Commit Graph

19301 Commits

Author SHA1 Message Date
Jakub Sitnicki
d57f4b8749 tcp: Update bind bucket state on port release
Today, once an inet_bind_bucket enters a state where fastreuse >= 0 or
fastreuseport >= 0 after a socket is explicitly bound to a port, it remains
in that state until all sockets are removed and the bucket is destroyed.

In this state, the bucket is skipped during ephemeral port selection in
connect(). For applications using a reduced ephemeral port
range (IP_LOCAL_PORT_RANGE socket option), this can cause faster port
exhaustion since blocked buckets are excluded from reuse.

The reason the bucket state isn't updated on port release is unclear.
Possibly a performance trade-off to avoid scanning bucket owners, or just
an oversight.

Fix it by recalculating the bucket state when a socket releases a port. To
limit overhead, each inet_bind2_bucket stores its own (fastreuse,
fastreuseport) state. On port release, only the relevant port-addr bucket
is scanned, and the overall state is derived from these.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250917-update-bind-bucket-state-on-unhash-v5-1-57168b661b47@cloudflare.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-23 10:12:15 +02:00
Eric Dumazet
649091ef59 tcp: reclaim 8 bytes in struct request_sock_queue
synflood_warned had to be u32 for xchg(), but ensuring
atomicity is not really needed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250919204856.2977245-9-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-22 17:55:25 -07:00
Eric Dumazet
9303c3ced1 net: move sk->sk_err_soft and sk->sk_sndbuf
sk->sk_sndbuf is read-mostly in tx path, so move it from
sock_write_tx group to more appropriate sock_read_tx.

sk->sk_err_soft was not identified previously, but
is used from tcp_ack().

Move it to sock_write_tx group for better cache locality.

Also change tcp_ack() to clear sk->sk_err_soft only if needed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250919204856.2977245-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-22 17:55:24 -07:00
Eric Dumazet
17b14d235f net: move sk_uid and sk_protocol to sock_read_tx
sk_uid and sk_protocol are read from inet6_csk_route_socket()
for each TCP transmit.

Also read from udpv6_sendmsg(), udp_sendmsg() and others.

Move them to sock_read_tx for better cache locality.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250919204856.2977245-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-22 17:55:24 -07:00
Kuniyuki Iwashima
0ac44301e3 tcp: Remove inet6_hash().
inet_hash() and inet6_hash() are exactly the same.

Also, we do not need to export inet6_hash().

Let's consolidate the two into __inet_hash() and rename it to inet_hash().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250919083706.1863217-3-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-22 11:38:43 -07:00
Kuniyuki Iwashima
6445bb832d tcp: Remove osk from __inet_hash() arg.
__inet_hash() is called from inet_hash() and inet6_hash with osk NULL.

Let's remove the 2nd arg from __inet_hash().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250919083706.1863217-2-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-22 11:38:42 -07:00
Johannes Berg
e0d3bba84f wifi: cfg80211: remove IEEE80211_CHAN_{1,2,4,8,16}MHZ flags
These were used by S1G for older chandef representation, but
are no longer needed. Clean them up, even if we can't drop
them from the userspace API entirely.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-09-22 09:03:14 +02:00
Luiz Augusto von Dentz
9e622804d5 Bluetooth: hci_event: Fix UAF in hci_acl_create_conn_sync
This fixes the following UFA in hci_acl_create_conn_sync where a
connection still pending is command submission (conn->state == BT_OPEN)
maybe freed, also since this also can happen with the likes of
hci_le_create_conn_sync fix it as well:

BUG: KASAN: slab-use-after-free in hci_acl_create_conn_sync+0x5ef/0x790 net/bluetooth/hci_sync.c:6861
Write of size 2 at addr ffff88805ffcc038 by task kworker/u11:2/9541

CPU: 1 UID: 0 PID: 9541 Comm: kworker/u11:2 Not tainted 6.16.0-rc7 #3 PREEMPT(full)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
Workqueue: hci3 hci_cmd_sync_work
Call Trace:
 <TASK>
 dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120
 print_address_description mm/kasan/report.c:378 [inline]
 print_report+0xca/0x230 mm/kasan/report.c:480
 kasan_report+0x118/0x150 mm/kasan/report.c:593
 hci_acl_create_conn_sync+0x5ef/0x790 net/bluetooth/hci_sync.c:6861
 hci_cmd_sync_work+0x210/0x3a0 net/bluetooth/hci_sync.c:332
 process_one_work kernel/workqueue.c:3238 [inline]
 process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3321
 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3402
 kthread+0x70e/0x8a0 kernel/kthread.c:464
 ret_from_fork+0x3fc/0x770 arch/x86/kernel/process.c:148
 ret_from_fork_asm+0x1a/0x30 home/kwqcheii/source/fuzzing/kernel/kasan/linux-6.16-rc7/arch/x86/entry/entry_64.S:245
 </TASK>

Allocated by task 123736:
 kasan_save_stack mm/kasan/common.c:47 [inline]
 kasan_save_track+0x3e/0x80 mm/kasan/common.c:68
 poison_kmalloc_redzone mm/kasan/common.c:377 [inline]
 __kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:394
 kasan_kmalloc include/linux/kasan.h:260 [inline]
 __kmalloc_cache_noprof+0x230/0x3d0 mm/slub.c:4359
 kmalloc_noprof include/linux/slab.h:905 [inline]
 kzalloc_noprof include/linux/slab.h:1039 [inline]
 __hci_conn_add+0x233/0x1b30 net/bluetooth/hci_conn.c:939
 hci_conn_add_unset net/bluetooth/hci_conn.c:1051 [inline]
 hci_connect_acl+0x16c/0x4e0 net/bluetooth/hci_conn.c:1634
 pair_device+0x418/0xa70 net/bluetooth/mgmt.c:3556
 hci_mgmt_cmd+0x9c9/0xef0 net/bluetooth/hci_sock.c:1719
 hci_sock_sendmsg+0x6ca/0xef0 net/bluetooth/hci_sock.c:1839
 sock_sendmsg_nosec net/socket.c:712 [inline]
 __sock_sendmsg+0x219/0x270 net/socket.c:727
 sock_write_iter+0x258/0x330 net/socket.c:1131
 new_sync_write fs/read_write.c:593 [inline]
 vfs_write+0x54b/0xa90 fs/read_write.c:686
 ksys_write+0x145/0x250 fs/read_write.c:738
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Freed by task 103680:
 kasan_save_stack mm/kasan/common.c:47 [inline]
 kasan_save_track+0x3e/0x80 mm/kasan/common.c:68
 kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:576
 poison_slab_object mm/kasan/common.c:247 [inline]
 __kasan_slab_free+0x62/0x70 mm/kasan/common.c:264
 kasan_slab_free include/linux/kasan.h:233 [inline]
 slab_free_hook mm/slub.c:2381 [inline]
 slab_free mm/slub.c:4643 [inline]
 kfree+0x18e/0x440 mm/slub.c:4842
 device_release+0x9c/0x1c0
 kobject_cleanup lib/kobject.c:689 [inline]
 kobject_release lib/kobject.c:720 [inline]
 kref_put include/linux/kref.h:65 [inline]
 kobject_put+0x22b/0x480 lib/kobject.c:737
 hci_conn_cleanup net/bluetooth/hci_conn.c:175 [inline]
 hci_conn_del+0x8ff/0xcb0 net/bluetooth/hci_conn.c:1173
 hci_conn_complete_evt+0x3c7/0x1040 net/bluetooth/hci_event.c:3199
 hci_event_func net/bluetooth/hci_event.c:7477 [inline]
 hci_event_packet+0x7e0/0x1200 net/bluetooth/hci_event.c:7531
 hci_rx_work+0x46a/0xe80 net/bluetooth/hci_core.c:4070
 process_one_work kernel/workqueue.c:3238 [inline]
 process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3321
 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3402
 kthread+0x70e/0x8a0 kernel/kthread.c:464
 ret_from_fork+0x3fc/0x770 arch/x86/kernel/process.c:148
 ret_from_fork_asm+0x1a/0x30 home/kwqcheii/source/fuzzing/kernel/kasan/linux-6.16-rc7/arch/x86/entry/entry_64.S:245

Last potentially related work creation:
 kasan_save_stack+0x3e/0x60 mm/kasan/common.c:47
 kasan_record_aux_stack+0xbd/0xd0 mm/kasan/generic.c:548
 insert_work+0x3d/0x330 kernel/workqueue.c:2183
 __queue_work+0xbd9/0xfe0 kernel/workqueue.c:2345
 queue_delayed_work_on+0x18b/0x280 kernel/workqueue.c:2561
 pairing_complete+0x1e7/0x2b0 net/bluetooth/mgmt.c:3451
 pairing_complete_cb+0x1ac/0x230 net/bluetooth/mgmt.c:3487
 hci_connect_cfm include/net/bluetooth/hci_core.h:2064 [inline]
 hci_conn_failed+0x24d/0x310 net/bluetooth/hci_conn.c:1275
 hci_conn_complete_evt+0x3c7/0x1040 net/bluetooth/hci_event.c:3199
 hci_event_func net/bluetooth/hci_event.c:7477 [inline]
 hci_event_packet+0x7e0/0x1200 net/bluetooth/hci_event.c:7531
 hci_rx_work+0x46a/0xe80 net/bluetooth/hci_core.c:4070
 process_one_work kernel/workqueue.c:3238 [inline]
 process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3321
 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3402
 kthread+0x70e/0x8a0 kernel/kthread.c:464
 ret_from_fork+0x3fc/0x770 arch/x86/kernel/process.c:148
 ret_from_fork_asm+0x1a/0x30 home/kwqcheii/source/fuzzing/kernel/kasan/linux-6.16-rc7/arch/x86/entry/entry_64.S:245

Fixes: aef2aa4fa9 ("Bluetooth: hci_event: Fix creating hci_conn object on error status")
Reported-by: Junvyyang, Tencent Zhuque Lab <zhuque@tencent.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-09-20 11:01:10 -04:00
Antoine Tenart
1c7e4a6185 net: ipv4: make udp_v4_early_demux explicitly return drop reason
udp_v4_early_demux already returns drop reasons as it either returns 0
or ip_mc_validate_source, which itself returns drop reasons. Its return
value is also already used as a drop reason itself.

Makes this explicit by making it return drop reasons.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250915091958.15382-2-atenart@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-19 17:35:51 -07:00
Kuniyuki Iwashima
f1bf77491d psp: Fix typo in kdoc for struct psp_dev_caps.assoc_drv_spc.
assoc_drv_spc is the size of psp_assoc.drv_data[].

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250918192539.1587586-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-19 17:02:27 -07:00
Daniel Zahka
28bb24dadd psp: don't use flags for checking sk_state
Using flags to check sk_state only makes sense to check for a subset
of states in parallel e.g. sk_fullsock(). We are not doing that
here. Compare for individual states directly.

Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Link: https://patch.msgid.link/20250918155205.2197603-4-daniel.zahka@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-19 17:01:20 -07:00
Daniel Zahka
803cdb6ddc psp: fix preemptive inet_twsk() cast in psp_sk_get_assoc_rcu()
It is weird to cast to a timewait_sock before checking sk_state, even
if the use is after such a check. Remove the tw local variable, and
use inet_twsk() directly in the timewait branch.

Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Link: https://patch.msgid.link/20250918155205.2197603-3-daniel.zahka@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-19 17:01:20 -07:00
Daniel Zahka
f8d2f8205b psp: make struct sock argument const in psp_sk_get_assoc_rcu()
This function does not need a mutable reference to its argument.

Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Link: https://patch.msgid.link/20250918155205.2197603-2-daniel.zahka@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-19 17:01:20 -07:00
Christian Brauner
99d33ce100
net: port to ns_ref_*() helpers
Stop accessing ns.count directly.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-19 16:22:38 +02:00
Christian Brauner
d7afdf8895
ns: add to_<type>_ns() to respective headers
Every namespace type has a container_of(ns, <ns_type>, ns) static inline
function that is currently not exposed in the header. So we have a bunch
of places that open-code it via container_of(). Move it to the headers
so we can use it directly.

Reviewed-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-19 14:26:16 +02:00
Lachlan Hodges
cbcd507f01 wifi: cfg80211: remove ieee80211_s1g_channel_width
With the introduction of proper S1G channel flags, this function is no
longer used. Remove it.

Signed-off-by: Lachlan Hodges <lachlan.hodges@morsemicro.com>
Link: https://patch.msgid.link/20250918051913.500781-4-lachlan.hodges@morsemicro.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-09-19 11:56:07 +02:00
Lachlan Hodges
d0688dc2b1 wifi: cfg80211: correctly implement and validate S1G chandef
Currently, the S1G channelisation implementation differs from that of
VHT, which is the PHY that S1G is based on. The major difference between
the clock rate is 1/10th of VHT. However how their channelisation is
represented within cfg80211 and mac80211 vastly differ.

To rectify this, remove the use of IEEE80211_CHAN_1/2/4.. flags that were
previously used to indicate the control channel width, however it should be
implied that the control channels are 1MHz in the case of S1G. Additionally,
introduce the invert - being IEEE80211_CHAN_NO_4/8/16MHz - that imply
the control channel may not be used for a certain bandwidth. With these
new flags, we can perform regulatory and chandef validation just as we would
for VHT.

To deal with the notion that S1G PHYs may contain a 2MHz primary channel,
introduce a new variable, s1g_primary_2mhz, which indicates whether we are
operating on a 2MHz primary channel. In this case, the chandef::chan points to
the 1MHz primary channel pointed to by the primary channel location. Alongside
this, introduce some new helper routines that can extract the sibling 1MHz
channel. The sibling being the alternate 1MHz primary subchannel within the
2MHz primary channel that is not pointed to by chandef::chan.

Furthermore, due to unique restrictions imposed on S1G PHYs, introduce
a new flag, IEEE80211_CHAN_S1G_NO_PRIMARY, which states that the 1MHz channel
cannot be used as a primary channel. This is assumed to be set by vendors
as it is hardware and regdom specific, When we validate a 2MHz primary channel,
we need to ensure both 1MHz subchannels do not contain this flag. If one or
both of the 1MHz subchannels contain this flag then the 2MHz primary is not
permitted for use as a primary channel.

Properly integrate S1G channel validation such that it is implemented
according with other PHY types such as VHT. Additionally, implement a new
S1G-specific regulatory flag to allow cfg80211 to understand specific
vendor requirements for S1G PHYs.

Signed-off-by: Arien Judge <arien.judge@morsemicro.com>
Signed-off-by: Andrew Pope <andrew.pope@morsemicro.com>
Signed-off-by: Lachlan Hodges <lachlan.hodges@morsemicro.com>
Link: https://patch.msgid.link/20250918051913.500781-2-lachlan.hodges@morsemicro.com
[remove redundant NL80211_ATTR_S1G_PRIMARY_2MHZ check]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-09-19 11:55:56 +02:00
Ilan Peer
04f17cfea2 wifi: mac80211: Export an API to check if NAN is started
So it can be used by drivers to check if NAN Device interface
is started or not.

Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250908140015.c69652f77eb6.Ie4f3d197e0706e742e3d97614fadc11b22adfbc6@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-09-19 11:26:23 +02:00
Ilan Peer
fc41f4a28a wifi: mac80211: Support Tx of action frame for NAN
Add support for sending management frame over a NAN Device
interface:

- Declare support for the supported management frames types.
- Since action frame transmissions over a NAN Device interface
  do not necessarily require a channel configuration, e.g., they
  can be transmitted during DW, modify the Tx path to avoid
  accessing channel information for NAN Device interface.
- In addition modify the points in the Tx path logic to account
  for cases that a band is not specified in the Tx information.

Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250908140015.23b160089228.I65a58af753bcbcfb5c4ad8ef372d546f889725ba@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-09-19 11:26:22 +02:00
Ilan Peer
1884e2594b wifi: cfg80211: Store the NAN cluster ID
When the driver indicates that the device has joined
a cluster, store the cluster ID. This is needed for data
path operations, e.g., filtering received frames etc.

Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250908140015.63e9fef2a3aa.I6c858185c9e71f84bd2c5174d7ee45902b4391c3@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-09-19 11:26:22 +02:00
Ilan Peer
b9c3d426c8 wifi: cfg80211: Advertise supported NAN capabilities
Allow drivers to specify the supported NAN capabilities and support
advertising the NAN capabilities to user space.

Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250908140015.2976966556f5.Ic6e43b10049573180c909dad806f279cfb31143e@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-09-19 11:26:22 +02:00
Andrei Otcheretianski
1ccfd8db34 wifi: cfg80211: Add cluster joined notification APIs
The drivers should notify upper layers and user space when a NAN device
joins a cluster. This is needed, for example, to set the correct addr3
in SDF frames. Add API to report cluster join event.

Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250908140015.ad27b7b6e4d9.I70b213a2a49f18d1ba2ad325e67e8eff51cc7a1f@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-09-19 11:26:21 +02:00
Andrei Otcheretianski
ba9b2ceaa2 wifi: nl80211: Add NAN Discovery Window (DW) notification
This notification will be used by the device to inform user space
about upcoming DW. When received, user space will be able to prepare
multicast Service Discovery Frames (SDFs) to be transmitted during the
next DW using %NL80211_CMD_FRAME command on the NAN management interface.
The device/driver will take care to transmit the frames in the correct
timing. This allows to implement a synchronized Discovery Engine (DE)
in user space, if the device doesn't support DE offload.
Note that this notification can be sent before the actual DW starts as
long as the driver/device handles the actual timing of the SDF
transmission.

Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250908140015.0e1d15031bab.I5b1721e61b63910452b3c5cdcdc1e94cb094d4c9@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-09-19 11:26:21 +02:00
Andrei Otcheretianski
01b4a3061b wifi: nl80211: Add more configuration options for NAN commands
Current NAN APIs have only basic configuration for master
preference and operating bands. Add and parse additional parameters
which provide more control over NAN synchronization. The newly added
attributes allow to publish additional NAN attributes and vendor
elements in NAN beacons, control scan and discovery beacons
periodicity, enable/disable DW notifications etc.

Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
tested: Miriam Rachel Korenblit <miriam.rachel.korenblit@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250908140015.a4779492bf8e.I375feb919bd72358173766b9fe10010c40796b33@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-09-19 11:26:21 +02:00
Jakub Kicinski
f2cdc4c22b Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.17-rc7).

No conflicts.

Adjacent changes:

drivers/net/ethernet/mellanox/mlx5/core/en/fs.h
  9536fbe10c ("net/mlx5e: Add PSP steering in local NIC RX")
  7601a0a462 ("net/mlx5e: Add a miss level for ipsec crypto offload")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-18 11:26:06 -07:00
Eric Dumazet
87ebb628a5 net: clear sk->sk_ino in sk_set_socket(sk, NULL)
Andrei Vagin reported that blamed commit broke CRIU.

Indeed, while we want to keep sk_uid unchanged when a socket
is cloned, we want to clear sk->sk_ino.

Otherwise, sock_diag might report multiple sockets sharing
the same inode number.

Move the clearing part from sock_orphan() to sk_set_socket(sk, NULL),
called both from sock_orphan() and sk_clone_lock().

Fixes: 5d6b58c932 ("net: lockless sock_i_ino()")
Closes: https://lore.kernel.org/netdev/aMhX-VnXkYDpKd9V@google.com/
Closes: https://github.com/checkpoint-restore/criu/issues/2744
Reported-by: Andrei Vagin <avagin@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Andrei Vagin <avagin@google.com>
Link: https://patch.msgid.link/20250917135337.1736101-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-18 07:47:17 -07:00
Paolo Abeni
64d2616972 Merge branch 'add-basic-psp-encryption-for-tcp-connections'
Daniel Zahka says:

==================
add basic PSP encryption for TCP connections

This is v13 of the PSP RFC [1] posted by Jakub Kicinski one year
ago. General developments since v1 include a fork of packetdrill [2]
with support for PSP added, as well as some test cases, and an
implementation of PSP key exchange and connection upgrade [3]
integrated into the fbthrift RPC library. Both [2] and [3] have been
tested on server platforms with PSP-capable CX7 NICs. Below is the
cover letter from the original RFC:

Add support for PSP encryption of TCP connections.

PSP is a protocol out of Google:
https://github.com/google/psp/blob/main/doc/PSP_Arch_Spec.pdf
which shares some similarities with IPsec. I added some more info
in the first patch so I'll keep it short here.

The protocol can work in multiple modes including tunneling.
But I'm mostly interested in using it as TLS replacement because
of its superior offload characteristics. So this patch does three
things:

 - it adds "core" PSP code
   PSP is offload-centric, and requires some additional care and
   feeding, so first chunk of the code exposes device info.
   This part can be reused by PSP implementations in xfrm, tunneling etc.

 - TCP integration TLS style
   Reuse some of the existing concepts from TLS offload, such as
   attaching crypto state to a socket, marking skbs as "decrypted",
   egress validation. PSP does not prescribe key exchange protocols.
   To use PSP as a more efficient TLS offload we intend to perform
   a TLS handshake ("inline" in the same TCP connection) and negotiate
   switching to PSP based on capabilities of both endpoints.
   This is also why I'm not including a software implementation.
   Nobody would use it in production, software TLS is faster,
   it has larger crypto records.

 - mlx5 implementation
   That's mostly other people's work, not 100% sure those folks
   consider it ready hence the RFC in the title. But it works :)

Not posted, queued a branch [4] are follow up pieces:
 - standard stats
 - netdevsim implementation and tests

[1] https://lore.kernel.org/netdev/20240510030435.120935-1-kuba@kernel.org/
[2] https://github.com/danieldzahka/packetdrill
[3] https://github.com/danieldzahka/fbthrift/tree/dzahka/psp
[4] https://github.com/kuba-moo/linux/tree/psp

Comments we intend to defer to future series:
   - we prefer to keep the version field in the tx-assoc netlink
     request, because it makes parsing keys require less state early
     on, but we are willing to change in the next version of this
     series.
   - using a static branch to wrap psp_enqueue_set_decrypted() and
     other functions called from tcp.
   - using INDIRECT_CALL for tls/psp in sk_validate_xmit_skb(). We
     prefer to address this in a dedicated patch series, so that this
     series does not need to modify the way tls_validate_xmit_skb() is
     declared and stubbed out.

v12: https://lore.kernel.org/netdev/20250916000559.1320151-1-kuba@kernel.org/
v11: https://lore.kernel.org/20250911014735.118695-1-daniel.zahka@gmail.com
v10: https://lore.kernel.org/netdev/20250828162953.2707727-1-daniel.zahka@gmail.com/
v9: https://lore.kernel.org/netdev/20250827155340.2738246-1-daniel.zahka@gmail.com/
v8: https://lore.kernel.org/netdev/20250825200112.1750547-1-daniel.zahka@gmail.com/
v7: https://lore.kernel.org/netdev/20250820113120.992829-1-daniel.zahka@gmail.com/
v6: https://lore.kernel.org/netdev/20250812003009.2455540-1-daniel.zahka@gmail.com/
v5: https://lore.kernel.org/netdev/20250723203454.519540-1-daniel.zahka@gmail.com/
v4: https://lore.kernel.org/netdev/20250716144551.3646755-1-daniel.zahka@gmail.com/
v3: https://lore.kernel.org/netdev/20250702171326.3265825-1-daniel.zahka@gmail.com/
v2: https://lore.kernel.org/netdev/20250625135210.2975231-1-daniel.zahka@gmail.com/
v1: https://lore.kernel.org/netdev/20240510030435.120935-1-kuba@kernel.org/
==================

Links: https://patch.msgid.link/20250917000954.859376-1-daniel.zahka@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
* add-basic-psp-encryption-for-tcp-connections:
  net/mlx5e: Implement PSP key_rotate operation
  net/mlx5e: Add Rx data path offload
  psp: provide decapsulation and receive helper for drivers
  net/mlx5e: Configure PSP Rx flow steering rules
  net/mlx5e: Add PSP steering in local NIC RX
  net/mlx5e: Implement PSP Tx data path
  psp: provide encapsulation helper for drivers
  net/mlx5e: Implement PSP operations .assoc_add and .assoc_del
  net/mlx5e: Support PSP offload functionality
  psp: track generations of device key
  net: psp: update the TCP MSS to reflect PSP packet overhead
  net: psp: add socket security association code
  net: tcp: allow tcp_timewait_sock to validate skbs before handing to device
  net: move sk_validate_xmit_skb() to net/core/dev.c
  psp: add op for rotation of device key
  tcp: add datapath logic for PSP with inline key exchange
  net: modify core data structures for PSP datapath support
  psp: base PSP device support
  psp: add documentation
2025-09-18 12:38:34 +02:00
Raed Salem
0eddb8023c psp: provide decapsulation and receive helper for drivers
Create psp_dev_rcv(), which drivers can call to psp decapsulate and attach
a psp_skb_ext to an skb.

psp_dev_rcv() only supports what the PSP architecture specification
refers to as "transport mode" packets, where the L3 header is either
IPv6 or IPv4.

Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Co-developed-by: Daniel Zahka <daniel.zahka@gmail.com>
Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250917000954.859376-18-daniel.zahka@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-18 12:32:07 +02:00
Raed Salem
fc72451574 psp: provide encapsulation helper for drivers
Create a new function psp_encapsulate(), which takes a TCP packet and
PSP encapsulates it according to the "Transport Mode Packet Format"
section of the PSP Architecture Specification.

psp_encapsulate() does not push a PSP trailer onto the skb. Both IPv6
and IPv4 are supported. Virtualization cookie is not included.

Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Co-developed-by: Daniel Zahka <daniel.zahka@gmail.com>
Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250917000954.859376-14-daniel.zahka@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-18 12:32:07 +02:00
Jakub Kicinski
e78851058b psp: track generations of device key
There is a (somewhat theoretical in absence of multi-host support)
possibility that another entity will rotate the key and we won't
know. This may lead to accepting packets with matching SPI but
which used different crypto keys than we expected.

The PSP Architecture specification mentions that an implementation
should track device key generation when device keys are managed by the
NIC. Some PSP implementations may opt to include this key generation
state in decryption metadata each time a device key is used to decrypt
a packet. If that is the case, that key generation counter can also be
used when policy checking a decrypted skb against a psp_assoc. This is
an optional feature that is not explicitly part of the PSP spec, but
can provide additional security in the case where an attacker may have
the ability to force key rotations faster than rekeying can occur.

Since we're tracking "key generations" more explicitly now,
maintain different lists for associations from different generations.
This way we can catch stale associations (the user space should
listen to rotation notifications and change the keys).

Drivers can "opt out" of generation tracking by setting
the generation value to 0.

Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250917000954.859376-11-daniel.zahka@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-18 12:32:06 +02:00
Jakub Kicinski
e97269257f net: psp: update the TCP MSS to reflect PSP packet overhead
PSP eats 40B of header space. Adjust MSS appropriately.

We can either modify tcp_mtu_to_mss() / tcp_mss_to_mtu()
or reuse icsk_ext_hdr_len. The former option is more TCP
specific and has runtime overhead. The latter is a bit
of a hack as PSP is not an ext_hdr. If one squints hard
enough, UDP encap is just a more practical version of
IPv6 exthdr, so go with the latter. Happy to change.

Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250917000954.859376-10-daniel.zahka@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-18 12:32:06 +02:00
Jakub Kicinski
6b46ca260e net: psp: add socket security association code
Add the ability to install PSP Rx and Tx crypto keys on TCP
connections. Netlink ops are provided for both operations.
Rx side combines allocating a new Rx key and installing it
on the socket. Theoretically these are separate actions,
but in practice they will always be used one after the
other. We can add distinct "alloc" and "install" ops later.

Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Co-developed-by: Daniel Zahka <daniel.zahka@gmail.com>
Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250917000954.859376-9-daniel.zahka@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-18 12:32:06 +02:00
Daniel Zahka
0917bb139e net: tcp: allow tcp_timewait_sock to validate skbs before handing to device
Provide a callback to validate skb's originating from tcp timewait
socks before passing to the device layer. Full socks have a
sk_validate_xmit_skb member for checking that a device is capable of
performing offloads required for transmitting an skb. With psp, tcp
timewait socks will inherit the crypto state from their corresponding
full socks. Any ACKs or RSTs that originate from a tcp timewait sock
carrying psp state should be psp encapsulated.

Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250917000954.859376-8-daniel.zahka@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-18 12:32:06 +02:00
Daniel Zahka
8c511c1df3 net: move sk_validate_xmit_skb() to net/core/dev.c
Move definition of sk_validate_xmit_skb() from net/core/sock.c to
net/core/dev.c.

This change is in preparation of the next patch, where
sk_validate_xmit_skb() will need to cast sk to a tcp_timewait_sock *,
and access member fields. Including linux/tcp.h from linux/sock.h
creates a circular dependency, and dev.c is the only current call site
of this function.

Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250917000954.859376-7-daniel.zahka@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-18 12:32:06 +02:00
Jakub Kicinski
117f02a49b psp: add op for rotation of device key
Rotating the device key is a key part of the PSP protocol design.
Some external daemon needs to do it once a day, or so.
Add a netlink op to perform this operation.
Add a notification group for informing users that key has been
rotated and they should rekey (next rotation will cut them off).

Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250917000954.859376-6-daniel.zahka@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-18 12:32:06 +02:00
Jakub Kicinski
659a2899a5 tcp: add datapath logic for PSP with inline key exchange
Add validation points and state propagation to support PSP key
exchange inline, on TCP connections. The expectation is that
application will use some well established mechanism like TLS
handshake to establish a secure channel over the connection and
if both endpoints are PSP-capable - exchange and install PSP keys.
Because the connection can existing in PSP-unsecured and PSP-secured
state we need to make sure that there are no race conditions or
retransmission leaks.

On Tx - mark packets with the skb->decrypted bit when PSP key
is at the enqueue time. Drivers should only encrypt packets with
this bit set. This prevents retransmissions getting encrypted when
original transmission was not. Similarly to TLS, we'll use
sk->sk_validate_xmit_skb to make sure PSP skbs can't "escape"
via a PSP-unaware device without being encrypted.

On Rx - validation is done under socket lock. This moves the validation
point later than xfrm, for example. Please see the documentation patch
for more details on the flow of securing a connection, but for
the purpose of this patch what's important is that we want to
enforce the invariant that once connection is secured any skb
in the receive queue has been encrypted with PSP.

Add GRO and coalescing checks to prevent PSP authenticated data from
being combined with cleartext data, or data with non-matching PSP
state. On Rx, check skb's with psp_skb_coalesce_diff() at points
before psp_sk_rx_policy_check(). After skb's are policy checked and on
the socket receive queue, skb_cmp_decrypted() is sufficient for
checking for coalescable PSP state. On Tx, tcp_write_collapse_fence()
should be called when transitioning a socket into PSP Tx state to
prevent data sent as cleartext from being coalesced with PSP
encapsulated data.

This change only adds the validation points, for ease of review.
Subsequent change will add the ability to install keys, and flesh
the enforcement logic out

Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Co-developed-by: Daniel Zahka <daniel.zahka@gmail.com>
Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250917000954.859376-5-daniel.zahka@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-18 12:32:06 +02:00
Jakub Kicinski
ed8a507b74 net: modify core data structures for PSP datapath support
Add pointers to psp data structures to core networking structs,
and an SKB extension to carry the PSP information from the drivers
to the socket layer.

Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Co-developed-by: Daniel Zahka <daniel.zahka@gmail.com>
Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250917000954.859376-4-daniel.zahka@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-18 12:32:06 +02:00
Jakub Kicinski
00c94ca2b9 psp: base PSP device support
Add a netlink family for PSP and allow drivers to register support.

The "PSP device" is its own object. This allows us to perform more
flexible reference counting / lifetime control than if PSP information
was part of net_device. In the future we should also be able
to "delegate" PSP access to software devices, such as *vlan, veth
or netkit more easily.

Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250917000954.859376-3-daniel.zahka@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-18 12:32:06 +02:00
Eric Dumazet
3cd04c8f4a udp: make busylock per socket
While having all spinlocks packed into an array was a space saver,
this also caused NUMA imbalance and hash collisions.

UDPv6 socket size becomes 1600 after this patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250916160951.541279-10-edumazet@google.com
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-18 10:17:10 +02:00
Eric Dumazet
9db27c8062 udp: add udp_drops_inc() helper
Generic sk_drops_inc() reads sk->sk_drop_counters.
We know the precise location for UDP sockets.

Move sk_drop_counters out of sock_read_rxtx
so that sock_write_rxtx starts at a cache line boundary.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250916160951.541279-9-edumazet@google.com
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-18 10:17:10 +02:00
Eric Dumazet
4effb335b5 net: group sk_backlog and sk_receive_queue
UDP receivers suffer from sk_rmem_alloc updates,
currently sharing a cache line with fields that
need to be read-mostly (sock_read_rx group):

1) RFS enabled hosts read sk_napi_id
from __udpv6_queue_rcv_skb().

2) sk->sk_rcvbuf is read from __udp_enqueue_schedule_skb()

/* --- cacheline 3 boundary (192 bytes) --- */
struct {
    atomic_t           rmem_alloc;           /*  0xc0   0x4 */   // Oops
    int                len;                  /*  0xc4   0x4 */
    struct sk_buff *   head;                 /*  0xc8   0x8 */
    struct sk_buff *   tail;                 /*  0xd0   0x8 */
} sk_backlog;                                /*  0xc0  0x18 */
__u8                       __cacheline_group_end__sock_write_rx[0]; /*  0xd8     0 */
__u8                       __cacheline_group_begin__sock_read_rx[0]; /*  0xd8     0 */
struct dst_entry *         sk_rx_dst;        /*  0xd8   0x8 */
int                        sk_rx_dst_ifindex;/*  0xe0   0x4 */
u32                        sk_rx_dst_cookie; /*  0xe4   0x4 */
unsigned int               sk_ll_usec;       /*  0xe8   0x4 */
unsigned int               sk_napi_id;       /*  0xec   0x4 */
u16                        sk_busy_poll_budget;/*  0xf0   0x2 */
u8                         sk_prefer_busy_poll;/*  0xf2   0x1 */
u8                         sk_userlocks;     /*  0xf3   0x1 */
int                        sk_rcvbuf;        /*  0xf4   0x4 */
struct sk_filter *         sk_filter;        /*  0xf8   0x8 */

Move sk_error (which is less often dirtied) there.

Alternative would be to cache align sock_read_rx but
this has more implications/risks.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250916160951.541279-8-edumazet@google.com
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-18 10:17:10 +02:00
Eric Dumazet
5489f333ef ipv6: make ipv6_pinfo.daddr_cache a boolean
ipv6_pinfo.daddr_cache is either NULL or &sk->sk_v6_daddr

We do not need 8 bytes, a boolean is enough.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250916160951.541279-3-edumazet@google.com
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-18 10:17:09 +02:00
Eric Dumazet
3fbb2a6f3a ipv6: make ipv6_pinfo.saddr_cache a boolean
ipv6_pinfo.saddr_cache is either NULL or &np->saddr.

We do not need 8 bytes, a boolean is enough.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250916160951.541279-2-edumazet@google.com
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-18 10:17:09 +02:00
Ilpo Järvinen
fe2cddc648 tcp: accecn: AccECN option ceb/cep and ACE field multi-wrap heuristics
The AccECN option ceb/cep heuristic algorithm is from AccECN spec
Appendix A.2.2 to mitigate against false ACE field overflows. Armed
with ceb delta from option, delivered bytes, and delivered packets it
is possible to estimate how many times ACE field wrapped.

This calculation is necessary only if more than one wrap is possible.
Without SACK, delivered bytes and packets are not always trustworthy in
which case TCP falls back to the simpler no-or-all wraps ceb algorithm.

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250916082434.100722-10-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-18 08:47:52 +02:00
Chia-Yu Chang
b40671b5ee tcp: accecn: AccECN option failure handling
AccECN option may fail in various way, handle these:
- Attempt to negotiate the use of AccECN on the 1st retransmitted SYN
	- From the 2nd retransmitted SYN, stop AccECN negotiation
- Remove option from SYN/ACK rexmits to handle blackholes
- If no option arrives in SYN/ACK, assume Option is not usable
        - If an option arrives later, re-enabled
- If option is zeroed, disable AccECN option processing

This patch use existing padding bits in tcp_request_sock and
holes in tcp_sock without increasing the size.

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250916082434.100722-9-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-18 08:47:52 +02:00
Chia-Yu Chang
aa55a7dde7 tcp: accecn: AccECN option send control
Instead of sending the option in every ACK, limit sending to
those ACKs where the option is necessary:
- Handshake
- "Change-triggered ACK" + the ACK following it. The
  2nd ACK is necessary to unambiguously indicate which
  of the ECN byte counters in increasing. The first
  ACK has two counters increasing due to the ecnfield
  edge.
- ACKs with CE to allow CEP delta validations to take
  advantage of the option.
- Force option to be sent every at least once per 2^22
  bytes. The check is done using the bit edges of the
  byte counters (avoids need for extra variables).
- AccECN option beacon to send a few times per RTT even if
  nothing in the ECN state requires that. The default is 3
  times per RTT, and its period can be set via
  sysctl_tcp_ecn_option_beacon.

Below are the pahole outcomes before and after this patch,
in which the group size of tcp_sock_write_tx is increased
from 89 to 97 due to the new u64 accecn_opt_tstamp member:

[BEFORE THIS PATCH]
struct tcp_sock {
    [...]
    u64                        tcp_wstamp_ns;        /*  2488     8 */
    struct list_head           tsorted_sent_queue;   /*  2496    16 */

    [...]
    __cacheline_group_end__tcp_sock_write_tx[0];     /*  2521     0 */
    __cacheline_group_begin__tcp_sock_write_txrx[0]; /*  2521     0 */
    u8                         nonagle:4;            /*  2521: 0  1 */
    u8                         rate_app_limited:1;   /*  2521: 4  1 */
    /* XXX 3 bits hole, try to pack */

    /* Force alignment to the next boundary: */
    u8                         :0;
    u8                         received_ce_pending:4;/*  2522: 0  1 */
    u8                         unused2:4;            /*  2522: 4  1 */
    u8                         accecn_minlen:2;      /*  2523: 0  1 */
    u8                         est_ecnfield:2;       /*  2523: 2  1 */
    u8                         unused3:4;            /*  2523: 4  1 */

    [...]
    __cacheline_group_end__tcp_sock_write_txrx[0];   /*  2628     0 */

    [...]
    /* size: 3200, cachelines: 50, members: 171 */
}

[AFTER THIS PATCH]
struct tcp_sock {
    [...]
    u64                        tcp_wstamp_ns;        /*  2488     8 */
    u64                        accecn_opt_tstamp;    /*  2596     8 */
    struct list_head           tsorted_sent_queue;   /*  2504    16 */

    [...]
    __cacheline_group_end__tcp_sock_write_tx[0];     /*  2529     0 */
    __cacheline_group_begin__tcp_sock_write_txrx[0]; /*  2529     0 */
    u8                         nonagle:4;            /*  2529: 0  1 */
    u8                         rate_app_limited:1;   /*  2529: 4  1 */
    /* XXX 3 bits hole, try to pack */

    /* Force alignment to the next boundary: */
    u8                         :0;
    u8                         received_ce_pending:4;/*  2530: 0  1 */
    u8                         unused2:4;            /*  2530: 4  1 */
    u8                         accecn_minlen:2;      /*  2531: 0  1 */
    u8                         est_ecnfield:2;       /*  2531: 2  1 */
    u8                         accecn_opt_demand:2;  /*  2531: 4  1 */
    u8                         prev_ecnfield:2;      /*  2531: 6  1 */

    [...]
    __cacheline_group_end__tcp_sock_write_txrx[0];   /*  2636     0 */

    [...]
    /* size: 3200, cachelines: 50, members: 173 */
}

Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Co-developed-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250916082434.100722-8-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-18 08:47:52 +02:00
Ilpo Järvinen
b5e74132df tcp: accecn: AccECN option
The Accurate ECN allows echoing back the sum of bytes for
each IP ECN field value in the received packets using
AccECN option. This change implements AccECN option tx & rx
side processing without option send control related features
that are added by a later change.

Based on specification:
  https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt
(Some features of the spec will be added in the later changes
rather than in this one).

A full-length AccECN option is always attempted but if it does
not fit, the minimum length is selected based on the counters
that have changed since the last update. The AccECN option
(with 24-bit fields) often ends in odd sizes so the option
write code tries to take advantage of some nop used to pad
the other TCP options.

The delivered_ecn_bytes pairs with received_ecn_bytes similar
to how delivered_ce pairs with received_ce. In contrast to
ACE field, however, the option is not always available to update
delivered_ecn_bytes. For ACK w/o AccECN option, the delivered
bytes calculated based on the cumulative ACK+SACK information
are assigned to one of the counters using an estimation
heuristic to select the most likely ECN byte counter. Any
estimation error is corrected when the next AccECN option
arrives. It may occur that the heuristic gets too confused
when there are enough different byte counter deltas between
ACKs with the AccECN option in which case the heuristic just
gives up on updating the counters for a while.

tcp_ecn_option sysctl can be used to select option sending
mode for AccECN: TCP_ECN_OPTION_DISABLED, TCP_ECN_OPTION_MINIMUM,
and TCP_ECN_OPTION_FULL.

This patch increases the size of tcp_info struct, as there is
no existing holes for new u32 variables. Below are the pahole
outcomes before and after this patch:

[BEFORE THIS PATCH]
struct tcp_info {
    [...]
     __u32                     tcpi_total_rto_time;  /*   244     4 */

    /* size: 248, cachelines: 4, members: 61 */
}

[AFTER THIS PATCH]
struct tcp_info {
    [...]
    __u32                      tcpi_total_rto_time;  /*   244     4 */
    __u32                      tcpi_received_ce;     /*   248     4 */
    __u32                      tcpi_delivered_e1_bytes; /*   252     4 */
    __u32                      tcpi_delivered_e0_bytes; /*   256     4 */
    __u32                      tcpi_delivered_ce_bytes; /*   260     4 */
    __u32                      tcpi_received_e1_bytes; /*   264     4 */
    __u32                      tcpi_received_e0_bytes; /*   268     4 */
    __u32                      tcpi_received_ce_bytes; /*   272     4 */

    /* size: 280, cachelines: 5, members: 68 */
}

This patch uses the existing 1-byte holes in the tcp_sock_write_txrx
group for new u8 members, but adds a 4-byte hole in tcp_sock_write_rx
group after the new u32 delivered_ecn_bytes[3] member. Therefore, the
group size of tcp_sock_write_rx is increased from 96 to 112. Below
are the pahole outcomes before and after this patch:

[BEFORE THIS PATCH]
struct tcp_sock {
    [...]
    u8                         received_ce_pending:4; /*  2522: 0  1 */
    u8                         unused2:4;             /*  2522: 4  1 */
    /* XXX 1 byte hole, try to pack */

    [...]
    u32                        rcv_rtt_last_tsecr;    /*  2668     4 */

    [...]
    __cacheline_group_end__tcp_sock_write_rx[0];      /*  2728     0 */

    [...]
    /* size: 3200, cachelines: 50, members: 167 */
}

[AFTER THIS PATCH]
struct tcp_sock {
    [...]
    u8                         received_ce_pending:4;/*  2522: 0  1 */
    u8                         unused2:4;            /*  2522: 4  1 */
    u8                         accecn_minlen:2;      /*  2523: 0  1 */
    u8                         est_ecnfield:2;       /*  2523: 2  1 */
    u8                         unused3:4;            /*  2523: 4  1 */

    [...]
    u32                        rcv_rtt_last_tsecr;   /*  2668     4 */
    u32                        delivered_ecn_bytes[3];/*  2672    12 */
    /* XXX 4 bytes hole, try to pack */

    [...]
    __cacheline_group_end__tcp_sock_write_rx[0];     /*  2744     0 */

    [...]
    /* size: 3200, cachelines: 50, members: 171 */
}

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250916082434.100722-7-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-18 08:47:52 +02:00
Ilpo Järvinen
9a01127744 tcp: accecn: add AccECN rx byte counters
These three byte counters track IP ECN field payload byte sums for
all arriving (acceptable) packets for ECT0, ECT1, and CE. The
AccECN option (added by a later patch in the series) echoes these
counters back to sender side; therefore, it is placed within the
group of tcp_sock_write_txrx.

Below are the pahole outcomes before and after this patch, in which
the group size of tcp_sock_write_txrx is increased from 95 + 4 to
107 + 4 and an extra 4-byte hole is created but will be exploited
in later patches:

[BEFORE THIS PATCH]
struct tcp_sock {
    [...]
    u32                        delivered_ce;         /*  2576     4 */
    u32                        received_ce;          /*  2580     4 */
    u32                        app_limited;          /*  2584     4 */
    u32                        rcv_wnd;              /*  2588     4 */
    struct tcp_options_received rx_opt;              /*  2592    24 */
    __cacheline_group_end__tcp_sock_write_txrx[0];   /*  2616     0 */

    [...]
    /* size: 3200, cachelines: 50, members: 166 */
}

[AFTER THIS PATCH]
struct tcp_sock {
    [...]
    u32                        delivered_ce;         /*  2576     4 */
    u32                        received_ce;          /*  2580     4 */
    u32                        received_ecn_bytes[3];/*  2584    12 */
    u32                        app_limited;          /*  2596     4 */
    u32                        rcv_wnd;              /*  2600     4 */
    struct tcp_options_received rx_opt;              /*  2604    24 */
    __cacheline_group_end__tcp_sock_write_txrx[0];   /*  2628     0 */
    /* XXX 4 bytes hole, try to pack */

    [...]
    /* size: 3200, cachelines: 50, members: 167 */
}

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250916082434.100722-4-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-18 08:47:51 +02:00
Ilpo Järvinen
3cae34274c tcp: accecn: AccECN negotiation
Accurate ECN negotiation parts based on the specification:
  https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt

Accurate ECN is negotiated using ECE, CWR and AE flags in the
TCP header. TCP falls back into using RFC3168 ECN if one of the
ends supports only RFC3168-style ECN.

The AccECN negotiation includes reflecting IP ECN field value
seen in SYN and SYNACK back using the same bits as negotiation
to allow responding to SYN CE marks and to detect ECN field
mangling. CE marks should not occur currently because SYN=1
segments are sent with Non-ECT in IP ECN field (but proposal
exists to remove this restriction).

Reflecting SYN IP ECN field in SYNACK is relatively simple.
Reflecting SYNACK IP ECN field in the final/third ACK of
the handshake is more challenging. Linux TCP code is not well
prepared for using the final/third ACK a signalling channel
which makes things somewhat complicated here.

tcp_ecn sysctl can be used to select the highest ECN variant
(Accurate ECN, ECN, No ECN) that is attemped to be negotiated and
requested for incoming connection and outgoing connection:
TCP_ECN_IN_NOECN_OUT_NOECN, TCP_ECN_IN_ECN_OUT_ECN,
TCP_ECN_IN_ECN_OUT_NOECN, TCP_ECN_IN_ACCECN_OUT_ACCECN,
TCP_ECN_IN_ACCECN_OUT_ECN, and TCP_ECN_IN_ACCECN_OUT_NOECN.

After this patch, the size of tcp_request_sock remains unchanged
and no new holes are added. Below are the pahole outcomes before
and after this patch:

[BEFORE THIS PATCH]
struct tcp_request_sock {
    [...]
    u32                        rcv_nxt;              /*   352     4 */
    u8                         syn_tos;              /*   356     1 */

    /* size: 360, cachelines: 6, members: 16 */
}

[AFTER THIS PATCH]
struct tcp_request_sock {
    [...]
    u32                        rcv_nxt;              /*   352     4 */
    u8                         syn_tos;              /*   356     1 */
    bool                       accecn_ok;            /*   357     1 */
    u8                         syn_ect_snt:2;        /*   358: 0  1 */
    u8                         syn_ect_rcv:2;        /*   358: 2  1 */
    u8                         accecn_fail_mode:4;   /*   358: 4  1 */

    /* size: 360, cachelines: 6, members: 20 */
}

After this patch, the size of tcp_sock remains unchanged and no new
holes are added. Also, 4 bits of the existing 2-byte hole are exploited.
Below are the pahole outcomes before and after this patch:

[BEFORE THIS PATCH]
struct tcp_sock {
    [...]
    u8                         dup_ack_counter:2;    /*  2761: 0  1 */
    u8                         tlp_retrans:1;        /*  2761: 2  1 */
    u8                         unused:5;             /*  2761: 3  1 */
    u8                         thin_lto:1;           /*  2762: 0  1 */
    u8                         fastopen_connect:1;   /*  2762: 1  1 */
    u8                         fastopen_no_cookie:1; /*  2762: 2  1 */
    u8                         fastopen_client_fail:2; /*  2762: 3  1 */
    u8                         frto:1;               /*  2762: 5  1 */
    /* XXX 2 bits hole, try to pack */

    [...]
    u8                         keepalive_probes;     /*  2765     1 */
    /* XXX 2 bytes hole, try to pack */

    [...]
    /* size: 3200, cachelines: 50, members: 164 */
}

[AFTER THIS PATCH]
struct tcp_sock {
    [...]
    u8                         dup_ack_counter:2;    /*  2761: 0  1 */
    u8                         tlp_retrans:1;        /*  2761: 2  1 */
    u8                         syn_ect_snt:2;        /*  2761: 3  1 */
    u8                         syn_ect_rcv:2;        /*  2761: 5  1 */
    u8                         thin_lto:1;           /*  2761: 7  1 */
    u8                         fastopen_connect:1;   /*  2762: 0  1 */
    u8                         fastopen_no_cookie:1; /*  2762: 1  1 */
    u8                         fastopen_client_fail:2; /*  2762: 2  1 */
    u8                         frto:1;               /*  2762: 4  1 */
    /* XXX 3 bits hole, try to pack */

    [...]
    u8                         keepalive_probes;     /*  2765     1 */
    u8                         accecn_fail_mode:4;   /*  2766: 0  1 */
    /* XXX 4 bits hole, try to pack */
    /* XXX 1 byte hole, try to pack */

    [...]
    /* size: 3200, cachelines: 50, members: 166 */
}

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Co-developed-by: Olivier Tilmans <olivier.tilmans@nokia.com>
Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia.com>
Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250916082434.100722-3-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-18 08:47:51 +02:00
Ilpo Järvinen
542a495cba tcp: AccECN core
This change implements Accurate ECN without negotiation and
AccECN Option (that will be added by later changes). Based on
AccECN specifications:
  https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt

Accurate ECN allows feeding back the number of CE (congestion
experienced) marks accurately to the sender in contrast to
RFC3168 ECN that can only signal one marks-seen-yes/no per RTT.
Congestion control algorithms can take advantage of the accurate
ECN information to fine-tune their congestion response to avoid
drastic rate reduction when only mild congestion is encountered.

With Accurate ECN, tp->received_ce (r.cep in AccECN spec) keeps
track of how many segments have arrived with a CE mark. Accurate
ECN uses ACE field (ECE, CWR, AE) to communicate the value back
to the sender which updates tp->delivered_ce (s.cep) based on the
feedback. This signalling channel is lossy when ACE field overflow
occurs.

Conservative strategy is selected here to deal with the ACE
overflow, however, some strategies using the AccECN option later
in the overall patchset mitigate against false overflows detected.

The ACE field values on the wire are offset by
TCP_ACCECN_CEP_INIT_OFFSET. Delivered_ce/received_ce count the
real CE marks rather than forcing all downstream users to adapt
to the wire offset.

This patch uses the first 1-byte hole and the last 4-byte hole of
the tcp_sock_write_txrx for 'received_ce_pending' and 'received_ce'.
Also, the group size of tcp_sock_write_txrx is increased from
91 + 4 to 95 + 4 due to the new u32 received_ce member. Below are
the trimmed pahole outcomes before and after this patch.

[BEFORE THIS PATCH]
struct tcp_sock {
    [...]
    __cacheline_group_begin__tcp_sock_write_txrx[0]; /*  2521     0 */
    u8                         nonagle:4;            /*  2521: 0  1 */
    u8                         rate_app_limited:1;   /*  2521: 4  1 */
    /* XXX 3 bits hole, try to pack */
    /* XXX 2 bytes hole, try to pack */

    [...]
    u32                        delivered_ce;         /*  2576     4 */
    u32                        app_limited;          /*  2580     4 */
    u32                        rcv_wnd;              /*  2684     4 */
    struct tcp_options_received rx_opt;              /*  2688    24 */
    __cacheline_group_end__tcp_sock_write_txrx[0];   /*  2612     0 */
    /* XXX 4 bytes hole, try to pack */

    [...]
    /* size: 3200, cachelines: 50, members: 161 */
}

[AFTER THIS PATCH]
struct tcp_sock {
    [...]
    __cacheline_group_begin__tcp_sock_write_txrx[0]; /*  2521     0 */
    u8                         nonagle:4;            /*  2521: 0  1 */
    u8                         rate_app_limited:1;   /*  2521: 4  1 */
    /* XXX 3 bits hole, try to pack */

    /* Force alignment to the next boundary: */
    u8                         :0;
    u8                         received_ce_pending:4;/*  2522: 0  1 */
    u8                         unused2:4;            /*  2522: 4  1 */
    /* XXX 1 byte hole, try to pack */

    [...]
    u32                        delivered_ce;         /*  2576     4 */
    u32                        received_ce;          /*  2580     4 */
    u32                        app_limited;          /*  2584     4 */
    u32                        rcv_wnd;              /*  2588     4 */
    struct tcp_options_received rx_opt;              /*  2592    24 */
    __cacheline_group_end__tcp_sock_write_txrx[0];   /*  2616     0 */

    [...]
    /* size: 3200, cachelines: 50, members: 164 */
}

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Co-developed-by: Olivier Tilmans <olivier.tilmans@nokia.com>
Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia.com>
Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250916082434.100722-2-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-18 08:47:51 +02:00
Cosmin Ratiu
6bdcb735fe devlink: Add a 'num_doorbells' driverinit param
This parameter can be used by drivers to configure a different number of
doorbells.

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-17 18:30:51 -07:00
Chia-Yu Chang
30f5ca0062 tcp: ecn functions in separated include file
The following patches will modify ECN helpers and add AccECN herlpers,
and this patch moves the existing ones into a separated include file.

No functional changes.

Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250911110642.87529-5-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-15 16:26:33 -07:00
Ilpo Järvinen
61b2f7baa9 tcp: fast path functions later
The following patch will use tcp_ecn_mode_accecn(),
TCP_ACCECN_CEP_INIT_OFFSET, TCP_ACCECN_CEP_ACE_MASK in
__tcp_fast_path_on() to make new flag for AccECN.

No functional changes.

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250911110642.87529-3-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-15 16:26:33 -07:00
Ilya Maximets
a9888628cb net: dst_metadata: fix IP_DF bit not extracted from tunnel headers
Both OVS and TC flower allow extracting and matching on the DF bit of
the outer IP header via OVS_TUNNEL_KEY_ATTR_DONT_FRAGMENT in the
OVS_KEY_ATTR_TUNNEL and TCA_FLOWER_KEY_FLAGS_TUNNEL_DONT_FRAGMENT in
the TCA_FLOWER_KEY_ENC_FLAGS respectively.  Flow dissector extracts
this information as FLOW_DIS_F_TUNNEL_DONT_FRAGMENT from the tunnel
info key.

However, the IP_TUNNEL_DONT_FRAGMENT_BIT in the tunnel key is never
actually set, because the tunneling code doesn't actually extract it
from the IP header.  OAM and CRIT_OPT are extracted by the tunnel
implementation code, same code also sets the KEY flag, if present.
UDP tunnel core takes care of setting the CSUM flag if the checksum
is present in the UDP header, but the DONT_FRAGMENT is not handled at
any layer.

Fix that by checking the bit and setting the corresponding flag while
populating the tunnel info in the IP layer where it belongs.

Not using __assign_bit as we don't really need to clear the bit in a
just initialized field.  It also doesn't seem like using __assign_bit
will make the code look better.

Clearly, users didn't rely on this functionality for anything very
important until now.  The reason why this doesn't break OVS logic is
that it only matches on what kernel previously parsed out and if kernel
consistently reports this bit as zero, OVS will only match on it to be
zero, which sort of works.  But it is still a bug that the uAPI reports
and allows matching on the field that is not actually checked in the
packet.  And this is causing misleading -df reporting in OVS datapath
flows, while the tunnel traffic actually has the bit set in most cases.

This may also cause issues if a hardware properly implements support
for tunnel flag matching as it will disagree with the implementation
in a software path of TC flower.

Fixes: 7d5437c709 ("openvswitch: Add tunneling interface.")
Fixes: 1d17568e74 ("net/sched: cls_flower: add support for matching tunnel control flags")
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250909165440.229890-2-i.maximets@ovn.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-14 14:28:12 -07:00
Yafang Shao
66048f8b3c net/cls_cgroup: Fix task_get_classid() during qdisc run
During recent testing with the netem qdisc to inject delays into TCP
traffic, we observed that our CLS BPF program failed to function correctly
due to incorrect classid retrieval from task_get_classid(). The issue
manifests in the following call stack:

        bpf_get_cgroup_classid+5
        cls_bpf_classify+507
        __tcf_classify+90
        tcf_classify+217
        __dev_queue_xmit+798
        bond_dev_queue_xmit+43
        __bond_start_xmit+211
        bond_start_xmit+70
        dev_hard_start_xmit+142
        sch_direct_xmit+161
        __qdisc_run+102             <<<<< Issue location
        __dev_xmit_skb+1015
        __dev_queue_xmit+637
        neigh_hh_output+159
        ip_finish_output2+461
        __ip_finish_output+183
        ip_finish_output+41
        ip_output+120
        ip_local_out+94
        __ip_queue_xmit+394
        ip_queue_xmit+21
        __tcp_transmit_skb+2169
        tcp_write_xmit+959
        __tcp_push_pending_frames+55
        tcp_push+264
        tcp_sendmsg_locked+661
        tcp_sendmsg+45
        inet_sendmsg+67
        sock_sendmsg+98
        sock_write_iter+147
        vfs_write+786
        ksys_write+181
        __x64_sys_write+25
        do_syscall_64+56
        entry_SYSCALL_64_after_hwframe+100

The problem occurs when multiple tasks share a single qdisc. In such cases,
__qdisc_run() may transmit skbs created by different tasks. Consequently,
task_get_classid() retrieves an incorrect classid since it references the
current task's context rather than the skb's originating task.

Given that dev_queue_xmit() always executes with bh disabled, we can use
softirq_count() instead to obtain the correct classid.

The simple steps to reproduce this issue:
1. Add network delay to the network interface:
  such as: tc qdisc add dev bond0 root netem delay 1.5ms
2. Build two distinct net_cls cgroups, each with a network-intensive task
3. Initiate parallel TCP streams from both tasks to external servers.

Under this specific condition, the issue reliably occurs. The kernel
eventually dequeues an SKB that originated from Task-A while executing in
the context of Task-B.

It is worth noting that it will change the established behavior for a
slightly different scenario:

  <sock S is created by task A>
  <class ID for task A is changed>
  <skb is created by sock S xmit and classified>

prior to this patch the skb will be classified with the 'new' task A
classid, now with the old/original one. The bpf_get_cgroup_classid_curr()
function is a more appropriate choice for this case.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20250902062933.30087-1-laoar.shao@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-14 11:55:04 -07:00
Eric Dumazet
fdae0ab67d net: use NUMA drop counters for softnet_data.dropped
Hosts under DOS attack can suffer from false sharing
in enqueue_to_backlog() : atomic_inc(&sd->dropped).

This is because sd->dropped can be touched from many cpus,
possibly residing on different NUMA nodes.

Generalize the sk_drop_counters infrastucture
added in commit c51613fa27 ("net: add sk->sk_drop_counters")
and use it to replace softnet_data.dropped
with NUMA friendly softnet_data.drop_counters.

This adds 64 bytes per cpu, maybe more in the future
if we increase the number of counters (currently 2)
per 'struct numa_drop_counters'.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250909121942.1202585-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-14 11:35:17 -07:00
Dmitry Safonov
51e547e8c8 tcp: Free TCP-AO/TCP-MD5 info/keys without RCU
Now that the destruction of info/keys is delayed until the socket
destructor, it's safe to use kfree() without an RCU callback.
The socket is in TCP_CLOSE state either because it never left it,
or it's already closed and the refcounter is zero. In any way,
no one can discover it anymore, it's safe to release memory
straight away.

Similar thing was possible for twsk already.

Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Link: https://patch.msgid.link/20250909-b4-tcp-ao-md5-rst-finwait2-v5-2-9ffaaaf8b236@arista.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-11 19:05:56 -07:00
Dmitry Safonov
9e472d9e84 tcp: Destroy TCP-AO, TCP-MD5 keys in .sk_destruct()
Currently there are a couple of minor issues with destroying the keys
tcp_v4_destroy_sock():

1. The socket is yet in TCP bind buckets, making it reachable for
   incoming segments [on another CPU core], potentially available to send
   late FIN/ACK/RST replies.

2. There is at least one code path, where tcp_done() is called before
   sending RST [kudos to Bob for investigation]. This is a case of
   a server, that finished sending its data and just called close().

   The socket is in TCP_FIN_WAIT2 and has RCV_SHUTDOWN (set by
   __tcp_close())

   tcp_v4_do_rcv()/tcp_v6_do_rcv()
     tcp_rcv_state_process()            /* LINUX_MIB_TCPABORTONDATA */
       tcp_reset()
         tcp_done_with_error()
           tcp_done()
             inet_csk_destroy_sock()    /* Destroys AO/MD5 keys */
     /* tcp_rcv_state_process() returns SKB_DROP_REASON_TCP_ABORT_ON_DATA */
   tcp_v4_send_reset()                  /* Sends an unsigned RST segment */

   tcpdump:
> 22:53:15.399377 00:00:b2:1f:00:00 > 00:00:01:01:00:00, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 33929, offset 0, flags [DF], proto TCP (6), length 60)
>     1.0.0.1.34567 > 1.0.0.2.49848: Flags [F.], seq 2185658590, ack 3969644355, win 502, options [nop,nop,md5 valid], length 0
> 22:53:15.399396 00:00:01:01:00:00 > 00:00:b2:1f:00:00, ethertype IPv4 (0x0800), length 86: (tos 0x0, ttl 64, id 51951, offset 0, flags [DF], proto TCP (6), length 72)
>     1.0.0.2.49848 > 1.0.0.1.34567: Flags [.], seq 3969644375, ack 2185658591, win 128, options [nop,nop,md5 valid,nop,nop,sack 1 {2185658590:2185658591}], length 0
> 22:53:16.429588 00:00:b2:1f:00:00 > 00:00:01:01:00:00, ethertype IPv4 (0x0800), length 60: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 40)
>     1.0.0.1.34567 > 1.0.0.2.49848: Flags [R], seq 2185658590, win 0, length 0
> 22:53:16.664725 00:00:b2:1f:00:00 > 00:00:01:01:00:00, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
>     1.0.0.1.34567 > 1.0.0.2.49848: Flags [R], seq 2185658591, win 0, options [nop,nop,md5 valid], length 0
> 22:53:17.289832 00:00:b2:1f:00:00 > 00:00:01:01:00:00, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
>     1.0.0.1.34567 > 1.0.0.2.49848: Flags [R], seq 2185658591, win 0, options [nop,nop,md5 valid], length 0

  Note the signed RSTs later in the dump - those are sent by the server
  when the fin-wait socket gets removed from hash buckets, by
  the listener socket.

Instead of destroying AO/MD5 info and their keys in inet_csk_destroy_sock(),
slightly delay it until the actual socket .sk_destruct(). As shutdown'ed
socket can yet send non-data replies, they should be signed in order for
the peer to process them. Now it also matches how AO/MD5 gets destructed
for TIME-WAIT sockets (in tcp_twsk_destructor()).

This seems optimal for TCP-MD5, while for TCP-AO it seems to have an
open problem: once RST get sent and socket gets actually destructed,
there is no information on the initial sequence numbers. So, in case
this last RST gets lost in the network, the server's listener socket
won't be able to properly sign another RST. Nothing in RFC 1122
prescribes keeping any local state after non-graceful reset.
Luckily, BGP are known to use keep alive(s).

While the issue is quite minor/cosmetic, these days monitoring network
counters is a common practice and getting invalid signed segments from
a trusted BGP peer can get customers worried.

Investigated-by: Bob Gilligan <gilligan@arista.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Link: https://patch.msgid.link/20250909-b4-tcp-ao-md5-rst-finwait2-v5-1-9ffaaaf8b236@arista.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-11 19:05:56 -07:00
Jakub Kicinski
d103f26a5c Plenty of things going on, notably:
- iwlwifi: major cleanups/rework
  - brcmfmac: gets AP isolation support
  - mac80211: gets more S1G support
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmjCn20ACgkQ10qiO8sP
 aADlog//eLH6dmMJSF5d3HCCXzpABholMM7SgFmeL7zz2JVhMe+TD9sdM972A9/d
 aSTPhFvj3SESiayohbziKD2Fk2EWLYh35MVxP1VQG/pBMdXGryReWxXuMc+/iep8
 wZJjvR7b5dwXc7xWi4BDYxRj4Z+3IMFulmwXMJl9pfBG8l/M2vRd1ihvpLyXwbZ1
 k67rcmFltjvIh4DpGdxypO0onuTIgLycGF84YhyFrjeOpB75tM3knzCQMjnHZYo5
 YL295cTPfpLY+7iyMhrFv5ootWm5eLfGX+jXA5EmEojOwinZbiN6AB4xDeMWC3fz
 rxk5ESo/wGEbJpLiV+CVkkptHYubGk7pA+EPSMQg94+HV4WL+GekZ9/5bRKLZ/Xv
 BXNkIL71KxhVd9DiDG5WomgnO2hDpk0zxpMADgZNEPkdgDRcl1Kh1oDw7S/4bp58
 V/Ea349wRInesxK10BdJwtsJ0khDg0gaN6f+OG8L7dRw4mSl9akk2e0KVhHoOxYT
 rBbK4QKt77308TjlYqzJ+SUFhrb6PhPNgRMFx2zjSgH1eaHONVEM2zYOscQGNgqR
 Rfz8d6LUXx+aR87aOT31vUWqjl1KgwqbICiI6KWaQpcKTVK61WVFUiksddK6XyB7
 drwoPFekBEjVXOVYaQtyPG3qEGEOVCEjooHXFPQJ+SJeR1sMHCc=
 =Nlkd
 -----END PGP SIGNATURE-----

Merge tag 'wireless-next-2025-09-11' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next

Johannes Berg says:

====================
Plenty of things going on, notably:
 - iwlwifi: major cleanups/rework
 - brcmfmac: gets AP isolation support
 - mac80211: gets more S1G support

* tag 'wireless-next-2025-09-11' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (94 commits)
  wifi: mwifiex: fix endianness handling in mwifiex_send_rgpower_table
  wifi: cfg80211: Remove the redundant wiphy_dev
  wifi: mac80211: fix incorrect comment
  wifi: cfg80211: update the time stamps in hidden ssid
  wifi: mac80211: Fix HE capabilities element check
  wifi: mac80211: add tx_handlers_drop statistics to ethtool
  wifi: mac80211: fix reporting of all valid links in sta_set_sinfo()
  wifi: iwlwifi: mld: CHANNEL_SURVEY_NOTIF is always supported
  wifi: iwlwifi: mld: remove support of iwl_esr_mode_notif version 1
  wifi: iwlwifi: mld: remove support from of sta cmd version 1
  wifi: iwlwifi: mld: remove support of roc cmd version 5
  wifi: iwlwifi: mld: remove support of mac cmd ver 2
  wifi: iwlwifi: mld: don't consider phy cmd version 5
  wifi: iwlwifi: implement wowlan status notification API update
  wifi: iwlwifi: fw: Add ASUS to PPAG and TAS list
  wifi: iwlwifi: add kunit tests for nvm parse
  wifi: iwlwifi: api: add a flag to iwl_link_ctx_modify_flags
  wifi: iwlwifi: pcie: move ltr_enabled to the specific transport
  wifi: iwlwifi: pcie: move pm_support to the specific transport
  wifi: iwlwifi: rename iwl_finish_nic_init
  ...
====================

Link: https://patch.msgid.link/20250911100854.20445-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-11 17:50:46 -07:00
Jakub Kicinski
fc3a281041 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.17-rc6).

Conflicts:

net/netfilter/nft_set_pipapo.c
net/netfilter/nft_set_pipapo_avx2.c
  c4eaca2e10 ("netfilter: nft_set_pipapo: don't check genbit from packetpath lookups")
  84c1da7b38 ("netfilter: nft_set_pipapo: use avx2 algorithm for insertions too")

Only trivial adjacent changes (in a doc and a Makefile).

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-11 17:40:13 -07:00
Ido Schimmel
0d3c4a4416 ipv4: icmp: Pass IPv4 control block structure as an argument to __icmp_send()
__icmp_send() is used to generate ICMP error messages in response to
various situations such as MTU errors (i.e., "Fragmentation Required")
and too many hops (i.e., "Time Exceeded").

The skb that generated the error does not necessarily come from the IPv4
layer and does not always have a valid IPv4 control block in skb->cb.

Therefore, commit 9ef6b42ad6 ("net: Add __icmp_send helper.") changed
the function to take the IP options structure as argument instead of
deriving it from the skb's control block. Some callers of this function
such as icmp_send() pass the IP options structure from the skb's control
block as in these call paths the control block is known to be valid, but
other callers simply pass a zeroed structure.

A subsequent patch will need __icmp_send() to access more information
from the IPv4 control block (specifically, the ifindex of the input
interface). As a preparation for this change, change the function to
take the IPv4 control block structure as an argument instead of the IP
options structure. This makes the function similar to its IPv6
counterpart that already takes the IPv6 control block structure as an
argument.

No functional changes intended.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250908073238.119240-3-idosch@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-11 12:22:38 +02:00
Jakub Kicinski
6bffdc0f88 net: xdp: handle frags with unreadable memory
We don't expect frags with unreadable memory to be presented
to XDP programs today, but the XDP helpers are designed to be
usable whether XDP is enabled or not. Support handling frags
with unreadable memory.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20250905221539.2930285-3-kuba@kernel.org
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-11 12:00:20 +02:00
Jakub Kicinski
1827f773e4 net: xdp: pass full flags to xdp_update_skb_shared_info()
xdp_update_skb_shared_info() needs to update skb state which
was maintained in xdp_buff / frame. Pass full flags into it,
instead of breaking it out bit by bit. We will need to add
a bit for unreadable frags (even tho XDP doesn't support
those the driver paths may be common), at which point almost
all call sites would become:

    xdp_update_skb_shared_info(skb, num_frags,
                               sinfo->xdp_frags_size,
                               MY_PAGE_SIZE * num_frags,
                               xdp_buff_is_frag_pfmemalloc(xdp),
                               xdp_buff_is_frag_unreadable(xdp));

Keep a helper for accessing the flags, in case we need to
transform them somehow in the future (e.g. to cover up xdp_buff
vs xdp_frame differences).

While we are touching call callers - rename the helper to
xdp_update_skb_frags_info(), previous name may have implied that
it's shinfo that's updated. We are updating flags in struct sk_buff
based on frags that got attched.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://patch.msgid.link/20250905221539.2930285-2-kuba@kernel.org
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-11 12:00:20 +02:00
Florian Westphal
11fe5a82e5 netfilter: nf_tables: make nft_set_do_lookup available unconditionally
This function was added for retpoline mitigation and is replaced by a
static inline helper if mitigations are not enabled.

Enable this helper function unconditionally so next patch can add a lookup
restart mechanism to fix possible false negatives while transactions are
in progress.

Adding lookup restarts in nft_lookup_eval doesn't work as nft_objref would
then need the same copypaste loop.

This patch is separate to ease review of the actual bug fix.

Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
2025-09-10 20:30:37 +02:00
Florian Westphal
64102d9bbc netfilter: nf_tables: place base_seq in struct net
This will soon be read from packet path around same time as the gencursor.

Both gencursor and base_seq get incremented almost at the same time, so
it makes sense to place them in the same structure.

This doesn't increase struct net size on 64bit due to padding.

Signed-off-by: Florian Westphal <fw@strlen.de>
2025-09-10 20:30:37 +02:00
Vlad Dumitrescu
ce0b015e26 devlink: Add 'total_vfs' generic device param
NICs are typically configured with total_vfs=0, forcing users to rely
on external tools to enable SR-IOV (a widely used and essential feature).

Add total_vfs parameter to devlink for SR-IOV max VF configurability.
Enables standard kernel tools to manage SR-IOV, addressing the need for
flexible VF configuration.

Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
Tested-by: Kamal Heib <kheib@redhat.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250907012953.301746-2-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-09 19:14:23 -07:00
Jakub Kicinski
4ea83b7573 Merge branch '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:

====================
idpf: add XDP support

Alexander Lobakin says:

Add XDP support (w/o XSk for now) to the idpf driver using the libeth_xdp
sublib. All possible verdicts, .ndo_xdp_xmit(), multi-buffer etc. are here.
In general, nothing outstanding comparing to ice, except performance --
let's say, up to 2x for .ndo_xdp_xmit() on certain platforms and
scenarios.
idpf doesn't support VLAN Rx offload, so only the hash hint is
available for now.

Patches 1-7 are prereqs, without which XDP would either not work at all
or work slower/worse/...

* '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  idpf: add XDP RSS hash hint
  idpf: add support for .ndo_xdp_xmit()
  idpf: add support for XDP on Rx
  idpf: use generic functions to build xdp_buff and skb
  idpf: implement XDP_SETUP_PROG in ndo_bpf for splitq
  idpf: prepare structures to support XDP
  idpf: add support for nointerrupt queues
  idpf: remove SW marker handling from NAPI
  idpf: add 4-byte completion descriptor definition
  idpf: link NAPIs to queues
  idpf: use a saner limit for default number of queues to allocate
  idpf: fix Rx descriptor ready check barrier in splitq
  xdp, libeth: make the xdp_init_buff() micro-optimization generic
====================

Link: https://patch.msgid.link/20250908195748.1707057-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-09 18:44:07 -07:00
Hangbin Liu
e5a6643435 bonding: support aggregator selection based on port priority
Add a new ad_select policy 'port_priority' that uses the per-port
actor priority values (set via ad_actor_port_prio) to determine
aggregator selection.

This allows administrators to influence which ports are preferred
for aggregation by assigning different priority values, providing
more flexible load balancing control in LACP configurations.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20250902064501.360822-3-liuhangbin@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-09 10:56:02 +02:00
Hangbin Liu
6b6dc81ee7 bonding: add support for per-port LACP actor priority
Introduce a new netlink attribute 'actor_port_prio' to allow setting
the LACP actor port priority on a per-slave basis. This extends the
existing bonding infrastructure to support more granular control over
LACP negotiations.

The priority value is embedded in LACPDU packets and will be used by
subsequent patches to influence aggregator selection policies.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20250902064501.360822-2-liuhangbin@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-09 10:56:02 +02:00
Eric Dumazet
20d3d26815 net: snmp: remove SNMP_MIB_SENTINEL
No more user of SNMP_MIB_SENTINEL, we can remove it.

Also remove snmp_get_cpu_field[64]_batch() helpers.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250905165813.1470708-10-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-08 18:06:21 -07:00
Eric Dumazet
ceac1fb229 ipv6: snmp: do not use SNMP_MIB_SENTINEL anymore
Use ARRAY_SIZE(), so that we know the limit at compile time.

Following patch needs this preliminary change.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250905165813.1470708-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-08 18:06:20 -07:00
Alexander Lobakin
17d370a70b xdp, libeth: make the xdp_init_buff() micro-optimization generic
Often times the compilers are not able to expand two consecutive 32-bit
writes into one 64-bit on the corresponding architectures. This applies
to xdp_init_buff() called for every received frame (or at least once
per each 64 frames when the frag size is fixed).
Move the not-so-pretty hack from libeth_xdp straight to xdp_init_buff(),
but using a proper union around ::frame_sz and ::flags.
The optimization is limited to LE architectures due to the structure
layout.

One simple example from idpf with the XDP series applied (Clang 22-git,
CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE => -O2):

add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-27 (-27)
Function                                     old     new   delta
idpf_vport_splitq_napi_poll                 5076    5049     -27

The perf difference with XDP_DROP is around +0.8-1% which I see as more
than satisfying.

Suggested-by: Simon Horman <horms@kernel.org>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Tested-by: Ramu R <ramu.r@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-09-08 10:26:25 -07:00
Jakub Kicinski
5ef04a7b06 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.17-rc5).

No conflicts.

Adjacent changes:

include/net/sock.h
  c51613fa27 ("net: add sk->sk_drop_counters")
  5d6b58c932 ("net: lockless sock_i_ino()")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-04 13:33:00 -07:00
Arend van Spriel
2418553491 wifi: nl80211: allow drivers to support subset of NL80211_CMD_SET_BSS
The so-called fullmac devices rely on firmware functionality and/or API to
change BSS parameters. Today there are limited drivers supporting the
nl80211 primitive, but they only handle a subset of the bss parameters
passed if any. The mac80211 driver does handle all parameters and stores
their configured values. Some of the BSS parameters were already conditional
by wiphy->features. For these the wiphy->bss_param_support and wiphy->features
fields are silently aligned in wiphy_register(). Maybe better to issue a warning
instead when they are misaligned.

Signed-off-by: Arend van Spriel <arend.vanspriel@broadcom.com>
Link: https://patch.msgid.link/20250817190435.1495094-2-arend.vanspriel@broadcom.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-09-04 11:19:02 +02:00
Muna Sinada
d0bf06158c wifi: nl80211: Add EHT fixed Tx rate support
Add new attributes to support EHT MCS/NSS Tx rates and EHT GI/LTF.
Parse EHT fixed MCS/NSS Tx rates and EHT GI/LTF values passed by the
userspace, validate and add as part of cfg80211_bitrate_mask.

MCS mask is constructed by new function, eht_build_mcs_mask(). Max NSS
supported for MCS rates of 7, 9, 11 and 13 is utilized to set MCS
bitmask for each NSS. MCS rates 14, and 15 if supported, are set only
for NSS = 0.

Co-developed-by: Aloka Dixit <aloka.dixit@oss.qualcomm.com>
Signed-off-by: Aloka Dixit <aloka.dixit@oss.qualcomm.com>
Signed-off-by: Muna Sinada <muna.sinada@oss.qualcomm.com>
Link: https://patch.msgid.link/20250815213011.2704803-1-muna.sinada@oss.qualcomm.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-09-04 11:19:02 +02:00
Aditya Kumar Singh
5f9d5fd8e0 wifi: cfg80211: fix return value in cfg80211_get_radio_idx_by_chan()
If a valid radio index is not found, the function returns -ENOENT. If the
channel argument itself is invalid, it returns -EINVAL. However, since the
caller only checks for < 0, the distinction between these error codes is
not utilized much. Also, handling these two distinct error codes throughout
the codebase adds complexity, as both cases must be addressed separately. A
subsequent change aims to simplify this by using a single error code for
all invalid cases, making error handling more consistent and streamlined.

To support this change, update the return value to -EINVAL when a valid
radio index is not found. This is still appropriate because, even if the
channel argument is structurally valid, the absence of a corresponding
radio index implies that the argument is effectively invalid—otherwise, a
valid index would have been found.

Signed-off-by: Aditya Kumar Singh <aditya.kumar.singh@oss.qualcomm.com>
Link: https://patch.msgid.link/20250812-fix_scan_ap_flag_requirement_during_mlo-v4-1-383ffb6da213@oss.qualcomm.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-09-04 11:19:01 +02:00
Jakub Kicinski
3ceb08838b net: add helper to pre-check if PP for an Rx queue will be unreadable
mlx5 pokes into the rxq state to check if the queue has a memory
provider, and therefore whether it may produce unreadable mem.
Add a helper for doing this in the page pool API. fbnic will want
a similar thing (tho, for a slightly different reason).

Reviewed-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20250901211214.1027927-11-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-04 10:19:17 +02:00
Juraj Šarinay
21f82062d0 net: nfc: nci: Increase NCI_DATA_TIMEOUT to 3000 ms
An exchange with a NFC target must complete within NCI_DATA_TIMEOUT.
A delay of 700 ms is not sufficient for cryptographic operations on smart
cards. CardOS 6.0 may need up to 1.3 seconds to perform 256-bit ECDH
or 3072-bit RSA. To prevent brute-force attacks, passports and similar
documents introduce even longer delays into access control protocols
(BAC/PACE).

The timeout should be higher, but not too much. The expiration allows
us to detect that a NFC target has disappeared.

Signed-off-by: Juraj Šarinay <juraj@sarinay.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://patch.msgid.link/20250902113630.62393-1-juraj@sarinay.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-03 17:02:12 -07:00
Eric Dumazet
5d6b58c932 net: lockless sock_i_ino()
Followup of commit c51da3f7a1 ("net: remove sock_i_uid()")

A recent syzbot report was the trigger for this change.

Over the years, we had many problems caused by the
read_lock[_bh](&sk->sk_callback_lock) in sock_i_uid().

We could fix smc_diag_dump_proto() or make a more radical move:

Instead of waiting for new syzbot reports, cache the socket
inode number in sk->sk_ino, so that we no longer
need to acquire sk->sk_callback_lock in sock_i_ino().

This makes socket dumps faster (one less cache line miss,
and two atomic ops avoided).

Prior art:

commit 25a9c8a443 ("netlink: Add __sock_i_ino() for __netlink_diag_dump().")
commit 4f9bf2a2f5 ("tcp: Don't acquire inet_listen_hashbucket::lock with disabled BH.")
commit efc3dbc374 ("rds: Make rds_sock_lock BH rather than IRQ safe.")

Fixes: d2d6422f8b ("x86: Allow to enable PREEMPT_RT.")
Reported-by: syzbot+50603c05bbdf4dfdaffa@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/68b73804.050a0220.3db4df.01d8.GAE@google.com/T/#u
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20250902183603.740428-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-03 16:08:24 -07:00
Jakub Kicinski
24ee9feeb3 netfilter pull request nf-next-25-09-02
-----BEGIN PGP SIGNATURE-----
 
 iQJBBAABCAArFiEEgKkgxbID4Gn1hq6fcJGo2a1f9gAFAmi28PMNHGZ3QHN0cmxl
 bi5kZQAKCRBwkajZrV/2AJ7LD/9vb6dwLS7u7+wrksZUdDdkIvIN7TOeiN0JJ/xd
 XOAiABE2LVwoAyA7zHDdQKjkB+3kJwhSQcS3RO+qn3E77JGlADRWxYI4tg5FfWMC
 WrDGSVW5vU92611s2i8GvLdCeZ7u7/qoMMAIjlnS8XHBq+6SHMacmwbH/W/4pMKR
 58XFW+fYOJF4q9ATH9F6DXAFys9ZC873dv2rUw7/jXiAcHggwpWFlPVPbimWcrT/
 u+z07preK9jALaYj5ctUsRznwqd0x7AwSsLI/e6VgPtXFhdTPdsH//aWwkMlX1Uw
 eQ0i9XoarYzoA5wnYaB2t5e0b0haAll+rxJRM3ISavizmBx6BPHOyerpe5J3kmO5
 H5A8Low+KB/zJkiovTY9A5uLzpXaScS00kD5a6obAPFfCm64kwctBTmV64+zPfYo
 /36/zze/mpfjI8FMdyCZfOZgSfQ5KeCZg7FMi6uNZMg0PoMkW5yCrnjXX2Oaz2Rj
 TukUGXg6lFRMBjK4jiCy/HXusn+Xuv1FrHwYwAvYvtbOxwbUZkVo1h65f3Fng/vU
 vAtWfw6Gh+I9AKWiTyC2FRm0rZ26K+p/8OrErGIJmsGbcqXcYFZE0bqvfhIDzMWg
 VzykeGbK51q3s4nFR9C/7wcprDTaQCi3/sobu4ussetkeDMU3d9aEsxkOxNPkz1j
 dxJT/w==
 =wamJ
 -----END PGP SIGNATURE-----

Merge tag 'nf-next-25-09-02' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next

Florian Westphal says:

====================
netfilter: updates for net-next

1) prefer vmalloc_array in ebtables, from  Qianfeng Rong.

2) Use csum_replace4 instead of open-coding it, from Christophe Leroy.

3+4) Get rid of GFP_ATOMIC in transaction object allocations, those
     cause silly failures with large sets under memory pressure, from
     myself.

5) Remove test for AVX cpu feature in nftables pipapo set type,
   testing for AVX2 feature is sufficient.

6) Unexport a few function in nf_reject infra: no external callers.

7) Extend payload offset to u16, this was restricted to values <=255
   so far, from Fernando Fernandez Mancera.

* tag 'nf-next-25-09-02' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  netfilter: nft_payload: extend offset to 65535 bytes
  netfilter: nf_reject: remove unneeded exports
  netfilter: nft_set_pipapo: remove redundant test for avx feature bit
  netfilter: nf_tables: all transaction allocations can now sleep
  netfilter: nf_tables: allow iter callbacks to sleep
  netfilter: nft_payload: Use csum_replace4() instead of opencoding
  netfilter: ebtables: Use vmalloc_array() to improve code
====================

Link: https://patch.msgid.link/20250902133549.15945-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-03 16:06:45 -07:00
Asbjørn Sloth Tønnesen
017bda80fd genetlink: fix typo in comment
In this context "not that ..." should properly be "note that ...".

Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250902154640.759815-4-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-03 15:16:49 -07:00
Christoph Paasch
929324913e net: Add rfs_needed() helper
Add a helper to check if RFS is needed or not. Allows to make the code a
bit cleaner and the next patch to have MPTCP use this helper to decide
whether or not to iterate over the subflows.

tun_flow_update() was calling sock_rps_record_flow_hash() regardless of
the state of rfs_needed. This was not really a bug as sock_flow_table
simply ends up being NULL and thus everything will be fine.
This commit here thus also implicitly makes tun_flow_update() respect
the state of rfs_needed.

Suggested-by: Matthieu Baerts <matttbe@kernel.org>
Signed-off-by: Christoph Paasch <cpaasch@openai.com>
Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250902-net-next-mptcp-misc-feat-6-18-v2-3-fa02bb3188b1@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-03 15:08:20 -07:00
Eric Dumazet
5d14bbf9d1 net_sched: act: remove tcfa_qstats
tcfa_qstats is currently only used to hold drops and overlimits counters.

tcf_action_inc_drop_qstats() and tcf_action_inc_overlimit_qstats()
currently acquire a->tcfa_lock to increment these counters.

Switch to two atomic_t to get lock-free accounting.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20250901093141.2093176-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-02 15:52:24 -07:00
Jay Vosburgh
23a6037ce7 bonding: Remove support for use_carrier
Remove the implementation of use_carrier, the link monitoring
method that utilizes ethtool or ioctl to determine the link state of an
interface in a bond.  Bonding will always behaves as if use_carrier=1,
which relies on netif_carrier_ok() to determine the link state of
interfaces.

	To avoid acquiring RTNL many times per second, bonding inspects
link state under RCU, but not under RTNL.  However, ethtool
implementations in drivers may sleep, and therefore this strategy is
unsuitable for use with calls into driver ethtool functions.

	The use_carrier option was introduced in 2003, to provide
backwards compatibility for network device drivers that did not support
the then-new netif_carrier_ok/on/off system.  Device drivers are now
expected to support netif_carrier_*, and the use_carrier backwards
compatibility logic is no longer necessary.

	The option itself remains, but when queried always returns 1,
and may only be set to 1.

Link: https://lore.kernel.org/000000000000eb54bf061cfd666a@google.com
Link: https://lore.kernel.org/20240718122017.d2e33aaac43a.I10ab9c9ded97163aef4e4de10985cd8f7de60d28@changeid
Signed-off-by: Jay Vosburgh <jv@jvosburgh.net>
Reported-by: syzbot+b8c48ea38ca27d150063@syzkaller.appspotmail.com
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/2029487.1756512517@famine
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-02 14:01:54 -07:00
Fernando Fernandez Mancera
077dc4a275 netfilter: nft_payload: extend offset to 65535 bytes
In some situations 255 bytes offset is not enough to match or manipulate
the desired packet field. Increase the offset limit to 65535 or U16_MAX.

In addition, the nla policy maximum value is not set anymore as it is
limited to s16. Instead, the maximum value is checked during the payload
expression initialization function.

Tested with the nft command line tool.

table ip filter {
	chain output {
		@nh,2040,8 set 0xff
		@nh,524280,8 set 0xff
		@nh,524280,8 0xff
		@nh,2040,8 0xff
	}
}

Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
2025-09-02 15:28:18 +02:00
Florian Westphal
f4f9e05904 netfilter: nf_reject: remove unneeded exports
These functions have no external callers and can be static.

Signed-off-by: Florian Westphal <fw@strlen.de>
2025-09-02 15:28:17 +02:00
Florian Westphal
a60a5abe19 netfilter: nf_tables: allow iter callbacks to sleep
Quoting Sven Auhagen:
  we do see on occasions that we get the following error message, more so on
  x86 systems than on arm64:

  Error: Could not process rule: Cannot allocate memory delete table inet filter

  It is not a consistent error and does not happen all the time.
  We are on Kernel 6.6.80, seems to me like we have something along the lines
  of the nf_tables: allow clone callbacks to sleep problem using GFP_ATOMIC.

As hinted at by Sven, this is because of GFP_ATOMIC allocations during
set flush.

When set is flushed, all elements are deactivated. This triggers a set
walk and each element gets added to the transaction list.

The rbtree and rhashtable sets don't allow the iter callback to sleep:
rbtree walk acquires read side of an rwlock with bh disabled, rhashtable
walk happens with rcu read lock held.

Rbtree is simple enough to resolve:
When the walk context is ITER_READ, no change is needed (the iter
callback must not deactivate elements; we're not in a transaction).

When the iter type is ITER_UPDATE, the rwlock isn't needed because the
caller holds the transaction mutex, this prevents any and all changes to
the ruleset, including add/remove of set elements.

Rhashtable is slightly more complex.
When the iter type is ITER_READ, no change is needed, like rbtree.

For ITER_UPDATE, we hold transaction mutex which prevents elements from
getting free'd, even outside of rcu read lock section.

So build a temporary list of all elements while doing the rcu iteration
and then call the iterator in a second pass.

The disadvantage is the need to iterate twice, but this cost comes with
the benefit to allow the iter callback to use GFP_KERNEL allocations in
a followup patch.

The new list based logic makes it necessary to catch recursive calls to
the same set earlier.

Such walk -> iter -> walk recursion for the same set can happen during
ruleset validation in case userspace gave us a bogus (cyclic) ruleset
where verdict map m jumps to chain that sooner or later also calls
"vmap @m".

Before the new ->in_update_walk test, the ruleset is rejected because the
infinite recursion causes ctx->level to exceed the allowed maximum.

But with the new logic added here, elements would get skipped:

nft_rhash_walk_update would see elements that are on the walk_list of
an older stack frame.

As all recursive calls into same map results in -EMLINK, we can avoid this
problem by using the new in_update_walk flag and reject immediately.

Next patch converts the problematic GFP_ATOMIC allocations.

Reported-by: Sven Auhagen <Sven.Auhagen@belden.com>
Closes: https://lore.kernel.org/netfilter-devel/BY1PR18MB5874110CAFF1ED098D0BC4E7E07BA@BY1PR18MB5874.namprd18.prod.outlook.com/
Signed-off-by: Florian Westphal <fw@strlen.de>
2025-09-02 15:28:17 +02:00
Eric Dumazet
689adb36bd inet: ping: make ping_port_rover per netns
Provide isolation between netns for ping idents.

Randomize initial ping_port_rover value at netns creation.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250829153054.474201-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-01 13:15:14 -07:00
Eric Dumazet
10343e7e6c inet: ping: remove ping_hash()
There is no point in keeping ping_hash().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Yue Haibing <yuehaibing@huawei.com>
Link: https://patch.msgid.link/20250829153054.474201-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-01 13:15:14 -07:00
Kuniyuki Iwashima
7051b54fb5 tcp: Remove sk->sk_prot->orphan_count.
TCP tracks the number of orphaned (SOCK_DEAD but not yet destructed)
sockets in tcp_orphan_count.

In some code that was shared with DCCP, tcp_orphan_count is referenced
via sk->sk_prot->orphan_count.

Let's reference tcp_orphan_count directly.

inet_csk_prepare_for_destroy_sock() is moved to inet_connection_sock.c
due to header dependency.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250829215641.711664-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-01 12:52:09 -07:00
Simon Schuster
edd3cb05c0 copy_process: pass clone_flags as u64 across calltree
With the introduction of clone3 in commit 7f192e3cd3 ("fork: add
clone3") the effective bit width of clone_flags on all architectures was
increased from 32-bit to 64-bit, with a new type of u64 for the flags.
However, for most consumers of clone_flags the interface was not
changed from the previous type of unsigned long.

While this works fine as long as none of the new 64-bit flag bits
(CLONE_CLEAR_SIGHAND and CLONE_INTO_CGROUP) are evaluated, this is still
undesirable in terms of the principle of least surprise.

Thus, this commit fixes all relevant interfaces of callees to
sys_clone3/copy_process (excluding the architecture-specific
copy_thread) to consistently pass clone_flags as u64, so that
no truncation to 32-bit integers occurs on 32-bit architectures.

Signed-off-by: Simon Schuster <schuster.simon@siemens-energy.com>
Link: https://lore.kernel.org/20250901-nios2-implement-clone3-v2-2-53fcf5577d57@siemens-energy.com
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-01 15:31:34 +02:00
Eric Dumazet
99a2ace61b net: use dst_dev_rcu() in sk_setup_caps()
Use RCU to protect accesses to dst->dev from sk_setup_caps()
and sk_dst_gso_max_size().

Also use dst_dev_rcu() in ip6_dst_mtu_maybe_forward(),
and ip_dst_mtu_maybe_forward().

ip4_dst_hoplimit() can use dst_dev_net_rcu().

Fixes: 4a6ce2b6f2 ("net: introduce a new function dst_dev_put()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250828195823.3958522-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-29 19:36:32 -07:00
Eric Dumazet
caedcc5b6d net: dst: introduce dst->dev_rcu
Followup of commit 88fe14253e ("net: dst: add four helpers
to annotate data-races around dst->dev").

We want to gradually add explicit RCU protection to dst->dev,
including lockdep support.

Add an union to alias dst->dev_rcu and dst->dev.

Add dst_dev_net_rcu() helper.

Fixes: 4a6ce2b6f2 ("net: introduce a new function dst_dev_put()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250828195823.3958522-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-29 19:36:31 -07:00
Jakub Kicinski
d23ad54de7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.17-rc4).

No conflicts.

Adjacent changes:

drivers/net/ethernet/intel/idpf/idpf_txrx.c
  02614eee26 ("idpf: do not linearize big TSO packets")
  6c4e684802 ("idpf: remove obsolete stashing code")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-29 11:48:01 -07:00
Eric Dumazet
53df77e785 net_sched: act_skbmod: use RCU in tcf_skbmod_dump()
Also storing tcf_action into struct tcf_skbmod_params
makes sure there is no discrepancy in tcf_skbmod_act().

No longer block BH in tcf_skbmod_init() when acquiring tcf_lock.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250827125349.3505302-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-28 16:46:23 -07:00
Eric Dumazet
e97ae74297 net_sched: act_tunnel_key: use RCU in tunnel_key_dump()
Also storing tcf_action into struct tcf_tunnel_key_params
makes sure there is no discrepancy in tunnel_key_act().

No longer block BH in tunnel_key_init() when acquiring tcf_lock.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250827125349.3505302-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-28 16:46:23 -07:00
Eric Dumazet
48b5e5dbdb net_sched: act_vlan: use RCU in tcf_vlan_dump()
Also storing tcf_action into struct tcf_vlan_params
makes sure there is no discrepancy in tcf_vlan_act().

No longer block BH in tcf_vlan_init() when acquiring tcf_lock.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250827125349.3505302-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-28 16:46:23 -07:00
Dragos Tatulea
13d8e05adf queue_api: add support for fetching per queue DMA dev
For zerocopy (io_uring, devmem), there is an assumption that the
parent device can do DMA. However that is not always the case:
- Scalable Function netdevs [1] have the DMA device in the grandparent.
- For Multi-PF netdevs [2] queues can be associated to different DMA
devices.

This patch introduces the a queue based interface for allowing drivers
to expose a different DMA device for zerocopy.

[1] Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst
[2] Documentation/networking/multi-pf-netdev.rst

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20250827144017.1529208-3-dtatulea@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-28 16:05:31 -07:00
Eric Dumazet
b81aa23234 inet: raw: add drop_counters to raw sockets
When a packet flood hits one or more RAW sockets, many cpus
have to update sk->sk_drops.

This slows down other cpus, because currently
sk_drops is in sock_write_rx group.

Add a socket_drop_counters structure to raw sockets.

Using dedicated cache lines to hold drop counters
makes sure that consumers no longer suffer from
false sharing if/when producers only change sk->sk_drops.

This adds 128 bytes per RAW socket.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250826125031.1578842-6-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-08-28 13:14:50 +02:00
Eric Dumazet
51132b99f0 udp: add drop_counters to udp socket
When a packet flood hits one or more UDP sockets, many cpus
have to update sk->sk_drops.

This slows down other cpus, because currently
sk_drops is in sock_write_rx group.

Add a socket_drop_counters structure to udp sockets.

Using dedicated cache lines to hold drop counters
makes sure that consumers no longer suffer from
false sharing if/when producers only change sk->sk_drops.

This adds 128 bytes per UDP socket.

Tested with the following stress test, sending about 11 Mpps
to a dual socket AMD EPYC 7B13 64-Core.

super_netperf 20 -t UDP_STREAM -H DUT -l10 -- -n -P,1000 -m 120
Note: due to socket lookup, only one UDP socket is receiving
packets on DUT.

Then measure receiver (DUT) behavior. We can see both
consumer and BH handlers can process more packets per second.

Before:

nstat -n ; sleep 1 ; nstat | grep Udp
Udp6InDatagrams                 615091             0.0
Udp6InErrors                    3904277            0.0
Udp6RcvbufErrors                3904277            0.0

After:

nstat -n ; sleep 1 ; nstat | grep Udp
Udp6InDatagrams                 816281             0.0
Udp6InErrors                    7497093            0.0
Udp6RcvbufErrors                7497093            0.0

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250826125031.1578842-5-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-08-28 13:14:50 +02:00
Eric Dumazet
c51613fa27 net: add sk->sk_drop_counters
Some sockets suffer from heavy false sharing on sk->sk_drops,
and fields in the same cache line.

Add sk->sk_drop_counters to:

- move the drop counter(s) to dedicated cache lines.
- Add basic NUMA awareness to these drop counter(s).

Following patches will use this infrastructure for UDP and RAW sockets.

sk_clone_lock() is not yet ready, it would need to properly
set newsk->sk_drop_counters if we plan to use this for TCP sockets.

v2: used Paolo suggestion from https://lore.kernel.org/netdev/8f09830a-d83d-43c9-b36b-88ba0a23e9b2@redhat.com/

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250826125031.1578842-4-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-08-28 13:14:50 +02:00
Eric Dumazet
cb4d5a6eb6 net: add sk_drops_skbadd() helper
Existing sk_drops_add() helper is renamed to sk_drops_skbadd().

Add sk_drops_add() and convert sk_drops_inc() to use it.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250826125031.1578842-3-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-08-28 13:14:50 +02:00
Eric Dumazet
f86f42ed2c net: add sk_drops_read(), sk_drops_inc() and sk_drops_reset() helpers
We want to split sk->sk_drops in the future to reduce
potential contention on this field.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250826125031.1578842-2-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-08-28 13:14:50 +02:00
Krishna Kumar
97bcc5b6f4 net: Prevent RPS table overwrite of active flows
This patch fixes an issue where two different flows on the same RXq
produce the same hash resulting in continuous flow overwrites.

Flow #1: A packet for Flow #1 comes in, kernel calls the steering
         function. The driver gives back a filter id. The kernel saves
	 this filter id in the selected slot. Later, the driver's
	 service task checks if any filters have expired and then
	 installs the rule for Flow #1.
Flow #2: A packet for Flow #2 comes in. It goes through the same steps.
         But this time, the chosen slot is being used by Flow #1. The
	 driver gives a new filter id and the kernel saves it in the
	 same slot. When the driver's service task runs, it runs through
	 all the flows, checks if Flow #1 should be expired, the kernel
	 returns True as the slot has a different filter id, and then
	 the driver installs the rule for Flow #2.
Flow #1: Another packet for Flow #1 comes in. The same thing repeats.
         The slot is overwritten with a new filter id for Flow #1.

This causes a repeated cycle of flow programming for missed packets,
wasting CPU cycles while not improving performance. This problem happens
at higher rates when the RPS table is small, but tests show it still
happens even with 12,000 connections and an RPS size of 16K per queue
(global table size = 144x16K = 64K).

This patch prevents overwriting an rps_dev_flow entry if it is active.
The intention is that it is better to do aRFS for the first flow instead
of hurting all flows on the same hash. Without this, two (or more) flows
on one RX queue with the same hash can keep overwriting each other. This
causes the driver to reprogram the flow repeatedly.

Changes:
  1. Add a new 'hash' field to struct rps_dev_flow.
  2. Add rps_flow_is_active(): a helper function to check if a flow is
     active or not, extracted from rps_may_expire_flow(). It is further
     simplified as per reviewer feedback.
  3. In set_rps_cpu():
     - Avoid overwriting by programming a new filter if:
        - The slot is not in use, or
        - The slot is in use but the flow is not active, or
        - The slot has an active flow with the same hash, but target CPU
          differs.
     - Save the hash in the rps_dev_flow entry.
  4. rps_may_expire_flow(): Use earlier extracted rps_flow_is_active().

Testing & results:
  - Driver: ice (E810 NIC), Kernel: net-next
  - #CPUs = #RXq = 144 (1:1)
  - Number of flows: 12K
  - Eight RPS settings from 256 to 32768. Though RPS=256 is not ideal,
    it is still sufficient to cover 12K flows (256*144 rx-queues = 64K
    global table slots)
  - Global Table Size = 144 * RPS (effectively equal to 256 * RPS)
  - Each RPS test duration = 8 mins (org code) + 8 mins (new code).
  - Metrics captured on client

Legend for following tables:
Steer-C: #times ndo_rx_flow_steer() was Called by set_rps_cpu()
Steer-L: #times ice_arfs_flow_steer() Looped over aRFS entries
Add:     #times driver actually programmed aRFS (ice_arfs_build_entry())
Del:     #times driver deleted the flow (ice_arfs_del_flow_rules())
Units:   K = 1,000 times, M = 1 million times

  |-------|---------|------|     Org Code    |---------|---------|
  | RPS   | Latency | CPU  | Add    |  Del   | Steer-C | Steer-L |
  |-------|---------|------|--------|--------|---------|---------|
  | 256   | 227.0   | 93.2 | 1.6M   | 1.6M   | 121.7M  | 267.6M  |
  | 512   | 225.9   | 94.1 | 11.5M  | 11.2M  | 65.7M   | 199.6M  |
  | 1024  | 223.5   | 95.6 | 16.5M  | 16.5M  | 27.1M   | 187.3M  |
  | 2048  | 222.2   | 96.3 | 10.5M  | 10.5M  | 12.5M   | 115.2M  |
  | 4096  | 223.9   | 94.1 | 5.5M   | 5.5M   | 7.2M    | 65.9M   |
  | 8192  | 224.7   | 92.5 | 2.7M   | 2.7M   | 3.0M    | 29.9M   |
  | 16384 | 223.5   | 92.5 | 1.3M   | 1.3M   | 1.4M    | 13.9M   |
  | 32768 | 219.6   | 93.2 | 838.1K | 838.1K | 965.1K  | 8.9M    |
  |-------|---------|------|   New Code      |---------|---------|
  | 256   | 201.5   | 99.1 | 13.4K  | 5.0K   | 13.7K   | 75.2K   |
  | 512   | 202.5   | 98.2 | 11.2K  | 5.9K   | 11.2K   | 55.5K   |
  | 1024  | 207.3   | 93.9 | 11.5K  | 9.7K   | 11.5K   | 59.6K   |
  | 2048  | 207.5   | 96.7 | 11.8K  | 11.1K  | 15.5K   | 79.3K   |
  | 4096  | 206.9   | 96.6 | 11.8K  | 11.7K  | 11.8K   | 63.2K   |
  | 8192  | 205.8   | 96.7 | 11.9K  | 11.8K  | 11.9K   | 63.9K   |
  | 16384 | 200.9   | 98.2 | 11.9K  | 11.9K  | 11.9K   | 64.2K   |
  | 32768 | 202.5   | 98.0 | 11.9K  | 11.9K  | 11.9K   | 64.2K   |
  |-------|---------|------|--------|--------|---------|---------|

Some observations:
  1. Overall Latency improved: (1790.19-1634.94)/1790.19*100 = 8.67%
  2. Overall CPU increased:    (777.32-751.49)/751.45*100    = 3.44%
  3. Flow Management (add/delete) remained almost constant at ~11K
     compared to values in millions.

Signed-off-by: Krishna Kumar <krikku@gmail.com>
Link: https://patch.msgid.link/20250825031005.3674864-2-krikku@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-27 18:24:13 -07:00
Takamitsu Iwai
d860d1faa6 net: rose: convert 'use' field to refcount_t
The 'use' field in struct rose_neigh is used as a reference counter but
lacks atomicity. This can lead to race conditions where a rose_neigh
structure is freed while still being referenced by other code paths.

For example, when rose_neigh->use becomes zero during an ioctl operation
via rose_rt_ioctl(), the structure may be removed while its timer is
still active, potentially causing use-after-free issues.

This patch changes the type of 'use' from unsigned short to refcount_t and
updates all code paths to use rose_neigh_hold() and rose_neigh_put() which
operate reference counts atomically.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Takamitsu Iwai <takamitz@amazon.co.jp>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250823085857.47674-3-takamitz@amazon.co.jp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-27 07:43:08 -07:00
Takamitsu Iwai
dcb3465902 net: rose: split remove and free operations in rose_remove_neigh()
The current rose_remove_neigh() performs two distinct operations:
1. Removes rose_neigh from rose_neigh_list
2. Frees the rose_neigh structure

Split these operations into separate functions to improve maintainability
and prepare for upcoming refcount_t conversion. The timer cleanup remains
in rose_remove_neigh() because free operations can be called from timer
itself.

This patch introduce rose_neigh_put() to handle the freeing of rose_neigh
structures and modify rose_remove_neigh() to handle removal only.

Signed-off-by: Takamitsu Iwai <takamitz@amazon.co.jp>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250823085857.47674-2-takamitz@amazon.co.jp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-27 07:43:08 -07:00
Eric Biggers
fe60065689 ipv6: sr: Prepare HMAC key ahead of time
Prepare the HMAC key when it is added to the kernel, instead of
preparing it implicitly for every packet.  This significantly improves
the performance of seg6_hmac_compute().  A microbenchmark on x86_64
shows seg6_hmac_compute() (with HMAC-SHA256) dropping from ~1978 cycles
to ~1419 cycles, a 28% improvement.

The size of 'struct seg6_hmac_info' increases by 128 bytes, but that
should be fine, since there should not be a massive number of keys.

Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Link: https://patch.msgid.link/20250824013644.71928-3-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-26 18:11:29 -07:00
Eric Biggers
095928e7d8 ipv6: sr: Use HMAC-SHA1 and HMAC-SHA256 library functions
Use the HMAC-SHA1 and HMAC-SHA256 library functions instead of
crypto_shash.  This is simpler and faster.  Pre-allocating per-CPU hash
transformation objects and descriptors is no longer needed, and a
microbenchmark on x86_64 shows seg6_hmac_compute() (with HMAC-SHA256)
dropping from ~2494 cycles to ~1978 cycles, a 20% improvement.

Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Link: https://patch.msgid.link/20250824013644.71928-2-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-26 18:11:29 -07:00
Guillaume Nault
1bec9d0c00 ipv4: Convert ->flowi4_tos to dscp_t.
Convert the ->flowic_tos field of struct flowi_common from __u8 to
dscp_t, rename it ->flowic_dscp and propagate these changes to struct
flowi and struct flowi4.

We've had several bugs in the past where ECN bits could interfere with
IPv4 routing, because these bits were not properly cleared when setting
->flowi4_tos. These bugs should be fixed now and the dscp_t type has
been introduced to ensure that variables carrying DSCP values don't
accidentally have any ECN bits set. Several variables and structure
fields have been converted to dscp_t already, but the main IPv4 routing
structure, struct flowi4, is still using a __u8. To avoid any future
regression, this patch converts it to dscp_t.

There are many users to convert at once. Fortunately, around half of
->flowi4_tos users already have a dscp_t value at hand, which they
currently convert to __u8 using inet_dscp_to_dsfield(). For all of
these users, we just need to drop that conversion.

But, although we try to do the __u8 <-> dscp_t conversions at the
boundaries of the network or of user space, some places still store
TOS/DSCP variables as __u8 in core networking code. Those can hardly be
converted either because the data structure is part of UAPI or because
the same variable or field is also used for handling ECN in other parts
of the code. In all of these cases where we don't have a dscp_t
variable at hand, we need to use inet_dsfield_to_dscp() when
interacting with ->flowi4_dscp.

Changes since v1:
  * Fix space alignment in __bpf_redirect_neigh_v4() (Ido).

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/29acecb45e911d17446b9a3dbdb1ab7b821ea371.1756128932.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-26 17:34:31 -07:00
Shahar Shitrit
6a06d8c405 devlink: Introduce burst period for health reporter
Currently, the devlink health reporter starts the grace period
immediately after handling an error, blocking any further recoveries
until it finished.

However, when a single root cause triggers multiple errors in a short
time frame, it is desirable to treat them as a bulk of errors and to
allow their recoveries, avoiding premature blocking of subsequent
related errors, and reducing the risk of inconsistent or incomplete
error handling.

To address this, introduce a configurable burst period for devlink
health reporter. Start this period when the first error is handled,
and allow recovery attempts for reported errors during this window.
Once burst period expires, begin the grace period to block further
recoveries until it concludes.

Timeline summary:

----|--------|------------------------------/----------------------/--
error is  error is       burst period           grace period
reported  recovered  (recoveries allowed)    (recoveries blocked)

For calculating the burst period duration, use the same
last_recovery_ts as the grace period. Update it on recovery only
when the burst period is inactive (either disabled or at the
first error).

This patch implements the framework for the burst period and
effectively sets its value to 0 at reporter creation, so the current
behavior remains unchanged, which ensures backward compatibility.

A downstream patch will make the burst period configurable.

Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250824084354.533182-4-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-26 17:24:16 -07:00
Shahar Shitrit
d2b0073745 devlink: Move graceful period parameter to reporter ops
Move the default graceful period from a parameter to
devlink_health_reporter_create() to a field in the
devlink_health_reporter_ops structure.

This change improves consistency, as the graceful period is inherently
tied to the reporter's behavior and recovery policy. It simplifies the
signature of devlink_health_reporter_create() and its internal helper
functions. It also centralizes the reporter configuration at the ops
structure, preparing the groundwork for a downstream patch that will
introduce a devlink health reporter burst period attribute whose
default value will similarly be provided by the driver via the ops
structure.

Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250824084354.533182-2-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-26 17:24:16 -07:00
Kuniyuki Iwashima
cb16f4b6c7 tcp: Don't pass hashinfo to socket lookup helpers.
These socket lookup functions required struct inet_hashinfo because
they are shared by TCP and DCCP.

  * __inet_lookup_established()
  * __inet_lookup_listener()
  * __inet6_lookup_established()
  * inet6_lookup_listener()

DCCP has gone, and we don't need to pass hashinfo down to them.

Let's fetch net->ipv4.tcp_death_row.hashinfo directly in the above
4 functions.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250822190803.540788-5-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-25 17:53:35 -07:00
Kuniyuki Iwashima
2d842b6c67 tcp: Remove timewait_sock_ops.twsk_destructor().
Since DCCP has been removed, sk->sk_prot->twsk_prot->twsk_destructor
is always tcp_twsk_destructor().

Let's call tcp_twsk_destructor() directly in inet_twsk_free() and
remove ->twsk_destructor().

While at it, tcp_twsk_destructor() is un-exported.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250822190803.540788-3-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-25 17:53:35 -07:00
Pavel Shpakovskiy
6bbd0d3f0c Bluetooth: hci_sync: fix set_local_name race condition
Function set_name_sync() uses hdev->dev_name field to send
HCI_OP_WRITE_LOCAL_NAME command, but copying from data to hdev->dev_name
is called after mgmt cmd was queued, so it is possible that function
set_name_sync() will read old name value.

This change adds name as a parameter for function hci_update_name_sync()
to avoid race condition.

Fixes: 6f6ff38a1e ("Bluetooth: hci_sync: Convert MGMT_OP_SET_LOCAL_NAME")
Signed-off-by: Pavel Shpakovskiy <pashpakovskii@salutedevices.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-08-22 13:57:31 -04:00
Jakub Kicinski
a9af709fda Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.17-rc3).

No conflicts or adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21 11:33:15 -07:00
Jakub Kicinski
07cf71bf25 net: page_pool: add page_pool_get()
There is a page_pool_put() function but no get equivalent.
Having multiple references to a page pool is quite useful.
It avoids branching in create / destroy paths in drivers
which support memory providers.

Use the new helper in bnxt.

Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20250820025704.166248-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21 08:03:54 -07:00
Hangbin Liu
b64d035f77 bonding: update LACP activity flag after setting lacp_active
The port's actor_oper_port_state activity flag should be updated immediately
after changing the lacp_active option to reflect the current mode correctly.

Fixes: 3a755cd8b7 ("bonding: add new option lacp_active")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20250815062000.22220-2-liuhangbin@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-08-21 09:35:20 +02:00
Eric Dumazet
a6d4f25888 net: set net.core.rmem_max and net.core.wmem_max to 4 MB
SO_RCVBUF and SO_SNDBUF have limited range today, unless
distros or system admins change rmem_max and wmem_max.

Even iproute2 uses 1 MB SO_RCVBUF which is capped by
the kernel.

Decouple [rw]mem_max and [rw]mem_default and increase
[rw]mem_max to 4 MB.

Before:

$ sysctl net.core.rmem_default net.core.rmem_max net.core.wmem_default net.core.wmem_max
net.core.rmem_default = 212992
net.core.rmem_max = 212992
net.core.wmem_default = 212992
net.core.wmem_max = 212992

After:

$ sysctl net.core.rmem_default net.core.rmem_max net.core.wmem_default net.core.wmem_max
net.core.rmem_default = 212992
net.core.rmem_max = 4194304
net.core.wmem_default = 212992
net.core.wmem_max = 4194304

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20250819174030.1986278-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-20 19:35:00 -07:00
Eric Biggers
2f3dd6ec90 sctp: Convert cookie authentication to use HMAC-SHA256
Convert SCTP cookies to use HMAC-SHA256, instead of the previous choice
of the legacy algorithms HMAC-MD5 and HMAC-SHA1.  Simplify and optimize
the code by using the HMAC-SHA256 library instead of crypto_shash, and
by preparing the HMAC key when it is generated instead of per-operation.

This doesn't break compatibility, since the cookie format is an
implementation detail, not part of the SCTP protocol itself.

Note that the cookie size doesn't change either.  The HMAC field was
already 32 bytes, even though previously at most 20 bytes were actually
compared.  32 bytes exactly fits an untruncated HMAC-SHA256 value.  So,
although we could safely truncate the MAC to something slightly shorter,
for now just keep the cookie size the same.

I also considered SipHash, but that would generate only 8-byte MACs.  An
8-byte MAC *might* suffice here.  However, there's quite a lot of
information in the SCTP cookies: more than in TCP SYN cookies.  So
absent an analysis that occasional forgeries of all that information is
okay in SCTP, I errored on the side of caution.

Remove HMAC-MD5 and HMAC-SHA1 as options, since the new HMAC-SHA256
option is just better.  It's faster as well as more secure.  For
example, benchmarking on x86_64, cookie authentication is now nearly 3x
as fast as the previous default choice and implementation of HMAC-MD5.

Also just make the kernel always support cookie authentication if SCTP
is supported at all, rather than making it optional in the build.  (It
was sort of optional before, but it didn't really work properly.  E.g.,
a kernel with CONFIG_SCTP_COOKIE_HMAC_MD5=n still supported HMAC-MD5
cookie authentication if CONFIG_CRYPTO_HMAC and CONFIG_CRYPTO_MD5
happened to be enabled in the kconfig for other reasons.)

Acked-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Link: https://patch.msgid.link/20250818205426.30222-5-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-19 19:36:26 -07:00
Eric Biggers
bf40785fa4 sctp: Use HMAC-SHA1 and HMAC-SHA256 library for chunk authentication
For SCTP chunk authentication, use the HMAC-SHA1 and HMAC-SHA256 library
functions instead of crypto_shash.  This is simpler and faster.  There's
no longer any need to pre-allocate 'crypto_shash' objects; the SCTP code
now simply calls into the HMAC code directly.

As part of this, make SCTP always support both HMAC-SHA1 and
HMAC-SHA256.  Previously, it only guaranteed support for HMAC-SHA1.
However, HMAC-SHA256 tended to be supported too anyway, as it was
supported if CONFIG_CRYPTO_SHA256 was enabled elsewhere in the kconfig.

Acked-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Link: https://patch.msgid.link/20250818205426.30222-4-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-19 19:36:25 -07:00
Kuniyuki Iwashima
bf64002c94 net: Define sk_memcg under CONFIG_MEMCG.
Except for sk_clone_lock(), all accesses to sk->sk_memcg
is done under CONFIG_MEMCG.

As a bonus, let's define sk->sk_memcg under CONFIG_MEMCG.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Link: https://patch.msgid.link/20250815201712.1745332-11-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-19 19:20:59 -07:00
Kuniyuki Iwashima
b2ffd10cdd net-memcg: Pass struct sock to mem_cgroup_sk_under_memory_pressure().
We will store a flag in the lowest bit of sk->sk_memcg.

Then, we cannot pass the raw pointer to mem_cgroup_under_socket_pressure().

Let's pass struct sock to it and rename the function to match other
functions starting with mem_cgroup_sk_.

Note that the helper is moved to sock.h to use mem_cgroup_from_sk().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Link: https://patch.msgid.link/20250815201712.1745332-10-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-19 19:20:59 -07:00
Kuniyuki Iwashima
43049b0db0 net-memcg: Introduce mem_cgroup_sk_enabled().
The socket memcg feature is enabled by a static key and
only works for non-root cgroup.

We check both conditions in many places.

Let's factorise it as a helper function.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Link: https://patch.msgid.link/20250815201712.1745332-8-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-19 19:20:59 -07:00
Kuniyuki Iwashima
f7161b234f net-memcg: Introduce mem_cgroup_from_sk().
We will store a flag in the lowest bit of sk->sk_memcg.

Then, directly dereferencing sk->sk_memcg will be illegal, and we
do not want to allow touching the raw sk->sk_memcg in many places.

Let's introduce mem_cgroup_from_sk().

Other places accessing the raw sk->sk_memcg will be converted later.

Note that we cannot define the helper as an inline function in
memcontrol.h as we cannot access any fields of struct sock there
due to circular dependency, so it is placed in sock.h.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Link: https://patch.msgid.link/20250815201712.1745332-7-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-19 19:20:59 -07:00
Dipayaan Roy
730ff06d3f net: mana: Use page pool fragments for RX buffers instead of full pages to improve memory efficiency.
This patch enhances RX buffer handling in the mana driver by allocating
pages from a page pool and slicing them into MTU-sized fragments, rather
than dedicating a full page per packet. This approach is especially
beneficial on systems with large base page sizes like 64KB.

Key improvements:

- Proper integration of page pool for RX buffer allocations.
- MTU-sized buffer slicing to improve memory utilization.
- Reduce overall per Rx queue memory footprint.
- Automatic fallback to full-page buffers when:
   * Jumbo frames are enabled (MTU > PAGE_SIZE / 2).
   * The XDP path is active, to avoid complexities with fragment reuse.

Testing on VMs with 64KB pages shows around 200% throughput improvement.
Memory efficiency is significantly improved due to reduced wastage in page
allocations. Example: We are now able to fit 35 rx buffers in a single 64kb
page for MTU size of 1500, instead of 1 rx buffer per page previously.

Tested:

- iperf3, iperf2, and nttcp benchmarks.
- Jumbo frames with MTU 9000.
- Native XDP programs (XDP_PASS, XDP_DROP, XDP_TX, XDP_REDIRECT) for
  testing the XDP path in driver.
- Memory leak detection (kmemleak).
- Driver load/unload, reboot, and stress scenarios.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
Link: https://patch.msgid.link/20250814140410.GA22089@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-08-19 14:42:44 +02:00
Luiz Augusto von Dentz
9d4b01a0bf Bluetooth: hci_core: Fix not accounting for BIS/CIS/PA links separately
This fixes the likes of hci_conn_num(CIS_LINK) returning the total of
ISO connection which includes BIS_LINK as well, so this splits the
iso_num into each link type and introduces hci_iso_num that can be used
in places where the total number of ISO connection still needs to be
used.

Fixes: 23205562ff ("Bluetooth: separate CIS_LINK and BIS_LINK link types")
Fixes: a7bcffc673 ("Bluetooth: Add PA_LINK to distinguish BIG sync and PA sync connections")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-08-15 10:13:41 -04:00
Luiz Augusto von Dentz
3dcf7175f2 Bluetooth: hci_core: Fix using ll_privacy_capable for current settings
ll_privacy_capable only indicates that the controller supports the
feature but it doesnt' check that LE is enabled so it end up being
marked as active in the current settings when it shouldn't.

Fixes: ad383c2c65 ("Bluetooth: hci_sync: Enable advertising when LL privacy is enabled")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-08-15 09:44:49 -04:00
Luiz Augusto von Dentz
709788b154 Bluetooth: hci_core: Fix using {cis,bis}_capable for current settings
{cis,bis}_capable only indicates the controller supports the feature
since it doesn't check that LE is enabled so it shall not be used for
current setting, instead this introduces {cis,bis}_enabled macros that
can be used to indicate that these features are currently enabled.

Fixes: 26afbd826e ("Bluetooth: Add initial implementation of CIS connections")
Fixes: eca0ae4aea ("Bluetooth: Add initial implementation of BIS connections")
Fixes: ae75336131 ("Bluetooth: Check for ISO support in controller")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-08-15 09:44:49 -04:00
William Liu
52bf272636 net/sched: Fix backlog accounting in qdisc_dequeue_internal
This issue applies for the following qdiscs: hhf, fq, fq_codel, and
fq_pie, and occurs in their change handlers when adjusting to the new
limit. The problem is the following in the values passed to the
subsequent qdisc_tree_reduce_backlog call given a tbf parent:

   When the tbf parent runs out of tokens, skbs of these qdiscs will
   be placed in gso_skb. Their peek handlers are qdisc_peek_dequeued,
   which accounts for both qlen and backlog. However, in the case of
   qdisc_dequeue_internal, ONLY qlen is accounted for when pulling
   from gso_skb. This means that these qdiscs are missing a
   qdisc_qstats_backlog_dec when dropping packets to satisfy the
   new limit in their change handlers.

   One can observe this issue with the following (with tc patched to
   support a limit of 0):

   export TARGET=fq
   tc qdisc del dev lo root
   tc qdisc add dev lo root handle 1: tbf rate 8bit burst 100b latency 1ms
   tc qdisc replace dev lo handle 3: parent 1:1 $TARGET limit 1000
   echo ''; echo 'add child'; tc -s -d qdisc show dev lo
   ping -I lo -f -c2 -s32 -W0.001 127.0.0.1 2>&1 >/dev/null
   echo ''; echo 'after ping'; tc -s -d qdisc show dev lo
   tc qdisc change dev lo handle 3: parent 1:1 $TARGET limit 0
   echo ''; echo 'after limit drop'; tc -s -d qdisc show dev lo
   tc qdisc replace dev lo handle 2: parent 1:1 sfq
   echo ''; echo 'post graft'; tc -s -d qdisc show dev lo

   The second to last show command shows 0 packets but a positive
   number (74) of backlog bytes. The problem becomes clearer in the
   last show command, where qdisc_purge_queue triggers
   qdisc_tree_reduce_backlog with the positive backlog and causes an
   underflow in the tbf parent's backlog (4096 Mb instead of 0).

To fix this issue, the codepath for all clients of qdisc_dequeue_internal
has been simplified: codel, pie, hhf, fq, fq_pie, and fq_codel.
qdisc_dequeue_internal handles the backlog adjustments for all cases that
do not directly use the dequeue handler.

The old fq_codel_change limit adjustment loop accumulated the arguments to
the subsequent qdisc_tree_reduce_backlog call through the cstats field.
However, this is confusing and error prone as fq_codel_dequeue could also
potentially mutate this field (which qdisc_dequeue_internal calls in the
non gso_skb case), so we have unified the code here with other qdiscs.

Fixes: 2d3cbfd6d5 ("net_sched: Flush gso_skb list too during ->change()")
Fixes: 4b549a2ef4 ("fq_codel: Fair Queue Codel AQM")
Fixes: 10239edf86 ("net-qdisc-hhf: Heavy-Hitter Filter (HHF) qdisc")
Signed-off-by: William Liu <will@willsroot.io>
Reviewed-by: Savino Dicanosa <savy@syst3mfailure.io>
Link: https://patch.msgid.link/20250812235725.45243-1-will@willsroot.io
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-14 17:52:29 -07:00
Parav Pandit
41a6e8ab18 devlink/port: Check attributes early and constify
Constify the devlink port attributes to indicate they are read only
and does not depend on anything else. Therefore, validate it early
before setting in the devlink port.

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Parav Pandit <parav@nvidia.com>
Link: https://patch.msgid.link/20250813094417.7269-3-parav@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-14 17:35:20 -07:00
Sven Stegemann
52565a9352 net: kcm: Fix race condition in kcm_unattach()
syzbot found a race condition when kcm_unattach(psock)
and kcm_release(kcm) are executed at the same time.

kcm_unattach() is missing a check of the flag
kcm->tx_stopped before calling queue_work().

If the kcm has a reserved psock, kcm_unattach() might get executed
between cancel_work_sync() and unreserve_psock() in kcm_release(),
requeuing kcm->tx_work right before kcm gets freed in kcm_done().

Remove kcm->tx_stopped and replace it by the less
error-prone disable_work_sync().

Fixes: ab7ac4eb98 ("kcm: Kernel Connection Multiplexor module")
Reported-by: syzbot+e62c9db591c30e174662@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=e62c9db591c30e174662
Reported-by: syzbot+d199b52665b6c3069b94@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=d199b52665b6c3069b94
Reported-by: syzbot+be6b1fdfeae512726b4e@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=be6b1fdfeae512726b4e
Signed-off-by: Sven Stegemann <sven@stegemann.de>
Link: https://patch.msgid.link/20250812191810.27777-1-sven@stegemann.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-13 18:18:33 -07:00
Jakub Kicinski
212122775e Merge branch '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
Tony Nguyen says:

====================
ixgbe: bypass devlink phys_port_name generation

Jedrzej adds option to skip phys_port_name generation and opts
ixgbe into it as some configurations rely on pre-devlink naming
which could end up broken as a result.

* '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
  ixgbe: prevent from unwanted interface name changes
  devlink: let driver opt out of automatic phys_port_name generation
====================

Link: https://patch.msgid.link/20250812205226.1984369-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-13 17:31:46 -07:00
Frederic Weisbecker
c0a23bbc98 ipvs: Fix estimator kthreads preferred affinity
The estimator kthreads' affinity are defined by sysctl overwritten
preferences and applied through a plain call to the scheduler's affinity
API.

However since the introduction of managed kthreads preferred affinity,
such a practice shortcuts the kthreads core code which eventually
overwrites the target to the default unbound affinity.

Fix this with using the appropriate kthread's API.

Fixes: d1a8919758 ("kthread: Default affine kthread to its preferred NUMA node")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
2025-08-13 08:34:33 +02:00
Jedrzej Jagielski
c5ec7f49b4 devlink: let driver opt out of automatic phys_port_name generation
Currently when adding devlink port, phys_port_name is automatically
generated within devlink port initialization flow. As a result adding
devlink port support to driver may result in forced changes of interface
names, which breaks already existing network configs.

This is an expected behavior but in some scenarios it would not be
preferable to provide such limitation for legacy driver not being able to
keep 'pre-devlink' interface name.

Add flag no_phys_port_name to devlink_port_attrs struct which indicates
if devlink should not alter name of interface.

Suggested-by: Jiri Pirko <jiri@resnulli.us>
Link: https://lore.kernel.org/all/nbwrfnjhvrcduqzjl4a2jafnvvud6qsbxlvxaxilnryglf4j7r@btuqrimnfuly/
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-08-12 13:23:39 -07:00
Jakub Kicinski
64fdaa94bf net: page_pool: allow enabling recycling late, fix false positive warning
Page pool can have pages "directly" (locklessly) recycled to it,
if the NAPI that owns the page pool is scheduled to run on the same CPU.
To make this safe we check that the NAPI is disabled while we destroy
the page pool. In most cases NAPI and page pool lifetimes are tied
together so this happens naturally.

The queue API expects the following order of calls:
 -> mem_alloc
    alloc new pp
 -> stop
    napi_disable
 -> start
    napi_enable
 -> mem_free
    free old pp

Here we allocate the page pool in ->mem_alloc and free in ->mem_free.
But the NAPIs are only stopped between ->stop and ->start. We created
page_pool_disable_direct_recycling() to safely shut down the recycling
in ->stop. This way the page_pool_destroy() call in ->mem_free doesn't
have to worry about recycling any more.

Unfortunately, the page_pool_disable_direct_recycling() is not enough
to deal with failures which necessitate freeing the _new_ page pool.
If we hit a failure in ->mem_alloc or ->stop the new page pool has
to be freed while the NAPI is active (assuming driver attaches the
page pool to an existing NAPI instance and doesn't reallocate NAPIs).

Freeing the new page pool is technically safe because it hasn't been
used for any packets, yet, so there can be no recycling. But the check
in napi_assert_will_not_race() has no way of knowing that. We could
check if page pool is empty but that'd make the check much less likely
to trigger during development.

Add page_pool_enable_direct_recycling(), pairing with
page_pool_disable_direct_recycling(). It will allow us to create the new
page pools in "disabled" state and only enable recycling when we know
the reconfig operation will not fail.

Coincidentally it will also let us re-enable the recycling for the old
pool, if the reconfig failed:

 -> mem_alloc (new)
 -> stop (old)
    # disables direct recycling for old
 -> start (new)
    # fail!!
 -> start (old)
    # go back to old pp but direct recycling is lost :(
 -> mem_free (new)

The new helper is idempotent to make the life easier for drivers,
which can operate in HDS mode and support zero-copy Rx.
The driver can call the helper twice whether there are two pools
or it has multiple references to a single pool.

Fixes: 40eca00ae6 ("bnxt_en: unlink page pool when stopping Rx queue")
Tested-by: David Wei <dw@davidwei.uk>
Link: https://patch.msgid.link/20250805003654.2944974-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-08 12:54:42 -07:00
Linus Torvalds
3781648824 Previous releases - regressions:
- netlink: avoid infinite retry looping in netlink_unicast()
 
 Previous releases - always broken:
 
  - packet: fix a race in packet_set_ring() and packet_notifier()
 
  - ipv6: reject malicious packets in ipv6_gso_segment()
 
  - sched: mqprio: fix stack out-of-bounds write in tc entry parsing
 
  - net: drop UFO packets (injected via virtio) in udp_rcv_segment()
 
  - eth: mlx5: correctly set gso_segs when LRO is used, avoid false
    positive checksum validation errors
 
  - netpoll: prevent hanging NAPI when netcons gets enabled
 
  - phy: mscc: fix parsing of unicast frames for PTP timestamping
 
  - number of device tree / OF reference leak fixes
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmiUvMcACgkQMUZtbf5S
 IrucuA//bQGZdQkpRo/2zWDFS4wN7hV8PkR+kI0F4rUyNQjFDyUOrccYnhHl8VRn
 DfnmVC9oiWxwuW2QgrgH1KKSw9dxhDPVmhLezAHaZv9XAPPgPO/Yb2Dr3oKHF+yK
 +/QW57FN8rNg8mHUzMS/26Y+rH6OubTaf3MKcsT2uZhuIXKHScOUedmr6EeEkp/L
 c32MpWfcVC7M2jjvCH+HO4k4RawWaB8W93iMUrLMSVI1oE3Tsjyhl5Cv9TBixh7g
 3KZW98qVqMXKBQr0QTF4LBR0IWgKOS4KMVJqrgQ3CZszbDWbmBsFfi8olr/AS0Nt
 vgk6hTd8vVHXa+sOvdFpDbocdjXBq6vf5bDd9p57bt3JyxFMfFUWbG1eDj8bUpMI
 YuKAhQg9mTr6R8DaLwqANY8zrSET2FiYbNkUsP9TD8++q2j34U5/a468cxymfjJ/
 90vSrBGUpwxYPhx2B+Slu1SZJ1g0NbUL34jrKBuSNloVoZDRuzq2zd310BigG/35
 T5CTr5OH+USlFjL6KpgGHnygiMQB/h2WgEjWbFoZHnIb+exVKCgS6IiNBVV3b2OK
 Nv6iVyFKPdKJ8qxeaiYMWA16wG8buOytlBn/1hYKLBvyVKAAx9Ib4ao2TK9w/1Vs
 50sX21h+Awj2a/17sXlSXA+RVdeBcAzHdsGWHr1ql31htn2Xl10=
 =nwy7
 -----END PGP SIGNATURE-----

Merge tag 'net-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
  Previous releases - regressions:

   - netlink: avoid infinite retry looping in netlink_unicast()

  Previous releases - always broken:

   - packet: fix a race in packet_set_ring() and packet_notifier()

   - ipv6: reject malicious packets in ipv6_gso_segment()

   - sched: mqprio: fix stack out-of-bounds write in tc entry parsing

   - net: drop UFO packets (injected via virtio) in udp_rcv_segment()

   - eth: mlx5: correctly set gso_segs when LRO is used, avoid false
     positive checksum validation errors

   - netpoll: prevent hanging NAPI when netcons gets enabled

   - phy: mscc: fix parsing of unicast frames for PTP timestamping

   - a number of device tree / OF reference leak fixes"

* tag 'net-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (44 commits)
  pptp: fix pptp_xmit() error path
  net: ti: icssg-prueth: Fix skb handling for XDP_PASS
  net: Update threaded state in napi config in netif_set_threaded
  selftests: netdevsim: Xfail nexthop test on slow machines
  eth: fbnic: Lock the tx_dropped update
  eth: fbnic: Fix tx_dropped reporting
  eth: fbnic: remove the debugging trick of super high page bias
  net: ftgmac100: fix potential NULL pointer access in ftgmac100_phy_disconnect
  dt-bindings: net: Replace bouncing Alexandru Tachici emails
  dpll: zl3073x: ZL3073X_I2C and ZL3073X_SPI should depend on NET
  net/sched: mqprio: fix stack out-of-bounds write in tc entry parsing
  Revert "net: mdio_bus: Use devm for getting reset GPIO"
  selftests: net: packetdrill: xfail all problems on slow machines
  net/packet: fix a race in packet_set_ring() and packet_notifier()
  benet: fix BUG when creating VFs
  net: airoha: npu: Add missing MODULE_FIRMWARE macros
  net: devmem: fix DMA direction on unmapping
  ipa: fix compile-testing with qcom-mdt=m
  eth: fbnic: unlink NAPIs from queues on error to open
  net: Add locking to protect skb->dev access in ip_output
  ...
2025-08-08 07:03:25 +03:00
Linus Torvalds
a530a36bb5 Kbuild updates for v6.17
- Fix a shortcut key issue in menuconfig
 
  - Fix missing rebuild of kheaders
 
  - Sort the symbol dump generated by gendwarfsyms
 
  - Support zboot extraction in scripts/extract-vmlinux
 
  - Migrate gconfig to GTK 3
 
  - Add TAR variable to allow overriding the default tar command
 
  - Hand over Kbuild maintainership
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmiSr38VHG1hc2FoaXJv
 eUBrZXJuZWwub3JnAAoJED2LAQed4NsGcTIP/RCVr/OEJgVXg8dOBNhNuhGoidnM
 2uqRcaza68tOSegpFGcfd9bhO1TCR/O5SL117TS8UEx4f9ge7gk/+XVZC8i1268m
 9+V6eJd8QI34nB/EezMnrhvFmn2kC0UMuSldZYQJ2cReLjtapBP2xBtWnxi+Zyyw
 +FjdHwQln7E8UaB/gMqh9KVVOytX4NIUUZEA/78nd4eJaJbLxJ/5ztAxGLB//bXI
 Rr6bjAeOmIfRWS9QWnGzNzHmzp4SSmU+/gdLXyaWlmoVjeut8O+BJXvQRNfswk8K
 JXmk6uZUx6CNheCca2RaM0i6XAArkpOQc/7v7Ul/rSriTxdxAVUghjk0fNrXJGvQ
 kBjewOTUXg8f4xhuPAL3nkWmCh0s0tV3Q0EneAaJuUck5yRkW0bxxKa74h6ji2q8
 8RsNS5Mq0yYiR1gmCqhEmTN6/wDamSpGkHT0k6ukipixjCr5DP2QFP/xT3d7qSwc
 W6msliXgBmY/ZrGoJXy4zjmu5vxKfAes+7JLBDxUKjdItC818qwSPf+nbvVIdJqb
 K4/hcNDuYPKd/8N8YapPjbGfTBk9Xqp74ez4xg5XNIBPS+cE5k/mePletbCzkzC5
 vzjoNgVUmpPGPdDaBk+S4jSEzWUi575aQx590OfdoXsBt4CQVHHk4PEM1Qh5cWnZ
 m6tx1oqfruovU6Gx
 =MtyJ
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-v6.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild updates from Masahiro Yamada:
 "This is the last pull request from me.

  I'm grateful to have been able to continue as a maintainer for eight
  years. From the next cycle, Nathan and Nicolas will maintain Kbuild.

   - Fix a shortcut key issue in menuconfig

   - Fix missing rebuild of kheaders

   - Sort the symbol dump generated by gendwarfsyms

   - Support zboot extraction in scripts/extract-vmlinux

   - Migrate gconfig to GTK 3

   - Add TAR variable to allow overriding the default tar command

   - Hand over Kbuild maintainership"

* tag 'kbuild-v6.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (92 commits)
  MAINTAINERS: hand over Kbuild maintenance
  kheaders: make it possible to override TAR
  kbuild: userprogs: use correct linker when mixing clang and GNU ld
  kconfig: lxdialog: replace strcpy() with strncpy() in inputbox.c
  kconfig: lxdialog: replace strcpy with snprintf in print_autowrap
  kconfig: gconf: refactor text_insert_help()
  kconfig: gconf: remove unneeded variable in text_insert_msg
  kconfig: gconf: use hyphens in signals
  kconfig: gconf: replace GtkImageMenuItem with GtkMenuItem
  kconfig: gconf: Fix Back button behavior
  kconfig: gconf: fix single view to display dependent symbols correctly
  scripts: add zboot support to extract-vmlinux
  gendwarfksyms: order -T symtypes output by name
  gendwarfksyms: use preferred form of sizeof for allocation
  kconfig: qconf: confine {begin,end}Group to constructor and destructor
  kconfig: qconf: fix ConfigList::updateListAllforAll()
  kconfig: add a function to dump all menu entries in a tree-like format
  kconfig: gconf: show GTK version in About dialog
  kconfig: gconf: replace GtkHPaned and GtkVPaned with GtkPaned
  kconfig: gconf: replace GdkColor with GdkRGBA
  ...
2025-08-06 07:32:52 +03:00
Sharath Chandra Vurukala
1dbf1d590d net: Add locking to protect skb->dev access in ip_output
In ip_output() skb->dev is updated from the skb_dst(skb)->dev
this can become invalid when the interface is unregistered and freed,

Introduced new skb_dst_dev_rcu() function to be used instead of
skb_dst_dev() within rcu_locks in ip_output.This will ensure that
all the skb's associated with the dev being deregistered will
be transnmitted out first, before freeing the dev.

Given that ip_output() is called within an rcu_read_lock()
critical section or from a bottom-half context, it is safe to introduce
an RCU read-side critical section within it.

Multiple panic call stacks were observed when UL traffic was run
in concurrency with device deregistration from different functions,
pasting one sample for reference.

[496733.627565][T13385] Call trace:
[496733.627570][T13385] bpf_prog_ce7c9180c3b128ea_cgroupskb_egres+0x24c/0x7f0
[496733.627581][T13385] __cgroup_bpf_run_filter_skb+0x128/0x498
[496733.627595][T13385] ip_finish_output+0xa4/0xf4
[496733.627605][T13385] ip_output+0x100/0x1a0
[496733.627613][T13385] ip_send_skb+0x68/0x100
[496733.627618][T13385] udp_send_skb+0x1c4/0x384
[496733.627625][T13385] udp_sendmsg+0x7b0/0x898
[496733.627631][T13385] inet_sendmsg+0x5c/0x7c
[496733.627639][T13385] __sys_sendto+0x174/0x1e4
[496733.627647][T13385] __arm64_sys_sendto+0x28/0x3c
[496733.627653][T13385] invoke_syscall+0x58/0x11c
[496733.627662][T13385] el0_svc_common+0x88/0xf4
[496733.627669][T13385] do_el0_svc+0x2c/0xb0
[496733.627676][T13385] el0_svc+0x2c/0xa4
[496733.627683][T13385] el0t_64_sync_handler+0x68/0xb4
[496733.627689][T13385] el0t_64_sync+0x1a4/0x1a8

Changes in v3:
- Replaced WARN_ON() with  WARN_ON_ONCE(), as suggested by Willem de Bruijn.
- Dropped legacy lines mistakenly pulled in from an outdated branch.

Changes in v2:
- Addressed review comments from Eric Dumazet
- Used READ_ONCE() to prevent potential load/store tearing
- Added skb_dst_dev_rcu() and used along with rcu_read_lock() in ip_output

Signed-off-by: Sharath Chandra Vurukala <quic_sharathv@quicinc.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250730105118.GA26100@hu-sharathv-hyd.qualcomm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-01 15:17:52 -07:00
Wang Liang
d46e51f1c7 net: drop UFO packets in udp_rcv_segment()
When sending a packet with virtio_net_hdr to tun device, if the gso_type
in virtio_net_hdr is SKB_GSO_UDP and the gso_size is less than udphdr
size, below crash may happen.

  ------------[ cut here ]------------
  kernel BUG at net/core/skbuff.c:4572!
  Oops: invalid opcode: 0000 [#1] SMP NOPTI
  CPU: 0 UID: 0 PID: 62 Comm: mytest Not tainted 6.16.0-rc7 #203 PREEMPT(voluntary)
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
  RIP: 0010:skb_pull_rcsum+0x8e/0xa0
  Code: 00 00 5b c3 cc cc cc cc 8b 93 88 00 00 00 f7 da e8 37 44 38 00 f7 d8 89 83 88 00 00 00 48 8b 83 c8 00 00 00 5b c3 cc cc cc cc <0f> 0b 0f 0b 66 66 2e 0f 1f 84 00 000
  RSP: 0018:ffffc900001fba38 EFLAGS: 00000297
  RAX: 0000000000000004 RBX: ffff8880040c1000 RCX: ffffc900001fb948
  RDX: ffff888003e6d700 RSI: 0000000000000008 RDI: ffff88800411a062
  RBP: ffff8880040c1000 R08: 0000000000000000 R09: 0000000000000001
  R10: ffff888003606c00 R11: 0000000000000001 R12: 0000000000000000
  R13: ffff888004060900 R14: ffff888004050000 R15: ffff888004060900
  FS:  000000002406d3c0(0000) GS:ffff888084a19000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000020000040 CR3: 0000000004007000 CR4: 00000000000006f0
  Call Trace:
   <TASK>
   udp_queue_rcv_one_skb+0x176/0x4b0 net/ipv4/udp.c:2445
   udp_queue_rcv_skb+0x155/0x1f0 net/ipv4/udp.c:2475
   udp_unicast_rcv_skb+0x71/0x90 net/ipv4/udp.c:2626
   __udp4_lib_rcv+0x433/0xb00 net/ipv4/udp.c:2690
   ip_protocol_deliver_rcu+0xa6/0x160 net/ipv4/ip_input.c:205
   ip_local_deliver_finish+0x72/0x90 net/ipv4/ip_input.c:233
   ip_sublist_rcv_finish+0x5f/0x70 net/ipv4/ip_input.c:579
   ip_sublist_rcv+0x122/0x1b0 net/ipv4/ip_input.c:636
   ip_list_rcv+0xf7/0x130 net/ipv4/ip_input.c:670
   __netif_receive_skb_list_core+0x21d/0x240 net/core/dev.c:6067
   netif_receive_skb_list_internal+0x186/0x2b0 net/core/dev.c:6210
   napi_complete_done+0x78/0x180 net/core/dev.c:6580
   tun_get_user+0xa63/0x1120 drivers/net/tun.c:1909
   tun_chr_write_iter+0x65/0xb0 drivers/net/tun.c:1984
   vfs_write+0x300/0x420 fs/read_write.c:593
   ksys_write+0x60/0xd0 fs/read_write.c:686
   do_syscall_64+0x50/0x1c0 arch/x86/entry/syscall_64.c:63
   </TASK>

To trigger gso segment in udp_queue_rcv_skb(), we should also set option
UDP_ENCAP_ESPINUDP to enable udp_sk(sk)->encap_rcv. When the encap_rcv
hook return 1 in udp_queue_rcv_one_skb(), udp_csum_pull_header() will try
to pull udphdr, but the skb size has been segmented to gso size, which
leads to this crash.

Previous commit cf329aa42b ("udp: cope with UDP GRO packet misdirection")
introduces segmentation in UDP receive path only for GRO, which was never
intended to be used for UFO, so drop UFO packets in udp_rcv_segment().

Link: https://lore.kernel.org/netdev/20250724083005.3918375-1-wangliang74@huawei.com/
Link: https://lore.kernel.org/netdev/20250729123907.3318425-1-wangliang74@huawei.com/
Fixes: cf329aa42b ("udp: cope with UDP GRO packet misdirection")
Suggested-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Signed-off-by: Wang Liang <wangliang74@huawei.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250730101458.3470788-1-wangliang74@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-01 15:14:51 -07:00
Linus Torvalds
d9104cec3e bpf-next-6.17
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmiINnEACgkQ6rmadz2v
 bToBnA/9F+A3R6rTwGk4HK3xpfc/nm2Tanl3oRN7S2ub/mskDOtWSIyG6cVFZ0UG
 1fK6IkByyRIpAF/5qhdlw8drRXHkQtGLA0lP2L9llm4X1mHLofB18y9OeLrDE1WN
 KwNP06+IGX9W802lCGSIXOY+VmRscVfXSMokyQt2ilHplKjOnDqJcYkWupi3T2rC
 mz79FY9aEl2YrIcpj9RXz+8nwP49pZBuW2P0IM5PAIj4BJBXShrUp8T1nz94okNe
 NFsnAyRxjWpUT0McEgtA9WvpD9lZqujYD8Qp0KlGZWmI3vNpV5d9S1+dBcEb1n7q
 dyNMkTF3oRrJhhg4VqoHc6fVpzSEoZ9ZxV5Hx4cs+ganH75D4YbdGqx/7mR3DUgH
 MZh6rHF1pGnK7TAm7h5gl3ZRAOkZOaahbe1i01NKo9CEe5fSh3AqMyzJYoyGHRKi
 xDN39eQdWBNA+hm1VkbK2Bv93Rbjrka2Kj+D3sSSO9Bo/u3ntcknr7LW39idKz62
 Q8dkKHcCEtun7gjk0YXPF013y81nEohj1C+52BmJ2l5JitM57xfr6YOaQpu7DPDE
 AJbHx6ASxKdyEETecd0b+cXUPQ349zmRXy0+CDMAGKpBicC0H0mHhL14cwOY1Hfu
 EIpIjmIJGI3JNF6T5kybcQGSBOYebdV0FFgwSllzPvuYt7YsHCs=
 =/O3j
 -----END PGP SIGNATURE-----

Merge tag 'bpf-next-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Pull bpf updates from Alexei Starovoitov:

 - Remove usermode driver (UMD) framework (Thomas Weißschuh)

 - Introduce Strongly Connected Component (SCC) in the verifier to
   detect loops and refine register liveness (Eduard Zingerman)

 - Allow 'void *' cast using bpf_rdonly_cast() and corresponding
   '__arg_untrusted' for global function parameters (Eduard Zingerman)

 - Improve precision for BPF_ADD and BPF_SUB operations in the verifier
   (Harishankar Vishwanathan)

 - Teach the verifier that constant pointer to a map cannot be NULL
   (Ihor Solodrai)

 - Introduce BPF streams for error reporting of various conditions
   detected by BPF runtime (Kumar Kartikeya Dwivedi)

 - Teach the verifier to insert runtime speculation barrier (lfence on
   x86) to mitigate speculative execution instead of rejecting the
   programs (Luis Gerhorst)

 - Various improvements for 'veristat' (Mykyta Yatsenko)

 - For CONFIG_DEBUG_KERNEL config warn on internal verifier errors to
   improve bug detection by syzbot (Paul Chaignon)

 - Support BPF private stack on arm64 (Puranjay Mohan)

 - Introduce bpf_cgroup_read_xattr() kfunc to read xattr of cgroup's
   node (Song Liu)

 - Introduce kfuncs for read-only string opreations (Viktor Malik)

 - Implement show_fdinfo() for bpf_links (Tao Chen)

 - Reduce verifier's stack consumption (Yonghong Song)

 - Implement mprog API for cgroup-bpf programs (Yonghong Song)

* tag 'bpf-next-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (192 commits)
  selftests/bpf: Migrate fexit_noreturns case into tracing_failure test suite
  selftests/bpf: Add selftest for attaching tracing programs to functions in deny list
  bpf: Add log for attaching tracing programs to functions in deny list
  bpf: Show precise rejected function when attaching fexit/fmod_ret to __noreturn functions
  bpf: Fix various typos in verifier.c comments
  bpf: Add third round of bounds deduction
  selftests/bpf: Test invariants on JSLT crossing sign
  selftests/bpf: Test cross-sign 64bits range refinement
  selftests/bpf: Update reg_bound range refinement logic
  bpf: Improve bounds when s64 crosses sign boundary
  bpf: Simplify bounds refinement from s32
  selftests/bpf: Enable private stack tests for arm64
  bpf, arm64: JIT support for private stack
  bpf: Move bpf_jit_get_prog_name() to core.c
  bpf, arm64: Fix fp initialization for exception boundary
  umd: Remove usermode driver framework
  bpf/preload: Don't select USERMODE_DRIVER
  selftests/bpf: Fix test dynptr/test_dynptr_memset_xdp_chunks failure
  selftests/bpf: Fix test dynptr/test_dynptr_copy_xdp failure
  selftests/bpf: Increase xdp data size for arm64 64K page size
  ...
2025-07-30 09:58:50 -07:00
Linus Torvalds
8be4d31cb8 Networking changes for 6.17.
Core & protocols
 ----------------
 
  - Wrap datapath globals into net_aligned_data, to avoid false sharing.
 
  - Preserve MSG_ZEROCOPY in forwarding (e.g. out of a container).
 
  - Add SO_INQ and SCM_INQ support to AF_UNIX.
 
  - Add SIOCINQ support to AF_VSOCK.
 
  - Add TCP_MAXSEG sockopt to MPTCP.
 
  - Add IPv6 force_forwarding sysctl to enable forwarding per interface.
 
  - Make TCP validation of whether packet fully fits in the receive
    window and the rcv_buf more strict. With increased use of HW
    aggregation a single "packet" can be multiple 100s of kB.
 
  - Add MSG_MORE flag to optimize large TCP transmissions via sockmap,
    improves latency up to 33% for sockmap users.
 
  - Convert TCP send queue handling from tasklet to BH workque.
 
  - Improve BPF iteration over TCP sockets to see each socket exactly once.
 
  - Remove obsolete and unused TCP RFC3517/RFC6675 loss recovery code.
 
  - Support enabling kernel threads for NAPI processing on per-NAPI
    instance basis rather than a whole device. Fully stop the kernel NAPI
    thread when threaded NAPI gets disabled. Previously thread would stick
    around until ifdown due to tricky synchronization.
 
  - Allow multicast routing to take effect on locally-generated packets.
 
  - Add output interface argument for End.X in segment routing.
 
  - MCTP: add support for gateway routing, improve bind() handling.
 
  - Don't require rtnl_lock when fetching an IPv6 neighbor over Netlink.
 
  - Add a new neighbor flag ("extern_valid"), which cedes refresh
    responsibilities to userspace. This is needed for EVPN multi-homing
    where a neighbor entry for a multi-homed host needs to be synced
    across all the VTEPs among which the host is multi-homed.
 
  - Support NUD_PERMANENT for proxy neighbor entries.
 
  - Add a new queuing discipline for IETF RFC9332 DualQ Coupled AQM.
 
  - Add sequence numbers to netconsole messages. Unregister netconsole's
    console when all net targets are removed. Code refactoring.
    Add a number of selftests.
 
  - Align IPSec inbound SA lookup to RFC 4301. Only SPI and protocol
    should be used for an inbound SA lookup.
 
  - Support inspecting ref_tracker state via DebugFS.
 
  - Don't force bonding advertisement frames tx to ~333 ms boundaries.
    Add broadcast_neighbor option to send ARP/ND on all bonded links.
 
  - Allow providing upcall pid for the 'execute' command in openvswitch.
 
  - Remove DCCP support from Netfilter's conntrack.
 
  - Disallow multiple packet duplications in the queuing layer.
 
  - Prevent use of deprecated iptables code on PREEMPT_RT.
 
 Driver API
 ----------
 
  - Support RSS and hashing configuration over ethtool Netlink.
 
  - Add dedicated ethtool callbacks for getting and setting hashing fields.
 
  - Add support for power budget evaluation strategy in PSE /
    Power-over-Ethernet. Generate Netlink events for overcurrent etc.
 
  - Support DPLL phase offset monitoring across all device inputs.
    Support providing clock reference and SYNC over separate DPLL
    inputs.
 
  - Support traffic classes in devlink rate API for bandwidth management.
 
  - Remove rtnl_lock dependency from UDP tunnel port configuration.
 
 Device drivers
 --------------
 
  - Add a new Broadcom driver for 800G Ethernet (bnge).
 
  - Add a standalone driver for Microchip ZL3073x DPLL.
 
  - Remove IBM's NETIUCV device driver.
 
  - Ethernet high-speed NICs:
    - Broadcom (bnxt):
     - support zero-copy Tx of DMABUF memory
     - take page size into account for page pool recycling rings
    - Intel (100G, ice, idpf):
      - idpf: XDP and AF_XDP support preparations
      - idpf: add flow steering
      - add link_down_events statistic
      - clean up the TSPLL code
      - preparations for live VM migration
    - nVidia/Mellanox:
     - support zero-copy Rx/Tx interfaces (DMABUF and io_uring)
     - optimize context memory usage for matchers
     - expose serial numbers in devlink info
     - support PCIe congestion metrics
    - Meta (fbnic):
      - add 25G, 50G, and 100G link modes to phylink
      - support dumping FW logs
    - Marvell/Cavium:
      - support for CN20K generation of the Octeon chips
    - Amazon:
      - add HW clock (without timestamping, just hypervisor time access)
 
  - Ethernet virtual:
    - VirtIO net:
      - support segmentation of UDP-tunnel-encapsulated packets
    - Google (gve):
      - support packet timestamping and clock synchronization
    - Microsoft vNIC:
      - add handler for device-originated servicing events
      - allow dynamic MSI-X vector allocation
      - support Tx bandwidth clamping
 
  - Ethernet NICs consumer, and embedded:
    - AMD:
      - amd-xgbe: hardware timestamping and PTP clock support
    - Broadcom integrated MACs (bcmgenet, bcmasp):
      - use napi_complete_done() return value to support NAPI polling
      - add support for re-starting auto-negotiation
    - Broadcom switches (b53):
      - support BCM5325 switches
      - add bcm63xx EPHY power control
    - Synopsys (stmmac):
      - lots of code refactoring and cleanups
    - TI:
      - icssg-prueth: read firmware-names from device tree
      - icssg: PRP offload support
    - Microchip:
      - lan78xx: convert to PHYLINK for improved PHY and MAC management
      - ksz: add KSZ8463 switch support
    - Intel:
      - support similar queue priority scheme in multi-queue and
        time-sensitive networking (taprio)
      - support packet pre-emption in both
    - RealTek (r8169):
      - enable EEE at 5Gbps on RTL8126
    - Airoha:
      - add PPPoE offload support
      - MDIO bus controller for Airoha AN7583
 
  - Ethernet PHYs:
    - support for the IPQ5018 internal GE PHY
    - micrel KSZ9477 switch-integrated PHYs:
      - add MDI/MDI-X control support
      - add RX error counters
      - add cable test support
      - add Signal Quality Indicator (SQI) reporting
    - dp83tg720: improve reset handling and reduce link recovery time
    - support bcm54811 (and its MII-Lite interface type)
    - air_en8811h: support resume/suspend
    - support PHY counters for QCA807x and QCA808x
    - support WoL for QCA807x
 
  - CAN drivers:
    - rcar_canfd: support for Transceiver Delay Compensation
    - kvaser: report FW versions via devlink dev info
 
  - WiFi:
    - extended regulatory info support (6 GHz)
    - add statistics and beacon monitor for Multi-Link Operation (MLO)
    - support S1G aggregation, improve S1G support
    - add Radio Measurement action fields
    - support per-radio RTS threshold
    - some work around how FIPS affects wifi, which was wrong (RC4 is used
      by TKIP, not only WEP)
    - improvements for unsolicited probe response handling
 
  - WiFi drivers:
    - RealTek (rtw88):
      - IBSS mode for SDIO devices
    - RealTek (rtw89):
      - BT coexistence for MLO/WiFi7
      - concurrent station + P2P support
      - support for USB devices RTL8851BU/RTL8852BU
    - Intel (iwlwifi):
      - use embedded PNVM in (to be released) FW images to fix
        compatibility issues
      - many cleanups (unused FW APIs, PCIe code, WoWLAN)
      - some FIPS interoperability
    - MediaTek (mt76):
      - firmware recovery improvements
      - more MLO work
    - Qualcomm/Atheros (ath12k):
      - fix scan on multi-radio devices
      - more EHT/Wi-Fi 7 features
      - encapsulation/decapsulation offload
    - Broadcom (brcm80211):
      - support SDIO 43751 device
 
  - Bluetooth:
    - hci_event: add support for handling LE BIG Sync Lost event
    - ISO: add socket option to report packet seqnum via CMSG
    - ISO: support SCM_TIMESTAMPING for ISO TS
 
  - Bluetooth drivers:
    - intel_pcie: support Function Level Reset
    - nxpuart: add support for 4M baudrate
    - nxpuart: implement powerup sequence, reset, FW dump, and FW loading
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmiFgLgACgkQMUZtbf5S
 IrvafxAAnQRwYBoIG+piCILx6z5pRvBGHkmEQ4AQgSCFuq2eO3ubwMFIqEybfma1
 5+QFjUZAV3OgGgKRBS2KGWxtSzdiF+/JGV1VOIN67sX3Mm0a2QgjA4n5CgKL0FPr
 o6BEzjX5XwG1zvGcBNQ5BZ19xUUKjoZQgTtnea8sZ57Fsp5RtRgmYRqoewNvNk/n
 uImh0NFsDVb0UeOpSzC34VD9l1dJvLGdui4zJAjno/vpvmT1DkXjoK419J/r52SS
 X+5WgsfJ6DkjHqVN1tIhhK34yWqBOcwGFZJgEnWHMkFIl2FqRfFKMHyqtfLlVnLA
 mnIpSyz8Sq2AHtx0TlgZ3At/Ri8p5+yYJgHOXcDKyABa8y8Zf4wrycmr6cV9JLuL
 z54nLEVnJuvfDVDVJjsLYdJXyhMpZFq6+uAItdxKaw8Ugp/QqG4QtoRj+XIHz4ZW
 z6OohkCiCzTwEISFK+pSTxPS30eOxq43kCspcvuLiwCCStJBRkRb5GdZA4dm7LA+
 1Od4ADAkHjyrFtBqTyyC2scX8UJ33DlAIpAYyIeS6w9Cj9EXxtp1z33IAAAZ03MW
 jJwIaJuc8bK2fWKMmiG7ucIXjPo4t//KiWlpkwwqLhPbjZgfDAcxq1AC2TLoqHBL
 y4EOgKpHDCMAghSyiFIAn2JprGcEt8dp+11B0JRXIn4Pm/eYDH8=
 =lqbe
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "Core & protocols:

   - Wrap datapath globals into net_aligned_data, to avoid false sharing

   - Preserve MSG_ZEROCOPY in forwarding (e.g. out of a container)

   - Add SO_INQ and SCM_INQ support to AF_UNIX

   - Add SIOCINQ support to AF_VSOCK

   - Add TCP_MAXSEG sockopt to MPTCP

   - Add IPv6 force_forwarding sysctl to enable forwarding per interface

   - Make TCP validation of whether packet fully fits in the receive
     window and the rcv_buf more strict. With increased use of HW
     aggregation a single "packet" can be multiple 100s of kB

   - Add MSG_MORE flag to optimize large TCP transmissions via sockmap,
     improves latency up to 33% for sockmap users

   - Convert TCP send queue handling from tasklet to BH workque

   - Improve BPF iteration over TCP sockets to see each socket exactly
     once

   - Remove obsolete and unused TCP RFC3517/RFC6675 loss recovery code

   - Support enabling kernel threads for NAPI processing on per-NAPI
     instance basis rather than a whole device. Fully stop the kernel
     NAPI thread when threaded NAPI gets disabled. Previously thread
     would stick around until ifdown due to tricky synchronization

   - Allow multicast routing to take effect on locally-generated packets

   - Add output interface argument for End.X in segment routing

   - MCTP: add support for gateway routing, improve bind() handling

   - Don't require rtnl_lock when fetching an IPv6 neighbor over Netlink

   - Add a new neighbor flag ("extern_valid"), which cedes refresh
     responsibilities to userspace. This is needed for EVPN multi-homing
     where a neighbor entry for a multi-homed host needs to be synced
     across all the VTEPs among which the host is multi-homed

   - Support NUD_PERMANENT for proxy neighbor entries

   - Add a new queuing discipline for IETF RFC9332 DualQ Coupled AQM

   - Add sequence numbers to netconsole messages. Unregister
     netconsole's console when all net targets are removed. Code
     refactoring. Add a number of selftests

   - Align IPSec inbound SA lookup to RFC 4301. Only SPI and protocol
     should be used for an inbound SA lookup

   - Support inspecting ref_tracker state via DebugFS

   - Don't force bonding advertisement frames tx to ~333 ms boundaries.
     Add broadcast_neighbor option to send ARP/ND on all bonded links

   - Allow providing upcall pid for the 'execute' command in openvswitch

   - Remove DCCP support from Netfilter's conntrack

   - Disallow multiple packet duplications in the queuing layer

   - Prevent use of deprecated iptables code on PREEMPT_RT

  Driver API:

   - Support RSS and hashing configuration over ethtool Netlink

   - Add dedicated ethtool callbacks for getting and setting hashing
     fields

   - Add support for power budget evaluation strategy in PSE /
     Power-over-Ethernet. Generate Netlink events for overcurrent etc

   - Support DPLL phase offset monitoring across all device inputs.
     Support providing clock reference and SYNC over separate DPLL
     inputs

   - Support traffic classes in devlink rate API for bandwidth
     management

   - Remove rtnl_lock dependency from UDP tunnel port configuration

  Device drivers:

   - Add a new Broadcom driver for 800G Ethernet (bnge)

   - Add a standalone driver for Microchip ZL3073x DPLL

   - Remove IBM's NETIUCV device driver

   - Ethernet high-speed NICs:
      - Broadcom (bnxt):
         - support zero-copy Tx of DMABUF memory
         - take page size into account for page pool recycling rings
      - Intel (100G, ice, idpf):
         - idpf: XDP and AF_XDP support preparations
         - idpf: add flow steering
         - add link_down_events statistic
         - clean up the TSPLL code
         - preparations for live VM migration
      - nVidia/Mellanox:
         - support zero-copy Rx/Tx interfaces (DMABUF and io_uring)
         - optimize context memory usage for matchers
         - expose serial numbers in devlink info
         - support PCIe congestion metrics
      - Meta (fbnic):
         - add 25G, 50G, and 100G link modes to phylink
         - support dumping FW logs
      - Marvell/Cavium:
         - support for CN20K generation of the Octeon chips
      - Amazon:
         - add HW clock (without timestamping, just hypervisor time access)

   - Ethernet virtual:
      - VirtIO net:
         - support segmentation of UDP-tunnel-encapsulated packets
      - Google (gve):
         - support packet timestamping and clock synchronization
      - Microsoft vNIC:
         - add handler for device-originated servicing events
         - allow dynamic MSI-X vector allocation
         - support Tx bandwidth clamping

   - Ethernet NICs consumer, and embedded:
      - AMD:
         - amd-xgbe: hardware timestamping and PTP clock support
      - Broadcom integrated MACs (bcmgenet, bcmasp):
         - use napi_complete_done() return value to support NAPI polling
         - add support for re-starting auto-negotiation
      - Broadcom switches (b53):
         - support BCM5325 switches
         - add bcm63xx EPHY power control
      - Synopsys (stmmac):
         - lots of code refactoring and cleanups
      - TI:
         - icssg-prueth: read firmware-names from device tree
         - icssg: PRP offload support
      - Microchip:
         - lan78xx: convert to PHYLINK for improved PHY and MAC management
         - ksz: add KSZ8463 switch support
      - Intel:
         - support similar queue priority scheme in multi-queue and
           time-sensitive networking (taprio)
         - support packet pre-emption in both
      - RealTek (r8169):
         - enable EEE at 5Gbps on RTL8126
      - Airoha:
         - add PPPoE offload support
         - MDIO bus controller for Airoha AN7583

   - Ethernet PHYs:
      - support for the IPQ5018 internal GE PHY
      - micrel KSZ9477 switch-integrated PHYs:
         - add MDI/MDI-X control support
         - add RX error counters
         - add cable test support
         - add Signal Quality Indicator (SQI) reporting
      - dp83tg720: improve reset handling and reduce link recovery time
      - support bcm54811 (and its MII-Lite interface type)
      - air_en8811h: support resume/suspend
      - support PHY counters for QCA807x and QCA808x
      - support WoL for QCA807x

   - CAN drivers:
      - rcar_canfd: support for Transceiver Delay Compensation
      - kvaser: report FW versions via devlink dev info

   - WiFi:
      - extended regulatory info support (6 GHz)
      - add statistics and beacon monitor for Multi-Link Operation (MLO)
      - support S1G aggregation, improve S1G support
      - add Radio Measurement action fields
      - support per-radio RTS threshold
      - some work around how FIPS affects wifi, which was wrong (RC4 is
        used by TKIP, not only WEP)
      - improvements for unsolicited probe response handling

   - WiFi drivers:
      - RealTek (rtw88):
         - IBSS mode for SDIO devices
      - RealTek (rtw89):
         - BT coexistence for MLO/WiFi7
         - concurrent station + P2P support
         - support for USB devices RTL8851BU/RTL8852BU
      - Intel (iwlwifi):
         - use embedded PNVM in (to be released) FW images to fix
           compatibility issues
         - many cleanups (unused FW APIs, PCIe code, WoWLAN)
         - some FIPS interoperability
      - MediaTek (mt76):
         - firmware recovery improvements
         - more MLO work
      - Qualcomm/Atheros (ath12k):
         - fix scan on multi-radio devices
         - more EHT/Wi-Fi 7 features
         - encapsulation/decapsulation offload
      - Broadcom (brcm80211):
         - support SDIO 43751 device

   - Bluetooth:
      - hci_event: add support for handling LE BIG Sync Lost event
      - ISO: add socket option to report packet seqnum via CMSG
      - ISO: support SCM_TIMESTAMPING for ISO TS

   - Bluetooth drivers:
      - intel_pcie: support Function Level Reset
      - nxpuart: add support for 4M baudrate
      - nxpuart: implement powerup sequence, reset, FW dump, and FW loading"

* tag 'net-next-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1742 commits)
  dpll: zl3073x: Fix build failure
  selftests: bpf: fix legacy netfilter options
  ipv6: annotate data-races around rt->fib6_nsiblings
  ipv6: fix possible infinite loop in fib6_info_uses_dev()
  ipv6: prevent infinite loop in rt6_nlmsg_size()
  ipv6: add a retry logic in net6_rt_notify()
  vrf: Drop existing dst reference in vrf_ip6_input_dst
  net/sched: taprio: align entry index attr validation with mqprio
  net: fsl_pq_mdio: use dev_err_probe
  selftests: rtnetlink.sh: remove esp4_offload after test
  vsock: remove unnecessary null check in vsock_getname()
  igb: xsk: solve negative overflow of nb_pkts in zerocopy mode
  stmmac: xsk: fix negative overflow of budget in zerocopy mode
  dt-bindings: ieee802154: Convert at86rf230.txt yaml format
  net: dsa: microchip: Disable PTP function of KSZ8463
  net: dsa: microchip: Setup fiber ports for KSZ8463
  net: dsa: microchip: Write switch MAC address differently for KSZ8463
  net: dsa: microchip: Use different registers for KSZ8463
  net: dsa: microchip: Add KSZ8463 switch support to KSZ DSA driver
  dt-bindings: net: dsa: microchip: Add KSZ8463 switch support
  ...
2025-07-30 08:58:55 -07:00
Linus Torvalds
c3018a2c6a for-6.17/io_uring-20250728
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmiHcP8QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgphu+EACp+MNKBIx7iIM1arbmafEmroaR9Whi5CJu
 8BczAKRVPNhICPXAjr6UOmRuD4nBzPysPNi73xNRdXdnd/dnOL27zMqF/0JCJsd6
 V4Xt7pXcanJ4BoeRDv9P332h6/IVo1ulxqEzhWJt2KKoq4cWbSwlvk18NPOdGIkp
 PcMPARf15d4KbLprdST5Dzrg2uqZ3jZ4Y4zvIMKery56AZCkp5ZrtPFttw4jhTff
 FQzIjY/hXXDFURubv75EHFg7KC9rcZa4CuU40duTHmsXD/EsSsa0eAJMmiVAGjhM
 PXK/70ZOoxLK1zU7zN+iAMUnVL7onoNvo8g9e7/FJNuHV25am8aUx4DzzP3ci+He
 +2C4PpOKx9Egz0oKmC6hl9eMg/muPd/3j8uxQlGOvz38icgeirQB/ebr1KLtBkM9
 pKQmCsGaIzncXWddcTMCayEj3wbvn2lgWSqSgZuwTt1AXqSTxLOmTywCYcXDBdli
 2ejh/Hk45TALnBhiiDWdOT2euh2EEP1ADOZzVsZzeR29OzLqJDRjRnitb40NUU60
 +ny2HOcplIN6aah+QdFwK31FieVKDY/ufjt1yGvwCP+kIxUEDSYi+NRRldmwznxy
 8UZwFyvd8wOxUc+iG/wCK7ccY+MtYyaZAE4ok2QEvQvMUTBJE/LvI4bkd77fx7Sy
 2cMeDukZ/w==
 =EGH5
 -----END PGP SIGNATURE-----

Merge tag 'for-6.17/io_uring-20250728' of git://git.kernel.dk/linux

Pull io_uring updates from Jens Axboe:

 - Optimization to avoid reference counts on non-cloned registered
   buffers. This is how these buffers were handled prior to having
   cloning support, and we can still use that approach as long as the
   buffers haven't been cloned to another ring.

 - Cleanup and improvement for uring_cmd, where btrfs was the only user
   of storing allocated data for the lifetime of the uring_cmd. Clean
   that up so we can get rid of the need to do that.

 - Avoid unnecessary memory copies in uring_cmd usage. This is
   particularly important as a lot of uring_cmd usage necessitates the
   use of 128b SQEs.

 - A few updates for recv multishot, where it's now possible to add
   fairness limits for limiting how much is transferred for each retry
   loop. Additionally, recv multishot now supports an overall cap as
   well, where once reached the multishot recv will terminate. The
   latter is useful for buffer management and juggling many recv streams
   at the same time.

 - Add support for returning the TX timestamps via a new socket command.
   This feature can work in either singleshot or multishot mode, where
   the latter triggers a completion whenever new timestamps are
   available. This is an alternative to using the existing error queue.

 - Add support for an io_uring "mock" file, which is the start of being
   able to do 100% targeted testing in terms of exercising io_uring
   request handling. The idea is to have a file type that can be
   anything the tester would like, and behave exactly how you want it to
   behave in terms of hitting the code paths you want.

 - Improve zcrx by using sgtables to de-duplicate and improve dma
   address handling.

 - Prep work for supporting larger pages for zcrx.

 - Various little improvements and fixes.

* tag 'for-6.17/io_uring-20250728' of git://git.kernel.dk/linux: (42 commits)
  io_uring/zcrx: fix leaking pages on sg init fail
  io_uring/zcrx: don't leak pages on account failure
  io_uring/zcrx: fix null ifq on area destruction
  io_uring: fix breakage in EXPERT menu
  io_uring/cmd: remove struct io_uring_cmd_data
  btrfs/ioctl: store btrfs_uring_encoded_data in io_btrfs_cmd
  io_uring/cmd: introduce IORING_URING_CMD_REISSUE flag
  io_uring/zcrx: account area memory
  io_uring: export io_[un]account_mem
  io_uring/net: Support multishot receive len cap
  io_uring: deduplicate wakeup handling
  io_uring/net: cast min_not_zero() type
  io_uring/poll: cleanup apoll freeing
  io_uring/net: allow multishot receive per-invocation cap
  io_uring/net: move io_sr_msg->retry_flags to io_sr_msg->flags
  io_uring/net: use passed in 'len' in io_recv_buf_select()
  io_uring/zcrx: prepare fallback for larger pages
  io_uring/zcrx: assert area type in io_zcrx_iov_page
  io_uring/zcrx: allocate sgtable for umem areas
  io_uring/zcrx: introduce io_populate_area_dma
  ...
2025-07-28 16:30:12 -07:00
Linus Torvalds
672dcda246 vfs-6.17-rc1.pidfs
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaINCiQAKCRCRxhvAZXjc
 orltAQDq3y1anYETz5/FD6P2gXY1W5hXdSm3EHHeacQ1JjTXvgEA2g1lWO7J4anf
 oOVE8aSvMow/FOjivLZBYmI65pkYJAE=
 =oDKB
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.17-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull pidfs updates from Christian Brauner:

 - persistent info

   Persist exit and coredump information independent of whether anyone
   currently holds a pidfd for the struct pid.

   The current scheme allocated pidfs dentries on-demand repeatedly.
   This scheme is reaching it's limits as it makes it impossible to pin
   information that needs to be available after the task has exited or
   coredumped and that should not be lost simply because the pidfd got
   closed temporarily. The next opener should still see the stashed
   information.

   This is also a prerequisite for supporting extended attributes on
   pidfds to allow attaching meta information to them.

   If someone opens a pidfd for a struct pid a pidfs dentry is allocated
   and stashed in pid->stashed. Once the last pidfd for the struct pid
   is closed the pidfs dentry is released and removed from pid->stashed.

   So if 10 callers create a pidfs dentry for the same struct pid
   sequentially, i.e., each closing the pidfd before the other creates a
   new one then a new pidfs dentry is allocated every time.

   Because multiple tasks acquiring and releasing a pidfd for the same
   struct pid can race with each another a task may still find a valid
   pidfs entry from the previous task in pid->stashed and reuse it. Or
   it might find a dead dentry in there and fail to reuse it and so
   stashes a new pidfs dentry. Multiple tasks may race to stash a new
   pidfs dentry but only one will succeed, the other ones will put their
   dentry.

   The current scheme aims to ensure that a pidfs dentry for a struct
   pid can only be created if the task is still alive or if a pidfs
   dentry already existed before the task was reaped and so exit
   information has been was stashed in the pidfs inode.

   That's great except that it's buggy. If a pidfs dentry is stashed in
   pid->stashed after pidfs_exit() but before __unhash_process() is
   called we will return a pidfd for a reaped task without exit
   information being available.

   The pidfds_pid_valid() check does not guard against this race as it
   doens't sync at all with pidfs_exit(). The pid_has_task() check might
   be successful simply because we're before __unhash_process() but
   after pidfs_exit().

   Introduce a new scheme where the lifetime of information associated
   with a pidfs entry (coredump and exit information) isn't bound to the
   lifetime of the pidfs inode but the struct pid itself.

   The first time a pidfs dentry is allocated for a struct pid a struct
   pidfs_attr will be allocated which will be used to store exit and
   coredump information.

   If all pidfs for the pidfs dentry are closed the dentry and inode can
   be cleaned up but the struct pidfs_attr will stick until the struct
   pid itself is freed. This will ensure minimal memory usage while
   persisting relevant information.

   The new scheme has various advantages. First, it allows to close the
   race where we end up handing out a pidfd for a reaped task for which
   no exit information is available. Second, it minimizes memory usage.
   Third, it allows to remove complex lifetime tracking via dentries
   when registering a struct pid with pidfs. There's no need to get or
   put a reference. Instead, the lifetime of exit and coredump
   information associated with a struct pid is bound to the lifetime of
   struct pid itself.

 - extended attributes

   Now that we have a way to persist information for pidfs dentries we
   can start supporting extended attributes on pidfds. This will allow
   userspace to attach meta information to tasks.

   One natural extension would be to introduce a custom pidfs.* extended
   attribute space and allow for the inheritance of extended attributes
   across fork() and exec().

   The first simple scheme will allow privileged userspace to set
   trusted extended attributes on pidfs inodes.

 - Allow autonomous pidfs file handles

   Various filesystems such as pidfs and drm support opening file
   handles without having to require a file descriptor to identify the
   filesystem. The filesystem are global single instances and can be
   trivially identified solely on the information encoded in the file
   handle.

   This makes it possible to not have to keep or acquire a sentinal file
   descriptor just to pass it to open_by_handle_at() to identify the
   filesystem. That's especially useful when such sentinel file
   descriptor cannot or should not be acquired.

   For pidfs this means a file handle can function as full replacement
   for storing a pid in a file. Instead a file handle can be stored and
   reopened purely based on the file handle.

   Such autonomous file handles can be opened with or without specifying
   a a file descriptor. If no proper file descriptor is used the
   FD_PIDFS_ROOT sentinel must be passed. This allows us to define
   further special negative fd sentinels in the future.

   Userspace can trivially test for support by trying to open the file
   handle with an invalid file descriptor.

 - Allow pidfds for reaped tasks with SCM_PIDFD messages

   This is a logical continuation of the earlier work to create pidfds
   for reaped tasks through the SO_PEERPIDFD socket option merged in
   923ea4d448 ("Merge patch series "net, pidfs: enable handing out
   pidfds for reaped sk->sk_peer_pid"").

 - Two minor fixes:

    * Fold fs_struct->{lock,seq} into a seqlock

    * Don't bother with path_{get,put}() in unix_open_file()

* tag 'vfs-6.17-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (37 commits)
  don't bother with path_get()/path_put() in unix_open_file()
  fold fs_struct->{lock,seq} into a seqlock
  selftests: net: extend SCM_PIDFD test to cover stale pidfds
  af_unix: enable handing out pidfds for reaped tasks in SCM_PIDFD
  af_unix: stash pidfs dentry when needed
  af_unix/scm: fix whitespace errors
  af_unix: introduce and use scm_replace_pid() helper
  af_unix: introduce unix_skb_to_scm helper
  af_unix: rework unix_maybe_add_creds() to allow sleep
  selftests/pidfd: decode pidfd file handles withou having to specify an fd
  fhandle, pidfs: support open_by_handle_at() purely based on file handle
  uapi/fcntl: add FD_PIDFS_ROOT
  uapi/fcntl: add FD_INVALID
  fcntl/pidfd: redefine PIDFD_SELF_THREAD_GROUP
  uapi/fcntl: mark range as reserved
  fhandle: reflow get_path_anchor()
  pidfs: add pidfs_root_path() helper
  fhandle: rename to get_path_anchor()
  fhandle: hoist copy_from_user() above get_path_from_fd()
  fhandle: raise FILEID_IS_DIR in handle_type
  ...
2025-07-28 14:10:15 -07:00
Jakub Kicinski
c6dc26df6b netfilter pull request 25-07-25
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEjF9xRqF1emXiQiqU1w0aZmrPKyEFAmiDtZkACgkQ1w0aZmrP
 KyGxohAAns0Vyq4zx4VKcaDQQbciakZnHPK28eHOGxNKzh6qX/ybB8feOeAVOR1Q
 RCsVkSej/HiGIOsrWwiOKecx+d/Sf2wSGeJYftPMjk5pL/V3SyMdAUUgQgPtne1o
 LHE3rsu9BLGXt6M1mpn4+DjDQqDbLBVXvi3x/FNXCFJETuEfaL7gcisXKxzeqCtS
 fWqFw0JzQNmHCYsBbUb7CpoowD3QKvSBdIUP+8ciC9qszRfgGfOlCKwkyRwp/uV3
 vh1yLkHvlh9r4oe+PP4fNvTMbsaPS1jecj6xNwRmEEyf0bvBBCl9iNbWe5F9Fqvr
 dbbOkKOHfo0nCYR3AzQFJpM1o81KGQ90JkcrR/hhEII8KD/3VKdvUqFl1WkU6rEP
 rAxkl4lXM7aq11nJp2dClvshSZO/6Fo2byISGfLuVLPnq1/Lo2zZjnmk7StE1bm2
 bWCA+C64CjAt1gqUCC1TZ7XpMtAZDQEdDSSN49/BqpJJncFEhigzph+I3H+LFuE9
 hcUdF4i1UTNlqMTv4uY8ycdmrjRj4SfeXaBXxWjnaRn5Wxfs0GdtmdeiVuBTLylQ
 Y18MM5OMGso1bzgS392jjwzMsommdLcgb7/v8Z6TU9dAjoeS2cYac5JiqilkgiSN
 0nzikhtADCHZ+tgMU2LOc+nK93KQWyaJUtrSR5uEU6tHqe/m6No=
 =mKnX
 -----END PGP SIGNATURE-----

Merge tag 'nf-next-25-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next

Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following series contains Netfilter/IPVS updates for net-next:

1) Display netns inode in conntrack table full log, from lvxiafei.

2) Autoload nf_log_syslog in case no logging backend is available,
   from Lance Yang.

3) Three patches to remove unused functions in x_tables, nf_tables and
   conntrack. From Yue Haibing.

4) Exclude LEGACY TABLES on PREEMPT_RT: Add NETFILTER_XTABLES_LEGACY
   to exclude xtables legacy infrastructure.

5) Restore selftests by toggling NETFILTER_XTABLES_LEGACY where needed.
   From Florian Westphal.

6) Use CONFIG_INET_SCTP_DIAG in tools/testing/selftests/net/netfilter/config,
   from Sebastian Andrzej Siewior.

7) Use timer_delete in comment in IPVS codebase, from WangYuli.

8) Dump flowtable information in nfnetlink_hook, this includes an initial
   patch to consolidate common code in helper function, from Phil Sutter.

9) Remove unused arguments in nft_pipapo set backend, from Florian Westphal.

10) Return nft_set_ext instead of boolean in set lookup function,
    from Florian Westphal.

11) Remove indirection in dynamic set infrastructure, also from Florian.

12) Consolidate pipapo_get/lookup, from Florian.

13) Use kvmalloc in nft_pipapop, from Florian Westphal.

14) syzbot reports slab-out-of-bounds in xt_nfacct log message,
    fix from Florian Westphal.

15) Ignored tainted kernels in selftest nft_interface_stress.sh,
    from Phil Sutter.

16) Fix IPVS selftest by disabling rp_filter with ipip tunnel device,
    from Yi Chen.

* tag 'nf-next-25-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  selftests: netfilter: ipvs.sh: Explicity disable rp_filter on interface tunl0
  selftests: netfilter: Ignore tainted kernels in interface stress test
  netfilter: xt_nfacct: don't assume acct name is null-terminated
  netfilter: nft_set_pipapo: prefer kvmalloc for scratch maps
  netfilter: nft_set_pipapo: merge pipapo_get/lookup
  netfilter: nft_set: remove indirection from update API call
  netfilter: nft_set: remove one argument from lookup and update functions
  netfilter: nft_set_pipapo: remove unused arguments
  netfilter: nfnetlink_hook: Dump flowtable info
  netfilter: nfnetlink: New NFNLA_HOOK_INFO_DESC helper
  ipvs: Rename del_timer in comment in ip_vs_conn_expire_now()
  selftests: netfilter: Enable CONFIG_INET_SCTP_DIAG
  selftests: net: Enable legacy netfilter legacy options.
  netfilter: Exclude LEGACY TABLES on PREEMPT_RT.
  netfilter: conntrack: Remove unused net in nf_conntrack_double_lock()
  netfilter: nf_tables: Remove unused nft_reduce_is_readonly()
  netfilter: x_tables: Remove unused functions xt_{in|out}name()
  netfilter: load nf_log_syslog on enabling nf_conntrack_log_invalid
  netfilter: conntrack: table full detailed log
====================

Link: https://patch.msgid.link/20250725170340.21327-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-25 16:37:55 -07:00
Kees Cook
511d10b4c2 sctp: Replace sockaddr with sockaddr_inet in sctp_addr union
As part of the removal of the variably-sized sockaddr for kernel
internals, replace struct sockaddr with sockaddr_inet in the sctp_addr
union.

No binary changes; the union size remains unchanged due to sockaddr_inet
matching the size of sockaddr_in6.

Signed-off-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20250722171836.1078436-3-kees@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-25 15:29:58 -07:00
Florian Westphal
531e613121 netfilter: nft_set: remove indirection from update API call
This stems from a time when sets and nft_dynset resided in different kernel
modules.  We can replace this with a direct call.

We could even remove both ->update and ->delete, given its only
supported by rhashtable, but on the off-chance we'll see runtime
add/delete for other types or a new set type keep that as-is for now.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-07-25 18:40:23 +02:00
Florian Westphal
17a20e09f0 netfilter: nft_set: remove one argument from lookup and update functions
Return the extension pointer instead of passing it as a function
argument to be filled in by the callee.

As-is, whenever false is returned, the extension pointer is not used.

For all set types, when true is returned, the extension pointer was set
to the matching element.

Only exception: nft_set_bitmap doesn't support extensions.
Return a pointer to a static const empty element extension container.

return false -> return NULL
return true -> return the elements' extension pointer.

This saves one function argument.

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-07-25 18:40:16 +02:00
Yue Haibing
bf6788742b netfilter: nf_tables: Remove unused nft_reduce_is_readonly()
Since commit 9e539c5b6d ("netfilter: nf_tables: disable expression
reduction infra") this is unused.

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-07-25 18:37:12 +02:00
Lance Yang
e89a680466 netfilter: load nf_log_syslog on enabling nf_conntrack_log_invalid
When no logger is registered, nf_conntrack_log_invalid fails to log invalid
packets, leaving users unaware of actual invalid traffic. Improve this by
loading nf_log_syslog, similar to how 'iptables -I FORWARD 1 -m conntrack
--ctstate INVALID -j LOG' triggers it.

Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Zi Li <zi.li@linux.dev>
Signed-off-by: Lance Yang <lance.yang@linux.dev>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-07-25 18:35:41 +02:00
Samiullah Khawaja
71c52411c5 net: Create separate gro_flush_normal function
Move multiple copies of same code snippet doing `gro_flush` and
`gro_normal_list` into separate helper function.

Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250723013031.2911384-2-skhawaja@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-24 18:34:55 -07:00
Jakub Kicinski
d2002ccb47 bluetooth-next pull request for net-next:
core:
 
  - hci_sync: fix double free in 'hci_discovery_filter_clear()'
  - hci_event: Mask data status from LE ext adv reports
  - hci_devcd_dump: fix out-of-bounds via dev_coredumpv
  - ISO: add socket option to report packet seqnum via CMSG
  - hci_event: Add support for handling LE BIG Sync Lost event
  - ISO: Support SCM_TIMESTAMPING for ISO TS
  - hci_core: Add PA_LINK to distinguish BIG sync and PA sync connections
  - hci_sock: Reset cookie to zero in hci_sock_free_cookie()
 
 drivers:
 
  - btusb: Add new VID/PID 0489/e14e for MT7925
  - btusb: Add a new VID/PID 2c7c/7009 for MT7925
  - btusb: Add RTL8852BE device 0x13d3:0x3618
  - btusb: Add support for variant of RTL8851BE (USB ID 13d3:3601)
  - btusb: Add USB ID 3625:010b for TP-LINK Archer TX10UB Nano
  - btusb: QCA: Support downloading custom-made firmwares
  - btusb: Add one more ID 0x28de:0x1401 for Qualcomm WCN6855
  - nxp: add support for supply and reset
  - btnxpuart: Add support for 4M baudrate
  - btnxpuart: Correct the Independent Reset handling after FW dump
  - btnxpuart: Add uevents for FW dump and FW download complete
  - btintel: Define a macro for Intel Reset vendor command
  - btintel_pcie: Support Function level reset
  - btintel_pcie: Add support for device 0x4d76
  - btintel_pcie: Make driver wait for alive interrupt
  - btintel_pcie: Fix Alive Context State Handling
  - hci_qca: Enable ISO data packet RX
 -----BEGIN PGP SIGNATURE-----
 
 iQJNBAABCgA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmiBMSEZHGx1aXoudm9u
 LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKSrXD/9hlsXgpNgSGJQGdI07xdKR
 QrBbK/7GU7yur5w4lzW5hepn5inmy8cOw9PKo45HhM50cHu95IAnlUP4wDjwc4Ia
 Yv69+VNFonYFZ0P8Frcuy5RUZPdK/K88bdqoyccRm3eXk+Y2Of/6iv6mjx3Wq/fP
 dN8vvOLlEdsiYDK8g82IQKH28Gt8w4TWGgJ94DT/NgPpLgy2IkVcPt9S/s1O4tXu
 InRxvzHwqX98oEgkxgOclbsVZ8rClZMl94ffvT4C0f9rEZ2RWPqU7GCOJUJhSJHs
 98Rxp4K8H+rlRuMhlwe2bzPs7Y6Ao/mMAN6zxKgEN0LfHXgveQ/GQmCznclMlhjS
 5AX7zdCP9Jpfh/DOCfZdKde2Y5j8rBeIn7/7+IqxFjeEnXY7URxdKN+YMxR5w+fe
 vepX7CsJg5x28k1wTgG4CJ9iUP/AV6AmZrwnJcCF35M964IsgrSFa061AeKYb1OX
 gz4a+rQiLC6hTB2cvyBXgLmZehDu/eGHypCxfFFlxOkHwq0dRzV+W+hQfeel+DkO
 AF3Kcw6xzA0nPovjzLGErxeSG18h7Y58pjHHU11M9lUthVChpG3QBxtlS70xbugl
 pPSc3kQI/oB6UYFOFyrjwzSDPkVkXRGgfG/FbZQDag40edMB6hATnBCEqVkf+L7x
 qtgHU9wLX3uAE1f0aUx5gQ==
 =L038
 -----END PGP SIGNATURE-----

Merge tag 'for-net-next-2025-07-23' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next

Luiz Augusto von Dentz says:

====================
bluetooth-next pull request for net-next:

core:

 - hci_sync: fix double free in 'hci_discovery_filter_clear()'
 - hci_event: Mask data status from LE ext adv reports
 - hci_devcd_dump: fix out-of-bounds via dev_coredumpv
 - ISO: add socket option to report packet seqnum via CMSG
 - hci_event: Add support for handling LE BIG Sync Lost event
 - ISO: Support SCM_TIMESTAMPING for ISO TS
 - hci_core: Add PA_LINK to distinguish BIG sync and PA sync connections
 - hci_sock: Reset cookie to zero in hci_sock_free_cookie()

drivers:

 - btusb: Add new VID/PID 0489/e14e for MT7925
 - btusb: Add a new VID/PID 2c7c/7009 for MT7925
 - btusb: Add RTL8852BE device 0x13d3:0x3618
 - btusb: Add support for variant of RTL8851BE (USB ID 13d3:3601)
 - btusb: Add USB ID 3625:010b for TP-LINK Archer TX10UB Nano
 - btusb: QCA: Support downloading custom-made firmwares
 - btusb: Add one more ID 0x28de:0x1401 for Qualcomm WCN6855
 - nxp: add support for supply and reset
 - btnxpuart: Add support for 4M baudrate
 - btnxpuart: Correct the Independent Reset handling after FW dump
 - btnxpuart: Add uevents for FW dump and FW download complete
 - btintel: Define a macro for Intel Reset vendor command
 - btintel_pcie: Support Function level reset
 - btintel_pcie: Add support for device 0x4d76
 - btintel_pcie: Make driver wait for alive interrupt
 - btintel_pcie: Fix Alive Context State Handling
 - hci_qca: Enable ISO data packet RX

* tag 'for-net-next-2025-07-23' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next: (42 commits)
  Bluetooth: Add PA_LINK to distinguish BIG sync and PA sync connections
  Bluetooth: hci_event: Mask data status from LE ext adv reports
  Bluetooth: btintel_pcie: Fix Alive Context State Handling
  Bluetooth: btintel_pcie: Make driver wait for alive interrupt
  Bluetooth: hci_devcd_dump: fix out-of-bounds via dev_coredumpv
  Bluetooth: hci_sync: fix double free in 'hci_discovery_filter_clear()'
  Bluetooth: btusb: Add one more ID 0x28de:0x1401 for Qualcomm WCN6855
  Bluetooth: btusb: Sort WCN6855 device IDs by VID and PID
  Bluetooth: btusb: QCA: Support downloading custom-made firmwares
  Bluetooth: btnxpuart: Add uevents for FW dump and FW download complete
  Bluetooth: btnxpuart: Correct the Independent Reset handling after FW dump
  Bluetooth: ISO: Support SCM_TIMESTAMPING for ISO TS
  Bluetooth: ISO: add socket option to report packet seqnum via CMSG
  Bluetooth: btintel: Define a macro for Intel Reset vendor command
  Bluetooth: Fix typos in comments
  Bluetooth: RFCOMM: Fix typos in comments
  Bluetooth: aosp: Fix typo in comment
  Bluetooth: hci_bcm4377: Fix typo in comment
  Bluetooth: btrtl: Fix typo in comment
  Bluetooth: btmtk: Fix typo in log string
  ...
====================

Link: https://patch.msgid.link/20250723190233.166823-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-24 18:12:54 -07:00
Jakub Kicinski
126d85fb04 Another wireless update:
- rtw89:
    - STA+P2P concurrency
    - support for USB devices RTL8851BU/RTL8852BU
  - ath9k: OF support
  - ath12k:
    - more EHT/Wi-Fi 7 features
    - encapsulation/decapsulation offload
  - iwlwifi: some FIPS interoperability
  - brcm80211: support SDIO 43751 device
  - rt2x00: better DT/OF support
  - cfg80211/mac80211:
    - improved S1G support
    - beacon monitor for MLO
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmiCBJcACgkQ10qiO8sP
 aABs3w/9GiI7NLSicM2ulpaWs92za2RrsXsYH01z+m+dx3wUCupKEXKHh+hwy+EW
 WxJpfMqKURAh4QofTCLK0mwOKLr1bT/gNmXd9tKM3oaDIH/fk6HCteTff8GgmU/1
 zfbd5UMSU1W98WJiS3Ukm9FjZN5i1X/cpPdYMIq0sX3sM9JSmDZL6ToLsknr3sC2
 6CZj3UcjPx89e/ei4ALVjlU8DRhKBgqpzHMBgdnPt8bAbp9tUmyTKF0RqrBUn1Md
 9D1inkA/bJxmIKHslc8PgUEeuBlndCjrLCzlYw2XW8UOcLaHxujqQv7drgEjjlr2
 UTM+Hv6itFsRugS465EwbYLjM3lotmbpSWKR7ZQiSBF16jg0mBscq2mqqOU6Dv+X
 SqxTp9WSYptCtylinz/6h8SsaPr+rGxa00sLopcPUrhWgWagndhmKMcVfQBvnlUE
 JAg9gXkQ0d5GuYDOIdW+7i1NpLADthQpyynihhkAyISgfk+43HGUssTXCfRMr7wc
 iL07j6PFGXYTdSDiuaxnS70qn8jHSlFQ6FKvb7jxTGBR3RD1KtCyoWVStx6vE6Q7
 MTnsBGZZZ0yli6ur5vtVJr6ziwjjkBG+XxxuLvbeFb6+LYbPdpfa9fXj/UFct9el
 U9iBlQ0SBwgw8olh1Y6Xbj0XM9B57UnKLgRZ9GxMPVgQPLTli9k=
 =Dkfh
 -----END PGP SIGNATURE-----

Merge tag 'wireless-next-2025-07-24' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next

Johannes Berg says:

====================
Another wireless update:
 - rtw89:
   - STA+P2P concurrency
   - support for USB devices RTL8851BU/RTL8852BU
 - ath9k: OF support
 - ath12k:
   - more EHT/Wi-Fi 7 features
   - encapsulation/decapsulation offload
 - iwlwifi: some FIPS interoperability
 - brcm80211: support SDIO 43751 device
 - rt2x00: better DT/OF support
 - cfg80211/mac80211:
   - improved S1G support
   - beacon monitor for MLO

* tag 'wireless-next-2025-07-24' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (199 commits)
  ssb: use new GPIO line value setter callbacks for the second GPIO chip
  wifi: Fix typos
  wifi: brcmsmac: Use str_true_false() helper
  wifi: brcmfmac: fix EXTSAE WPA3 connection failure due to AUTH TX failure
  wifi: brcm80211: Remove yet more unused functions
  wifi: brcm80211: Remove more unused functions
  wifi: brcm80211: Remove unused functions
  wifi: iwlwifi: Revert "wifi: iwlwifi: remove support of several iwl_ppag_table_cmd versions"
  wifi: iwlwifi: check validity of the FW API range
  wifi: iwlwifi: don't export symbols that we shouldn't
  wifi: iwlwifi: mld: use spec link id and not FW link id
  wifi: iwlwifi: mld: decode EOF bit for AMPDUs
  wifi: iwlwifi: Remove support for rx OMI bandwidth reduction
  wifi: iwlwifi: stop supporting iwl_omi_send_status_notif ver 1
  wifi: iwlwifi: remove SC2F firmware support
  wifi: iwlwifi: mvm: Remove NAN support
  wifi: iwlwifi: mld: avoid outdated reorder buffer head_sn
  wifi: iwlwifi: mvm: avoid outdated reorder buffer head_sn
  wifi: iwlwifi: disable certain features for fips_enabled
  wifi: iwlwifi: mld: support channel survey collection for ACS scans
  ...
====================

Link: https://patch.msgid.link/20250724100349.21564-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-24 17:25:42 -07:00
Jakub Kicinski
8b5a19b4ff Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.16-rc8).

Conflicts:

drivers/net/ethernet/microsoft/mana/gdma_main.c
  9669ddda18 ("net: mana: Fix warnings for missing export.h header inclusion")
  7553911210 ("net: mana: Allocate MSI-X vectors dynamically")
https://lore.kernel.org/20250711130752.23023d98@canb.auug.org.au

Adjacent changes:

drivers/net/ethernet/ti/icssg/icssg_prueth.h
  6e86fb73de ("net: ti: icssg-prueth: Fix buffer allocation for ICSSG")
  ffe8a49091 ("net: ti: icssg-prueth: Read firmware-names from device tree")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-24 11:10:46 -07:00
Paolo Abeni
291d5dc80e ipsec-2025-07-23
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmiAj9QACgkQrB3Eaf9P
 W7dQEhAAj0ccCgKych/IBx4zlK/TSsPN9PnZmylTgpfWQdClCKjNEa80Umf2wWgc
 0j5rA77B2AJSHkRHMW6cWpmZRbicnWDavAtUOB/2lEcw8EDyMcIEEByzBtRkylvb
 S1slLlpAYyezHsSr9uGu6+Meb3llkL2+pnF2CiA02uWI0MaEXaqtGfEqZ4DZX9BK
 xQcb3St6Mr6pBV0sQ/rlY3Gfe/t/gzUnJNQQtRuKcZmBG4Ls1MLpJ6fYU++k+lfX
 my4JehdI2Sg6QujX4HdlBDF1bCzcKpxzJGQVhkjT9OPx25owtoknMezeWdrK3Gdk
 0pwIFs0+dFCc6hi2x6seabSBPNZJzT7978XZ0U2PneGoKlSw36UuAvk2ckAaBS+N
 I2flFQRGAf7XjxNn53lRailG/nxGYUMEvhlM+uGjy1rujDn8NmdZXAwtQUGP25bJ
 DCcXiQp6KUnclIWJ0MjScuFUG++Afn52DgsM+N5Fmjg9U/F1dVqIWjixcOOM4Lx2
 DOpJy9gi3r7JZ0b1K7KMq1FibxFGszaW4YeABNPqlcEjnWiVhPNQjxKm0S84UQdL
 Qh0s1PiAWhBqt4bvdf6h+r8309JWgo704MWxC2tBodDVJoYgugNXbnPihrr6+wO3
 D9SbUJHSdTLOWTWzhlrFEb2oHVHAXc3KZpIQnk+32XaxqcNHpeY=
 =MBQn
 -----END PGP SIGNATURE-----

Merge tag 'ipsec-2025-07-23' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec

Steffen Klassert says:

====================
pull request (net): ipsec 2025-07-23

1) Premption fixes for xfrm_state_find.
   From Sabrina Dubroca.

2) Initialize offload path also for SW IPsec GRO. This fixes a
   performance regression on SW IPsec offload.
   From Leon Romanovsky.

3) Fix IPsec UDP GRO for IKE packets.
   From Tobias Brunner,

4) Fix transport header setting for IPcomp after decompressing.
   From Fernando Fernandez Mancera.

5)  Fix use-after-free when xfrmi_changelink tries to change
    collect_md for a xfrm interface.
    From Eyal Birger .

6) Delete the special IPcomp x->tunnel state along with the state x
   to avoid refcount problems.
   From Sabrina Dubroca.

Please pull or let me know if there are problems.

* tag 'ipsec-2025-07-23' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
  Revert "xfrm: destroy xfrm_state synchronously on net exit path"
  xfrm: delete x->tunnel as we delete x
  xfrm: interface: fix use-after-free after changing collect_md xfrm interface
  xfrm: ipcomp: adjust transport header after decompressing
  xfrm: Set transport header to fix UDP GRO handling
  xfrm: always initialize offload path
  xfrm: state: use a consistent pcpu_id in xfrm_state_find
  xfrm: state: initialize state_ptrs earlier in xfrm_state_find
====================

Link: https://patch.msgid.link/20250723075417.3432644-1-steffen.klassert@secunet.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-24 12:30:40 +02:00
Koen De Schepper
8f9516daed sched: Add enqueue/dequeue of dualpi2 qdisc
DualPI2 provides L4S-type low latency & loss to traffic that uses a
scalable congestion controller (e.g. TCP-Prague, DCTCP) without
degrading the performance of 'classic' traffic (e.g. Reno,
Cubic etc.). It is to be the reference implementation of IETF RFC9332
DualQ Coupled AQM (https://datatracker.ietf.org/doc/html/rfc9332).

Note that creating two independent queues cannot meet the goal of
DualPI2 mentioned in RFC9332: "...to preserve fairness between
ECN-capable and non-ECN-capable traffic." Further, it could even
lead to starvation of Classic traffic, which is also inconsistent
with the requirements in RFC9332: "...although priority MUST be
bounded in order not to starve Classic traffic." DualPI2 is
designed to maintain approximate per-flow fairness on L-queue and
C-queue by forming a single qdisc using the coupling factor and
scheduler between two queues.

The qdisc provides two queues called low latency and classic. It
classifies packets based on the ECN field in the IP headers. By
default it directs non-ECN and ECT(0) into the classic queue and
ECT(1) and CE into the low latency queue, as per the IETF spec.

Each queue runs its own AQM:
* The classic AQM is called PI2, which is similar to the PIE AQM but
  more responsive and simpler. Classic traffic requires a decent
  target queue (default 15ms for Internet deployment) to fully
  utilize the link and to avoid high drop rates.
* The low latency AQM is, by default, a very shallow ECN marking
  threshold (1ms) similar to that used for DCTCP.

The DualQ isolates the low queuing delay of the Low Latency queue
from the larger delay of the 'Classic' queue. However, from a
bandwidth perspective, flows in either queue will share out the link
capacity as if there was just a single queue. This bandwidth pooling
effect is achieved by coupling together the drop and ECN-marking
probabilities of the two AQMs.

The PI2 AQM has two main parameters in addition to its target delay.
The integral gain factor alpha is used to slowly correct any persistent
standing queue error from the target delay, while the proportional gain
factor beta is used to quickly compensate for queue changes (growth or
shrinkage). Either alpha and beta are given as a parameter, or they can
be calculated by tc from alternative typical and maximum RTT parameters.

Internally, the output of a linear Proportional Integral (PI)
controller is used for both queues. This output is squared to
calculate the drop or ECN-marking probability of the classic queue.
This counterbalances the square-root rate equation of Reno/Cubic,
which is the trick that balances flow rates across the queues. For
the ECN-marking probability of the low latency queue, the output of
the base AQM is multiplied by a coupling factor. This determines the
balance between the flow rates in each queue. The default setting
makes the flow rates roughly equal, which should be generally
applicable.

If DUALPI2 AQM has detected overload (due to excessive non-responsive
traffic in either queue), it will switch to signaling congestion
solely using drop, irrespective of the ECN field. Alternatively, it
can be configured to limit the drop probability and let the queue
grow and eventually overflow (like tail-drop).

GSO splitting in DUALPI2 is configurable from userspace while the
default behavior is to split gso. When running DUALPI2 at unshaped
10gigE with 4 download streams test, splitting gso apart results in
halving the latency with no loss in throughput:

Summary of tcp_4down run 'no_split_gso':
                         avg         median      # data pts
 Ping (ms) ICMP   :       0.53      0.30 ms         350
 TCP download avg :    2326.86       N/A Mbits/s    350
 TCP download sum :    9307.42       N/A Mbits/s    350
 TCP download::1  :    2672.99   2568.73 Mbits/s    350
 TCP download::2  :    2586.96   2570.51 Mbits/s    350
 TCP download::3  :    1786.26   1798.82 Mbits/s    350
 TCP download::4  :    2261.21   2309.49 Mbits/s    350

Summart of tcp_4down run 'split_gso':
                         avg          median      # data pts
 Ping (ms) ICMP   :       0.22      0.23 ms         350
 TCP download avg :    2335.02       N/A Mbits/s    350
 TCP download sum :    9340.09       N/A Mbits/s    350
 TCP download::1  :    2335.30   2334.22 Mbits/s    350
 TCP download::2  :    2334.72   2334.20 Mbits/s    350
 TCP download::3  :    2335.28   2334.58 Mbits/s    350
 TCP download::4  :    2334.79   2334.39 Mbits/s    350

A similar result is observed when running DUALPI2 at unshaped 1gigE
with 1 download stream test:

Summary of tcp_1down run 'no_split_gso':
                         avg         median      # data pts
 Ping (ms) ICMP :         1.13      1.25 ms         350
 TCP download   :       941.41    941.46 Mbits/s    350

Summart of tcp_1down run 'split_gso':
                         avg         median      # data pts
 Ping (ms) ICMP :         0.51      0.55 ms         350
 TCP download   :       941.41    941.45 Mbits/s    350

Additional details can be found in the draft:
  https://datatracker.ietf.org/doc/html/rfc9332

Signed-off-by: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com>
Co-developed-by: Olga Albisser <olga@albisser.org>
Signed-off-by: Olga Albisser <olga@albisser.org>
Co-developed-by: Olivier Tilmans <olivier.tilmans@nokia.com>
Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia.com>
Co-developed-by: Henrik Steen <henrist@henrist.net>
Signed-off-by: Henrik Steen <henrist@henrist.net>
Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Signed-off-by: Bob Briscoe <research@bobbriscoe.net>
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Acked-by: Dave Taht <dave.taht@gmail.com>
Link: https://patch.msgid.link/20250722095915.24485-4-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-23 17:52:07 -07:00
Byungchul Park
9dfd871a3e libeth: xdp: access ->pp through netmem_desc instead of page
To eliminate the use of struct page in page pool, the page pool users
should use netmem descriptor and APIs instead.

Make xdp access ->pp through netmem_desc instead of page.

Signed-off-by: Byungchul Park <byungchul@sk.com>
Link: https://patch.msgid.link/20250721021835.63939-13-byungchul@sk.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-23 17:46:56 -07:00
Byungchul Park
89ade7c730 netmem, mlx4: access ->pp_ref_count through netmem_desc instead of page
To eliminate the use of struct page in page pool, the page pool users
should use netmem descriptor and APIs instead.

Make mlx4 access ->pp_ref_count through netmem_desc instead of page.

While at it, add a helper, pp_page_to_nmdesc() and __pp_page_to_nmdesc(),
that can be used to get netmem_desc from page only if it's a pp page.
For now that netmem_desc overlays on page, it can be achieved by just
casting, and use macro and _Generic to cover const casting as well.

Plus, change page_pool_page_is_pp() to check for 'const struct page *'
instead of 'struct page *' since it doesn't modify data and additionally
covers const type.

Signed-off-by: Byungchul Park <byungchul@sk.com>
Link: https://patch.msgid.link/20250721021835.63939-4-byungchul@sk.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-23 17:46:54 -07:00
Byungchul Park
38a436d4e2 netmem: use netmem_desc instead of page to access ->pp in __netmem_get_pp()
To eliminate the use of the page pool fields in struct page, the page
pool code should use netmem descriptor and APIs instead.

However, __netmem_get_pp() still accesses ->pp via struct page.  So
change it to use struct netmem_desc instead, since ->pp no longer will
be available in struct page.

While at it, add a helper, __netmem_to_nmdesc(), that can be used to
unsafely get pointer to netmem_desc backing the netmem_ref, only when
the netmem_ref is always backed by system memory.

Signed-off-by: Byungchul Park <byungchul@sk.com>
Link: https://patch.msgid.link/20250721021835.63939-3-byungchul@sk.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-23 17:46:54 -07:00
Byungchul Park
f3d85c9ee5 netmem: introduce struct netmem_desc mirroring struct page
To simplify struct page, the page pool members of struct page should be
moved to other, allowing these members to be removed from struct page.

Introduce a network memory descriptor to store the members, struct
netmem_desc, and make it union'ed with the existing fields in struct
net_iov, allowing to organize the fields of struct net_iov.

Signed-off-by: Byungchul Park <byungchul@sk.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20250721021835.63939-2-byungchul@sk.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-23 17:46:54 -07:00
Yang Li
a7bcffc673 Bluetooth: Add PA_LINK to distinguish BIG sync and PA sync connections
Currently, BIS_LINK is used for both BIG sync and PA sync connections,
which makes it impossible to distinguish them when searching for a PA
sync connection.

Adding PA_LINK will make the distinction clearer and simplify future
extensions for PA-related features.

Signed-off-by: Yang Li <yang.li@amlogic.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-07-23 10:35:14 -04:00
Chris Down
0cadf8534f Bluetooth: hci_event: Mask data status from LE ext adv reports
The Event_Type field in an LE Extended Advertising Report uses bits 5
and 6 for data status (e.g. truncation or fragmentation), not the PDU
type itself.

The ext_evt_type_to_legacy() function fails to mask these status bits
before evaluation. This causes valid advertisements with status bits set
(e.g. a truncated non-connectable advertisement, which ends up showing
as PDU type 0x40) to be misclassified as unknown and subsequently
dropped. This is okay for most checks which use bitwise AND on the
relevant event type bits, but it doesn't work for non-connectable types,
which are checked with '== LE_EXT_ADV_NON_CONN_IND' (that is, zero).

In terms of behaviour, first the device sends a truncated report:

> HCI Event: LE Meta Event (0x3e) plen 26
      LE Extended Advertising Report (0x0d)
        Entry 0
          Event type: 0x0040
            Data status: Incomplete, data truncated, no more to come
          Address type: Random (0x01)
          Address: 1D:12:46:FA:F8:6E (Non-Resolvable)
          SID: 0x03
          RSSI: -98 dBm (0x9e)
          Data length: 0x00

Then, a few seconds later, it sends the subsequent complete report:

> HCI Event: LE Meta Event (0x3e) plen 122
      LE Extended Advertising Report (0x0d)
        Entry 0
          Event type: 0x0000
            Data status: Complete
          Address type: Random (0x01)
          Address: 1D:12:46:FA:F8:6E (Non-Resolvable)
          SID: 0x03
          RSSI: -97 dBm (0x9f)
          Data length: 0x60
          Service Data: Google (0xfef3)
            Data[92]: ...

These devices often send multiple truncated reports per second.

This patch introduces a PDU type mask to ensure only the relevant bits
are evaluated, allowing for the correct translation of all valid
extended advertising packets.

Fixes: b2cc9761f1 ("Bluetooth: Handle extended ADV PDU types")
Signed-off-by: Chris Down <chris@chrisdown.name>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-07-23 10:34:47 -04:00
Arseniy Krasnov
2935e55685 Bluetooth: hci_sync: fix double free in 'hci_discovery_filter_clear()'
Function 'hci_discovery_filter_clear()' frees 'uuids' array and then
sets it to NULL. There is a tiny chance of the following race:

'hci_cmd_sync_work()'

 'update_passive_scan_sync()'

   'hci_update_passive_scan_sync()'

     'hci_discovery_filter_clear()'
       kfree(uuids);

       <-------------------------preempted-------------------------------->
                                           'start_service_discovery()'

                                             'hci_discovery_filter_clear()'
                                               kfree(uuids); // DOUBLE FREE

       <-------------------------preempted-------------------------------->

      uuids = NULL;

To fix it let's add locking around 'kfree()' call and NULL pointer
assignment. Otherwise the following backtrace fires:

[ ] ------------[ cut here ]------------
[ ] kernel BUG at mm/slub.c:547!
[ ] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
[ ] CPU: 3 UID: 0 PID: 246 Comm: bluetoothd Tainted: G O 6.12.19-kernel #1
[ ] Tainted: [O]=OOT_MODULE
[ ] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ ] pc : __slab_free+0xf8/0x348
[ ] lr : __slab_free+0x48/0x348
...
[ ] Call trace:
[ ]  __slab_free+0xf8/0x348
[ ]  kfree+0x164/0x27c
[ ]  start_service_discovery+0x1d0/0x2c0
[ ]  hci_sock_sendmsg+0x518/0x924
[ ]  __sock_sendmsg+0x54/0x60
[ ]  sock_write_iter+0x98/0xf8
[ ]  do_iter_readv_writev+0xe4/0x1c8
[ ]  vfs_writev+0x128/0x2b0
[ ]  do_writev+0xfc/0x118
[ ]  __arm64_sys_writev+0x20/0x2c
[ ]  invoke_syscall+0x68/0xf0
[ ]  el0_svc_common.constprop.0+0x40/0xe0
[ ]  do_el0_svc+0x1c/0x28
[ ]  el0_svc+0x30/0xd0
[ ]  el0t_64_sync_handler+0x100/0x12c
[ ]  el0t_64_sync+0x194/0x198
[ ] Code: 8b0002e6 eb17031f 54fffbe1 d503201f (d4210000)
[ ] ---[ end trace 0000000000000000 ]---

Fixes: ad383c2c65 ("Bluetooth: hci_sync: Enable advertising when LL privacy is enabled")
Signed-off-by: Arseniy Krasnov <avkrasnov@salutedevices.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-07-23 10:33:30 -04:00
Pauli Virtanen
7565bc5659 Bluetooth: ISO: add socket option to report packet seqnum via CMSG
User applications need a way to track which ISO interval a given SDU
belongs to, to properly detect packet loss. All controllers do not set
timestamps, and it's not guaranteed user application receives all packet
reports (small socket buffer, or controller doesn't send all reports
like Intel AX210 is doing).

Add socket option BT_PKT_SEQNUM that enables reporting of received
packet ISO sequence number in BT_SCM_PKT_SEQNUM CMSG.

Use BT_PKT_SEQNUM == 22 for the socket option, as 21 was used earlier
for a removed experimental feature that never got into mainline.

Signed-off-by: Pauli Virtanen <pav@iki.fi>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-07-23 10:31:19 -04:00
Yang Li
be31d11ec9 Bluetooth: Fix spelling mistakes
Correct the misspelling of “estabilished” in the code.

Signed-off-by: Yang Li <yang.li@amlogic.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-07-23 10:27:33 -04:00
Yang Li
b2a5f2e1c1 Bluetooth: hci_event: Add support for handling LE BIG Sync Lost event
When the BIS source stops, the controller sends an LE BIG Sync Lost
event (subevent 0x1E). Currently, this event is not handled, causing
the BIS stream to remain active in BlueZ and preventing recovery.

Signed-off-by: Yang Li <yang.li@amlogic.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-07-23 10:27:04 -04:00
Yue Haibing
70c672f933 Bluetooth: Remove hci_conn_hash_lookup_state()
Since commit 4aa42119d9 ("Bluetooth: Remove pending ACL connection
attempts") this function is unused.

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-07-23 10:24:41 -04:00
Miri Korenblit
69fdb08435 wifi: mac80211: don't require cipher and keylen in gtk rekey
ieee80211_add_gtk_rekey receives a keyconf as an argument, and the
cipher and keylen are taken from there to the new allocated key.
But in rekey, both the cipher and the keylen should be the same as of
the old key, so let ieee80211_add_gtk_rekey find those, so drivers won't
have to fill it in.

Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250721214922.3c5c023bfae9.Ie6594ae2b4b6d5b3d536e642b349046ebfce7a5d@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-07-22 10:43:19 +02:00
Michael-CY Lee
84b62b72b4 wifi: cfg80211/mac80211: report link ID for unexpected frames
The upper layer may require the link ID to properly handle
unexpected frames. For instance, if hostapd, operating as an
AP MLD, receives a data frame from a non-associated STA,
it must send deauthentication to the link on which the STA is
operating.

Signed-off-by: Michael-CY Lee <michael-cy.lee@mediatek.com>
Reviewed-by: Money Wang <money.wang@mediatek.com>
Link: https://patch.msgid.link/20250721065159.1740992-1-michael-cy.lee@mediatek.com
[edit commit message]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-07-21 19:40:59 +02:00
Miri Korenblit
460114eae8 wifi: mac80211: remove ieee80211_remove_key
It is no longer used, remove it.

Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250721091956.e964ceacd85c.Idecab8ef161fa58e000b3969bc936399284b79f0@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-07-21 19:35:17 +02:00
Jesper Dangaard Brouer
a6f190630d net: track pfmemalloc drops via SKB_DROP_REASON_PFMEMALLOC
Add a new SKB drop reason (SKB_DROP_REASON_PFMEMALLOC) to track packets
dropped due to memory pressure. In production environments, we've observed
memory exhaustion reported by memory layer stack traces, but these drops
were not properly tracked in the SKB drop reason infrastructure.

While most network code paths now properly report pfmemalloc drops, some
protocol-specific socket implementations still use sk_filter() without
drop reason tracking:
- Bluetooth L2CAP sockets
- CAIF sockets
- IUCV sockets
- Netlink sockets
- SCTP sockets
- Unix domain sockets

These remaining cases represent less common paths and could be converted
in a follow-up patch if needed. The current implementation provides
significantly improved observability into memory pressure events in the
network stack, especially for key protocols like TCP and UDP, helping to
diagnose problems in production environments.

Reported-by: Matt Fleming <mfleming@cloudflare.com>
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://patch.msgid.link/175268316579.2407873.11634752355644843509.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-18 16:59:05 -07:00
Alexei Starovoitov
beb1097ec8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after rc6
Cross-merge BPF and other fixes after downstream PR.

No conflicts.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-18 12:15:59 -07:00
Lachlan Hodges
bbf93a06d7 wifi: mac80211: support initialising an S1G short beaconing BSS
Introduce the ability to parse the short beacon data and long
beacon period. The long beacon period represents the number of beacon
intervals between each long beacon transmission. Additionally,
as a BSS cannot change its configuration such that short beaconing
is dynamically disabled/enabled without tearing down the interface
- we ensure we have an existing short beacon before performing
the update.

Signed-off-by: Lachlan Hodges <lachlan.hodges@morsemicro.com>
Link: https://patch.msgid.link/20250717074205.312577-3-lachlan.hodges@morsemicro.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-07-18 14:14:43 +02:00
Lachlan Hodges
6624a0af82 wifi: cfg80211: support configuring an S1G short beaconing BSS
S1G short beacons are an optional frame type used in an S1G BSS
that contain a limited set of elements. While they are optional,
they are a fundamental part of S1G that enables significant
power saving.

Expose 2 additional netlink attributes,
NL80211_ATTR_S1G_LONG_BEACON_PERIOD which denotes the number of beacon
intervals between each long beacon and NL80211_ATTR_S1G_SHORT_BEACON
which is a nested attribute containing the short beacon tail and
head. We split them as the long beacon period cannot be updated,
and is only used when initialisng the interface, whereas the short
beacon data can be used to both initialise and update the templates.
This follows how things such as the beacon interval and DTIM period
currently operate.

During the initialisation path, we ensure we have the long beacon
period if the short beacon data is being passed down, whereas
the update path will simply update the template if its sent down.

The short beacon data is validated using the same routines for regular
beacons as they support correctly parsing the short beacon format
while ensuring the frame is well-formed.

Signed-off-by: Lachlan Hodges <lachlan.hodges@morsemicro.com>
Link: https://patch.msgid.link/20250717074205.312577-2-lachlan.hodges@morsemicro.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-07-18 14:14:43 +02:00
Kuniyuki Iwashima
dc2a27e524 neighbour: Update pneigh_entry in pneigh_create().
neigh_add() updates pneigh_entry() found or created by pneigh_create().

This update is serialised by RTNL, but we will remove it.

Let's move the update part to pneigh_create() and make it return errno
instead of a pointer of pneigh_entry.

Now, the pneigh code is RTNL free.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-16-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17 16:25:22 -07:00
Kuniyuki Iwashima
13a936bb99 neighbour: Protect tbl->phash_buckets[] with a dedicated mutex.
tbl->phash_buckets[] is only modified in the slow path by pneigh_create()
and pneigh_delete() under the table lock.

Both of them are called under RTNL, so no extra lock is needed, but we
will remove RTNL from the paths.

pneigh_create() looks up a pneigh_entry, and this part can be lockless,
but it would complicate the logic like

  1. lookup
  2. allocate pengih_entry for GFP_KERNEL
  3. lookup again but under lock
  4. if found, return it after freeing the allocated memory
  5. else, return the new one

Instead, let's add a per-table mutex and run lookup and allocation
under it.

Note that updating pneigh_entry part in neigh_add() is still protected
by RTNL and will be moved to pneigh_create() in the next patch.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-15-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17 16:25:21 -07:00
Kuniyuki Iwashima
dd103c9a53 neighbour: Remove __pneigh_lookup().
__pneigh_lookup() is the lockless version of pneigh_lookup(),
but its only caller pndisc_is_router() holds the table lock and
reads pneigh_netry.flags.

This is because accessing pneigh_entry after pneigh_lookup() was
illegal unless the caller holds RTNL or the table lock.

Now, pneigh_entry is guaranteed to be alive during the RCU critical
section.

Let's call pneigh_lookup() and use READ_ONCE() for n->flags in
pndisc_is_router() and remove __pneigh_lookup().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-13-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17 16:25:21 -07:00
Kuniyuki Iwashima
d539d8fbd8 neighbour: Free pneigh_entry after RCU grace period.
We will convert RTM_GETNEIGH to RCU.

neigh_get() looks up pneigh_entry by pneigh_lookup() and passes
it to pneigh_fill_info().

Then, we must ensure that the entry is alive till pneigh_fill_info()
completes, but read_lock_bh(&tbl->lock) in pneigh_lookup() does not
guarantee that.

Also, we will convert all readers of tbl->phash_buckets[] to RCU.

Let's use call_rcu() to free pneigh_entry and update phash_buckets[]
and ->next by rcu_assign_pointer().

pneigh_ifdown_and_unlock() uses list_head to avoid overwriting
->next and moving RCU iterators to another list.

pndisc_destructor() (only IPv6 ndisc uses this) uses a mutex, so it
is not delayed to call_rcu(), where we cannot sleep.  This is fine
because the mcast code works with RCU and ipv6_dev_mc_dec() frees
mcast objects after RCU grace period.

While at it, we change the return type of pneigh_ifdown_and_unlock()
to void.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-8-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17 16:25:20 -07:00
Kuniyuki Iwashima
d63382aea7 neighbour: Annotate neigh_table.phash_buckets and pneigh_entry.next with __rcu.
The next patch will free pneigh_entry with call_rcu().

Then, we need to annotate neigh_table.phash_buckets[] and
pneigh_entry.next with __rcu.

To make the next patch cleaner, let's annotate the fields in advance.

Currently, all accesses to the fields are under the neigh table lock,
so rcu_dereference_protected() is used with 1 for now, but most of them
(except in pneigh_delete() and pneigh_ifdown_and_unlock()) will be
replaced with rcu_dereference() and rcu_dereference_check().

Note that pneigh_ifdown_and_unlock() changes pneigh_entry.next to a
local list, which is illegal because the RCU iterator could be moved
to another list.  This part will be fixed in the next patch.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-7-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17 16:25:20 -07:00
Kuniyuki Iwashima
e804bd83c1 neighbour: Split pneigh_lookup().
pneigh_lookup() has ASSERT_RTNL() in the middle of the function, which
is confusing.

When called with the last argument, creat, 0, pneigh_lookup() literally
looks up a proxy neighbour entry.  This is the case of the reader path
as the fast path and RTM_GETNEIGH.

pneigh_lookup(), however, creates a pneigh_entry when called with creat 1
from RTM_NEWNEIGH and SIOCSARP, which require RTNL.

Let's split pneigh_lookup() into two functions.

We will convert all the reader paths to RCU, and read_lock_bh(&tbl->lock)
in the new pneigh_lookup() will be dropped.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-6-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17 16:25:20 -07:00
Jakub Kicinski
af2d6148d2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.16-rc7).

Conflicts:

Documentation/netlink/specs/ovpn.yaml
  880d43ca9a ("netlink: specs: clean up spaces in brackets")
  af52020fc5 ("ovpn: reject unexpected netlink attributes")

drivers/net/phy/phy_device.c
  a44312d58e ("net: phy: Don't register LEDs for genphy")
  f0f2b992d8 ("net: phy: Don't register LEDs for genphy")
https://lore.kernel.org/20250710114926.7ec3a64f@kernel.org

drivers/net/wireless/intel/iwlwifi/fw/regulatory.c
drivers/net/wireless/intel/iwlwifi/mld/regulatory.c
  5fde0fcbd7 ("wifi: iwlwifi: mask reserved bits in chan_state_active_bitmap")
  ea045a0de3 ("wifi: iwlwifi: add support for accepting raw DSM tables by firmware")

net/ipv6/mcast.c
  ae3264a25a ("ipv6: mcast: Delay put pmc->idev in mld_del_delrec()")
  a8594c956c ("ipv6: mcast: Avoid a duplicate pointer check in mld_del_delrec()")
https://lore.kernel.org/8cc52891-3653-4b03-a45e-05464fe495cf@kernel.org

No adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17 11:00:33 -07:00
Jakub Kicinski
a2bbaff681 bluetooth pull request for net:
- hci_sync: fix connectable extended advertising when using static random address
  - hci_core: fix typos in macros
  - hci_core: add missing braces when using macro parameters
  - hci_core: replace 'quirks' integer by 'quirk_flags' bitmap
  - SMP: If an unallowed command is received consider it a failure
  - SMP: Fix using HCI_ERROR_REMOTE_USER_TERM on timeout
  - L2CAP: Fix null-ptr-deref in l2cap_sock_resume_cb()
  - L2CAP: Fix attempting to adjust outgoing MTU
  - btintel: Check if controller is ISO capable on btintel_classify_pkt_type
  - btusb: QCA: Fix downloading wrong NVM for WCN6855 GF variant without board ID
 -----BEGIN PGP SIGNATURE-----
 
 iQJNBAABCgA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmh5CF4ZHGx1aXoudm9u
 LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKeeYD/9oqrpfAnF+ZYakvt+W+bJx
 KPBWlXlgVSnPbK9qJPWm8AUaEOz6yyGd728S0QYZ+y5map5TVMWE0n1BYfauUgch
 GmUS/Li44qRVi9ygxS3CqiXHoVFtRiMJd3kx5v3SH8LHUQZakcNsFg4DCfQAufZe
 uJI+/+vccBx8rF+WR3mlhziE0bUosHOLAkqujnuKg/EpVO4xc4zeG6AKK5ihVHgQ
 1SlPp/s6BYz1VcMj9HMsEk6z4iY8WF5bdN1YdzkvRziTYYuFMDpJwI83FtkmmrsG
 v59GwlPMsGNlz25KbapzqGgflydeXSKbigTJQr7LHAaKv4jmqnnAeCOlhkvFuFBm
 snb2Zkkw16w+s/DBQvriBy6D+yiaSwKkZUjNWwGTvyDqAna6Kx44jzT1QpgOSm2p
 d+rxjrNXRjT59wiIo1JsOXpK5Mbbyz5QGXge/RbUO36glh/J2Vs44F1HueZHwSSw
 GGt0jmRTjB8/icbcvnkMVgwnoQEul7bsV95fPOq6CGSuRxYIX7uFXWMM/Wb/1SN7
 QWQyN/P7z5XpZMWFH3SDVx/FhN6G5Pi17OkvaLVSwfKs7jK45gb10Oi6cdypL2rc
 Ed6EkBOIL0ETAqDd4NLGTOeHEpJ3zfxxWqlu5cYrUf4qj7vtXk39ylNmXNNA/4Ci
 lU4vQdGAHYX3BIhaRdqryQ==
 =JuS2
 -----END PGP SIGNATURE-----

Merge tag 'for-net-2025-07-17' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth

Luiz Augusto von Dentz says:

====================
bluetooth pull request for net:

 - hci_sync: fix connectable extended advertising when using static random address
 - hci_core: fix typos in macros
 - hci_core: add missing braces when using macro parameters
 - hci_core: replace 'quirks' integer by 'quirk_flags' bitmap
 - SMP: If an unallowed command is received consider it a failure
 - SMP: Fix using HCI_ERROR_REMOTE_USER_TERM on timeout
 - L2CAP: Fix null-ptr-deref in l2cap_sock_resume_cb()
 - L2CAP: Fix attempting to adjust outgoing MTU
 - btintel: Check if controller is ISO capable on btintel_classify_pkt_type
 - btusb: QCA: Fix downloading wrong NVM for WCN6855 GF variant without board ID

* tag 'for-net-2025-07-17' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
  Bluetooth: L2CAP: Fix attempting to adjust outgoing MTU
  Bluetooth: btusb: QCA: Fix downloading wrong NVM for WCN6855 GF variant without board ID
  Bluetooth: hci_dev: replace 'quirks' integer by 'quirk_flags' bitmap
  Bluetooth: hci_core: add missing braces when using macro parameters
  Bluetooth: hci_core: fix typos in macros
  Bluetooth: SMP: Fix using HCI_ERROR_REMOTE_USER_TERM on timeout
  Bluetooth: SMP: If an unallowed command is received consider it a failure
  Bluetooth: btintel: Check if controller is ISO capable on btintel_classify_pkt_type
  Bluetooth: hci_sync: fix connectable extended advertising when using static random address
  Bluetooth: Fix null-ptr-deref in l2cap_sock_resume_cb()
====================

Link: https://patch.msgid.link/20250717142849.537425-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17 07:54:49 -07:00
Paolo Abeni
44eb62e1ea Another set of changes, notably:
- cfg80211: fix double-free introduced earlier
  - mac80211: fix RCU iteration in CSA
  - iwlwifi: many cleanups (unused FW APIs, PCIe code, WoWLAN)
  - mac80211: some work around how FIPS affects wifi, which was
              wrong (RC4 is used by TKIP, not only WEP)
  - cfg/mac80211: improvements for unsolicated probe response
                  handling
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmh4xE8ACgkQ10qiO8sP
 aAANTA/9HWfjfXTvZ3lEZqmOeJdHxd6xw3uHA732NF+KxoKi8vnO8V63WNVxBzCF
 6srSaSi0IkdtnyR/uPqG5AkWej/ouAeDy+FtzoDpP8O6G4e9yjKZBRnvP+O/gkod
 ERM4ouLWsumAIZqSqrSAJHgNTOP0/n5MNiLoKHlcjZmXphL2/lM8Qay7kVmWG+vT
 tqv/LrwqRiTe/a+O9PKaW2sq6z23CBhEqWf6hhFCaEPVQXQJLBc6FTI4008zP/Rt
 aSq672rMwA8ckuVOH0EAX544DMJc736to9XJEuu9EzuOaDZ9FwfqF9nOTMnFDev2
 k9qn8Q5vsx+786A6pXw+sarnxhUxnQCYB00MhMe1e1yoB0lv57hZCD1Z0JTBGXPh
 8ghmbelYBvkLollie6zUUcdf3G98ADUYvHsR0elBgGADWPHvnbn3ORVcv852X6jS
 El1CR3lyvwwHn9USKaR91ESQWFRtLtPG+OmmLRLIG7wdkp0sDjHOpvV7A/7QpFLw
 A18QrN+ORNL3Pja4A4dyIpDRQiTKHIjEEm3dAlVDzSWcPpcws5WlfreC1L71XblA
 kRl+1I7SZ87rp8d6TXXxc8QmIMA+M4OL6M4FMAJgjn8VdyjlaC6lvl1i+tgdaDik
 ISrY/78gnnt93BQmHU4P/syvh+NdpgNN+LgeaikKb1kmjG837Yk=
 =g8VQ
 -----END PGP SIGNATURE-----

Merge tag 'wireless-next-2025-07-17' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next

Johannes Berg says:

====================
Another set of changes, notably:
 - cfg80211: fix double-free introduced earlier
 - mac80211: fix RCU iteration in CSA
 - iwlwifi: many cleanups (unused FW APIs, PCIe code, WoWLAN)
 - mac80211: some work around how FIPS affects wifi, which was
             wrong (RC4 is used by TKIP, not only WEP)
 - cfg/mac80211: improvements for unsolicated probe response
                 handling

* tag 'wireless-next-2025-07-17' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (64 commits)
  wifi: cfg80211: fix double free for link_sinfo in nl80211_station_dump()
  wifi: cfg80211: fix off channel operation allowed check for MLO
  wifi: mac80211: use RCU-safe iteration in ieee80211_csa_finish
  wifi: mac80211_hwsim: Update comments in header
  wifi: mac80211: parse unsolicited broadcast probe response data
  wifi: cfg80211: parse attribute to update unsolicited probe response template
  wifi: mac80211: don't use TPE data from assoc response
  wifi: mac80211: handle WLAN_HT_ACTION_NOTIFY_CHANWIDTH async
  wifi: mac80211: simplify __ieee80211_rx_h_amsdu() loop
  wifi: mac80211: don't mark keys for inactive links as uploaded
  wifi: mac80211: only assign chanctx in reconfig
  wifi: mac80211_hwsim: Declare support for AP scanning
  wifi: mac80211: clean up cipher suite handling
  wifi: mac80211: don't send keys to driver when fips_enabled
  wifi: cfg80211: Fix interface type validation
  wifi: mac80211: remove ieee80211_link_unreserve_chanctx() return value
  wifi: mac80211: don't unreserve never reserved chanctx
  mwl8k: Add missing check after DMA map
  wifi: mac80211: make VHT opmode NSS ignore a debug message
  wifi: iwlwifi: remove support of several iwl_ppag_table_cmd versions
  ...
====================

Link: https://patch.msgid.link/20250717094610.20106-47-johannes@sipsolutions.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-17 15:00:41 +02:00
Paolo Abeni
e49f95dc8c Couple of fixes:
- ath12k performance regression from -rc1
  - cfg80211 counted_by() removal for scan request
    as it doesn't match usage and keeps complaining
  - iwlwifi crash with certain older devices
  - iwlwifi missing an error path unlock
  - iwlwifi compatibility with certain BIOS updates
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmh4vwsACgkQ10qiO8sP
 aAAv+w/9Fs51/x4KVnUhAK7+WtOjvFAuzQkCsVlMZplO8WnXOzUPWfZ+sOiBrMbR
 ws55b4SDnO82LoeLq4MBxAMJG7P+aWK9hNbMzA6TENaWCgxvTYNEQBijGjplx8pt
 0r2lpn3OP3ALluV09UnCVTVIWFL3pq/jCsGEGHVhjpRUJz9RSwmBrHFM6V7yqrzl
 VIbN8k5U90L1gwQ0wf2RI/KMZwJ85zCVViYSx1wNEpfwDH0pnoy5V1bMCNaCvReh
 Gca8XCmCKYP2/NSQcsoyqnz0RgjKDAsN1a8s/S/HHFea62iOpUctSg8rgc8bLfJn
 bbuwuSx40Ol70kPnSii35ImTO6PRSWXpibOlA3xtjPoLprX99e7xrdm6nSbf03It
 VnQtfhB5fuoBHQuPxLUeLDVl8dT4r4zbS1JQw7feUCP1Rs3IGGDLtBQHZQNE0VIr
 eFkWSa4SHv5mOFrPSuyAsSviRn6O8/Ni84/hXnAf9Lh1hzfPFG4pQ+gzBNf1ZoZI
 lrgNvj1xoDothTb3DzgEQwO5oxd3zFHdtOD/4DN8wH0pSOTm4obQm23b43F6t+CC
 q2E8eD4BurfnBuL1zlLseo6JbSFC7LOsustgarB+q2XSPoTCmJqaOfR24Kq9ZjjN
 OBE+uvTlg+/l2UjZ5lp/07XE3LSy6bWAvsl4Yy02biOiyZ0GuXU=
 =yzt3
 -----END PGP SIGNATURE-----

Merge tag 'wireless-2025-07-17' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless

Johannes Berg says:

====================
Couple of fixes:
 - ath12k performance regression from -rc1
 - cfg80211 counted_by() removal for scan request
   as it doesn't match usage and keeps complaining
 - iwlwifi crash with certain older devices
 - iwlwifi missing an error path unlock
 - iwlwifi compatibility with certain BIOS updates

* tag 'wireless-2025-07-17' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
  wifi: iwlwifi: Fix botched indexing conversion
  wifi: cfg80211: remove scan request n_channels counted_by
  wifi: ath12k: Fix packets received in WBM error ring with REO LUT enabled
  wifi: iwlwifi: mask reserved bits in chan_state_active_bitmap
  wifi: iwlwifi: pcie: fix locking on invalid TOP reset
====================

Link: https://patch.msgid.link/20250717091831.18787-5-johannes@sipsolutions.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-17 14:52:42 +02:00
Florian Westphal
2d72afb340 netfilter: nf_conntrack: fix crash due to removal of uninitialised entry
A crash in conntrack was reported while trying to unlink the conntrack
entry from the hash bucket list:
    [exception RIP: __nf_ct_delete_from_lists+172]
    [..]
 #7 [ff539b5a2b043aa0] nf_ct_delete at ffffffffc124d421 [nf_conntrack]
 #8 [ff539b5a2b043ad0] nf_ct_gc_expired at ffffffffc124d999 [nf_conntrack]
 #9 [ff539b5a2b043ae0] __nf_conntrack_find_get at ffffffffc124efbc [nf_conntrack]
    [..]

The nf_conn struct is marked as allocated from slab but appears to be in
a partially initialised state:

 ct hlist pointer is garbage; looks like the ct hash value
 (hence crash).
 ct->status is equal to IPS_CONFIRMED|IPS_DYING, which is expected
 ct->timeout is 30000 (=30s), which is unexpected.

Everything else looks like normal udp conntrack entry.  If we ignore
ct->status and pretend its 0, the entry matches those that are newly
allocated but not yet inserted into the hash:
  - ct hlist pointers are overloaded and store/cache the raw tuple hash
  - ct->timeout matches the relative time expected for a new udp flow
    rather than the absolute 'jiffies' value.

If it were not for the presence of IPS_CONFIRMED,
__nf_conntrack_find_get() would have skipped the entry.

Theory is that we did hit following race:

cpu x 			cpu y			cpu z
 found entry E		found entry E
 E is expired		<preemption>
 nf_ct_delete()
 return E to rcu slab
					init_conntrack
					E is re-inited,
					ct->status set to 0
					reply tuplehash hnnode.pprev
					stores hash value.

cpu y found E right before it was deleted on cpu x.
E is now re-inited on cpu z.  cpu y was preempted before
checking for expiry and/or confirm bit.

					->refcnt set to 1
					E now owned by skb
					->timeout set to 30000

If cpu y were to resume now, it would observe E as
expired but would skip E due to missing CONFIRMED bit.

					nf_conntrack_confirm gets called
					sets: ct->status |= CONFIRMED
					This is wrong: E is not yet added
					to hashtable.

cpu y resumes, it observes E as expired but CONFIRMED:
			<resumes>
			nf_ct_expired()
			 -> yes (ct->timeout is 30s)
			confirmed bit set.

cpu y will try to delete E from the hashtable:
			nf_ct_delete() -> set DYING bit
			__nf_ct_delete_from_lists

Even this scenario doesn't guarantee a crash:
cpu z still holds the table bucket lock(s) so y blocks:

			wait for spinlock held by z

					CONFIRMED is set but there is no
					guarantee ct will be added to hash:
					"chaintoolong" or "clash resolution"
					logic both skip the insert step.
					reply hnnode.pprev still stores the
					hash value.

					unlocks spinlock
					return NF_DROP
			<unblocks, then
			 crashes on hlist_nulls_del_rcu pprev>

In case CPU z does insert the entry into the hashtable, cpu y will unlink
E again right away but no crash occurs.

Without 'cpu y' race, 'garbage' hlist is of no consequence:
ct refcnt remains at 1, eventually skb will be free'd and E gets
destroyed via: nf_conntrack_put -> nf_conntrack_destroy -> nf_ct_destroy.

To resolve this, move the IPS_CONFIRMED assignment after the table
insertion but before the unlock.

Pablo points out that the confirm-bit-store could be reordered to happen
before hlist add resp. the timeout fixup, so switch to set_bit and
before_atomic memory barrier to prevent this.

It doesn't matter if other CPUs can observe a newly inserted entry right
before the CONFIRMED bit was set:

Such event cannot be distinguished from above "E is the old incarnation"
case: the entry will be skipped.

Also change nf_ct_should_gc() to first check the confirmed bit.

The gc sequence is:
 1. Check if entry has expired, if not skip to next entry
 2. Obtain a reference to the expired entry.
 3. Call nf_ct_should_gc() to double-check step 1.

nf_ct_should_gc() is thus called only for entries that already failed an
expiry check. After this patch, once the confirmed bit check passes
ct->timeout has been altered to reflect the absolute 'best before' date
instead of a relative time.  Step 3 will therefore not remove the entry.

Without this change to nf_ct_should_gc() we could still get this sequence:

 1. Check if entry has expired.
 2. Obtain a reference.
 3. Call nf_ct_should_gc() to double-check step 1:
    4 - entry is still observed as expired
    5 - meanwhile, ct->timeout is corrected to absolute value on other CPU
      and confirm bit gets set
    6 - confirm bit is seen
    7 - valid entry is removed again

First do check 6), then 4) so the gc expiry check always picks up either
confirmed bit unset (entry gets skipped) or expiry re-check failure for
re-inited conntrack objects.

This change cannot be backported to releases before 5.19. Without
commit 8a75a2c174 ("netfilter: conntrack: remove unconfirmed list")
|= IPS_CONFIRMED line cannot be moved without further changes.

Cc: Razvan Cojocaru <rzvncj@gmail.com>
Link: https://lore.kernel.org/netfilter-devel/20250627142758.25664-1-fw@strlen.de/
Link: https://lore.kernel.org/netfilter-devel/4239da15-83ff-4ca4-939d-faef283471bb@gmail.com/
Fixes: 1397af5bfd ("netfilter: conntrack: remove the percpu dying list")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-07-17 11:23:33 +02:00
Christian Eggers
6851a0c228 Bluetooth: hci_dev: replace 'quirks' integer by 'quirk_flags' bitmap
The 'quirks' member already ran out of bits on some platforms some time
ago. Replace the integer member by a bitmap in order to have enough bits
in future. Replace raw bit operations by accessor macros.

Fixes: ff26b2dd65 ("Bluetooth: Add quirk for broken READ_VOICE_SETTING")
Fixes: 127881334e ("Bluetooth: Add quirk for broken READ_PAGE_SCAN_TYPE")
Suggested-by: Pauli Virtanen <pav@iki.fi>
Tested-by: Ivan Pravdin <ipravdin.official@gmail.com>
Signed-off-by: Kiran K <kiran.k@intel.com>
Signed-off-by: Christian Eggers <ceggers@arri.de>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-07-16 15:37:53 -04:00
Christian Eggers
cdee6a4416 Bluetooth: hci_core: add missing braces when using macro parameters
Macro parameters should always be put into braces when accessing it.

Fixes: 4fc9857ab8 ("Bluetooth: hci_sync: Add check simultaneous roles support")
Signed-off-by: Christian Eggers <ceggers@arri.de>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-07-16 15:37:37 -04:00
Christian Eggers
dfef8d87a0 Bluetooth: hci_core: fix typos in macros
The provided macro parameter is named 'dev' (rather than 'hdev', which
may be a variable on the stack where the macro is used).

Fixes: a9a830a676 ("Bluetooth: hci_event: Fix sending HCI_OP_READ_ENC_KEY_SIZE")
Fixes: 6126ffabba ("Bluetooth: Introduce HCI_CONN_FLAG_DEVICE_PRIVACY device flag")
Signed-off-by: Christian Eggers <ceggers@arri.de>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-07-16 15:34:06 -04:00
Matt Johnston
3549eb08e5 net: mctp: Allow limiting binds to a peer address
Prior to calling bind() a program may call connect() on a socket to
restrict to a remote peer address.

Using connect() is the normal mechanism to specify a remote network
peer, so we use that here. In MCTP connect() is only used for bound
sockets - send() is not available for MCTP since a tag must be provided
for each message.

The smctp_type must match between connect() and bind() calls.

Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Link: https://patch.msgid.link/20250710-mctp-bind-v4-6-8ec2f6460c56@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-15 12:08:39 +02:00
Matt Johnston
1aeed732f4 net: mctp: Use hashtable for binds
Ensure that a specific EID (remote or local) bind will match in
preference to a MCTP_ADDR_ANY bind.

This adds infrastructure for binding a socket to receive messages from a
specific remote peer address, a future commit will expose an API for
this.

Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Link: https://patch.msgid.link/20250710-mctp-bind-v4-5-8ec2f6460c56@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-15 12:08:39 +02:00
Yuvarani V
c932be7262 wifi: cfg80211: parse attribute to update unsolicited probe response template
At present, the updated unsolicited broadcast probe response template is
not processed during userspace commands such as channel switch or color
change. This leads to an issue where older incorrect unsolicited probe
response is still used during these events.

Add support to parse the netlink attribute and store it so that
mac80211/drivers can use it to set the BSS_CHANGED_UNSOL_BCAST_PROBE_RESP
flag in order to send the updated unsolicited broadcast probe response
templates during these events.

Signed-off-by: Yuvarani V <quic_yuvarani@quicinc.com>
Signed-off-by: Aditya Kumar Singh <aditya.kumar.singh@oss.qualcomm.com>
Link: https://patch.msgid.link/20250710-update_unsol_bcast_probe_resp-v2-1-31aca39d3b30@oss.qualcomm.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-07-15 11:01:21 +02:00
Ilan Peer
14450be233 wifi: cfg80211: Fix interface type validation
Fix a condition that verified valid values of interface types.

Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250709233537.7ad199ca5939.I0ac1ff74798bf59a87a57f2e18f2153c308b119b@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-07-15 11:00:38 +02:00
Johannes Berg
444020f4bf wifi: cfg80211: remove scan request n_channels counted_by
This reverts commit e3eac9f32e ("wifi: cfg80211: Annotate struct
cfg80211_scan_request with __counted_by").

This really has been a completely failed experiment. There were
no actual bugs found, and yet at this point we already have four
"fixes" to it, with nothing to show for but code churn, and it
never even made the code any safer.

In all of the cases that ended up getting "fixed", the structure
is also internally inconsistent after the n_channels setting as
the channel list isn't actually filled yet. You cannot scan with
such a structure, that's just wrong. In mac80211, the struct is
also reused multiple times, so initializing it once is no good.

Some previous "fixes" (e.g. one in brcm80211) are also just setting
n_channels before accessing the array, under the assumption that the
code is correct and the array can be accessed, further showing that
the whole thing is just pointless when the allocation count and use
count are not separate.

If we really wanted to fix it, we'd need to separately track the
number of channels allocated and the number of channels currently
used, but given that no bugs were found despite the numerous syzbot
reports, that'd just be a waste of time.

Remove the __counted_by() annotation. We really should also remove
a number of the n_channels settings that are setting up a structure
that's inconsistent, but that can wait.

Reported-by: syzbot+e834e757bd9b3d3e1251@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=e834e757bd9b3d3e1251
Fixes: e3eac9f32e ("wifi: cfg80211: Annotate struct cfg80211_scan_request with __counted_by")
Link: https://patch.msgid.link/20250714142130.9b0bbb7e1f07.I09112ccde72d445e11348fc2bef68942cb2ffc94@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-07-15 10:27:21 +02:00
Eric Dumazet
75dff0584c tcp: add const to tcp_try_rmem_schedule() and sk_rmem_schedule() skb
These functions to not modify the skb, add a const qualifier.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250711114006.480026-7-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-14 18:41:43 -07:00
Eric Dumazet
6c758062c6 tcp: add LINUX_MIB_BEYOND_WINDOW
Add a new SNMP MIB : LINUX_MIB_BEYOND_WINDOW

Incremented when an incoming packet is received beyond the
receiver window.

nstat -az | grep TcpExtBeyondWindow

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250711114006.480026-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-14 18:41:42 -07:00
Eric Dumazet
9ca48d616e tcp: do not accept packets beyond window
Currently, TCP accepts incoming packets which might go beyond the
offered RWIN.

Add to tcp_sequence() the validation of packet end sequence.

Add the corresponding check in the fast path.

We relax this new constraint if the receive queue is empty,
to not freeze flows from buggy peers.

Add a new drop reason : SKB_DROP_REASON_TCP_INVALID_END_SEQUENCE.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250711114006.480026-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-14 18:41:15 -07:00
Dr. David Alan Gilbert
08a305b2a5 net/x25: Remove unused x25_terminate_link()
x25_terminate_link() has been unused since the last use was removed
in 2020 by:
commit 7eed751b3b ("net/x25: handle additional netdev events")

Remove it.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Acked-by: Martin Schiller <ms@dev.tdt.de>
Link: https://patch.msgid.link/20250712205759.278777-1-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-14 17:19:13 -07:00
Phil Sutter
36a686c078 Revert "netfilter: nf_tables: Add notifications for hook changes"
This reverts commit 465b9ee0ee.

Such notifications fit better into core or nfnetlink_hook code,
following the NFNL_MSG_HOOK_GET message format.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-07-14 15:22:47 +02:00
Eric Dumazet
1f376373bd net_sched: act_skbedit: use RCU in tcf_skbedit_dump()
Also storing tcf_action into struct tcf_skbedit_params
makes sure there is no discrepancy in tcf_skbedit_act().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250709090204.797558-12-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-11 16:01:17 -07:00
Eric Dumazet
cec7a5c6c6 net_sched: act_police: use RCU in tcf_police_dump()
Also storing tcf_action into struct tcf_police_params
makes sure there is no discrepancy in tcf_police_act().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250709090204.797558-11-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-11 16:01:17 -07:00
Eric Dumazet
9d09674657 net_sched: act_pedit: use RCU in tcf_pedit_dump()
Also storing tcf_action into struct tcf_pedit_params
makes sure there is no discrepancy in tcf_pedit_act().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250709090204.797558-10-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-11 16:01:17 -07:00
Eric Dumazet
5d28928668 net_sched: act_nat: use RCU in tcf_nat_dump()
Also storing tcf_action into struct tcf_nat_params
makes sure there is no discrepancy in tcf_nat_act().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250709090204.797558-9-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-11 16:01:16 -07:00
Eric Dumazet
8151684e33 net_sched: act_mpls: use RCU in tcf_mpls_dump()
Also storing tcf_action into struct tcf_mpls_params
makes sure there is no discrepancy in tcf_mpls_act().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250709090204.797558-8-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-11 16:01:16 -07:00
Eric Dumazet
799c94178c net_sched: act_ctinfo: use RCU in tcf_ctinfo_dump()
Also storing tcf_action into struct tcf_ctinfo_params
makes sure there is no discrepancy in tcf_ctinfo_act().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250709090204.797558-7-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-11 16:01:16 -07:00
Eric Dumazet
d300335b4e net_sched: act_ctinfo: use atomic64_t for three counters
Commit 21c167aa0b ("net/sched: act_ctinfo: use percpu stats")
missed that stats_dscp_set, stats_dscp_error and stats_cpmark_set
might be written (and read) locklessly.

Use atomic64_t for these three fields, I doubt act_ctinfo is used
heavily on big SMP hosts anyway.

Fixes: 24ec483cec ("net: sched: Introduce act_ctinfo action")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Pedro Tammela <pctammela@mojatatu.com>
Link: https://patch.msgid.link/20250709090204.797558-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-11 16:01:16 -07:00
Eric Dumazet
554e66bad8 net_sched: act_ct: use RCU in tcf_ct_dump()
Also storing tcf_action into struct tcf_ct_params
makes sure there is no discrepancy in tcf_ct_act().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250709090204.797558-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-11 16:01:16 -07:00
Eric Dumazet
ba9dc9c140 net_sched: act_csum: use RCU in tcf_csum_dump()
Also storing tcf_action into struct tcf_csum_params
makes sure there is no discrepancy in tcf_csum_act().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250709090204.797558-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-11 16:01:16 -07:00
Eric Dumazet
0d75287770 net_sched: act_connmark: use RCU in tcf_connmark_dump()
Also storing tcf_action into struct tcf_connmark_parms
makes sure there is no discrepancy in tcf_connmark_act().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250709090204.797558-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-11 16:01:15 -07:00
Eric Dumazet
30dbb2d0e1 net_sched: act: annotate data-races in tcf_lastuse_update() and tcf_tm_dump()
tcf_tm_dump() reads fields that can be changed concurrently,
and tcf_lastuse_update() might race against itself.

Add READ_ONCE() and WRITE_ONCE() annotations.

Fetch jiffies once in tcf_tm_dump().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250709090204.797558-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-11 16:01:15 -07:00
Jakub Kicinski
0cad34fb7c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.16-rc6-2).

No conflicts.

Adjacent changes:

drivers/net/wireless/mediatek/mt76/mt7925/mcu.c
  c701574c54 ("wifi: mt76: mt7925: fix invalid array index in ssid assignment during hw scan")
  b3a431fe2e ("wifi: mt76: mt7925: fix off by one in mt7925_mcu_hw_scan()")

drivers/net/wireless/mediatek/mt76/mt7996/mac.c
  62da647a2b ("wifi: mt76: mt7996: Add MLO support to mt7996_tx_check_aggr()")
  dc66a129ad ("wifi: mt76: add a wrapper for wcid access with validation")

drivers/net/wireless/mediatek/mt76/mt7996/main.c
  3dd6f67c66 ("wifi: mt76: Move RCU section in mt7996_mcu_add_rate_ctrl()")
  8989d8e90f ("wifi: mt76: mt7996: Do not set wcid.sta to 1 in mt7996_mac_sta_event()")

net/mac80211/cfg.c
  58fcb1b428 ("wifi: mac80211: reject VHT opmode for unsupported channel widths")
  037dc18ac3 ("wifi: mac80211: add support for storing station S1G capabilities")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-11 11:42:38 -07:00
Tao Chen
6e816e1c05 bpf: Remove location field in tcx_link
Use attach_type in bpf_link to replace the location filed, and
remove location field in tcx_link.

Signed-off-by: Tao Chen <chen.dylane@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20250710032038.888700-5-chen.dylane@linux.dev
2025-07-11 11:00:57 -07:00
Jakub Kicinski
0f26870a98 netfilter pull request 25-07-10
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEjF9xRqF1emXiQiqU1w0aZmrPKyEFAmhvEc8ACgkQ1w0aZmrP
 KyGYmBAAxw/EnlVp9sHWAPBNka6KOQECAb1LrnBVjv5nFZK3jHy8ehb/XjW0Q/rR
 ScjTgGJLBVuXMSGfFYBsDezesLYZLs/QK3amsTP9+HtYYGdOxdgvmTPMrUe2mF4W
 He/xrvWpVKTV00vnpVvZe5zbWnpw9iDTXbDecxmArqNVu78LOsRFU20S2MPc86ne
 wbgJohCS/QvhHPsKBUd83niGtKXR8uaFitLDaCgaaqqF4noXfNxIpdxm2O3NIqu1
 2gbEAp3qi1YWHUYHQUxsK7W7JjY8xfd9EFLViRdQCI0aevE01GmwJc4J2oDlZ7Ki
 DqRFEYzpwpxqe0YwihAsB72faXMOsWFODwIiojlAue26W5iVLzu+w7hBUiOBiq6E
 CEBH2OSL47n3s/Va6e1Gb9W2A24x+Qp+EZEZb0Ci0I0MlstOw0aUutoc/P6F9MxI
 v9ApavHkt5/FImeXiTB3j61ZodgduC5WLZ7HCM5EjrCjlZznsJFuxtWSYARyYTUM
 aj3U00KRW85WgeuvWj1aTGTiT7dmTsQVo7XllEpzfXWvVxE0ATwfiLRfUQwtUwGx
 6SecA49s34AwgdUKMynfPtlkxIzraGjhzJmGdTCA1Hi1fw2TUYniASmHRDKVQvUA
 1jYCQucYf+Q1zybosscXepUCtsVSPTZTZE/gQSDCUy6WNvHnv84=
 =zECx
 -----END PGP SIGNATURE-----

Merge tag 'nf-next-25-07-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next

Pablo Neira Ayuso says:

====================
Netfilter updates for net-next (v2)

The following series contains an initial small batch of Netfilter
updates for net-next:

1) Remove DCCP conntrack support, keep DCCP matches around in order to
   avoid breakage when loading ruleset, add Kconfig to wrap the code
   so it can be disabled by distributors.

2) Remove buggy code aiming at shrinking netlink deletion event, then
   re-add it correctly in another patch. This is to prevent -stable to
   pick up on a fix that breaks old userspace. From Phil Sutter.

3) Missing WARN_ON_ONCE() to check for lockdep_commit_lock_is_held()
   to uncover bugs. From Fedor Pchelkin.

* tag 'nf-next-25-07-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  netfilter: nf_tables: adjust lockdep assertions handling
  netfilter: nf_tables: Reintroduce shortened deletion notifications
  netfilter: nf_tables: Drop dead code from fill_*_info routines
  netfilter: conntrack: remove DCCP protocol support
====================

Link: https://patch.msgid.link/20250710010706.2861281-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-10 18:18:41 -07:00
Jakub Kicinski
809f683324 Quite a bit more work, notably:
- mt76: firmware recovery improvements, MLO work
  - iwlwifi: use embedded PNVM in (to be released) FW images
             to fix compatibility issues
  - cfg80211/mac80211: extended regulatory info support (6 GHz)
  - cfg80211: use "faux device" for regulatory
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmhvskoACgkQ10qiO8sP
 aACdthAAiN1wzi9Hrpm2ULqDecEL553ecvfOPAzlYibF4wiulVEdKJd7nFn03udA
 n2ylEgiANbTb+yEgDjj6IwQ9w5XmYvlLSGO/wIag3UUSWUYuCC5if/JEq56lom+B
 IZFMOFXpvO35jSe1h1v0HuuJvbeZv7KDWMJlA4aSXbLx4Y1juAuQid4/YllUkuXt
 QgAziAhF7laNk+8nfQLQ3N1DQytftiDK32vCJ6kJ7ciEhh8qxwT5aVkmE0Q/iOLg
 na4TdrcRRMQ7kkgODqksGz1nVrr/0AHyMVi/rQ/YaL1uY8fBxvbs0nmjU8G78LkO
 gM/401kySEzgLl6x/xrnaVUupbAtsyYrHYH+4q6Z07UPg1OgYY4G71rJIMp/Hq52
 /iQVMOFmbhoIqFLqlvmh92QqZRH+rMf/KnO9Pnh/wVdxGH74ZP0bFYB02bqdaRie
 uxtO9lHfnuMedET85D747rmgAqrBJ2t3BqAD++LNdqF530eOaWiBkAPeAbTRSgis
 d1BJoDtWL3l0tjAA1ivdUw/fPjqWxffpdDTtzSwRjuFUNE5YPpz3VuKNSurSgvMZ
 DCSvxTHf5DLVrg6b4W/YXickD2kArJLpaCwkxzkpyBzh6wyRvxs+b/oZSQRVeKvL
 C36I3WNDqARfC29ilYUiz/G8kJry8IQaSuLaMP6X92Fa6Hdl7Kg=
 =qkYl
 -----END PGP SIGNATURE-----

Merge tag 'wireless-next-2025-07-10' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next

Johannes Berg says:

====================
Quite a bit more work, notably:
 - mt76: firmware recovery improvements, MLO work
 - iwlwifi: use embedded PNVM in (to be released) FW images
            to fix compatibility issues
 - cfg80211/mac80211: extended regulatory info support (6 GHz)
 - cfg80211: use "faux device" for regulatory

* tag 'wireless-next-2025-07-10' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (48 commits)
  wifi: mac80211: don't complete management TX on SAE commit
  wifi: cfg80211/mac80211: implement dot11ExtendedRegInfoSupport
  wifi: mac80211: send extended MLD capa/ops if AP has it
  wifi: mac80211: copy first_part into HW scan
  wifi: cfg80211: add a flag for the first part of a scan
  wifi: mac80211: remove DISALLOW_PUNCTURING_5GHZ code
  wifi: cfg80211: only verify part of Extended MLD Capabilities
  wifi: nl80211: make nl80211_check_scan_flags() type safe
  wifi: cfg80211: hide scan internals
  wifi: mac80211: fix deactivated link CSA
  wifi: mac80211: add mandatory bitrate support for 6 GHz
  wifi: mac80211: remove spurious blank line
  wifi: mac80211: verify state before connection
  wifi: mac80211: avoid weird state in error path
  wifi: iwlwifi: mvm: remove support for iwl_wowlan_info_notif_v4
  wifi: iwlwifi: bump minimum API version in BZ
  wifi: iwlwifi: mvm: remove unneeded argument
  wifi: iwlwifi: mvm: remove MLO GTK rekey code
  wifi: iwlwifi: pcie: rename iwl_pci_gen1_2_probe() argument
  wifi: iwlwifi: match discrete/integrated to fix some names
  ...
====================

Link: https://patch.msgid.link/20250710123113.24878-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-10 17:24:22 -07:00
Eric Dumazet
18cdb3d982 netfilter: flowtable: account for Ethernet header in nf_flow_pppoe_proto()
syzbot found a potential access to uninit-value in nf_flow_pppoe_proto()

Blamed commit forgot the Ethernet header.

BUG: KMSAN: uninit-value in nf_flow_offload_inet_hook+0x7e4/0x940 net/netfilter/nf_flow_table_inet.c:27
  nf_flow_offload_inet_hook+0x7e4/0x940 net/netfilter/nf_flow_table_inet.c:27
  nf_hook_entry_hookfn include/linux/netfilter.h:157 [inline]
  nf_hook_slow+0xe1/0x3d0 net/netfilter/core.c:623
  nf_hook_ingress include/linux/netfilter_netdev.h:34 [inline]
  nf_ingress net/core/dev.c:5742 [inline]
  __netif_receive_skb_core+0x4aff/0x70c0 net/core/dev.c:5837
  __netif_receive_skb_one_core net/core/dev.c:5975 [inline]
  __netif_receive_skb+0xcc/0xac0 net/core/dev.c:6090
  netif_receive_skb_internal net/core/dev.c:6176 [inline]
  netif_receive_skb+0x57/0x630 net/core/dev.c:6235
  tun_rx_batched+0x1df/0x980 drivers/net/tun.c:1485
  tun_get_user+0x4ee0/0x6b40 drivers/net/tun.c:1938
  tun_chr_write_iter+0x3e9/0x5c0 drivers/net/tun.c:1984
  new_sync_write fs/read_write.c:593 [inline]
  vfs_write+0xb4b/0x1580 fs/read_write.c:686
  ksys_write fs/read_write.c:738 [inline]
  __do_sys_write fs/read_write.c:749 [inline]

Reported-by: syzbot+bf6ed459397e307c3ad2@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/686bc073.a00a0220.c7b3.0086.GAE@google.com/T/#u
Fixes: 87b3593bed ("netfilter: flowtable: validate pppoe header")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Pablo Neira Ayuso <pablo@netfilter.org>
Link: https://patch.msgid.link/20250707124517.614489-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-10 17:12:28 -07:00
Wang Liang
96698d1898 net: replace ND_PRINTK with dynamic debug
ND_PRINTK with val > 1 only works when the ND_DEBUG was set in compilation
phase. Replace it with dynamic debug. Convert ND_PRINTK with val <= 1 to
net_{err,warn}_ratelimited, and convert the rest to net_dbg_ratelimited.

Suggested-by: Ido Schimmel <idosch@idosch.org>
Signed-off-by: Wang Liang <wangliang74@huawei.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250708033342.1627636-1-wangliang74@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-10 15:27:32 -07:00
Jakub Kicinski
3321e97eab Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.16-rc6).

No conflicts.

Adjacent changes:

Documentation/devicetree/bindings/net/allwinner,sun8i-a83t-emac.yaml
  0a12c435a1 ("dt-bindings: net: sun8i-emac: Add A100 EMAC compatible")
  b3603c0466 ("dt-bindings: net: sun8i-emac: Rename A523 EMAC0 to GMAC0")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-10 10:10:49 -07:00
Jason Xing
45e359be1c net: xsk: introduce XDP_MAX_TX_SKB_BUDGET setsockopt
This patch provides a setsockopt method to let applications leverage to
adjust how many descs to be handled at most in one send syscall. It
mitigates the situation where the default value (32) that is too small
leads to higher frequency of triggering send syscall.

Considering the prosperity/complexity the applications have, there is no
absolutely ideal suggestion fitting all cases. So keep 32 as its default
value like before.

The patch does the following things:
- Add XDP_MAX_TX_SKB_BUDGET socket option.
- Set max_tx_budget to 32 by default in the initialization phase as a
  per-socket granular control.
- Set the range of max_tx_budget as [32, xs->tx->nentries].

The idea behind this comes out of real workloads in production. We use a
user-level stack with xsk support to accelerate sending packets and
minimize triggering syscalls. When the packets are aggregated, it's not
hard to hit the upper bound (namely, 32). The moment user-space stack
fetches the -EAGAIN error number passed from sendto(), it will loop to try
again until all the expected descs from tx ring are sent out to the driver.
Enlarging the XDP_MAX_TX_SKB_BUDGET value contributes to less frequency of
sendto() and higher throughput/PPS.

Here is what I did in production, along with some numbers as follows:
For one application I saw lately, I suggested using 128 as max_tx_budget
because I saw two limitations without changing any default configuration:
1) XDP_MAX_TX_SKB_BUDGET, 2) socket sndbuf which is 212992 decided by
net.core.wmem_default. As to XDP_MAX_TX_SKB_BUDGET, the scenario behind
this was I counted how many descs are transmitted to the driver at one
time of sendto() based on [1] patch and then I calculated the
possibility of hitting the upper bound. Finally I chose 128 as a
suitable value because 1) it covers most of the cases, 2) a higher
number would not bring evident results. After twisting the parameters,
a stable improvement of around 4% for both PPS and throughput and less
resources consumption were found to be observed by strace -c -p xxx:
1) %time was decreased by 7.8%
2) error counter was decreased from 18367 to 572

[1]: https://lore.kernel.org/all/20250619093641.70700-1-kerneljasonxing@gmail.com/

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://patch.msgid.link/20250704160138.48677-1-kerneljasonxing@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-10 14:48:29 +02:00
Xiang Mei
dd831ac822 net/sched: sch_qfq: Fix null-deref in agg_dequeue
To prevent a potential crash in agg_dequeue (net/sched/sch_qfq.c)
when cl->qdisc->ops->peek(cl->qdisc) returns NULL, we check the return
value before using it, similar to the existing approach in sch_hfsc.c.

To avoid code duplication, the following changes are made:

1. Changed qdisc_warn_nonwc(include/net/pkt_sched.h) into a static
inline function.

2. Moved qdisc_peek_len from net/sched/sch_hfsc.c to
include/net/pkt_sched.h so that sch_qfq can reuse it.

3. Applied qdisc_peek_len in agg_dequeue to avoid crashing.

Signed-off-by: Xiang Mei <xmei5@asu.edu>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Link: https://patch.msgid.link/20250705212143.3982664-1-xmei5@asu.edu
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-10 11:08:35 +02:00
Ivan Vecera
de9ccf2296 devlink: Add new "clock_id" generic device param
Add a new device generic parameter to specify clock ID that should
be used by the device for registering DPLL devices and pins.

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20250704182202.1641943-5-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09 19:08:52 -07:00
Ivan Vecera
c0ef144695 devlink: Add support for u64 parameters
Only 8, 16 and 32-bit integers are supported for numeric devlink
parameters. The subsequent patch adds support for DPLL clock ID
that is defined as 64-bit number. Add support for u64 parameter
type.

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20250704182202.1641943-4-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09 19:08:52 -07:00
Johannes Berg
6b04716cdc wifi: mac80211: don't complete management TX on SAE commit
When SAE commit is sent and received in response, there's no
ordering for the SAE confirm messages. As such, don't call
drivers to stop listening on the channel when the confirm
message is still expected.

This fixes an issue if the local confirm is transmitted later
than the AP's confirm, for iwlwifi (and possibly mt76) the
AP's confirm would then get lost since the device isn't on
the channel at the time the AP transmit the confirm.

For iwlwifi at least, this also improves the overall timing
of the authentication handshake (by about 15ms according to
the report), likely since the session protection won't be
aborted and rescheduled.

Note that even before this, mgd_complete_tx() wasn't always
called for each call to mgd_prepare_tx() (e.g. in the case
of WEP key shared authentication), and the current drivers
that have the complete callback don't seem to mind. Document
this as well though.

Reported-by: Jan Hendrik Farr <kernel@jfarr.cc>
Closes: https://lore.kernel.org/all/aB30Ea2kRG24LINR@archlinux/
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250609213232.12691580e140.I3f1d3127acabcd58348a110ab11044213cf147d3@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-07-09 11:56:45 +02:00
Benjamin Berg
62c57ebb31 wifi: cfg80211: add a flag for the first part of a scan
When there are no non-6 GHz channels, then the 6 GHz scan is the first
part of a split scan. Add a boolean denoting whether the scan is the
first part of a scan as it might be useful to drivers for internal
bookkeeping. This flag is also set if the scan is not split.

Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250609213231.07e5a8a452ec.Ibf18f513e507422078fb31b28947e582a20df87a@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-07-09 11:52:36 +02:00
Johannes Berg
984462751d wifi: mac80211: remove DISALLOW_PUNCTURING_5GHZ code
Since iwlwifi was the only driver using this and no
longer does, we can remove all this code.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250609213231.4dff5fb8890f.Ie531f912b252a0042c18c0734db50c3afe1adfb5@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-07-09 11:52:36 +02:00
Johannes Berg
f0df91b6a7 wifi: cfg80211: hide scan internals
Hide the internal scan fields from mac80211 and drivers, the
'notified' variable is for internal tracking, and the 'info'
is output that's passed to cfg80211_scan_done() and stored
only for delayed userspace notification.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Reviewed-by: Benjamin Berg <benjamin.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250609213231.6a62e41858e2.I004f66e9c087cc6e6ae4a24951cf470961ee9466@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-07-09 11:52:35 +02:00
Miri Korenblit
be1ba9ed22 wifi: mac80211: avoid weird state in error path
If we get to the error path of ieee80211_prep_connection, for example
because of a FW issue, then ieee80211_vif_set_links is called
with 0.
But the call to drv_change_vif_links from ieee80211_vif_update_links
will probably fail as well, for the same reason.
In this case, the valid_links and active_links bitmaps will be reverted
to the value of the failing connection.
Then, in the next connection, due to the logic of
ieee80211_set_vif_links_bitmaps, valid_links will be set to the ID of
the new connection assoc link, but the active_links will remain with the
ID of the old connection's assoc link.
If those IDs are different, we get into a weird state of valid_links and
active_links being different. One of the consequences of this state is
to call drv_change_vif_links with new_links as 0, since the & operation
between the bitmaps will be 0.

Since a removal of a link should always succeed, ignore the return value
of drv_change_vif_links if it was called to only remove links, which is
the case for the ieee80211_prep_connection's error path.
That way, the bitmaps will not be reverted to have the value from the
failing connection and will have 0, so the next connection will have a
good state.

Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20250609213231.ba2011fb435f.Id87ff6dab5e1cf757b54094ac2d714c656165059@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-07-09 11:41:38 +02:00
Kuniyuki Iwashima
df30285b36 af_unix: Introduce SO_INQ.
We have an application that uses almost the same code for TCP and
AF_UNIX (SOCK_STREAM).

TCP can use TCP_INQ, but AF_UNIX doesn't have it and requires an
extra syscall, ioctl(SIOCINQ) or getsockopt(SO_MEMINFO) as an
alternative.

Let's introduce the generic version of TCP_INQ.

If SO_INQ is enabled, recvmsg() will put a cmsg of SCM_INQ that
contains the exact value of ioctl(SIOCINQ).  The cmsg is also
included when msg->msg_get_inq is non-zero to make sockets
io_uring-friendly.

Note that SOCK_CUSTOM_SOCKOPT is flagged only for SOCK_STREAM to
override setsockopt() for SOL_SOCKET.

By having the flag in struct unix_sock, instead of struct sock, we
can later add SO_INQ support for TCP and reuse tcp_sk(sk)->recvmsg_inq.

Note also that supporting custom getsockopt() for SOL_SOCKET will need
preparation for other SOCK_CUSTOM_SOCKOPT users (UDP, vsock, MPTCP).

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250702223606.1054680-7-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08 18:05:25 -07:00
Kuniyuki Iwashima
f4e1fb04c1 af_unix: Use cached value for SOCK_STREAM in unix_inq_len().
Compared to TCP, ioctl(SIOCINQ) for AF_UNIX SOCK_STREAM socket is more
expensive, as unix_inq_len() requires iterating through the receive queue
and accumulating skb->len.

Let's cache the value for SOCK_STREAM to a new field during sendmsg()
and recvmsg().

The field is protected by the receive queue lock.

Note that ioctl(SIOCINQ) for SOCK_DGRAM returns the length of the first
skb in the queue.

SOCK_SEQPACKET still requires iterating through the queue because we do
not touch functions shared with unix_dgram_ops.  But, if really needed,
we can support it by switching __skb_try_recv_datagram() to a custom
version.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250702223606.1054680-5-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08 18:05:25 -07:00
Sabrina Dubroca
2a198bbec6 Revert "xfrm: destroy xfrm_state synchronously on net exit path"
This reverts commit f75a2804da.

With all states (whether user or kern) removed from the hashtables
during deletion, there's no need for synchronous destruction of
states. xfrm6_tunnel states still need to have been destroyed (which
will be the case when its last user is deleted (not destroyed)) so
that xfrm6_tunnel_free_spi removes it from the per-netns hashtable
before the netns is destroyed.

This has the benefit of skipping one synchronize_rcu per state (in
__xfrm_state_destroy(sync=true)) when we exit a netns.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-07-08 13:28:29 +02:00
Sabrina Dubroca
b441cf3f8c xfrm: delete x->tunnel as we delete x
The ipcomp fallback tunnels currently get deleted (from the various
lists and hashtables) as the last user state that needed that fallback
is destroyed (not deleted). If a reference to that user state still
exists, the fallback state will remain on the hashtables/lists,
triggering the WARN in xfrm_state_fini. Because of those remaining
references, the fix in commit f75a2804da ("xfrm: destroy xfrm_state
synchronously on net exit path") is not complete.

We recently fixed one such situation in TCP due to defered freeing of
skbs (commit 9b6412e697 ("tcp: drop secpath at the same time as we
currently drop dst")). This can also happen due to IP reassembly: skbs
with a secpath remain on the reassembly queue until netns
destruction. If we can't guarantee that the queues are flushed by the
time xfrm_state_fini runs, there may still be references to a (user)
xfrm_state, preventing the timely deletion of the corresponding
fallback state.

Instead of chasing each instance of skbs holding a secpath one by one,
this patch fixes the issue directly within xfrm, by deleting the
fallback state as soon as the last user state depending on it has been
deleted. Destruction will still happen when the final reference is
dropped.

A separate lockdep class for the fallback state is required since
we're going to lock x->tunnel while x is locked.

Fixes: 9d4139c769 ("netns xfrm: per-netns xfrm_state_all list")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-07-08 13:28:27 +02:00
Eric Dumazet
84a7d6797e net/sched: acp_api: no longer acquire RTNL in tc_action_net_exit()
tc_action_net_exit() got an rtnl exclusion in commit
a159d3c4b8 ("net_sched: acquire RTNL in tc_action_net_exit()")

Since then, commit 16af606739 ("net: sched: implement reference
counted action release") made this RTNL exclusion obsolete for
most cases.

Only tcf_action_offload_del() might still require it.

Move the rtnl locking into tcf_idrinfo_destroy() when
an offload action is found.

Most netns do not have actions, yet deleting them is adding a lot
of pressure on RTNL, which is for many the most contended mutex
in the kernel.

We are moving to a per-netns 'rtnl', so tc_action_net_exit()
will not be able to grab 'rtnl' a single time for a batch of netns.

Before the patch:

perf probe -a rtnl_lock

perf record -e probe:rtnl_lock -a /bin/bash -c 'unshare -n "/bin/true"; sleep 1'
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.305 MB perf.data (25 samples) ]

After the patch:

perf record -e probe:rtnl_lock -a /bin/bash -c 'unshare -n "/bin/true"; sleep 1'
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.304 MB perf.data (9 samples) ]

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Vlad Buslov <vladbu@nvidia.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Link: https://patch.msgid.link/20250702071230.1892674-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-08 13:00:24 +02:00
Jeremy Kerr
ad39c12fce net: mctp: add gateway routing support
This change allows for gateway routing, where a route table entry
may reference a routable endpoint (by network and EID), instead of
routing directly to a netdevice.

We add support for a RTM_GATEWAY attribute for netlink route updates,
with an attribute format of:

    struct mctp_fq_addr {
        unsigned int net;
        mctp_eid_t eid;
    }

- we need the net here to uniquely identify the target EID, as we no
longer have the device reference directly (which would provide the net
id in the case of direct routes).

This makes route lookups recursive, as a route lookup that returns a
gateway route must be resolved into a direct route (ie, to a device)
eventually. We provide a limit to the route lookups, to prevent infinite
loop routing.

The route lookup populates a new 'nexthop' field in the dst structure,
which now specifies the key for the neighbour table lookup on device
output, rather than using the packet destination address directly.

Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://patch.msgid.link/20250702-dev-forwarding-v5-13-1468191da8a4@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-08 12:39:24 +02:00
Jeremy Kerr
3007f90ec0 net: mctp: separate cb from direct-addressing routing
Now that we have the dst->haddr populated by sendmsg (when extended
addressing is in use), we no longer need to stash the link-layer address
in the skb->cb.

Instead, only use skb->cb for incoming lladdr data.

While we're at it: remove cb->src, as was never used.

Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://patch.msgid.link/20250702-dev-forwarding-v5-4-1468191da8a4@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-08 12:39:23 +02:00
Jeremy Kerr
269936db5e net: mctp: separate routing database from routing operations
This change adds a struct mctp_dst, representing the result of a routing
lookup. This decouples the struct mctp_route from the actual
implementation of a routing operation.

This will allow for future routing changes which may require more
involved lookup logic, such as gateway routing - which may require
multiple traversals of the routing table.

Since we only use the struct mctp_route at lookup time, we no longer
hold routes over a routing operation, as we only need it to populate the
dst. However, we do hold the dev while the dst is active.

This requires some changes to the route test infrastructure, as we no
longer have a mock route to handle the route output operation, and
transient dsts are created by the routing code, so we can't override
them as easily.

Instead, we use kunit->priv to stash a packet queue, and a custom
dst_output function queues into that packet queue, which we can use for
later expectations.

Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://patch.msgid.link/20250702-dev-forwarding-v5-3-1468191da8a4@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-08 12:39:23 +02:00
Tonghao Zhang
ce7a381697 net: bonding: add broadcast_neighbor option for 802.3ad
Stacking technology is a type of technology used to expand ports on
Ethernet switches. It is widely used as a common access method in
large-scale Internet data center architectures. Years of practice
have proved that stacking technology has advantages and disadvantages
in high-reliability network architecture scenarios. For instance,
in stacking networking arch, conventional switch system upgrades
require multiple stacked devices to restart at the same time.
Therefore, it is inevitable that the business will be interrupted
for a while. It is for this reason that "no-stacking" in data centers
has become a trend. Additionally, when the stacking link connecting
the switches fails or is abnormal, the stack will split. Although it is
not common, it still happens in actual operation. The problem is that
after the split, it is equivalent to two switches with the same
configuration appearing in the network, causing network configuration
conflicts and ultimately interrupting the services carried by the
stacking system.

To improve network stability, "non-stacking" solutions have been
increasingly adopted, particularly by public cloud providers and
tech companies like Alibaba, Tencent, and Didi. "non-stacking" is
a method of mimicing switch stacking that convinces a LACP peer,
bonding in this case, connected to a set of "non-stacked" switches
that all of its ports are connected to a single switch
(i.e., LACP aggregator), as if those switches were stacked. This
enables the LACP peer's ports to aggregate together, and requires
(a) special switch configuration, described in the linked article,
and (b) modifications to the bonding 802.3ad (LACP) mode to send
all ARP/ND packets across all ports of the active aggregator.

Note that, with multiple aggregators, the current broadcast mode
logic will send only packets to the selected aggregator(s).

 +-----------+   +-----------+
 |  switch1  |   |  switch2  |
 +-----------+   +-----------+
         ^           ^
         |           |
      +-----------------+
      |   bond4 lacp    |
      +-----------------+
         |           |
         | NIC1      | NIC2
      +-----------------+
      |     server      |
      +-----------------+

- https://www.ruijie.com/fr-fr/support/tech-gallery/de-stack-data-center-network-architecture/

Cc: Jay Vosburgh <jv@jvosburgh.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Andrew Lunn <andrew+netdev@lunn.ch>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Tonghao Zhang <tonghao@bamaicloud.com>
Signed-off-by: Zengbing Tu <tuzengbing@didiglobal.com>
Link: https://patch.msgid.link/84d0a044514157bb856a10b6d03a1028c4883561.1751031306.git.tonghao@bamaicloud.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-08 10:59:41 +02:00
Jakub Kicinski
80852774ba bluetooth pull request for net:
- hci_sync: Fix not disabling advertising instance
  - hci_core: Remove check of BDADDR_ANY in hci_conn_hash_lookup_big_state
  - hci_sync: Fix attempting to send HCI_Disconnect to BIS handle
  - hci_event: Fix not marking Broadcast Sink BIS as connected
 -----BEGIN PGP SIGNATURE-----
 
 iQJNBAABCgA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmhmpiMZHGx1aXoudm9u
 LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKTZOD/9ggH1wJ52hD6NeRow2dZHt
 BAppV0qNg7p9PkmYPDTOKNkdASmp5VfdLyCXH0vBUjC7+iutrkA/oq5UmIzAM7Uz
 1HSELyvsLBL3Gj2AxWyElBtuwo3F7cA70veesZKDJJ92VHlCMMXPJugMCrcy0iJn
 sVYENIJKtMnLJMTnnpmBCpRQszQfHGYnzQerbeUlCPoFgn7vsRy+onZ9XiIj6R2K
 HGZVk6MhUnNWrXthwcgwAZ0SRq+5nhxhB4jOSSZ6nNnT9Wt8oloDM7KtYSy976QA
 Ub2LzvR2E8/XtnTeWus5oLL2kAvz1vClS6gpGrvziElnIryq8aYhwKOO7yhRFmDK
 kllpOIPaiDrl1H1nVR4z7t/IK5z/0A0wkTcXXm4WNZkLOj8YmY+BdjPKO39PRxK4
 PpcHzsmI6PQjnLeGRi3a4PYfMgVYRvdO/OKSO9/xphqgE+dQdonH7/Dsvw+QkEw8
 TgFvfKj8bMNjd11Mynd46P45tnQ3HIrTSYGTbdvoU/n8ZFEjATuHtIxRscw0bRmj
 iy4u55JV+U3BAdm2CsRouiiTHnvXt9S2HMyMnemwwxsGpim/JG+IsccYUZd8mvLf
 d+N2LWsq71CpVhRSJ50WJDrCEJdKT71KwHbSYLJl8vGhObV3VpTBZ57eVF98SNjm
 Zqg8JqLJFvCLOhl4RWj/xA==
 =o7T9
 -----END PGP SIGNATURE-----

Merge tag 'for-net-2025-07-03' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth

Luiz Augusto von Dentz says:

====================
bluetooth pull request for net:

 - hci_sync: Fix not disabling advertising instance
 - hci_core: Remove check of BDADDR_ANY in hci_conn_hash_lookup_big_state
 - hci_sync: Fix attempting to send HCI_Disconnect to BIS handle
 - hci_event: Fix not marking Broadcast Sink BIS as connected

* tag 'for-net-2025-07-03' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
  Bluetooth: hci_event: Fix not marking Broadcast Sink BIS as connected
  Bluetooth: hci_sync: Fix attempting to send HCI_Disconnect to BIS handle
  Bluetooth: hci_core: Remove check of BDADDR_ANY in hci_conn_hash_lookup_big_state
  Bluetooth: hci_sync: Fix not disabling advertising instance
====================

Link: https://patch.msgid.link/20250703160409.1791514-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-07 19:01:29 -07:00
Byungchul Park
d8bf56a0ca page_pool: make page_pool_get_dma_addr() just wrap page_pool_get_dma_addr_netmem()
The page pool members in struct page cannot be removed unless it's not
allowed to access any of them via struct page.

Do not access 'page->dma_addr' directly in page_pool_get_dma_addr() but
just wrap page_pool_get_dma_addr_netmem() safely.

Signed-off-by: Byungchul Park <byungchul@sk.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://patch.msgid.link/20250702053256.4594-6-byungchul@sk.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-07 18:40:10 -07:00
Byungchul Park
4369d40da2 netmem: use _Generic to cover const casting for page_to_netmem()
The current page_to_netmem() doesn't cover const casting resulting in
trying to cast const struct page * to const netmem_ref fails.

To cover the case, change page_to_netmem() to use macro and _Generic.

Signed-off-by: Byungchul Park <byungchul@sk.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/20250702053256.4594-5-byungchul@sk.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-07 18:40:10 -07:00
Stefano Garzarella
1e3b66e326 vsock: fix vsock_proto declaration
From commit 634f1a7110 ("vsock: support sockmap"), `struct proto
vsock_proto`, defined in af_vsock.c, is not static anymore, since it's
used by vsock_bpf.c.

If CONFIG_BPF_SYSCALL is not defined, `make C=2` will print a warning:
    $ make O=build C=2 W=1 net/vmw_vsock/
      ...
      CC [M]  net/vmw_vsock/af_vsock.o
      CHECK   ../net/vmw_vsock/af_vsock.c
    ../net/vmw_vsock/af_vsock.c:123:14: warning: symbol 'vsock_proto' was not declared. Should it be static?

Declare `vsock_proto` regardless of CONFIG_BPF_SYSCALL, since it's defined
in af_vsock.c, which is built regardless of CONFIG_BPF_SYSCALL.

Fixes: 634f1a7110 ("vsock: support sockmap")
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://patch.msgid.link/20250703112329.28365-1-sgarzare@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-07 16:55:54 -07:00
Alexander Mikhalitsyn
2b9996417e
af_unix/scm: fix whitespace errors
Fix whitespace/formatting errors.

Cc: linux-kernel@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Kuniyuki Iwashima <kuniyu@google.com>
Cc: Lennart Poettering <mzxreary@0pointer.de>
Cc: Luca Boccassi <bluca@debian.org>
Cc: David Rheinsberg <david@readahead.eu>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Link: https://lore.kernel.org/20250703222314.309967-5-aleksandr.mikhalitsyn@canonical.com
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-04 09:32:35 +02:00
Luiz Augusto von Dentz
59710a26a2 Bluetooth: hci_core: Remove check of BDADDR_ANY in hci_conn_hash_lookup_big_state
The check for destination to be BDADDR_ANY is no longer necessary with
the introduction of BIS_LINK.

Fixes: 23205562ff ("Bluetooth: separate CIS_LINK and BIS_LINK link types")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-07-03 11:36:54 -04:00
Pablo Neira Ayuso
fd72f265bb netfilter: conntrack: remove DCCP protocol support
The DCCP socket family has now been removed from this tree, see:

  8bb3212be4 ("Merge branch 'net-retire-dccp-socket'")

Remove connection tracking and NAT support for this protocol, this
should not pose a problem because no DCCP traffic is expected to be seen
on the wire.

As for the code for matching on dccp header for iptables and nftables,
mark it as deprecated and keep it in place. Ruleset restoration is an
atomic operation. Without dccp matching support, an astray match on dccp
could break this operation leaving your computer with no policy in
place, so let's follow a more conservative approach for matches.

Add CONFIG_NFT_EXTHDR_DCCP which is set to 'n' by default to deprecate
dccp extension support. Similarly, label CONFIG_NETFILTER_XT_MATCH_DCCP
as deprecated too and also set it to 'n' by default.

Code to match on DCCP protocol from ebtables also remains in place, this
is just a few checks on IPPROTO_DCCP from _check() path which is
exercised when ruleset is loaded. There is another use of IPPROTO_DCCP
from the _check() path in the iptables multiport match. Another check
for IPPROTO_DCCP from the packet in the reject target is also removed.

So let's schedule removal of the dccp matching for a second stage, this
should not interfer with the dccp retirement since this is only matching
on the dccp header.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-07-03 13:51:39 +02:00
Carolina Jubran
566e8f108f devlink: Extend devlink rate API with traffic classes bandwidth management
Introduce support for specifying relative bandwidth shares between
traffic classes (TC) in the devlink-rate API. This new option allows
users to allocate bandwidth across multiple traffic classes in a
single command.

This feature provides a more granular control over traffic management,
especially for scenarios requiring Enhanced Transmission Selection.

Users can now define a relative bandwidth share for each traffic class.
For example, assigning share values of 20 to TC0 (TCP/UDP) and 80 to TC5
(RoCE) will result in TC0 receiving 20% and TC5 receiving 80% of the
total bandwidth. The actual percentage each class receives depends on
the ratio of its share value to the sum of all shares.

Example:
DEV=pci/0000:08:00.0

$ devlink port function rate add $DEV/vfs_group tx_share 10Gbit \
  tx_max 50Gbit tc-bw 0:20 1:0 2:0 3:0 4:0 5:80 6:0 7:0

$ devlink port function rate set $DEV/vfs_group \
  tc-bw 0:20 1:0 2:0 3:0 4:0 5:20 6:60 7:0

Example usage with ynl:

./tools/net/ynl/cli.py --spec Documentation/netlink/specs/devlink.yaml \
  --do rate-set --json '{
  "bus-name": "pci",
  "dev-name": "0000:08:00.0",
  "port-index": 1,
  "rate-tc-bws": [
    {"rate-tc-index": 0, "rate-tc-bw": 50},
    {"rate-tc-index": 1, "rate-tc-bw": 50},
    {"rate-tc-index": 2, "rate-tc-bw": 0},
    {"rate-tc-index": 3, "rate-tc-bw": 0},
    {"rate-tc-index": 4, "rate-tc-bw": 0},
    {"rate-tc-index": 5, "rate-tc-bw": 0},
    {"rate-tc-index": 6, "rate-tc-bw": 0},
    {"rate-tc-index": 7, "rate-tc-bw": 0}
  ]
}'

./tools/net/ynl/cli.py --spec Documentation/netlink/specs/devlink.yaml \
  --do rate-get --json '{
  "bus-name": "pci",
  "dev-name": "0000:08:00.0",
  "port-index": 1
}'

output for rate-get:
{'bus-name': 'pci',
 'dev-name': '0000:08:00.0',
 'port-index': 1,
 'rate-tc-bws': [{'rate-tc-bw': 50, 'rate-tc-index': 0},
                 {'rate-tc-bw': 50, 'rate-tc-index': 1},
                 {'rate-tc-bw': 0, 'rate-tc-index': 2},
                 {'rate-tc-bw': 0, 'rate-tc-index': 3},
                 {'rate-tc-bw': 0, 'rate-tc-index': 4},
                 {'rate-tc-bw': 0, 'rate-tc-index': 5},
                 {'rate-tc-bw': 0, 'rate-tc-index': 6},
                 {'rate-tc-bw': 0, 'rate-tc-index': 7}],
 'rate-tx-max': 0,
 'rate-tx-priority': 0,
 'rate-tx-share': 0,
 'rate-tx-weight': 0,
 'rate-type': 'leaf'}

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250629142138.361537-3-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 15:39:05 -07:00
Carolina Jubran
42401c4238 netlink: introduce type-checking attribute iteration for nlmsg
Add the nlmsg_for_each_attr_type() macro to simplify iteration over
attributes of a specific type in a Netlink message.

Convert existing users in vxlan and nfsd to use the new macro.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250629142138.361537-2-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 15:39:04 -07:00
Eric Dumazet
93d1cff35a ipv6: adopt skb_dst_dev() and skb_dst_dev_net[_rcu]() helpers
Use the new helpers as a step to deal with potential dst->dev races.

v2: fix typo in ipv6_rthdr_rcv() (kernel test robot <lkp@intel.com>)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-10-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 14:32:30 -07:00
Eric Dumazet
1caf272972 ipv6: adopt dst_dev() helper
Use the new helper as a step to deal with potential dst->dev races.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-9-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 14:32:30 -07:00
Eric Dumazet
a74fc62eec ipv4: adopt dst_dev, skb_dst_dev and skb_dst_dev_net[_rcu]
Use the new helpers as a first step to deal with
potential dst->dev races.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-8-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 14:32:30 -07:00
Eric Dumazet
88fe14253e net: dst: add four helpers to annotate data-races around dst->dev
dst->dev is read locklessly in many contexts,
and written in dst_dev_put().

Fixing all the races is going to need many changes.

We probably will have to add full RCU protection.

Add three helpers to ease this painful process.

static inline struct net_device *dst_dev(const struct dst_entry *dst)
{
       return READ_ONCE(dst->dev);
}

static inline struct net_device *skb_dst_dev(const struct sk_buff *skb)
{
       return dst_dev(skb_dst(skb));
}

static inline struct net *skb_dst_dev_net(const struct sk_buff *skb)
{
       return dev_net(skb_dst_dev(skb));
}

static inline struct net *skb_dst_dev_net_rcu(const struct sk_buff *skb)
{
       return dev_net_rcu(skb_dst_dev(skb));
}

Fixes: 4a6ce2b6f2 ("net: introduce a new function dst_dev_put()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-7-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 14:32:30 -07:00
Eric Dumazet
2dce8c52a9 net: dst: annotate data-races around dst->output
dst_dev_put() can overwrite dst->output while other
cpus might read this field (for instance from dst_output())

Add READ_ONCE()/WRITE_ONCE() annotations to suppress
potential issues.

We will likely need RCU protection in the future.

Fixes: 4a6ce2b6f2 ("net: introduce a new function dst_dev_put()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 14:32:30 -07:00
Eric Dumazet
f1c5fd3489 net: dst: annotate data-races around dst->input
dst_dev_put() can overwrite dst->input while other
cpus might read this field (for instance from dst_input())

Add READ_ONCE()/WRITE_ONCE() annotations to suppress
potential issues.

We will likely need full RCU protection later.

Fixes: 4a6ce2b6f2 ("net: introduce a new function dst_dev_put()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 14:32:29 -07:00
Eric Dumazet
8f2b2282d0 net: dst: annotate data-races around dst->lastuse
(dst_entry)->lastuse is read and written locklessly,
add corresponding annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 14:32:29 -07:00
Eric Dumazet
36229b2cac net: dst: annotate data-races around dst->expires
(dst_entry)->expires is read and written locklessly,
add corresponding annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 14:32:29 -07:00
Eric Dumazet
8a402bbe54 net: dst: annotate data-races around dst->obsolete
(dst_entry)->obsolete is read locklessly, add corresponding
annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 14:32:29 -07:00
Eric Dumazet
e3d4825124 udp: move udp_memory_allocated into net_aligned_data
____cacheline_aligned_in_smp attribute only makes sure to align
a field to a cache line. It does not prevent the linker to use
the remaining of the cache line for other variables, causing
potential false sharing.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250630093540.3052835-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 14:22:02 -07:00
Eric Dumazet
8308133741 tcp: move tcp_memory_allocated into net_aligned_data
____cacheline_aligned_in_smp attribute only makes sure to align
a field to a cache line. It does not prevent the linker to use
the remaining of the cache line for other variables, causing
potential false sharing.

Move tcp_memory_allocated into a dedicated cache line.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250630093540.3052835-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 14:22:02 -07:00
Eric Dumazet
998642e999 net: move net_cookie into net_aligned_data
Using per-cpu data for net->net_cookie generation is overkill,
because even busy hosts do not create hundreds of netns per second.

Make sure to put net_cookie in a private cache line to avoid
potential false sharing.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250630093540.3052835-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 14:22:02 -07:00
Eric Dumazet
3715b5df09 net: add struct net_aligned_data
This structure will hold networking data that must
consume a full cache line to avoid accidental false sharing.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250630093540.3052835-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 14:22:02 -07:00
Haiyang Zhang
fbe346ce9d net: mana: Handle Reset Request from MANA NIC
Upon receiving the Reset Request, pause the connection and clean up
queues, wait for the specified period, then resume the NIC.
In the cleanup phase, the HWC is no longer responding, so set hwc_timeout
to zero to skip waiting on the response.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Link: https://patch.msgid.link/1751055983-29760-1-git-send-email-haiyangz@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-01 19:17:53 -07:00
Ido Schimmel
03dc03fa04 neighbor: Add NTF_EXT_VALIDATED flag for externally validated entries
tl;dr
=====

Add a new neighbor flag ("extern_valid") that can be used to indicate to
the kernel that a neighbor entry was learned and determined to be valid
externally. The kernel will not try to remove or invalidate such an
entry, leaving these decisions to the user space control plane. This is
needed for EVPN multi-homing where a neighbor entry for a multi-homed
host needs to be synced across all the VTEPs among which the host is
multi-homed.

Background
==========

In a typical EVPN multi-homing setup each host is multi-homed using a
set of links called ES (Ethernet Segment, i.e., LAG) to multiple leaf
switches (VTEPs). VTEPs that are connected to the same ES are called ES
peers.

When a neighbor entry is learned on a VTEP, it is distributed to both ES
peers and remote VTEPs using EVPN MAC/IP advertisement routes. ES peers
use the neighbor entry when routing traffic towards the multi-homed host
and remote VTEPs use it for ARP/NS suppression.

Motivation
==========

If the ES link between a host and the VTEP on which the neighbor entry
was locally learned goes down, the EVPN MAC/IP advertisement route will
be withdrawn and the neighbor entries will be removed from both ES peers
and remote VTEPs. Routing towards the multi-homed host and ARP/NS
suppression can fail until another ES peer locally learns the neighbor
entry and distributes it via an EVPN MAC/IP advertisement route.

"draft-rbickhart-evpn-ip-mac-proxy-adv-03" [1] suggests avoiding these
intermittent failures by having the ES peers install the neighbor
entries as before, but also injecting EVPN MAC/IP advertisement routes
with a proxy indication. When the previously mentioned ES link goes down
and the original EVPN MAC/IP advertisement route is withdrawn, the ES
peers will not withdraw their neighbor entries, but instead start aging
timers for the proxy indication.

If an ES peer locally learns the neighbor entry (i.e., it becomes
"reachable"), it will restart its aging timer for the entry and emit an
EVPN MAC/IP advertisement route without a proxy indication. An ES peer
will stop its aging timer for the proxy indication if it observes the
removal of the proxy indication from at least one of the ES peers
advertising the entry.

In the event that the aging timer for the proxy indication expired, an
ES peer will withdraw its EVPN MAC/IP advertisement route. If the timer
expired on all ES peers and they all withdrew their proxy
advertisements, the neighbor entry will be completely removed from the
EVPN fabric.

Implementation
==============

In the above scheme, when the control plane (e.g., FRR) advertises a
neighbor entry with a proxy indication, it expects the corresponding
entry in the data plane (i.e., the kernel) to remain valid and not be
removed due to garbage collection or loss of carrier. The control plane
also expects the kernel to notify it if the entry was learned locally
(i.e., became "reachable") so that it will remove the proxy indication
from the EVPN MAC/IP advertisement route. That is why these entries
cannot be programmed with dummy states such as "permanent" or "noarp".

Instead, add a new neighbor flag ("extern_valid") which indicates that
the entry was learned and determined to be valid externally and should
not be removed or invalidated by the kernel. The kernel can probe the
entry and notify user space when it becomes "reachable" (it is initially
installed as "stale"). However, if the kernel does not receive a
confirmation, have it return the entry to the "stale" state instead of
the "failed" state.

In other words, an entry marked with the "extern_valid" flag behaves
like any other dynamically learned entry other than the fact that the
kernel cannot remove or invalidate it.

One can argue that the "extern_valid" flag should not prevent garbage
collection and that instead a neighbor entry should be programmed with
both the "extern_valid" and "extern_learn" flags. There are two reasons
for not doing that:

1. Unclear why a control plane would like to program an entry that the
   kernel cannot invalidate but can completely remove.

2. The "extern_learn" flag is used by FRR for neighbor entries learned
   on remote VTEPs (for ARP/NS suppression) whereas here we are
   concerned with local entries. This distinction is currently irrelevant
   for the kernel, but might be relevant in the future.

Given that the flag only makes sense when the neighbor has a valid
state, reject attempts to add a neighbor with an invalid state and with
this flag set. For example:

 # ip neigh add 192.0.2.1 nud none dev br0.10 extern_valid
 Error: Cannot create externally validated neighbor with an invalid state.
 # ip neigh add 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid
 # ip neigh replace 192.0.2.1 nud failed dev br0.10 extern_valid
 Error: Cannot mark neighbor as externally validated with an invalid state.

The above means that a neighbor cannot be created with the
"extern_valid" flag and flags such as "use" or "managed" as they result
in a neighbor being created with an invalid state ("none") and
immediately getting probed:

 # ip neigh add 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use
 Error: Cannot create externally validated neighbor with an invalid state.

However, these flags can be used together with "extern_valid" after the
neighbor was created with a valid state:

 # ip neigh add 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid
 # ip neigh replace 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use

One consequence of preventing the kernel from invalidating a neighbor
entry is that by default it will only try to determine reachability
using unicast probes. This can be changed using the "mcast_resolicit"
sysctl:

 # sysctl net.ipv4.neigh.br0/10.mcast_resolicit
 0
 # tcpdump -nn -e -i br0.10 -Q out arp &
 # ip neigh replace 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use
 62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
 62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
 62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
 # sysctl -wq net.ipv4.neigh.br0/10.mcast_resolicit=3
 # ip neigh replace 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use
 62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
 62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
 62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
 62:50:1d:11:93:6f > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
 62:50:1d:11:93:6f > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
 62:50:1d:11:93:6f > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28

iproute2 patches can be found here [2].

[1] https://datatracker.ietf.org/doc/html/draft-rbickhart-evpn-ip-mac-proxy-adv-03
[2] https://github.com/idosch/iproute2/tree/submit/extern_valid_v1

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://patch.msgid.link/20250626073111.244534-2-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-30 18:14:23 -07:00
Eric Dumazet
cf56a98202 tcp: remove inet_rtx_syn_ack()
inet_rtx_syn_ack() is a simple wrapper around tcp_rtx_synack(),
if we move req->num_retrans update.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250626153017.2156274-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-27 15:34:19 -07:00
Eric Dumazet
8d68411a12 tcp: remove rtx_syn_ack field
Now inet_rtx_syn_ack() is only used by TCP, it can directly
call tcp_rtx_synack() instead of using an indirect call
to req->rsk_ops->rtx_syn_ack().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250626153017.2156274-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-27 15:34:18 -07:00
Jakub Kicinski
28aa52b618 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.16-rc4).

Conflicts:

Documentation/netlink/specs/mptcp_pm.yaml
  9e6dd4c256 ("netlink: specs: mptcp: replace underscores with dashes in names")
  ec362192aa ("netlink: specs: fix up indentation errors")
https://lore.kernel.org/20250626122205.389c2cd4@canb.auug.org.au

Adjacent changes:

Documentation/netlink/specs/fou.yaml
  791a9ed0a4 ("netlink: specs: fou: replace underscores with dashes in names")
  880d43ca9a ("netlink: specs: clean up spaces in brackets")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-26 10:40:50 -07:00
Yue Haibing
4b70e2a069 net/sched: Remove unused functions
Since commit c54e1d920f ("flow_offload: add ops to tc_action_ops for
flow action setup") these are unused.

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Link: https://patch.msgid.link/20250624014327.3686873-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-25 15:28:08 -07:00
Jakub Kicinski
ab4eb6a25d The usual features/cleanups/etc., notably:
- rtw88: IBSS mode for SDIO devices
  - rtw89:
    - BT coex for MLO/WiFi7
    - work on station + P2P concurrency
  - ath: fix W=2 export.h warnings
  - ath12k: fix scan on multi-radio devices
  - cfg80211/mac80211: MLO statistics
  - mac80211: S1G aggregation
  - cfg80211/mac80211: per-radio RTS threshold
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmhb5KsACgkQ10qiO8sP
 aAC8mQ//RPsycyRQ3XRf3E1OsGV61M1iJJRvdmNtW7GMi6Ztc96xkf1oGHQRH2x1
 lQ6PiDGEG9dvtQNvE2rS1vtgsFAa0pWB4+pD2t+mIs3umwxGI8OhgwgSIY1GYTwe
 Jq9IqsxSFMc38pI4RPVL2TgKJ81W3Ak51DAbCBxP5Lo2vFyFYe6toVc8+88WyPTX
 +hhY5cw2WSDwWiRQ9BPTgs5+LKRnIVh8xmX0I69bUX8zZ0MxU2z0UGzjVQOjywDK
 w89jdqwwbMVMIL8Ti2foIfI8UC4N+KwDeMB7HLJhkTEf8B4VScd8aeSFIOtjz/j0
 V20fegLfJ31DvsXWI5I2ia7DKDnMRA3Kb1XZzsS5qeY7jTolLMUS9c96fw7SUfVd
 P3clV5H1kKretnzxKijUbrG2MgHsuIzX3x6GXWtuZEcVyrovh9wSz08B4/DbfPe7
 PSqVug+GOnH5YV7RpP9jByvY86MT846O/V+MYi+CkrwgMATRyX0jA6EXUVe26kzT
 mL3Xr68aSUKLxkLgiCWSdzQw7e+YF56JbpcbJQOq0Z16CIOw1k84r62kBUFzNB3k
 tG5InbFYAGiNJu2LmhFFZN8grD1TrDYKuzmgBoia48TpI2gwlwcW3kCWfMfty25L
 IJ4p+SAsTvnYpbLesWQHDZ0bouQCKPouB5NUqVHFNhBCIBMo634=
 =16yP
 -----END PGP SIGNATURE-----

Merge tag 'wireless-next-2025-06-25' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next

Johannes Berg says:

====================
The usual features/cleanups/etc., notably:

 - rtw88: IBSS mode for SDIO devices
 - rtw89:
   - BT coex for MLO/WiFi7
   - work on station + P2P concurrency
 - ath: fix W=2 export.h warnings
 - ath12k: fix scan on multi-radio devices
 - cfg80211/mac80211: MLO statistics
 - mac80211: S1G aggregation
 - cfg80211/mac80211: per-radio RTS threshold

* tag 'wireless-next-2025-06-25' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (171 commits)
  wifi: iwlwifi: dvm: fix potential overflow in rs_fill_link_cmd()
  iwlwifi: Add missing check for alloc_ordered_workqueue
  wifi: iwlwifi: Fix memory leak in iwl_mvm_init()
  iwlwifi: api: delete repeated words
  iwlwifi: remove unused no_sleep_autoadjust declaration
  iwlwifi: Fix comment typo
  iwlwifi: use DECLARE_BITMAP macro
  iwlwifi: fw: simplify the iwl_fw_dbg_collect_trig()
  wifi: iwlwifi: mld: ftm: fix switch end indentation
  MAINTAINERS: update iwlwifi git link
  wifi: iwlwifi: pcie: fix non-MSIX handshake register
  wifi: iwlwifi: mld: don't exit EMLSR when we shouldn't
  wifi: iwlwifi: move _iwl_trans_set_bits_mask utilities
  wifi: iwlwifi: mld: make iwl_mld_add_all_rekeys void
  wifi: iwlwifi: move iwl_trans_pcie_write_mem to iwl-trans.c
  wifi: iwlwifi: pcie: move iwl_trans_pcie_dump_regs() to utils.c
  wifi: iwlwifi: mld: advertise support for TTLM changes
  wifi: iwlwifi: mld: Block EMLSR when scanning on P2P Device
  wifi: iwlwifi: mld: use the correct struct size for tracing
  wifi: iwlwifi: support RZL platform device ID
  ...
====================

Link: https://patch.msgid.link/20250625120135.41933-55-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-25 10:28:13 -07:00
Paolo Abeni
c9e78afa68 udp_tunnel: fix deadlock in udp_tunnel_nic_set_port_priv()
While configuring a vxlan tunnel in a system with a i40e NIC driver, I
observe the following deadlock:

 WARNING: possible recursive locking detected
 6.16.0-rc2.net-next-6.16_92d87230d899+ #13 Tainted: G            E
 --------------------------------------------
 kworker/u256:4/1125 is trying to acquire lock:
 ffff88921ab9c8c8 (&utn->lock){+.+.}-{4:4}, at: i40e_udp_tunnel_set_port (/home/pabeni/net-next/include/net/udp_tunnel.h:343 /home/pabeni/net-next/drivers/net/ethernet/intel/i40e/i40e_main.c:13013) i40e

 but task is already holding lock:
 ffff88921ab9c8c8 (&utn->lock){+.+.}-{4:4}, at: udp_tunnel_nic_device_sync_work (/home/pabeni/net-next/net/ipv4/udp_tunnel_nic.c:739) udp_tunnel

 other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(&utn->lock);
   lock(&utn->lock);

  *** DEADLOCK ***

  May be due to missing lock nesting notation

 4 locks held by kworker/u256:4/1125:
 #0: ffff8892910ca158 ((wq_completion)udp_tunnel_nic){+.+.}-{0:0}, at: process_one_work (/home/pabeni/net-next/kernel/workqueue.c:3213)
 #1: ffffc900244efd30 ((work_completion)(&utn->work)){+.+.}-{0:0}, at: process_one_work (/home/pabeni/net-next/kernel/workqueue.c:3214)
 #2: ffffffff9a14e290 (rtnl_mutex){+.+.}-{4:4}, at: udp_tunnel_nic_device_sync_work (/home/pabeni/net-next/net/ipv4/udp_tunnel_nic.c:737) udp_tunnel
 #3: ffff88921ab9c8c8 (&utn->lock){+.+.}-{4:4}, at: udp_tunnel_nic_device_sync_work (/home/pabeni/net-next/net/ipv4/udp_tunnel_nic.c:739) udp_tunnel

 stack backtrace:
 Hardware name: Dell Inc. PowerEdge R7525/0YHMCJ, BIOS 2.2.5 04/08/2021
i
 Call Trace:
  <TASK>
 dump_stack_lvl (/home/pabeni/net-next/lib/dump_stack.c:123)
 print_deadlock_bug (/home/pabeni/net-next/kernel/locking/lockdep.c:3047)
 validate_chain (/home/pabeni/net-next/kernel/locking/lockdep.c:3901)
 __lock_acquire (/home/pabeni/net-next/kernel/locking/lockdep.c:5240)
 lock_acquire.part.0 (/home/pabeni/net-next/kernel/locking/lockdep.c:473 /home/pabeni/net-next/kernel/locking/lockdep.c:5873)
 __mutex_lock (/home/pabeni/net-next/kernel/locking/mutex.c:604 /home/pabeni/net-next/kernel/locking/mutex.c:747)
 i40e_udp_tunnel_set_port (/home/pabeni/net-next/include/net/udp_tunnel.h:343 /home/pabeni/net-next/drivers/net/ethernet/intel/i40e/i40e_main.c:13013) i40e
 udp_tunnel_nic_device_sync_by_port (/home/pabeni/net-next/net/ipv4/udp_tunnel_nic.c:230 /home/pabeni/net-next/net/ipv4/udp_tunnel_nic.c:249) udp_tunnel
 __udp_tunnel_nic_device_sync.part.0 (/home/pabeni/net-next/net/ipv4/udp_tunnel_nic.c:292) udp_tunnel
 udp_tunnel_nic_device_sync_work (/home/pabeni/net-next/net/ipv4/udp_tunnel_nic.c:742) udp_tunnel
 process_one_work (/home/pabeni/net-next/kernel/workqueue.c:3243)
 worker_thread (/home/pabeni/net-next/kernel/workqueue.c:3315 /home/pabeni/net-next/kernel/workqueue.c:3402)
 kthread (/home/pabeni/net-next/kernel/kthread.c:464)

AFAICS all the existing callsites of udp_tunnel_nic_set_port_priv() are
already under the utn lock scope, avoid (re-)acquiring it in such a
function.

Fixes: 1ead750109 ("udp_tunnel: remove rtnl_lock dependency")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/95a827621ec78c12d1564ec3209e549774f9657d.1750675978.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-24 16:31:36 -07:00
Lachlan Hodges
2a8a6b7c4c wifi: mac80211: handle station association response with S1G
Add support for updating the stations S1G capabilities when
an S1G association occurs.

Signed-off-by: Lachlan Hodges <lachlan.hodges@morsemicro.com>
Link: https://patch.msgid.link/20250617080610.756048-3-lachlan.hodges@morsemicro.com
[remove unused S1G_CAP3_MAX_MPDU_LEN_3895/_7791]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-24 15:19:28 +02:00
Lachlan Hodges
5ea255673c wifi: cfg80211: support configuration of S1G station capabilities
Currently there is no support for initialising a peers S1G capabilities,
this patch adds support for configuring an S1G station.

Signed-off-by: Lachlan Hodges <lachlan.hodges@morsemicro.com>
Link: https://patch.msgid.link/20250617080610.756048-2-lachlan.hodges@morsemicro.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-24 15:19:28 +02:00
Roopni Devanathan
264637941c wifi: cfg80211: Add Support to Set RTS Threshold for each Radio
Currently, setting RTS threshold is based on per-phy basis, i.e., all the
radios present in a wiphy will take RTS threshold value to be the one sent
from userspace. But each radio in a multi-radio wiphy can have different
RTS threshold requirements.

To extend support to set RTS threshold for each radio, get the radio for
which RTS threshold needs to be changed from the user. Use the attribute
in NL - NL80211_ATTR_WIPHY_RADIO_INDEX, to identify the radio of interest.
Create a new structure - wiphy_radio_cfg and add rts_threshold in it as a
u32 value to store RTS threshold of each radio in a wiphy and allocate
memory for it during wiphy register based on the wiphy.n_radio updated by
drivers. Pass radio id received from the user to mac80211 drivers along
with its corresponding RTS threshold.

Signed-off-by: Roopni Devanathan <quic_rdevanat@quicinc.com>
Link: https://patch.msgid.link/20250615082312.619639-3-quic_rdevanat@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-24 15:19:27 +02:00
Roopni Devanathan
b74947b4f6 wifi: cfg80211/mac80211: Add support to get radio index
Currently, per-radio attributes are set on per-phy basis, i.e., all the
radios present in a wiphy will take attributes values sent from user. But
each radio in a wiphy can get different values from userspace based on
its requirement.

To extend support to set per-radio attributes, add support to get radio
index from userspace. Add an NL attribute - NL80211_ATTR_WIPHY_RADIO_INDEX,
to get user specified radio index for which attributes should be changed.
Pass this to individual drivers, so that the drivers can use this radio
index to change per-radio attributes when necessary. Currently, per-radio
attributes identified are:
NL80211_ATTR_WIPHY_TX_POWER_LEVEL
NL80211_ATTR_WIPHY_ANTENNA_TX
NL80211_ATTR_WIPHY_ANTENNA_RX
NL80211_ATTR_WIPHY_RETRY_SHORT
NL80211_ATTR_WIPHY_RETRY_LONG
NL80211_ATTR_WIPHY_FRAG_THRESHOLD
NL80211_ATTR_WIPHY_RTS_THRESHOLD
NL80211_ATTR_WIPHY_COVERAGE_CLASS
NL80211_ATTR_TXQ_LIMIT
NL80211_ATTR_TXQ_MEMORY_LIMIT
NL80211_ATTR_TXQ_QUANTUM

By default, the radio index is set to -1. This means the attribute should
be treated as a global configuration. If the user has not specified any
index, then the radio index passed to individual drivers would be -1. This
would indicate that the attribute applies to all radios in that wiphy.

Signed-off-by: Roopni Devanathan <quic_rdevanat@quicinc.com>
Link: https://patch.msgid.link/20250615082312.619639-2-quic_rdevanat@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-24 15:19:27 +02:00
Sarika Sharma
4cb1ce7e25 wifi: mac80211: add link_sta_statistics ops to fill link station statistics
Currently, link station statistics for MLO are filled by mac80211.
But there are some statistics that kept by mac80211 might not be
accurate, so let the driver pre-fill the link statistics. The driver
can fill the values (indicating which field is filled, by setting the
filled bitmapin in link_station structure).
Statistics that driver don't fill are filled by mac80211.

Hence, add link_sta_statistics callback to fill link station statistics
for MLO in sta_set_link_sinfo() by drivers.

Signed-off-by: Sarika Sharma <quic_sarishar@quicinc.com>
Link: https://patch.msgid.link/20250528054420.3050133-11-quic_sarishar@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-24 15:19:27 +02:00
Sarika Sharma
505991fba9 wifi: mac80211: extend support to fill link level sinfo structure
Currently, sinfo structure is supported to fill information at
deflink( or one of the links) level for station. This has problems
when applied to fetch multi-link(ML) station information.

Hence, if valid_links are present, support filling link_station
structure for each link.

This will be helpful to check the link related statistics during MLO.

Additionally, TXQ stats for pertid are applicable at station level
not at link level. Therefore check link_id is less then 0, before
filling TXQ stats in pertid stats.

Signed-off-by: Sarika Sharma <quic_sarishar@quicinc.com>
Link: https://patch.msgid.link/20250528054420.3050133-9-quic_sarishar@quicinc.com
[fix some indentation]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-24 15:19:26 +02:00
Sarika Sharma
49e47223ec wifi: cfg80211: allocate memory for link_station info structure
Currently, station_info structure is passed to fill station statistics
from mac80211/drivers. After NL message send to user space for requested
station statistics, memory for station statistics is freed in cfg80211.
Therefore, memory allocation/free for link station statistics should
also happen in cfg80211 only.

Hence, allocate the memory for link_station structure for all
possible links and free in cfg80211_sinfo_release_content().

Signed-off-by: Sarika Sharma <quic_sarishar@quicinc.com>
Link: https://patch.msgid.link/20250528054420.3050133-6-quic_sarishar@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-24 15:19:26 +02:00
Sarika Sharma
d2329fff7e wifi: cfg80211: add link_station_info structure to support MLO statistics
Current implementation of NL80211_GET_STATION does not work for
multi-link operation(MLO) since in case of MLO only deflink (or one
of the links) is considered and not all links.

Therefore to support for MLO, add link_station_info structure
to account link level statistics for station.

Additionally, add valid_links in station_info structure to indicate
bitmap of valid links for MLO. This will be helpful to check the link
related statistics during MLO.

Signed-off-by: Sarika Sharma <quic_sarishar@quicinc.com>
Link: https://patch.msgid.link/20250528054420.3050133-3-quic_sarishar@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-24 15:19:26 +02:00
Sarika Sharma
e581b7fe62 wifi: mac80211: add support towards MLO handling of station statistics
Currently, in supporting API's to fill sinfo structure from sta
structure, is mapped to fill the fields from sta->deflink. However,
for multi-link (ML) station, sinfo structure should be filled from
corresponding link_id.

Therefore, add  link_id as an additional argument in supporting API's
for filling sinfo structure correctly. Link_id is set to -1 for non-ML
station and corresponding link_id for ML stations. In supporting API's
for filling sinfo structure, check for link_id, if link_id < 0, fill
the sinfo structure from sta->deflink, otherwise fill from
sta->link[link_id].

Current, changes are done at the deflink level i.e, pass -1 as link_id.
Actual link_id will be added in subsequent patches to support
station statistics for MLO.

Signed-off-by: Sarika Sharma <quic_sarishar@quicinc.com>
Link: https://patch.msgid.link/20250528054420.3050133-2-quic_sarishar@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-24 15:19:26 +02:00
Masahiro Yamada
7934a8dd86 module: remove meaningless 'name' parameter from __MODULE_INFO()
The symbol names in the .modinfo section are never used and already
randomized by the __UNIQUE_ID() macro.

Therefore, the second parameter of  __MODULE_INFO() is meaningless
and can be removed to simplify the code.

With this change, the symbol names in the .modinfo section will be
prefixed with __UNIQUE_ID_modinfo, making it clearer that they
originate from MODULE_INFO().

[Before]

  $ objcopy  -j .modinfo vmlinux.o modinfo.o
  $ nm -n modinfo.o | head -n10
  0000000000000000 r __UNIQUE_ID_license560
  0000000000000011 r __UNIQUE_ID_file559
  0000000000000030 r __UNIQUE_ID_description558
  0000000000000074 r __UNIQUE_ID_license580
  000000000000008e r __UNIQUE_ID_file579
  00000000000000bd r __UNIQUE_ID_description578
  00000000000000e6 r __UNIQUE_ID_license581
  00000000000000ff r __UNIQUE_ID_file580
  0000000000000134 r __UNIQUE_ID_description579
  0000000000000179 r __UNIQUE_ID_uncore_no_discover578

[After]

  $ objcopy  -j .modinfo vmlinux.o modinfo.o
  $ nm -n modinfo.o | head -n10
  0000000000000000 r __UNIQUE_ID_modinfo560
  0000000000000011 r __UNIQUE_ID_modinfo559
  0000000000000030 r __UNIQUE_ID_modinfo558
  0000000000000074 r __UNIQUE_ID_modinfo580
  000000000000008e r __UNIQUE_ID_modinfo579
  00000000000000bd r __UNIQUE_ID_modinfo578
  00000000000000e6 r __UNIQUE_ID_modinfo581
  00000000000000ff r __UNIQUE_ID_modinfo580
  0000000000000134 r __UNIQUE_ID_modinfo579
  0000000000000179 r __UNIQUE_ID_modinfo578

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Petr Pavlu <petr.pavlu@suse.com>
2025-06-24 20:33:31 +09:00
Eric Dumazet
935b67675a net: make sk->sk_rcvtimeo lockless
Followup of commit 285975dd67 ("net: annotate data-races around
sk->sk_{rcv|snd}timeo").

Remove lock_sock()/release_sock() from ksmbd_tcp_rcv_timeout()
and add READ_ONCE()/WRITE_ONCE() where it is needed.

Also SO_RCVTIMEO_OLD and SO_RCVTIMEO_NEW can call sock_set_timeout()
without holding the socket lock.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250620155536.335520-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-23 17:05:12 -07:00
Eric Dumazet
3169e36ae1 net: make sk->sk_sndtimeo lockless
Followup of commit 285975dd67 ("net: annotate data-races around
sk->sk_{rcv|snd}timeo").

Remove lock_sock()/release_sock() from sock_set_sndtimeo(),
and add READ_ONCE()/WRITE_ONCE() where it is needed.

Also SO_SNDTIMEO_OLD and SO_SNDTIMEO_NEW can call sock_set_timeout()
without holding the socket lock.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250620155536.335520-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-23 17:05:11 -07:00
Eric Dumazet
c51da3f7a1 net: remove sock_i_uid()
Difference between sock_i_uid() and sk_uid() is that
after sock_orphan(), sock_i_uid() returns GLOBAL_ROOT_UID
while sk_uid() returns the last cached sk->sk_uid value.

None of sock_i_uid() callers care about this.

Use sk_uid() which is much faster and inlined.

Note that diag/dump users are calling sock_i_ino() and
can not see the full benefit yet.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Reviewed-by: Maciej Żenczykowski <maze@google.com>
Link: https://patch.msgid.link/20250620133001.4090592-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-23 17:04:03 -07:00
Eric Dumazet
e84a4927a4 net: annotate races around sk->sk_uid
sk->sk_uid can be read while another thread changes its
value in sockfs_setattr().

Add sk_uid(const struct sock *sk) helper to factorize the needed
READ_ONCE() annotations, and add corresponding WRITE_ONCE()
where needed.

Fixes: 86741ec254 ("net: core: Add a UID field to struct sock.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Reviewed-by: Maciej Żenczykowski <maze@google.com>
Link: https://patch.msgid.link/20250620133001.4090592-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-23 17:04:03 -07:00
Jens Axboe
5be5726e1a Merge branch 'timestamp-for-jens' of https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next into for-6.17/io_uring
Pull networking side timestamp prep patch from Jakub.

* 'timestamp-for-jens' of https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next:
  net: timestamp: add helper returning skb's tx tstamp
2025-06-23 08:59:38 -06:00
Kuniyuki Iwashima
1d6123102e Bluetooth: hci_core: Fix use-after-free in vhci_flush()
syzbot reported use-after-free in vhci_flush() without repro. [0]

From the splat, a thread close()d a vhci file descriptor while
its device was being used by iotcl() on another thread.

Once the last fd refcnt is released, vhci_release() calls
hci_unregister_dev(), hci_free_dev(), and kfree() for struct
vhci_data, which is set to hci_dev->dev->driver_data.

The problem is that there is no synchronisation after unlinking
hdev from hci_dev_list in hci_unregister_dev().  There might be
another thread still accessing the hdev which was fetched before
the unlink operation.

We can use SRCU for such synchronisation.

Let's run hci_dev_reset() under SRCU and wait for its completion
in hci_unregister_dev().

Another option would be to restore hci_dev->destruct(), which was
removed in commit 587ae086f6 ("Bluetooth: Remove unused
hci-destruct cb").  However, this would not be a good solution, as
we should not run hci_unregister_dev() while there are in-flight
ioctl() requests, which could lead to another data-race KCSAN splat.

Note that other drivers seem to have the same problem, for exmaple,
virtbt_remove().

[0]:
BUG: KASAN: slab-use-after-free in skb_queue_empty_lockless include/linux/skbuff.h:1891 [inline]
BUG: KASAN: slab-use-after-free in skb_queue_purge_reason+0x99/0x360 net/core/skbuff.c:3937
Read of size 8 at addr ffff88807cb8d858 by task syz.1.219/6718

CPU: 1 UID: 0 PID: 6718 Comm: syz.1.219 Not tainted 6.16.0-rc1-syzkaller-00196-g08207f42d3ff #0 PREEMPT(full)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025
Call Trace:
 <TASK>
 dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120
 print_address_description mm/kasan/report.c:408 [inline]
 print_report+0xd2/0x2b0 mm/kasan/report.c:521
 kasan_report+0x118/0x150 mm/kasan/report.c:634
 skb_queue_empty_lockless include/linux/skbuff.h:1891 [inline]
 skb_queue_purge_reason+0x99/0x360 net/core/skbuff.c:3937
 skb_queue_purge include/linux/skbuff.h:3368 [inline]
 vhci_flush+0x44/0x50 drivers/bluetooth/hci_vhci.c:69
 hci_dev_do_reset net/bluetooth/hci_core.c:552 [inline]
 hci_dev_reset+0x420/0x5c0 net/bluetooth/hci_core.c:592
 sock_do_ioctl+0xd9/0x300 net/socket.c:1190
 sock_ioctl+0x576/0x790 net/socket.c:1311
 vfs_ioctl fs/ioctl.c:51 [inline]
 __do_sys_ioctl fs/ioctl.c:907 [inline]
 __se_sys_ioctl+0xf9/0x170 fs/ioctl.c:893
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fcf5b98e929
Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fcf5c7b9038 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007fcf5bbb6160 RCX: 00007fcf5b98e929
RDX: 0000000000000000 RSI: 00000000400448cb RDI: 0000000000000009
RBP: 00007fcf5ba10b39 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 00007fcf5bbb6160 R15: 00007ffd6353d528
 </TASK>

Allocated by task 6535:
 kasan_save_stack mm/kasan/common.c:47 [inline]
 kasan_save_track+0x3e/0x80 mm/kasan/common.c:68
 poison_kmalloc_redzone mm/kasan/common.c:377 [inline]
 __kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:394
 kasan_kmalloc include/linux/kasan.h:260 [inline]
 __kmalloc_cache_noprof+0x230/0x3d0 mm/slub.c:4359
 kmalloc_noprof include/linux/slab.h:905 [inline]
 kzalloc_noprof include/linux/slab.h:1039 [inline]
 vhci_open+0x57/0x360 drivers/bluetooth/hci_vhci.c:635
 misc_open+0x2bc/0x330 drivers/char/misc.c:161
 chrdev_open+0x4c9/0x5e0 fs/char_dev.c:414
 do_dentry_open+0xdf0/0x1970 fs/open.c:964
 vfs_open+0x3b/0x340 fs/open.c:1094
 do_open fs/namei.c:3887 [inline]
 path_openat+0x2ee5/0x3830 fs/namei.c:4046
 do_filp_open+0x1fa/0x410 fs/namei.c:4073
 do_sys_openat2+0x121/0x1c0 fs/open.c:1437
 do_sys_open fs/open.c:1452 [inline]
 __do_sys_openat fs/open.c:1468 [inline]
 __se_sys_openat fs/open.c:1463 [inline]
 __x64_sys_openat+0x138/0x170 fs/open.c:1463
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Freed by task 6535:
 kasan_save_stack mm/kasan/common.c:47 [inline]
 kasan_save_track+0x3e/0x80 mm/kasan/common.c:68
 kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:576
 poison_slab_object mm/kasan/common.c:247 [inline]
 __kasan_slab_free+0x62/0x70 mm/kasan/common.c:264
 kasan_slab_free include/linux/kasan.h:233 [inline]
 slab_free_hook mm/slub.c:2381 [inline]
 slab_free mm/slub.c:4643 [inline]
 kfree+0x18e/0x440 mm/slub.c:4842
 vhci_release+0xbc/0xd0 drivers/bluetooth/hci_vhci.c:671
 __fput+0x44c/0xa70 fs/file_table.c:465
 task_work_run+0x1d1/0x260 kernel/task_work.c:227
 exit_task_work include/linux/task_work.h:40 [inline]
 do_exit+0x6ad/0x22e0 kernel/exit.c:955
 do_group_exit+0x21c/0x2d0 kernel/exit.c:1104
 __do_sys_exit_group kernel/exit.c:1115 [inline]
 __se_sys_exit_group kernel/exit.c:1113 [inline]
 __x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1113
 x64_sys_call+0x21ba/0x21c0 arch/x86/include/generated/asm/syscalls_64.h:232
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

The buggy address belongs to the object at ffff88807cb8d800
 which belongs to the cache kmalloc-1k of size 1024
The buggy address is located 88 bytes inside of
 freed 1024-byte region [ffff88807cb8d800, ffff88807cb8dc00)

Fixes: bf18c7118c ("Bluetooth: vhci: Free driver_data on file release")
Reported-by: syzbot+2faa4825e556199361f9@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=f62d64848fc4c7c30cd6
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Acked-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-06-23 10:59:29 -04:00
Kavita Kavita
7c598c653a wifi: cfg80211: Add support for link reconfiguration negotiation offload to driver
In the case of SME-in-driver, the driver can internally choose to
update the links based on the AP MLD recommendation and do link
reconfiguration negotiation with AP MLD.
(e.g., After the driver processing the BSS Transition Management request
frame received from the AP MLD with Neighbor Report containing
Multi-Link element with recommended links information chooses to do link
reconfiguration negotiation with AP MLD).

To support this, extend cfg80211_mlo_reconf_add_done() and
NL80211_CMD_ASSOC_MLO_RECONF to indicate added links information for
driver-initiated link reconfiguration requests. For removed links,
the driver indicates links information using the
NL80211_CMD_LINKS_REMOVED event for driver-initiated cases, the same as
supplicant initiated cases.

For the driver-initiated case, cfg80211 will receive link
reconfiguration result asynchronously from driver so holding BSSes of
the accepted add links is needed in the event path. Also, no need of
unhold call for the rejected add link BSSes since there was no hold call
happened previously.

Once the supplicant receives the NL80211_CMD_ASSOC_MLO_RECONF event,
it needs to process the information about newly added links and install
per-link group keys (e.g., GTK/IGTK/BIGTK etc.).

In case of the SME-in-driver, using a vendor interface etc. to notify
the supplicant to initiate a link reconfiguration request and then
supplicant sending command to the cfg80211 can lead to race conditions.
The correct design to avoid this is that the driver indicates the
cfg80211 directly with the results of the link reconfiguration
negotiation.

Signed-off-by: Kavita Kavita <quic_kkavita@quicinc.com>
Link: https://patch.msgid.link/20250604105757.2542-3-quic_kkavita@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-20 10:46:31 +02:00
Vasanthakumar Thiagarajan
df42bfc96e wifi: cfg80211: Add utility API to get radio index from channel
Add utility API cfg80211_get_radio_idx_by_chan() to retrieve the radio
index corresponding to a given channel in a multi-radio wiphy.

This utility function can be used when we want to check the radio-specific
data for a channel in a multi-radio wiphy. For example, it can help
determine the radio index required to handle a scan request. This index
can then be used to decide whether the scan can proceed without
interfering with ongoing DFS operations on another radio.

Signed-off-by: Vasanthakumar Thiagarajan <vasanthakumar.thiagarajan@oss.qualcomm.com>
Co-developed-by: Raj Kumar Bhagat <quic_rajkbhag@quicinc.com>
Signed-off-by: Raj Kumar Bhagat <quic_rajkbhag@quicinc.com>
Link: https://patch.msgid.link/20250527-mlo-dfs-acs-v2-1-92c2f37c81d9@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-20 10:44:05 +02:00
Nicolas Escande
c7d78566bb neighbour: add support for NUD_PERMANENT proxy entries
As discussesd before in [0] proxy entries (which are more configuration
than runtime data) should stay when the link (carrier) goes does down.
This is what happens for regular neighbour entries.

So lets fix this by:
  - storing in proxy entries the fact that it was added as NUD_PERMANENT
  - not removing NUD_PERMANENT proxy entries when the carrier goes down
    (same as how it's done in neigh_flush_dev() for regular neigh entries)

[0]: https://lore.kernel.org/netdev/c584ef7e-6897-01f3-5b80-12b53f7b4bf4@kernel.org/

Signed-off-by: Nicolas Escande <nico.escande@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250617141334.3724863-1-nico.escande@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-19 16:15:34 -07:00
Erni Sri Satya Vennela
ca8ac489ca net: mana: Handle unsupported HWC commands
If any of the HWC commands are not recognized by the
underlying hardware, the hardware returns the response
header status of -1. Log the information using
netdev_info_once to avoid multiple error logs in dmesg.

Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Reviewed-by: Saurabh Singh Sengar <ssengar@linux.microsoft.com>
Reviewed-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
Link: https://patch.msgid.link/1750144656-2021-5-git-send-email-ernis@linux.microsoft.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-06-19 15:13:49 +02:00
Erni Sri Satya Vennela
a6d5edf11e net: mana: Add speed support in mana_get_link_ksettings
Allow mana ethtool get_link_ksettings operation to report
the maximum speed supported by the SKU in mbps.

The driver retrieves this information by issuing a
HWC command to the hardware via mana_query_link_cfg(),
which retrieves the SKU's maximum supported speed.

These APIs when invoked on hardware that are older/do
not support these APIs, the speed would be reported as UNKNOWN.

Before:
$ethtool enP30832s1
> Settings for enP30832s1:
        Supported ports: [  ]
        Supported link modes:   Not reported
        Supported pause frame use: No
        Supports auto-negotiation: No
        Supported FEC modes: Not reported
        Advertised link modes:  Not reported
        Advertised pause frame use: No
        Advertised auto-negotiation: No
        Advertised FEC modes: Not reported
        Speed: Unknown!
        Duplex: Full
        Auto-negotiation: off
        Port: Other
        PHYAD: 0
        Transceiver: internal
        Link detected: yes

After:
$ethtool enP30832s1
> Settings for enP30832s1:
        Supported ports: [  ]
        Supported link modes:   Not reported
        Supported pause frame use: No
        Supports auto-negotiation: No
        Supported FEC modes: Not reported
        Advertised link modes:  Not reported
        Advertised pause frame use: No
        Advertised auto-negotiation: No
        Advertised FEC modes: Not reported
        Speed: 16000Mb/s
        Duplex: Full
        Auto-negotiation: off
        Port: Other
        PHYAD: 0
        Transceiver: internal
        Link detected: yes

Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Reviewed-by: Saurabh Singh Sengar <ssengar@linux.microsoft.com>
Reviewed-by: Long Li <longli@microsoft.com>
Link: https://patch.msgid.link/1750144656-2021-4-git-send-email-ernis@linux.microsoft.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-06-19 15:13:49 +02:00
Erni Sri Satya Vennela
75cabb4693 net: mana: Add support for net_shaper_ops
Introduce support for net_shaper_ops in the MANA driver,
enabling configuration of rate limiting on the MANA NIC.

To apply rate limiting, the driver issues a HWC command via
mana_set_bw_clamp() and updates the corresponding shaper object
in the net_shaper cache. If an error occurs during this process,
the driver restores the previous speed by querying the current link
configuration using mana_query_link_cfg().

The minimum supported bandwidth is 100 Mbps, and only values that are
exact multiples of 100 Mbps are allowed. Any other values are rejected.

To remove a shaper, the driver resets the bandwidth to the maximum
supported by the SKU using mana_set_bw_clamp() and clears the
associated cache entry. If an error occurs during this process,
the shaper details are retained.

On the hardware that does not support these APIs, the net-shaper
calls to set speed would fail.

Set the speed:
./tools/net/ynl/pyynl/cli.py \
 --spec Documentation/netlink/specs/net_shaper.yaml \
 --do set --json '{"ifindex":'$IFINDEX',
		   "handle":{"scope": "netdev", "id":'$ID' },
		   "bw-max": 200000000 }'

Get the shaper details:
./tools/net/ynl/pyynl/cli.py \
 --spec Documentation/netlink/specs/net_shaper.yaml \
 --do get --json '{"ifindex":'$IFINDEX',
		      "handle":{"scope": "netdev", "id":'$ID' }}'

> {'bw-max': 200000000,
> 'handle': {'scope': 'netdev'},
> 'ifindex': $IFINDEX,
> 'metric': 'bps'}

Delete the shaper object:
./tools/net/ynl/pyynl/cli.py \
 --spec Documentation/netlink/specs/net_shaper.yaml \
 --do delete --json '{"ifindex":'$IFINDEX',
		      "handle":{"scope": "netdev","id":'$ID' }}'

Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Reviewed-by: Saurabh Singh Sengar <ssengar@linux.microsoft.com>
Reviewed-by: Long Li <longli@microsoft.com>
Link: https://patch.msgid.link/1750144656-2021-3-git-send-email-ernis@linux.microsoft.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-06-19 15:13:49 +02:00
David Arinzon
cea465a96a devlink: Add new "enable_phc" generic device param
Add a new device generic parameter to enable/disable the
PHC (PTP Hardware Clock) functionality in the device associated
with the devlink instance.

Signed-off-by: David Arinzon <darinzon@amazon.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20250617110545.5659-6-darinzon@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-18 18:57:29 -07:00
Stanislav Fomichev
1ead750109 udp_tunnel: remove rtnl_lock dependency
Drivers that are using ops lock and don't depend on RTNL lock
still need to manage it because udp_tunnel's RTNL dependency.
Introduce new udp_tunnel_nic_lock and use it instead of
rtnl_lock. Drop non-UDP_TUNNEL_NIC_INFO_MAY_SLEEP mode from
udp_tunnel infra (udp_tunnel_nic_device_sync_work needs to
grab udp_tunnel_nic_lock mutex and might sleep).

Cover more places in v4:

- netlink
  - udp_tunnel_notify_add_rx_port (ndo_open)
    - triggers udp_tunnel_nic_device_sync_work
  - udp_tunnel_notify_del_rx_port (ndo_stop)
    - triggers udp_tunnel_nic_device_sync_work
  - udp_tunnel_get_rx_info (__netdev_update_features)
    - triggers NETDEV_UDP_TUNNEL_PUSH_INFO
  - udp_tunnel_drop_rx_info (__netdev_update_features)
    - triggers NETDEV_UDP_TUNNEL_DROP_INFO
  - udp_tunnel_nic_reset_ntf (ndo_open)

- notifiers
  - udp_tunnel_nic_netdevice_event, depending on the event:
    - triggers NETDEV_UDP_TUNNEL_PUSH_INFO
    - triggers NETDEV_UDP_TUNNEL_DROP_INFO

- ethnl_tunnel_info_reply_size
- udp_tunnel_nic_set_port_priv (two intel drivers)

Cc: Michael Chan <michael.chan@broadcom.com>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Stanislav Fomichev <stfomichev@gmail.com>
Link: https://patch.msgid.link/20250616162117.287806-4-stfomichev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-18 18:53:51 -07:00
Yue Haibing
a33556940b tcp: Remove inet_hashinfo2_free_mod()
DCCP was removed, inet_hashinfo2_free_mod() is unused now.

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250617130613.498659-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-18 18:29:58 -07:00
Jakub Kicinski
189bd9c873 Merge branch '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:

====================
libeth: add libeth_xdp helper lib

Alexander Lobakin says:

Time to add XDP helpers infra to libeth to greatly simplify adding
XDP to idpf and iavf, as well as improve and extend XDP in ice and
i40e. Any vendor is free to reuse helpers. If this happens, I'm fine
with moving the folder of out intel/.

The helpers greatly simplify building xdp_buff, running a prog,
handling the verdict, implement XDP_TX, .ndo_xdp_xmit, XDP buffer
completion. Same applies to XSk (with XSk xmit instead of
.ndo_xdp_xmit, plus stuff like XSk wakeup).
They are entirely generic with no HW definitions or assumptions.
HW-specific stuff like parsing Rx desc / filling Tx desc is passed
from the driver as inline callbacks.

For now, key assumptions that optimize performance / avoid code
bloat, but might not fit every driver in driver/net/:
 * netmem holding the buffers are always order-0;
 * driver has separate XDP Tx queues, doesn't use stack queues for
   that. For best efficiency, you may want to have nr_cpu_ids XDP
   queues, but less (queue sharing) is also supported;
 * XDP Tx queues are interrupt-less and use "lazy" cleaning only
   when there are less than 1/4 free Tx descriptors of the queue
   size;
 * main target platforms are 64-bit, although 32-bit is also fully
   supported, but the code might be not as optimized for them.

Library code already supports multi-buffer for all kinds of Tx and
both header split and no split for Rx and Tx. Frags can come from
devmem/io_uring etc., direct `struct page *` is used only for header
buffers for which it's always true.
Drivers are free to pass their own Rx hints and XSK xmit hints ops.

XDP_TX and ndo_xdp_xmit use onstack bulk for the frames to be sent
and send them by batches of 16 buffers. This eats ~280 bytes on the
stack, but gives good boosts and allow to greatly optimize the main
sending function leaving it without any error/exception paths.

XSk xmit fills Tx descriptors in the loop unrolled by 8. This was
proven to improve perf on ice and i40e. XDP_TX and ndo_xdp_xmit
doesn't use unrolling as I wasn't able to get any improvements in
those scenenarios from this, while +1 Kb for their sending functions
for nothing doesn't sound reasonable.

XSk wakeup, instead of traditionally used "SW interrupts" provided
by NICs, uses IPI to schedule NAPI on the CPU corresponding to the
given queue pair. It gives better control over CPU distribution and
in general performs way better than "SW interrupts", plus allows us
to not pass any HW-specific callbacks there.

The code is built the way that all callbacks passed from drivers
get inlined; in general, most of hotpath gets inlined. Everything
slow/exception lands to .c files in the libeth folder, doesn't
create copies in the drivers themselves and doesn't overloat
hotpath.
Sure, inlining means that hotpath will be compiled into every driver
that uses the lib, but the core code is written in one place, so no
copying of bugs happens. Fixed once -- works everywhere.

The last commit might look like sorta hack, but it gives really good
boosts and decreases object code size, plus there are checks that
all those wider accesses are fully safe, so I don't feel anything
bad about it.

An example of using libeth_xdp can be found either on my GitHub or
on the mailing lists here ("XDP for idpf"). Macros for building
driver XDP functions lead to that some implementations (XDP_TX,
ndo_xdp_xmit etc.) consist of really only a few lines.

* '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  libeth: xdp, xsk: access adjacent u32s as u64 where applicable
  libeth: xsk: add XSkFQ refill and XSk wakeup helpers
  libeth: xsk: add XSk Rx processing support
  libeth: xsk: add XSk xmit functions
  libeth: xsk: add XSk XDP_TX sending helpers
  libeth: xdp: add RSS hash hint and XDP features setup helpers
  libeth: xdp: add templates for building driver-side callbacks
  libeth: xdp: add XDP prog run and verdict result handling
  libeth: xdp: add helpers for preparing/processing &libeth_xdp_buff
  libeth: xdp: add XDPSQ cleanup timers
  libeth: xdp: add XDPSQ locking helpers
  libeth: xdp: add XDPSQE completion helpers
  libeth: xdp: add .ndo_xdp_xmit() helpers
  libeth: xdp: add XDP_TX buffers sending
  libeth: support native XDP and register memory model
  libeth: convert to netmem
  libeth, libie: clean symbol exports up a little
====================

Link: https://patch.msgid.link/20250616201639.710420-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 18:50:57 -07:00
Dragos Tatulea
a202f24b08 page_pool: Add page_pool_dev_alloc_netmems helper
This is the netmem counterpart of page_pool_dev_alloc_pages() which
uses the default GFP flags for RX.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250616141441.1243044-4-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 18:34:12 -07:00
Dragos Tatulea
c9e1225352 net: Allow const args for of page_to_netmem()
This allows calling page_to_netmem() with a const page * argument.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250616141441.1243044-2-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 18:34:11 -07:00
Tejun Heo
fd0406e5ca net: tcp: tsq: Convert from tasklet to BH workqueue
The only generic interface to execute asynchronously in the BH context is
tasklet; however, it's marked deprecated and has some design flaws. To
replace tasklets, BH workqueue support was recently added. A BH workqueue
behaves similarly to regular workqueues except that the queued work items
are executed in the BH context.

This patch converts TCP Small Queues implementation from tasklet to BH
workqueue.

Semantically, this is an equivalent conversion and there shouldn't be any
user-visible behavior changes. While workqueue's queueing and execution
paths are a bit heavier than tasklet's, unless the work item is being queued
every packet, the difference hopefully shouldn't matter.

My experience with the networking stack is very limited and this patch
definitely needs attention from someone who actually understands networking.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Cc: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/aFBeJ38AS1ZF3Dq5@slm.duckdns.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 18:29:21 -07:00
Petr Machata
f8337efa4f vxlan: Support MC routing in the underlay
Locally-generated MC packets have so far not been subject to MC routing.
Instead an MC-enabled installation would maintain the MC routing tables,
and separately from that the list of interfaces to send packets to as part
of the VXLAN FDB and MDB.

In a previous patch, a ip_mr_output() and ip6_mr_output() routines were
added for IPv4 and IPv6. All locally generated MC traffic is now passed
through these functions. For reasons of backward compatibility, an SKB
(IPCB / IP6CB) flag guards the actual MC routing.

This patch adds logic to set the flag, and the UAPI to enable the behavior.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/d899655bb7e9b2521ee8c793e67056b9fd02ba12.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 18:18:46 -07:00
Petr Machata
f78c75d84f net: ipv6: Add a flags argument to ip6tunnel_xmit(), udp_tunnel6_xmit_skb()
ip6tunnel_xmit() erases the contents of the SKB control block. In order to
be able to set particular IP6CB flags on the SKB, add a corresponding
parameter, and propagate it to udp_tunnel6_xmit_skb() as well.

In one of the following patches, VXLAN driver will use this facility to
mark packets as subject to IPv6 multicast routing.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/acb4f9f3e40c3a931236c3af08a720b017fbfbfb.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 18:18:45 -07:00
Petr Machata
6a7d88ca15 net: ipv6: Make udp_tunnel6_xmit_skb() void
The function always returns zero, thus the return value does not carry any
signal. Just make it void.

Most callers already ignore the return value. However:

- Refold arguments of the call from sctp_v6_xmit() so that they fit into
  the 80-column limit.

- tipc_udp_xmit() initializes err from the return value, but that should
  already be always zero at that point. So there's no practical change, but
  elision of the assignment prompts a couple more tweaks to clean up the
  function.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/7facacf9d8ca3ca9391a4aee88160913671b868d.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 18:18:45 -07:00
Petr Machata
35bec72a24 net: ipv4: Add ip_mr_output()
Multicast routing is today handled in the input path. Locally generated MC
packets don't hit the IPMR code today. Thus if a VXLAN remote address is
multicast, the driver needs to set an OIF during route lookup. Thus MC
routing configuration needs to be kept in sync with the VXLAN FDB and MDB.
Ideally, the VXLAN packets would be routed by the MC routing code instead.

To that end, this patch adds support to route locally generated multicast
packets. The newly-added routines do largely what ip_mr_input() and
ip_mr_forward() do: make an MR cache lookup to find where to send the
packets, and use ip_mc_output() to send each of them. When no cache entry
is found, the packet is punted to the daemon for resolution.

However, an installation that uses a VXLAN underlay netdevice for which it
also has matching MC routes, would get a different routing with this patch.
Previously, the MC packets would be delivered directly to the underlay
port, whereas now they would be MC-routed. In order to avoid this change in
behavior, introduce an IPCB flag. Only if the flag is set will
ip_mr_output() actually engage, otherwise it reverts to ip_mc_output().

This code is based on work by Roopa Prabhu and Nikolay Aleksandrov.

Signed-off-by: Roopa Prabhu <roopa@nvidia.com>
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/0aadbd49330471c0f758d54afb05eb3b6e3a6b65.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 18:18:45 -07:00
Petr Machata
e3411e326f net: ipv4: Add a flags argument to iptunnel_xmit(), udp_tunnel_xmit_skb()
iptunnel_xmit() erases the contents of the SKB control block. In order to
be able to set particular IPCB flags on the SKB, add a corresponding
parameter, and propagate it to udp_tunnel_xmit_skb() as well.

In one of the following patches, VXLAN driver will use this facility to
mark packets as subject to IP multicast routing.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Acked-by: Antonio Quartulli <antonio@openvpn.net>
Link: https://patch.msgid.link/89c9daf9f2dc088b6b92ccebcc929f51742de91f.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 18:18:44 -07:00
Mina Almasry
0f66b616b8 netmem: fix netmem comments
Trivial fix to a couple of outdated netmem comments. No code changes,
just more accurately describing current code.

Signed-off-by: Mina Almasry <almasrymina@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250615203511.591438-1-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 18:00:26 -07:00
Álvaro Fernández Rojas
ef07df397a net: dsa: tag_brcm: add support for legacy FCS tags
Add support for legacy Broadcom FCS tags, which are similar to
DSA_TAG_PROTO_BRCM_LEGACY.
BCM5325 and BCM5365 switches require including the original FCS value and
length, as opposed to BCM63xx switches.
Adding the original FCS value and length to DSA_TAG_PROTO_BRCM_LEGACY would
impact performance of BCM63xx switches, so it's better to create a new tag.

Signed-off-by: Álvaro Fernández Rojas <noltari@gmail.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20250614080000.1884236-3-noltari@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 17:52:01 -07:00
Neal Cardwell
db16319efc tcp: remove RFC3517/RFC6675 tcp_clear_retrans_hints_partial()
Now that we have removed the RFC3517/RFC6675 hints,
tcp_clear_retrans_hints_partial() is empty, and can be removed.

Suggested-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250615001435.2390793-4-ncardwell.sw@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 16:19:04 -07:00
Neal Cardwell
ba4618885b tcp: remove RFC3517/RFC6675 hint state: lost_skb_hint, lost_cnt_hint
Now that obsolete RFC3517/RFC6675 TCP loss detection has been removed,
we can remove the somewhat complex and intrusive code to maintain its
hint state: lost_skb_hint and lost_cnt_hint.

This commit makes tcp_clear_retrans_hints_partial() empty. We will
remove tcp_clear_retrans_hints_partial() and its call sites in the
next commit.

Suggested-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250615001435.2390793-3-ncardwell.sw@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 16:19:04 -07:00
Jakub Kicinski
b9ebe0cd5d Merge branch 'io_uring-cmd-for-tx-timestamps'
Pavel Begunkov says:

====================
io_uring cmd for tx timestamps (part)

Apply the networking helpers for the io_uring timestamp API.
====================

Link: https://patch.msgid.link/cover.1750065793.git.asml.silence@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 15:24:30 -07:00
Pavel Begunkov
2410251cde net: timestamp: add helper returning skb's tx tstamp
Add a helper function skb_get_tx_timestamp() that returns a tx timestamp
associated with an error queue skb.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/702357dd8936ef4c0d3864441e853bfe3224a677.1750065793.git.asml.silence@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 15:24:23 -07:00
Jakub Kicinski
0f0decc777 Merge branch 'shradha_v6.16-rc1' of https://github.com/shradhagupta6/linux
Shradha Gupta says:

====================
Allow dyn MSI-X vector allocation of MANA

In this patchset we want to enable the MANA driver to be able to
allocate MSI-X vectors in PCI dynamically.

The first patch exports pci_msix_prepare_desc() in PCI to be able to
correctly prepare descriptors for dynamically added MSI-X vectors.

The second patch adds the support of dynamic vector allocation in
pci-hyperv PCI controller by enabling the MSI_FLAG_PCI_MSIX_ALLOC_DYN
flag and using the pci_msix_prepare_desc() exported in first patch.

The third patch adds a detailed description of the irq_setup(), to
help understand the function design better.

The fourth patch is a preparation patch for mana changes to support
dynamic IRQ allocation. It contains changes in irq_setup() to allow
skipping first sibling CPU sets, in case certain IRQs are already
affinitized to them.

The fifth patch has the changes in MANA driver to be able to allocate
MSI-X vectors dynamically. If the support does not exist it defaults to
older behavior.

* 'shradha_v6.16-rc1' of https://github.com/shradhagupta6/linux:
  net: mana: Allocate MSI-X vectors dynamically
  net: mana: Allow irq_setup() to skip cpus for affinity
  net: mana: explain irq_setup() algorithm
  PCI: hv: Allow dynamic MSI-X vector allocation
  PCI/MSI: Export pci_msix_prepare_desc() for dynamic MSI-X allocations
====================

Link: https://patch.msgid.link/1749650984-9193-1-git-send-email-shradhagupta@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 14:44:41 -07:00
Shradha Gupta
7553911210 net: mana: Allocate MSI-X vectors dynamically
Currently, the MANA driver allocates MSI-X vectors statically based on
MANA_MAX_NUM_QUEUES and num_online_cpus() values and in some cases ends
up allocating more vectors than it needs. This is because, by this time
we do not have a HW channel and do not know how many IRQs should be
allocated.

To avoid this, we allocate 1 MSI-X vector during the creation of HWC and
after getting the value supported by hardware, dynamically add the
remaining MSI-X vectors.

Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
2025-06-17 06:15:15 +00:00
Haiyang Zhang
7768c5f417 net: mana: Add handler for hardware servicing events
To collaborate with hardware servicing events, upon receiving the special
EQE notification from the HW channel, remove the devices on this bus.
Then, after a waiting period based on the device specs, rescan the parent
bus to recover the devices.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/1749834034-18498-1-git-send-email-haiyangz@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-16 15:19:48 -07:00
Alexander Lobakin
80bae9df21 libeth: xdp, xsk: access adjacent u32s as u64 where applicable
On 64-bit systems, writing/reading one u64 is faster than two u32s even
when they're are adjacent in a struct. The compilers won't guarantee
they will combine those; I observed both successful and unsuccessful
attempts with both GCC and Clang, and it's not easy to say what it
depends on.
There's a few places in libeth_xdp winning up to several percent from
combined access (both performance and object code size, especially
when unrolling). Add __LIBETH_WORD_ACCESS and use it there on LE.
Drivers are free to optimize HW-specific callbacks under the same
definition.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-06-16 11:40:15 -07:00
Alexander Lobakin
3ced71a8b3 libeth: xsk: add XSkFQ refill and XSk wakeup helpers
XSkFQ refill is pretty generic across the drivers minus FQ descriptor
filling and can easily be unified with one inline callback.
XSk wakeup is usually not, but here, instead of commonly used
"SW interrupts", I picked firing an IPI. In most tests, it showed better
performance; it also provides better control for userspace on which CPU
will handle the xmit, as SW interrupts honor IRQ affinity no matter
which core produces XSk xmit descs (while XDPSQs are associated 1:1
with cores having the same ID).

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-06-16 11:40:15 -07:00
Alexander Lobakin
5495c58c65 libeth: xsk: add XSk Rx processing support
Add XSk counterparts for preparing XSk &libeth_xdp_buff (adding head and
frags), running the program, and handling the verdict, inc. XDP_PASS.
Shortcuts in comparison with regular Rx: frags and all verdicts except
XDP_REDIRECT are under unlikely() and out of line; no checks for XDP
program presence as it's always true for XSk.

Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> # optimizations
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-06-16 11:40:15 -07:00
Alexander Lobakin
40e846d122 libeth: xsk: add XSk xmit functions
Reuse core sending functions to send XSk xmit frames.
Both metadata and no metadata pools/driver are supported. libeth_xdp
also provides generic XSk metadata ops, currently with the checksum
offload only and for cases when HW doesn't require supplying L3/L4
checksum offsets. Drivers are free to pass their own ops.
&libeth_xdp_tx_bulk is not used here as it would be redundant;
pool->tx_descs are accessed directly.
Fake "libeth_xsktmo" is needed to hide implementation details from the
drivers when they want to use the generic ops: the original struct is
defined in the same file where dev->xsk_tx_metadata_ops gets set to
avoid duplication of slowpath; at the same time; XSk xmit functions
use local "fast" copy to inline XMO callbacks.
Tx descriptor filling loop is unrolled by 8.

Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> # optimizations
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-06-16 11:40:15 -07:00
Alexander Lobakin
b3ad8450b4 libeth: xsk: add XSk XDP_TX sending helpers
Add Xsk counterparts for XDP_TX buffer sending and completion.
The same base structures and functions used from the libeth_xdp core,
with adjustments to that XSk Rx always operates on &xdp_buff_xsk for
both head and frags. And unlike regular Rx, here unlikely() are used
for frags, as the header split gives no benefits for XSk Rx, at
least for now.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-06-16 11:40:15 -07:00
Alexander Lobakin
576cc5c13d libeth: xdp: add RSS hash hint and XDP features setup helpers
End the XDP section by adding helpers to setup XDP features, flipping
.ndo_xdp_xmit() support at runtime (in case when it's not always on),
and calculating the queue clean/refill threshold.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-06-16 11:40:15 -07:00
Alexander Lobakin
1bb635d374 libeth: xdp: add templates for building driver-side callbacks
Defining driver-specific functions to pass to libeth_xdp functions can
induce boilerplates and/or look a bit cryptic with all those layers of
indirection. On the other hand, this indirection is needed to allow
compilers to uninline big functions even when passed to __always_inline
helpers (too much inlining also hurts performance in some cases), plus
to reuse some XDP helpers in XSk code.
Add macros to quickly build them, with the detailed kdoc. They take
names of the actual callbacks for filling a Tx descriptor and other
purely HW-specific things and wrap them appropriately.

LIBETH_XDP_DEFINE_{BEGIN,END}() is needed for GCC 8+ unfortunately to
let the drivers control which functions will be static and which global
without hitting `-Wold-style-declaration`.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-06-16 11:40:15 -07:00
Alexander Lobakin
4c805f7ae1 libeth: xdp: add XDP prog run and verdict result handling
Running a prog and handling the verdicts, up to napi_gro_receive()
is also pretty generic code not really differing between vendors
(except for Tx descriptor filling and Rx descriptor parsing).

Define a couple inlines to do that. The inline callbacks a driver
needs to pass is mentioned above: Tx descriptor filling for XDP_TX,
populating skb with the descriptor data for XDP_PASS, finalizing
XDPSQs after the polling loop for XDP_TX (kicking the HW to start
sending).
The populate callback passes only &libeth_xdp_buff assuming buff::desc
pointer is enough, plus you can always get the corresponding Rx queue
structure via container_of(buff::rxq). If not, a driver can extend
the buff with more fields directly on the stack without touching
libeth_xdp definitions.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-06-16 11:40:14 -07:00
Alexander Lobakin
3ef2b0192e libeth: xdp: add helpers for preparing/processing &libeth_xdp_buff
Add convenience helpers to build an &xdp_buff. This means: general
initialization before the NAPI loop, adding head, adding frags etc.
libeth_xdp_process_buff() is the same what everybody have in their
drivers:

dma_sync_for_cpu();

if (!frag) {
	add_head();
	prefetch();
} else {
	add_frag();
}

Note that I don't use net_prefetch(), sticking to the original
prefetch(). In none of my tests prefetching 128 bytes yielded better
perf than 64 bytes. That might differ if the headers are huge enough,
but then additional tunneling etc. overhead takes place, you either
way won't win a lot.

&libeth_xdp_stash is for cases when you exit the polling loop without
finishing building the buff. If that happens, you need to store the
buffer in the queue structure until the next loop and then restore it.
It makes no sense to place a whole full &xdp_buff there. Define a
minimal structure, which would store only the fields essential to
restore it.
I was able to pack it into 16 bytes, which is only 8 bytes bigger
than `struct sk_buff *skb` on x64.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-06-16 11:40:14 -07:00
Alexander Lobakin
819bbaefed libeth: xdp: add XDPSQ cleanup timers
When XDP Tx queues are not interrupt-driven but use lazy cleaning,
i.e. only when there are less than `threshold` free descriptors left,
we also need cleanup timers to avoid &xdp_buff and &xdp_frame stall
for too long, especially with Page Pool (it warns every about inflight
pages every 60 second).
Let's say we sent 256 frames and don't need to send more, but we clean
only when the number of pending items >= 384. In that case, those 256
will stall until 128 more are sent. For this, add simple helpers to
run a timer which will clean the queue regardless, after 1 second of
the last send.
The timer is triggered when finalizing the queue. As long as there is
regular active traffic, the timer doesn't fire.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-06-16 11:40:14 -07:00
Alexander Lobakin
c4ba6a9b9d libeth: xdp: add XDPSQ locking helpers
Unfortunately, it's not always possible to allocate
max(num_rxqs, nr_cpu_ids) even on hi-end NICs.
To mitigate this, add simple locking helpers to libeth_xdp.
As long as XDPSQs are not shared, the whole functionality is gated
behind a static lock. Otherwise, each bulk flush locks the queue for
the time of cleaning and filling the descriptors.
As long as this particular queue is not used by more than 1 CPU,
the impact is minimal (runtime check for boolean twice per 16+
descriptors).

Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> # static key
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-06-16 11:40:14 -07:00
Alexander Lobakin
26ce8eb0bb libeth: xdp: add XDPSQE completion helpers
Similarly to libeth_tx_complete(), add libeth_xdp_complete_tx() to
handle XDP_TX and xmit buffers. Both use bulk return under the hood.

Also add out of line libeth_tx_complete_any() which handles both
regular and XDP frames (if libeth_xdp is loaded), for example,
to call on queue destroy, where we don't need inlining but
convenience.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-06-16 11:40:14 -07:00
Alexander Lobakin
084ceda7de libeth: xdp: add .ndo_xdp_xmit() helpers
Add helpers for implementing .ndo_xdp_xmit().
Same as for XDP_TX, accumulate up to 16 DMA-mapped frames on the stack,
then flush. If DMA mapping is failed for some reason, don't try mapping
further frames, but still flush what was already prepared.
DMA address of a head frame is stored in its headroom, assuming it
has enough of it for an 8 (or 4) byte value.
In addition to @prep and @xmit driver callbacks in XDP_TX, xmit also
needs @finalize to kick the XDPSQ after filling.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-06-16 11:40:14 -07:00
Alexander Lobakin
8591c3afe8 libeth: xdp: add XDP_TX buffers sending
Start adding XDP-specific code to libeth, namely handling XDP_TX buffers
(only sending).
The idea is that we accumulate up to 16 buffers on the stack, then,
if either the limit is reached or the polling is finished, flush them
at once with only one XDPSQ cleaning (if needed). The main sending
function will be aware of the sending budget and already have all the
info to send the buffers, so it can't fail.
Drivers need to provide 2 inline callbacks to the main sending function:
for cleaning an XDPSQ and for filling descriptors; the library code
takes care of the rest.
Note that unlike the generic code, multi-buffer support is not wrapped
here with unlikely() to not hurt header split setups.

&libeth_xdp_buff is a simple extension over &xdp_buff which has a direct
pointer to the corresponding Rx descriptor (and, luckily, precisely 1 CL
size and 16-byte alignment on x86_64).

Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> # xmit logic
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-06-16 11:40:14 -07:00
Alexander Lobakin
35c64b6500 libeth: support native XDP and register memory model
Expand libeth's Page Pool functionality by adding native XDP support.
This means picking the appropriate headroom and DMA direction.
Also, register all the created &page_pools as XDP memory models.
A driver then can call xdp_rxq_info_attach_page_pool() when registering
its RxQ info.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-06-16 11:40:14 -07:00
Alexander Lobakin
6ad5ff6e72 libeth: convert to netmem
Back when the libeth Rx core was initially written, devmem was a draft
and netmem_ref didn't exist in the mainline. Now that it's here, make
libeth MP-agnostic before introducing any new code or any new library
users.
When it's known that the created PP/FQ is for header buffers, use faster
"unsafe" underscored netmem <--> virt accessors as netmem_is_net_iov()
is always false in that case, but consumes some cycles (bit test +
true branch).

Reviewed-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-06-16 11:40:14 -07:00
RubenKelevra
b776999bf2 net: pfcp: fix typo in message_priority field name
The field is spelled "message_priprity" in the big-endian bit-field
definition.  Nothing in-tree currently references the member, so the
typo does not break kernel builds, but it is clearly incorrect.

Signed-off-by: RubenKelevra <rubenkelevra@gmail.com>
Link: https://patch.msgid.link/20250612145012.185321-1-rubenkelevra@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-13 18:17:08 -07:00
Jakub Kicinski
535de52801 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.16-rc2).

No conflicts or adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-12 10:09:10 -07:00
Jakub Kicinski
d5441acae7 bluetooth pull request for net:
- eir: Fix NULL pointer deference on eir_get_service_data
  - eir: Fix possible crashes on eir_create_adv_data
  - hci_sync: Fix broadcast/PA when using an existing instance
  - ISO: Fix using BT_SK_PA_SYNC to detect BIS sockets
  - ISO: Fix not using bc_sid as advertisement SID
  - MGMT: Fix sparse errors
 -----BEGIN PGP SIGNATURE-----
 
 iQJNBAABCgA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmhJ66MZHGx1aXoudm9u
 LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKfp/D/0VTEMF4PiA2eLHIPSwyIHr
 pvpz3nY1WE84lAVL0VKNJalA15dk6TVs3Vxgns62BHLdajBOmYPpuJGXaSERBfLB
 t5eb4nU9rx9F7+SW8zVLNwtnn5bTENNYKQIjfLmslDQQGfOjeaUP5sO/rIcLEiO3
 0rEi55pE4nM6S2wUcmQlhWPC6tr3vIptg4lAz3MWlATDuUnkLjJ3rzEZdkg2kt39
 2VJGNxXEG7sBrwv+coO3ROe54YSOrb+gvd9HOL0vq3MVBcvncCRqc7TuBlYi7/5C
 p+WdEyG26FgS/TzdgMJKuVISQp6kNKulbuRhsnD2XZA3Gik+t+79Ex9haYW+HLDS
 AWQNBm1FgYdCc4LsAxKfwGdvp8wAx1ci1vLNniYVTelyUAc5LosEZ/15DCCyTKdK
 9zXEAfxwn72dLVtryVIRKqDR39QVqsxDSuV9ydgXzPJWwjisHX3AB01EqN5PGjYH
 aspNgMGfYL9zSw6N1LQ+99M+/JLbvLs7b4jui4CbD3EI7nxN0YqOcKlHw7vEje5s
 auU/UEL7DgWOzHTxCcidwATuV79pfx0CRSwsXaPLV1yA9lhS5AYdpBlsRB+wRFbN
 vhpw8dwj/WCM0GVYnG87BU3mriyfNgaERTVA2nLKZXvn+cRkVBUkLwBV3Jpi7vQZ
 cJ22gcrRj7uYotfvyCHv9g==
 =dulg
 -----END PGP SIGNATURE-----

Merge tag 'for-net-2025-06-11' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth

Luiz Augusto von Dentz says:

====================
bluetooth pull request for net:

 - eir: Fix NULL pointer deference on eir_get_service_data
 - eir: Fix possible crashes on eir_create_adv_data
 - hci_sync: Fix broadcast/PA when using an existing instance
 - ISO: Fix using BT_SK_PA_SYNC to detect BIS sockets
 - ISO: Fix not using bc_sid as advertisement SID
 - MGMT: Fix sparse errors

* tag 'for-net-2025-06-11' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
  Bluetooth: MGMT: Fix sparse errors
  Bluetooth: ISO: Fix not using bc_sid as advertisement SID
  Bluetooth: ISO: Fix using BT_SK_PA_SYNC to detect BIS sockets
  Bluetooth: eir: Fix possible crashes on eir_create_adv_data
  Bluetooth: hci_sync: Fix broadcast/PA when using an existing instance
  Bluetooth: Fix NULL pointer deference on eir_get_service_data
====================

Link: https://patch.msgid.link/20250611204944.1559356-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-12 08:13:48 -07:00
Eric Dumazet
adcaa890c7 net_sched: remove qdisc_tree_flush_backlog()
This function is no longer used after the four prior fixes.

Given all prior uses were wrong, it seems better to remove it.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250611111515.1983366-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-12 08:05:50 -07:00
Leon Romanovsky
c0f21029f1 xfrm: always initialize offload path
Offload path is used for GRO with SW IPsec, and not just for HW
offload. So initialize it anyway.

Fixes: 585b64f5a6 ("xfrm: delay initialization of offload path till its actually requested")
Reported-by: Sabrina Dubroca <sd@queasysnail.net>
Closes: https://lore.kernel.org/all/aEGW_5HfPqU1rFjl@krikkit
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-06-12 07:07:14 +02:00
Luiz Augusto von Dentz
5842c01a9e Bluetooth: ISO: Fix not using bc_sid as advertisement SID
Currently bc_sid is being ignore when acting as Broadcast Source role,
so this fix it by passing the bc_sid and then use it when programming
the PA:

< HCI Command: LE Set Exte.. (0x08|0x0036) plen 25
        Handle: 0x01
        Properties: 0x0000
        Min advertising interval: 140.000 msec (0x00e0)
        Max advertising interval: 140.000 msec (0x00e0)
        Channel map: 37, 38, 39 (0x07)
        Own address type: Random (0x01)
        Peer address type: Public (0x00)
        Peer address: 00:00:00:00:00:00 (OUI 00-00-00)
        Filter policy: Allow Scan Request from Any, Allow Connect Request from Any (0x00)
        TX power: Host has no preference (0x7f)
        Primary PHY: LE 1M (0x01)
        Secondary max skip: 0x00
        Secondary PHY: LE 2M (0x02)
        SID: 0x01
        Scan request notifications: Disabled (0x00)

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-06-11 16:29:55 -04:00
Jakub Kicinski
34355b6712 linux-can-next-for-6.17-20250610
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCgAxFiEEn/sM2K9nqF/8FWzzDHRl3/mQkZwFAmhH/swTHG1rbEBwZW5n
 dXRyb25peC5kZQAKCRAMdGXf+ZCRnONCCACa16bTW53gBzmiTxdEgUJ/h+gQuR8G
 Fj+yOYIWNZY/YOExa40ldApu3iB9UAB0D+FOly4Wv5zYDct6yNBxqtZjbkTFMaoi
 3i+SSrRLNtIxgGs1KgJKVPis8mhCqiBL0aGoJDGyRiye6hotECDyQWvlGM3lMGUr
 wdMDQW2xyKOWvm++jXijkUMyKThmI7czlSH8al+JU9KcAO9hiUlGzejdI56KUIMW
 TRlg2QSK9CfIzgUP4RQughbF59/8Xbq3LOidu50xMad2wiOJj0IUHB0h6LoAshnS
 jFy4Ox4Gw5hcmdaEKazjYEtq3nQeZ6wct7jThw02e4D9h0ac2MCVhphk
 =Pt9d
 -----END PGP SIGNATURE-----

Merge tag 'linux-can-next-for-6.17-20250610' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next

Marc Kleine-Budde says:

====================
pull-request: can-next 2025-06-10

The first 4 patches are by Vincent Mailhol and prepare the CAN netlink
interface for the introduction of CAN XL configuration.

Geert Uytterhoeven's patch updates the CAN networking documentation.

The last 2 patched are by Davide Caratti and introduce skb drop
reasons in the receive path of several CAN protocols.

* tag 'linux-can-next-for-6.17-20250610' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next:
  can: add drop reasons in CAN protocols receive path
  can: add drop reasons in the receive path of AF_CAN
  documentation: networking: can: Document alloc_candev_mqs()
  can: netlink: can_changelink(): rename tdc_mask into fd_tdc_flag_provided
  can: bittiming: rename can_tdc_is_enabled() into can_fd_tdc_is_enabled()
  can: bittiming: rename CAN_CTRLMODE_TDC_MASK into CAN_CTRLMODE_FD_TDC_MASK
  can: netlink: replace tabulation by space in assignment
====================

Link: https://patch.msgid.link/20250610094933.1593081-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-10 15:44:47 -07:00
Michal Luczaj
2660a544fd net: Fix TOCTOU issue in sk_is_readable()
sk->sk_prot->sock_is_readable is a valid function pointer when sk resides
in a sockmap. After the last sk_psock_put() (which usually happens when
socket is removed from sockmap), sk->sk_prot gets restored and
sk->sk_prot->sock_is_readable becomes NULL.

This makes sk_is_readable() racy, if the value of sk->sk_prot is reloaded
after the initial check. Which in turn may lead to a null pointer
dereference.

Ensure the function pointer does not turn NULL after the check.

Fixes: 8934ce2fd0 ("bpf: sockmap redirect ingress support")
Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250609-skisreadable-toctou-v1-1-d0dfb2d62c37@rbox.co
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-10 15:31:28 -07:00
Gur Stavi
2bc64b89c4 queue_api: add subqueue variant netif_subqueue_sent
Add a new function, netif_subqueue_sent, which is a wrapper for
netdev_tx_sent_queue.

Drivers that use the subqueue variant macros, netif_subqueue_xxx,
identify queue by index and are not required to obtain
struct netdev_queue explicitly.

Such drivers still need to call netdev_tx_sent_queue which is a
counterpart of netif_subqueue_completed_wake. Allowing drivers to use a
subqueue variant for this purpose improves their code consistency by
always referring to queue by its index.

Signed-off-by: Gur Stavi <gur.stavi@huawei.com>
Link: https://patch.msgid.link/909a5c92db49cad39f0954d6cb86775e6480ef4c.1749038081.git.gur.stavi@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-10 15:27:18 -07:00
Willem de Bruijn
561939ed44 net: remove unused sock_enable_timestamps
This function was introduced in commit 783da70e83 ("net: add
sock_enable_timestamps"), with one caller in rxrpc.

That only caller was removed in commit 7903d4438b ("rxrpc: Don't use
received skbuff timestamps").

Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20250609153254.3504909-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-10 14:43:40 -07:00
Dipayaan Roy
c09ef59e17 net: mana: Expose additional hardware counters for drop and TC via ethtool.
Add support for reporting additional hardware counters for drop and
TC using the ethtool -S interface.

These counters include:

- Aggregate Rx/Tx drop counters
- Per-TC Rx/Tx packet counters
- Per-TC Rx/Tx byte counters
- Per-TC Rx/Tx pause frame counters

The counters are exposed using ethtool_ops->get_ethtool_stats and
ethtool_ops->get_strings. This feature/counters are not available
to all versions of hardware.

Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
Reviewed-by: Subbaraya Sundeep <sbhatta@marvell.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Link: https://patch.msgid.link/20250609100103.GA7102@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-10 14:23:19 -07:00
Davide Caratti
127c49624a can: add drop reasons in the receive path of AF_CAN
Besides the existing pr_warn_once(), use skb drop reasons in case AF_CAN
layer drops non-conformant CAN{,FD,XL} frames, or conformant frames
received by "wrong" devices, so that it's possible to debug (and count)
such events using existing tracepoints:

| # perf record -e skb:kfree_skb -aR -- ./drv/canfdtest -v -g -l 1 vcan0
| # perf script
| [...]
| canfdtest  1123 [000]  3893.271264: skb:kfree_skb: skbaddr=0xffff975703c9f700 rx_sk=(nil) protocol=12 location=can_rcv+0x4b  reason: CAN_RX_INVALID_FRAME

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Link: https://patch.msgid.link/20250604160605.1005704-2-dcaratti@redhat.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2025-06-10 10:23:30 +02:00
Jakub Kicinski
fdd9ebccfc bluetooth pull request for net:
- MGMT: Fix UAF on mgmt_remove_adv_monitor_complete
  - MGMT: Protect mgmt_pending list with its own lock
  - hci_core: fix list_for_each_entry_rcu usage
  - btintel_pcie: Increase the tx and rx descriptor count
  - btintel_pcie: Reduce driver buffer posting to prevent race condition
  - btintel_pcie: Fix driver not posting maximum rx buffers
 -----BEGIN PGP SIGNATURE-----
 
 iQJNBAABCgA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmhB65gZHGx1aXoudm9u
 LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKXuaEACPXWNUOViPFPE85M1Y/VGA
 Hw4uDO9x25XySBk740NRT3qkYS8pWZa8SujQZa0ijqklrggosnz3q7QdwiRow5Cv
 CLqCZiuQDtekXV8K9xa66K8rt2iUxMDnQRzNW32Pe0OW6Xy2RFiYqC7ZVpFomXBj
 2vMj+aNRwbdzvKStEQTxWCISdCkP7XSuOdWS/wnAFyiSThgr4R8PByLQZ9P2J5xj
 KfLBs+QzwHCc1hGbO7odTVqyv+UN3v82aN2fmyusdgBYBJ9ymLMV1gpBm/B4oGI7
 /zXbU9bZWL+uis+pB3k9MQnaytc32v1ODFyqY8Ua1slE4Qzwz7OKB/8TP9MeOO1s
 MzzIYuAK2KJ6C5mxyIBRVMcbdX2GgiwVIXJBWesuqoZc0H1En+eSpoKNzfoX16Ul
 hMc8pCfvpKXaqo9KOJMldr5Yg4iKV83Am7zNUB1ka6TymM8NUx56gbF50tYDlOXY
 TGYpli8OQF4x5/tWRh9AE+DxgYa4sVrDiQncvnSMlmlyBGf/wCczCjaFwRlGM9Wu
 MZPi2zm0lwa1F6T358uOyJRbcFawaV39AGHo37SrCFOvPIKC+c6iTYqLHWLeq6V6
 mXlUn4BrTrt7TUqFpBIUcN0LOOLKgxr7Oa8UAhhCfn8LLsFvryuTEbNtxOqvFLQP
 4ZUyJFMjUnVAr5PMPjyJ3w==
 =VZN1
 -----END PGP SIGNATURE-----

Merge tag 'for-net-2025-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth

Luiz Augusto von Dentz says:

====================
bluetooth pull request for net:

 - MGMT: Fix UAF on mgmt_remove_adv_monitor_complete
 - MGMT: Protect mgmt_pending list with its own lock
 - hci_core: fix list_for_each_entry_rcu usage
 - btintel_pcie: Increase the tx and rx descriptor count
 - btintel_pcie: Reduce driver buffer posting to prevent race condition
 - btintel_pcie: Fix driver not posting maximum rx buffers

* tag 'for-net-2025-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
  Bluetooth: MGMT: Protect mgmt_pending list with its own lock
  Bluetooth: MGMT: Fix UAF on mgmt_remove_adv_monitor_complete
  Bluetooth: btintel_pcie: Reduce driver buffer posting to prevent race condition
  Bluetooth: btintel_pcie: Increase the tx and rx descriptor count
  Bluetooth: btintel_pcie: Fix driver not posting maximum rx buffers
  Bluetooth: hci_core: fix list_for_each_entry_rcu usage
====================

Link: https://patch.msgid.link/20250605191136.904411-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-09 15:47:30 -07:00
Linus Torvalds
2c7e4a2663 Including fixes from CAN, wireless, Bluetooth, and Netfilter.
Current release - regressions:
 
  - Revert "kunit: configs: Enable CONFIG_INIT_STACK_ALL_PATTERN
    in all_tests", makes kunit error out if compiler is old
 
  - wifi: iwlwifi: mvm: fix assert on suspend
 
  - rxrpc: fix return from none_validate_challenge()
 
 Current release - new code bugs:
 
  - ovpn: couple of fixes for socket cleanup and UDP-tunnel teardown
 
  - can: kvaser_pciefd: refine error prone echo_skb_max handling logic
 
  - fix net_devmem_bind_dmabuf() stub when DEVMEM not compiled
 
  - eth: airoha: fixes for config / accel in bridge mode
 
 Previous releases - regressions:
 
  - Bluetooth: hci_qca: move the SoC type check to the right place,
    fix GPIO integration
 
  - prevent a NULL deref in rtnl_create_link() after locking changes
 
  - fix udp gso skb_segment after pull from frag_list
 
  - hv_netvsc: fix potential deadlock in netvsc_vf_setxdp()
 
 Previous releases - always broken:
 
  - netfilter:
    - nf_nat: also check reverse tuple to obtain clashing entry
    - nf_set_pipapo_avx2: fix initial map fill (zeroing)
 
  - fix the helper for incremental update of packet checksums after
    modifying the IP address, used by ILA and BPF
 
  - eth: stmmac: prevent div by 0 when clock rate is misconfigured
 
  - eth: ice: fix Tx scheduler handling of XDP and changing queue count
 
  - eth: b53: fix support for the RGMII interface when delays configured
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmhBv5kACgkQMUZtbf5S
 Irs/DA/+PIh7a33iVcsGIcmWtpnGp+18id1tSLnYGUGx1cW6zxutPD8rb6BsAN84
 KR+XVsbMDUehIa10xPoF2L5mX5YujEiPSkjP8eE2KJKDLGpDtYNOyOWKT21yudnd
 4EVF5JQoEbWHrkHMKF97tla84QLd5fFtgsvejVeZtQYSIDOteNGfra4Jly8iiR+J
 i9k+HdB0CNEKVvvibQZjZ5CrkpmdNPmB9UoJ59bG15q2+vXdzOPm/CCNo//9ZQJB
 I8O40nu16msRRVA9nc2V/Tp98fTk9dnDpTSyWiBlNCut9g9ftx456Ew+tjobMRIT
 yeh+q9+1z3YHjGJB8P1FGmMZWK3tbrwyqjFGqpSjr7juucFok9kxAaRPqrQxga7H
 Yxq3RegeNqukEAV39ZE14TL765Jy+XXF1uTHhNBkUADlNJVKnZygSk78/Ut2nDvQ
 vkfoto+CfKny5qkSbTk8KKv1rZu3xwewoOjlcdkHlOBoouCjPOxTC7yxTZgUZB5c
 yap0jQsedJct4OAA+O7IGLCmf3KrJ0H32HbWEY68mpTEd+4Df5vAWiIi7vmVJmk3
 DX9JWmu5A5yjNMhOEsBQU98gkNw366aA/E8dr+lEfp3AoqDrmdbG3l8+qqhqYnb+
 nnL1sNiQH1griZwQBUROAhrtXnYlYsAsZi+cv23Q0hQiGIvIC2Q=
 =sRQt
 -----END PGP SIGNATURE-----

Merge tag 'net-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from CAN, wireless, Bluetooth, and Netfilter.

  Current release - regressions:

   - Revert "kunit: configs: Enable CONFIG_INIT_STACK_ALL_PATTERN in
     all_tests", makes kunit error out if compiler is old

   - wifi: iwlwifi: mvm: fix assert on suspend

   - rxrpc: fix return from none_validate_challenge()

  Current release - new code bugs:

   - ovpn: couple of fixes for socket cleanup and UDP-tunnel teardown

   - can: kvaser_pciefd: refine error prone echo_skb_max handling logic

   - fix net_devmem_bind_dmabuf() stub when DEVMEM not compiled

   - eth: airoha: fixes for config / accel in bridge mode

  Previous releases - regressions:

   - Bluetooth: hci_qca: move the SoC type check to the right place, fix
     GPIO integration

   - prevent a NULL deref in rtnl_create_link() after locking changes

   - fix udp gso skb_segment after pull from frag_list

   - hv_netvsc: fix potential deadlock in netvsc_vf_setxdp()

  Previous releases - always broken:

   - netfilter:
       - nf_nat: also check reverse tuple to obtain clashing entry
       - nf_set_pipapo_avx2: fix initial map fill (zeroing)

   - fix the helper for incremental update of packet checksums after
     modifying the IP address, used by ILA and BPF

   - eth:
       - stmmac: prevent div by 0 when clock rate is misconfigured
       - ice: fix Tx scheduler handling of XDP and changing queue count
       - eth: fix support for the RGMII interface when delays configured"

* tag 'net-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (76 commits)
  calipso: unlock rcu before returning -EAFNOSUPPORT
  seg6: Fix validation of nexthop addresses
  net: prevent a NULL deref in rtnl_create_link()
  net: annotate data-races around cleanup_net_task
  selftests: drv-net: tso: make bkg() wait for socat to quit
  selftests: drv-net: tso: fix the GRE device name
  selftests: drv-net: add configs for the TSO test
  wireguard: device: enable threaded NAPI
  netlink: specs: rt-link: decode ip6gre
  netlink: specs: rt-link: add missing byte-order properties
  net: wwan: mhi_wwan_mbim: use correct mux_id for multiplexing
  wifi: cfg80211/mac80211: correctly parse S1G beacon optional elements
  net: dsa: b53: do not touch DLL_IQQD on bcm53115
  net: dsa: b53: allow RGMII for bcm63xx RGMII ports
  net: dsa: b53: do not configure bcm63xx's IMP port interface
  net: dsa: b53: do not enable RGMII delay on bcm63xx
  net: dsa: b53: do not enable EEE on bcm63xx
  net: ti: icssg-prueth: Fix swapped TX stats for MII interfaces.
  selftests: netfilter: nft_nat.sh: add test for reverse clash with nat
  netfilter: nf_nat: also check reverse tuple to obtain clashing entry
  ...
2025-06-05 12:34:55 -07:00
Luiz Augusto von Dentz
6fe26f694c Bluetooth: MGMT: Protect mgmt_pending list with its own lock
This uses a mutex to protect from concurrent access of mgmt_pending
list which can cause crashes like:

==================================================================
BUG: KASAN: slab-use-after-free in hci_sock_get_channel+0x60/0x68 net/bluetooth/hci_sock.c:91
Read of size 2 at addr ffff0000c48885b2 by task syz.4.334/7318

CPU: 0 UID: 0 PID: 7318 Comm: syz.4.334 Not tainted 6.15.0-rc7-syzkaller-g187899f4124a #0 PREEMPT
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2025
Call trace:
 show_stack+0x2c/0x3c arch/arm64/kernel/stacktrace.c:466 (C)
 __dump_stack+0x30/0x40 lib/dump_stack.c:94
 dump_stack_lvl+0xd8/0x12c lib/dump_stack.c:120
 print_address_description+0xa8/0x254 mm/kasan/report.c:408
 print_report+0x68/0x84 mm/kasan/report.c:521
 kasan_report+0xb0/0x110 mm/kasan/report.c:634
 __asan_report_load2_noabort+0x20/0x2c mm/kasan/report_generic.c:379
 hci_sock_get_channel+0x60/0x68 net/bluetooth/hci_sock.c:91
 mgmt_pending_find+0x7c/0x140 net/bluetooth/mgmt_util.c:223
 pending_find net/bluetooth/mgmt.c:947 [inline]
 remove_adv_monitor+0x44/0x1a4 net/bluetooth/mgmt.c:5445
 hci_mgmt_cmd+0x780/0xc00 net/bluetooth/hci_sock.c:1712
 hci_sock_sendmsg+0x544/0xbb0 net/bluetooth/hci_sock.c:1832
 sock_sendmsg_nosec net/socket.c:712 [inline]
 __sock_sendmsg net/socket.c:727 [inline]
 sock_write_iter+0x25c/0x378 net/socket.c:1131
 new_sync_write fs/read_write.c:591 [inline]
 vfs_write+0x62c/0x97c fs/read_write.c:684
 ksys_write+0x120/0x210 fs/read_write.c:736
 __do_sys_write fs/read_write.c:747 [inline]
 __se_sys_write fs/read_write.c:744 [inline]
 __arm64_sys_write+0x7c/0x90 fs/read_write.c:744
 __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
 invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:49
 el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:132
 do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:151
 el0_svc+0x58/0x17c arch/arm64/kernel/entry-common.c:767
 el0t_64_sync_handler+0x78/0x108 arch/arm64/kernel/entry-common.c:786
 el0t_64_sync+0x198/0x19c arch/arm64/kernel/entry.S:600

Allocated by task 7037:
 kasan_save_stack mm/kasan/common.c:47 [inline]
 kasan_save_track+0x40/0x78 mm/kasan/common.c:68
 kasan_save_alloc_info+0x44/0x54 mm/kasan/generic.c:562
 poison_kmalloc_redzone mm/kasan/common.c:377 [inline]
 __kasan_kmalloc+0x9c/0xb4 mm/kasan/common.c:394
 kasan_kmalloc include/linux/kasan.h:260 [inline]
 __do_kmalloc_node mm/slub.c:4327 [inline]
 __kmalloc_noprof+0x2fc/0x4c8 mm/slub.c:4339
 kmalloc_noprof include/linux/slab.h:909 [inline]
 sk_prot_alloc+0xc4/0x1f0 net/core/sock.c:2198
 sk_alloc+0x44/0x3ac net/core/sock.c:2254
 bt_sock_alloc+0x4c/0x300 net/bluetooth/af_bluetooth.c:148
 hci_sock_create+0xa8/0x194 net/bluetooth/hci_sock.c:2202
 bt_sock_create+0x14c/0x24c net/bluetooth/af_bluetooth.c:132
 __sock_create+0x43c/0x91c net/socket.c:1541
 sock_create net/socket.c:1599 [inline]
 __sys_socket_create net/socket.c:1636 [inline]
 __sys_socket+0xd4/0x1c0 net/socket.c:1683
 __do_sys_socket net/socket.c:1697 [inline]
 __se_sys_socket net/socket.c:1695 [inline]
 __arm64_sys_socket+0x7c/0x94 net/socket.c:1695
 __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
 invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:49
 el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:132
 do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:151
 el0_svc+0x58/0x17c arch/arm64/kernel/entry-common.c:767
 el0t_64_sync_handler+0x78/0x108 arch/arm64/kernel/entry-common.c:786
 el0t_64_sync+0x198/0x19c arch/arm64/kernel/entry.S:600

Freed by task 6607:
 kasan_save_stack mm/kasan/common.c:47 [inline]
 kasan_save_track+0x40/0x78 mm/kasan/common.c:68
 kasan_save_free_info+0x58/0x70 mm/kasan/generic.c:576
 poison_slab_object mm/kasan/common.c:247 [inline]
 __kasan_slab_free+0x68/0x88 mm/kasan/common.c:264
 kasan_slab_free include/linux/kasan.h:233 [inline]
 slab_free_hook mm/slub.c:2380 [inline]
 slab_free mm/slub.c:4642 [inline]
 kfree+0x17c/0x474 mm/slub.c:4841
 sk_prot_free net/core/sock.c:2237 [inline]
 __sk_destruct+0x4f4/0x760 net/core/sock.c:2332
 sk_destruct net/core/sock.c:2360 [inline]
 __sk_free+0x320/0x430 net/core/sock.c:2371
 sk_free+0x60/0xc8 net/core/sock.c:2382
 sock_put include/net/sock.h:1944 [inline]
 mgmt_pending_free+0x88/0x118 net/bluetooth/mgmt_util.c:290
 mgmt_pending_remove+0xec/0x104 net/bluetooth/mgmt_util.c:298
 mgmt_set_powered_complete+0x418/0x5cc net/bluetooth/mgmt.c:1355
 hci_cmd_sync_work+0x204/0x33c net/bluetooth/hci_sync.c:334
 process_one_work+0x7e8/0x156c kernel/workqueue.c:3238
 process_scheduled_works kernel/workqueue.c:3319 [inline]
 worker_thread+0x958/0xed8 kernel/workqueue.c:3400
 kthread+0x5fc/0x75c kernel/kthread.c:464
 ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:847

Fixes: a380b6cff1 ("Bluetooth: Add generic mgmt helper API")
Closes: https://syzkaller.appspot.com/bug?extid=0a7039d5d9986ff4ecec
Closes: https://syzkaller.appspot.com/bug?extid=cc0cc52e7f43dc9e6df1
Reported-by: syzbot+0a7039d5d9986ff4ecec@syzkaller.appspotmail.com
Tested-by: syzbot+0a7039d5d9986ff4ecec@syzkaller.appspotmail.com
Tested-by: syzbot+cc0cc52e7f43dc9e6df1@syzkaller.appspotmail.com
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-06-05 14:54:57 -04:00
Luiz Augusto von Dentz
e6ed54e86a Bluetooth: MGMT: Fix UAF on mgmt_remove_adv_monitor_complete
This reworks MGMT_OP_REMOVE_ADV_MONITOR to not use mgmt_pending_add to
avoid crashes like bellow:

==================================================================
BUG: KASAN: slab-use-after-free in mgmt_remove_adv_monitor_complete+0xe5/0x540 net/bluetooth/mgmt.c:5406
Read of size 8 at addr ffff88801c53f318 by task kworker/u5:5/5341

CPU: 0 UID: 0 PID: 5341 Comm: kworker/u5:5 Not tainted 6.15.0-syzkaller-10402-g4cb6c8af8591 #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
Workqueue: hci0 hci_cmd_sync_work
Call Trace:
 <TASK>
 dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120
 print_address_description mm/kasan/report.c:408 [inline]
 print_report+0xd2/0x2b0 mm/kasan/report.c:521
 kasan_report+0x118/0x150 mm/kasan/report.c:634
 mgmt_remove_adv_monitor_complete+0xe5/0x540 net/bluetooth/mgmt.c:5406
 hci_cmd_sync_work+0x261/0x3a0 net/bluetooth/hci_sync.c:334
 process_one_work kernel/workqueue.c:3238 [inline]
 process_scheduled_works+0xade/0x17b0 kernel/workqueue.c:3321
 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3402
 kthread+0x711/0x8a0 kernel/kthread.c:464
 ret_from_fork+0x3fc/0x770 arch/x86/kernel/process.c:148
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
 </TASK>

Allocated by task 5987:
 kasan_save_stack mm/kasan/common.c:47 [inline]
 kasan_save_track+0x3e/0x80 mm/kasan/common.c:68
 poison_kmalloc_redzone mm/kasan/common.c:377 [inline]
 __kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:394
 kasan_kmalloc include/linux/kasan.h:260 [inline]
 __kmalloc_cache_noprof+0x230/0x3d0 mm/slub.c:4358
 kmalloc_noprof include/linux/slab.h:905 [inline]
 kzalloc_noprof include/linux/slab.h:1039 [inline]
 mgmt_pending_new+0x65/0x240 net/bluetooth/mgmt_util.c:252
 mgmt_pending_add+0x34/0x120 net/bluetooth/mgmt_util.c:279
 remove_adv_monitor+0x103/0x1b0 net/bluetooth/mgmt.c:5454
 hci_mgmt_cmd+0x9c9/0xef0 net/bluetooth/hci_sock.c:1719
 hci_sock_sendmsg+0x6ca/0xef0 net/bluetooth/hci_sock.c:1839
 sock_sendmsg_nosec net/socket.c:712 [inline]
 __sock_sendmsg+0x219/0x270 net/socket.c:727
 sock_write_iter+0x258/0x330 net/socket.c:1131
 new_sync_write fs/read_write.c:593 [inline]
 vfs_write+0x548/0xa90 fs/read_write.c:686
 ksys_write+0x145/0x250 fs/read_write.c:738
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Freed by task 5989:
 kasan_save_stack mm/kasan/common.c:47 [inline]
 kasan_save_track+0x3e/0x80 mm/kasan/common.c:68
 kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:576
 poison_slab_object mm/kasan/common.c:247 [inline]
 __kasan_slab_free+0x62/0x70 mm/kasan/common.c:264
 kasan_slab_free include/linux/kasan.h:233 [inline]
 slab_free_hook mm/slub.c:2380 [inline]
 slab_free mm/slub.c:4642 [inline]
 kfree+0x18e/0x440 mm/slub.c:4841
 mgmt_pending_foreach+0xc9/0x120 net/bluetooth/mgmt_util.c:242
 mgmt_index_removed+0x10d/0x2f0 net/bluetooth/mgmt.c:9366
 hci_sock_bind+0xbe9/0x1000 net/bluetooth/hci_sock.c:1314
 __sys_bind_socket net/socket.c:1810 [inline]
 __sys_bind+0x2c3/0x3e0 net/socket.c:1841
 __do_sys_bind net/socket.c:1846 [inline]
 __se_sys_bind net/socket.c:1844 [inline]
 __x64_sys_bind+0x7a/0x90 net/socket.c:1844
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Fixes: 66bd095ab5 ("Bluetooth: advmon offload MSFT remove monitor")
Closes: https://syzkaller.appspot.com/bug?extid=feb0dc579bbe30a13190
Reported-by: syzbot+feb0dc579bbe30a13190@syzkaller.appspotmail.com
Tested-by: syzbot+feb0dc579bbe30a13190@syzkaller.appspotmail.com
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-06-05 14:54:35 -04:00
Paul Chaignon
6043b794c7 net: Fix checksum update for ILA adj-transport
During ILA address translations, the L4 checksums can be handled in
different ways. One of them, adj-transport, consist in parsing the
transport layer and updating any found checksum. This logic relies on
inet_proto_csum_replace_by_diff and produces an incorrect skb->csum when
in state CHECKSUM_COMPLETE.

This bug can be reproduced with a simple ILA to SIR mapping, assuming
packets are received with CHECKSUM_COMPLETE:

  $ ip a show dev eth0
  14: eth0@if15: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
      link/ether 62:ae:35:9e:0f:8d brd ff:ff:ff:ff:ff:ff link-netnsid 0
      inet6 3333:0:0:1::c078/64 scope global
         valid_lft forever preferred_lft forever
      inet6 fd00:10:244:1::c078/128 scope global nodad
         valid_lft forever preferred_lft forever
      inet6 fe80::60ae:35ff:fe9e:f8d/64 scope link proto kernel_ll
         valid_lft forever preferred_lft forever
  $ ip ila add loc_match fd00:10:244:1 loc 3333:0:0:1 \
      csum-mode adj-transport ident-type luid dev eth0

Then I hit [fd00:10:244:1::c078]:8000 with a server listening only on
[3333:0:0:1::c078]:8000. With the bug, the SYN packet is dropped with
SKB_DROP_REASON_TCP_CSUM after inet_proto_csum_replace_by_diff changed
skb->csum. The translation and drop are visible on pwru [1] traces:

  IFACE   TUPLE                                                        FUNC
  eth0:9  [fd00:10:244:3::3d8]:51420->[fd00:10:244:1::c078]:8000(tcp)  ipv6_rcv
  eth0:9  [fd00:10:244:3::3d8]:51420->[fd00:10:244:1::c078]:8000(tcp)  ip6_rcv_core
  eth0:9  [fd00:10:244:3::3d8]:51420->[fd00:10:244:1::c078]:8000(tcp)  nf_hook_slow
  eth0:9  [fd00:10:244:3::3d8]:51420->[fd00:10:244:1::c078]:8000(tcp)  inet_proto_csum_replace_by_diff
  eth0:9  [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp)     tcp_v6_early_demux
  eth0:9  [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp)     ip6_route_input
  eth0:9  [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp)     ip6_input
  eth0:9  [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp)     ip6_input_finish
  eth0:9  [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp)     ip6_protocol_deliver_rcu
  eth0:9  [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp)     raw6_local_deliver
  eth0:9  [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp)     ipv6_raw_deliver
  eth0:9  [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp)     tcp_v6_rcv
  eth0:9  [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp)     __skb_checksum_complete
  eth0:9  [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp)     kfree_skb_reason(SKB_DROP_REASON_TCP_CSUM)
  eth0:9  [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp)     skb_release_head_state
  eth0:9  [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp)     skb_release_data
  eth0:9  [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp)     skb_free_head
  eth0:9  [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp)     kfree_skbmem

This is happening because inet_proto_csum_replace_by_diff is updating
skb->csum when it shouldn't. The L4 checksum is updated such that it
"cancels" the IPv6 address change in terms of checksum computation, so
the impact on skb->csum is null.

Note this would be different for an IPv4 packet since three fields
would be updated: the IPv4 address, the IP checksum, and the L4
checksum. Two would cancel each other and skb->csum would still need
to be updated to take the L4 checksum change into account.

This patch fixes it by passing an ipv6 flag to
inet_proto_csum_replace_by_diff, to skip the skb->csum update if we're
in the IPv6 case. Note the behavior of the only other user of
inet_proto_csum_replace_by_diff, the BPF subsystem, is left as is in
this patch and fixed in the subsequent patch.

With the fix, using the reproduction from above, I can confirm
skb->csum is not touched by inet_proto_csum_replace_by_diff and the TCP
SYN proceeds to the application after the ILA translation.

Link: https://github.com/cilium/pwru [1]
Fixes: 65d7ab8de5 ("net: Identifier Locator Addressing module")
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://patch.msgid.link/b5539869e3550d46068504feb02d37653d939c0b.1748509484.git.paul.chaignon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-30 19:53:51 -07:00
Linus Torvalds
dd91b5e1d6 RDMA v6.16 merge window pull request
Usual collection of driver fixes:
 
 - Small bug fixes and cleansup in hfi, hns, rxe, mlx5, mana siw
 
 - Further ODP functionality in rxe
 
 - Remote access MRs in mana, along with more page sizes
 
 - Improve CM scalability with a rwlock around the agent
 
 - More trace points for hns
 
 - ODP hmm conversion to the new two step dma API
 
 - Support the ethernet HW device in mana as well as the RNIC
 
 - Cleanups:
  * Use secs_to_jiffies() when appropriate
  * Use ERR_CAST() instead of naked casts
  * Don't use %pK in printk
  * Unusued functions removed
  * Allocation type matching
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCaDm95gAKCRCFwuHvBreF
 YXJxAQCZ+p+mxt0rTeVI2j6YQ26thuvb/tH0Upu8epgdQ3T/ZgD/YOHBC6OrXWJa
 Uz6BTiyz/xiyMtJLTD4kEiG2o74J1gE=
 =DNQC
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:
 "Usual collection of driver fixes:

   - Small bug fixes and cleansup in hfi, hns, rxe, mlx5, mana siw

   - Further ODP functionality in rxe

   - Remote access MRs in mana, along with more page sizes

   - Improve CM scalability with a rwlock around the agent

   - More trace points for hns

   - ODP hmm conversion to the new two step dma API

   - Support the ethernet HW device in mana as well as the RNIC

   - Cleanups:
       - Use secs_to_jiffies() when appropriate
       - Use ERR_CAST() instead of naked casts
       - Don't use %pK in printk
       - Unusued functions removed
       - Allocation type matching"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (57 commits)
  RDMA/cma: Fix hang when cma_netevent_callback fails to queue_work
  RDMA/bnxt_re: Support extended stats for Thor2 VF
  RDMA/hns: Fix endian issue in trace events
  RDMA/mlx5: Avoid flexible array warning
  IB/cm: Remove dead code and adjust naming
  RDMA/core: Avoid hmm_dma_map_alloc() for virtual DMA devices
  RDMA/rxe: Break endless pagefault loop for RO pages
  RDMA/bnxt_re: Fix return code of bnxt_re_configure_cc
  RDMA/bnxt_re: Fix missing error handling for tx_queue
  RDMA/bnxt_re: Fix incorrect display of inactivity_cp in debugfs output
  RDMA/mlx5: Add support for 200Gbps per lane speeds
  RDMA/mlx5: Remove the redundant MLX5_IB_STAGE_UAR stage
  RDMA/iwcm: Fix use-after-free of work objects after cm_id destruction
  net: mana: Add support for auxiliary device servicing events
  RDMA/mana_ib: unify mana_ib functions to support any gdma device
  RDMA/mana_ib: Add support of mana_ib for RNIC and ETH nic
  net: mana: Probe rdma device in mana driver
  RDMA/siw: replace redundant ternary operator with just rv
  RDMA/umem: Separate implicit ODP initialization from explicit ODP
  RDMA/core: Convert UMEM ODP DMA mapping to caching IOVA and page linkage
  ...
2025-05-30 10:18:56 -07:00
Haiyang Zhang
290e5d3c49 net: mana: Add support for Multi Vports on Bare metal
To support Multi Vports on Bare metal, increase the device config response
version. And, skip the register HW vport, and register filter steps, when
the Bare metal hostmode is set.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Link: https://patch.msgid.link/1747671636-5810-1-git-send-email-haiyangz@microsoft.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-28 08:30:46 +02:00
Christoph Hellwig
33f1b3677a sctp: mark sctp_do_peeloff static
sctp_do_peeloff is only used inside of net/sctp/socket.c,
so mark it static.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Xin Long <lucien.xin@gmail.com>
Link: https://patch.msgid.link/20250526054745.2329201-1-hch@lst.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-27 18:18:55 -07:00
Michal Luczaj
5ec40864aa vsock: Move lingering logic to af_vsock core
Lingering should be transport-independent in the long run. In preparation
for supporting other transports, as well as the linger on shutdown(), move
code to core.

Generalize by querying vsock_transport::unsent_bytes(), guard against the
callback being unimplemented. Do not pass sk_lingertime explicitly. Pull
SOCK_LINGER check into vsock_linger().

Flatten the function. Remove the nested block by inverting the condition:
return early on !timeout.

Suggested-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://patch.msgid.link/20250522-vsock-linger-v6-2-2ad00b0e447e@rbox.co
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-27 11:05:21 +02:00
Jason Gunthorpe
ef2233850e Linux 6.15
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCgA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmgzoyMeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG0cEIAJrO2lKaFN4fbv6G
 FQTHQF1soicGpak3yY9u1o5LCqEIzjW2ScxcKG+dl7FcXsaZYcyg4HNzxbV9l/rr
 Ck2qZh3CCkVem0/nEsOJwYbNYKnq+pM5h1jIwn/LUkRuV55s5K5oRHzRj673BEj5
 BLaRFivZ1t4eM64EqbU1ut11/VEAkr2GcB01forHDeuWwoa3p6DfmALo7X/U43Vg
 FN2hp/3PPfiU6PwoCxQlmMpHNFkoZOHpi8P8Qm+mu0MQI12QrUC1Riib4EkrwEEv
 a28F4Au+TIjLceRdi6Ss/rhTC71usQIQ2OnnmHBUeYgdwHRXHgfewhtQDUKTU0MR
 OwKECbY=
 =skuS
 -----END PGP SIGNATURE-----

Merge tag 'v6.15' into rdma.git for-next

Following patches need the RDMA rc branch since we are past the RC cycle
now.

Merge conflicts resolved based on Linux-next:

- For RXE odp changes keep for-next version and fixup new places that
  need to call is_odp_mr()
  https://lore.kernel.org/r/20250422143019.500201bd@canb.auug.org.au
  https://lore.kernel.org/r/20250514122455.3593b083@canb.auug.org.au

- irdma is keeping the while/kfree bugfix from -rc and the pf/cdev_info
  change from for-next
  https://lore.kernel.org/r/20250513130630.280ee6c5@canb.auug.org.au

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-05-26 15:33:52 -03:00
Paolo Abeni
f5b60d6a57 netfilter pull request 25-05-23
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEjF9xRqF1emXiQiqU1w0aZmrPKyEFAmgwd00ACgkQ1w0aZmrP
 KyEfwA//RXQ3i8PCa7lKHxDRhVzG3rEvgXRmiXeNd+JjzsCnybBb7+wRf3dtBGWT
 +1s44Utx1JqosWxCVBulqYC5bqSC66789l5X2jhYJmUZxRrbcsqPngwnIrjb/XeK
 ZJM62wiRhkBQED7yZLGy+y4VHQiG8CEMt16AOQHk863aruWv1tT7up90CTtzA545
 4GF/grU3FC0PsoTLwzWyvqsWK+9uk3Y4Tifp5hU3w6uRD9EjX5tHCZlXXSqOF5gu
 KT26OYsePYXhJVZIwDf2oVLGi0EVTPB9IFxZSNgLqyXqu2ILAb9OwRNVTNfTP7Pg
 1RWJWmgqvRNs9OM2ecifYgQf/AfvCL0Cja1BJOjmvtICuGegrYH7G5YYQsMl9CoE
 7jBoTzpToSASat5+dwoz81Bvzh447dYxRE2VmbxmRTTWToQYS1KGBPc9e3u/n5Rr
 ruh8tRZ3/R0Fy+YLDkrJst3grh5RLITbuyu4ElJMArPU50mLTVYxKd6nA3BqwB5G
 1GmLfCzvQH3e6PKz6CNke1AytVDy/wLTXtcbLnze2Muaj4AqhtOe5Q8ypnOO0Vyk
 PsJ6U3rm2asd3GE9+AIx8gZBv8yCu1w9CiwLK8ybT2NETb2dEnqPgWeDyT7rpcaD
 sQOPsBE1q/TEp9gofbYCHBm5E2mX9UP7Q6EHCTekrI97xLq8Q2M=
 =fBhd
 -----END PGP SIGNATURE-----

Merge tag 'nf-next-25-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next

Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following batch contains Netfilter updates for net-next,
specifically 26 patches: 5 patches adding/updating selftests,
4 fixes, 3 PREEMPT_RT fixes, and 14 patches to enhance nf_tables):

1) Improve selftest coverage for pipapo 4 bit group format, from
   Florian Westphal.

2) Fix incorrect dependencies when compiling a kernel without
   legacy ip{6}tables support, also from Florian.

3) Two patches to fix nft_fib vrf issues, including selftest updates
   to improve coverage, also from Florian Westphal.

4) Fix incorrect nesting in nft_tunnel's GENEVE support, from
   Fernando F. Mancera.

5) Three patches to fix PREEMPT_RT issues with nf_dup infrastructure
   and nft_inner to match in inner headers, from Sebastian Andrzej Siewior.

6) Integrate conntrack information into nft trace infrastructure,
   from Florian Westphal.

7) A series of 13 patches to allow to specify wildcard netdevice in
   netdev basechain and flowtables, eg.

   table netdev filter {
       chain ingress {
           type filter hook ingress devices = { eth0, eth1, vlan* } priority 0; policy accept;
       }
   }

   This also allows for runtime hook registration on NETDEV_{UN}REGISTER
   event, from Phil Sutter.

netfilter pull request 25-05-23

* tag 'nf-next-25-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: (26 commits)
  selftests: netfilter: Torture nftables netdev hooks
  netfilter: nf_tables: Add notifications for hook changes
  netfilter: nf_tables: Support wildcard netdev hook specs
  netfilter: nf_tables: Sort labels in nft_netdev_hook_alloc()
  netfilter: nf_tables: Handle NETDEV_CHANGENAME events
  netfilter: nf_tables: Wrap netdev notifiers
  netfilter: nf_tables: Respect NETDEV_REGISTER events
  netfilter: nf_tables: Prepare for handling NETDEV_REGISTER events
  netfilter: nf_tables: Have a list of nf_hook_ops in nft_hook
  netfilter: nf_tables: Pass nf_hook_ops to nft_unregister_flowtable_hook()
  netfilter: nf_tables: Introduce nft_register_flowtable_ops()
  netfilter: nf_tables: Introduce nft_hook_find_ops{,_rcu}()
  netfilter: nf_tables: Introduce functions freeing nft_hook objects
  netfilter: nf_tables: add packets conntrack state to debug trace info
  netfilter: conntrack: make nf_conntrack_id callable without a module dependency
  netfilter: nf_dup_netdev: Move the recursion counter struct netdev_xmit
  netfilter: nft_inner: Use nested-BH locking for nft_pcpu_tun_ctx
  netfilter: nf_dup{4, 6}: Move duplication check to task_struct
  netfilter: nft_tunnel: fix geneve_opt dump
  selftests: netfilter: nft_fib.sh: add type and oif tests with and without VRFs
  ...
====================

Link: https://patch.msgid.link/20250523132712.458507-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-26 18:53:41 +02:00
Paolo Abeni
fdb061195f ipsec-next-2025-05-23
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmgwJa4ACgkQrB3Eaf9P
 W7d34A//V3NukN6UNAUKd+MbH80eXCEbNSNIuVUstfr0S71qTCxovLX58u+oQztb
 43mx/NsnF38TzNFWVyVzF4vcr/n0DS/Da3P5pJEjoewIYSDrz/WfOum6VpVIUsZ/
 kLCDZlIoX/fBPFZDPHMmsDXDemAdrtr8CuK72NUH10vKDuGKSUG0NElqDieDBEsA
 y/fqgBsyxQXi9cMdRxf+DLDK/hzqyaJmVj8B8WEcFtYXJ4RE6+jfLgAaTE6J7V5W
 fYACTu/IcdtgEEm2U7wlow66oIjqqGReuWUzV9zHGJNCB9+da6L4dbGtzlRmOPdn
 kI1PIALFWT2HbKnJOJJbaThO6zES1rMOm3PsWt7iVewCT8HuhAa9kDV0xzdcLQE1
 +REfo8dXW9f5hRUrSuqpJFUArkCHWHLhQEcmTHaF0b2RveC/hd9rOyKIfae+fgIP
 5uLU2DpwafDgw5UCjsQTLyQ5M6icO8wFgM7vKAUJWyI1Pck1ktf7Ic6+KQRNjWiv
 Q7ImwpSdLH2bZpIbIKDnIcyZg3CMBIQ88cdsYi0+ckgDQ0hMf6ZrXRseXKRe0P/M
 gKgBOoXIJBF7niJQTDqHjsmnYGvvhZysIJNQLf4BZFYOeF5L9OduP6ywqMe5pFKt
 QAsJSZw/+SibheLEYQAzvyLD6VdMXaxeOAHlPylRRpl9vEX0l04=
 =GRVJ
 -----END PGP SIGNATURE-----

Merge tag 'ipsec-next-2025-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next

Steffen Klassert says:

====================
1) Remove some unnecessary strscpy_pad() size arguments.
   From Thorsten Blum.

2) Correct use of xso.real_dev on bonding offloads.
   Patchset from Cosmin Ratiu.

3) Add hardware offload configuration to XFRM_MSG_MIGRATE.
   From Chiachang Wang.

4) Refactor migration setup during cloning. This was
   done after the clone was created. Now it is done
   in the cloning function itself.
   From Chiachang Wang.

5) Validate assignment of maximal possible SEQ number.
   Prevent from setting to the maximum sequrnce number
   as this would cause for traffic drop.
   From Leon Romanovsky.

6) Prevent configuration of interface index when offload
   is used. Hardware can't handle this case.i
   From Leon Romanovsky.

7) Always use kfree_sensitive() for SA secret zeroization.
   From Zilin Guan.

ipsec-next-2025-05-23

* tag 'ipsec-next-2025-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next:
  xfrm: use kfree_sensitive() for SA secret zeroization
  xfrm: prevent configuration of interface index when offload is used
  xfrm: validate assignment of maximal possible SEQ number
  xfrm: Refactor migration setup during the cloning process
  xfrm: Migrate offload configuration
  bonding: Fix multiple long standing offload races
  bonding: Mark active offloaded xfrm_states
  xfrm: Add explicit dev to .xdo_dev_state_{add,delete,free}
  xfrm: Remove unneeded device check from validate_xmit_xfrm
  xfrm: Use xdo.dev instead of xdo.real_dev
  net/mlx5: Avoid using xso.real_dev unnecessarily
  xfrm: Remove unnecessary strscpy_pad() size arguments
====================

Link: https://patch.msgid.link/20250523075611.3723340-1-steffen.klassert@secunet.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-26 18:32:48 +02:00
Qiu Yutan
e45b7196df net: neigh: use kfree_skb_reason() in neigh_resolve_output() and neigh_connected_output()
Replace kfree_skb() used in neigh_resolve_output() and
neigh_connected_output() with kfree_skb_reason().

Following new skb drop reason is added:
/* failed to fill the device hard header */
SKB_DROP_REASON_NEIGH_HH_FILLFAIL

Signed-off-by: Qiu Yutan <qiu.yutan@zte.com.cn>
Signed-off-by: Jiang Kun <jiang.kun2@zte.com.cn>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Xu Xin <xu.xin16@zte.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-26 10:03:13 +01:00
Phil Sutter
465b9ee0ee netfilter: nf_tables: Add notifications for hook changes
Notify user space if netdev hooks are updated due to netdev add/remove
events. Send minimal notification messages by introducing
NFT_MSG_NEWDEV/DELDEV message types describing a single device only.

Upon NETDEV_CHANGENAME, the callback has no information about the
interface's old name. To provide a clear message to user space, include
the hook's stored interface name in the notification.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23 13:57:14 +02:00
Phil Sutter
73319a8ee1 netfilter: nf_tables: Have a list of nf_hook_ops in nft_hook
Supporting a 1:n relationship between nft_hook and nf_hook_ops is
convenient since a chain's or flowtable's nft_hooks may remain in place
despite matching interfaces disappearing. This stabilizes ruleset dumps
in that regard and opens the possibility to claim newly added interfaces
which match the spec. Also it prepares for wildcard interface specs
since these will potentially match multiple interfaces.

All spots dealing with hook registration are updated to handle a list of
multiple nf_hook_ops, but nft_netdev_hook_alloc() only adds a single
item for now to retain the old behaviour. The only expected functional
change here is how vanishing interfaces are handled: Instead of dropping
the respective nft_hook, only the matching nf_hook_ops are dropped.

To safely remove individual ops from the list in netdev handlers, an
rcu_head is added to struct nf_hook_ops so kfree_rcu() may be used.
There is at least nft_flowtable_find_dev() which may be iterating
through the list at the same time.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23 13:57:13 +02:00
Phil Sutter
e225376d78 netfilter: nf_tables: Introduce nft_hook_find_ops{,_rcu}()
Also a pretty dull wrapper around the hook->ops.dev comparison for now.
Will search the embedded nf_hook_ops list in future. The ugly cast to
eliminate the const qualifier will vanish then, too.

Since this future list will be RCU-protected, also introduce an _rcu()
variant here.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23 13:57:12 +02:00
Florian Westphal
9a119669fb netfilter: nf_tables: nft_fib: consistent l3mdev handling
fib has two modes:
1. Obtain output device according to source or destination address
2. Obtain the type of the address, e.g. local, unicast, multicast.

'fib daddr type' should return 'local' if the address is configured
in this netns or unicast otherwise.

'fib daddr . iif type' should return 'local' if the address is configured
on the input interface or unicast otherwise, i.e. more restrictive.

However, if the interface is part of a VRF, then 'fib daddr type'
returns unicast even if the address is configured on the incoming
interface.

This is broken for both ipv4 and ipv6.

In the ipv4 case, inet_dev_addr_type must only be used if the
'iif' or 'oif' (strict mode) was requested.

Else inet_addr_type_dev_table() needs to be used and the correct
dev argument must be passed as well so the correct fib (vrf) table
is used.

In the ipv6 case, the bug is similar, without strict mode, dev is NULL
so .flowi6_l3mdev will be set to 0.

Add a new 'nft_fib_l3mdev_master_ifindex_rcu()' helper and use that
to init the .l3mdev structure member.

For ipv6, use it from nft_fib6_flowi_init() which gets called from
both the 'type' and the 'route' mode eval functions.

This provides consistent behaviour for all modes for both ipv4 and ipv6:
If strict matching is requested, the input respectively output device
of the netfilter hooks is used.

Otherwise, use skb->dev to obtain the l3mdev ifindex.

Without this, most type checks in updated nft_fib.sh selftest fail:

  FAIL: did not find veth0 . 10.9.9.1 . local in fibtype4
  FAIL: did not find veth0 . dead:1::1 . local in fibtype6
  FAIL: did not find veth0 . dead:9::1 . local in fibtype6
  FAIL: did not find tvrf . 10.0.1.1 . local in fibtype4
  FAIL: did not find tvrf . 10.9.9.1 . local in fibtype4
  FAIL: did not find tvrf . dead:1::1 . local in fibtype6
  FAIL: did not find tvrf . dead:9::1 . local in fibtype6
  FAIL: fib expression address types match (iif in vrf)

(fib errounously returns 'unicast' for all of them, even
 though all of these addresses are local to the vrf).

Fixes: f6d0cbcf09 ("netfilter: nf_tables: add fib expression")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23 13:57:09 +02:00
Kuniyuki Iwashima
77cbe1a6d8 af_unix: Introduce SO_PASSRIGHTS.
As long as recvmsg() or recvmmsg() is used with cmsg, it is not
possible to avoid receiving file descriptors via SCM_RIGHTS.

This behaviour has occasionally been flagged as problematic, as
it can be (ab)used to trigger DoS during close(), for example, by
passing a FUSE-controlled fd or a hung NFS fd.

For instance, as noted on the uAPI Group page [0], an untrusted peer
could send a file descriptor pointing to a hung NFS mount and then
close it.  Once the receiver calls recvmsg() with msg_control, the
descriptor is automatically installed, and then the responsibility
for the final close() now falls on the receiver, which may result
in blocking the process for a long time.

Regarding this, systemd calls cmsg_close_all() [1] after each
recvmsg() to close() unwanted file descriptors sent via SCM_RIGHTS.

However, this cannot work around the issue at all, because the final
fput() may still occur on the receiver's side once sendmsg() with
SCM_RIGHTS succeeds.  Also, even filtering by LSM at recvmsg() does
not work for the same reason.

Thus, we need a better way to refuse SCM_RIGHTS at sendmsg().

Let's introduce SO_PASSRIGHTS to disable SCM_RIGHTS.

Note that this option is enabled by default for backward
compatibility.

Link: https://uapi-group.org/kernel-features/#disabling-reception-of-scm_rights-for-af_unix-sockets #[0]
Link: https://github.com/systemd/systemd/blob/v257.5/src/basic/fd-util.c#L612-L628 #[1]
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-23 10:24:18 +01:00
Kuniyuki Iwashima
0e81cfd971 af_unix: Move SOCK_PASS{CRED,PIDFD,SEC} to struct sock.
As explained in the next patch, SO_PASSRIGHTS would have a problem
if we assigned a corresponding bit to socket->flags, so it must be
managed in struct sock.

Mixing socket->flags and sk->sk_flags for similar options will look
confusing, and sk->sk_flags does not have enough space on 32bit system.

Also, as mentioned in commit 16e5726269 ("af_unix: dont send
SCM_CREDENTIALS by default"), SOCK_PASSCRED and SOCK_PASSPID handling
is known to be slow, and managing the flags in struct socket cannot
avoid that for embryo sockets.

Let's move SOCK_PASS{CRED,PIDFD,SEC} to struct sock.

While at it, other SOCK_XXX flags in net.h are grouped as enum.

Note that assign_bit() was atomic, so the writer side is moved down
after lock_sock() in setsockopt(), but the bit is only read once
in sendmsg() and recvmsg(), so lock_sock() is not needed there.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-23 10:24:18 +01:00
Kuniyuki Iwashima
7d8d93fdde net: Restrict SO_PASS{CRED,PIDFD,SEC} to AF_{UNIX,NETLINK,BLUETOOTH}.
SCM_CREDENTIALS and SCM_SECURITY can be recv()ed by calling
scm_recv() or scm_recv_unix(), and SCM_PIDFD is only used by
scm_recv_unix().

scm_recv() is called from AF_NETLINK and AF_BLUETOOTH.

scm_recv_unix() is literally called from AF_UNIX.

Let's restrict SO_PASSCRED and SO_PASSSEC to such sockets and
SO_PASSPIDFD to AF_UNIX only.

Later, SOCK_PASS{CRED,PIDFD,SEC} will be moved to struct sock
and united with another field.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-23 10:24:18 +01:00
Kuniyuki Iwashima
38b95d588f scm: Move scm_recv() from scm.h to scm.c.
scm_recv() has been placed in scm.h since the pre-git era for no
particular reason (I think), which makes the file really fragile.

For example, when you move SOCK_PASSCRED from include/linux/net.h to
enum sock_flags in include/net/sock.h, you will see weird build failure
due to terrible dependency.

To avoid the build failure in the future, let's move scm_recv(_unix())?
and its callees to scm.c.

Note that only scm_recv() needs to be exported for Bluetooth.

scm_send() should be moved to scm.c too, but I'll revisit later.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-23 10:24:18 +01:00
Jakub Kicinski
ea15e04626 Lots of new things, notably:
* ath12k: monitor mode for WCN7850, better 6 GHz regulatory
  * brcmfmac: SAE for some Cypress devices
  * iwlwifi: rework device configuration
  * mac80211: scan improvements with MLO
  * mt76: EHT improvements, new device IDs
  * rtw88: throughput improvements
  * rtw89: MLO, STA/P2P concurrency improvements, SAR
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmgvVpAACgkQ10qiO8sP
 aADeJg//dShJQPKKUw7s4qY9y0lr1kimFw7cKE1vhHAq0eyQE8VP/05sj7XkeLdO
 2MDFCmWmTRZW1Av925xhicEhdiggxdOaT3n3RQ82y+Vjx7+6BsqqRE0YVRmK28vM
 MhUQocSzbd+Gh75wd4ti8G8dDPRJ9sbLTlZhIqPXMth2Ljl9EklMNzOlhzfo8N8+
 TgZ8oJx0EZ2n+sObtI5US27rNiPzLCtAM10Nl03F5yxfSk7gh3UpLHFhmu7384Nx
 56qqMwsmHHQSaRudg1ls8p30ztwve8/zHkOM6UeVksbb7CS2GHoPoVFtJUWBYmn9
 Ckd/XNItniRmIbsABgOyybawJV7EKZAWHclffeICQc526VMZWxeD9xukxQZSykiu
 3YXbHbPUkaCi3MlC3arc8SNQpW2l/BQvrC0SHqds4r/h/j4yUbA0wLs0OwqNXXwh
 NFoXnPTlkhMjNcX0W0t/A+EzXt/EsGKjBasiWC/tVZG9gHpMWKO3G8kLKmDyN/i9
 NsUh7E7zJTBjYjS2Bhm4xGmSy3DdgKSkBV2d4qCG/0LoKAW2eIRw97DGyn0pfVlA
 BAmio94xDJ3AM565WeySmWi/ZqvuDgQy2rd3J1ji0F/QDhdwqUdIHXy9C3VbN9zB
 TegAIgDjnqpOLzCn2P2FzWZlXwcXsxG13XMvqr2DfhBZtNUmRsw=
 =NXiI
 -----END PGP SIGNATURE-----

Merge tag 'wireless-next-2025-05-22' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next

Johannes Berg says:

====================
Lots of new things, notably:
 * ath12k: monitor mode for WCN7850, better 6 GHz regulatory
 * brcmfmac: SAE for some Cypress devices
 * iwlwifi: rework device configuration
 * mac80211: scan improvements with MLO
 * mt76: EHT improvements, new device IDs
 * rtw88: throughput improvements
 * rtw89: MLO, STA/P2P concurrency improvements, SAR

* tag 'wireless-next-2025-05-22' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (389 commits)
  wifi: mt76: mt7925: add rfkill_poll for hardware rfkill
  wifi: mt76: support power delta calculation for 5 TX paths
  wifi: mt76: fix available_antennas setting
  wifi: mt76: mt7996: fix RX buffer size of MCU event
  wifi: mt76: mt7996: change max beacon size
  wifi: mt76: mt7996: fix invalid NSS setting when TX path differs from NSS
  wifi: mt76: mt7996: drop fragments with multicast or broadcast RA
  wifi: mt76: mt7996: set EHT max ampdu length capability
  wifi: mt76: mt7996: fix beamformee SS field
  wifi: mt76: remove capability of partial bandwidth UL MU-MIMO
  wifi: mt76: mt7925: add test mode support
  wifi: mt76: mt7925: extend MCU support for testmode
  wifi: mt76: mt7925: ensure all MCU commands wait for response
  wifi: mt76: mt7925: refine the sniffer commnad
  wifi: mt76: mt7925: prevent multiple scan commands
  wifi: mt76: mt7915: Fix null-ptr-deref in mt7915_mmio_wed_init()
  wifi: mt76: mt7996: Fix null-ptr-deref in mt7996_mmio_wed_init()
  wifi: mt76: mt7925: add RNR scan support for 6GHz
  wifi: mt76: add mt76_connac_mcu_build_rnr_scan_param routine
  wifi: mt76: scan: Fix 'mlink' dereferenced before IS_ERR_OR_NULL check
  ...
====================

Link: https://patch.msgid.link/20250522165501.189958-50-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-22 14:05:18 -07:00
Jakub Kicinski
43a1ce8f42 bluetooth-next pull request for net-next:
core:
 
  - Add support for SIOCETHTOOL ETHTOOL_GET_TS_INFO
  - Separate CIS_LINK and BIS_LINK link types
  - Introduce HCI Driver protocol
 
 drivers:
 
  - btintel_pcie: Do not generate coredump for diagnostic events
  - btusb: Add HCI Drv commands for configuring altsetting
  - btusb: Add RTL8851BE device 0x0bda:0xb850
  - btusb: Add new VID/PID 13d3/3584 for MT7922
  - btusb: Add new VID/PID 13d3/3630 and 13d3/3613 for MT7925
  - btnxpuart: Implement host-wakeup feature
 -----BEGIN PGP SIGNATURE-----
 
 iQJNBAABCgA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmgvWiwZHGx1aXoudm9u
 LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKezlD/4uKp4yrCPAO/tO0FFvh752
 7oVmBzqe6GDunl+Isz6/GSWc5sD0OVdhMg7QL+zhi3hjluyGh9N3rUE9Qw/Q3h8Q
 JkMXWAVNHq+Dr88RqLVro335D2XP8mgiTLEKwSDh5Fdip3xOz+itoQZI5wYqriQg
 exNU1l04ZzrwMWicAJULvFFPz9q/556cUq0x9k7OJ6GaHOmUQ0Y7BPMFAQ0/uHAA
 8Y9qiXlJQKzeYDz9rUvAf6Gd+21k0cAU4QSYt+ZDLGBAuH0iK4zgu56uiHadVLRb
 bm5hlO/lrUD7Hw/swSJ2wZYMKpPINPP6Cr2kpC66kmXZYWx7YJaQQCN8GEtwbEVh
 t3q9Y7zQXjppQ/tIG/WJuWlZ84DiWsm5na3k/q61LfihQ5VPL96RtlJKXD492Dxz
 vFXRFN5F6lMcDP5Ujji6S8O0H5P1bDz9XbITcGHxEDjAbOnThBND7g+10mmZ1MRw
 GWQTnnsrYaU+gaUdj9Nr5o7kPp2KSXvGkG8F407RDvF2fjbbwTNEQgkt7vF9CbPN
 KkJAwnPM+JhSuxGaIVcKpoKJ2gZA/fNXjr9d6hD6v/U+SksNsMovxtdZMaL4Mx/n
 gV8W7RwhUNeJ8NvneHJ12bRhtb7x/IYJsQ6ARgDTenNlSxd56uc3Zt7nmDdIvUfF
 BuEJDucjnJLnCsrUj6MNuw==
 =ZrGM
 -----END PGP SIGNATURE-----

Merge tag 'for-net-next-2025-05-22' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next

Luiz Augusto von Dentz says:

====================
bluetooth-next pull request for net-next:

core:
 - Add support for SIOCETHTOOL ETHTOOL_GET_TS_INFO
 - Separate CIS_LINK and BIS_LINK link types
 - Introduce HCI Driver protocol

drivers:
 - btintel_pcie: Do not generate coredump for diagnostic events
 - btusb: Add HCI Drv commands for configuring altsetting
 - btusb: Add RTL8851BE device 0x0bda:0xb850
 - btusb: Add new VID/PID 13d3/3584 for MT7922
 - btusb: Add new VID/PID 13d3/3630 and 13d3/3613 for MT7925
 - btnxpuart: Implement host-wakeup feature

* tag 'for-net-next-2025-05-22' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next: (23 commits)
  Bluetooth: btintel: Check dsbr size from EFI variable
  Bluetooth: MGMT: iterate over mesh commands in mgmt_mesh_foreach()
  Bluetooth: btusb: Add new VID/PID 13d3/3584 for MT7922
  Bluetooth: btusb: use skb_pull to avoid unsafe access in QCA dump handling
  Bluetooth: L2CAP: Fix not checking l2cap_chan security level
  Bluetooth: separate CIS_LINK and BIS_LINK link types
  Bluetooth: btusb: Add new VID/PID 13d3/3630 for MT7925
  Bluetooth: add support for SIOCETHTOOL ETHTOOL_GET_TS_INFO
  Bluetooth: btintel_pcie: Dump debug registers on error
  Bluetooth: ISO: Fix getpeername not returning sockaddr_iso_bc fields
  Bluetooth: ISO: Fix not using SID from adv report
  Revert "Bluetooth: btusb: add sysfs attribute to control USB alt setting"
  Revert "Bluetooth: btusb: Configure altsetting for HCI_USER_CHANNEL"
  Bluetooth: btusb: Add HCI Drv commands for configuring altsetting
  Bluetooth: Introduce HCI Driver protocol
  Bluetooth: btnxpuart: Implement host-wakeup feature
  dt-bindings: net: bluetooth: nxp: Add support for host-wakeup
  Bluetooth: btusb: Add RTL8851BE device 0x0bda:0xb850
  Bluetooth: hci_uart: Remove unnecessary NULL check before release_firmware()
  Bluetooth: btmtksdio: Fix wakeup source leaks on device unbind
  ...
====================

Link: https://patch.msgid.link/20250522171048.3307873-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-22 13:46:13 -07:00
Jakub Kicinski
33e1b1b399 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.15-rc8).

Conflicts:
  80f2ab46c2 ("irdma: free iwdev->rf after removing MSI-X")
  4bcc063939 ("ice, irdma: fix an off by one in error handling code")
  c24a65b6a2 ("iidc/ice/irdma: Update IDC to support multiple consumers")
https://lore.kernel.org/20250513130630.280ee6c5@canb.auug.org.au

No extra adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-22 09:42:41 -07:00
Paolo Abeni
bd2ec34d00 ipsec-2025-05-21
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmgtYgEACgkQrB3Eaf9P
 W7e1ag//da84UIRyJwMfDO4Y3MXDNPslNSDuq0HuvwRtdLIBLFtwitSzU1uhKsxY
 yn5v7RSsxvp6lXW2RT+Ycor2qZ/mGHJsHcVfG7m0YjxH6unw7yzjqn5LNNzRbYN4
 NcD8P0skuX6d80EFPUB3Hsnmdj1VKR62OsWyk3rAPb4CLBVKJt9OsseVfN4bn1R0
 TaZSIkdh5EDGYXTBKb49jc8LFfQo7+uVg/AjtZ/2ZsWt+Qgw3XevTIcwLokH00rt
 GzXcLjC1g+b6TeVncOuD1oiNJUtQVGYV23t2yQlk9k2HFzCdNnq0YM9pzawwiI+l
 icBV2X/QFjhdCRkvJRF4dkXq/4tnnEmYoY/1vSOoWR9VmY2u8Lr3VRiDD/h0gYJT
 KXd8YPMtZLDnLgmH+DwWbv4vdLtHvQTmB8XFzb/4VN6Ikucenry3loJsUsLnS+Je
 t1/7unLrg9yyJC6UPzweqjAx+6VgZvem/M5kejIVxHpk+Wg2dXGZ2jz4fsVuZYPB
 dMLj1h1MLn4gOt2b/bdI2do0C+p2R1axrTNw+RiqwCrb1h5Ey+7RAhWyXyaHUEs3
 1brMAgOcvdbaaeSIpoHJ8eJx/PgRxDrxRnUC3HjCGPNApYQXC3FM3POk7wwJ9C0i
 odlHrq+yOdzLCZyU+YKdR1q3kPq9AWpUSmc4Olg359OQ9IxDGQw=
 =bgyq
 -----END PGP SIGNATURE-----

Merge tag 'ipsec-2025-05-21' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec

Steffen Klassert says:

====================
pull request (net): ipsec 2025-05-21

1) Fix some missing kfree_skb in the error paths of espintcp.
   From Sabrina Dubroca.

2) Fix a reference leak in espintcp.
   From Sabrina Dubroca.

3) Fix UDP GRO handling for ESPINUDP.
   From Tobias Brunner.

4) Fix ipcomp truesize computation on the receive path.
   From Sabrina Dubroca.

5) Sanitize marks before policy/state insertation.
   From Paul Chaignon.

* tag 'ipsec-2025-05-21' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
  xfrm: Sanitize marks before insert
  xfrm: ipcomp: fix truesize computation on receive
  xfrm: Fix UDP GRO handling for some corner cases
  espintcp: remove encap socket caching to avoid reference leak
  espintcp: fix skb leaks
====================

Link: https://patch.msgid.link/20250521054348.4057269-1-steffen.klassert@secunet.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-22 11:49:53 +02:00
Eric Biggers
70c96c7cb9 net: fold __skb_checksum() into skb_checksum()
Now that the only remaining caller of __skb_checksum() is
skb_checksum(), fold __skb_checksum() into skb_checksum().  This makes
struct skb_checksum_ops unnecessary, so remove that too and simply do
the "regular" net checksum.  It also makes the wrapper functions
csum_partial_ext() and csum_block_add_ext() unnecessary, so remove those
too and just use the underlying functions.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://patch.msgid.link/20250519175012.36581-7-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-21 15:40:16 -07:00
Eric Biggers
99de9d4022 sctp: use skb_crc32c() instead of __skb_checksum()
Make sctp_compute_cksum() just use the new function skb_crc32c(),
instead of calling __skb_checksum() with a skb_checksum_ops struct that
does CRC32C.  This is faster and simpler.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://patch.msgid.link/20250519175012.36581-6-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-21 15:40:16 -07:00
Pauli Virtanen
23205562ff Bluetooth: separate CIS_LINK and BIS_LINK link types
Use separate link type id for unicast and broadcast ISO connections.
These connection types are handled with separate HCI commands, socket
API is different, and hci_conn has union fields that are different in
the two cases, so they shall not be mixed up.

Currently in most places it is attempted to distinguish ucast by
bacmp(&c->dst, BDADDR_ANY) but it is wrong as dst is set for bcast sink
hci_conn in iso_conn_ready(). Additionally checking sync_handle might be
OK, but depends on details of bcast conn configuration flow.

To avoid complicating it, use separate link types.

Fixes: f764a6c2c1 ("Bluetooth: ISO: Add broadcast support")
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-05-21 10:29:28 -04:00
Pauli Virtanen
dd0ccf8580 Bluetooth: add support for SIOCETHTOOL ETHTOOL_GET_TS_INFO
Bluetooth needs some way for user to get supported so_timestamping flags
for the different socket types.

Use SIOCETHTOOL API for this purpose. As hci_dev is not associated with
struct net_device, the existing implementation can't be reused, so we
add a small one here.

Add support (only) for ETHTOOL_GET_TS_INFO command. The API differs
slightly from netdev in that the result depends also on socket type.

Signed-off-by: Pauli Virtanen <pav@iki.fi>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-05-21 10:28:51 -04:00
Hsin-chen Chuang
04425292a6 Bluetooth: Introduce HCI Driver protocol
Although commit 75ddcd5ad4 ("Bluetooth: btusb: Configure altsetting
for HCI_USER_CHANNEL") has enabled the HCI_USER_CHANNEL user to send out
SCO data through USB Bluetooth chips, it's observed that with the patch
HFP is flaky on most of the existing USB Bluetooth controllers: Intel
chips sometimes send out no packet for Transparent codec; MTK chips may
generate SCO data with a wrong handle for CVSD codec; RTK could split
the data with a wrong packet size for Transparent codec; ... etc.

To address the issue above one needs to reset the altsetting back to
zero when there is no active SCO connection, which is the same as the
BlueZ behavior, and another benefit is the bus doesn't need to reserve
bandwidth when no SCO connection.

This patch adds the infrastructure that allow the user space program to
talk to Bluetooth drivers directly:
- Define the new packet type HCI_DRV_PKT which is specifically used for
  communication between the user space program and the Bluetooth drviers
- hci_send_frame intercepts the packets and invokes drivers' HCI Drv
  callbacks (so far only defined for btusb)
- 2 kinds of events to user space: Command Status and Command Complete,
  the former simply returns the status while the later may contain
  additional response data.

Cc: chromeos-bluetooth-upstreaming@chromium.org
Fixes: b16b327edb ("Bluetooth: btusb: add sysfs attribute to control USB alt setting")
Signed-off-by: Hsin-chen Chuang <chharry@chromium.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-05-21 10:28:07 -04:00
Bert Karwatzki
d7500fbfb1 wifi: check if socket flags are valid
Checking the SOCK_WIFI_STATUS flag bit in sk_flags may give wrong results
since sk_flags are part of a union and the union is used otherwise. Add
sk_requests_wifi_status() which checks if sk is non-NULL, sk is a full
socket (so flags are valid) and checks the flag bit.

Fixes: 76a853f86c ("wifi: free SKBTX_WIFI_STATUS skb tx_flags flag")
Suggested-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Bert Karwatzki <spasswolf@web.de>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20250520223430.6875-1-spasswolf@web.de
[edit commit message, fix indentation]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-05-21 09:26:22 +02:00
Kuniyuki Iwashima
f0a56c17e6 inet: Remove rtnl_is_held arg of lwtunnel_valid_encap_type(_attr)?().
Commit f130a0cc1b ("inet: fix lwtunnel_valid_encap_type() lock
imbalance") added the rtnl_is_held argument as a temporary fix while
I'm converting nexthop and IPv6 routing table to per-netns RTNL or RCU.

Now all callers of lwtunnel_valid_encap_type() do not hold RTNL.

Let's remove the argument.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250516022759.44392-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-20 19:18:24 -07:00
Michael Chan
aed031da7e bnxt_en: Fix netdev locking in ULP IRQ functions
netdev_lock is already held when calling bnxt_ulp_irq_stop() and
bnxt_ulp_irq_restart().  When converting rtnl_lock to netdev_lock,
the original code was rtnl_dereference() to indicate that rtnl_lock
was already held.  rcu_dereference_protected() is the correct
conversion after replacing rtnl_lock with netdev_lock.

Add a new helper netdev_lock_dereference() similar to
rtnl_dereference().

Fixes: 004b500801 ("eth: bnxt: remove most dependencies on RTNL")
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250519204130.3097027-2-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-20 18:52:11 -07:00
Jakub Kicinski
4c2bd7913f net: let lockdep compare instance locks
AFAIU always returning -1 from lockdep's compare function
basically disables checking of dependencies between given
locks. Try to be a little more precise about what guarantees
that instance locks won't deadlock.

Right now we only nest them under protection of rtnl_lock.
Mostly in unregister_netdevice_many() and dev_close_many().

Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250517200810.466531-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-20 18:14:33 -07:00
Gur Stavi
84b21e61eb queue_api: reduce risk of name collision over txq
Rename local variable in macros from txq to _txq.
When macro parameter get_desc is expended it is likely to have a txq
token that refers to a different txq variable at the caller's site.

Signed-off-by: Gur Stavi <gur.stavi@huawei.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/95b60d218f004308486d92ed17c8cc6f28bac09d.1747559621.git.gur.stavi@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-19 20:09:02 -07:00
Eric Dumazet
9cd5ef0b8c net: rfs: add sock_rps_delete_flow() helper
RFS can exhibit lower performance for workloads using short-lived
flows and a small set of 4-tuple.

This is often the case for load-testers, using a pair of hosts,
if the server has a single listener port.

Typical use case :

Server : tcp_crr -T128 -F1000 -6 -U -l30 -R 14250
Client : tcp_crr -T128 -F1000 -6 -U -l30 -c -H server | grep local_throughput

This is because RFS global hash table contains stale information,
when the same RSS key is recycled for another socket and another cpu.

Make sure to undo the changes and go back to initial state when
a flow is disconnected.

Performance of the above test is increased by 22 %,
going from 372604 transactions per second to 457773.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Octavian Purdila <tavip@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20250515100354.3339920-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16 16:03:48 -07:00
Jakub Kicinski
bebd7b2626 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.15-rc7).

Conflicts:

tools/testing/selftests/drivers/net/hw/ncdevmem.c
  97c4e094a4 ("tests/ncdevmem: Fix double-free of queue array")
  2f1a805f32 ("selftests: ncdevmem: Implement devmem TCP TX")
https://lore.kernel.org/20250514122900.1e77d62d@canb.auug.org.au

Adjacent changes:

net/core/devmem.c
net/core/devmem.h
  0afc44d8cd ("net: devmem: fix kernel panic when netlink socket close after module unload")
  bd61848900 ("net: devmem: Implement TX path")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-15 11:28:30 -07:00
Mina Almasry
383faec0fd net: enable driver support for netmem TX
Drivers need to make sure not to pass netmem dma-addrs to the
dma-mapping API in order to support netmem TX.

Add helpers and netmem_dma_*() helpers that enables special handling of
netmem dma-addrs that drivers can use.

Document in netmem.rst what drivers need to do to support netmem TX.

Signed-off-by: Mina Almasry <almasrymina@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250508004830.4100853-7-almasrymina@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-13 11:12:49 +02:00
Mina Almasry
bd61848900 net: devmem: Implement TX path
Augment dmabuf binding to be able to handle TX. Additional to all the RX
binding, we also create tx_vec needed for the TX path.

Provide API for sendmsg to be able to send dmabufs bound to this device:

- Provide a new dmabuf_tx_cmsg which includes the dmabuf to send from.
- MSG_ZEROCOPY with SCM_DEVMEM_DMABUF cmsg indicates send from dma-buf.

Devmem is uncopyable, so piggyback off the existing MSG_ZEROCOPY
implementation, while disabling instances where MSG_ZEROCOPY falls back
to copying.

We additionally pipe the binding down to the new
zerocopy_fill_skb_from_devmem which fills a TX skb with net_iov netmems
instead of the traditional page netmems.

We also special case skb_frag_dma_map to return the dma-address of these
dmabuf net_iovs instead of attempting to map pages.

The TX path may release the dmabuf in a context where we cannot wait.
This happens when the user unbinds a TX dmabuf while there are still
references to its netmems in the TX path. In that case, the netmems will
be put_netmem'd from a context where we can't unmap the dmabuf, Resolve
this by making __net_devmem_dmabuf_binding_free schedule_work'd.

Based on work by Stanislav Fomichev <sdf@fomichev.me>. A lot of the meat
of the implementation came from devmem TCP RFC v1[1], which included the
TX path, but Stan did all the rebasing on top of netmem/net_iov.

Cc: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250508004830.4100853-5-almasrymina@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-13 11:12:48 +02:00
Mina Almasry
e9f3d61db5 net: add get_netmem/put_netmem support
Currently net_iovs support only pp ref counts, and do not support a
page ref equivalent.

This is fine for the RX path as net_iovs are used exclusively with the
pp and only pp refcounting is needed there. The TX path however does not
use pp ref counts, thus, support for get_page/put_page equivalent is
needed for netmem.

Support get_netmem/put_netmem. Check the type of the netmem before
passing it to page or net_iov specific code to obtain a page ref
equivalent.

For dmabuf net_iovs, we obtain a ref on the underlying binding. This
ensures the entire binding doesn't disappear until all the net_iovs have
been put_netmem'ed. We do not need to track the refcount of individual
dmabuf net_iovs as we don't allocate/free them from a pool similar to
what the buddy allocator does for pages.

This code is written to be extensible by other net_iov implementers.
get_netmem/put_netmem will check the type of the netmem and route it to
the correct helper:

pages -> [get|put]_page()
dmabuf net_iovs -> net_devmem_[get|put]_net_iov()
new net_iovs ->	new helpers

Signed-off-by: Mina Almasry <almasrymina@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250508004830.4100853-3-almasrymina@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-13 11:12:48 +02:00
Mina Almasry
03e96b8c11 netmem: add niov->type attribute to distinguish different net_iov types
Later patches in the series adds TX net_iovs where there is no pp
associated, so we can't rely on niov->pp->mp_ops to tell what is the
type of the net_iov.

Add a type enum to the net_iov which tells us the net_iov type.

Signed-off-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20250508004830.4100853-2-almasrymina@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-13 11:12:48 +02:00
Jakub Kicinski
a96876057b netlink: fix policy dump for int with validation callback
Recent devlink change added validation of an integer value
via NLA_POLICY_VALIDATE_FN, for sparse enums. Handle this
in policy dump. We can't extract any info out of the callback,
so report only the type.

Fixes: 429ac62114 ("devlink: define enum for attr types of dynamic attributes")
Reported-by: syzbot+01eb26848144516e7f0a@syzkaller.appspotmail.com
Link: https://patch.msgid.link/20250509212751.1905149-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-12 18:50:09 -07:00
Shiraz Saleem
505cc26bca net: mana: Add support for auxiliary device servicing events
Handle soc servicing events which require the rdma auxiliary device resources to
be cleaned up during a suspend, and re-initialized during a resume.

Signed-off-by: Shiraz Saleem <shirazsaleem@microsoft.com>
Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://patch.msgid.link/1746633545-17653-5-git-send-email-kotaranov@linux.microsoft.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-05-12 07:31:48 -04:00
Konstantin Taranov
ced82fce77 net: mana: Probe rdma device in mana driver
Initialize gdma device for rdma inside mana module.
For each gdma device, initialize an auxiliary ib device.

Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://patch.msgid.link/1746633545-17653-2-git-send-email-kotaranov@linux.microsoft.com
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-05-12 06:44:52 -04:00
Vladimir Oltean
6c14058edf net: dsa: convert to ndo_hwtstamp_get() and ndo_hwtstamp_set()
New timestamping API was introduced in commit 66f7223039 ("net: add
NDOs for configuring hardware timestamping") from kernel v6.6. It is
time to convert DSA to the new API, so that the ndo_eth_ioctl() path can
be removed completely.

Move the ds->ops->port_hwtstamp_get() and ds->ops->port_hwtstamp_set()
calls from dsa_user_ioctl() to dsa_user_hwtstamp_get() and
dsa_user_hwtstamp_set().

Due to the fact that the underlying ifreq type changes to
kernel_hwtstamp_config, the drivers and the Ocelot switchdev front-end,
all hooked up directly or indirectly, must also be converted all at once.

The conversion also updates the comment from dsa_port_supports_hwtstamp(),
which is no longer true because kernel_hwtstamp_config is kernel memory
and does not need copy_to_user(). I've deliberated whether it is
necessary to also update "err != -EOPNOTSUPP" to a more general "!err",
but all drivers now either return 0 or -EOPNOTSUPP.

The existing logic from the ocelot_ioctl() function, to avoid
configuring timestamping if the PHY supports the operation, is obsoleted
by more advanced core logic in dev_set_hwtstamp_phylib().

This is only a partial preparation for proper PHY timestamping support.
None of these switch driver currently sets up PTP traps for PHY
timestamping, so setting dev->see_all_hwtstamp_requests is not yet
necessary and the conversion is relatively trivial.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # felix, sja1105, mv88e6xxx
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20250508095236.887789-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-09 16:34:09 -07:00
Cong Wang
2d3cbfd6d5 net_sched: Flush gso_skb list too during ->change()
Previously, when reducing a qdisc's limit via the ->change() operation, only
the main skb queue was trimmed, potentially leaving packets in the gso_skb
list. This could result in NULL pointer dereference when we only check
sch->limit against sch->q.qlen.

This patch introduces a new helper, qdisc_dequeue_internal(), which ensures
both the gso_skb list and the main queue are properly flushed when trimming
excess packets. All relevant qdiscs (codel, fq, fq_codel, fq_pie, hhf, pie)
are updated to use this helper in their ->change() routines.

Fixes: 76e3cc126b ("codel: Controlled Delay AQM")
Fixes: 4b549a2ef4 ("fq_codel: Fair Queue Codel AQM")
Fixes: afe4fd0624 ("pkt_sched: fq: Fair Queue packet scheduler")
Fixes: ec97ecf1eb ("net: sched: add Flow Queue PIE packet scheduler")
Fixes: 10239edf86 ("net-qdisc-hhf: Heavy-Hitter Filter (HHF) qdisc")
Fixes: d4b36210c2 ("net: pkt_sched: PIE AQM scheme")
Reported-by: Will <willsroot@protonmail.com>
Reported-by: Savy <savy@syst3mfailure.io>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-09 12:34:38 +01:00
Jakub Kicinski
ea9a83d7f3 bluetooth pull request for net:
- MGMT: Fix MGMT_OP_ADD_DEVICE invalid device flags
  - hci_event: Fix not using key encryption size when its known
 -----BEGIN PGP SIGNATURE-----
 
 iQJNBAABCgA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmgcyKYZHGx1aXoudm9u
 LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKWSqEACGdIL6oDrfUmE2mFUxlT2C
 eUSe1RDvXi63pFCxv4a1/JrnAICfIPbuKhvkOhf9g3JoQXjqdtX5REjEPcCndyBg
 sSyGGfxcaoYLsSWQY1nSb2p/bttk1R3LfPT78QhKPDsh63FkvRw77P++8anzG3T1
 wqgjKU9XZb7zYmMElYCqYbnd0K2PymJtD+Oml1srBRdOpz1ex3s6Qj4Cy6cg07Vb
 SlxqWMnWZG0wfnusAFZ48zwYu/7/LQQnSJ6rbHSfrKLQnizNVvFTtsW+xrIfQ2m6
 G7/WOX2LVwtP3XMe8VIDV0kdDtkDFtHq0KuNToAtt4DS582RHw0AVe+Xc2x5FFKA
 rmcukZLvg6tv/DM/PM5zJdvZW4M+r1IOnBSZvFMAdYb4Af04DHjI1k1ow9On6O3f
 oJCeRZ4LoOREljR+xdO/Ewn207za0wGR7IlDXFVpeGEOnLcSAxvqUl6biL3cEQpA
 bb97I7fjqwSPqpAGsMV0uhULTJxT7QPhF05rL3bpSzAzZoDpRFhc070TLBBuQvW1
 zQmf+SCeg72vDtv1vsekyoXWfM+SpwlsNYuStPBztrKZgBGeRKjtZc2/L6dLPKbo
 bATO6Dhidm+outsF59YaqORFteON4yLAXSryLw6uVNIT3ApyUspNICyn4zi0tbYY
 8N4rjTclIk5zV1z8zCJ7Lg==
 =uyVh
 -----END PGP SIGNATURE-----

Merge tag 'for-net-2025-05-08' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth

Luiz Augusto von Dentz says:

====================
bluetooth pull request for net:

 - MGMT: Fix MGMT_OP_ADD_DEVICE invalid device flags
 - hci_event: Fix not using key encryption size when its known

* tag 'for-net-2025-05-08' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
  Bluetooth: hci_event: Fix not using key encryption size when its known
  Bluetooth: MGMT: Fix MGMT_OP_ADD_DEVICE invalid device flags
====================

Link: https://patch.msgid.link/20250508150927.385675-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-08 18:38:00 -07:00
Mohan Kumar G
22c64f37e1 wifi: mac80211: Update MCS15 support in link_conf
As per IEEE 802.11be-2024 - 9.4.2.321, EHT operation element
contains MCS15 Disable subfield as the sixth bit, which is set when
MCS15 support is not enabled.

Get MCS15 support from EHT operation params and add it in link_conf
so that driver can use this value to know if EHT-MCS 15 reception
is enabled.

Co-developed-by: Dhanavandhana Kannan <quic_dhanavan1@quicinc.com>
Signed-off-by: Dhanavandhana Kannan <quic_dhanavan1@quicinc.com>
Signed-off-by: Mohan Kumar G <quic_mkumarg@quicinc.com>
Link: https://patch.msgid.link/20250505152836.3266829-1-quic_mkumarg@quicinc.com
[remove pointless !! for bool assignment]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-05-09 00:05:11 +02:00
Jakub Kicinski
6b02fd7799 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.15-rc6).

No conflicts.

Adjacent changes:

net/core/dev.c:
  08e9f2d584 ("net: Lock netdevices during dev_shutdown")
  a82dc19db1 ("net: avoid potential race between netdev_get_by_index_lock() and netns switch")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-08 08:59:02 -07:00
Luiz Augusto von Dentz
c82b6357a5 Bluetooth: hci_event: Fix not using key encryption size when its known
This fixes the regression introduced by 50c1241e6a8a ("Bluetooth: l2cap:
Check encryption key size on incoming connection") introduced a check for
l2cap_check_enc_key_size which checks for hcon->enc_key_size which may
not be initialized if HCI_OP_READ_ENC_KEY_SIZE is still pending.

If the key encryption size is known, due previously reading it using
HCI_OP_READ_ENC_KEY_SIZE, then store it as part of link_key/smp_ltk
structures so the next time the encryption is changed their values are
used as conn->enc_key_size thus avoiding the racing against
HCI_OP_READ_ENC_KEY_SIZE.

Now that the enc_size is stored as part of key the information the code
then attempts to check that there is no downgrade of security if
HCI_OP_READ_ENC_KEY_SIZE returns a value smaller than what has been
previously stored.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=220061
Link: https://bugzilla.kernel.org/show_bug.cgi?id=220063
Fixes: 522e9ed157 ("Bluetooth: l2cap: Check encryption key size on incoming connection")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-05-08 10:24:15 -04:00
Jakub Kicinski
23fa6a23d9 net: export a helper for adding up queue stats
Older drivers and drivers with lower queue counts often have a static
array of queues, rather than allocating structs for each queue on demand.
Add a helper for adding up qstats from a queue range. Expectation is
that driver will pass a queue range [netdev->real_num_*x_queues, MAX).
It was tempting to always use num_*x_queues as the end, but virtio
seems to clamp its queue count after allocating the netdev. And this
way we can trivaly reuse the helper for [0, real_..).

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20250507003221.823267-2-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-08 11:56:12 +02:00
Jakub Kicinski
9daaf19786 wireless features, notably
* stack
    - free SKBTX_WIFI_STATUS flag
    - fixes for VLAN multicast in multi-link
    - improve codel parameters (revert some old twiddling)
  * ath12k
    - Enable AHB support for IPQ5332.
    - Add monitor interface support to QCN9274.
    - Add MLO support to WCN7850.
    - Add 802.11d scan offload support to WCN7850.
  * ath11k
    - Restore hibernation support
  * iwlwifi
    - EMLSR on two 5 GHz links
  * mwifiex
    - cleanups/refactoring
 
 along with many other small features/cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmgaSmcACgkQ10qiO8sP
 aAA4hA//f/nfeLTAnhON53mDlqxa55/2bw9XSH7pOIOasVBWxmuYxhWfn5uiZluI
 zOGlBO7vtJYvPrVHEHSuPWMNCQ+ieL2ShRSP5BBfy3KBYdD4gKKAd95WoiXmRVp4
 d13OYtF9msFbVXOZYMyxMHmzIrWlBQIokjOSjqjSBeYRnD8U0GRemiSecugWo/qI
 bE7xcZuTgKBy+gr7242017DcUjWBdWcsp1C6C+COZm/KrSihQ0SQ4PIcOZgPsZjl
 COGextZltbWW56qnlp6QC394V+Vhah+Owcz3Qqz9zZ7hzJnuPo+DnpPMShhRGruL
 /IgqKhzcuye5UUJJl8nD768x6ebClchkcBC+A/hfjk5UYVl/oZxA1Bw5fC2O+5VU
 ycDMHr1Qu/yEE2rbwIJPGNOZ7NisqJFF07CPFuygKjBGNnp7I5S7H6UJsKRi5MuZ
 0CXHiFMhuHgWLmDFauIa66XI1JpqIzQgbZSjqVYFKRqwEz3yRKwihwDzZy1m6pDQ
 NhMFedQznFkMSJ+m020IOxxXYPy98PHcus+TZWL4os0SJTfEycEUWJZnlJ47Bb25
 Z9IF1OER2I4giMKxUVKRoDq0SGStbZODwqxrVCRD261Um9ybbfDdRbDeDOmvY+Gu
 jEmE6tscWbRfDvCS2M+xIg+MAcVHJq5gO4S8RopMSJlFHY/T/uE=
 =s9mS
 -----END PGP SIGNATURE-----

Merge tag 'wireless-next-2025-05-06' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next

Johannes Berg says:

====================
wireless features, notably

 * stack
   - free SKBTX_WIFI_STATUS flag
   - fixes for VLAN multicast in multi-link
   - improve codel parameters (revert some old twiddling)
 * ath12k
   - Enable AHB support for IPQ5332.
   - Add monitor interface support to QCN9274.
   - Add MLO support to WCN7850.
   - Add 802.11d scan offload support to WCN7850.
 * ath11k
   - Restore hibernation support
 * iwlwifi
   - EMLSR on two 5 GHz links
 * mwifiex
   - cleanups/refactoring

along with many other small features/cleanups

* tag 'wireless-next-2025-05-06' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (177 commits)
  Revert "wifi: iwlwifi: clean up config macro"
  wifi: iwlwifi: move phy_filters to fw_runtime
  wifi: iwlwifi: pcie: make sure to lock rxq->read
  wifi: iwlwifi: add definitions for iwl_mac_power_cmd version 2
  wifi: iwlwifi: clean up config macro
  wifi: iwlwifi: mld: simplify iwl_mld_rx_fill_status()
  wifi: iwlwifi: mld: rx: simplify channel handling
  wifi: iwlwifi: clean up band in RX metadata
  wifi: iwlwifi: mld: skip unknown FW channel load values
  wifi: iwlwifi: define API for external FSEQ images
  wifi: iwlwifi: mld: allow EMLSR on separated 5 GHz subbands
  wifi: iwlwifi: mld: use cfg80211_chandef_get_width()
  wifi: iwlwifi: mld: fix iwl_mld_emlsr_disallowed_with_link() return
  wifi: iwlwifi: mld: clarify variable type
  wifi: iwlwifi: pcie: add support for the reset handshake in MSI
  wifi: mac80211_hwsim: Prevent tsf from setting if beacon is disabled
  wifi: mac80211: restructure tx profile retrieval for MLO MBSSID
  wifi: nl80211: add link id of transmitted profile for MLO MBSSID
  wifi: ieee80211: Add helpers to fetch EMLSR delay and timeout values
  wifi: mac80211: update ML STA with EML capabilities
  ...

====================

Link: https://patch.msgid.link/20250506174656.119970-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-06 19:04:42 -07:00
Jiri Pirko
f9e78932ea devlink: avoid param type value translations
Assign DEVLINK_PARAM_TYPE_* enum values to DEVLINK_VAR_ATTR_TYPE_* to
ensure the same values are used internally and in UAPI. Benefit from
that by removing the value translations.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20250505114513.53370-4-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-06 18:21:11 -07:00
Dr. David Alan Gilbert
ac8f09b921 sctp: Remove unused sctp_assoc_del_peer and sctp_chunk_iif
sctp_assoc_del_peer() last use was removed in 2015 by
commit 73e6742027 ("sctp: Do not try to search for the transport twice")
which now uses rm_peer instead of del_peer.

sctp_chunk_iif() last use was removed in 2016 by
commit 1f45f78f8e ("sctp: allow GSO frags to access the chunk too")

Remove them.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Acked-by: Xin Long <lucien.xin@gmail.com>
Link: https://patch.msgid.link/20250501233815.99832-1-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-05 16:51:12 -07:00
Dr. David Alan Gilbert
320a66f840 strparser: Remove unused __strp_unpause
The last use of __strp_unpause() was removed in 2022 by
commit 84c61fe1a7 ("tls: rx: do not use the standard strparser")

Remove it.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250501002402.308843-1-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-05 16:48:12 -07:00
Jakub Kicinski
337079d31f Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.15-rc5).

No conflicts or adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-01 15:11:38 -07:00
Willem de Bruijn
65e9024643 ip: load balance tcp connections to single dst addr and port
Load balance new TCP connections across nexthops also when they
connect to the same service at a single remote address and port.

This affects only port-based multipath hashing:
fib_multipath_hash_policy 1 or 3.

Local connections must choose both a source address and port when
connecting to a remote service, in ip_route_connect. This
"chicken-and-egg problem" (commit 2d7192d6cb ("ipv4: Sanitize and
simplify ip_route_{connect,newports}()")) is resolved by first
selecting a source address, by looking up a route using the zero
wildcard source port and address.

As a result multiple connections to the same destination address and
port have no entropy in fib_multipath_hash.

This is not a problem when forwarding, as skb-based hashing has a
4-tuple. Nor when establishing UDP connections, as autobind there
selects a port before reaching ip_route_connect.

Load balance also TCP, by using a random port in fib_multipath_hash.
Port assignment in inet_hash_connect is not atomic with
ip_route_connect. Thus ports are unpredictable, effectively random.

Implementation details:

Do not actually pass a random fl4_sport, as that affects not only
hashing, but routing more broadly, and can match a source port based
policy route, which existing wildcard port 0 will not. Instead,
define a new wildcard flowi flag that is used only for hashing.

Selecting a random source is equivalent to just selecting a random
hash entirely. But for code clarity, follow the normal 4-tuple hash
process and only update this field.

fib_multipath_hash can be reached with zero sport from other code
paths, so explicitly pass this flowi flag, rather than trying to infer
this case in the function itself.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250424143549.669426-3-willemdebruijn.kernel@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-29 16:22:25 +02:00
Willem de Bruijn
32607a332c ipv4: prefer multipath nexthop that matches source address
With multipath routes, try to ensure that packets leave on the device
that is associated with the source address.

Avoid the following tcpdump example:

    veth0 Out IP 10.1.0.2.38640 > 10.2.0.3.8000: Flags [S]
    veth1 Out IP 10.1.0.2.38648 > 10.2.0.3.8000: Flags [S]

Which can happen easily with the most straightforward setup:

    ip addr add 10.0.0.1/24 dev veth0
    ip addr add 10.1.0.1/24 dev veth1

    ip route add 10.2.0.3 nexthop via 10.0.0.2 dev veth0 \
    			  nexthop via 10.1.0.2 dev veth1

This is apparently considered WAI, based on the comment in
ip_route_output_key_hash_rcu:

    * 2. Moreover, we are allowed to send packets with saddr
    *    of another iface. --ANK

It may be ok for some uses of multipath, but not all. For instance,
when using two ISPs, a router may drop packets with unknown source.

The behavior occurs because tcp_v4_connect makes three route
lookups when establishing a connection:

1. ip_route_connect calls to select a source address, with saddr zero.
2. ip_route_connect calls again now that saddr and daddr are known.
3. ip_route_newports calls again after a source port is also chosen.

With a route with multiple nexthops, each lookup may make a different
choice depending on available entropy to fib_select_multipath. So it
is possible for 1 to select the saddr from the first entry, but 3 to
select the second entry. Leading to the above situation.

Address this by preferring a match that matches the flowi4 saddr. This
will make 2 and 3 make the same choice as 1. Continue to update the
backup choice until a choice that matches saddr is found.

Do this in fib_select_multipath itself, rather than passing an fl4_oif
constraint, to avoid changing non-multipath route selection. Commit
e6b45241c5 ("ipv4: reset flowi parameters on route connect") shows
how that may cause regressions.

Also read ipv4.sysctl_fib_multipath_use_neigh only once. No need to
refresh in the loop.

This does not happen in IPv6, which performs only one lookup.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250424143549.669426-2-willemdebruijn.kernel@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-29 16:22:25 +02:00
Jesper Dangaard Brouer
34dd0fecaa net: sched: generalize check for no-queue qdisc on TX queue
The "noqueue" qdisc can either be directly attached, or get default
attached if net_device priv_flags has IFF_NO_QUEUE. In both cases, the
allocated Qdisc structure gets it's enqueue function pointer reset to
NULL by noqueue_init() via noqueue_qdisc_ops.

This is a common case for software virtual net_devices. For these devices
with no-queue, the transmission path in __dev_queue_xmit() will bypass
the qdisc layer. Directly invoking device drivers ndo_start_xmit (via
dev_hard_start_xmit).  In this mode the device driver is not allowed to
ask for packets to be queued (either via returning NETDEV_TX_BUSY or
stopping the TXQ).

The simplest and most reliable way to identify this no-queue case is by
checking if enqueue == NULL.

The vrf driver currently open-codes this check (!qdisc->enqueue). While
functionally correct, this low-level detail is better encapsulated in a
dedicated helper for clarity and long-term maintainability.

To make this behavior more explicit and reusable, this patch introduce a
new helper: qdisc_txq_has_no_queue(). Helper will also be used by the
veth driver in the next patch, which introduces optional qdisc-based
backpressure.

This is a non-functional change.

Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://patch.msgid.link/174559293172.827981.7583862632045264175.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-28 14:06:58 -07:00
Luiz Augusto von Dentz
024421cf39 Bluetooth: hci_conn: Fix not setting timeout for BIG Create Sync
BIG Create Sync requires the command to just generates a status so this
makes use of __hci_cmd_sync_status_sk to wait for
HCI_EVT_LE_BIG_SYNC_ESTABLISHED, also because of this chance it is not
longer necessary to use a custom method to serialize the process of
creating the BIG sync since the cmd_work_sync itself ensures only one
command would be pending which now awaits for
HCI_EVT_LE_BIG_SYNC_ESTABLISHED before proceeding to next connection.

Fixes: 42ecf19471 ("Bluetooth: ISO: Do not emit LE BIG Create Sync if previous is pending")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-04-25 15:03:19 -04:00
Luiz Augusto von Dentz
6d0417e4e1 Bluetooth: hci_conn: Fix not setting conn_timeout for Broadcast Receiver
Broadcast Receiver requires creating PA sync but the command just
generates a status so this makes use of __hci_cmd_sync_status_sk to wait
for HCI_EV_LE_PA_SYNC_ESTABLISHED, also because of this chance it is not
longer necessary to use a custom method to serialize the process of
creating the PA sync since the cmd_work_sync itself ensures only one
command would be pending which now awaits for
HCI_EV_LE_PA_SYNC_ESTABLISHED before proceeding to next connection.

Fixes: 4a5e0ba686 ("Bluetooth: ISO: Do not emit LE PA Create Sync if previous is pending")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-04-25 15:03:19 -04:00
e.kubanski
bf20af0790 xsk: Fix offset calculation in unaligned mode
Bring back previous offset calculation behaviour
in AF_XDP unaligned umem mode.

In unaligned mode, upper 16 bits should contain
data offset, lower 48 bits should contain
only specific chunk location without offset.

Remove pool->headroom duplication into 48bit address.

Signed-off-by: Eryk Kubanski <e.kubanski@partner.samsung.com>
Fixes: bea14124ba ("xsk: Get rid of xdp_buff_xsk::orig_addr")
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://patch.msgid.link/20250416112925.7501-1-e.kubanski@partner.samsung.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-24 17:11:52 -07:00
e.kubanski
a1356ac774 xsk: Fix race condition in AF_XDP generic RX path
Move rx_lock from xsk_socket to xsk_buff_pool.
Fix synchronization for shared umem mode in
generic RX path where multiple sockets share
single xsk_buff_pool.

RX queue is exclusive to xsk_socket, while FILL
queue can be shared between multiple sockets.
This could result in race condition where two
CPU cores access RX path of two different sockets
sharing the same umem.

Protect both queues by acquiring spinlock in shared
xsk_buff_pool.

Lock contention may be minimized in the future by some
per-thread FQ buffering.

It's safe and necessary to move spin_lock_bh(rx_lock)
after xsk_rcv_check():
* xs->pool and spinlock_init is synchronized by
  xsk_bind() -> xsk_is_bound() memory barriers.
* xsk_rcv_check() may return true at the moment
  of xsk_release() or xsk_unbind_dev(),
  however this will not cause any data races or
  race conditions. xsk_unbind_dev() removes xdp
  socket from all maps and waits for completion
  of all outstanding rx operations. Packets in
  RX path will either complete safely or drop.

Signed-off-by: Eryk Kubanski <e.kubanski@partner.samsung.com>
Fixes: bf0bdd1343 ("xdp: fix race on generic receive path")
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://patch.msgid.link/20250416101908.10919-1-e.kubanski@partner.samsung.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-24 17:11:33 -07:00
Dr. David Alan Gilbert
39144062ea rxrpc: Remove deadcode
Remove three functions that are no longer used.

rxrpc_get_txbuf() last use was removed by 2020's
commit 5e6ef4f101 ("rxrpc: Make the I/O thread take over the call and
local processor work")

rxrpc_kernel_get_epoch() last use was removed by 2020's
commit 44746355cc ("afs: Don't get epoch from a server because it may be
ambiguous")

rxrpc_kernel_set_max_life() last use was removed by 2023's
commit db099c625b ("rxrpc: Fix timeout of a call that hasn't yet been
granted a channel")

Both of the rxrpc_kernel_* functions were documented.  Remove that
documentation as well as the code.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Acked-by: David Howells <dhowells@redhat.com>
Link: https://patch.msgid.link/20250422235147.146460-1-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-24 17:03:45 -07:00
Kuniyuki Iwashima
081efd1832 ipv6: Protect nh->f6i_list with spinlock and flag.
We will get rid of RTNL from RTM_NEWROUTE and SIOCADDRT.

Then, we may be going to add a route tied to a dying nexthop.

The nexthop itself is not freed during the RCU grace period, but
if we link a route after __remove_nexthop_fib() is called for the
nexthop, the route will be leaked.

To avoid the race between IPv6 route addition under RCU vs nexthop
deletion under RTNL, let's add a dead flag and protect it and
nh->f6i_list with a spinlock.

__remove_nexthop_fib() acquires the nexthop's spinlock and sets false
to nh->dead, then calls ip6_del_rt() for the linked route one by one
without the spinlock because fib6_purge_rt() acquires it later.

While adding an IPv6 route, fib6_add() acquires the nexthop lock and
checks the dead flag just before inserting the route.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250418000443.43734-15-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24 09:29:56 +02:00
Kuniyuki Iwashima
accb46b56b ipv6: Defer fib6_purge_rt() in fib6_add_rt2node() to fib6_add().
The next patch adds per-nexthop spinlock which protects nh->f6i_list.

When rt->nh is not NULL, fib6_add_rt2node() will be called under the lock.
fib6_add_rt2node() could call fib6_purge_rt() for another route, which
could holds another nexthop lock.

Then, deadlock could happen between two nexthops.

Let's defer fib6_purge_rt() after fib6_add_rt2node().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/20250418000443.43734-14-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24 09:29:56 +02:00
Kuniyuki Iwashima
834d97843e ipv6: Protect fib6_link_table() with spinlock.
We will get rid of RTNL from RTM_NEWROUTE and SIOCADDRT.

If the request specifies a new table ID, fib6_new_table() is
called to create a new routing table.

Two concurrent requests could specify the same table ID, so we
need a lock to protect net->ipv6.fib_table_hash[h].

Let's add a spinlock to protect the hash bucket linkage.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/20250418000443.43734-13-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24 09:29:56 +02:00
Rameshkumar Sundaram
f600832794 wifi: mac80211: restructure tx profile retrieval for MLO MBSSID
For MBSSID, each vif (struct ieee80211_vif) stores another vif
pointer for the transmitting profile of MBSSID set. This won't
suffice for MLO as there may be multiple links, each of which can
be part of different MBSSID sets. Hence the information needs to
be stored per-link. Additionally, the transmitted profile itself
may be part of an MLD hence storing vif will not suffice either.
Fix MLO by storing an instance of struct ieee80211_bss_conf
for each link.

Modify following operations to reflect the above structure updates:
- channel switch completion
- BSS color change completion
- Removing nontransmitted links in ieee80211_stop_mbssid()
- drivers retrieving the transmitted link for beacon templates.

Signed-off-by: Rameshkumar Sundaram <rameshkumar.sundaram@oss.qualcomm.com>
Co-developed-by: Muna Sinada <muna.sinada@oss.qualcomm.com>
Signed-off-by: Muna Sinada <muna.sinada@oss.qualcomm.com>
Co-developed-by: Aloka Dixit <aloka.dixit@oss.qualcomm.com>
Signed-off-by: Aloka Dixit <aloka.dixit@oss.qualcomm.com>
Link: https://patch.msgid.link/20250408184501.3715887-3-aloka.dixit@oss.qualcomm.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-04-23 18:03:53 +02:00
Rameshkumar Sundaram
37523c3c47 wifi: nl80211: add link id of transmitted profile for MLO MBSSID
During non-transmitted (nontx) profile configuration, interface
index of the transmitted (tx) profile is used to retrieve the
wireless device (wdev) associated with it. With MLO, this 'wdev'
may be part of an MLD with more than one link, hence only
interface index is not sufficient anymore to retrieve the correct
tx profile. Add a new attribute to configure link id of tx profile.

Signed-off-by: Rameshkumar Sundaram <rameshkumar.sundaram@oss.qualcomm.com>
Co-developed-by: Muna Sinada <muna.sinada@oss.qualcomm.com>
Signed-off-by: Muna Sinada <muna.sinada@oss.qualcomm.com>
Co-developed-by: Aloka Dixit <aloka.dixit@oss.qualcomm.com>
Signed-off-by: Aloka Dixit <aloka.dixit@oss.qualcomm.com>
Link: https://patch.msgid.link/20250408184501.3715887-2-aloka.dixit@oss.qualcomm.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-04-23 18:03:30 +02:00
Ramasamy Kaliappan
14e0f59a88 wifi: mac80211: update ML STA with EML capabilities
When an AP and Non-AP MLD operates in EMLSR mode, EML capabilities
advertised during Association contains information such as EMLSR
transition delay, padding delay and transition timeout values.

Save the EML capabilities information that is received during station
addition and capabilities update in ieee80211_sta so that drivers can use
it for triggering EMLSR operation.

Signed-off-by: Ramasamy Kaliappan <quic_rkaliapp@quicinc.com>
Signed-off-by: Rameshkumar Sundaram <quic_ramess@quicinc.com>
Link: https://patch.msgid.link/20250327051320.3253783-3-quic_ramess@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-04-23 17:03:05 +02:00
Ramasamy Kaliappan
53160d0edf wifi: cfg80211: Add support to get EMLSR capabilities of non-AP MLD
The Enhanced multi-link single-radio (EMLSR) operation allows a non-AP MLD
with multiple receive chains to listen on one or more EMLSR links when the
corresponding non-AP STA(s) affiliated with the non-AP MLD is (are) in
the awake state. [IEEE 802.11be-2024, (35.3.17 Enhanced multi-link
single-radio (EMLSR) operation)]

An MLD which intends to enable EMLSR operations will set the EML
Capabilities Present subfield to 1 and shall set the EMLSR Support
subfield in the Common Info field of the Basic Multi-Link element to 1 in
all Management frames that include the Basic Multi-Link element except
Authentication frames. EML capabilities contains information such as
EML Transition timeout, Padding delay and Transition delay. These fields
needs to updated to drivers to trigger EMLSR operation and to transmit and
receive initial control frame and data frames.

Add support to receive EML Capabilities subfield that non-AP MLD
advertises during (re)association request and send it to underlying
drivers during ADD/SET station.

Signed-off-by: Ramasamy Kaliappan <quic_rkaliapp@quicinc.com>
Signed-off-by: Rameshkumar Sundaram <quic_ramess@quicinc.com>
Link: https://patch.msgid.link/20250327051320.3253783-2-quic_ramess@quicinc.com
[accept EMLSR capabilities only for unassoc AP STA]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-04-23 17:02:44 +02:00
Toke Høiland-Jørgensen
4876376988 Revert "mac80211: Dynamically set CoDel parameters per station"
This reverts commit 484a54c2e5. The CoDel
parameter change essentially disables CoDel on slow stations, with some
questionable assumptions, as Dave pointed out in [0]. Quoting from
there:

  But here are my pithy comments as to why this part of mac80211 is so
  wrong...

   static void sta_update_codel_params(struct sta_info *sta, u32 thr)
   {
  -       if (thr && thr < STA_SLOW_THRESHOLD * sta->local->num_sta) {

  1) sta->local->num_sta is the number of associated, rather than
  active, stations. "Active" stations in the last 50ms or so, might have
  been a better thing to use, but as most people have far more than that
  associated, we end up with really lousy codel parameters, all the
  time. Mistake numero uno!

  2) The STA_SLOW_THRESHOLD was completely arbitrary in 2016.

  -               sta->cparams.target = MS2TIME(50);

  This, by itself, was probably not too bad. 30ms might have been
  better, at the time, when we were battling powersave etc, but 20ms was
  enough, really, to cover most scenarios, even where we had low rate
  2Ghz multicast to cope with. Even then, codel has a hard time finding
  any sane drop rate at all, with a target this high.

  -               sta->cparams.interval = MS2TIME(300);

  But this was horrible, a total mistake, that is leading to codel being
  completely ineffective in almost any scenario on clients or APS.
  100ms, even 80ms, here, would be vastly better than this insanity. I'm
  seeing 5+seconds of delay accumulated in a bunch of otherwise happily
  fq-ing APs....

  100ms of observed jitter during a flow is enough. Certainly (in 2016)
  there were interactions with powersave that I did not understand, and
  still don't, but if you are transmitting in the first place, powersave
  shouldn't be a problemmmm.....

  -               sta->cparams.ecn = false;

  At the time we were pretty nervous about ecn, I'm kind of sanguine
  about it now, and reliably indicating ecn seems better than turning it
  off for any reason.

  [...]

  In production, on p2p wireless, I've had 8ms and 80ms for target and
  interval for years now, and it works great.

I think Dave's arguments above are basically sound on the face of it,
and various experimentation with tighter CoDel parameters in the OpenWrt
community have show promising results[1]. So I don't think there's any
reason to keep this parameter fiddling; hence this revert.

[0] https://lore.kernel.org/linux-wireless/CAA93jw6NJ2cmLmMauz0xAgC2MGbBq6n0ZiZzAdkK0u4b+O2yXg@mail.gmail.com/
[1] https://forum.openwrt.org/t/reducing-multiplexing-latencies-still-further-in-wifi/133605/130

Suggested-By: Dave Taht <dave.taht@gmail.com>
In-memory-of: Dave Taht <dave.taht@gmail.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
Link: https://patch.msgid.link/20250403183930.197716-1-toke@toke.dk
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-04-23 15:22:11 +02:00
Johannes Berg
996c15bd30 wifi: cfg80211/mac80211: remove more 5/10 MHz code
We still have ieee80211_chandef_rate_flags() and all that,
but all the users seem pretty much broken (deflink, etc.)
Remove all the code. It's been two years since last anyone
even vaguely entertained the notion of looking at this and
fixing it.

Link: https://patch.msgid.link/20250329221419.c31da7ae8c84.I1a3a4b6008134d66ca75a5bdfc004f4594da8145@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-04-23 15:21:32 +02:00
Johannes Berg
76a853f86c wifi: free SKBTX_WIFI_STATUS skb tx_flags flag
Jason mentioned at netdevconf that we've run out of tx_flags in
the skb_shinfo(). Gain one bit back by removing the wifi bit.

We can do that because the only userspace application for it
(hostapd) doesn't change the setting on the socket, it just
uses different sockets, and normally doesn't even use this any
more, sending the frames over nl80211 instead.

Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20250313134942.52ff54a140ec.If390bbdc46904cf451256ba989d7a056c457af6e@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-04-23 15:09:25 +02:00
Joshua Washington
0e0a7e3719 xdp: create locked/unlocked instances of xdp redirect target setters
Commit 03df156dd3 ("xdp: double protect netdev->xdp_flags with
netdev->lock") introduces the netdev lock to xdp_set_features_flag().
The change includes a _locked version of the method, as it is possible
for a driver to have already acquired the netdev lock before calling
this helper. However, the same applies to
xdp_features_(set|clear)_redirect_flags(), which ends up calling the
unlocked version of xdp_set_features_flags() leading to deadlocks in
GVE, which grabs the netdev lock as part of its suspend, reset, and
shutdown processes:

[  833.265543] WARNING: possible recursive locking detected
[  833.270949] 6.15.0-rc1 #6 Tainted: G            E
[  833.276271] --------------------------------------------
[  833.281681] systemd-shutdow/1 is trying to acquire lock:
[  833.287090] ffff949d2b148c68 (&dev->lock){+.+.}-{4:4}, at: xdp_set_features_flag+0x29/0x90
[  833.295470]
[  833.295470] but task is already holding lock:
[  833.301400] ffff949d2b148c68 (&dev->lock){+.+.}-{4:4}, at: gve_shutdown+0x44/0x90 [gve]
[  833.309508]
[  833.309508] other info that might help us debug this:
[  833.316130]  Possible unsafe locking scenario:
[  833.316130]
[  833.322142]        CPU0
[  833.324681]        ----
[  833.327220]   lock(&dev->lock);
[  833.330455]   lock(&dev->lock);
[  833.333689]
[  833.333689]  *** DEADLOCK ***
[  833.333689]
[  833.339701]  May be due to missing lock nesting notation
[  833.339701]
[  833.346582] 5 locks held by systemd-shutdow/1:
[  833.351205]  #0: ffffffffa9c89130 (system_transition_mutex){+.+.}-{4:4}, at: __se_sys_reboot+0xe6/0x210
[  833.360695]  #1: ffff93b399e5c1b8 (&dev->mutex){....}-{4:4}, at: device_shutdown+0xb4/0x1f0
[  833.369144]  #2: ffff949d19a471b8 (&dev->mutex){....}-{4:4}, at: device_shutdown+0xc2/0x1f0
[  833.377603]  #3: ffffffffa9eca050 (rtnl_mutex){+.+.}-{4:4}, at: gve_shutdown+0x33/0x90 [gve]
[  833.386138]  #4: ffff949d2b148c68 (&dev->lock){+.+.}-{4:4}, at: gve_shutdown+0x44/0x90 [gve]

Introduce xdp_features_(set|clear)_redirect_target_locked() versions
which assume that the netdev lock has already been acquired before
setting the XDP feature flag and update GVE to use the locked version.

Fixes: 03df156dd3 ("xdp: double protect netdev->xdp_flags with netdev->lock")
Tested-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>
Signed-off-by: Joshua Washington <joshwash@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20250422011643.3509287-1-joshwash@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-22 19:57:56 -07:00
Dr. David Alan Gilbert
45bd443bfd net: 802: Remove unused p8022 code
p8022.c defines two external functions, register_8022_client()
and unregister_8022_client(), the last use of which was removed in
2018 by
commit 7a2e838d28 ("staging: ipx: delete it from the tree")

Remove the p8022.c file, it's corresponding header, and glue
surrounding it.  There was one place the header was included in vlan.c
but it didn't use the functions it declared.

There was a comment in net/802/Makefile about checking
against net/core/Makefile, but that's at least 20 years old and
there's no sign of net/core/Makefile mentioning it.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Link: https://patch.msgid.link/20250418011519.145320-1-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-22 07:04:02 -07:00
Ido Schimmel
1f763fa808 vxlan: Convert FDB table to rhashtable
FDB entries are currently stored in a hash table with a fixed number of
buckets (256), resulting in performance degradation as the number of
entries grows. Solve this by converting the driver to use rhashtable
which maintains more or less constant performance regardless of the
number of entries.

Measured transmitted packets per second using a single pktgen thread
with varying number of entries when the transmitted packet always hits
the default entry (worst case):

Number of entries | Improvement
------------------|------------
1k                | +1.12%
4k                | +9.22%
16k               | +55%
64k               | +585%
256k              | +2460%

In addition, the change reduces the size of the VXLAN device structure
from 2584 bytes to 672 bytes.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250415121143.345227-16-idosch@nvidia.com
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-22 11:11:16 +02:00
Ido Schimmel
8d45673d2d vxlan: Add a linked list of FDB entries
Currently, FDB entries are stored in a hash table with a fixed number of
buckets. The table is used for both lookups and entry traversal.
Subsequent patches will convert the table to rhashtable which is not
suitable for entry traversal.

In preparation for this conversion, add FDB entries to a linked list.
Subsequent patches will convert the driver to use this list when
traversing entries during dump, flush, etc.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250415121143.345227-8-idosch@nvidia.com
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-22 11:11:15 +02:00
Ido Schimmel
094adad913 vxlan: Use a single lock to protect the FDB table
Currently, the VXLAN driver stores FDB entries in a hash table with a
fixed number of buckets (256). Subsequent patches are going to convert
this table to rhashtable with a linked list for entry traversal, as
rhashtable is more scalable.

In preparation for this conversion, move from a per-bucket spin lock to
a single spin lock that protects the entire FDB table.

The per-bucket spin locks were introduced by commit fe1e0713bb
("vxlan: Use FDB_HASH_SIZE hash_locks to reduce contention") citing
"huge contention when inserting/deleting vxlan_fdbs into the fdb_head".

It is not clear from the commit message which code path was holding the
spin lock for long periods of time, but the obvious suspect is the FDB
cleanup routine (vxlan_cleanup()) that periodically traverses the entire
table in order to delete aged-out entries.

This will be solved by subsequent patches that will convert the FDB
cleanup routine to traverse the linked list of FDB entries using RCU,
only acquiring the spin lock when deleting an aged-out entry.

The change reduces the size of the VXLAN device structure from 3600
bytes to 2576 bytes.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250415121143.345227-7-idosch@nvidia.com
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-22 11:11:15 +02:00
Konstantin Taranov
f1652d76f4 RDMA/mana_ib: Add support of 4M, 1G, and 2G pages
Check PF capability flag whether the 4M, 1G, and 2G pages are
supported. Add these pages sizes to mana_ib, if supported.

Define possible page sizes in enum gdma_page_type and
remove unused enum atb_page_size.

Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://patch.msgid.link/1744621234-26114-4-git-send-email-kotaranov@linux.microsoft.com
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-04-20 06:36:26 -04:00
Konstantin Taranov
8f49682d94 RDMA/mana_ib: support of the zero based MRs
Add IB_ZERO_BASED to the valid flags and use
the corresponding MR creation request for the zero
based memory.

Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://patch.msgid.link/1744621234-26114-3-git-send-email-kotaranov@linux.microsoft.com
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-04-20 06:32:35 -04:00
Zijun Hu
2b905deb43 net: Delete the outer () duplicated of macro SOCK_SKB_CB_OFFSET definition
For macro SOCK_SKB_CB_OFFSET definition, Delete the outer () duplicated.

Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250416-fix_net-v1-1-d544c9f3f169@quicinc.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-17 19:04:36 -07:00
Jakub Kicinski
22cbc1ee26 netdev: fix the locking for netdev notifications
Kuniyuki reports that the assert for netdev lock fires when
there are netdev event listeners (otherwise we skip the netlink
event generation).

Correct the locking when coming from the notifier.

The NETDEV_XDP_FEAT_CHANGE notifier is already fully locked,
it's the documentation that's incorrect.

Fixes: 99e44f39a8 ("netdev: depend on netdev->lock for xdp features")
Reported-by: syzkaller <syzkaller@googlegroups.com>
Reported-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/20250410171019.62128-1-kuniyu@amazon.com
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250416030447.1077551-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-17 18:55:14 -07:00
Jakub Kicinski
240ce924d2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.15-rc3).

No conflicts. Adjacent changes:

tools/net/ynl/pyynl/ynl_gen_c.py
  4d07bbf2d4 ("tools: ynl-gen: don't declare loop iterator in place")
  7e8ba0c7de ("tools: ynl: don't use genlmsghdr in classic netlink")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-17 12:26:50 -07:00
Chiachang Wang
ab244a394c xfrm: Migrate offload configuration
Add hardware offload configuration to XFRM_MSG_MIGRATE
using an option netlink attribute XFRMA_OFFLOAD_DEV.

In the existing xfrm_state_migrate(), the xfrm_init_state()
is called assuming no hardware offload by default. Even the
original xfrm_state is configured with offload, the setting will
be reset. If the device is configured with hardware offload,
it's reasonable to allow the device to maintain its hardware
offload mode. But the device will end up with offload disabled
after receiving a migration event when the device migrates the
connection from one netdev to another one.

The devices that support migration may work with different
underlying networks, such as mobile devices. The hardware setting
should be forwarded to the different netdev based on the
migration configuration. This change provides the capability
for user space to migrate from one netdev to another.

Test: Tested with kernel test in the Android tree located
      in https://android.googlesource.com/kernel/tests/
      The xfrm_tunnel_test.py under the tests folder in
      particular.
Signed-off-by: Chiachang Wang <chiachangwang@google.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-04-17 11:00:03 +02:00
Taehee Yoo
cd1fafe7da eth: bnxt: add support rx side device memory TCP
Currently, bnxt_en driver satisfies the requirements of the Device
memory TCP, which is HDS.
So, it implements rx-side Device memory TCP for bnxt_en driver.
It requires only converting the page API to netmem API.
`struct page` of agg rings are changed to `netmem_ref netmem` and
corresponding functions are changed to a variant of netmem API.

It also passes PP_FLAG_ALLOW_UNREADABLE_NETMEM flag to a parameter of
page_pool.
The netmem will be activated only when a user requests devmem TCP.

When netmem is activated, received data is unreadable and netmem is
disabled, received data is readable.
But drivers don't need to handle both cases because netmem core API will
handle it properly.
So, using proper netmem API is enough for drivers.

Device memory TCP can be tested with
tools/testing/selftests/drivers/net/hw/ncdevmem.
This is tested with BCM57504-N425G and firmware version 232.0.155.8/pkg
232.1.132.8.

Reviewed-by: Mina Almasry <almasrymina@google.com>
Tested-by: David Wei <dw@davidwei.uk>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Link: https://patch.msgid.link/20250415052458.1260575-1-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-16 18:17:57 -07:00
Cosmin Ratiu
d2fddbd347 bonding: Fix multiple long standing offload races
Refactor the bonding ipsec offload operations to fix a number of
long-standing control plane races between state migration and user
deletion and a few other issues.

xfrm state deletion can happen concurrently with
bond_change_active_slave() operation. This manifests itself as a
bond_ipsec_del_sa() call with x->lock held, followed by a
bond_ipsec_free_sa() a bit later from a wq. The alternate path of
these calls coming from xfrm_dev_state_flush() can't happen, as that
needs the RTNL lock and bond_change_active_slave() already holds it.

1. bond_ipsec_del_sa_all() might call xdo_dev_state_delete() a second
   time on an xfrm state that was concurrently killed. This is bad.
2. bond_ipsec_add_sa_all() can add a state on the new device, but
   pending bond_ipsec_free_sa() calls from the old device will then hit
   the WARN_ON() and then, worse, call xdo_dev_state_free() on the new
   device without a corresponding xdo_dev_state_delete().
3. Resolve a sleeping in atomic context introduced by the mentioned
   "Fixes" commit.

bond_ipsec_del_sa_all() and bond_ipsec_add_sa_all() now acquire x->lock
and check for x->km.state to help with problems 1 and 2. And since
xso.real_dev is now a private pointer managed by the bonding driver in
xfrm state, make better use of it to fully fix problems 1 and 2. In
bond_ipsec_del_sa_all(), set xso.real_dev to NULL while holding both the
mutex and x->lock, which makes sure that neither bond_ipsec_del_sa() nor
bond_ipsec_free_sa() could run concurrently.

Fix problem 3 by moving the list cleanup (which requires the mutex) from
bond_ipsec_del_sa() (called from atomic context) to bond_ipsec_free_sa()

Finally, simplify bond_ipsec_del_sa() and bond_ipsec_free_sa() by using
xso->real_dev directly, since it's now protected by locks and can be
trusted to always reflect the offload device.

Fixes: 2aeeef906d ("bonding: change ipsec_lock from spin lock to mutex")
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Hangbin Liu <liuhangbin@gmail.com>
Tested-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-04-16 11:02:49 +02:00
Cosmin Ratiu
43eca05b6a xfrm: Add explicit dev to .xdo_dev_state_{add,delete,free}
Previously, device driver IPSec offload implementations would fall into
two categories:
1. Those that used xso.dev to determine the offload device.
2. Those that used xso.real_dev to determine the offload device.

The first category didn't work with bonding while the second did.
In a non-bonding setup the two pointers are the same.

This commit adds explicit pointers for the offload netdevice to
.xdo_dev_state_add() / .xdo_dev_state_delete() / .xdo_dev_state_free()
which eliminates the confusion and allows drivers from the first
category to work with bonding.

xso.real_dev now becomes a private pointer managed by the bonding
driver.

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-04-16 11:01:41 +02:00
Ido Schimmel
2d300ce0b7 net: fib_rules: Fix iif / oif matching on L3 master device
Before commit 40867d74c3 ("net: Add l3mdev index to flow struct and
avoid oif reset for port devices") it was possible to use FIB rules to
match on a L3 domain. This was done by having a FIB rule match on iif /
oif being a L3 master device. It worked because prior to the FIB rule
lookup the iif / oif fields in the flow structure were reset to the
index of the L3 master device to which the input / output device was
enslaved to.

The above scheme made it impossible to match on the original input /
output device. Therefore, cited commit stopped overwriting the iif / oif
fields in the flow structure and instead stored the index of the
enslaving L3 master device in a new field ('flowi_l3mdev') in the flow
structure.

While the change enabled new use cases, it broke the original use case
of matching on a L3 domain. Fix this by interpreting the iif / oif
matching on a L3 master device as a match against the L3 domain. In
other words, if the iif / oif in the FIB rule points to a L3 master
device, compare the provided index against 'flowi_l3mdev' rather than
'flowi_{i,o}if'.

Before cited commit, a FIB rule that matched on 'iif vrf1' would only
match incoming traffic from devices enslaved to 'vrf1'. With the
proposed change (i.e., comparing against 'flowi_l3mdev'), the rule would
also match traffic originating from a socket bound to 'vrf1'. Avoid that
by adding a new flow flag ('FLOWI_FLAG_L3MDEV_OIF') that indicates if
the L3 domain was derived from the output interface or the input
interface (when not set) and take this flag into account when evaluating
the FIB rule against the flow structure.

Avoid unnecessary checks in the data path by detecting that a rule
matches on a L3 master device when the rule is installed and marking it
as such.

Tested using the following script [1].

Output before 40867d74c3 (v5.4.291):

default dev dummy1 table 100 scope link
default dev dummy1 table 200 scope link

Output after 40867d74c374:

default dev dummy1 table 300 scope link
default dev dummy1 table 300 scope link

Output with this patch:

default dev dummy1 table 100 scope link
default dev dummy1 table 200 scope link

[1]
 #!/bin/bash

 ip link add name vrf1 up type vrf table 10
 ip link add name dummy1 up master vrf1 type dummy

 sysctl -wq net.ipv4.conf.all.forwarding=1
 sysctl -wq net.ipv4.conf.all.rp_filter=0

 ip route add table 100 default dev dummy1
 ip route add table 200 default dev dummy1
 ip route add table 300 default dev dummy1

 ip rule add prio 0 oif vrf1 table 100
 ip rule add prio 1 iif vrf1 table 200
 ip rule add prio 2 table 300

 ip route get 192.0.2.1 oif dummy1 fibmatch
 ip route get 192.0.2.1 iif dummy1 from 198.51.100.1 fibmatch

Fixes: 40867d74c3 ("net: Add l3mdev index to flow struct and avoid oif reset for port devices")
Reported-by: hanhuihui <hanhuihui5@huawei.com>
Closes: https://lore.kernel.org/netdev/ec671c4f821a4d63904d0da15d604b75@huawei.com/
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250414172022.242991-2-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-15 17:54:56 -07:00
Breno Leitao
95d06e92a4 netlink: Introduce nlmsg_payload helper
Create a new helper function, nlmsg_payload(), to simplify checking and
retrieving Netlink message payloads.

This reduces boilerplate code for users who need to verify the message
length before accessing its data.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250414-nlmsg-v2-1-3d90cb42c6af@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-15 08:28:53 -07:00
Matthieu Baerts (NGI0)
6e83166dd8 mptcp: sched: remove mptcp_sched_data
This is a follow-up of commit b68b106b0f ("mptcp: sched: reduce size
for unused data"), now removing the mptcp_sched_data structure.

Now is a good time to do that, because the previously mentioned WIP work
has been updated, no longer depending on this structure.

Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250413-net-next-mptcp-sched-mib-sft-misc-v2-1-0f83a4350150@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-15 08:21:46 -07:00
David Howells
d03539d5c2 rxrpc: Display security params in the afs_cb_call tracepoint
Make the afs_cb_call tracepoint display some security parameters to make
debugging easier.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/20250411095303.2316168-12-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14 17:36:42 -07:00
David Howells
5800b1cf3f rxrpc: Allow CHALLENGEs to the passed to the app for a RESPONSE
Allow the app to request that CHALLENGEs be passed to it through an
out-of-band queue that allows recvmsg() to pick it up so that the app can
add data to it with sendmsg().

This will allow the application (AFS or userspace) to interact with the
process if it wants to and put values into user-defined fields.  This will
be used by AFS when talking to a fileserver to supply that fileserver with
a crypto key by which callback RPCs can be encrypted (ie. notifications
from the fileserver to the client).

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/20250411095303.2316168-5-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14 17:36:41 -07:00
David Howells
23738cc804 rxrpc: Pull out certain app callback funcs into an ops table
A number of functions separately furnish an AF_RXRPC socket with callback
function pointers into a kernel app (such as the AFS filesystem) that is
using it.  Replace most of these with an ops table for the entire socket.
This makes it easier to add more callback functions.

Note that the call incoming data processing callback is retaind as that
gets set to different things, depending on the type of op.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/20250411095303.2316168-3-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14 17:36:41 -07:00
Kuniyuki Iwashima
c57a9c5035 net: Remove ->exit_batch_rtnl().
There are no ->exit_batch_rtnl() users remaining.

Let's remove the hook.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250411205258.63164-15-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14 17:08:45 -07:00
Kuniyuki Iwashima
a967e01e2a ipv4: ip_tunnel: Convert ip_tunnel_delete_nets() callers to ->exit_rtnl().
ip_tunnel_delete_nets() iterates the dying netns list and performs the
same operations for each.

Let's export ip_tunnel_destroy() as ip_tunnel_delete_net() and call it
from ->exit_rtnl().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250411205258.63164-7-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14 17:08:42 -07:00
Kuniyuki Iwashima
7a60d91c69 net: Add ->exit_rtnl() hook to struct pernet_operations.
struct pernet_operations provides two batching hooks; ->exit_batch()
and ->exit_batch_rtnl().

The batching variant is beneficial if ->exit() meets any of the
following conditions:

  1) ->exit() repeatedly acquires a global lock for each netns

  2) ->exit() has a time-consuming operation that can be factored
     out (e.g. synchronize_rcu(), smp_mb(), etc)

  3) ->exit() does not need to repeat the same iterations for each
     netns (e.g. inet_twsk_purge())

Currently, none of the ->exit_batch_rtnl() functions satisfy any of
the above conditions because RTNL is factored out and held by the
caller and all of these functions iterate over the dying netns list.

Also, we want to hold per-netns RTNL there but avoid spreading
__rtnl_net_lock() across multiple locations.

Let's add ->exit_rtnl() hook and run it under __rtnl_net_lock().

The following patches will convert all ->exit_batch_rtnl() users
to ->exit_rtnl().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250411205258.63164-4-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14 17:08:41 -07:00
Toke Høiland-Jørgensen
ee62ce7a1d page_pool: Track DMA-mapped pages and unmap them when destroying the pool
When enabling DMA mapping in page_pool, pages are kept DMA mapped until
they are released from the pool, to avoid the overhead of re-mapping the
pages every time they are used. This causes resource leaks and/or
crashes when there are pages still outstanding while the device is torn
down, because page_pool will attempt an unmap through a non-existent DMA
device on the subsequent page return.

To fix this, implement a simple tracking of outstanding DMA-mapped pages
in page pool using an xarray. This was first suggested by Mina[0], and
turns out to be fairly straight forward: We simply store pointers to
pages directly in the xarray with xa_alloc() when they are first DMA
mapped, and remove them from the array on unmap. Then, when a page pool
is torn down, it can simply walk the xarray and unmap all pages still
present there before returning, which also allows us to get rid of the
get/put_device() calls in page_pool. Using xa_cmpxchg(), no additional
synchronisation is needed, as a page will only ever be unmapped once.

To avoid having to walk the entire xarray on unmap to find the page
reference, we stash the ID assigned by xa_alloc() into the page
structure itself, using the upper bits of the pp_magic field. This
requires a couple of defines to avoid conflicting with the
POINTER_POISON_DELTA define, but this is all evaluated at compile-time,
so does not affect run-time performance. The bitmap calculations in this
patch gives the following number of bits for different architectures:

- 23 bits on 32-bit architectures
- 21 bits on PPC64 (because of the definition of ILLEGAL_POINTER_VALUE)
- 32 bits on other 64-bit architectures

Stashing a value into the unused bits of pp_magic does have the effect
that it can make the value stored there lie outside the unmappable
range (as governed by the mmap_min_addr sysctl), for architectures that
don't define ILLEGAL_POINTER_VALUE. This means that if one of the
pointers that is aliased to the pp_magic field (such as page->lru.next)
is dereferenced while the page is owned by page_pool, that could lead to
a dereference into userspace, which is a security concern. The risk of
this is mitigated by the fact that (a) we always clear pp_magic before
releasing a page from page_pool, and (b) this would need a
use-after-free bug for struct page, which can have many other risks
since page->lru.next is used as a generic list pointer in multiple
places in the kernel. As such, with this patch we take the position that
this risk is negligible in practice. For more discussion, see[1].

Since all the tracking added in this patch is performed on DMA
map/unmap, no additional code is needed in the fast path, meaning the
performance overhead of this tracking is negligible there. A
micro-benchmark shows that the total overhead of the tracking itself is
about 400 ns (39 cycles(tsc) 395.218 ns; sum for both map and unmap[2]).
Since this cost is only paid on DMA map and unmap, it seems like an
acceptable cost to fix the late unmap issue. Further optimisation can
narrow the cases where this cost is paid (for instance by eliding the
tracking when DMA map/unmap is a no-op).

The extra memory needed to track the pages is neatly encapsulated inside
xarray, which uses the 'struct xa_node' structure to track items. This
structure is 576 bytes long, with slots for 64 items, meaning that a
full node occurs only 9 bytes of overhead per slot it tracks (in
practice, it probably won't be this efficient, but in any case it should
be an acceptable overhead).

[0] https://lore.kernel.org/all/CAHS8izPg7B5DwKfSuzz-iOop_YRbk3Sd6Y4rX7KBG9DcVJcyWg@mail.gmail.com/
[1] https://lore.kernel.org/r/20250320023202.GA25514@openwall.com
[2] https://lore.kernel.org/r/ae07144c-9295-4c9d-a400-153bb689fe9e@huawei.com

Reported-by: Yonglong Liu <liuyonglong@huawei.com>
Closes: https://lore.kernel.org/r/8743264a-9700-4227-a556-5f931c720211@huawei.com
Fixes: ff7d6b27f8 ("page_pool: refurbish version of page_pool code")
Suggested-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org>
Tested-by: Jesper Dangaard Brouer <hawk@kernel.org>
Tested-by: Qiuling Ren <qren@redhat.com>
Tested-by: Yuying Ma <yuma@redhat.com>
Tested-by: Yonglong Liu <liuyonglong@huawei.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20250409-page-pool-track-dma-v9-2-6a9ef2e0cba8@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14 16:30:29 -07:00
Paolo Abeni
c26c192c3d udp: properly deal with xfrm encap and ADDRFORM
UDP GRO accounting assumes that the GRO receive callback is always
set when the UDP tunnel is enabled, but syzkaller proved otherwise,
leading tot the following splat:

WARNING: CPU: 0 PID: 5837 at net/ipv4/udp_offload.c:123 udp_tunnel_update_gro_rcv+0x28d/0x4c0 net/ipv4/udp_offload.c:123
Modules linked in:
CPU: 0 UID: 0 PID: 5837 Comm: syz-executor850 Not tainted 6.14.0-syzkaller-13320-g420aabef3ab5 #0 PREEMPT(full)
Hardware name: Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2025
RIP: 0010:udp_tunnel_update_gro_rcv+0x28d/0x4c0 net/ipv4/udp_offload.c:123
Code: 00 00 e8 c6 5a 2f f7 48 c1 e5 04 48 8d b5 20 53 c7 9a ba 10
      00 00 00 4c 89 ff e8 ce 87 99 f7 e9 ce 00 00 00 e8 a4 5a 2f
      f7 90 <0f> 0b 90 e9 de fd ff ff bf 01 00 00 00 89 ee e8 cf
      5e 2f f7 85 ed
RSP: 0018:ffffc90003effa88 EFLAGS: 00010293
RAX: ffffffff8a93fc9c RBX: 0000000000000000 RCX: ffff8880306f9e00
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000000 R08: ffffffff8a93fabe R09: 1ffffffff20bfb2e
R10: dffffc0000000000 R11: fffffbfff20bfb2f R12: ffff88814ef21738
R13: dffffc0000000000 R14: ffff88814ef21778 R15: 1ffff11029de42ef
FS:  0000000000000000(0000) GS:ffff888124f96000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f04eec760d0 CR3: 000000000eb38000 CR4: 00000000003526f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 udp_tunnel_cleanup_gro include/net/udp_tunnel.h:205 [inline]
 udpv6_destroy_sock+0x212/0x270 net/ipv6/udp.c:1829
 sk_common_release+0x71/0x2e0 net/core/sock.c:3896
 inet_release+0x17d/0x200 net/ipv4/af_inet.c:435
 __sock_release net/socket.c:647 [inline]
 sock_close+0xbc/0x240 net/socket.c:1391
 __fput+0x3e9/0x9f0 fs/file_table.c:465
 task_work_run+0x251/0x310 kernel/task_work.c:227
 exit_task_work include/linux/task_work.h:40 [inline]
 do_exit+0xa11/0x27f0 kernel/exit.c:953
 do_group_exit+0x207/0x2c0 kernel/exit.c:1102
 __do_sys_exit_group kernel/exit.c:1113 [inline]
 __se_sys_exit_group kernel/exit.c:1111 [inline]
 __x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1111
 x64_sys_call+0x26c3/0x26d0 arch/x86/include/generated/asm/syscalls_64.h:232
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xf3/0x230 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f04eebfac79
Code: Unable to access opcode bytes at 0x7f04eebfac4f.
RSP: 002b:00007fffdcaa34a8 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f04eebfac79
RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000000
RBP: 00007f04eec75270 R08: ffffffffffffffb8 R09: 00007fffdcaa36c8
R10: 0000200000000000 R11: 0000000000000246 R12: 00007f04eec75270
R13: 0000000000000000 R14: 00007f04eec75cc0 R15: 00007f04eebcca70

Address the issue moving the accounting hook into
setup_udp_tunnel_sock() and set_xfrm_gro_udp_encap_rcv(), where
the GRO callback is actually set.

set_xfrm_gro_udp_encap_rcv() is prone to races with IPV6_ADDRFORM,
run the relevant setsockopt under the socket lock to ensure using
consistent values of sk_family and up->encap_type.

Refactor the GRO callback selection code, to make it clear that
the function pointer is always initialized.

Reported-by: syzbot+8c469a2260132cd095c1@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=8c469a2260132cd095c1
Fixes: 172bf009c1 ("xfrm: Support GRO for IPv4 ESP in UDP encapsulation")
Fixes: 5d7f5b2f6b ("udp_tunnel: use static call for GRO hooks when possible")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/92bcdb6899145a9a387c8fa9e3ca656642a43634.1744228733.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14 14:29:01 -07:00
Sabrina Dubroca
028363685b espintcp: remove encap socket caching to avoid reference leak
The current scheme for caching the encap socket can lead to reference
leaks when we try to delete the netns.

The reference chain is: xfrm_state -> enacp_sk -> netns

Since the encap socket is a userspace socket, it holds a reference on
the netns. If we delete the espintcp state (through flush or
individual delete) before removing the netns, the reference on the
socket is dropped and the netns is correctly deleted. Otherwise, the
netns may not be reachable anymore (if all processes within the ns
have terminated), so we cannot delete the xfrm state to drop its
reference on the socket.

This patch results in a small (~2% in my tests) performance
regression.

A GC-type mechanism could be added for the socket cache, to clear
references if the state hasn't been used "recently", but it's a lot
more complex than just not caching the socket.

Fixes: e27cca96cd ("xfrm: add espintcp (RFC 8229)")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-04-14 11:59:17 +02:00
Kuniyuki Iwashima
235bd9d21f tcp: Rename tcp_or_dccp_get_hashinfo().
DCCP was removed, so tcp_or_dccp_get_hashinfo() should be renamed.

Let's rename it to tcp_get_hashinfo().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250410023921.11307-5-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-11 18:58:11 -07:00
Kuniyuki Iwashima
22d6c9eebf net: Unexport shared functions for DCCP.
DCCP was removed, so many inet functions no longer need to
be exported.

Let's unexport or use EXPORT_IPV6_MOD() for such functions.

sk_free_unlock_clone() is inlined in sk_clone_lock() as it's
the only caller.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250410023921.11307-4-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-11 18:58:11 -07:00
Kuniyuki Iwashima
2a63dd0edf net: Retire DCCP socket.
DCCP was orphaned in 2021 by commit 054c4610bd ("MAINTAINERS: dccp:
move Gerrit Renker to CREDITS"), which noted that the last maintainer
had been inactive for five years.

In recent years, it has become a playground for syzbot, and most changes
to DCCP have been odd bug fixes triggered by syzbot.  Apart from that,
the only changes have been driven by treewide or networking API updates
or adjustments related to TCP.

Thus, in 2023, we announced we would remove DCCP in 2025 via commit
b144fcaf46 ("dccp: Print deprecation notice.").

Since then, only one individual has contacted the netdev mailing list. [0]

There is ongoing research for Multipath DCCP.  The repository is hosted
on GitHub [1], and development is not taking place through the upstream
community.  While the repository is published under the GPLv2 license,
the scheduling part remains proprietary, with a LICENSE file [2] stating:

  "This is not Open Source software."

The researcher mentioned a plan to address the licensing issue, upstream
the patches, and step up as a maintainer, but there has been no further
communication since then.

Maintaining DCCP for a decade without any real users has become a burden.

Therefore, it's time to remove it.

Removing DCCP will also provide significant benefits to TCP.  It allows
us to freely reorganize the layout of struct inet_connection_sock, which
is currently shared with DCCP, and optimize it to reduce the number of
cachelines accessed in the TCP fast path.

Note that we keep DCCP netfilter modules as requested.  [3]

Link: https://lore.kernel.org/netdev/20230710182253.81446-1-kuniyu@amazon.com/T/#u #[0]
Link: https://github.com/telekom/mp-dccp #[1]
Link: https://github.com/telekom/mp-dccp/blob/mpdccp_v03_k5.10/net/dccp/non_gpl_scheduler/LICENSE #[2]
Link: https://lore.kernel.org/netdev/Z_VQ0KlCRkqYWXa-@calendula/ #[3]
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paul Moore <paul@paul-moore.com> (LSM and SELinux)
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Link: https://patch.msgid.link/20250410023921.11307-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-11 18:58:10 -07:00
Jiayuan Chen
c449d5f3a3 tcp: add LINUX_MIB_PAWS_TW_REJECTED counter
When TCP is in TIME_WAIT state, PAWS verification uses
LINUX_PAWSESTABREJECTED, which is ambiguous and cannot be distinguished
from other PAWS verification processes.

We added a new counter, like the existing PAWS_OLD_ACK one.

Also we update the doc with previously missing PAWS_OLD_ACK.

usage:
'''
nstat -az | grep PAWSTimewait
TcpExtPAWSTimewait              1                  0.0
'''

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250409112614.16153-3-jiayuan.chen@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10 18:29:26 -07:00
Jiayuan Chen
0427141112 tcp: add TCP_RFC7323_TW_PAWS drop reason
Devices in the networking path, such as firewalls, NATs, or routers, which
can perform SNAT or DNAT, use addresses from their own limited address
pools to masquerade the source address during forwarding, causing PAWS
verification to fail more easily.

Currently, packet loss statistics for PAWS can only be viewed through MIB,
which is a global metric and cannot be precisely obtained through tracing
to get the specific 4-tuple of the dropped packet. In the past, we had to
use kprobe ret to retrieve relevant skb information from
tcp_timewait_state_process().

We add a drop_reason pointer, similar to what previous commit does:
commit e34100c2ec ("tcp: add a drop_reason pointer to tcp_check_req()")

This commit addresses the PAWSESTABREJECTED case and also sets the
corresponding drop reason.

We use 'pwru' to test.

Before this commit:
''''
./pwru 'port 9999'
2025/04/07 13:40:19 Listening for events..
TUPLE                                        FUNC
172.31.75.115:12345->172.31.75.114:9999(tcp) sk_skb_reason_drop(SKB_DROP_REASON_NOT_SPECIFIED)
'''

After this commit:
'''
./pwru 'port 9999'
2025/04/07 13:51:34 Listening for events..
TUPLE                                        FUNC
172.31.75.115:12345->172.31.75.114:9999(tcp) sk_skb_reason_drop(SKB_DROP_REASON_TCP_RFC7323_TW_PAWS)
'''

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250409112614.16153-2-jiayuan.chen@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10 18:29:26 -07:00
Jakub Kicinski
cb7103298d Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.15-rc2).

Conflict:

Documentation/networking/netdevices.rst
net/core/lock_debug.c
  04efcee6ef ("net: hold instance lock during NETDEV_CHANGE")
  03df156dd3 ("xdp: double protect netdev->xdp_flags with netdev->lock")

No adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10 16:51:07 -07:00
Linus Torvalds
ab59a86056 Including fixes from netfilter.
Current release - regressions:
 
   - core: hold instance lock during NETDEV_CHANGE
 
   - rtnetlink: fix bad unlock balance in do_setlink().
 
   - ipv6:
     - fix null-ptr-deref in addrconf_add_ifaddr().
     - align behavior across nexthops during path selection
 
 Previous releases - regressions:
 
   - sctp: prevent transport UaF in sendmsg
 
   - mptcp: only inc MPJoinAckHMacFailure for HMAC failures
 
 Previous releases - always broken:
 
   - sched:
     - make ->qlen_notify() idempotent
     - ensure sufficient space when sending filter netlink notifications
     - sch_sfq: really don't allow 1 packet limit
 
   - netfilter: fix incorrect avx2 match of 5th field octet
 
   - tls: explicitly disallow disconnect
 
   - eth: octeontx2-pf: fix VF root node parent queue priority
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmf3xusSHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOkud0P/iWOQB0oj0nvxl2ionPzgJEPduxuF0V6
 YPyDBUzLC7Gq6NmTcdDlNJt8fE6UmKUIneghUm9Ss7LRpKv0/TPvorKMSK44Zt53
 a5q49JeoI0TvnnhJesdHjiF31hrInqZmcX8OjSH8Q/SCKuy7rsgzao0vjvhd7lxm
 wA6LlWnJO1Pf991nNpbjUSoAZ7CMNlEIewGkdq0+6UADC7D9VagKTgIkFKw1BvRw
 2Eb2pzvdO9Pj02+l/mjdRhUzMZlr+FG+WBqXk5oKR0YZ2t3CS4O9/UUBoAn775tM
 gCfzepNuAUXGX0I6h+DANCNuswWuG/IvYTdhy+hRWblYeCkILU60E8eVMlh7tpII
 fUd5GSRhX1NpGNHUlDG/4b6IcjMO3ebtce2cm2Y9t2CUe7EqB0HZyvTczNroTxip
 KXrXcCBuEkzxXCZhaN/CrBu8Piu8vJk/rMH5ha1khce9CkmYY+m9ruvsYjZmPI+/
 P/SFkRdb/yV/SIOmay8FCJsy60t4FOtLnlDDrnygq4Q/9a7VwafebVpKS1fbTELG
 ZTiELN3/PN2GUnfREf0DVLPfn9sdqrMZaclLOJpp/Zi1/RZpo52WHceXJShiu9pe
 A8B+3SuPgOaLfhwyqiHlWm5moc9kNF26vlrWfFjK1GrJdxMisYwQoWD5eHfQFhDX
 UaxlAmndwZa9
 =wkGz
 -----END PGP SIGNATURE-----

Merge tag 'net-6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
 "Including fixes from netfilter.

  Current release - regressions:

    - core: hold instance lock during NETDEV_CHANGE

    - rtnetlink: fix bad unlock balance in do_setlink()

    - ipv6:
       - fix null-ptr-deref in addrconf_add_ifaddr()
       - align behavior across nexthops during path selection

  Previous releases - regressions:

    - sctp: prevent transport UaF in sendmsg

    - mptcp: only inc MPJoinAckHMacFailure for HMAC failures

  Previous releases - always broken:

    - sched:
       - make ->qlen_notify() idempotent
       - ensure sufficient space when sending filter netlink notifications
       - sch_sfq: really don't allow 1 packet limit

    - netfilter: fix incorrect avx2 match of 5th field octet

    - tls: explicitly disallow disconnect

    - eth: octeontx2-pf: fix VF root node parent queue priority"

* tag 'net-6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (38 commits)
  ethtool: cmis_cdb: Fix incorrect read / write length extension
  selftests: netfilter: add test case for recent mismatch bug
  nft_set_pipapo: fix incorrect avx2 match of 5th field octet
  net: ppp: Add bound checking for skb data on ppp_sync_txmung
  net: Fix null-ptr-deref by sock_lock_init_class_and_name() and rmmod.
  ipv6: Align behavior across nexthops during path selection
  net: phy: allow MDIO bus PM ops to start/stop state machine for phylink-controlled PHY
  net: phy: move phy_link_change() prior to mdio_bus_phy_may_suspend()
  selftests/tc-testing: sfq: check that a derived limit of 1 is rejected
  net_sched: sch_sfq: move the limit validation
  net_sched: sch_sfq: use a temporary work area for validating configuration
  net: libwx: handle page_pool_dev_alloc_pages error
  selftests: mptcp: validate MPJoin HMacFailure counters
  mptcp: only inc MPJoinAckHMacFailure for HMAC failures
  rtnetlink: Fix bad unlock balance in do_setlink().
  net: ethtool: Don't call .cleanup_data when prepare_data fails
  tc: Ensure we have enough buffer space when sending filter netlink notifications
  net: libwx: Fix the wrong Rx descriptor field
  octeontx2-pf: qos: fix VF root node parent queue index
  selftests: tls: check that disconnect does nothing
  ...
2025-04-10 08:52:18 -07:00
Kuniyuki Iwashima
0bb2f7a1ad net: Fix null-ptr-deref by sock_lock_init_class_and_name() and rmmod.
When I ran the repro [0] and waited a few seconds, I observed two
LOCKDEP splats: a warning immediately followed by a null-ptr-deref. [1]

Reproduction Steps:

  1) Mount CIFS
  2) Add an iptables rule to drop incoming FIN packets for CIFS
  3) Unmount CIFS
  4) Unload the CIFS module
  5) Remove the iptables rule

At step 3), the CIFS module calls sock_release() for the underlying
TCP socket, and it returns quickly.  However, the socket remains in
FIN_WAIT_1 because incoming FIN packets are dropped.

At this point, the module's refcnt is 0 while the socket is still
alive, so the following rmmod command succeeds.

  # ss -tan
  State      Recv-Q Send-Q Local Address:Port  Peer Address:Port
  FIN-WAIT-1 0      477        10.0.2.15:51062   10.0.0.137:445

  # lsmod | grep cifs
  cifs                 1159168  0

This highlights a discrepancy between the lifetime of the CIFS module
and the underlying TCP socket.  Even after CIFS calls sock_release()
and it returns, the TCP socket does not die immediately in order to
close the connection gracefully.

While this is generally fine, it causes an issue with LOCKDEP because
CIFS assigns a different lock class to the TCP socket's sk->sk_lock
using sock_lock_init_class_and_name().

Once an incoming packet is processed for the socket or a timer fires,
sk->sk_lock is acquired.

Then, LOCKDEP checks the lock context in check_wait_context(), where
hlock_class() is called to retrieve the lock class.  However, since
the module has already been unloaded, hlock_class() logs a warning
and returns NULL, triggering the null-ptr-deref.

If LOCKDEP is enabled, we must ensure that a module calling
sock_lock_init_class_and_name() (CIFS, NFS, etc) cannot be unloaded
while such a socket is still alive to prevent this issue.

Let's hold the module reference in sock_lock_init_class_and_name()
and release it when the socket is freed in sk_prot_free().

Note that sock_lock_init() clears sk->sk_owner for svc_create_socket()
that calls sock_lock_init_class_and_name() for a listening socket,
which clones a socket by sk_clone_lock() without GFP_ZERO.

[0]:
CIFS_SERVER="10.0.0.137"
CIFS_PATH="//${CIFS_SERVER}/Users/Administrator/Desktop/CIFS_TEST"
DEV="enp0s3"
CRED="/root/WindowsCredential.txt"

MNT=$(mktemp -d /tmp/XXXXXX)
mount -t cifs ${CIFS_PATH} ${MNT} -o vers=3.0,credentials=${CRED},cache=none,echo_interval=1

iptables -A INPUT -s ${CIFS_SERVER} -j DROP

for i in $(seq 10);
do
    umount ${MNT}
    rmmod cifs
    sleep 1
done

rm -r ${MNT}

iptables -D INPUT -s ${CIFS_SERVER} -j DROP

[1]:
DEBUG_LOCKS_WARN_ON(1)
WARNING: CPU: 10 PID: 0 at kernel/locking/lockdep.c:234 hlock_class (kernel/locking/lockdep.c:234 kernel/locking/lockdep.c:223)
Modules linked in: cifs_arc4 nls_ucs2_utils cifs_md4 [last unloaded: cifs]
CPU: 10 UID: 0 PID: 0 Comm: swapper/10 Not tainted 6.14.0 #36
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
RIP: 0010:hlock_class (kernel/locking/lockdep.c:234 kernel/locking/lockdep.c:223)
...
Call Trace:
 <IRQ>
 __lock_acquire (kernel/locking/lockdep.c:4853 kernel/locking/lockdep.c:5178)
 lock_acquire (kernel/locking/lockdep.c:469 kernel/locking/lockdep.c:5853 kernel/locking/lockdep.c:5816)
 _raw_spin_lock_nested (kernel/locking/spinlock.c:379)
 tcp_v4_rcv (./include/linux/skbuff.h:1678 ./include/net/tcp.h:2547 net/ipv4/tcp_ipv4.c:2350)
...

BUG: kernel NULL pointer dereference, address: 00000000000000c4
 PF: supervisor read access in kernel mode
 PF: error_code(0x0000) - not-present page
PGD 0
Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 10 UID: 0 PID: 0 Comm: swapper/10 Tainted: G        W          6.14.0 #36
Tainted: [W]=WARN
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
RIP: 0010:__lock_acquire (kernel/locking/lockdep.c:4852 kernel/locking/lockdep.c:5178)
Code: 15 41 09 c7 41 8b 44 24 20 25 ff 1f 00 00 41 09 c7 8b 84 24 a0 00 00 00 45 89 7c 24 20 41 89 44 24 24 e8 e1 bc ff ff 4c 89 e7 <44> 0f b6 b8 c4 00 00 00 e8 d1 bc ff ff 0f b6 80 c5 00 00 00 88 44
RSP: 0018:ffa0000000468a10 EFLAGS: 00010046
RAX: 0000000000000000 RBX: ff1100010091cc38 RCX: 0000000000000027
RDX: ff1100081f09ca48 RSI: 0000000000000001 RDI: ff1100010091cc88
RBP: ff1100010091c200 R08: ff1100083fe6e228 R09: 00000000ffffbfff
R10: ff1100081eca0000 R11: ff1100083fe10dc0 R12: ff1100010091cc88
R13: 0000000000000001 R14: 0000000000000000 R15: 00000000000424b1
FS:  0000000000000000(0000) GS:ff1100081f080000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000000c4 CR3: 0000000002c4a003 CR4: 0000000000771ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
 <IRQ>
 lock_acquire (kernel/locking/lockdep.c:469 kernel/locking/lockdep.c:5853 kernel/locking/lockdep.c:5816)
 _raw_spin_lock_nested (kernel/locking/spinlock.c:379)
 tcp_v4_rcv (./include/linux/skbuff.h:1678 ./include/net/tcp.h:2547 net/ipv4/tcp_ipv4.c:2350)
 ip_protocol_deliver_rcu (net/ipv4/ip_input.c:205 (discriminator 1))
 ip_local_deliver_finish (./include/linux/rcupdate.h:878 net/ipv4/ip_input.c:234)
 ip_sublist_rcv_finish (net/ipv4/ip_input.c:576)
 ip_list_rcv_finish (net/ipv4/ip_input.c:628)
 ip_list_rcv (net/ipv4/ip_input.c:670)
 __netif_receive_skb_list_core (net/core/dev.c:5939 net/core/dev.c:5986)
 netif_receive_skb_list_internal (net/core/dev.c:6040 net/core/dev.c:6129)
 napi_complete_done (./include/linux/list.h:37 ./include/net/gro.h:519 ./include/net/gro.h:514 net/core/dev.c:6496)
 e1000_clean (drivers/net/ethernet/intel/e1000/e1000_main.c:3815)
 __napi_poll.constprop.0 (net/core/dev.c:7191)
 net_rx_action (net/core/dev.c:7262 net/core/dev.c:7382)
 handle_softirqs (kernel/softirq.c:561)
 __irq_exit_rcu (kernel/softirq.c:596 kernel/softirq.c:435 kernel/softirq.c:662)
 irq_exit_rcu (kernel/softirq.c:680)
 common_interrupt (arch/x86/kernel/irq.c:280 (discriminator 14))
  </IRQ>
 <TASK>
 asm_common_interrupt (./arch/x86/include/asm/idtentry.h:693)
RIP: 0010:default_idle (./arch/x86/include/asm/irqflags.h:37 ./arch/x86/include/asm/irqflags.h:92 arch/x86/kernel/process.c:744)
Code: 4c 01 c7 4c 29 c2 e9 72 ff ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d c3 2b 15 00 fb f4 <fa> c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90
RSP: 0018:ffa00000000ffee8 EFLAGS: 00000202
RAX: 000000000000640b RBX: ff1100010091c200 RCX: 0000000000061aa4
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff812f30c5
RBP: 000000000000000a R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000002 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
 ? do_idle (kernel/sched/idle.c:186 kernel/sched/idle.c:325)
 default_idle_call (./include/linux/cpuidle.h:143 kernel/sched/idle.c:118)
 do_idle (kernel/sched/idle.c:186 kernel/sched/idle.c:325)
 cpu_startup_entry (kernel/sched/idle.c:422 (discriminator 1))
 start_secondary (arch/x86/kernel/smpboot.c:315)
 common_startup_64 (arch/x86/kernel/head_64.S:421)
 </TASK>
Modules linked in: cifs_arc4 nls_ucs2_utils cifs_md4 [last unloaded: cifs]
CR2: 00000000000000c4

Fixes: ed07536ed6 ("[PATCH] lockdep: annotate nfs/nfsd in-kernel sockets")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20250407163313.22682-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-09 19:11:55 -07:00
Jakub Kicinski
ce7b149474 netdev: depend on netdev->lock for qstats in ops locked drivers
We mostly needed rtnl_lock in qstat to make sure the queue count
is stable while we work. For "ops locked" drivers the instance
lock protects the queue count, so we don't have to take rtnl_lock.

For currently ops-locked drivers: netdevsim and bnxt need
the protection from netdev going down while we dump, which
instance lock provides. gve doesn't care.

Reviewed-by: Joe Damato <jdamato@fastly.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250408195956.412733-9-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-09 17:01:52 -07:00
Jakub Kicinski
03df156dd3 xdp: double protect netdev->xdp_flags with netdev->lock
Protect xdp_features with netdev->lock. This way pure readers
no longer have to take rtnl_lock to access the field.

This includes calling NETDEV_XDP_FEAT_CHANGE under the lock.
Looks like that's fine for bonding, the only "real" listener,
it's the same as ethtool feature change.

In terms of normal drivers - only GVE need special consideration
(other drivers don't use instance lock or don't support XDP).
It calls xdp_set_features_flag() helper from gve_init_priv() which
in turn is called from gve_reset_recovery() (locked), or prior
to netdev registration. So switch to _locked.

Reviewed-by: Joe Damato <jdamato@fastly.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Acked-by: Harshitha Ramamurthy <hramamurthy@google.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20250408195956.412733-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-09 17:01:52 -07:00
Jakub Kicinski
4ec9031cbe netdev: add "ops compat locking" helpers
Add helpers to "lock a netdev in a backward-compatible way",
which for ops-locked netdevs will mean take the instance lock.
For drivers which haven't opted into the ops locking we'll take
rtnl_lock.

The scoped foreach is dropping and re-taking the lock for each
device, even if prev and next are both under rtnl_lock.
I hope that's fine since we expect that netdev nl to be mostly
supported by modern drivers, and modern drivers should also
opt into the instance locking.

Note that these helpers are mostly needed for queue related state,
because drivers modify queue config in their ops in a non-atomic
way. Or differently put, queue changes don't have a clear-cut API
like NAPI configuration. Any state that can should just use the
instance lock directly, not the "compat" hacks.

Reviewed-by: Joe Damato <jdamato@fastly.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250408195956.412733-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-09 17:01:51 -07:00
Jakub Kicinski
606048cbd8 net: designate XSK pool pointers in queues as "ops protected"
Read accesses go via xsk_get_pool_from_qid(), the call coming
from the core and gve look safe (other "ops locked" drivers
don't support XSK).

Write accesses go via xsk_reg_pool_at_qid() and xsk_clear_pool_at_qid().
Former is already under the ops lock, latter is not (both coming from
the workqueue via xp_clear_dev() and NETDEV_UNREGISTER via xsk_notifier()).

Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250408195956.412733-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-09 17:01:51 -07:00
Paolo Abeni
5d7f5b2f6b udp_tunnel: use static call for GRO hooks when possible
It's quite common to have a single UDP tunnel type active in the
whole system. In such a case we can replace the indirect call for
the UDP tunnel GRO callback with a static call.

Add the related accounting in the control path and switch to static
call when possible. To keep the code simple use a static array for
the registered tunnel types, and size such array based on the kernel
config.

Note that there are valid kernel configurations leading to
UDP_MAX_TUNNEL_TYPES == 0 even with IS_ENABLED(CONFIG_NET_UDP_TUNNEL),
Explicitly skip the accounting in such a case, to avoid compile warning
when accessing "udp_tunnel_gro_types".

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/53d156cdfddcc9678449e873cc83e68fa1582653.1744040675.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-08 18:19:45 -07:00
Paolo Abeni
a36283e2b6 udp_tunnel: create a fastpath GRO lookup.
Most UDP tunnels bind a socket to a local port, with ANY address, no
peer and no interface index specified.
Additionally it's quite common to have a single tunnel device per
namespace.

Track in each namespace the UDP tunnel socket respecting the above.
When only a single one is present, store a reference in the netns.

When such reference is not NULL, UDP tunnel GRO lookup just need to
match the incoming packet destination port vs the socket local port.

The tunnel socket never sets the reuse[port] flag[s]. When bound to no
address and interface, no other socket can exist in the same netns
matching the specified local port.

Matching packets with non-local destination addresses will be
aggregated, and eventually segmented as needed - no behavior changes
intended.

Restrict the optimization to kernel sockets only: it covers all the
relevant use-cases, and user-space owned sockets could be disconnected
and rebound after setup_udp_tunnel_sock(), breaking the uniqueness
assumption

Note that the UDP tunnel socket reference is stored into struct
netns_ipv4 for both IPv4 and IPv6 tunnels. That is intentional to keep
all the fastpath-related netns fields in the same struct and allow
cacheline-based optimization. Currently both the IPv4 and IPv6 socket
pointer share the same cacheline as the `udp_table` field.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/41d16bc8d1257d567f9344c445b4ae0b4a91ede4.1744040675.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-08 18:19:41 -07:00
Eric Dumazet
0a7de4a8f8 net: rps: remove kfree_rcu_mightsleep() use
Add an rcu_head to sd_flow_limit and rps_sock_flow_table structs
to use the more conventional and predictable k[v]free_rcu().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250407163602.170356-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-08 12:30:55 -07:00
Ricardo Cañuelo Navarro
f1a69a940d sctp: detect and prevent references to a freed transport in sendmsg
sctp_sendmsg() re-uses associations and transports when possible by
doing a lookup based on the socket endpoint and the message destination
address, and then sctp_sendmsg_to_asoc() sets the selected transport in
all the message chunks to be sent.

There's a possible race condition if another thread triggers the removal
of that selected transport, for instance, by explicitly unbinding an
address with setsockopt(SCTP_SOCKOPT_BINDX_REM), after the chunks have
been set up and before the message is sent. This can happen if the send
buffer is full, during the period when the sender thread temporarily
releases the socket lock in sctp_wait_for_sndbuf().

This causes the access to the transport data in
sctp_outq_select_transport(), when the association outqueue is flushed,
to result in a use-after-free read.

This change avoids this scenario by having sctp_transport_free() signal
the freeing of the transport, tagging it as "dead". In order to do this,
the patch restores the "dead" bit in struct sctp_transport, which was
removed in
commit 47faa1e4c5 ("sctp: remove the dead field of sctp_transport").

Then, in the scenario where the sender thread has released the socket
lock in sctp_wait_for_sndbuf(), the bit is checked again after
re-acquiring the socket lock to detect the deletion. This is done while
holding a reference to the transport to prevent it from being freed in
the process.

If the transport was deleted while the socket lock was relinquished,
sctp_sendmsg_to_asoc() will return -EAGAIN to let userspace retry the
send.

The bug was found by a private syzbot instance (see the error report [1]
and the C reproducer that triggers it [2]).

Link: https://people.igalia.com/rcn/kernel_logs/20250402__KASAN_slab-use-after-free_Read_in_sctp_outq_select_transport.txt [1]
Link: https://people.igalia.com/rcn/kernel_logs/20250402__KASAN_slab-use-after-free_Read_in_sctp_outq_select_transport__repro.c [2]
Cc: stable@vger.kernel.org
Fixes: df132eff46 ("sctp: clear the transport of some out_chunk_list chunks in sctp_assoc_rm_peer")
Suggested-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Ricardo Cañuelo Navarro <rcn@igalia.com>
Acked-by: Xin Long <lucien.xin@gmail.com>
Link: https://patch.msgid.link/20250404-kasan_slab-use-after-free_read_in_sctp_outq_select_transport__20250404-v1-1-5ce4a0b78ef2@igalia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-08 11:34:06 +02:00
Thomas Gleixner
8fa7292fee treewide: Switch/rename to timer_delete[_sync]()
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree
over and remove the historical wrapper inlines.

Conversion was done with coccinelle plus manual fixups where necessary.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-04-05 10:30:12 +02:00
Jakub Kicinski
ec304b70d4 net: move mp dev config validation to __net_mp_open_rxq()
devmem code performs a number of safety checks to avoid having
to reimplement all of them in the drivers. Move those to
__net_mp_open_rxq() and reuse that function for binding to make
sure that io_uring ZC also benefits from them.

While at it rename the queue ID variable to rxq_idx in
__net_mp_open_rxq(), we touch most of the relevant lines.

The XArray insertion is reordered after the netdev_rx_queue_restart()
call, otherwise we'd need to duplicate the queue index check
or risk inserting an invalid pointer. The XArray allocation
failures should be extremely rare.

Reviewed-by: Mina Almasry <almasrymina@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Fixes: 6e18ed929d ("net: add helpers for setting a memory provider on an rx queue")
Link: https://patch.msgid.link/20250403013405.2827250-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-04 07:35:38 -07:00
Stanislav Fomichev
1901066aab netdevsim: add dummy device notifiers
In order to exercise and verify notifiers' locking assumptions,
register dummy notifiers (via register_netdevice_notifier_dev_net).
Share notifier event handler that enforces the assumptions with
lock_debug.c (rename and export rtnl_net_debug_event as
netdev_debug_event). Add ops lock asserts to netdev_debug_event.

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250401163452.622454-6-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-03 15:32:08 -07:00
Stanislav Fomichev
8965c160b8 net: use netif_disable_lro in ipv6_add_dev
ipv6_add_dev might call dev_disable_lro which unconditionally grabs
instance lock, so it will deadlock during NETDEV_REGISTER. Switch
to netif_disable_lro.

Make sure all callers hold the instance lock as well.

Cc: Cosmin Ratiu <cratiu@nvidia.com>
Fixes: ad7c7b2172 ("net: hold netdev instance lock during sysfs operations")
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250401163452.622454-4-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-03 15:32:08 -07:00
Linus Torvalds
acc4d5ff0b Rather tiny PR, mostly so that we can get into our trees your fix
to the x86 Makefile.
 
 Current release - regressions:
 
  - Revert "tcp: avoid atomic operations on sk->sk_rmem_alloc",
    error queue accounting was missed
 
 Current release - new code bugs:
 
  - 5 fixes for the netdevice instance locking work
 
 Previous releases - regressions:
 
  - usbnet: restore usb%d name exception for local mac addresses
 
 Previous releases - always broken:
 
  - rtnetlink: allocate vfinfo size for VF GUIDs when supported,
    avoid spurious GET_LINK failures
 
  - eth: mana: Switch to page pool for jumbo frames
 
  - phy: broadcom: Correct BCM5221 PHY model detection
 
 Misc:
 
  - selftests: drv-net: replace helpers for referring to other files
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmfrLiQACgkQMUZtbf5S
 IrvDsg//VH5VmkMau/TUF1WpwA03Wb/X0UqtM72lufVCMGYCogqjHZzs9popfhuu
 CCi4ZiBofEJHfN9WYJoumUL8aOdux0Q875o7h5gvfqKCokkUbJDk3W32QiLj1RqZ
 L14TorNOyS9Tg9m60pLa/6H3WS0kduhUGLuLzgTmOtT/YVJrOGmBtk/tRXFAKclH
 f3CPucvZGS5Fr4HfUc3yWXiocjibDJnv+jE+xW6S6YHpghvsK7anUvqIGTqQV8Vp
 s6AuvFULGO00968hFHcO0N8f8MnaCCJr2bcHCOXyjssdEbdvDOqzhFN4KhxWEPbK
 kCl3rLkPdkXo+ekOC7gIxXXaKVz3IVm28pegtEws8fda/iLuqZTp+BKo7kdrf3Iy
 br0rP/iK3eFN0M1XpVUIbmEuJ6VamztCzK88uvDKdI+Ol3GLAfy9v5NmkbzJI+aE
 cw+SyE6NgbeDeHBxvOu2F7G2sWMBkTEGaHMNXCv7I/VAvQsbk48onTVnpA+GlFD5
 vFqMxiZHLBUfFOfUcHxmw8KAkZ44pc1xEkpw/4s8GZOfq+1oWz2LrSQ8M8Hjs2VN
 NTuE44OOsBDPOJ54iDAIOr5jyF+ZDeWEuPbvWbHGri80gII+2iF2788N2ToyAkyR
 R0t4VtJwXF+8D6IxTG41OIvAuq9Zc8AqM6O7VnOyFhZtKqezBTI=
 =BDoM
 -----END PGP SIGNATURE-----

Merge tag 'net-6.15-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Rather tiny pull request, mostly so that we can get into our trees
  your fix to the x86 Makefile.

  Current release - regressions:

   - Revert "tcp: avoid atomic operations on sk->sk_rmem_alloc", error
     queue accounting was missed

  Current release - new code bugs:

   - 5 fixes for the netdevice instance locking work

  Previous releases - regressions:

   - usbnet: restore usb%d name exception for local mac addresses

  Previous releases - always broken:

   - rtnetlink: allocate vfinfo size for VF GUIDs when supported, avoid
     spurious GET_LINK failures

   - eth: mana: Switch to page pool for jumbo frames

   - phy: broadcom: Correct BCM5221 PHY model detection

  Misc:

   - selftests: drv-net: replace helpers for referring to other files"

* tag 'net-6.15-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (22 commits)
  Revert "tcp: avoid atomic operations on sk->sk_rmem_alloc"
  bnxt_en: bring back rtnl lock in bnxt_shutdown
  eth: gve: add missing netdev locks on reset and shutdown paths
  selftests: mptcp: ignore mptcp_diag binary
  selftests: mptcp: close fd_in before returning in main_loop
  selftests: mptcp: fix incorrect fd checks in main_loop
  mptcp: fix NULL pointer in can_accept_new_subflow
  octeontx2-af: Free NIX_AF_INT_VEC_GEN irq
  octeontx2-af: Fix mbox INTR handler when num VFs > 64
  net: fix use-after-free in the netdev_nl_sock_priv_destroy()
  selftests: net: use Path helpers in ping
  selftests: net: use the dummy bpf from net/lib
  selftests: drv-net: replace the rpath helper with Path objects
  net: lapbether: use netdev_lockdep_set_classes() helper
  net: phy: broadcom: Correct BCM5221 PHY model detection
  net: usb: usbnet: restore usb%d name exception for local mac addresses
  net/mlx5e: SHAMPO, Make reserved size independent of page size
  net: mana: Switch to page pool for jumbo frames
  MAINTAINERS: Add dedicated entries for phy_link_topology
  net: move replay logic to tc_modify_qdisc
  ...
2025-04-01 20:00:51 -07:00
Linus Torvalds
eb0ece1602 - The 6 patch series "Enable strict percpu address space checks" from
Uros Bizjak uses x86 named address space qualifiers to provide
   compile-time checking of percpu area accesses.
 
   This has caused a small amount of fallout - two or three issues were
   reported.  In all cases the calling code was founf to be incorrect.
 
 - The 4 patch series "Some cleanup for memcg" from Chen Ridong
   implements some relatively monir cleanups for the memcontrol code.
 
 - The 17 patch series "mm: fixes for device-exclusive entries (hmm)"
   from David Hildenbrand fixes a boatload of issues which David found then
   using device-exclusive PTE entries when THP is enabled.  More work is
   needed, but this makes thins better - our own HMM selftests now succeed.
 
 - The 2 patch series "mm: zswap: remove z3fold and zbud" from Yosry
   Ahmed remove the z3fold and zbud implementations.  They have been
   deprecated for half a year and nobody has complained.
 
 - The 5 patch series "mm: further simplify VMA merge operation" from
   Lorenzo Stoakes implements numerous simplifications in this area.  No
   runtime effects are anticipated.
 
 - The 4 patch series "mm/madvise: remove redundant mmap_lock operations
   from process_madvise()" from SeongJae Park rationalizes the locking in
   the madvise() implementation.  Performance gains of 20-25% were observed
   in one MADV_DONTNEED microbenchmark.
 
 - The 12 patch series "Tiny cleanup and improvements about SWAP code"
   from Baoquan He contains a number of touchups to issues which Baoquan
   noticed when working on the swap code.
 
 - The 2 patch series "mm: kmemleak: Usability improvements" from Catalin
   Marinas implements a couple of improvements to the kmemleak user-visible
   output.
 
 - The 2 patch series "mm/damon/paddr: fix large folios access and
   schemes handling" from Usama Arif provides a couple of fixes for DAMON's
   handling of large folios.
 
 - The 3 patch series "mm/damon/core: fix wrong and/or useless
   damos_walk() behaviors" from SeongJae Park fixes a few issues with the
   accuracy of kdamond's walking of DAMON regions.
 
 - The 3 patch series "expose mapping wrprotect, fix fb_defio use" from
   Lorenzo Stoakes changes the interaction between framebuffer deferred-io
   and core MM.  No functional changes are anticipated - this is
   preparatory work for the future removal of page structure fields.
 
 - The 4 patch series "mm/damon: add support for hugepage_size DAMOS
   filter" from Usama Arif adds a DAMOS filter which permits the filtering
   by huge page sizes.
 
 - The 4 patch series "mm: permit guard regions for file-backed/shmem
   mappings" from Lorenzo Stoakes extends the guard region feature from its
   present "anon mappings only" state.  The feature now covers shmem and
   file-backed mappings.
 
 - The 4 patch series "mm: batched unmap lazyfree large folios during
   reclamation" from Barry Song cleans up and speeds up the unmapping for
   pte-mapped large folios.
 
 - The 18 patch series "reimplement per-vma lock as a refcount" from
   Suren Baghdasaryan puts the vm_lock back into the vma.  Our reasons for
   pulling it out were largely bogus and that change made the code more
   messy.  This patchset provides small (0-10%) improvements on one
   microbenchmark.
 
 - The 5 patch series "Docs/mm/damon: misc DAMOS filters documentation
   fixes and improves" from SeongJae Park does some maintenance work on the
   DAMON docs.
 
 - The 27 patch series "hugetlb/CMA improvements for large systems" from
   Frank van der Linden addresses a pile of issues which have been observed
   when using CMA on large machines.
 
 - The 2 patch series "mm/damon: introduce DAMOS filter type for unmapped
   pages" from SeongJae Park enables users of DMAON/DAMOS to filter my the
   page's mapped/unmapped status.
 
 - The 19 patch series "zsmalloc/zram: there be preemption" from Sergey
   Senozhatsky teaches zram to run its compression and decompression
   operations preemptibly.
 
 - The 12 patch series "selftests/mm: Some cleanups from trying to run
   them" from Brendan Jackman fixes a pile of unrelated issues which
   Brendan encountered while runnimg our selftests.
 
 - The 2 patch series "fs/proc/task_mmu: add guard region bit to pagemap"
   from Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to
   determine whether a particular page is a guard page.
 
 - The 7 patch series "mm, swap: remove swap slot cache" from Kairui Song
   removes the swap slot cache from the allocation path - it simply wasn't
   being effective.
 
 - The 5 patch series "mm: cleanups for device-exclusive entries (hmm)"
   from David Hildenbrand implements a number of unrelated cleanups in this
   code.
 
 - The 5 patch series "mm: Rework generic PTDUMP configs" from Anshuman
   Khandual implements a number of preparatoty cleanups to the
   GENERIC_PTDUMP Kconfig logic.
 
 - The 8 patch series "mm/damon: auto-tune aggregation interval" from
   SeongJae Park implements a feedback-driven automatic tuning feature for
   DAMON's aggregation interval tuning.
 
 - The 5 patch series "Fix lazy mmu mode" from Ryan Roberts fixes some
   issues in powerpc, sparc and x86 lazy MMU implementations.  Ryan did
   this in preparation for implementing lazy mmu mode for arm64 to optimize
   vmalloc.
 
 - The 2 patch series "mm/page_alloc: Some clarifications for migratetype
   fallback" from Brendan Jackman reworks some commentary to make the code
   easier to follow.
 
 - The 3 patch series "page_counter cleanup and size reduction" from
   Shakeel Butt cleans up the page_counter code and fixes a size increase
   which we accidentally added late last year.
 
 - The 3 patch series "Add a command line option that enables control of
   how many threads should be used to allocate huge pages" from Thomas
   Prescher does that.  It allows the careful operator to significantly
   reduce boot time by tuning the parallalization of huge page
   initialization.
 
 - The 3 patch series "Fix calculations in trace_balance_dirty_pages()
   for cgwb" from Tang Yizhou fixes the tracing output from the dirty page
   balancing code.
 
 - The 9 patch series "mm/damon: make allow filters after reject filters
   useful and intuitive" from SeongJae Park improves the handling of allow
   and reject filters.  Behaviour is made more consistent and the
   documention is updated accordingly.
 
 - The 5 patch series "Switch zswap to object read/write APIs" from Yosry
   Ahmed updates zswap to the new object read/write APIs and thus permits
   the removal of some legacy code from zpool and zsmalloc.
 
 - The 6 patch series "Some trivial cleanups for shmem" from Baolin Wang
   does as it claims.
 
 - The 20 patch series "fs/dax: Fix ZONE_DEVICE page reference counts"
   from Alistair Popple regularizes the weird ZONE_DEVICE page refcount
   handling in DAX, permittig the removal of a number of special-case
   checks.
 
 - The 4 patch series "refactor mremap and fix bug" from Lorenzo Stoakes
   is a preparatoty refactoring and cleanup of the mremap() code.
 
 - The 20 patch series "mm: MM owner tracking for large folios (!hugetlb)
   + CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in
   which we determine whether a large folio is known to be mapped
   exclusively into a single MM.
 
 - The 8 patch series "mm/damon: add sysfs dirs for managing DAMOS
   filters based on handling layers" from SeongJae Park adds a couple of
   new sysfs directories to ease the management of DAMON/DAMOS filters.
 
 - The 13 patch series "arch, mm: reduce code duplication in mem_init()"
   from Mike Rapoport consolidates many per-arch implementations of
   mem_init() into code generic code, where that is practical.
 
 - The 13 patch series "mm/damon/sysfs: commit parameters online via
   damon_call()" from SeongJae Park continues the cleaning up of sysfs
   access to DAMON internal data.
 
 - The 3 patch series "mm: page_ext: Introduce new iteration API" from
   Luiz Capitulino reworks the page_ext initialization to fix a boot-time
   crash which was observed with an unusual combination of compile and
   cmdline options.
 
 - The 8 patch series "Buddy allocator like (or non-uniform) folio split"
   from Zi Yan reworks the code to split a folio into smaller folios.  The
   main benefit is lessened memory consumption: fewer post-split folios are
   generated.
 
 - The 2 patch series "Minimize xa_node allocation during xarry split"
   from Zi Yan reduces the number of xarray xa_nodes which are generated
   during an xarray split.
 
 - The 2 patch series "drivers/base/memory: Two cleanups" from Gavin Shan
   performs some maintenance work on the drivers/base/memory code.
 
 - The 3 patch series "Add tracepoints for lowmem reserves, watermarks
   and totalreserve_pages" from Martin Liu adds some more tracepoints to
   the page allocator code.
 
 - The 4 patch series "mm/madvise: cleanup requests validations and
   classifications" from SeongJae Park cleans up some warts which SeongJae
   observed during his earlier madvise work.
 
 - The 3 patch series "mm/hwpoison: Fix regressions in memory failure
   handling" from Shuai Xue addresses two quite serious regressions which
   Shuai has observed in the memory-failure implementation.
 
 - The 5 patch series "mm: reliable huge page allocator" from Johannes
   Weiner makes huge page allocations cheaper and more reliable by reducing
   fragmentation.
 
 - The 5 patch series "Minor memcg cleanups & prep for memdescs" from
   Matthew Wilcox is preparatory work for the future implementation of
   memdescs.
 
 - The 4 patch series "track memory used by balloon drivers" from Nico
   Pache introduces a way to track memory used by our various balloon
   drivers.
 
 - The 2 patch series "mm/damon: introduce DAMOS filter type for active
   pages" from Nhat Pham permits users to filter for active/inactive pages,
   separately for file and anon pages.
 
 - The 2 patch series "Adding Proactive Memory Reclaim Statistics" from
   Hao Jia separates the proactive reclaim statistics from the direct
   reclaim statistics.
 
 - The 2 patch series "mm/vmscan: don't try to reclaim hwpoison folio"
   from Jinjiang Tu fixes our handling of hwpoisoned pages within the
   reclaim code.
 -----BEGIN PGP SIGNATURE-----
 
 iHQEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ+nZaAAKCRDdBJ7gKXxA
 jsOWAPiP4r7CJHMZRK4eyJOkvS1a1r+TsIarrFZtjwvf/GIfAQCEG+JDxVfUaUSF
 Ee93qSSLR1BkNdDw+931Pu0mXfbnBw==
 =Pn2K
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - The series "Enable strict percpu address space checks" from Uros
   Bizjak uses x86 named address space qualifiers to provide
   compile-time checking of percpu area accesses.

   This has caused a small amount of fallout - two or three issues were
   reported. In all cases the calling code was found to be incorrect.

 - The series "Some cleanup for memcg" from Chen Ridong implements some
   relatively monir cleanups for the memcontrol code.

 - The series "mm: fixes for device-exclusive entries (hmm)" from David
   Hildenbrand fixes a boatload of issues which David found then using
   device-exclusive PTE entries when THP is enabled. More work is
   needed, but this makes thins better - our own HMM selftests now
   succeed.

 - The series "mm: zswap: remove z3fold and zbud" from Yosry Ahmed
   remove the z3fold and zbud implementations. They have been deprecated
   for half a year and nobody has complained.

 - The series "mm: further simplify VMA merge operation" from Lorenzo
   Stoakes implements numerous simplifications in this area. No runtime
   effects are anticipated.

 - The series "mm/madvise: remove redundant mmap_lock operations from
   process_madvise()" from SeongJae Park rationalizes the locking in the
   madvise() implementation. Performance gains of 20-25% were observed
   in one MADV_DONTNEED microbenchmark.

 - The series "Tiny cleanup and improvements about SWAP code" from
   Baoquan He contains a number of touchups to issues which Baoquan
   noticed when working on the swap code.

 - The series "mm: kmemleak: Usability improvements" from Catalin
   Marinas implements a couple of improvements to the kmemleak
   user-visible output.

 - The series "mm/damon/paddr: fix large folios access and schemes
   handling" from Usama Arif provides a couple of fixes for DAMON's
   handling of large folios.

 - The series "mm/damon/core: fix wrong and/or useless damos_walk()
   behaviors" from SeongJae Park fixes a few issues with the accuracy of
   kdamond's walking of DAMON regions.

 - The series "expose mapping wrprotect, fix fb_defio use" from Lorenzo
   Stoakes changes the interaction between framebuffer deferred-io and
   core MM. No functional changes are anticipated - this is preparatory
   work for the future removal of page structure fields.

 - The series "mm/damon: add support for hugepage_size DAMOS filter"
   from Usama Arif adds a DAMOS filter which permits the filtering by
   huge page sizes.

 - The series "mm: permit guard regions for file-backed/shmem mappings"
   from Lorenzo Stoakes extends the guard region feature from its
   present "anon mappings only" state. The feature now covers shmem and
   file-backed mappings.

 - The series "mm: batched unmap lazyfree large folios during
   reclamation" from Barry Song cleans up and speeds up the unmapping
   for pte-mapped large folios.

 - The series "reimplement per-vma lock as a refcount" from Suren
   Baghdasaryan puts the vm_lock back into the vma. Our reasons for
   pulling it out were largely bogus and that change made the code more
   messy. This patchset provides small (0-10%) improvements on one
   microbenchmark.

 - The series "Docs/mm/damon: misc DAMOS filters documentation fixes and
   improves" from SeongJae Park does some maintenance work on the DAMON
   docs.

 - The series "hugetlb/CMA improvements for large systems" from Frank
   van der Linden addresses a pile of issues which have been observed
   when using CMA on large machines.

 - The series "mm/damon: introduce DAMOS filter type for unmapped pages"
   from SeongJae Park enables users of DMAON/DAMOS to filter my the
   page's mapped/unmapped status.

 - The series "zsmalloc/zram: there be preemption" from Sergey
   Senozhatsky teaches zram to run its compression and decompression
   operations preemptibly.

 - The series "selftests/mm: Some cleanups from trying to run them" from
   Brendan Jackman fixes a pile of unrelated issues which Brendan
   encountered while runnimg our selftests.

 - The series "fs/proc/task_mmu: add guard region bit to pagemap" from
   Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to
   determine whether a particular page is a guard page.

 - The series "mm, swap: remove swap slot cache" from Kairui Song
   removes the swap slot cache from the allocation path - it simply
   wasn't being effective.

 - The series "mm: cleanups for device-exclusive entries (hmm)" from
   David Hildenbrand implements a number of unrelated cleanups in this
   code.

 - The series "mm: Rework generic PTDUMP configs" from Anshuman Khandual
   implements a number of preparatoty cleanups to the GENERIC_PTDUMP
   Kconfig logic.

 - The series "mm/damon: auto-tune aggregation interval" from SeongJae
   Park implements a feedback-driven automatic tuning feature for
   DAMON's aggregation interval tuning.

 - The series "Fix lazy mmu mode" from Ryan Roberts fixes some issues in
   powerpc, sparc and x86 lazy MMU implementations. Ryan did this in
   preparation for implementing lazy mmu mode for arm64 to optimize
   vmalloc.

 - The series "mm/page_alloc: Some clarifications for migratetype
   fallback" from Brendan Jackman reworks some commentary to make the
   code easier to follow.

 - The series "page_counter cleanup and size reduction" from Shakeel
   Butt cleans up the page_counter code and fixes a size increase which
   we accidentally added late last year.

 - The series "Add a command line option that enables control of how
   many threads should be used to allocate huge pages" from Thomas
   Prescher does that. It allows the careful operator to significantly
   reduce boot time by tuning the parallalization of huge page
   initialization.

 - The series "Fix calculations in trace_balance_dirty_pages() for cgwb"
   from Tang Yizhou fixes the tracing output from the dirty page
   balancing code.

 - The series "mm/damon: make allow filters after reject filters useful
   and intuitive" from SeongJae Park improves the handling of allow and
   reject filters. Behaviour is made more consistent and the documention
   is updated accordingly.

 - The series "Switch zswap to object read/write APIs" from Yosry Ahmed
   updates zswap to the new object read/write APIs and thus permits the
   removal of some legacy code from zpool and zsmalloc.

 - The series "Some trivial cleanups for shmem" from Baolin Wang does as
   it claims.

 - The series "fs/dax: Fix ZONE_DEVICE page reference counts" from
   Alistair Popple regularizes the weird ZONE_DEVICE page refcount
   handling in DAX, permittig the removal of a number of special-case
   checks.

 - The series "refactor mremap and fix bug" from Lorenzo Stoakes is a
   preparatoty refactoring and cleanup of the mremap() code.

 - The series "mm: MM owner tracking for large folios (!hugetlb) +
   CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in
   which we determine whether a large folio is known to be mapped
   exclusively into a single MM.

 - The series "mm/damon: add sysfs dirs for managing DAMOS filters based
   on handling layers" from SeongJae Park adds a couple of new sysfs
   directories to ease the management of DAMON/DAMOS filters.

 - The series "arch, mm: reduce code duplication in mem_init()" from
   Mike Rapoport consolidates many per-arch implementations of
   mem_init() into code generic code, where that is practical.

 - The series "mm/damon/sysfs: commit parameters online via
   damon_call()" from SeongJae Park continues the cleaning up of sysfs
   access to DAMON internal data.

 - The series "mm: page_ext: Introduce new iteration API" from Luiz
   Capitulino reworks the page_ext initialization to fix a boot-time
   crash which was observed with an unusual combination of compile and
   cmdline options.

 - The series "Buddy allocator like (or non-uniform) folio split" from
   Zi Yan reworks the code to split a folio into smaller folios. The
   main benefit is lessened memory consumption: fewer post-split folios
   are generated.

 - The series "Minimize xa_node allocation during xarry split" from Zi
   Yan reduces the number of xarray xa_nodes which are generated during
   an xarray split.

 - The series "drivers/base/memory: Two cleanups" from Gavin Shan
   performs some maintenance work on the drivers/base/memory code.

 - The series "Add tracepoints for lowmem reserves, watermarks and
   totalreserve_pages" from Martin Liu adds some more tracepoints to the
   page allocator code.

 - The series "mm/madvise: cleanup requests validations and
   classifications" from SeongJae Park cleans up some warts which
   SeongJae observed during his earlier madvise work.

 - The series "mm/hwpoison: Fix regressions in memory failure handling"
   from Shuai Xue addresses two quite serious regressions which Shuai
   has observed in the memory-failure implementation.

 - The series "mm: reliable huge page allocator" from Johannes Weiner
   makes huge page allocations cheaper and more reliable by reducing
   fragmentation.

 - The series "Minor memcg cleanups & prep for memdescs" from Matthew
   Wilcox is preparatory work for the future implementation of memdescs.

 - The series "track memory used by balloon drivers" from Nico Pache
   introduces a way to track memory used by our various balloon drivers.

 - The series "mm/damon: introduce DAMOS filter type for active pages"
   from Nhat Pham permits users to filter for active/inactive pages,
   separately for file and anon pages.

 - The series "Adding Proactive Memory Reclaim Statistics" from Hao Jia
   separates the proactive reclaim statistics from the direct reclaim
   statistics.

 - The series "mm/vmscan: don't try to reclaim hwpoison folio" from
   Jinjiang Tu fixes our handling of hwpoisoned pages within the reclaim
   code.

* tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (431 commits)
  mm/page_alloc: remove unnecessary __maybe_unused in order_to_pindex()
  x86/mm: restore early initialization of high_memory for 32-bits
  mm/vmscan: don't try to reclaim hwpoison folio
  mm/hwpoison: introduce folio_contain_hwpoisoned_page() helper
  cgroup: docs: add pswpin and pswpout items in cgroup v2 doc
  mm: vmscan: split proactive reclaim statistics from direct reclaim statistics
  selftests/mm: speed up split_huge_page_test
  selftests/mm: uffd-unit-tests support for hugepages > 2M
  docs/mm/damon/design: document active DAMOS filter type
  mm/damon: implement a new DAMOS filter type for active pages
  fs/dax: don't disassociate zero page entries
  MM documentation: add "Unaccepted" meminfo entry
  selftests/mm: add commentary about 9pfs bugs
  fork: use __vmalloc_node() for stack allocation
  docs/mm: Physical Memory: Populate the "Zones" section
  xen: balloon: update the NR_BALLOON_PAGES state
  hv_balloon: update the NR_BALLOON_PAGES state
  balloon_compaction: update the NR_BALLOON_PAGES state
  meminfo: add a per node counter for balloon drivers
  mm: remove references to folio in __memcg_kmem_uncharge_page()
  ...
2025-04-01 09:29:18 -07:00
Eric Dumazet
f278b6d5bb Revert "tcp: avoid atomic operations on sk->sk_rmem_alloc"
This reverts commit 0de2a5c4b8.

I forgot that a TCP socket could receive messages in its error queue.

sock_queue_err_skb() can be called without socket lock being held,
and changes sk->sk_rmem_alloc.

The fact that skbs in error queue are limited by sk->sk_rcvbuf
means that error messages can be dropped if socket receive
queues are full, which is an orthogonal issue.

In future kernels, we could use a separate sk->sk_error_mem_alloc
counter specifically for the error queue.

Fixes: 0de2a5c4b8 ("tcp: avoid atomic operations on sk->sk_rmem_alloc")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250331075946.31960-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-31 16:53:54 -07:00
Linus Torvalds
092e335082 RDMA v6.15 merge window pull request
- Usual minor updates and fixes for bnxt_re, hfi1, rxe, mana, iser, mlx5,
   vmw_pvrdma, hns
 
 - Make rxe work on tun devices
 
 - mana gains more standard verbs as it moves toward supporting in-kernel
   verbs
 
 - DMABUF support for mana
 
 - Fix page size calculations when memory registration exceeds 4G
 
 - On Demand Paging support for rxe
 
 - mlx5 support for RDMA TRANSPORT flow tables and a new ucap mechanism to
   access control use of them
 
 - Optional RDMA_TX/RX counters per QP in mlx5
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZ+ap4gAKCRCFwuHvBreF
 YaFHAP9wyeZCZIbnWaGcbNdbsmkEgy7aTVILRHf1NA7VSJ211gD9Ha60E+mkwtvA
 i7IJ49R2BdqzKaO9oTutj2Lw+8rABwQ=
 =qXhh
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:

 - Usual minor updates and fixes for bnxt_re, hfi1, rxe, mana, iser,
   mlx5, vmw_pvrdma, hns

 - Make rxe work on tun devices

 - mana gains more standard verbs as it moves toward supporting
   in-kernel verbs

 - DMABUF support for mana

 - Fix page size calculations when memory registration exceeds 4G

 - On Demand Paging support for rxe

 - mlx5 support for RDMA TRANSPORT flow tables and a new ucap mechanism
   to access control use of them

 - Optional RDMA_TX/RX counters per QP in mlx5

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (73 commits)
  IB/mad: Check available slots before posting receive WRs
  RDMA/mana_ib: Fix integer overflow during queue creation
  RDMA/mlx5: Fix calculation of total invalidated pages
  RDMA/mlx5: Fix mlx5_poll_one() cur_qp update flow
  RDMA/mlx5: Fix page_size variable overflow
  RDMA/mlx5: Drop access_flags from _mlx5_mr_cache_alloc()
  RDMA/mlx5: Fix cache entry update on dereg error
  RDMA/mlx5: Fix MR cache initialization error flow
  RDMA/mlx5: Support optional-counters binding for QPs
  RDMA/mlx5: Compile fs.c regardless of INFINIBAND_USER_ACCESS config
  RDMA/core: Pass port to counter bind/unbind operations
  RDMA/core: Add support to optional-counters binding configuration
  RDMA/core: Create and destroy rdma_counter using rdma_zalloc_drv_obj()
  RDMA/mlx5: Add optional counters for RDMA_TX/RX_packets/bytes
  RDMA/core: Fix use-after-free when rename device name
  RDMA/bnxt_re: Support perf management counters
  RDMA/rxe: Fix incorrect return value of rxe_odp_atomic_op()
  RDMA/uverbs: Propagate errors from rdma_lookup_get_uobject()
  RDMA/mana_ib: Handle net event for pointing to the current netdev
  net: mana: Change the function signature of mana_get_primary_netdev_rcu
  ...
2025-03-29 11:12:28 -07:00
Linus Torvalds
e5e0e6bebe This update includes the following changes:
API:
 
 - Remove legacy compression interface.
 - Improve scatterwalk API.
 - Add request chaining to ahash and acomp.
 - Add virtual address support to ahash and acomp.
 - Add folio support to acomp.
 - Remove NULL dst support from acomp.
 
 Algorithms:
 
 - Library options are fuly hidden (selected by kernel users only).
 - Add Kerberos5 algorithms.
 - Add VAES-based ctr(aes) on x86.
 - Ensure LZO respects output buffer length on compression.
 - Remove obsolete SIMD fallback code path from arm/ghash-ce.
 
 Drivers:
 
 - Add support for PCI device 0x1134 in ccp.
 - Add support for rk3588's standalone TRNG in rockchip.
 - Add Inside Secure SafeXcel EIP-93 crypto engine support in eip93.
 - Fix bugs in tegra uncovered by multi-threaded self-test.
 - Fix corner cases in hisilicon/sec2.
 
 Others:
 
 - Add SG_MITER_LOCAL to sg miter.
 - Convert ubifs, hibernate and xfrm_ipcomp from legacy API to acomp.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEn51F/lCuNhUwmDeSxycdCkmxi6cFAmfiQ9kACgkQxycdCkmx
 i6fFZg/9GWjC1FLEV66vNlYAIzFGwzwWdFGyQzXyP235Cphhm4qt9gx7P91N6Lvc
 pplVjNEeZHoP8lMw+AIeGc2cRhIwsvn8C+HA3tCBOoC1qSe8T9t7KHAgiRGd/0iz
 UrzVBFLYlR9i4tc0T5peyQwSctv8DfjWzduTmI3Ts8i7OQcfeVVgj3sGfWam7kjF
 1GJWIQH7aPzT8cwFtk8gAK1insuPPZelT1Ppl9kUeZe0XUibrP7Gb5G9simxXAyi
 B+nLCaJYS6Hc1f47cfR/qyZSeYQN35KTVrEoKb1pTYXfEtMv6W9fIvQVLJRYsqpH
 RUBdDJUseE+WckR6glX9USrh+Fv9d+HfsTXh1fhpApKU5sQJ7pDbUm4ge8p6htNG
 MIszbJPdqajYveRLuPUjFlUXaqomos8eT6BZA+RLHm1cogzEOm+5bjspbfRNAVPj
 x9KiDu5lXNiFj02v/MkLKUe3bnGIyVQnZNi7Rn0Rpxjv95tIjVpksZWMPJarxUC6
 5zdyM2I5X0Z9+teBpbfWyqfzSbAs/KpzV8S/xNvWDUT6NlpYGBeNXrCDTXcwJLAh
 PRW0w1EJUwsZbPi8GEh5jNzo/YK1cGsUKrihKv7YgqSSopMLI8e/WVr8nKZMVDFA
 O+6F6ec5lR7KsOIMGUqrBGFU1ccAeaLLvLK3H5J8//gMMg82Uik=
 =aQNt
 -----END PGP SIGNATURE-----

Merge tag 'v6.15-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6

Pull crypto updates from Herbert Xu:
 "API:
   - Remove legacy compression interface
   - Improve scatterwalk API
   - Add request chaining to ahash and acomp
   - Add virtual address support to ahash and acomp
   - Add folio support to acomp
   - Remove NULL dst support from acomp

  Algorithms:
   - Library options are fuly hidden (selected by kernel users only)
   - Add Kerberos5 algorithms
   - Add VAES-based ctr(aes) on x86
   - Ensure LZO respects output buffer length on compression
   - Remove obsolete SIMD fallback code path from arm/ghash-ce

  Drivers:
   - Add support for PCI device 0x1134 in ccp
   - Add support for rk3588's standalone TRNG in rockchip
   - Add Inside Secure SafeXcel EIP-93 crypto engine support in eip93
   - Fix bugs in tegra uncovered by multi-threaded self-test
   - Fix corner cases in hisilicon/sec2

  Others:
   - Add SG_MITER_LOCAL to sg miter
   - Convert ubifs, hibernate and xfrm_ipcomp from legacy API to acomp"

* tag 'v6.15-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (187 commits)
  crypto: testmgr - Add multibuffer acomp testing
  crypto: acomp - Fix synchronous acomp chaining fallback
  crypto: testmgr - Add multibuffer hash testing
  crypto: hash - Fix synchronous ahash chaining fallback
  crypto: arm/ghash-ce - Remove SIMD fallback code path
  crypto: essiv - Replace memcpy() + NUL-termination with strscpy()
  crypto: api - Call crypto_alg_put in crypto_unregister_alg
  crypto: scompress - Fix incorrect stream freeing
  crypto: lib/chacha - remove unused arch-specific init support
  crypto: remove obsolete 'comp' compression API
  crypto: compress_null - drop obsolete 'comp' implementation
  crypto: cavium/zip - drop obsolete 'comp' implementation
  crypto: zstd - drop obsolete 'comp' implementation
  crypto: lzo - drop obsolete 'comp' implementation
  crypto: lzo-rle - drop obsolete 'comp' implementation
  crypto: lz4hc - drop obsolete 'comp' implementation
  crypto: lz4 - drop obsolete 'comp' implementation
  crypto: deflate - drop obsolete 'comp' implementation
  crypto: 842 - drop obsolete 'comp' implementation
  crypto: nx - Migrate to scomp API
  ...
2025-03-29 10:01:55 -07:00
Linus Torvalds
1a9239bb42 Networking changes for 6.15.
Core & protocols
 ----------------
 
  - Continue Netlink conversions to per-namespace RTNL lock
    (IPv4 routing, routing rules, routing next hops, ARP ioctls).
 
  - Continue extending the use of netdev instance locks. As a driver
    opt-in protect queue operations and (in due course) ethtool
    operations with the instance lock and not RTNL lock.
 
  - Support collecting TCP timestamps (data submitted, sent, acked)
    in BPF, allowing for transparent (to the application) and lower
    overhead tracking of TCP RPC performance.
 
  - Tweak existing networking Rx zero-copy infra to support zero-copy
    Rx via io_uring.
 
  - Optimize MPTCP performance in single subflow mode by 29%.
 
  - Enable GRO on packets which went thru XDP CPU redirect (were queued
    for processing on a different CPU). Improving TCP stream performance
    up to 2x.
 
  - Improve performance of contended connect() by 200% by searching
    for an available 4-tuple under RCU rather than a spin lock.
    Bring an additional 229% improvement by tweaking hash distribution.
 
  - Avoid unconditionally touching sk_tsflags on RX, improving
    performance under UDP flood by as much as 10%.
 
  - Avoid skb_clone() dance in ping_rcv() to improve performance under
    ping flood.
 
  - Avoid FIB lookup in netfilter if socket is available, 20% perf win.
 
  - Rework network device creation (in-kernel) API to more clearly
    identify network namespaces and their roles.
    There are up to 4 namespace roles but we used to have just 2 netns
    pointer arguments, interpreted differently based on context.
 
  - Use sysfs_break_active_protection() instead of trylock to avoid
    deadlocks between unregistering objects and sysfs access.
 
  - Add a new sysctl and sockopt for capping max retransmit timeout
    in TCP.
 
  - Support masking port and DSCP in routing rule matches.
 
  - Support dumping IPv4 multicast addresses with RTM_GETMULTICAST.
 
  - Support specifying at what time packet should be sent on AF_XDP
    sockets.
 
  - Expose TCP ULP diagnostic info (for TLS and MPTCP) to non-admin users.
 
  - Add Netlink YAML spec for WiFi (nl80211) and conntrack.
 
  - Introduce EXPORT_IPV6_MOD() and EXPORT_IPV6_MOD_GPL() for symbols
    which only need to be exported when IPv6 support is built as a module.
 
  - Age FDB entries based on Rx not Tx traffic in VxLAN, similar
    to normal bridging.
 
  - Allow users to specify source port range for GENEVE tunnels.
 
  - netconsole: allow attaching kernel release, CPU ID and task name
    to messages as metadata
 
 Driver API
 ----------
 
  - Continue rework / fixing of Energy Efficient Ethernet (EEE) across
    the SW layers. Delegate the responsibilities to phylink where possible.
    Improve its handling in phylib.
 
  - Support symmetric OR-XOR RSS hashing algorithm.
 
  - Support tracking and preserving IRQ affinity by NAPI itself.
 
  - Support loopback mode speed selection for interface selftests.
 
 Device drivers
 --------------
 
  - Remove the IBM LCS driver for s390.
 
  - Remove the sb1000 cable modem driver.
 
  - Add support for SFP module access over SMBus.
 
  - Add MCTP transport driver for MCTP-over-USB.
 
  - Enable XDP metadata support in multiple drivers.
 
  - Ethernet high-speed NICs:
    - Broadcom (bnxt):
      - add PCIe TLP Processing Hints (TPH) support for new AMD platforms
      - support dumping RoCE queue state for debug
      - opt into instance locking
    - Intel (100G, ice, idpf):
      - ice: rework MSI-X IRQ management and distribution
      - ice: support for E830 devices
      - iavf: add support for Rx timestamping
      - iavf: opt into instance locking
    - nVidia/Mellanox:
      - mlx4: use page pool memory allocator for Rx
      - mlx5: support for one PTP device per hardware clock
      - mlx5: support for 200Gbps per-lane link modes
      - mlx5: move IPSec policy check after decryption
    - AMD/Solarflare:
      - support FW flashing via devlink
    - Cisco (enic):
      - use page pool memory allocator for Rx
      - enable 32, 64 byte CQEs
      - get max rx/tx ring size from the device
    - Meta (fbnic):
      - support flow steering and RSS configuration
      - report queue stats
      - support TCP segmentation
      - support IRQ coalescing
      - support ring size configuration
    - Marvell/Cavium:
      - support AF_XDP
    - Wangxun:
      - support for PTP clock and timestamping
    - Huawei (hibmcge):
      - checksum offload
      - add more statistics
 
  - Ethernet virtual:
    - VirtIO net:
      - aggressively suppress Tx completions, improve perf by 96% with
        1 CPU and 55% with 2 CPUs
      - expose NAPI to IRQ mapping and persist NAPI settings
    - Google (gve):
      - support XDP in DQO RDA Queue Format
      - opt into instance locking
    - Microsoft vNIC:
      - support BIG TCP
 
  - Ethernet NICs consumer, and embedded:
    - Synopsys (stmmac):
      - cleanup Tx and Tx clock setting and other link-focused cleanups
      - enable SGMII and 2500BASEX mode switching for Intel platforms
      - support Sophgo SG2044
    - Broadcom switches (b53):
      - support for BCM53101
    - TI:
      - iep: add perout configuration support
      - icssg: support XDP
    - Cadence (macb):
      - implement BQL
    - Xilinx (axinet):
      - support dynamic IRQ moderation and changing coalescing at runtime
      - implement BQL
      - report standard stats
    - MediaTek:
      - support phylink managed EEE
    - Intel:
      - igc: don't restart the interface on every XDP program change
    - RealTek (r8169):
      - support reading registers of internal PHYs directly
      - increase max jumbo packet size on RTL8125/RTL8126
    - Airoha:
      - support for RISC-V NPU packet processing unit
      - enable scatter-gather and support MTU up to 9kB
    - Tehuti (tn40xx):
      - support cards with TN4010 MAC and an Aquantia AQR105 PHY
 
  - Ethernet PHYs:
    - support for TJA1102S, TJA1121
    - dp83tg720: add randomized polling intervals for link detection
    - dp83822: support changing the transmit amplitude voltage
    - support for LEDs on 88q2xxx
 
  - CAN:
    - canxl: support Remote Request Substitution bit access
    - flexcan: add S32G2/S32G3 SoC
 
  - WiFi:
    - remove cooked monitor support
    - strict mode for better AP testing
    - basic EPCS support
    - OMI RX bandwidth reduction support
    - batman-adv: add support for jumbo frames
 
  - WiFi drivers:
    - RealTek (rtw88):
      - support RTL8814AE and RTL8814AU
    - RealTek (rtw89):
      - switch using wiphy_lock and wiphy_work
      - add BB context to manipulate two PHY as preparation of MLO
      - improve BT-coexistence mechanism to play A2DP smoothly
    - Intel (iwlwifi):
      - add new iwlmld sub-driver for latest HW/FW combinations
    - MediaTek (mt76):
      - preparation for mt7996 Multi-Link Operation (MLO) support
    - Qualcomm/Atheros (ath12k):
      - continued work on MLO
    - Silabs (wfx):
      - Wake-on-WLAN support
 
  - Bluetooth:
    - add support for skb TX SND/COMPLETION timestamping
    - hci_core: enable buffer flow control for SCO/eSCO
    - coredump: log devcd dumps into the monitor
 
  - Bluetooth drivers:
    - intel: add support to configure TX power
    - nxp: handle bootloader error during cmd5 and cmd7
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmfkLC8ACgkQMUZtbf5S
 Irsb5g/+L7oKOf0ALbaV9kxFsoz8AymZfAW9i/27F07omGJGpks8oX6j6rQLgIRO
 OQOFcp7XEdDh1+jh82gHVuPrw2/6lchLtW8ARtzdiQKFr5DRjrsbtua6GRc8iBqA
 DIRCBFoV2HuMkF39Vr09HMa9AZAT7QR2RLsRGpSq8E8Z8xxKz0X7oujs10PFpMTE
 IVKhTrVrk+NDot/IU2hzVpnpup+0ld+T2/ZaBklJGcU8uDffImsqNepHRyCG5UC3
 xz74Ju23MAj24Gct+og0yFUooF+lUltKyVm0FYCDCY3bASTwgY01NR3kEH/0NQvM
 cywLzd/ngHm/SMD2ggVAHkjZUieiIVHdaZ53dgjDeBOQoVP6p0dgUK7EumXX8Mx4
 8ReR2UiGoYRPaq9c4o+IjG4K027MwVK2p+mF1a6MLa+20XcyMbev8FIRbbHtC/V4
 z5/FsOAxcuICWkA1hU9bODrrGzIqemmdRgKG8sGuTJCt/kYGAn72/TCATGNSaCJ0
 00n2jN1aepa7wtywHJ5MhVzxN9iQX7+geUHXz0BI+lK4e1Pmk+vjGksymb9ai2fk
 eQAUV9ekub6q68/J16scD7XeOUM37bTLiMBQeIF8UtZBOJscKiS71zn9QP9Twwxv
 P2pm01RDZUI+z5ZX3hc12Pm1vjRHaAh9S1JpAw/pTOVlQ+mAJEM=
 =XY0S
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "Core & protocols:

   - Continue Netlink conversions to per-namespace RTNL lock
     (IPv4 routing, routing rules, routing next hops, ARP ioctls)

   - Continue extending the use of netdev instance locks. As a driver
     opt-in protect queue operations and (in due course) ethtool
     operations with the instance lock and not RTNL lock.

   - Support collecting TCP timestamps (data submitted, sent, acked) in
     BPF, allowing for transparent (to the application) and lower
     overhead tracking of TCP RPC performance.

   - Tweak existing networking Rx zero-copy infra to support zero-copy
     Rx via io_uring.

   - Optimize MPTCP performance in single subflow mode by 29%.

   - Enable GRO on packets which went thru XDP CPU redirect (were queued
     for processing on a different CPU). Improving TCP stream
     performance up to 2x.

   - Improve performance of contended connect() by 200% by searching for
     an available 4-tuple under RCU rather than a spin lock. Bring an
     additional 229% improvement by tweaking hash distribution.

   - Avoid unconditionally touching sk_tsflags on RX, improving
     performance under UDP flood by as much as 10%.

   - Avoid skb_clone() dance in ping_rcv() to improve performance under
     ping flood.

   - Avoid FIB lookup in netfilter if socket is available, 20% perf win.

   - Rework network device creation (in-kernel) API to more clearly
     identify network namespaces and their roles. There are up to 4
     namespace roles but we used to have just 2 netns pointer arguments,
     interpreted differently based on context.

   - Use sysfs_break_active_protection() instead of trylock to avoid
     deadlocks between unregistering objects and sysfs access.

   - Add a new sysctl and sockopt for capping max retransmit timeout in
     TCP.

   - Support masking port and DSCP in routing rule matches.

   - Support dumping IPv4 multicast addresses with RTM_GETMULTICAST.

   - Support specifying at what time packet should be sent on AF_XDP
     sockets.

   - Expose TCP ULP diagnostic info (for TLS and MPTCP) to non-admin
     users.

   - Add Netlink YAML spec for WiFi (nl80211) and conntrack.

   - Introduce EXPORT_IPV6_MOD() and EXPORT_IPV6_MOD_GPL() for symbols
     which only need to be exported when IPv6 support is built as a
     module.

   - Age FDB entries based on Rx not Tx traffic in VxLAN, similar to
     normal bridging.

   - Allow users to specify source port range for GENEVE tunnels.

   - netconsole: allow attaching kernel release, CPU ID and task name to
     messages as metadata

  Driver API:

   - Continue rework / fixing of Energy Efficient Ethernet (EEE) across
     the SW layers. Delegate the responsibilities to phylink where
     possible. Improve its handling in phylib.

   - Support symmetric OR-XOR RSS hashing algorithm.

   - Support tracking and preserving IRQ affinity by NAPI itself.

   - Support loopback mode speed selection for interface selftests.

  Device drivers:

   - Remove the IBM LCS driver for s390

   - Remove the sb1000 cable modem driver

   - Add support for SFP module access over SMBus

   - Add MCTP transport driver for MCTP-over-USB

   - Enable XDP metadata support in multiple drivers

   - Ethernet high-speed NICs:
      - Broadcom (bnxt):
         - add PCIe TLP Processing Hints (TPH) support for new AMD
           platforms
         - support dumping RoCE queue state for debug
         - opt into instance locking
      - Intel (100G, ice, idpf):
         - ice: rework MSI-X IRQ management and distribution
         - ice: support for E830 devices
         - iavf: add support for Rx timestamping
         - iavf: opt into instance locking
      - nVidia/Mellanox:
         - mlx4: use page pool memory allocator for Rx
         - mlx5: support for one PTP device per hardware clock
         - mlx5: support for 200Gbps per-lane link modes
         - mlx5: move IPSec policy check after decryption
      - AMD/Solarflare:
         - support FW flashing via devlink
      - Cisco (enic):
         - use page pool memory allocator for Rx
         - enable 32, 64 byte CQEs
         - get max rx/tx ring size from the device
      - Meta (fbnic):
         - support flow steering and RSS configuration
         - report queue stats
         - support TCP segmentation
         - support IRQ coalescing
         - support ring size configuration
      - Marvell/Cavium:
         - support AF_XDP
      - Wangxun:
         - support for PTP clock and timestamping
      - Huawei (hibmcge):
         - checksum offload
         - add more statistics

   - Ethernet virtual:
      - VirtIO net:
         - aggressively suppress Tx completions, improve perf by 96%
           with 1 CPU and 55% with 2 CPUs
         - expose NAPI to IRQ mapping and persist NAPI settings
      - Google (gve):
         - support XDP in DQO RDA Queue Format
         - opt into instance locking
      - Microsoft vNIC:
         - support BIG TCP

   - Ethernet NICs consumer, and embedded:
      - Synopsys (stmmac):
         - cleanup Tx and Tx clock setting and other link-focused
           cleanups
         - enable SGMII and 2500BASEX mode switching for Intel platforms
         - support Sophgo SG2044
      - Broadcom switches (b53):
         - support for BCM53101
      - TI:
         - iep: add perout configuration support
         - icssg: support XDP
      - Cadence (macb):
         - implement BQL
      - Xilinx (axinet):
         - support dynamic IRQ moderation and changing coalescing at
           runtime
         - implement BQL
         - report standard stats
      - MediaTek:
         - support phylink managed EEE
      - Intel:
         - igc: don't restart the interface on every XDP program change
      - RealTek (r8169):
         - support reading registers of internal PHYs directly
         - increase max jumbo packet size on RTL8125/RTL8126
      - Airoha:
         - support for RISC-V NPU packet processing unit
         - enable scatter-gather and support MTU up to 9kB
      - Tehuti (tn40xx):
         - support cards with TN4010 MAC and an Aquantia AQR105 PHY

   - Ethernet PHYs:
      - support for TJA1102S, TJA1121
      - dp83tg720: add randomized polling intervals for link detection
      - dp83822: support changing the transmit amplitude voltage
      - support for LEDs on 88q2xxx

   - CAN:
      - canxl: support Remote Request Substitution bit access
      - flexcan: add S32G2/S32G3 SoC

   - WiFi:
      - remove cooked monitor support
      - strict mode for better AP testing
      - basic EPCS support
      - OMI RX bandwidth reduction support
      - batman-adv: add support for jumbo frames

   - WiFi drivers:
      - RealTek (rtw88):
         - support RTL8814AE and RTL8814AU
      - RealTek (rtw89):
         - switch using wiphy_lock and wiphy_work
         - add BB context to manipulate two PHY as preparation of MLO
         - improve BT-coexistence mechanism to play A2DP smoothly
      - Intel (iwlwifi):
         - add new iwlmld sub-driver for latest HW/FW combinations
      - MediaTek (mt76):
         - preparation for mt7996 Multi-Link Operation (MLO) support
      - Qualcomm/Atheros (ath12k):
         - continued work on MLO
      - Silabs (wfx):
         - Wake-on-WLAN support

   - Bluetooth:
      - add support for skb TX SND/COMPLETION timestamping
      - hci_core: enable buffer flow control for SCO/eSCO
      - coredump: log devcd dumps into the monitor

   - Bluetooth drivers:
      - intel: add support to configure TX power
      - nxp: handle bootloader error during cmd5 and cmd7"

* tag 'net-next-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1681 commits)
  unix: fix up for "apparmor: add fine grained af_unix mediation"
  mctp: Fix incorrect tx flow invalidation condition in mctp-i2c
  net: usb: asix: ax88772: Increase phy_name size
  net: phy: Introduce PHY_ID_SIZE — minimum size for PHY ID string
  net: libwx: fix Tx L4 checksum
  net: libwx: fix Tx descriptor content for some tunnel packets
  atm: Fix NULL pointer dereference
  net: tn40xx: add pci-id of the aqr105-based Tehuti TN4010 cards
  net: tn40xx: prepare tn40xx driver to find phy of the TN9510 card
  net: tn40xx: create swnode for mdio and aqr105 phy and add to mdiobus
  net: phy: aquantia: add essential functions to aqr105 driver
  net: phy: aquantia: search for firmware-name in fwnode
  net: phy: aquantia: add probe function to aqr105 for firmware loading
  net: phy: Add swnode support to mdiobus_scan
  gve: add XDP DROP and PASS support for DQ
  gve: update XDP allocation path support RX buffer posting
  gve: merge packet buffer size fields
  gve: update GQ RX to use buf_size
  gve: introduce config-based allocation for XDP
  gve: remove xdp_xsk_done and xdp_xsk_wakeup statistics
  ...
2025-03-26 21:48:21 -07:00
Jakub Kicinski
023b1e9d26 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Merge in late fixes to prepare for the 6.15 net-next PR.

No conflicts, adjacent changes:

drivers/net/ethernet/broadcom/bnxt/bnxt.c
  919f9f497d ("eth: bnxt: fix out-of-range access of vnic_info array")
  fe96d717d3 ("bnxt_en: Extend queue stop/start for TX rings")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-26 09:32:10 -07:00
Stephen Rothwell
705094f655 unix: fix up for "apparmor: add fine grained af_unix mediation"
After merging the apparmor tree, today's linux-next build (x86_64
allmodconfig) failed like this:

security/apparmor/af_unix.c: In function 'unix_state_double_lock':
security/apparmor/af_unix.c:627:17: error: implicit declaration of function 'unix_state_lock'; did you mean 'unix_state_double_lock'? [-Wimplicit-function-declaration]
  627 |                 unix_state_lock(sk1);
      |                 ^~~~~~~~~~~~~~~
      |                 unix_state_double_lock
security/apparmor/af_unix.c: In function 'unix_state_double_unlock':
security/apparmor/af_unix.c:642:17: error: implicit declaration of function 'unix_state_unlock'; did you mean 'unix_state_double_lock'? [-Wimplicit-function-declaration]
  642 |                 unix_state_unlock(sk1);
      |                 ^~~~~~~~~~~~~~~~~
      |                 unix_state_double_lock

Caused by commit

  c05e705812 ("apparmor: add fine grained af_unix mediation")

interacting with commit

  84960bf240 ("af_unix: Move internal definitions to net/unix/.")

from the net-next tree.

Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Link: https://patch.msgid.link/20250326150148.72d9138d@canb.auug.org.au
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-26 09:31:18 -07:00
Linus Torvalds
ee6740fd34 CRC updates for 6.15
Another set of improvements to the kernel's CRC (cyclic redundancy
 check) code:
 
 - Rework the CRC64 library functions to be directly optimized, like what
   I did last cycle for the CRC32 and CRC-T10DIF library functions.
 
 - Rewrite the x86 PCLMULQDQ-optimized CRC code, and add VPCLMULQDQ
   support and acceleration for crc64_be and crc64_nvme.
 
 - Rewrite the riscv Zbc-optimized CRC code, and add acceleration for
   crc_t10dif, crc64_be, and crc64_nvme.
 
 - Remove crc_t10dif and crc64_rocksoft from the crypto API, since they
   are no longer needed there.
 
 - Rename crc64_rocksoft to crc64_nvme, as the old name was incorrect.
 
 - Add kunit test cases for crc64_nvme and crc7.
 
 - Eliminate redundant functions for calculating the Castagnoli CRC32,
   settling on just crc32c().
 
 - Remove unnecessary prompts from some of the CRC kconfig options.
 
 - Further optimize the x86 crc32c code.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCZ+CGGhQcZWJpZ2dlcnNA
 Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK3wRAP4tbnzawUmlIHIF0hleoADXehUgAhMt
 NZn15mGvyiuwIQEA8W9qvnLdFXZkdxhxAEvDDFjyrRauL6eGtr/GvCx4AQY=
 =wmKG
 -----END PGP SIGNATURE-----

Merge tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux

Pull CRC updates from Eric Biggers:
 "Another set of improvements to the kernel's CRC (cyclic redundancy
  check) code:

   - Rework the CRC64 library functions to be directly optimized, like
     what I did last cycle for the CRC32 and CRC-T10DIF library
     functions

   - Rewrite the x86 PCLMULQDQ-optimized CRC code, and add VPCLMULQDQ
     support and acceleration for crc64_be and crc64_nvme

   - Rewrite the riscv Zbc-optimized CRC code, and add acceleration for
     crc_t10dif, crc64_be, and crc64_nvme

   - Remove crc_t10dif and crc64_rocksoft from the crypto API, since
     they are no longer needed there

   - Rename crc64_rocksoft to crc64_nvme, as the old name was incorrect

   - Add kunit test cases for crc64_nvme and crc7

   - Eliminate redundant functions for calculating the Castagnoli CRC32,
     settling on just crc32c()

   - Remove unnecessary prompts from some of the CRC kconfig options

   - Further optimize the x86 crc32c code"

* tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: (36 commits)
  x86/crc: drop the avx10_256 functions and rename avx10_512 to avx512
  lib/crc: remove unnecessary prompt for CONFIG_CRC64
  lib/crc: remove unnecessary prompt for CONFIG_LIBCRC32C
  lib/crc: remove unnecessary prompt for CONFIG_CRC8
  lib/crc: remove unnecessary prompt for CONFIG_CRC7
  lib/crc: remove unnecessary prompt for CONFIG_CRC4
  lib/crc7: unexport crc7_be_syndrome_table
  lib/crc_kunit.c: update comment in crc_benchmark()
  lib/crc_kunit.c: add test and benchmark for crc7_be()
  x86/crc32: optimize tail handling for crc32c short inputs
  riscv/crc64: add Zbc optimized CRC64 functions
  riscv/crc-t10dif: add Zbc optimized CRC-T10DIF function
  riscv/crc32: reimplement the CRC32 functions using new template
  riscv/crc: add "template" for Zbc optimized CRC functions
  x86/crc: add ANNOTATE_NOENDBR to suppress objtool warnings
  x86/crc32: improve crc32c_arch() code generation with clang
  x86/crc64: implement crc64_be and crc64_nvme using new template
  x86/crc-t10dif: implement crc_t10dif using new template
  x86/crc32: implement crc32_le using new template
  x86/crc: add "template" for [V]PCLMULQDQ based CRC functions
  ...
2025-03-25 18:33:04 -07:00
Jakub Kicinski
4f74a45c6b bluetooth-next pull request for net-next:
core:
 
  - Add support for skb TX SND/COMPLETION timestamping
  - hci_core: Enable buffer flow control for SCO/eSCO
  - coredump: Log devcd dumps into the monitor
 
  drivers:
 
  - btusb: Add 2 HWIDs for MT7922
  - btusb: Fix regression in the initialization of fake Bluetooth controllers
  - btusb: Add 14 USB device IDs for Qualcomm WCN785x
  - btintel: Add support for Intel Scorpius Peak
  - btintel: Add support to configure TX power
  - btintel: Add DSBR support for ScP
  - btintel_pcie: Add device id of Whale Peak
  - btintel_pcie: Setup buffers for firmware traces
  - btintel_pcie: Read hardware exception data
  - btintel_pcie: Add support for device coredump
  - btintel_pcie: Trigger device coredump on hardware exception
  - btnxpuart: Support for controller wakeup gpio config
  - btnxpuart: Add support to set BD address
  - btnxpuart: Add correct bootloader error codes
  - btnxpuart: Handle bootloader error during cmd5 and cmd7
  - btnxpuart: Fix kernel panic during FW release
  - qca: add WCN3950 support
  - hci_qca: use the power sequencer for wcn6750
  - btmtksdio: Prevent enabling interrupts after IRQ handler removal
 -----BEGIN PGP SIGNATURE-----
 
 iQJNBAABCAA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmfjA3AZHGx1aXoudm9u
 LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKeWWD/0QZYBrNP9QSTyYTeNlYhCC
 Lw/n7n3+LhxOqIu+tOGS7UplTqR3p4WQGu2g+XV9wSu4dvLplZGn/40XtiXJA0+r
 VsXivQ4IR/Vjd8sNLLixfmuH4g4CbMSblQvECD3/5wFTeSH8T6/gyts/WV/LDYOJ
 jp5kG6HCFHMv7RaiaHdZ0Pe4c1xJ/Ek8TnrW4G/kPBaLzm+lhjRkfzxCx8cO0k0H
 mpoheUohLtUSgfLf49826t7rp3HDuX9db2hiGXQfKSrL2milwufKNaMFTmbuU2Uq
 IyAwR1CEdSsKlcpbnVNF05r94sjf8NjuBD3YWxB9OfVXq9aymJYZdGh54XT5nF3f
 ccD1WNnOoTsXDbEAnuVx+EYrJAI70e7vE4m8XbpxJcxhFoSIQCmtDoXUY199LiEG
 El7DlwouNFESmOIj6gSK7ogUEhmcQA0AJxBVpdxnGBAcQMn1hrGSb75VGg7rul8K
 HcBtf5j4gxZum2fzrufeY1+eBxfSRjNFIsnutAr53gEtidPZYlmiRyAuqQJMo2sO
 0hlpC6gQGQJ5HiTR5Jbo9u+OC1oeHHXNN5IpNrd6J65n0pqCRldDsoXCerEJsOou
 cqWzb9lfQCrjRnWRxUWOLukdYRgBH15wuORU+Z7/qVhqOr49RvARnzq7AjGSenTS
 YKb2N+NGlqpxG+vQscFJUA==
 =XCdn
 -----END PGP SIGNATURE-----

Merge tag 'for-net-next-2025-03-25' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next

Luiz Augusto von Dentz says:

====================
bluetooth-next pull request for net-next:

core:

 - Add support for skb TX SND/COMPLETION timestamping
 - hci_core: Enable buffer flow control for SCO/eSCO
 - coredump: Log devcd dumps into the monitor

 drivers:

 - btusb: Add 2 HWIDs for MT7922
 - btusb: Fix regression in the initialization of fake Bluetooth controllers
 - btusb: Add 14 USB device IDs for Qualcomm WCN785x
 - btintel: Add support for Intel Scorpius Peak
 - btintel: Add support to configure TX power
 - btintel: Add DSBR support for ScP
 - btintel_pcie: Add device id of Whale Peak
 - btintel_pcie: Setup buffers for firmware traces
 - btintel_pcie: Read hardware exception data
 - btintel_pcie: Add support for device coredump
 - btintel_pcie: Trigger device coredump on hardware exception
 - btnxpuart: Support for controller wakeup gpio config
 - btnxpuart: Add support to set BD address
 - btnxpuart: Add correct bootloader error codes
 - btnxpuart: Handle bootloader error during cmd5 and cmd7
 - btnxpuart: Fix kernel panic during FW release
 - qca: add WCN3950 support
 - hci_qca: use the power sequencer for wcn6750
 - btmtksdio: Prevent enabling interrupts after IRQ handler removal

* tag 'for-net-next-2025-03-25' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next: (53 commits)
  Bluetooth: MGMT: Add LL Privacy Setting
  Bluetooth: hci_event: Fix handling of HCI_EV_LE_DIRECT_ADV_REPORT
  Bluetooth: btnxpuart: Fix kernel panic during FW release
  Bluetooth: btnxpuart: Handle bootloader error during cmd5 and cmd7
  Bluetooth: btnxpuart: Add correct bootloader error codes
  t blameBluetooth: btintel: Fix leading white space
  Bluetooth: btintel: Add support to configure TX power
  Bluetooth: btmtksdio: Prevent enabling interrupts after IRQ handler removal
  Bluetooth: btmtk: Remove the resetting step before downloading the fw
  Bluetooth: SCO: add TX timestamping
  Bluetooth: L2CAP: add TX timestamping
  Bluetooth: ISO: add TX timestamping
  Bluetooth: add support for skb TX SND/COMPLETION timestamping
  net-timestamp: COMPLETION timestamp on packet tx completion
  HCI: coredump: Log devcd dumps into the monitor
  Bluetooth: HCI: Add definition of hci_rp_remote_name_req_cancel
  Bluetooth: hci_vhci: Mark Sync Flow Control as supported
  Bluetooth: hci_core: Enable buffer flow control for SCO/eSCO
  Bluetooth: btintel_pci: Fix build warning
  Bluetooth: btintel_pcie: Trigger device coredump on hardware exception
  ...
====================

Link: https://patch.msgid.link/20250325192925.2497890-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25 14:00:48 -07:00
Luiz Augusto von Dentz
eed14eb510 Bluetooth: MGMT: Add LL Privacy Setting
This adds LL Privacy (bit 22) to Read Controller Information so the likes
of bluetoothd(1) can detect when the controller supports it or not.

Fixes: e209e5ccc5 ("Bluetooth: MGMT: Mark LL Privacy as stable")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-03-25 15:22:49 -04:00
Eric Dumazet
f1e30061e8 tcp/dccp: remove icsk->icsk_ack.timeout
icsk->icsk_ack.timeout can be replaced by icsk->csk_delack_timer.expires

This saves 8 bytes in TCP/DCCP sockets and helps for better cache locality.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250324203607.703850-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25 10:34:33 -07:00
Eric Dumazet
a7c428ee8f tcp/dccp: remove icsk->icsk_timeout
icsk->icsk_timeout can be replaced by icsk->icsk_retransmit_timer.expires

This saves 8 bytes in TCP/DCCP sockets and helps for better cache locality.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250324203607.703850-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25 10:34:33 -07:00
Jakub Kicinski
310ae9eb26 net: designate queue -> napi linking as "ops protected"
netdev netlink is the only reader of netdev_{,rx_}queue->napi,
and it already holds netdev->lock. Switch protection of
the writes to netdev->lock to "ops protected".

The expectation will be now that accessing queue->napi
will require netdev->lock for "ops locked" drivers, and
rtnl_lock for all other drivers.

Current "ops locked" drivers don't require any changes.
gve and netdevsim use _locked() helpers right next to
netif_queue_set_napi() so they must be holding the instance
lock. iavf doesn't call it. bnxt is a bit messy but all paths
seem locked.

Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250324224537.248800-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25 10:06:49 -07:00
Jakub Kicinski
4b702f8b72 net: explain "protection types" for the instance lock
Try to define some terminology for which fields are protected
by which lock and how. Some fields are protected by both rtnl_lock
and instance lock which is hard to talk about without having
a "key phrase" to refer to a particular protection scheme.

"ops protected" fields are defined later in the series, one by one.

Add ASSERT_RTNL() to netdev_ops_assert_locked() for drivers
not other instance protection of ops. Hopefully it's not too
confusion that netdev_lock_ops() does not match the lock which
netdev_ops_assert_locked() will assert, exactly. The noun "ops"
is in a different place in the name, so I think it's acceptable...

Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250324224537.248800-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25 10:06:44 -07:00
Jakub Kicinski
e2f81e8f4d net: constify dev pointer in misc instance lock helpers
lockdep asserts and predicates can operate on const pointers.
In the future this will let us add asserts in functions
which operate on const pointers like dev_get_min_mp_channel_count().

Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250324224537.248800-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25 10:04:49 -07:00
Pauli Virtanen
11770f41b8 Bluetooth: L2CAP: add TX timestamping
Support TX timestamping in L2CAP sockets.

Support MSG_ERRQUEUE recvmsg.

For other than SOCK_STREAM L2CAP sockets, if a packet from sendmsg() is
fragmented, only the first ACL fragment is timestamped.

For SOCK_STREAM L2CAP sockets, use the bytestream convention and
timestamp the last fragment and count bytes in tskey.

Timestamps are not generated in the Enhanced Retransmission mode, as
meaning of COMPLETION stamp is unclear if L2CAP layer retransmits.

Signed-off-by: Pauli Virtanen <pav@iki.fi>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-03-25 12:50:35 -04:00
Pauli Virtanen
d415ba2882 Bluetooth: ISO: add TX timestamping
Add BT_SCM_ERROR socket CMSG type.

Support TX timestamping in ISO sockets.

Support MSG_ERRQUEUE in ISO recvmsg.

If a packet from sendmsg() is fragmented, only the first ACL fragment is
timestamped.

Signed-off-by: Pauli Virtanen <pav@iki.fi>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-03-25 12:50:07 -04:00
Pauli Virtanen
134f4b39df Bluetooth: add support for skb TX SND/COMPLETION timestamping
Support enabling TX timestamping for some skbs, and track them until
packet completion. Generate software SCM_TSTAMP_COMPLETION when getting
completion report from hardware.

Generate software SCM_TSTAMP_SND before sending to driver. Sending from
driver requires changes in the driver API, and drivers mostly are going
to send the skb immediately.

Make the default situation with no COMPLETION TX timestamping more
efficient by only counting packets in the queue when there is nothing to
track.  When there is something to track, we need to make clones, since
the driver may modify sent skbs.

The tx_q queue length is bounded by the hdev flow control, which will
not send new packets before it has got completion reports for old ones.

Signed-off-by: Pauli Virtanen <pav@iki.fi>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-03-25 12:49:38 -04:00
Wentao Guan
e8c00f5433 Bluetooth: HCI: Add definition of hci_rp_remote_name_req_cancel
Return Parameters is not only status, also bdaddr:

BLUETOOTH CORE SPECIFICATION Version 5.4 | Vol 4, Part E
page 1870:
BLUETOOTH CORE SPECIFICATION Version 5.0 | Vol 2, Part E
page 802:

Return parameters:
  Status:
  Size: 1 octet
  BD_ADDR:
  Size: 6 octets

Note that it also fixes the warning:
"Bluetooth: hci0: unexpected cc 0x041a length: 7 > 1"

Fixes: c8992cffbe ("Bluetooth: hci_event: Use of a function table to handle Command Complete")
Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-03-25 12:47:23 -04:00
Luiz Augusto von Dentz
1321845352 Bluetooth: hci_core: Enable buffer flow control for SCO/eSCO
This enables buffer flow control for SCO/eSCO
(see: Bluetooth Core 6.0 spec: 6.22. Synchronous Flow Control Enable),
recently this has caused the following problem and is actually a nice
addition for the likes of Socket TX complete:

< HCI Command: Read Buffer Size (0x04|0x0005) plen 0
> HCI Event: Command Complete (0x0e) plen 11
      Read Buffer Size (0x04|0x0005) ncmd 1
        Status: Success (0x00)
        ACL MTU: 1021 ACL max packet: 5
        SCO MTU: 240  SCO max packet: 8
...
< SCO Data TX: Handle 257 flags 0x00 dlen 120
< SCO Data TX: Handle 257 flags 0x00 dlen 120
< SCO Data TX: Handle 257 flags 0x00 dlen 120
< SCO Data TX: Handle 257 flags 0x00 dlen 120
< SCO Data TX: Handle 257 flags 0x00 dlen 120
< SCO Data TX: Handle 257 flags 0x00 dlen 120
< SCO Data TX: Handle 257 flags 0x00 dlen 120
< SCO Data TX: Handle 257 flags 0x00 dlen 120
< SCO Data TX: Handle 257 flags 0x00 dlen 120
> HCI Event: Hardware Error (0x10) plen 1
        Code: 0x0a

To fix the code will now attempt to enable buffer flow control when
HCI_QUIRK_SYNC_FLOWCTL_SUPPORTED is set by the driver:

< HCI Command: Write Sync Fl.. (0x03|0x002f) plen 1
        Flow control: Enabled (0x01)
> HCI Event: Command Complete (0x0e) plen 4
      Write Sync Flow Control Enable (0x03|0x002f) ncmd 1
        Status: Success (0x00)

On success then HCI_SCO_FLOWCTL would be set which indicates sco_cnt
shall be used for flow contro.

Fixes: 7fedd3bb6b ("Bluetooth: Prioritize SCO traffic")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Tested-by: Pauli Virtanen <pav@iki.fi>
2025-03-25 12:46:40 -04:00
Pedro Nishiyama
127881334e Bluetooth: Add quirk for broken READ_PAGE_SCAN_TYPE
Some fake controllers cannot be initialized because they return a smaller
report than expected for READ_PAGE_SCAN_TYPE.

Signed-off-by: Pedro Nishiyama <nishiyama.pedro@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-03-25 12:43:51 -04:00
Pedro Nishiyama
ff26b2dd65 Bluetooth: Add quirk for broken READ_VOICE_SETTING
Some fake controllers cannot be initialized because they return a smaller
report than expected for READ_VOICE_SETTING.

Signed-off-by: Pedro Nishiyama <nishiyama.pedro@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-03-25 12:43:25 -04:00
Easwar Hariharan
c9d84da18d Bluetooth: L2CAP: convert timeouts to secs_to_jiffies()
Commit b35108a51c ("jiffies: Define secs_to_jiffies()") introduced
secs_to_jiffies().  As the value here is a multiple of 1000, use
secs_to_jiffies() instead of msecs_to_jiffies() for readability.

Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-03-25 12:40:46 -04:00
Dr. David Alan Gilbert
60bfe8a7dc Bluetooth: MGMT: Remove unused mgmt_*_discovery_complete
mgmt_start_discovery_complete() and mgmt_stop_discovery_complete() last
uses were removed in 2022 by
commit ec2904c259 ("Bluetooth: Remove dead code from hci_request.c")

Remove them.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-03-25 12:31:02 -04:00
Jakub Kicinski
1f6154227b Revert "udp_tunnel: GRO optimizations"
Revert "udp_tunnel: use static call for GRO hooks when possible"
This reverts commit 311b36574c.

Revert "udp_tunnel: create a fastpath GRO lookup."
This reverts commit 8d4880db37.

There are multiple small issues with the series. In the interest
of unblocking the merge window let's opt for a revert.

Link: https://lore.kernel.org/cover.1742557254.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25 09:15:07 -07:00
Jakub Kicinski
586b7b3ebb ipsec-next-2025-03-24
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmfg9oUACgkQrB3Eaf9P
 W7d8eRAAhUdQztUoVjNfRBScD34EHEBq8ruFMgbHcXkkdhg223CUaTYKqz3dYE1E
 04OCSwe7yxFFSs7CKR4dRMSzcx7uH/ISPprk45g+wwoIzBCYOWjLBzS5qmamtSYu
 E3sI/QZaVUU7mFDg5n3sr5nkCNz+LtTL2dUbOgi5AOqY9h7LZRzLBEB1RCVOioVO
 14vDRBJ5RGIvmNx3pd0sRhB4BySUw1GQVTYLSFaL/V2JZcRu3yGhKT8q637fdKcS
 blvdf5q4CYyu5i3p9LHhsg33D8fYUotBVFtzw3QaM1G9MdyhMF4WUbeZCa2kkk7J
 N7j2N0WDjP7crKQkPbm9ATQiruF2zvPcYJqgIHKuNhWY35KNIFARWxbP8KGJJFUI
 EqUUKuT0lUi39hULmm5YEqCGWLNQldTBOOZL1fccRj2DZelx4uhQsnt5B2yimUCD
 7QJGH0Vx6NJAutuCgTXQtmD0qO2XYi50XLmh4K+HZ1M3DkSGohcYSz+aj8IIY/wT
 +yT55uRoONqDbh9kOLHab0WWE6R+XJyY5rKnz6XpzyTPh0OwoxrbVXJF045NYIpB
 tSl9ykVpymvWNT56nsfXEOm5THZhdvL+uDS/QV4rtF9En4TEJU0dGbhESvriuu2j
 g6EAQpYJqL9OwA5aKVZCGlHf7siji7Z/uc57DADZ8mtmDXVMqco=
 =9RMg
 -----END PGP SIGNATURE-----

Merge tag 'ipsec-next-2025-03-24' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next

Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2025-03-24

1) Prevent setting high order sequence number bits input in
   non-ESN mode. From Leon Romanovsky.

2) Support PMTU handling in tunnel mode for packet offload.
   From Leon Romanovsky.

3) Make xfrm_state_lookup_byaddr lockless.
   From Florian Westphal.

4) Remove unnecessary NULL check in xfrm_lookup_with_ifid().
   From Dan Carpenter.

* tag 'ipsec-next-2025-03-24' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next:
  xfrm: Remove unnecessary NULL check in xfrm_lookup_with_ifid()
  xfrm: state: make xfrm_state_lookup_byaddr lockless
  xfrm: check for PMTU in tunnel mode for packet offload
  xfrm: provide common xdo_dev_offload_ok callback implementation
  xfrm: rely on XFRM offload
  xfrm: simplify SA initialization routine
  xfrm: delay initialization of offload path till its actually requested
  xfrm: prevent high SEQ input in non-ESN mode
====================

Link: https://patch.msgid.link/20250324061855.4116819-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25 08:50:10 -07:00
Jakub Kicinski
00a25cca0d netfilter pull request 25-03-23
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEjF9xRqF1emXiQiqU1w0aZmrPKyEFAmff3F0ACgkQ1w0aZmrP
 KyGZDQ/7BzMeVJNjr8gRAfYtjqt+NWGr1vf6Tz8GDsGEHfgqeFrX/GyRMZi90kOV
 YMB8K+HiIou5yJtY2ZWgSkGsId8aK7fHMmN5KnP4l0XL6bcwi7yP/sck2m+KdN9k
 cApi/iVMVtJ4/+4MPrD6rgPcsDonj+wHwMQ3WItGNgenYDTOnEmqeEL7AK6HGTAg
 kTUmjVnyws+9UllNRzgJ/67OVewzPWy8imixFl1H+ZEfM0rTuNtr0zzl6rttXIU2
 w6FK6Kw3WBZYYfLelLLmtZ2UoxqVD90Y6DOPip1mMjj95jrJPSedsZfUsZivDTNn
 JOIn/zLtwGjJ2hO/2rFxEEoeiqG79Fskg7fGzQ5mxVtJ1/otDc53WMHjNtQQpYNz
 3xpPrwVOdCNQvorDLoDL2cInoc91ZADyJGFmLAou5NQdMbAWKsGKXEQolEiG0JEh
 hmWlrzkY5cns/dSGeZDAZvyhpVSF8dnClUP2BsPU3vVYN2MbCEBH10dwOkWcUhiq
 kj+1sNPnxkDiy054e708N3w0OKToHwtgJkfpEENxtI7dtCj/6sz9JHaN77RPiuzf
 aCIyjrhlUslkB6q5bLznyGoiQTaqzjOVWIGPcMKNT7XbElmhxIUMh3U05SktdlXz
 F9m1jIvThxPKj492i8ZEDjZQ9iBCEYm5KmnRD89aW+zf4UPbSYg=
 =b5n2
 -----END PGP SIGNATURE-----

Merge tag 'nf-next-25-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next

Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following batch contains Netfilter updates for net-next:

1) Use kvmalloc in xt_hashlimit, from Denis Kirjanov.

2) Tighten nf_conntrack sysctl accepted values for nf_conntrack_max
   and nf_ct_expect_max, from Nicolas Bouchinet.

3) Avoid lookup in nft_fib if socket is available, from Florian Westphal.

4) Initialize struct lsm_context in nfnetlink_queue to avoid
   hypothetical ENOMEM errors, Chenyuan Yang.

5) Use strscpy() instead of _pad when initializing xtables table name,
   kzalloc is already used to initialized the table memory area.
   From Thorsten Blum.

6) Missing socket lookup by conntrack information for IPv6 traffic
   in nft_socket, there is a similar chunk in IPv4, this was never
   added when IPv6 NAT was introduced. From Maxim Mikityanskiy.

7) Fix clang issues with nf_tables CONFIG_MITIGATION_RETPOLINE,
   from WangYuli.

* tag 'nf-next-25-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  netfilter: nf_tables: Only use nf_skip_indirect_calls() when MITIGATION_RETPOLINE
  netfilter: socket: Lookup orig tuple for IPv6 SNAT
  netfilter: xtables: Use strscpy() instead of strscpy_pad()
  netfilter: nfnetlink_queue: Initialize ctx to avoid memory allocation error
  netfilter: fib: avoid lookup if socket is available
  netfilter: conntrack: Bound nf_conntrack sysctl writes
  netfilter: xt_hashlimit: replace vmalloc calls with kvmalloc
====================

Link: https://patch.msgid.link/20250323100922.59983-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25 08:29:13 -07:00
Eric Dumazet
f3483c8e1d net: rfs: hash function change
RFS is using two kinds of hash tables.

First one is controlled by /proc/sys/net/core/rps_sock_flow_entries = 2^N
and using the N low order bits of the l4 hash is good enough.

Then each RX queue has its own hash table, controlled by
/sys/class/net/eth1/queues/rx-$q/rps_flow_cnt = 2^X

Current hash function, using the X low order bits is suboptimal,
because RSS is usually using Func(hash) = (hash % power_of_two);

For example, with 32 RX queues, 6 low order bits have no entropy
for a given queue.

Switch this hash function to hash_32(hash, log) to increase
chances to use all possible slots and reduce collisions.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <tom@herbertland.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250321171309.634100-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25 08:24:13 -07:00
Jakub Kicinski
1952e19c02 More features for 6.15, major changes:
* cfg80211/mac80211: fix and enable link reconfiguration
  * rtw88: support RTL8814AE/RTL8814AU
  * mt7996: preparations for MLO
  * ath12k: continued work on MLO
  * iwlwifi: add new iwlmld sub-driver/op-mode for
    some current and future devices
  * wfx: wowlan support
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmfcE18ACgkQ10qiO8sP
 aAAW4RAAlu0ehfMYvDmUifEo2q7NXVjou1D7u3gGNlwSR/Shrftn2k47n+tThARb
 25ZutatgPlt4b+RjzeoSa724xLlDOQAG7e36tV7xZNDgiWRK6VrF+JVzFxin7cgQ
 e3yXcgf5luLiUNWocCZXc2Vr4MGVrSFcqLAFZXuQPDzVfU8ClKObUsJkEX59PsFv
 Sob5b4VzwAyBXUlUzMjSWxdAsUV04p0NBfAv0HlEuc+XboHwMZDkXJokrPz9YCpz
 TiT+hySDxeFyeXADiXQsYUAVo8LwNUNlehqEc+yzXsuZb6daCiGQiZW0wN15uxCb
 vrfYawJO05QsWtnmG6gg91gh6FwFnd0XIPjrveqzhsAqbA+AJpHqPT9lUBe3K3C2
 PObDpz5NLRnGhO62Rt0au/wkoXGLe3sro5fOy1G0v2YMEvTIvEOByxEtlh0124zM
 C9s28KRV3PqZ4+8YmTjfc/enTVzh8vFFiyj7b//xtclBbFJZezJ1iDTHj1gEuIiQ
 /Npjvu7zXMYxmtGbUfTxoZ4AxzO7wk2j6oD3b+prI3DrthRjzq9n1j7bnmlAukyw
 miFm5f0ss0TrS+dOnE8CysNz8EkYJGDjtoGh9U3Bej+q2HSYK1YwcL2BeW+v28qV
 2rqyOQbftxXHWs0n2aEzkzGwnTYKCY7kR1+CMh11GQ7+5vKBVss=
 =HSDM
 -----END PGP SIGNATURE-----

Merge tag 'wireless-next-2025-03-20' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next

Johannes Berg says:

====================
More features for 6.15, major changes:
 * cfg80211/mac80211: fix and enable link reconfiguration
 * rtw88: support RTL8814AE/RTL8814AU
 * mt7996: preparations for MLO
 * ath12k: continued work on MLO
 * iwlwifi: add new iwlmld sub-driver/op-mode for
   some current and future devices
 * wfx: wowlan support

* tag 'wireless-next-2025-03-20' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (311 commits)
  wifi: mt76: mt7996: fix locking in mt7996_mac_sta_rc_work()
  wifi: mt76: mt76x2u: add TP-Link TL-WDN6200 ID to device table
  wifi: mt76: mt792x: re-register CHANCTX_STA_CSA only for the mt7921 series
  wifi: mt76: mt7996: Update mt7996_tx to MLO support
  wifi: mt76: mt7996: rework mt7996_ampdu_action to support MLO
  wifi: mt76: mt7996: rework set/get_tsf callabcks to support MLO
  wifi: mt76: mt7996: set vif default link_id adding/removing vif links
  wifi: mt76: mt7996: rework mt7996_mcu_beacon_inband_discov to support MLO
  wifi: mt76: mt7996: rework mt7996_mcu_add_obss_spr to support MLO
  wifi: mt76: mt7996: rework mt7996_net_fill_forward_path to support MLO
  wifi: mt76: mt7996: rework mt7996_update_mu_group to support MLO
  wifi: mt76: mt7996: rework mt7996_mac_sta_poll to support MLO
  wifi: mt76: mt7996: rework mt7996_mac_sta_rc_work to support MLO
  wifi: mt76: mt7996: remove mt7996_mac_enable_rtscts()
  wifi: mt76: mt7996: rework mt7996_sta_hw_queue_read to support MLO
  wifi: mt76: mt7996: rework mt7996_set_hw_key to support MLO
  wifi: mt76: mt7996: Add mt7996_sta_link to mt7996_mcu_add_bss_info signature
  wifi: mt76: mt7996: rework mt7996_sta_set_4addr and mt7996_sta_set_decap_offload to support MLO
  wifi: mt76: mt7996: rework mt7996_rx_get_wcid to support MLO
  wifi: mt76: mt7996: Rely on wcid_to_sta in mt7996_mac_add_txs_skb()
  ...
====================

Link: https://patch.msgid.link/20250320131106.33266-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25 08:04:13 -07:00
Wang Liang
094ee6017e bonding: check xdp prog when set bond mode
Following operations can trigger a warning[1]:

    ip netns add ns1
    ip netns exec ns1 ip link add bond0 type bond mode balance-rr
    ip netns exec ns1 ip link set dev bond0 xdp obj af_xdp_kern.o sec xdp
    ip netns exec ns1 ip link set bond0 type bond mode broadcast
    ip netns del ns1

When delete the namespace, dev_xdp_uninstall() is called to remove xdp
program on bond dev, and bond_xdp_set() will check the bond mode. If bond
mode is changed after attaching xdp program, the warning may occur.

Some bond modes (broadcast, etc.) do not support native xdp. Set bond mode
with xdp program attached is not good. Add check for xdp program when set
bond mode.

    [1]
    ------------[ cut here ]------------
    WARNING: CPU: 0 PID: 11 at net/core/dev.c:9912 unregister_netdevice_many_notify+0x8d9/0x930
    Modules linked in:
    CPU: 0 UID: 0 PID: 11 Comm: kworker/u4:0 Not tainted 6.14.0-rc4 #107
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
    Workqueue: netns cleanup_net
    RIP: 0010:unregister_netdevice_many_notify+0x8d9/0x930
    Code: 00 00 48 c7 c6 6f e3 a2 82 48 c7 c7 d0 b3 96 82 e8 9c 10 3e ...
    RSP: 0018:ffffc90000063d80 EFLAGS: 00000282
    RAX: 00000000ffffffa1 RBX: ffff888004959000 RCX: 00000000ffffdfff
    RDX: 0000000000000000 RSI: 00000000ffffffea RDI: ffffc90000063b48
    RBP: ffffc90000063e28 R08: ffffffff82d39b28 R09: 0000000000009ffb
    R10: 0000000000000175 R11: ffffffff82d09b40 R12: ffff8880049598e8
    R13: 0000000000000001 R14: dead000000000100 R15: ffffc90000045000
    FS:  0000000000000000(0000) GS:ffff888007a00000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 000000000d406b60 CR3: 000000000483e000 CR4: 00000000000006f0
    Call Trace:
     <TASK>
     ? __warn+0x83/0x130
     ? unregister_netdevice_many_notify+0x8d9/0x930
     ? report_bug+0x18e/0x1a0
     ? handle_bug+0x54/0x90
     ? exc_invalid_op+0x18/0x70
     ? asm_exc_invalid_op+0x1a/0x20
     ? unregister_netdevice_many_notify+0x8d9/0x930
     ? bond_net_exit_batch_rtnl+0x5c/0x90
     cleanup_net+0x237/0x3d0
     process_one_work+0x163/0x390
     worker_thread+0x293/0x3b0
     ? __pfx_worker_thread+0x10/0x10
     kthread+0xec/0x1e0
     ? __pfx_kthread+0x10/0x10
     ? __pfx_kthread+0x10/0x10
     ret_from_fork+0x2f/0x50
     ? __pfx_kthread+0x10/0x10
     ret_from_fork_asm+0x1a/0x30
     </TASK>
    ---[ end trace 0000000000000000 ]---

Fixes: 9e2ee5c7e7 ("net, bonding: Add XDP support to the bonding driver")
Signed-off-by: Wang Liang <wangliang74@huawei.com>
Acked-by: Jussi Maki <joamaki@gmail.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20250321044852.1086551-1-wangliang74@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25 08:00:09 -07:00
Eric Dumazet
0de2a5c4b8 tcp: avoid atomic operations on sk->sk_rmem_alloc
TCP uses generic skb_set_owner_r() and sock_rfree()
for received packets, with socket lock being owned.

Switch to private versions, avoiding two atomic operations
per packet.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250320121604.3342831-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25 07:37:16 -07:00
Eric Dumazet
b709857ecb ipv6: fix _DEVADD() and _DEVUPD() macros
ip6_rcv_core() is using:

	__IP6_ADD_STATS(net, idev,
			IPSTATS_MIB_NOECTPKTS +
				(ipv6_get_dsfield(hdr) & INET_ECN_MASK),
			max_t(unsigned short, 1, skb_shinfo(skb)->gso_segs));

This is currently evaluating both expressions twice.

Fix _DEVADD() and _DEVUPD() macros to evaluate their arguments once.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250319212516.2385451-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25 07:31:24 -07:00
Kuniyuki Iwashima
3056172a26 af_unix: Explicitly include headers for non-pointer struct fields.
include/net/af_unix.h indirectly includes some definitions for structs.

Let's include such headers explicitly.

  linux/atomic.h   : scm_stat.nr_fds
  linux/net.h      : unix_sock.peer_wq
  linux/path.h     : unix_sock.path
  linux/spinlock.h : unix_sock.lock
  linux/wait.h     : unix_sock.peer_wake
  uapi/linux/un.h  : unix_address.name[]

linux/socket.h is removed as the structs there are not used directly,
and linux/un.h is clarified with uapi as un.h only exists under
include/uapi.

While at it, duplicate headers are removed from .c files.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250318034934.86708-4-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25 04:30:07 -07:00
Kuniyuki Iwashima
84960bf240 af_unix: Move internal definitions to net/unix/.
net/af_unix.h is included by core and some LSMs, but most definitions
need not be.

Let's move struct unix_{vertex,edge} to net/unix/garbage.c and other
definitions to net/unix/af_unix.h.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250318034934.86708-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25 04:30:07 -07:00
Kuniyuki Iwashima
f9af583a2c af_unix: Sort headers.
This is a prep patch to make the following changes cleaner.

No functional change intended.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250318034934.86708-2-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25 04:30:07 -07:00
Jason Xing
f38805c5d2 tcp: support TCP_RTO_MIN_US for set/getsockopt use
Support adjusting/reading RTO MIN for socket level by using set/getsockopt().

This new option has the same effect as TCP_BPF_RTO_MIN, which means it
doesn't affect RTAX_RTO_MIN usage (by using ip route...). Considering that
bpf option was implemented before this patch, so we need to use a standalone
new option for pure tcp set/getsockopt() use.

When the socket is created, its icsk_rto_min is set to the default
value that is controlled by sysctl_tcp_rto_min_us. Then if application
calls setsockopt() with TCP_RTO_MIN_US flag to pass a valid value, then
icsk_rto_min will be overridden in jiffies unit.

This patch adds WRITE_ONCE/READ_ONCE to avoid data-race around
icsk_rto_min.

Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250317120314.41404-2-kerneljasonxing@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25 04:27:19 -07:00
Paolo Abeni
c353e8983e net: introduce per netns packet chains
Currently network taps unbound to any interface are linked in the
global ptype_all list, affecting the performance in all the network
namespaces.

Add per netns ptypes chains, so that in the mentioned case only
the netns owning the packet socket(s) is affected.

While at that drop the global ptype_all list: no in kernel user
registers a tap on "any" type without specifying either the target
device or the target namespace (and IMHO doing that would not make
any sense).

Note that this adds a conditional in the fast path (to check for
per netns ptype_specific list) and increases the dataset size by
a cacheline (owing the per netns lists).

Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumaze@google.com>
Link: https://patch.msgid.link/ae405f98875ee87f8150c460ad162de7e466f8a7.1742494826.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-24 13:58:22 -07:00
Linus Torvalds
9483c37e2d vfs-6.15-rc1.afs
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ90sDQAKCRCRxhvAZXjc
 ooUnAQCaXv5U0GaEwkCcW78vw/dk7jyFG5LlrGUMvZV8MBSvuAEAsaPvU2uM6ZNf
 743B8zOopOeX3Nwy8UKRcHk1nO1m5AY=
 =NoZe
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.15-rc1.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs afs updates from Christian Brauner:
 "This contains the work for afs for this cycle:

   - Fix an occasional hang that's only really encountered when
     rmmod'ing the kafs module

   - Remove the "-o autocell" mount option. This is obsolete with the
     dynamic root and removing it makes the next patch slightly easier

   - Change how the dynamic root mount is constructed. Currently, the
     root directory is (de)populated when it is (un)mounted if there are
     cells already configured and, further, pairs of automount points
     have to be created/removed each time a cell is added/deleted

     This is changed so that readdir on the root dir lists all the known
     cell automount pairs plus the @cell symlinks and the inodes and
     dentries are constructed by lookup on demand. This simplifies the
     cell management code

   - A few improvements to the afs_volume and afs_server tracepoints

   - Pass trace info into the afs_lookup_cell() function to allow the
     trace log to indicate the purpose of the lookup

   - Remove the 'net' parameter from afs_unuse_cell() as it's
     superfluous

   - In rxrpc, allow a kernel app (such as kafs) to store a word of
     information on rxrpc_peer records

   - Use the information stored on the rxrpc_peer record to point to the
     afs_server record. This allows the server address lookup to be done
     away with

   - Simplify the afs_server ref/activity accounting to make each one
     self-contained and not garbage collected from the cell management
     work item

   - Simplify the afs_cell ref/activity accounting to make each one of
     these also self-contained and not driven by a central management
     work item

     The current code was intended to make it such that a single timer
     for the namespace and one work item per cell could do all the work
     required to maintain these records. This, however, made for some
     sequencing problems when cleaning up these records. Further, the
     attempt to pass refs along with timers and work items made getting
     it right rather tricky when the timer or work item already had a
     ref attached and now a ref had to be got rid of"

* tag 'vfs-6.15-rc1.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  afs: Simplify cell record handling
  afs: Fix afs_server ref accounting
  afs: Use the per-peer app data provided by rxrpc
  rxrpc: Allow the app to store private data on peer structs
  afs: Drop the net parameter from afs_unuse_cell()
  afs: Make afs_lookup_cell() take a trace note
  afs: Improve server refcount/active count tracing
  afs: Improve afs_volume tracing to display a debug ID
  afs: Change dynroot to create contents on demand
  afs: Remove the "autocell" mount option
2025-03-24 13:15:16 -07:00
Kuniyuki Iwashima
66034f78a5 tcp/dccp: Remove inet_connection_sock_af_ops.addr2sockaddr().
inet_connection_sock_af_ops.addr2sockaddr() hasn't been used at all
in the git era.

  $ git grep addr2sockaddr $(git rev-list HEAD | tail -n 1)

Let's remove it.

Note that there was a 4 bytes hole after sockaddr_len and now it's
6 bytes, so the binary layout is not changed.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250318060112.3729-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-24 12:10:13 -07:00
Eric Dumazet
1937a0be28 tcp: move icsk_clean_acked to a better location
As a followup of my presentation in Zagreb for netdev 0x19:

icsk_clean_acked is only used by TCP when/if CONFIG_TLS_DEVICE
is enabled from tcp_ack().

Rename it to tcp_clean_acked, move it to tcp_sock structure
in the tcp_sock_read_rx for better cache locality in TCP
fast path.

Define this field only when CONFIG_TLS_DEVICE is enabled
saving 8 bytes on configs not using it.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250317085313.2023214-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-24 09:55:18 -07:00
Murad Masimov
2f6efbabce ax25: Remove broken autobind
Binding AX25 socket by using the autobind feature leads to memory leaks
in ax25_connect() and also refcount leaks in ax25_release(). Memory
leak was detected with kmemleak:

================================================================
unreferenced object 0xffff8880253cd680 (size 96):
backtrace:
__kmalloc_node_track_caller_noprof (./include/linux/kmemleak.h:43)
kmemdup_noprof (mm/util.c:136)
ax25_rt_autobind (net/ax25/ax25_route.c:428)
ax25_connect (net/ax25/af_ax25.c:1282)
__sys_connect_file (net/socket.c:2045)
__sys_connect (net/socket.c:2064)
__x64_sys_connect (net/socket.c:2067)
do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
================================================================

When socket is bound, refcounts must be incremented the way it is done
in ax25_bind() and ax25_setsockopt() (SO_BINDTODEVICE). In case of
autobind, the refcounts are not incremented.

This bug leads to the following issue reported by Syzkaller:

================================================================
ax25_connect(): syz-executor318 uses autobind, please contact jreuter@yaina.de
------------[ cut here ]------------
refcount_t: decrement hit 0; leaking memory.
WARNING: CPU: 0 PID: 5317 at lib/refcount.c:31 refcount_warn_saturate+0xfa/0x1d0 lib/refcount.c:31
Modules linked in:
CPU: 0 UID: 0 PID: 5317 Comm: syz-executor318 Not tainted 6.14.0-rc4-syzkaller-00278-gece144f151ac #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
RIP: 0010:refcount_warn_saturate+0xfa/0x1d0 lib/refcount.c:31
...
Call Trace:
 <TASK>
 __refcount_dec include/linux/refcount.h:336 [inline]
 refcount_dec include/linux/refcount.h:351 [inline]
 ref_tracker_free+0x6af/0x7e0 lib/ref_tracker.c:236
 netdev_tracker_free include/linux/netdevice.h:4302 [inline]
 netdev_put include/linux/netdevice.h:4319 [inline]
 ax25_release+0x368/0x960 net/ax25/af_ax25.c:1080
 __sock_release net/socket.c:647 [inline]
 sock_close+0xbc/0x240 net/socket.c:1398
 __fput+0x3e9/0x9f0 fs/file_table.c:464
 __do_sys_close fs/open.c:1580 [inline]
 __se_sys_close fs/open.c:1565 [inline]
 __x64_sys_close+0x7f/0x110 fs/open.c:1565
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
 ...
 </TASK>
================================================================

Considering the issues above and the comments left in the code that say:
"check if we can remove this feature. It is broken."; "autobinding in this
may or may not work"; - it is better to completely remove this feature than
to fix it because it is broken and leads to various kinds of memory bugs.

Now calling connect() without first binding socket will result in an
error (-EINVAL). Userspace software that relies on the autobind feature
might get broken. However, this feature does not seem widely used with
this specific driver as it was not reliable at any point of time, and it
is already broken anyway. E.g. ax25-tools and ax25-apps packages for
popular distributions do not use the autobind feature for AF_AX25.

Found by Linux Verification Center (linuxtesting.org) with Syzkaller.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: syzbot+33841dc6aa3e1d86b78a@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=33841dc6aa3e1d86b78a
Signed-off-by: Murad Masimov <m.masimov@mt-integration.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-03-24 10:26:53 +00:00
Herbert Xu
a6984aa806 net: mctp: Remove unnecessary cast in mctp_cb
The void * cast in mctp_cb is unnecessary as it's already been done
at the start of the function.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Jeremy Kerr <jk@codeconstruct.com.au>

Link: https://patch.msgid.link/Z9PwOQeBSYlgZlHq@gondor.apana.org.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-21 18:18:12 +01:00
Herbert Xu
eb2953d269 xfrm: ipcomp: Use crypto_acomp interface
Replace the legacy comperssion interface with the new acomp
interface.  This is the first user to make full user of the
asynchronous nature of acomp by plugging into the existing xfrm
resume interface.

As a result of SG support by acomp, the linear scratch buffer
in ipcomp can be removed.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-03-21 17:36:49 +08:00
Florian Westphal
eaaff9b670 netfilter: fib: avoid lookup if socket is available
In case the fib match is used from the input hook we can avoid the fib
lookup if early demux assigned a socket for us: check that the input
interface matches sk-cached one.

Rework the existing 'lo bypass' logic to first check sk, then
for loopback interface type to elide the fib lookup.

This speeds up fib matching a little, before:
93.08 GBit/s (no rules at all)
75.1  GBit/s ("fib saddr . iif oif missing drop" in prerouting)
75.62 GBit/s ("fib saddr . iif oif missing drop" in input)

After:
92.48 GBit/s (no rules at all)
75.62 GBit/s (fib rule in prerouting)
90.37 GBit/s (fib rule in input).

Numbers for the 'no rules' and 'prerouting' are expected to
closely match in-between runs, the 3rd/input test case exercises the
the 'avoid lookup if cached ifindex in sk matches' case.

Test used iperf3 via veth interface, lo can't be used due to existing
loopback test.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-03-21 10:12:15 +01:00
Paolo Abeni
f491593394 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.14-rc8).

Conflict:

tools/testing/selftests/net/Makefile
  03544faad7 ("selftest: net: add proc_net_pktgen")
  3ed61b8938 ("selftests: net: test for lwtunnel dst ref loops")

tools/testing/selftests/net/config:
  85cb3711ac ("selftests: net: Add test cases for link and peer netns")
  3ed61b8938 ("selftests: net: test for lwtunnel dst ref loops")

Adjacent commits:

tools/testing/selftests/net/Makefile
  c935af429e ("selftests: net: add support for testing SO_RCVMARK and SO_RCVPRIORITY")
  355d940f4d ("Revert "selftests: Add IPv6 link-local address generation tests for GRE devices."")

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-20 21:38:01 +01:00
Geliang Tang
fa3ee9dd80 mptcp: sysctl: add available_path_managers
Similarly to net.mptcp.available_schedulers, this patch adds a new one
net.mptcp.available_path_managers to list the available path managers.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250313-net-next-mptcp-pm-ops-intro-v1-11-f4e4a88efc50@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-20 10:14:49 +01:00
Geliang Tang
1305b0c22e mptcp: pm: define struct mptcp_pm_ops
In order to allow users to develop their own BPF-based path manager,
this patch defines a struct ops "mptcp_pm_ops" for an MPTCP path
manager, which contains a set of interfaces. Currently only init()
and release() interfaces are included, subsequent patches will add
others step by step.

Add a set of functions to register, unregister, find and validate a
given path manager struct ops.

"list" is used to add this path manager to mptcp_pm_list list when
it is registered. "name" is used to identify this path manager.
mptcp_pm_find() uses "name" to find a path manager on the list.

mptcp_pm_unregister is not used in this set, but will be invoked in
.unreg of struct bpf_struct_ops. mptcp_pm_validate() will be invoked
in .validate of struct bpf_struct_ops. That's why they are exported.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250313-net-next-mptcp-pm-ops-intro-v1-6-f4e4a88efc50@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-20 10:14:48 +01:00
Paolo Abeni
a0aff75e15 bluetooth pull request for net:
- hci_event: Fix connection regression between LE and non-LE adapters
  - Fix error code in chan_alloc_skb_cb()
 -----BEGIN PGP SIGNATURE-----
 
 iQJNBAABCAA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmfUWpAZHGx1aXoudm9u
 LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKVf5D/9sOWdQHyD5O3JUmG9Z9RP5
 CWsuJMHIqNlx+DWo//rxjD5iyzMozIutW8DS7zEW+Ndct++hn2VnM2BVqmLN5c81
 QqTRPJzULOTz1wpckCZ5dfpEAliXL57YUrtrsefWNEVGooBSNblkCvNjsa1Xq6BQ
 FmQ49avxxs4lxNBxA5vLqNBuWKqH1oeH6Q2shL350iX8qYiCk1+1cozA4LvDP+lE
 2qiOirYjRaarql7rcU+SC9PSt+cVQ/0OCRsOu2n0dbrjRn5KAWsq2+zEh36dP0Ks
 x+tmyYQWJk6Cyb+z/N2VN2NWpIIl9OiY3PSOCfNVD43MZ/RIk8QZgW+UjapbXShO
 1dDFOA4LLI5c2gNg4Px/lnNTd7bBU0smrv0AQ+CD/9zlDd0TJtbvtFCx/rYPEH8n
 sY2tzplfYeBLbbrHeaN0RHR0GQW1IYHcNTjctNeFcJBewKAAX2m7JnylWsG/ZCw9
 qW+c5qv6L+sGCHi92AQrxc0GU3/Pdl6t1YUow0nHFPu+7Elb3T/KOkaIWRauiAA1
 C6cKkp+iMyTb0Q6beh3s3FgF24MmEIBSwGAUrworvQm5+tzC5l75cv7mSQtwGMR0
 8V7x0e77zaTZ9ncgvloAwu5M472TvM/HdVkG/oODH5ab19kTiUOPHN2CtNeLpFo4
 pI8fKTzFtNSQISpVyARw5A==
 =+VUv
 -----END PGP SIGNATURE-----

Merge tag 'for-net-2025-03-14' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth

Luiz Augusto von Dentz says:

====================
bluetooth pull request for net:

 - hci_event: Fix connection regression between LE and non-LE adapters
 - Fix error code in chan_alloc_skb_cb()

* tag 'for-net-2025-03-14' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
  Bluetooth: hci_event: Fix connection regression between LE and non-LE adapters
  Bluetooth: Fix error code in chan_alloc_skb_cb()
====================

Link: https://patch.msgid.link/20250314163847.110069-1-luiz.dentz@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-19 19:44:05 +01:00
Eric Dumazet
15492700ac tcp: cache RTAX_QUICKACK metric in a hot cache line
tcp_in_quickack_mode() is called from input path for small packets.

It calls __sk_dst_get() which reads sk->sk_dst_cache which has been
put in sock_read_tx group (for good reasons).

Then dst_metric(dst, RTAX_QUICKACK) also needs extra cache line misses.

Cache RTAX_QUICKACK in icsk->icsk_ack.dst_quick_ack to no longer pull
these cache lines for the cases a delayed ACK is scheduled.

After this patch TCP receive path does not longer access sock_read_tx
group.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250312083907.1931644-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-18 13:44:59 +01:00
Eric Dumazet
eb0dfc0ef1 inet: frags: change inet_frag_kill() to defer refcount updates
In the following patch, we no longer assume inet_frag_kill()
callers own a reference.

Consuming two refcounts from inet_frag_kill() would lead in UAF.

Propagate the pointer to the refs that will be consumed later
by the final inet_frag_putn() call.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250312082250.1803501-4-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-18 13:18:36 +01:00
Eric Dumazet
ae2d90355a inet: frags: add inet_frag_putn() helper
inet_frag_putn() can release multiple references
in one step.

Use it in inet_frags_free_cb().

Replace inet_frag_put(X) with inet_frag_putn(X, 1)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250312082250.1803501-2-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-18 13:18:36 +01:00
Paolo Abeni
311b36574c udp_tunnel: use static call for GRO hooks when possible
It's quite common to have a single UDP tunnel type active in the
whole system. In such a case we can replace the indirect call for
the UDP tunnel GRO callback with a static call.

Add the related accounting in the control path and switch to static
call when possible. To keep the code simple use a static array for
the registered tunnel types, and size such array based on the kernel
config.

Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/6fd1f9c7651151493ecab174e7b8386a1534170d.1741718157.git.pabeni@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-18 11:40:30 +01:00
Paolo Abeni
8d4880db37 udp_tunnel: create a fastpath GRO lookup.
Most UDP tunnels bind a socket to a local port, with ANY address, no
peer and no interface index specified.
Additionally it's quite common to have a single tunnel device per
namespace.

Track in each namespace the UDP tunnel socket respecting the above.
When only a single one is present, store a reference in the netns.

When such reference is not NULL, UDP tunnel GRO lookup just need to
match the incoming packet destination port vs the socket local port.

The tunnel socket never sets the reuse[port] flag[s]. When bound to no
address and interface, no other socket can exist in the same netns
matching the specified local port.

Matching packets with non-local destination addresses will be
aggregated, and eventually segmented as needed - no behavior changes
intended.

Note that the UDP tunnel socket reference is stored into struct
netns_ipv4 for both IPv4 and IPv6 tunnels. That is intentional to keep
all the fastpath-related netns fields in the same struct and allow
cacheline-based optimization. Currently both the IPv4 and IPv6 socket
pointer share the same cacheline as the `udp_table` field.

Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/4d5c319c4471161829f50cb8436841de81a5edae.1741718157.git.pabeni@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-18 11:40:26 +01:00
Haiyang Zhang
2fc8a34662 net: mana: Support holes in device list reply msg
According to GDMA protocol, holes (zeros) are allowed at the beginning
or middle of the gdma_list_devices_resp message. The existing code
cannot properly handle this, and may miss some devices in the list.

To fix, scan the entire list until the num_of_devs are found, or until
the end of the list.

Cc: stable@vger.kernel.org
Fixes: ca9c54d2d6 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Long Li <longli@microsoft.com>
Reviewed-by: Shradha Gupta <shradhagupta@microsoft.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://patch.msgid.link/1741723974-1534-1-git-send-email-haiyangz@microsoft.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-18 11:32:15 +01:00
Johannes Berg
c924c5e9b8 Merge net-next/main to resolve conflicts
There are a few conflicts between the work that went
into wireless and that's here now, resolve them.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-03-18 09:46:36 +01:00
Ilpo Järvinen
9866884ce8 tcp: Pass flags to __tcp_send_ack
Accurate ECN needs to send custom flags to handle IP-ECN
field reflection during handshake.

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-03-17 13:56:38 +00:00
Ilpo Järvinen
4618e195f9 tcp: add new TCP_TW_ACK_OOW state and allow ECN bits in TOS
ECN bits in TOS are always cleared when sending in ACKs in TW. Clearing
them is problematic for TCP flows that used Accurate ECN because ECN bits
decide which service queue the packet is placed into (L4S vs Classic).
Effectively, TW ACKs are always downgraded from L4S to Classic queue
which might impact, e.g., delay the ACK will experience on the path
compared with the other packets of the flow.

Change the TW ACK sending code to differentiate:
- In tcp_v4_send_reset(), commit ba9e04a7dd ("ip: fix tos reflection
  in ack and reset packets") cleans ECN bits for TW reset and this is
  not affected.
- In tcp_v4_timewait_ack(), ECN bits for all TW ACKs are cleaned. But now
  only ECN bits of ACKs for oow data or paws_reject are cleaned, and ECN
  bits of other ACKs will not be cleaned.
- In tcp_v4_reqsk_send_ack(), commit 66b13d99d9 ("ipv4: tcp: fix TOS
  value in ACK messages sent from TIME_WAIT") did not clean ECN bits of
  ACKs for oow data or paws_reject. But now the ECN bits rae cleaned for
  these ACKs.

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-03-17 13:56:17 +00:00
Ilpo Järvinen
041fb11d51 tcp: helpers for ECN mode handling
Create helpers for TCP ECN modes. No functional changes.

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-03-17 13:54:11 +00:00
Ilpo Järvinen
2c2f08d31d tcp: extend TCP flags to allow AE bit/ACE field
With AccECN, there's one additional TCP flag to be used (AE)
and ACE field that overloads the definition of AE, CWR, and
ECE flags. As tcp_flags was previously only 1 byte, the
byte-order stuff needs to be added to it's handling.

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-03-17 13:49:46 +00:00
Chia-Yu Chang
0114a91da6 tcp: use BIT() macro in include/net/tcp.h
Use BIT() macro for TCP flags field and TCP congestion control
flags that will be used by the congestion control algorithm.

No functional changes.

Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Reviewed-by: Ilpo Järvinen <ij@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-03-17 13:49:13 +00:00
Uros Bizjak
8a3c392388 percpu: use TYPEOF_UNQUAL() in variable declarations
Use TYPEOF_UNQUAL() to declare variables as a corresponding type without
named address space qualifier to avoid "`__seg_gs' specified for auto
variable `var'" errors.

Link: https://lkml.kernel.org/r/20250127160709.80604-4-ubizjak@gmail.com
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Acked-by: Nadav Amit <nadav.amit@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16 22:05:53 -07:00
Paolo Abeni
941defcea7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.14-rc6).

Conflicts:

tools/testing/selftests/drivers/net/ping.py
  75cc19c8ff ("selftests: drv-net: add xdp cases for ping.py")
  de94e86974 ("selftests: drv-net: store addresses in dict indexed by ipver")
https://lore.kernel.org/netdev/20250311115758.17a1d414@canb.auug.org.au/

net/core/devmem.c
  a70f891e0f ("net: devmem: do not WARN conditionally after netdev_rx_queue_restart()")
  1d22d3060b ("net: drop rtnl_lock for queue_mgmt operations")
https://lore.kernel.org/netdev/20250313114929.43744df1@canb.auug.org.au/

Adjacent changes:

tools/testing/selftests/net/Makefile
  6f50175cca ("selftests: Add IPv6 link-local address generation tests for GRE devices.")
  2e5584e0f9 ("selftests/net: expand cmsg_ipv6.sh with ipv4")

drivers/net/ethernet/broadcom/bnxt/bnxt.c
  661958552e ("eth: bnxt: do not use BNXT_VNIC_NTUPLE unconditionally in queue restart logic")
  fe96d717d3 ("bnxt_en: Extend queue stop/start for TX rings")

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-13 23:08:11 +01:00
Arkadiusz Bokowy
f6685a96c8 Bluetooth: hci_event: Fix connection regression between LE and non-LE adapters
Due to a typo during defining HCI errors it is not possible to connect
LE-capable device with BR/EDR only adapter. The connection is terminated
by the LE adapter because the invalid LL params error code is treated
as unsupported remote feature.

Fixes: 79c0868ad6 ("Bluetooth: hci_event: Use HCI error defines instead of magic values")
Signed-off-by: Arkadiusz Bokowy <arkadiusz.bokowy@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-03-13 16:43:39 -04:00
Long Li
a8445cfec1 net: mana: Change the function signature of mana_get_primary_netdev_rcu
Change mana_get_primary_netdev_rcu() to mana_get_primary_netdev(), and
return the ndev with refcount held. The caller is responsible for dropping
the refcount.

Also drop the check for IFF_SLAVE as it is not necessary if the upper
device is present.

Signed-off-by: Long Li <longli@microsoft.com>
Link: https://patch.msgid.link/1741821332-9392-1-git-send-email-longli@linuxonhyperv.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-13 08:03:02 -04:00
Stanislav Fomichev
10eef096be net: add granular lock for the netdev netlink socket
As we move away from rtnl_lock for queue ops, introduce
per-netdev_nl_sock lock.

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20250311144026.4154277-3-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-12 13:32:35 -07:00
Stanislav Fomichev
b6b67141d6 net: create netdev_nl_sock to wrap bindings list
No functional changes. Next patches will add more granular locking
to netdev_nl_sock.

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20250311144026.4154277-2-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-12 13:32:35 -07:00
Johannes Berg
b5c1622762 wifi: cfg80211: expose cfg80211_chandef_get_width()
This can be just a trivial inline, to simplify some code.
Expose it, and also use it in util.c where it wasn't
previously available.

Reviewed-by: Miriam Rachel Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250311122534.c5c3b4af9a74.Ib25cf60f634dc359961182113214e5cdc3504e9c@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-03-12 09:50:24 +01:00
Ilan Peer
e16caea706 wifi: cfg80211: Update the link address when a link is added
When links are added, update the wireless device link addresses based
on the information provided by the driver.

Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250308225541.d694a9125aba.I79b010ea9aab47893e4f22c266362fde30b7f9ac@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-03-11 10:53:10 +01:00
Johannes Berg
cf12d3d71e wifi: cfg80211: improve supported_selector documentation
Improve the documentation for supported BSS selectors to make it more
precise.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Reviewed-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250308225541.ba402ff47314.I502b56111b62ea0be174ae76bd03684ae1d4aefb@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-03-11 10:53:10 +01:00
Anjaneyulu
cf4bd16088 wifi: cfg80211: allow IR in 20 MHz configurations
Some regulatory bodies doesn't allow IR (initiate radioation) on a
specific subband, but allows it for channels with a bandwidth of 20 MHz.
Add a channel flag that indicates that, and consider it in
cfg80211_reg_check_beaconing.

While on it, fix the kernel doc of enum nl80211_reg_rule_flags and
change it to use BIT().

Signed-off-by: Anjaneyulu <pagadala.yesu.anjaneyulu@intel.com>
Co-developed-by: Somashekhar Puttagangaiah <somashekhar.puttagangaiah@intel.com>
Signed-off-by: Somashekhar Puttagangaiah <somashekhar.puttagangaiah@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250308225541.d3ab352a73ff.I8a8f79e1c9eb74936929463960ee2a324712fe51@changeid
[fix typo]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-03-11 10:53:01 +01:00
Johannes Berg
969241371f wifi: cfg80211: allow setting extended MLD capa/ops
Some extended MLD capabilities and operations bits (currently
the "BTM MLD Recommendataion For Multiple APs Support" bit)
may depend on userspace capabilities. Allow userspace to pass
the values for this field that it supports to the association
and link reconfiguration operations.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Reviewed-by: Ilan Peer <ilan.peer@intel.com>
Link: https://patch.msgid.link/20250308225541.bd52078b5f65.I4dd8f53b0030db7ea87a2e0920989e7e2c7b5345@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-03-11 10:51:59 +01:00
Johannes Berg
a096a8602f wifi: cfg80211: move link reconfig parameters into a struct
Add a new struct cfg80211_ml_reconf_req to collect the link
reconfiguration parameters.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250308225541.0cf299c1fdd0.Id1a3b1092dc52d0d3731a8798522fdf2e052bf0b@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-03-11 10:51:58 +01:00
David Howells
f3a123b254 rxrpc: Allow the app to store private data on peer structs
Provide a way for the application (e.g. the afs filesystem) to store
private data on the rxrpc_peer structs for later retrieval via the call
object.

This will allow afs to store a pointer to the afs_server object on the
rxrpc_peer struct, thereby obviating the need for afs to keep lookup tables
by which it can associate an incoming call with server that transmitted it.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: linux-fsdevel@vger.kernel.org
cc: netdev@vger.kernel.org
Link: https://lore.kernel.org/r/20250224234154.2014840-13-dhowells@redhat.com/ # v1
Link: https://lore.kernel.org/r/20250310094206.801057-9-dhowells@redhat.com/ # v4
2025-03-10 09:47:15 +00:00
Jakub Kicinski
8ef890df40 net: move misc netdev_lock flavors to a separate header
Move the more esoteric helpers for netdev instance lock to
a dedicated header. This avoids growing netdevice.h to infinity
and makes rebuilding the kernel much faster (after touching
the header with the helpers).

The main netdev_lock() / netdev_unlock() functions are used
in static inlines in netdevice.h and will probably be used
most commonly, so keep them in netdevice.h.

Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250307183006.2312761-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-08 09:06:50 -08:00
Matthieu Baerts (NGI0)
0d7336f8f0 tcp: ulp: diag: more info without CAP_NET_ADMIN
When introduced in commit 61723b3932 ("tcp: ulp: add functions to dump
ulp-specific information"), the whole ULP diag info has been exported
only if the requester had CAP_NET_ADMIN.

It looks like not everything is sensitive, and some info can be exported
to all users in order to ease the debugging from the userspace side
without requiring additional capabilities. Each layer should then decide
what can be exposed to everybody. The 'net_admin' boolean is then passed
to the different layers.

On kTLS side, it looks like there is nothing sensitive there: version,
cipher type, tx/rx user config type, plus some flags. So, only some
metadata about the configuration, no cryptographic info like keys, etc.
Then, everything can be exported to all users.

On MPTCP side, that's different. The MPTCP-related sequence numbers per
subflow should certainly not be exposed to everybody. For example, the
DSS mapping and ssn_offset would give all users on the system access to
narrow ranges of values for the subflow TCP sequence numbers and
MPTCP-level DSNs, and then ease packet injection. The TCP diag interface
doesn't expose the TCP sequence numbers for TCP sockets, so best to do
the same here. The rest -- token, IDs, flags -- can be exported to
everybody.

Acked-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250306-net-next-tcp-ulp-diag-net-admin-v1-2-06afdd860fc9@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-07 19:39:53 -08:00
Luiz Augusto von Dentz
ab6ab707a4 Revert "Bluetooth: hci_core: Fix sleeping function called from invalid context"
This reverts commit 4d94f05558 which has
problems (see [1]) and is no longer needed since 581dd2dc16
("Bluetooth: hci_event: Fix using rcu_read_(un)lock while iterating")
has reworked the code where the original bug has been found.

[1] Link: https://lore.kernel.org/linux-bluetooth/877c55ci1r.wl-tiwai@suse.de/T/#t
Fixes: 4d94f05558 ("Bluetooth: hci_core: Fix sleeping function called from invalid context")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-03-07 13:03:05 -05:00
Eric Dumazet
d4438ce68b inet: call inet6_ehashfn() once from inet6_hash_connect()
inet6_ehashfn() being called from __inet6_check_established()
has a big impact on performance, as shown in the Tested section.

After prior patch, we can compute the hash for port 0
from inet6_hash_connect(), and derive each hash in
__inet_hash_connect() from this initial hash:

hash(saddr, lport, daddr, dport) == hash(saddr, 0, daddr, dport) + lport

Apply the same principle for __inet_check_established(),
although inet_ehashfn() has a smaller cost.

Tested:

Server: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog
Client: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog -c -H server

Before this patch:

  utime_start=0.286131
  utime_end=4.378886
  stime_start=11.952556
  stime_end=1991.655533
  num_transactions=1446830
  latency_min=0.001061085
  latency_max=12.075275028
  latency_mean=0.376375302
  latency_stddev=1.361969596
  num_samples=306383
  throughput=151866.56

perf top:

 50.01%  [kernel]       [k] __inet6_check_established
 20.65%  [kernel]       [k] __inet_hash_connect
 15.81%  [kernel]       [k] inet6_ehashfn
  2.92%  [kernel]       [k] rcu_all_qs
  2.34%  [kernel]       [k] __cond_resched
  0.50%  [kernel]       [k] _raw_spin_lock
  0.34%  [kernel]       [k] sched_balance_trigger
  0.24%  [kernel]       [k] queued_spin_lock_slowpath

After this patch:

  utime_start=0.315047
  utime_end=9.257617
  stime_start=7.041489
  stime_end=1923.688387
  num_transactions=3057968
  latency_min=0.003041375
  latency_max=7.056589232
  latency_mean=0.141075048    # Better latency metrics
  latency_stddev=0.526900516
  num_samples=312996
  throughput=320677.21        # 111 % increase, and 229 % for the series

perf top: inet6_ehashfn is no longer seen.

 39.67%  [kernel]       [k] __inet_hash_connect
 37.06%  [kernel]       [k] __inet6_check_established
  4.79%  [kernel]       [k] rcu_all_qs
  3.82%  [kernel]       [k] __cond_resched
  1.76%  [kernel]       [k] sched_balance_domains
  0.82%  [kernel]       [k] _raw_spin_lock
  0.81%  [kernel]       [k] sched_balance_rq
  0.81%  [kernel]       [k] sched_balance_trigger
  0.76%  [kernel]       [k] queued_spin_lock_slowpath

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Tested-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20250305034550.879255-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-06 15:26:02 -08:00
Florian Westphal
fb8286562e netfilter: nf_tables: make destruction work queue pernet
The call to flush_work before tearing down a table from the netlink
notifier was supposed to make sure that all earlier updates (e.g. rule
add) that might reference that table have been processed.

Unfortunately, flush_work() waits for the last queued instance.
This could be an instance that is different from the one that we must
wait for.

This is because transactions are protected with a pernet mutex, but the
work item is global, so holding the transaction mutex doesn't prevent
another netns from queueing more work.

Make the work item pernet so that flush_work() will wait for all
transactions queued from this netns.

A welcome side effect is that we no longer need to wait for transaction
objects from foreign netns.

The gc work queue is still global.  This seems to be ok because nft_set
structures are reference counted and each container structure owns a
reference on the net namespace.

The destroy_list is still protected by a global spinlock rather than
pernet one but the hold time is very short anyway.

v2: call cancel_work_sync before reaping the remaining tables (Pablo).

Fixes: 9f6958ba2e ("netfilter: nf_tables: unconditionally flush pending work before notifier")
Reported-by: syzbot+5d8c5789c8cb076b2c25@syzkaller.appspotmail.com
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-03-06 13:35:54 +01:00
Eric Dumazet
f130a0cc1b inet: fix lwtunnel_valid_encap_type() lock imbalance
After blamed commit rtm_to_fib_config() now calls
lwtunnel_valid_encap_type{_attr}() without RTNL held,
triggering an unlock balance in __rtnl_unlock,
as reported by syzbot [1]

IPv6 and rtm_to_nh_config() are not yet converted.

Add a temporary @rtnl_is_held parameter to lwtunnel_valid_encap_type()
and lwtunnel_valid_encap_type_attr().

While we are at it replace the two rcu_dereference()
in lwtunnel_valid_encap_type() with more appropriate
rcu_access_pointer().

[1]
syz-executor245/5836 is trying to release lock (rtnl_mutex) at:
 [<ffffffff89d0e38c>] __rtnl_unlock+0x6c/0xf0 net/core/rtnetlink.c:142
but there are no more locks to release!

other info that might help us debug this:
no locks held by syz-executor245/5836.

stack backtrace:
CPU: 0 UID: 0 PID: 5836 Comm: syz-executor245 Not tainted 6.14.0-rc4-syzkaller-00873-g3424291dd242 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2025
Call Trace:
 <TASK>
  __dump_stack lib/dump_stack.c:94 [inline]
  dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
  print_unlock_imbalance_bug+0x25b/0x2d0 kernel/locking/lockdep.c:5289
  __lock_release kernel/locking/lockdep.c:5518 [inline]
  lock_release+0x47e/0xa30 kernel/locking/lockdep.c:5872
  __mutex_unlock_slowpath+0xec/0x800 kernel/locking/mutex.c:891
  __rtnl_unlock+0x6c/0xf0 net/core/rtnetlink.c:142
  lwtunnel_valid_encap_type+0x38a/0x5f0 net/core/lwtunnel.c:169
  lwtunnel_valid_encap_type_attr+0x113/0x270 net/core/lwtunnel.c:209
  rtm_to_fib_config+0x949/0x14e0 net/ipv4/fib_frontend.c:808
  inet_rtm_newroute+0xf6/0x2a0 net/ipv4/fib_frontend.c:917
  rtnetlink_rcv_msg+0x791/0xcf0 net/core/rtnetlink.c:6919
  netlink_rcv_skb+0x206/0x480 net/netlink/af_netlink.c:2534
  netlink_unicast_kernel net/netlink/af_netlink.c:1313 [inline]
  netlink_unicast+0x7f6/0x990 net/netlink/af_netlink.c:1339
  netlink_sendmsg+0x8de/0xcb0 net/netlink/af_netlink.c:1883
  sock_sendmsg_nosec net/socket.c:709 [inline]

Fixes: 1dd2af7963 ("ipv4: fib: Convert RTM_NEWROUTE and RTM_DELROUTE to per-netns RTNL.")
Reported-by: syzbot+3f18ef0f7df107a3f6a0@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/67c6f87a.050a0220.38b91b.0147.GAE@google.com/T/#u
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250304125918.2763514-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-05 19:16:56 -08:00
Eric Dumazet
86c2bc293b tcp: use RCU lookup in __inet_hash_connect()
When __inet_hash_connect() has to try many 4-tuples before
finding an available one, we see a high spinlock cost from
the many spin_lock_bh(&head->lock) performed in its loop.

This patch adds an RCU lookup to avoid the spinlock cost.

check_established() gets a new @rcu_lookup argument.
First reason is to not make any changes while head->lock
is not held.
Second reason is to not make this RCU lookup a second time
after the spinlock has been acquired.

Tested:

Server:

ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog

Client:

ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog -c -H server

Before series:

  utime_start=0.288582
  utime_end=1.548707
  stime_start=20.637138
  stime_end=2002.489845
  num_transactions=484453
  latency_min=0.156279245
  latency_max=20.922042756
  latency_mean=1.546521274
  latency_stddev=3.936005194
  num_samples=312537
  throughput=47426.00

perf top on the client:

 49.54%  [kernel]       [k] _raw_spin_lock
 25.87%  [kernel]       [k] _raw_spin_lock_bh
  5.97%  [kernel]       [k] queued_spin_lock_slowpath
  5.67%  [kernel]       [k] __inet_hash_connect
  3.53%  [kernel]       [k] __inet6_check_established
  3.48%  [kernel]       [k] inet6_ehashfn
  0.64%  [kernel]       [k] rcu_all_qs

After this series:

  utime_start=0.271607
  utime_end=3.847111
  stime_start=18.407684
  stime_end=1997.485557
  num_transactions=1350742
  latency_min=0.014131929
  latency_max=17.895073144
  latency_mean=0.505675853  # Nice reduction of latency metrics
  latency_stddev=2.125164772
  num_samples=307884
  throughput=139866.80      # 190 % increase

perf top on client:

 56.86%  [kernel]       [k] __inet6_check_established
 17.96%  [kernel]       [k] __inet_hash_connect
 13.88%  [kernel]       [k] inet6_ehashfn
  2.52%  [kernel]       [k] rcu_all_qs
  2.01%  [kernel]       [k] __cond_resched
  0.41%  [kernel]       [k] _raw_spin_lock

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Tested-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250302124237.3913746-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-04 17:46:27 -08:00
Eric Dumazet
d186f405fd tcp: add RCU management to inet_bind_bucket
Add RCU protection to inet_bind_bucket structure.

- Add rcu_head field to the structure definition.

- Use kfree_rcu() at destroy time, and remove inet_bind_bucket_destroy()
  first argument.

- Use hlist_del_rcu() and hlist_add_head_rcu() methods.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250302124237.3913746-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-04 17:46:26 -08:00
Jakub Kicinski
71f8992e34 First 6.15 material:
* cfg80211/mac80211
    - remove cooked monitor support
    - strict mode for better AP testing
    - basic EPCS support
    - OMI RX bandwidth reduction support
  * rtw88
    - preparation for RTL8814AU support
  * rtw89
    - use wiphy_lock/wiphy_work
    - preparations for MLO
    - BT-Coex improvements
    - regulatory support in firmware files
  * iwlwifi
    - preparations for the new iwlmld sub-driver
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmfG9tQACgkQ10qiO8sP
 aAAE1g/9Fg6vrTY1OX0at1iXL5AMqtfAVUJffHGz8SyWJNzKgP0pLtXUTi1Ta2b4
 TmHG0+4fMcwM0nigL0cg/F6vEE5V7qM30b7WMhIKpBejcCklINEIYuB1a6CwfDNF
 7QtE/NA9/oKBy2JftYpm+4mdtlKdwBdkpz3MKE8/BpB1HdmG3IS3/z9+t4jkMOv+
 hRxZKRlqpW98uimc5JjkjMGyt/e4QyHiUt0Z7Ly3dzm0PajvmRy4Idkl3TFMi4Yp
 GgKFTix9/7G4iYFoyK5iNiqfesnzqY6jaI3sxv6h6AG/Yz8EOkBNNnKovPC0lAo/
 ZpkhU63XN10SZXxeSDSiz4tWx3uBxpN8PIRvuaEr+KW7caCjMYXAuB5uE+KvREA3
 hkM8zB5IkT54bLaHJYG6TGQoWxRnzoEDjURBHkEG9DfJs2ixKxdLfe/Rc1lvLqt4
 t5ejc+UQsjICZ31LhrZkNbfmKLtQfFCcBj3ZUHz9RzO1AXmsqudeRcl0j3rDjsqY
 AdYKFEKGCKeBlhr+Cng9VqzJa7gCmgVgkXYjGxJgbL0n3u6oV/S08/7Ssbzf7RXY
 ulFqxEjaexn/vtUHVwMOTxyDttftOfB3Y1IrNUJj3OXWpuXK247OQN6QHM9XWtVy
 xzBOW3g9N0L3EUM4PVmZTGU4SdtuwNxvZbSrToBfnjVu0rESAT4=
 =KYXQ
 -----END PGP SIGNATURE-----

Merge tag 'wireless-next-2025-03-04-v2' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next

Johannes Berg says:

====================
First 6.15 material:
 * cfg80211/mac80211
   - remove cooked monitor support
   - strict mode for better AP testing
   - basic EPCS support
   - OMI RX bandwidth reduction support
 * rtw88
   - preparation for RTL8814AU support
 * rtw89
   - use wiphy_lock/wiphy_work
   - preparations for MLO
   - BT-Coex improvements
   - regulatory support in firmware files
 * iwlwifi
   - preparations for the new iwlmld sub-driver

* tag 'wireless-next-2025-03-04-v2' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (128 commits)
  wifi: iwlwifi: remove mld/roc.c
  wifi: mac80211: refactor populating mesh related fields in sinfo
  wifi: cfg80211: reorg sinfo structure elements for mesh
  wifi: iwlwifi: Fix spelling mistake "Increate" -> "Increase"
  wifi: iwlwifi: add Debug Host Command APIs
  wifi: iwlwifi: add IWL_MAX_NUM_IGTKS macro
  wifi: iwlwifi: add OMI bandwidth reduction APIs
  wifi: iwlwifi: remove mvm prefix from iwl_mvm_d3_end_notif
  wifi: iwlwifi: remember if the UATS table was read successfully
  wifi: iwlwifi: export iwl_get_lari_config_bitmap
  wifi: iwlwifi: add support for external 32 KHz clock
  wifi: iwlwifi: mld: add a debug level for EHT prints
  wifi: iwlwifi: mld: add a debug level for PTP prints
  wifi: iwlwifi: remove mvm prefix from iwl_mvm_esr_mode_notif
  wifi: iwlwifi: use 0xff instead of 0xffffffff for invalid
  wifi: iwlwifi: location api cleanup
  wifi: cfg80211: expose update timestamp to drivers
  wifi: mac80211: add ieee80211_iter_chan_contexts_mtx
  wifi: mac80211: fix integer overflow in hwmp_route_info_get()
  wifi: mac80211: Fix possible integer promotion issue
  ...
====================

Link: https://patch.msgid.link/20250304125605.127914-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-04 08:50:42 -08:00
Geliang Tang
456cc675b6 sock: add sock_kmemdup helper
This patch adds the sock version of kmemdup() helper, named sock_kmemdup(),
to duplicate the input "src" memory block using the socket's option memory
buffer.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/f828077394c7d1f3560123497348b438c875b510.1740735165.git.tanggeliang@kylinos.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-03 17:16:34 -08:00
Eric Dumazet
e7b9ecce56 tcp: convert to dev_net_rcu()
TCP uses of dev_net() are under RCU protection, change them
to dev_net_rcu() to get LOCKDEP support.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250301201424.2046477-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-03 15:44:19 -08:00
Eric Dumazet
a11a791ca8 tcp: add four drop reasons to tcp_check_req()
Use two existing drop reasons in tcp_check_req():

- TCP_RFC7323_PAWS

- TCP_OVERWINDOW

Add two new ones:

- TCP_RFC7323_TSECR (corresponds to LINUX_MIB_TSECRREJECTED)

- TCP_LISTEN_OVERFLOW (when a listener accept queue is full)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250301201424.2046477-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-03 15:44:19 -08:00
Eric Dumazet
e34100c2ec tcp: add a drop_reason pointer to tcp_check_req()
We want to add new drop reasons for packets dropped in 3WHS in the
following patches.

tcp_rcv_state_process() has to set reason to TCP_FASTOPEN,
because tcp_check_req() will conditionally overwrite the drop_reason.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250301201424.2046477-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-03 15:44:19 -08:00
Kuniyuki Iwashima
9f7f3ebeba ipv4: fib: Namespacify fib_info hash tables.
We will convert RTM_NEWROUTE and RTM_DELROUTE to per-netns RTNL.
Then, we need to have per-netns hash tables for struct fib_info.

Let's allocate the hash tables per netns.

fib_info_hash, fib_info_hash_bits, and fib_info_cnt are now moved
to struct netns_ipv4 and accessed with net->ipv4.fib_XXX.

Also, the netns checks are removed from fib_find_info_nh() and
fib_find_info().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250228042328.96624-9-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-03 15:04:10 -08:00
Kuniyuki Iwashima
cfc47029fa ipv4: fib: Allocate fib_info_hash[] during netns initialisation.
We will allocate fib_info_hash[] and fib_info_laddrhash[] for each netns.

Currently, fib_info_hash[] is allocated when the first route is added.

Let's move the first allocation to a new __net_init function.

Note that we must call fib4_semantics_exit() in fib_net_exit_batch()
because ->exit() is called earlier than ->exit_batch().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250228042328.96624-4-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-03 15:04:09 -08:00
Sarika Sharma
23ff5f6f23 wifi: cfg80211: reorg sinfo structure elements for mesh
Currently, as multi-link operation(MLO) is not supported for mesh,
reorganize the sinfo structure for mesh-specific fields and embed
mesh related NL attributes together in organized view.
This will allow for the simplified reorganization of sinfo structure
for link level in a subsequent patch to add support for MLO station
statistics.
No functionality changes added.

Pahole summary before the reorg of sinfo structure:
 - size: 256, cachelines: 4, members: 50
 - sum members: 239, holes: 4, sum holes: 17
 - paddings: 2, sum paddings: 2
 - forced alignments: 1, forced holes: 1, sum forced holes: 1

Pahole summary after the reorg of sinfo structure:
 - size: 248, cachelines: 4, members: 50
 - sum members: 239, holes: 4, sum holes: 9
 - paddings: 2, sum paddings: 2
 - forced alignments: 1, last cacheline: 56 bytes

Signed-off-by: Sarika Sharma <quic_sarishar@quicinc.com>
Link: https://patch.msgid.link/20250213171632.1646538-2-quic_sarishar@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-02-28 14:08:59 +01:00
Jakub Kicinski
357660d759 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.14-rc5).

Conflicts:

drivers/net/ethernet/cadence/macb_main.c
  fa52f15c74 ("net: cadence: macb: Synchronize stats calculations")
  75696dd0fd ("net: cadence: macb: Convert to get_stats64")
https://lore.kernel.org/20250224125848.68ee63e5@canb.auug.org.au

Adjacent changes:

drivers/net/ethernet/intel/ice/ice_sriov.c
  79990cf5e7 ("ice: Fix deinitializing VF in error path")
  a203163274 ("ice: simplify VF MSI-X managing")

net/ipv4/tcp.c
  18912c5206 ("tcp: devmem: don't write truncated dmabuf CMSGs to userspace")
  297d389e9e ("net: prefix devmem specific helpers")

net/mptcp/subflow.c
  8668860b0a ("mptcp: reset when MPTCP opts are dropped after join")
  c3349a22c2 ("mptcp: consolidate subflow cleanup")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-27 10:20:58 -08:00
Linus Torvalds
1e15510b71 Including fixes from bluetooth. We didn't get netfilter or wireless PRs
this week, so next week's PR is probably going to be bigger. A healthy
 dose of fixes for bugs introduced in the current release nonetheless.
 
 Current release - regressions:
 
  - Bluetooth: always allow SCO packets for user channel
 
  - af_unix: fix memory leak in unix_dgram_sendmsg()
 
  - rxrpc:
    - remove redundant peer->mtu_lock causing lockdep splats
    - fix spinlock flavor issues with the peer record hash
 
  - eth: iavf: fix circular lock dependency with netdev_lock
 
  - net: use rtnl_net_dev_lock() in register_netdevice_notifier_dev_net()
    RDMA driver register notifier after the device
 
 Current release - new code bugs:
 
  - ethtool: fix ioctl confusing drivers about desired HDS user config
 
  - eth: ixgbe: fix media cage present detection for E610 device
 
 Previous releases - regressions:
 
  - loopback: avoid sending IP packets without an Ethernet header
 
  - mptcp: reset connection when MPTCP opts are dropped after join
 
 Previous releases - always broken:
 
  - net: better track kernel sockets lifetime
 
  - ipv6: fix dst ref loop on input in seg6 and rpl lw tunnels
 
  - phy: qca807x: use right value from DTS for DAC_DSP_BIAS_CURRENT
 
  - eth: enetc: number of error handling fixes
 
  - dsa: rtl8366rb: reshuffle the code to fix config / build issue
    with LED support
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmfAj8MACgkQMUZtbf5S
 IrtoTRAAj0XNWXGWZdOuVub0xhtjsPLoZktux4AzsELqaynextkJW6w9pG5qVrWu
 UZt3a3bC7u6+JoTgb+GQVhyjuuVjv6NOSuLK3FS+NePW8ijhLP5oTg6eD0MQS60Z
 wa9yQx3yL1Kvb6b80Go/3WgRX9V6Rx8zlROAl/gOlZ9NKB0rSVqnueZGPjGZJf1a
 ayyXsmzRykshbr5Ic0e+b74hFP3DGxVgHjIob1C4kk/Q+WOfQKnm3C3fnZ/R2QcS
 7B7kSk9WokvNwk3hJc7ZtFxJbrQKSSuRI8nCD93hBjTn76yJjlPicJ9b6HJoGhE/
 Pwt7fBnDCCA00x6ejD3OrurR+/80PbPtyvNtgMMTD49wSwxQpQ6YpTMInnodCzAV
 NvIhkkXBprI0kiTT4dDpNoeFMKD3i07etKpvMfEoDzZR7vgUsj6aClSmuxILeU9a
 crFC4Vp5SgyU1/lUPDiG4dfbd8s4hfM4bZ+d0zAtth3/rQA7/EA6dLqbRXXWX7h5
 Gl6egKWPsSl+WUgFjpBjYfhqrQsc06hxaCh0SQYH6SnS3i+PlMU2uRJYZMLQ66rX
 QsSQOyqCEHwd1qnrLedg9rCniv+DzOJf+qh+H0eY9WhuOay+8T52OHLxpRjSHxBo
 SCP+qQxSX0qhH5DtUiOV50Fwg19UhJJyWd0COfv5SIGm/I1dUOY=
 =+Ci7
 -----END PGP SIGNATURE-----

Merge tag 'net-6.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from bluetooth.

  We didn't get netfilter or wireless PRs this week, so next week's PR
  is probably going to be bigger. A healthy dose of fixes for bugs
  introduced in the current release nonetheless.

  Current release - regressions:

   - Bluetooth: always allow SCO packets for user channel

   - af_unix: fix memory leak in unix_dgram_sendmsg()

   - rxrpc:
       - remove redundant peer->mtu_lock causing lockdep splats
       - fix spinlock flavor issues with the peer record hash

   - eth: iavf: fix circular lock dependency with netdev_lock

   - net: use rtnl_net_dev_lock() in
     register_netdevice_notifier_dev_net() RDMA driver register notifier
     after the device

  Current release - new code bugs:

   - ethtool: fix ioctl confusing drivers about desired HDS user config

   - eth: ixgbe: fix media cage present detection for E610 device

  Previous releases - regressions:

   - loopback: avoid sending IP packets without an Ethernet header

   - mptcp: reset connection when MPTCP opts are dropped after join

  Previous releases - always broken:

   - net: better track kernel sockets lifetime

   - ipv6: fix dst ref loop on input in seg6 and rpl lw tunnels

   - phy: qca807x: use right value from DTS for DAC_DSP_BIAS_CURRENT

   - eth: enetc: number of error handling fixes

   - dsa: rtl8366rb: reshuffle the code to fix config / build issue with
     LED support"

* tag 'net-6.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (53 commits)
  net: ti: icss-iep: Reject perout generation request
  idpf: fix checksums set in idpf_rx_rsc()
  selftests: drv-net: Check if combined-count exists
  net: ipv6: fix dst ref loop on input in rpl lwt
  net: ipv6: fix dst ref loop on input in seg6 lwt
  usbnet: gl620a: fix endpoint checking in genelink_bind()
  net/mlx5: IRQ, Fix null string in debug print
  net/mlx5: Restore missing trace event when enabling vport QoS
  net/mlx5: Fix vport QoS cleanup on error
  net: mvpp2: cls: Fixed Non IP flow, with vlan tag flow defination.
  af_unix: Fix memory leak in unix_dgram_sendmsg()
  net: Handle napi_schedule() calls from non-interrupt
  net: Clear old fragment checksum value in napi_reuse_skb
  gve: unlink old napi when stopping a queue using queue API
  net: Use rtnl_net_dev_lock() in register_netdevice_notifier_dev_net().
  tcp: Defer ts_recent changes until req is owned
  net: enetc: fix the off-by-one issue in enetc_map_tx_tso_buffs()
  net: enetc: remove the mm_lock from the ENETC v4 driver
  net: enetc: add missing enetc4_link_deinit()
  net: enetc: update UDP checksum when updating originTimestamp field
  ...
2025-02-27 09:32:42 -08:00
Alexander Lobakin
b696d289c0 xdp: remove xdp_alloc_skb_bulk()
The only user was veth, which now uses napi_skb_cache_get_bulk().
It's now preferred over a direct allocation and is exported as
well, so remove this one.

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-27 14:03:52 +01:00
Alexander Lobakin
388d31417c net: gro: expose GRO init/cleanup to use outside of NAPI
Make GRO init and cleanup functions global to be able to use GRO
without a NAPI instance. Taking into account already global gro_flush(),
it's now fully usable standalone.
New functions are not exported, since they're not supposed to be used
outside of the kernel core code.

Tested-by: Daniel Xu <dxu@dxuuu.xyz>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-27 14:03:14 +01:00
Alexander Lobakin
291515c764 net: gro: decouple GRO from the NAPI layer
In fact, these two are not tied closely to each other. The only
requirements to GRO are to use it in the BH context and have some
sane limits on the packet batches, e.g. NAPI has a limit of its
budget (64/8/etc.).
Move purely GRO fields into a new structure, &gro_node. Embed it
into &napi_struct and adjust all the references.
gro_node::cached_napi_id is effectively the same as
napi_struct::napi_id, but to be used on GRO hotpath to mark skbs.
napi_struct::napi_id is now a fully control path field.

Three Ethernet drivers use napi_gro_flush() not really meant to be
exported, so move it to <net/gro.h> and add that include there.
napi_gro_receive() is used in more than 100 drivers, keep it
in <linux/netdevice.h>.
This does not make GRO ready to use outside of the NAPI context
yet.

Tested-by: Daniel Xu <dxu@dxuuu.xyz>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-27 14:03:14 +01:00
Benjamin Berg
ceaad3c435 wifi: cfg80211: expose update timestamp to drivers
This information is exposed to userspace but not drivers. Make this
field public so that drivers are also able to access it. The information
is for example useful for link selection to determine whether the BSS
corresponding to an MLO link has been seen in a recent scan.

Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250212082137.b682ee7aebc8.I0f7cca9effa2b1cee79f4f2eb8b549c99b4e0571@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-02-26 15:48:54 +01:00
Miri Korenblit
ebba23e077 wifi: mac80211: add ieee80211_iter_chan_contexts_mtx
Add a chanctx iterator that can be called from a wiphy-locked context.

Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250212082137.d85eef3024de.Icda0616416c5fd4b2cbf892bdab2476f26e644ec@changeid
[fix kernel-doc]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-02-26 15:48:47 +01:00
Leon Romanovsky
230804a893 Merge branch 'mlx5-next' into wip/leon-for-next
This is merge of shared branch between RDMA and net-next trees.

* mlx5-next: (550 commits)
  net/mlx5: Change POOL_NEXT_SIZE define value and make it global
  net/mlx5: Add new health syndrome error and crr bit offset
  Linux 6.14-rc3
  ...

Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-02-25 04:04:46 -05:00
Geliang Tang
9771a96a7a mptcp: sched: split get_subflow interface into two
get_retrans() interface of the burst packet scheduler invokes a sleeping
function mptcp_pm_subflow_chk_stale(), which calls __lock_sock_fast().
So get_retrans() interface should be set with BPF_F_SLEEPABLE flag in
BPF. But get_send() interface of this scheduler can't be set with
BPF_F_SLEEPABLE flag since it's invoked in ack_update_msk() under mptcp
data lock.

So this patch has to split get_subflow() interface of packet scheduer into
two interfaces: get_send() and get_retrans(). Then we can set get_retrans()
interface alone with BPF_F_SLEEPABLE flag.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250221-net-next-mptcp-pm-misc-cleanup-3-v1-8-2b70ab1cee79@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-24 18:23:44 -08:00
Eric Dumazet
5c70eb5c59 net: better track kernel sockets lifetime
While kernel sockets are dismantled during pernet_operations->exit(),
their freeing can be delayed by any tx packets still held in qdisc
or device queues, due to skb_set_owner_w() prior calls.

This then trigger the following warning from ref_tracker_dir_exit() [1]

To fix this, make sure that kernel sockets own a reference on net->passive.

Add sk_net_refcnt_upgrade() helper, used whenever a kernel socket
is converted to a refcounted one.

[1]

[  136.263918][   T35] ref_tracker: net notrefcnt@ffff8880638f01e0 has 1/2 users at
[  136.263918][   T35]      sk_alloc+0x2b3/0x370
[  136.263918][   T35]      inet6_create+0x6ce/0x10f0
[  136.263918][   T35]      __sock_create+0x4c0/0xa30
[  136.263918][   T35]      inet_ctl_sock_create+0xc2/0x250
[  136.263918][   T35]      igmp6_net_init+0x39/0x390
[  136.263918][   T35]      ops_init+0x31e/0x590
[  136.263918][   T35]      setup_net+0x287/0x9e0
[  136.263918][   T35]      copy_net_ns+0x33f/0x570
[  136.263918][   T35]      create_new_namespaces+0x425/0x7b0
[  136.263918][   T35]      unshare_nsproxy_namespaces+0x124/0x180
[  136.263918][   T35]      ksys_unshare+0x57d/0xa70
[  136.263918][   T35]      __x64_sys_unshare+0x38/0x40
[  136.263918][   T35]      do_syscall_64+0xf3/0x230
[  136.263918][   T35]      entry_SYSCALL_64_after_hwframe+0x77/0x7f
[  136.263918][   T35]
[  136.343488][   T35] ref_tracker: net notrefcnt@ffff8880638f01e0 has 1/2 users at
[  136.343488][   T35]      sk_alloc+0x2b3/0x370
[  136.343488][   T35]      inet6_create+0x6ce/0x10f0
[  136.343488][   T35]      __sock_create+0x4c0/0xa30
[  136.343488][   T35]      inet_ctl_sock_create+0xc2/0x250
[  136.343488][   T35]      ndisc_net_init+0xa7/0x2b0
[  136.343488][   T35]      ops_init+0x31e/0x590
[  136.343488][   T35]      setup_net+0x287/0x9e0
[  136.343488][   T35]      copy_net_ns+0x33f/0x570
[  136.343488][   T35]      create_new_namespaces+0x425/0x7b0
[  136.343488][   T35]      unshare_nsproxy_namespaces+0x124/0x180
[  136.343488][   T35]      ksys_unshare+0x57d/0xa70
[  136.343488][   T35]      __x64_sys_unshare+0x38/0x40
[  136.343488][   T35]      do_syscall_64+0xf3/0x230
[  136.343488][   T35]      entry_SYSCALL_64_after_hwframe+0x77/0x7f

Fixes: 0cafd77dcd ("net: add a refcount tracker for kernel sockets")
Reported-by: syzbot+30a19e01a97420719891@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/67b72aeb.050a0220.14d86d.0283.GAE@google.com/T/#u
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250220131854.4048077-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21 16:00:58 -08:00
Jakub Kicinski
e87700965a bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ6NaUOruQGUkvPdG4raS+Z+3y5EwUCZ7ffOQAKCRAraS+Z+3y5
 EzVHAP9h/QkeYoOZW9gul08I8vFiZsFe/lbOSLJWxeVfxb9JhgD/cMqby3qAxQK6
 lsdNQ9jYG2232Wym89ag7fvTBK15Wg4=
 =gkN2
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Martin KaFai Lau says:

====================
pull-request: bpf-next 2025-02-20

We've added 19 non-merge commits during the last 8 day(s) which contain
a total of 35 files changed, 1126 insertions(+), 53 deletions(-).

The main changes are:

1) Add TCP_RTO_MAX_MS support to bpf_set/getsockopt, from Jason Xing

2) Add network TX timestamping support to BPF sock_ops, from Jason Xing

3) Add TX metadata Launch Time support, from Song Yoong Siang

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
  igc: Add launch time support to XDP ZC
  igc: Refactor empty frame insertion for launch time support
  net: stmmac: Add launch time support to XDP ZC
  selftests/bpf: Add launch time request to xdp_hw_metadata
  xsk: Add launch time hardware offload support to XDP Tx metadata
  selftests/bpf: Add simple bpf tests in the tx path for timestamping feature
  bpf: Support selective sampling for bpf timestamping
  bpf: Add BPF_SOCK_OPS_TSTAMP_SENDMSG_CB callback
  bpf: Add BPF_SOCK_OPS_TSTAMP_ACK_CB callback
  bpf: Add BPF_SOCK_OPS_TSTAMP_SND_HW_CB callback
  bpf: Add BPF_SOCK_OPS_TSTAMP_SND_SW_CB callback
  bpf: Add BPF_SOCK_OPS_TSTAMP_SCHED_CB callback
  net-timestamp: Prepare for isolating two modes of SO_TIMESTAMPING
  bpf: Disable unsafe helpers in TX timestamping callbacks
  bpf: Prevent unsafe access to the sock fields in the BPF timestamping callback
  bpf: Prepare the sock_ops ctx and call bpf prog for TX timestamping
  bpf: Add networking timestamping support to bpf_get/setsockopt()
  selftests/bpf: Add rto max for bpf_setsockopt test
  bpf: Support TCP_RTO_MAX_MS for bpf_setsockopt
====================

Link: https://patch.msgid.link/20250221022104.386462-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21 15:59:47 -08:00
Xiao Liang
9c0fc091dc rtnetlink: Remove "net" from newlink params
Now that devices have been converted to use the specific netns instead
of ambiguous "net", let's remove it from newlink parameters.

Signed-off-by: Xiao Liang <shaw.leon@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250219125039.18024-11-shaw.leon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21 15:28:03 -08:00
Xiao Liang
eacb116053 net: ip_tunnel: Use link netns in newlink() of rtnl_link_ops
When link_net is set, use it as link netns instead of dev_net(). This
prepares for rtnetlink core to create device in target netns directly,
in which case the two namespaces may be different.

Convert common ip_tunnel_newlink() to accept an extra link netns
argument.

Signed-off-by: Xiao Liang <shaw.leon@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250219125039.18024-7-shaw.leon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21 15:28:02 -08:00
Xiao Liang
cf517ac16a net: Use link/peer netns in newlink() of rtnl_link_ops
Add two helper functions - rtnl_newlink_link_net() and
rtnl_newlink_peer_net() for netns fallback logic. Peer netns falls back
to link netns, and link netns falls back to source netns.

Convert the use of params->net in netdevice drivers to one of the helper
functions for clarity.

Signed-off-by: Xiao Liang <shaw.leon@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250219125039.18024-4-shaw.leon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21 15:28:02 -08:00
Xiao Liang
69c7be1b90 rtnetlink: Pack newlink() params into struct
There are 4 net namespaces involved when creating links:

 - source netns - where the netlink socket resides,
 - target netns - where to put the device being created,
 - link netns - netns associated with the device (backend),
 - peer netns - netns of peer device.

Currently, two nets are passed to newlink() callback - "src_net"
parameter and "dev_net" (implicitly in net_device). They are set as
follows, depending on netlink attributes in the request.

 +------------+-------------------+---------+---------+
 | peer netns | IFLA_LINK_NETNSID | src_net | dev_net |
 +------------+-------------------+---------+---------+
 |            | absent            | source  | target  |
 | absent     +-------------------+---------+---------+
 |            | present           | link    | link    |
 +------------+-------------------+---------+---------+
 |            | absent            | peer    | target  |
 | present    +-------------------+---------+---------+
 |            | present           | peer    | link    |
 +------------+-------------------+---------+---------+

When IFLA_LINK_NETNSID is present, the device is created in link netns
first and then moved to target netns. This has some side effects,
including extra ifindex allocation, ifname validation and link events.
These could be avoided if we create it in target netns from
the beginning.

On the other hand, the meaning of src_net parameter is ambiguous. It
varies depending on how parameters are passed. It is the effective
link (or peer netns) by design, but some drivers ignore it and use
dev_net instead.

To provide more netns context for drivers, this patch packs existing
newlink() parameters, along with the source netns, link netns and peer
netns, into a struct. The old "src_net" is renamed to "net" to avoid
confusion with real source netns, and will be deprecated later. The use
of src_net are converted to params->net trivially.

Signed-off-by: Xiao Liang <shaw.leon@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250219125039.18024-3-shaw.leon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21 15:28:02 -08:00
Leon Romanovsky
ca70c104e1 xfrm: check for PMTU in tunnel mode for packet offload
In tunnel mode, for the packet offload, there were no PMTU signaling
to the upper level about need to fragment the packet. As a solution,
call to already existing xfrm[4|6]_tunnel_check_size() to perform that.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-02-21 08:08:15 +01:00
Leon Romanovsky
b6ccf61aa4 xfrm: simplify SA initialization routine
SA replay mode is initialized differently for user-space and
kernel-space users, but the call to xfrm_init_replay() existed in
common path with boolean protection. That caused to situation where
we have two different function orders.

So let's rewrite the SA initialization flow to have same order for
both in-kernel and user-space callers.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-02-21 08:08:15 +01:00
Leon Romanovsky
585b64f5a6 xfrm: delay initialization of offload path till its actually requested
XFRM offload path is probed even if offload isn't needed at all. Let's
make sure that x->type_offload pointer stays NULL for such path to
reduce ambiguity.

Fixes: 9d389d7f84 ("xfrm: Add a xfrm type offload.")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-02-21 08:08:14 +01:00
Linus Torvalds
319fc77f8f BPF fixes:
- Fix a soft-lockup in BPF arena_map_free on 64k page size
   kernels (Alan Maguire)
 
 - Fix a missing allocation failure check in BPF verifier's
   acquire_lock_state (Kumar Kartikeya Dwivedi)
 
 - Fix a NULL-pointer dereference in trace_kfree_skb by adding
   kfree_skb to the raw_tp_null_args set (Kuniyuki Iwashima)
 
 - Fix a deadlock when freeing BPF cgroup storage (Abel Wu)
 
 - Fix a syzbot-reported deadlock when holding BPF map's
   freeze_mutex (Andrii Nakryiko)
 
 - Fix a use-after-free issue in bpf_test_init when
   eth_skb_pkt_type is accessing skb data not containing an
   Ethernet header (Shigeru Yoshida)
 
 - Fix skipping non-existing keys in generic_map_lookup_batch
   (Yan Zhai)
 
 - Several BPF sockmap fixes to address incorrect TCP copied_seq
   calculations, which prevented correct data reads from recv(2)
   in user space (Jiayuan Chen)
 
 - Two fixes for BPF map lookup nullness elision (Daniel Xu)
 
 - Fix a NULL-pointer dereference from vmlinux BTF lookup in
   bpf_sk_storage_tracing_allowed (Jared Kangas)
 
 Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
 -----BEGIN PGP SIGNATURE-----
 
 iIsEABYKADMWIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZ7evlRUcZGFuaWVsQGlv
 Z2VhcmJveC5uZXQACgkQ2yufC7HISIPwHgD/dTvM00Ck4Q73fPivyT7tcqxeXJlD
 D6ggzWl/SG9LAbwA/2/cSgAM9Jm1g7ddvn/S9QaDYOs5GmFl6urq6krs+tYD
 =FCs9
 -----END PGP SIGNATURE-----

Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Pull BPF fixes from Daniel Borkmann:

 - Fix a soft-lockup in BPF arena_map_free on 64k page size kernels
   (Alan Maguire)

 - Fix a missing allocation failure check in BPF verifier's
   acquire_lock_state (Kumar Kartikeya Dwivedi)

 - Fix a NULL-pointer dereference in trace_kfree_skb by adding kfree_skb
   to the raw_tp_null_args set (Kuniyuki Iwashima)

 - Fix a deadlock when freeing BPF cgroup storage (Abel Wu)

 - Fix a syzbot-reported deadlock when holding BPF map's freeze_mutex
   (Andrii Nakryiko)

 - Fix a use-after-free issue in bpf_test_init when eth_skb_pkt_type is
   accessing skb data not containing an Ethernet header (Shigeru
   Yoshida)

 - Fix skipping non-existing keys in generic_map_lookup_batch (Yan Zhai)

 - Several BPF sockmap fixes to address incorrect TCP copied_seq
   calculations, which prevented correct data reads from recv(2) in user
   space (Jiayuan Chen)

 - Two fixes for BPF map lookup nullness elision (Daniel Xu)

 - Fix a NULL-pointer dereference from vmlinux BTF lookup in
   bpf_sk_storage_tracing_allowed (Jared Kangas)

* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  selftests: bpf: test batch lookup on array of maps with holes
  bpf: skip non exist keys in generic_map_lookup_batch
  bpf: Handle allocation failure in acquire_lock_state
  bpf: verifier: Disambiguate get_constant_map_key() errors
  bpf: selftests: Test constant key extraction on irrelevant maps
  bpf: verifier: Do not extract constant map keys for irrelevant maps
  bpf: Fix softlockup in arena_map_free on 64k page kernel
  net: Add rx_skb of kfree_skb to raw_tp_null_args[].
  bpf: Fix deadlock when freeing cgroup storage
  selftests/bpf: Add strparser test for bpf
  selftests/bpf: Fix invalid flag of recv()
  bpf: Disable non stream socket for strparser
  bpf: Fix wrong copied_seq calculation
  strparser: Add read_sock callback
  bpf: avoid holding freeze_mutex during mmap operation
  bpf: unify VM_WRITE vs VM_MAYWRITE use in BPF map mmaping logic
  selftests/bpf: Adjust data size to have ETH_HLEN
  bpf, test_run: Fix use-after-free issue in eth_skb_pkt_type()
  bpf: Remove unnecessary BTF lookups in bpf_sk_storage_tracing_allowed
2025-02-20 15:37:17 -08:00
Song Yoong Siang
ca4419f15a xsk: Add launch time hardware offload support to XDP Tx metadata
Extend the XDP Tx metadata framework so that user can requests launch time
hardware offload, where the Ethernet device will schedule the packet for
transmission at a pre-determined time called launch time. The value of
launch time is communicated from user space to Ethernet driver via
launch_time field of struct xsk_tx_metadata.

Suggested-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Song Yoong Siang <yoong.siang.song@intel.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20250216093430.957880-2-yoong.siang.song@intel.com
2025-02-20 15:13:45 -08:00
Jason Xing
b3b81e6b00 bpf: Add BPF_SOCK_OPS_TSTAMP_ACK_CB callback
Support the ACK case for bpf timestamping.

Add a new sock_ops callback, BPF_SOCK_OPS_TSTAMP_ACK_CB. This
callback will occur at the same timestamping point as the user
space's SCM_TSTAMP_ACK. The BPF program can use it to get the
same SCM_TSTAMP_ACK timestamp without modifying the user-space
application.

This patch extends txstamp_ack to two bits: 1 stands for
SO_TIMESTAMPING mode, 2 bpf extension.

Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250220072940.99994-10-kerneljasonxing@gmail.com
2025-02-20 14:29:43 -08:00
Jason Xing
fd93eaffb3 bpf: Prevent unsafe access to the sock fields in the BPF timestamping callback
The subsequent patch will implement BPF TX timestamping. It will
call the sockops BPF program without holding the sock lock.

This breaks the current assumption that all sock ops programs will
hold the sock lock. The sock's fields of the uapi's bpf_sock_ops
requires this assumption.

To address this, a new "u8 is_locked_tcp_sock;" field is added. This
patch sets it in the current sock_ops callbacks. The "is_fullsock"
test is then replaced by the "is_locked_tcp_sock" test during
sock_ops_convert_ctx_access().

The new TX timestamping callbacks added in the subsequent patch will
not have this set. This will prevent unsafe access from the new
timestamping callbacks.

Potentially, we could allow read-only access. However, this would
require identifying which callback is read-safe-only and also requires
additional BPF instruction rewrites in the covert_ctx. Since the BPF
program can always read everything from a socket (e.g., by using
bpf_core_cast), this patch keeps it simple and disables all read
and write access to any socket fields through the bpf_sock_ops
UAPI from the new TX timestamping callback.

Moreover, note that some of the fields in bpf_sock_ops are specific
to tcp_sock, and sock_ops currently only supports tcp_sock. In
the future, UDP timestamping will be added, which will also break
this assumption. The same idea used in this patch will be reused.
Considering that the current sock_ops only supports tcp_sock, the
variable is named is_locked_"tcp"_sock.

Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20250220072940.99994-4-kerneljasonxing@gmail.com
2025-02-20 14:29:02 -08:00
Jason Xing
df600f3b1d bpf: Prepare the sock_ops ctx and call bpf prog for TX timestamping
This patch introduces a new bpf_skops_tx_timestamping() function
that prepares the "struct bpf_sock_ops" ctx and then executes the
sockops BPF program.

The subsequent patch will utilize bpf_skops_tx_timestamping() at
the existing TX timestamping kernel callbacks (__sk_tstamp_tx
specifically) to call the sockops BPF program. Later, four callback
points to report information to user space based on this patch will
be introduced.

Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20250220072940.99994-3-kerneljasonxing@gmail.com
2025-02-20 14:28:54 -08:00
Jason Xing
24e82b7c04 bpf: Add networking timestamping support to bpf_get/setsockopt()
The new SK_BPF_CB_FLAGS and new SK_BPF_CB_TX_TIMESTAMPING are
added to bpf_get/setsockopt. The later patches will implement the
BPF networking timestamping. The BPF program will use
bpf_setsockopt(SK_BPF_CB_FLAGS, SK_BPF_CB_TX_TIMESTAMPING) to
enable the BPF networking timestamping on a socket.

Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250220072940.99994-2-kerneljasonxing@gmail.com
2025-02-20 14:28:37 -08:00
Gal Pressman
bb5e62f2d5 net: Add options as a flexible array to struct ip_tunnel_info
Remove the hidden assumption that options are allocated at the end of
the struct, and teach the compiler about them using a flexible array.

With this, we can revert the unsafe_memcpy() call we have in
tun_dst_unclone() [1], and resolve the false field-spanning write
warning caused by the memcpy() in ip_tunnel_info_opts_set().

The layout of struct ip_tunnel_info remains the same with this patch.
Before this patch, there was an implicit padding at the end of the
struct, options would be written at 'info + 1' which is after the
padding.
This will remain the same as this patch explicitly aligns 'options'.
The alignment is needed as the options are later casted to different
structs, and might result in unaligned memory access.

Pahole output before this patch:
struct ip_tunnel_info {
    struct ip_tunnel_key       key;                  /*     0    64 */

    /* XXX last struct has 1 byte of padding */

    /* --- cacheline 1 boundary (64 bytes) --- */
    struct ip_tunnel_encap     encap;                /*    64     8 */
    struct dst_cache           dst_cache;            /*    72    16 */
    u8                         options_len;          /*    88     1 */
    u8                         mode;                 /*    89     1 */

    /* size: 96, cachelines: 2, members: 5 */
    /* padding: 6 */
    /* paddings: 1, sum paddings: 1 */
    /* last cacheline: 32 bytes */
};

Pahole output after this patch:
struct ip_tunnel_info {
    struct ip_tunnel_key       key;                  /*     0    64 */

    /* XXX last struct has 1 byte of padding */

    /* --- cacheline 1 boundary (64 bytes) --- */
    struct ip_tunnel_encap     encap;                /*    64     8 */
    struct dst_cache           dst_cache;            /*    72    16 */
    u8                         options_len;          /*    88     1 */
    u8                         mode;                 /*    89     1 */

    /* XXX 6 bytes hole, try to pack */

    u8                         options[] __attribute__((__aligned__(16))); /*    96     0 */

    /* size: 96, cachelines: 2, members: 6 */
    /* sum members: 90, holes: 1, sum holes: 6 */
    /* paddings: 1, sum paddings: 1 */
    /* forced alignments: 1, forced holes: 1, sum forced holes: 6 */
    /* last cacheline: 32 bytes */
} __attribute__((__aligned__(16)));

[1] Commit 13cfd6a6d7 ("net: Silence false field-spanning write warning in metadata_dst memcpy")

Link: https://lore.kernel.org/all/53D1D353-B8F6-4ADC-8F29-8C48A7C9C6F1@kernel.org/
Suggested-by: Kees Cook <kees@kernel.org>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20250219143256.370277-3-gal@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-20 13:17:16 -08:00
Gal Pressman
ba3fa6e8c1 ip_tunnel: Use ip_tunnel_info() helper instead of 'info + 1'
Tunnel options should not be accessed directly, use the ip_tunnel_info()
accessor instead.

Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20250219143256.370277-2-gal@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-20 13:17:16 -08:00
Jakub Kicinski
5d6ba5ab85 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.14-rc4).

No conflicts or adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-20 10:37:30 -08:00
Paolo Abeni
14ad6ed30a net: allow small head cache usage with large MAX_SKB_FRAGS values
Sabrina reported the following splat:

    WARNING: CPU: 0 PID: 1 at net/core/dev.c:6935 netif_napi_add_weight_locked+0x8f2/0xba0
    Modules linked in:
    CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.14.0-rc1-net-00092-g011b03359038 #996
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
    RIP: 0010:netif_napi_add_weight_locked+0x8f2/0xba0
    Code: e8 c3 e6 6a fe 48 83 c4 28 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc c7 44 24 10 ff ff ff ff e9 8f fb ff ff e8 9e e6 6a fe <0f> 0b e9 d3 fe ff ff e8 92 e6 6a fe 48 8b 04 24 be ff ff ff ff 48
    RSP: 0000:ffffc9000001fc60 EFLAGS: 00010293
    RAX: 0000000000000000 RBX: ffff88806ce48128 RCX: 1ffff11001664b9e
    RDX: ffff888008f00040 RSI: ffffffff8317ca42 RDI: ffff88800b325cb6
    RBP: ffff88800b325c40 R08: 0000000000000001 R09: ffffed100167502c
    R10: ffff88800b3a8163 R11: 0000000000000000 R12: ffff88800ac1c168
    R13: ffff88800ac1c168 R14: ffff88800ac1c168 R15: 0000000000000007
    FS:  0000000000000000(0000) GS:ffff88806ce00000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffff888008201000 CR3: 0000000004c94001 CR4: 0000000000370ef0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    <TASK>
    gro_cells_init+0x1ba/0x270
    xfrm_input_init+0x4b/0x2a0
    xfrm_init+0x38/0x50
    ip_rt_init+0x2d7/0x350
    ip_init+0xf/0x20
    inet_init+0x406/0x590
    do_one_initcall+0x9d/0x2e0
    do_initcalls+0x23b/0x280
    kernel_init_freeable+0x445/0x490
    kernel_init+0x20/0x1d0
    ret_from_fork+0x46/0x80
    ret_from_fork_asm+0x1a/0x30
    </TASK>
    irq event stamp: 584330
    hardirqs last  enabled at (584338): [<ffffffff8168bf87>] __up_console_sem+0x77/0xb0
    hardirqs last disabled at (584345): [<ffffffff8168bf6c>] __up_console_sem+0x5c/0xb0
    softirqs last  enabled at (583242): [<ffffffff833ee96d>] netlink_insert+0x14d/0x470
    softirqs last disabled at (583754): [<ffffffff8317c8cd>] netif_napi_add_weight_locked+0x77d/0xba0

on kernel built with MAX_SKB_FRAGS=45, where SKB_WITH_OVERHEAD(1024)
is smaller than GRO_MAX_HEAD.

Such built additionally contains the revert of the single page frag cache
so that napi_get_frags() ends up using the page frag allocator, triggering
the splat.

Note that the underlying issue is independent from the mentioned
revert; address it ensuring that the small head cache will fit either TCP
and GRO allocation and updating napi_alloc_skb() and __netdev_alloc_skb()
to select kmalloc() usage for any allocation fitting such cache.

Reported-by: Sabrina Dubroca <sd@queasysnail.net>
Suggested-by: Eric Dumazet <edumazet@google.com>
Fixes: 3948b05950 ("net: introduce a config option to tweak MAX_SKB_FRAGS")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-20 10:53:17 +01:00
Sabrina Dubroca
9b6412e697 tcp: drop secpath at the same time as we currently drop dst
Xiumei reported hitting the WARN in xfrm6_tunnel_net_exit while
running tests that boil down to:
 - create a pair of netns
 - run a basic TCP test over ipcomp6
 - delete the pair of netns

The xfrm_state found on spi_byaddr was not deleted at the time we
delete the netns, because we still have a reference on it. This
lingering reference comes from a secpath (which holds a ref on the
xfrm_state), which is still attached to an skb. This skb is not
leaked, it ends up on sk_receive_queue and then gets defer-free'd by
skb_attempt_defer_free.

The problem happens when we defer freeing an skb (push it on one CPU's
defer_list), and don't flush that list before the netns is deleted. In
that case, we still have a reference on the xfrm_state that we don't
expect at this point.

We already drop the skb's dst in the TCP receive path when it's no
longer needed, so let's also drop the secpath. At this point,
tcp_filter has already called into the LSM hooks that may require the
secpath, so it should not be needed anymore. However, in some of those
places, the MPTCP extension has just been attached to the skb, so we
cannot simply drop all extensions.

Fixes: 68822bdf76 ("net: generalize skb freeing deferral to per-cpu lists")
Reported-by: Xiumei Mu <xmu@redhat.com>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/5055ba8f8f72bdcb602faa299faca73c280b7735.1739743613.git.sd@queasysnail.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-20 09:24:08 +01:00
Paolo Abeni
c8802ded46 net: dismiss sk_forward_alloc_get()
After the previous patch we can remove the forward_alloc_get
proto callback, basically reverting commit 292e6077b0 ("net: introduce
sk_forward_alloc_get()") and commit 66d58f046c ("net: use
sk_forward_alloc_get() in sk_get_meminfo()").

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250218-net-next-mptcp-rx-path-refactor-v1-5-4a47d90d7998@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-19 19:05:28 -08:00
Ido Schimmel
79a4e21584 ipv4: fib_rules: Add port mask matching
Extend IPv4 FIB rules to match on source and destination ports using a
mask. Note that the mask is only set when not matching on a range.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250217134109.311176-4-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-19 18:43:38 -08:00
Ido Schimmel
da7665947b net: fib_rules: Add port mask support
Add support for configuring and deleting rules that match on source and
destination ports using a mask as well as support for dumping such rules
to user space.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250217134109.311176-3-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-19 18:43:38 -08:00
Kuniyuki Iwashima
e57a632021 net: Add net_passive_inc() and net_passive_dec().
net_drop_ns() is NULL when CONFIG_NET_NS is disabled.

The next patch introduces a function that increments
and decrements net->passive.

As a prep, let's rename and export net_free() to
net_passive_dec() and add net_passive_inc().

Suggested-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/netdev/CANn89i+oUCt2VGvrbrweniTendZFEh+nwS=uonc004-aPkWy-Q@mail.gmail.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250217191129.19967-2-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-18 18:33:29 -08:00
Willem de Bruijn
5cd2f78886 ipv6: initialize inet socket cookies with sockcm_init
Avoid open coding the same logic.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250214222720.3205500-8-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-18 18:27:20 -08:00
Willem de Bruijn
096208592b ipv6: replace ipcm6_init calls with ipcm6_init_sk
This initializes tclass and dontfrag before cmsg parsing, removing the
need for explicit checks against -1 in each caller.

Leave hlimit set to -1, because its full initialization
(in ip6_sk_dst_hoplimit) requires more state (dst, flowi6, ..).

This also prepares for calling sockcm_init in a follow-on patch.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250214222720.3205500-7-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-18 18:27:20 -08:00
Willem de Bruijn
9329b58395 ipv4: remove get_rttos
Initialize the ip cookie tos field when initializing the cookie, in
ipcm_init_sk.

The existing code inverts the standard pattern for initializing cookie
fields. Default is to initialize the field from the sk, then possibly
overwrite that when parsing cmsgs (the unlikely case).

This field inverts that, setting the field to an illegal value and
after cmsg parsing checking whether the value is still illegal and
thus should be overridden.

Be careful to always apply mask INET_DSCP_MASK, as before.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250214222720.3205500-5-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-18 18:27:19 -08:00
Willem de Bruijn
94788792f3 ipv4: initialize inet socket cookies with sockcm_init
Avoid open coding the same logic.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250214222720.3205500-4-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-18 18:27:19 -08:00
Willem de Bruijn
6ad861519a net: initialize mark in sockcm_init
Avoid open coding initialization of sockcm fields.
Avoid reading the sk_priority field twice.

This ensures all callers, existing and future, will correctly try a
cmsg passed mark before sk_mark.

This patch extends support for cmsg mark to:
packet_spkt and packet_tpacket and net/can/raw.c.

This patch extends support for cmsg priority to:
packet_spkt and packet_tpacket.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250214222720.3205500-3-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-18 18:27:19 -08:00
Jakub Kicinski
0f375d90c4 Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:

====================
ice, iavf: Add support for Rx timestamping

Mateusz Polchlopek says:

Initially, during VF creation it registers the PTP clock in
the system and negotiates with PF it's capabilities. In the
meantime the PF enables the Flexible Descriptor for VF.
Only this type of descriptor allows to receive Rx timestamps.

Enabling virtual clock would be possible, though it would probably
perform poorly due to the lack of direct time access.

Enable timestamping should be done using userspace tools, e.g.
hwstamp_ctl -i $VF -r 14

In order to report the timestamps to userspace, the VF extends
timestamp to 40b.

To support this feature the flexible descriptors and PTP part
in iavf driver have been introduced.

* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  iavf: add support for Rx timestamps to hotpath
  iavf: handle set and get timestamps ops
  iavf: Implement checking DD desc field
  iavf: refactor iavf_clean_rx_irq to support legacy and flex descriptors
  iavf: define Rx descriptors as qwords
  libeth: move idpf_rx_csum_decoded and idpf_rx_extracted
  iavf: periodically cache PHC time
  iavf: add support for indirect access to PHC time
  iavf: add initial framework for registering PTP clock
  iavf: negotiate PTP capabilities
  iavf: add support for negotiating flexible RXDID format
  virtchnl: add enumeration for the rxdid format
  ice: support Rx timestamp on flex descriptor
  virtchnl: add support for enabling PTP on iAVF
====================

Link: https://patch.msgid.link/20250214192739.1175740-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-17 17:09:44 -08:00
Joe Damato
a127c18462 netlink: Add nla_put_empty_nest helper
Creating empty nests is helpful when the exact attributes to be exposed
in the future are not known. Encapsulate the logic in a helper.

Signed-off-by: Joe Damato <jdamato@fastly.com>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20250214211255.14194-2-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-17 16:46:03 -08:00
Stefano Jordhani
b9d752105e net: use napi_id_valid helper
In commit 6597e8d358 ("netdev-genl: Elide napi_id when not present"),
napi_id_valid function was added. Use the helper to refactor open-coded
checks in the source.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Stefano Jordhani <sjordhani@gmail.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk> # for iouring
Link: https://patch.msgid.link/20250214181801.931-1-sjordhani@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-17 16:43:04 -08:00
Eric Dumazet
54568a84c9 net: introduce EXPORT_IPV6_MOD() and EXPORT_IPV6_MOD_GPL()
We have many EXPORT_SYMBOL(x) in networking tree because IPv6
can be built as a module.

CONFIG_IPV6=y is becoming the norm.

Define a EXPORT_IPV6_MOD(x) which only exports x
for modular IPv6.

Same principle applies to EXPORT_IPV6_MOD_GPL()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Link: https://patch.msgid.link/20250212132418.1524422-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-14 13:09:38 -08:00
Mateusz Polchlopek
ce5cf4af7c libeth: move idpf_rx_csum_decoded and idpf_rx_extracted
Structs idpf_rx_csum_decoded and idpf_rx_extracted are used both in
idpf and iavf Intel drivers. Change the prefix from idpf_* to libeth_*
and move mentioned structs to libeth's rx.h header file.

Adjust usage in idpf driver.

Suggested-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-14 10:58:08 -08:00
Jakub Kicinski
7a7e019713 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.14-rc3).

No conflicts or adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-13 12:43:30 -08:00
Luiz Augusto von Dentz
ab4eedb790 Bluetooth: L2CAP: Fix corrupted list in hci_chan_del
This fixes the following trace by reworking the locking of l2cap_conn
so instead of only locking when changing the chan_l list this promotes
chan_lock to a general lock of l2cap_conn so whenever it is being held
it would prevents the likes of l2cap_conn_del to run:

list_del corruption, ffff888021297e00->prev is LIST_POISON2 (dead000000000122)
------------[ cut here ]------------
kernel BUG at lib/list_debug.c:61!
Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
CPU: 1 UID: 0 PID: 5896 Comm: syz-executor213 Not tainted 6.14.0-rc1-next-20250204-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 12/27/2024
RIP: 0010:__list_del_entry_valid_or_report+0x12c/0x190 lib/list_debug.c:59
Code: 8c 4c 89 fe 48 89 da e8 32 8c 37 fc 90 0f 0b 48 89 df e8 27 9f 14 fd 48 c7 c7 a0 c0 60 8c 4c 89 fe 48 89 da e8 15 8c 37 fc 90 <0f> 0b 4c 89 e7 e8 0a 9f 14 fd 42 80 3c 2b 00 74 08 4c 89 e7 e8 cb
RSP: 0018:ffffc90003f6f998 EFLAGS: 00010246
RAX: 000000000000004e RBX: dead000000000122 RCX: 01454d423f7fbf00
RDX: 0000000000000000 RSI: 0000000080000000 RDI: 0000000000000000
RBP: dffffc0000000000 R08: ffffffff819f077c R09: 1ffff920007eded0
R10: dffffc0000000000 R11: fffff520007eded1 R12: dead000000000122
R13: dffffc0000000000 R14: ffff8880352248d8 R15: ffff888021297e00
FS:  00007f7ace6686c0(0000) GS:ffff8880b8700000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f7aceeeb1d0 CR3: 000000003527c000 CR4: 00000000003526f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 __list_del_entry_valid include/linux/list.h:124 [inline]
 __list_del_entry include/linux/list.h:215 [inline]
 list_del_rcu include/linux/rculist.h:168 [inline]
 hci_chan_del+0x70/0x1b0 net/bluetooth/hci_conn.c:2858
 l2cap_conn_free net/bluetooth/l2cap_core.c:1816 [inline]
 kref_put include/linux/kref.h:65 [inline]
 l2cap_conn_put+0x70/0xe0 net/bluetooth/l2cap_core.c:1830
 l2cap_sock_shutdown+0xa8a/0x1020 net/bluetooth/l2cap_sock.c:1377
 l2cap_sock_release+0x79/0x1d0 net/bluetooth/l2cap_sock.c:1416
 __sock_release net/socket.c:642 [inline]
 sock_close+0xbc/0x240 net/socket.c:1393
 __fput+0x3e9/0x9f0 fs/file_table.c:448
 task_work_run+0x24f/0x310 kernel/task_work.c:227
 ptrace_notify+0x2d2/0x380 kernel/signal.c:2522
 ptrace_report_syscall include/linux/ptrace.h:415 [inline]
 ptrace_report_syscall_exit include/linux/ptrace.h:477 [inline]
 syscall_exit_work+0xc7/0x1d0 kernel/entry/common.c:173
 syscall_exit_to_user_mode_prepare kernel/entry/common.c:200 [inline]
 __syscall_exit_to_user_mode_work kernel/entry/common.c:205 [inline]
 syscall_exit_to_user_mode+0x24a/0x340 kernel/entry/common.c:218
 do_syscall_64+0x100/0x230 arch/x86/entry/common.c:89
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f7aceeaf449
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 41 19 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f7ace668218 EFLAGS: 00000246 ORIG_RAX: 000000000000002a
RAX: fffffffffffffffc RBX: 00007f7acef39328 RCX: 00007f7aceeaf449
RDX: 000000000000000e RSI: 0000000020000100 RDI: 0000000000000004
RBP: 00007f7acef39320 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
R13: 0000000000000004 R14: 00007f7ace668670 R15: 000000000000000b
 </TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
RIP: 0010:__list_del_entry_valid_or_report+0x12c/0x190 lib/list_debug.c:59
Code: 8c 4c 89 fe 48 89 da e8 32 8c 37 fc 90 0f 0b 48 89 df e8 27 9f 14 fd 48 c7 c7 a0 c0 60 8c 4c 89 fe 48 89 da e8 15 8c 37 fc 90 <0f> 0b 4c 89 e7 e8 0a 9f 14 fd 42 80 3c 2b 00 74 08 4c 89 e7 e8 cb
RSP: 0018:ffffc90003f6f998 EFLAGS: 00010246
RAX: 000000000000004e RBX: dead000000000122 RCX: 01454d423f7fbf00
RDX: 0000000000000000 RSI: 0000000080000000 RDI: 0000000000000000
RBP: dffffc0000000000 R08: ffffffff819f077c R09: 1ffff920007eded0
R10: dffffc0000000000 R11: fffff520007eded1 R12: dead000000000122
R13: dffffc0000000000 R14: ffff8880352248d8 R15: ffff888021297e00
FS:  00007f7ace6686c0(0000) GS:ffff8880b8600000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f7acef05b08 CR3: 000000003527c000 CR4: 00000000003526f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

Reported-by: syzbot+10bd8fe6741eedd2be2e@syzkaller.appspotmail.com
Tested-by: syzbot+10bd8fe6741eedd2be2e@syzkaller.appspotmail.com
Fixes: b4f82f9ed4 ("Bluetooth: L2CAP: Fix slab-use-after-free Read in l2cap_send_cmd")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
2025-02-13 11:15:37 -05:00
Paolo Abeni
f0e70409b7 net: avoid unconditionally touching sk_tsflags on RX
After commit 5d4cc87414 ("net: reorganize "struct sock" fields"),
the sk_tsflags field shares the same cacheline with sk_forward_alloc.

The UDP protocol does not acquire the sock lock in the RX path;
forward allocations are protected via the receive queue spinlock;
additionally udp_recvmsg() calls sock_recv_cmsgs() unconditionally
touching sk_tsflags on each packet reception.

Due to the above, under high packet rate traffic, when the BH and the
user-space process run on different CPUs, UDP packet reception
experiences a cache miss while accessing sk_tsflags.

The receive path doesn't strictly need to access the problematic field;
change sock_set_timestamping() to maintain the relevant information
in a newly allocated sk_flags bit, so that sock_recv_cmsgs() can
take decisions accessing the latter field only.

With this patch applied, on an AMD epic server with i40e NICs, I
measured a 10% performance improvement for small packets UDP flood
performance tests - possibly a larger delta could be observed with more
recent H/W.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/dbd18c8a1171549f8249ac5a8b30b1b5ec88a425.1739294057.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-12 19:37:19 -08:00
Jakub Kicinski
34eea78a11 net: report csum_complete via qstats
Commit 13c7c941e7 ("netdev: add qstat for csum complete") reserved
the entry for csum complete in the qstats uAPI. Start reporting this
value now that we have a driver which needs it.

Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20250211181356.580800-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-12 16:37:35 -08:00
Eric Dumazet
1280c26228 tcp: add tcp_rto_max_ms sysctl
Previous patch added a TCP_RTO_MAX_MS socket option
to tune a TCP socket max RTO value.

Many setups prefer to change a per netns sysctl.

This patch adds /proc/sys/net/ipv4/tcp_rto_max_ms

Its initial value is 120000 (120 seconds).

Keep in mind that a decrease of tcp_rto_max_ms
means shorter overall timeouts, unless tcp_retries2
sysctl is increased.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-11 13:08:00 +01:00
Eric Dumazet
54a378f434 tcp: add the ability to control max RTO
Currently, TCP stack uses a constant (120 seconds)
to limit the RTO value exponential growth.

Some applications want to set a lower value.

Add TCP_RTO_MAX_MS socket option to set a value (in ms)
between 1 and 120 seconds.

It is discouraged to change the socket rto max on a live
socket, as it might lead to unexpected disconnects.

Following patch is adding a netns sysctl to control the
default value at socket creation time.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-11 13:08:00 +01:00
Eric Dumazet
7baa030155 tcp: add a @pace_delay parameter to tcp_reset_xmit_timer()
We want to factorize calls to inet_csk_reset_xmit_timer(),
to ease TCP_RTO_MAX change.

Current users want to add tcp_pacing_delay(sk)
to the timeout.

Remaining calls to inet_csk_reset_xmit_timer()
do not add the pacing delay. Following patch
will convert them, passing false for @pace_delay.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-11 13:07:59 +01:00
Eric Dumazet
0fed463777 tcp: remove tcp_reset_xmit_timer() @max_when argument
All callers use TCP_RTO_MAX, we can factorize this constant,
becoming a variable soon.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-11 13:07:59 +01:00
Emmanuel Grumbach
8c60179b64 wifi: mac80211: set ieee80211_prep_tx_info::link_id upon Auth Rx
This will be used by the low level driver.
Note that link_id  will be 0 in case of a non-MLO authentication.
Also fix a call-site of mgd_prepare_tx() where the link_id was not
populated.

Update the documentation to reflect the current state
ieee80211_prep_tx_info::link_id is also available in mgd_complete_tx().

Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250205110958.6a590f189ce5.I1fc5c0da26b143f5b07191eb592f01f7083d55ae@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-02-11 11:59:07 +01:00
Johannes Berg
3ad4fce66e wifi: mac80211: add strict mode disabling workarounds
Add a strict mode where we disable certain workarounds and have
additional checks such as, for now, that VHT capabilities from
association response match those from beacon/probe response. We
can extend the checks in the future.

Make it an opt-in setting by the driver so it can be set there
in some driver-specific way, for example. Also allow setting
this one hw flag through the hwflags debugfs, by writing a new
strict=0 or strict=1 value.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Reviewed-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250205110958.5cecb0469479.I4a69617dc60ba0d6308416ffbc3102cfd08ba068@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-02-11 11:59:06 +01:00
Ilan Peer
de86c5f608 wifi: mac80211: Add support for EPCS configuration
Add support for configuring EPCS state:

- When EPCS is enabled, send an EPCS enable request action frame
  to the AP. When the AP replies with EPCS enable response, enable
  EPCS by applying the QoS parameters provided by the AP. Do so for
  all the valid MLD links. Once EPCS is enabled, support processing
  of unsolicited EPCS enable response frames.
- When EPCS is disabled, send an EPCS teardown request to the AP
  and apply the QoS parameters as obtained from the last received
  beacons. Do so for all the valid links.

Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250205110958.7a90afd7e140.I3f602d65f5c1fd849d6c70b12307dda33aa91ccb@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-02-11 11:59:06 +01:00
Alexander Wetzel
286e696770 wifi: mac80211: Drop cooked monitor support
Hostapd switched from cooked monitor interfaces to nl80211 Dec 2011.
Drop support for the outdated cooked monitor interfaces and fix
creating the virtual monitor interfaces in the following cases:

 1) We have one non-monitor and one monitor interface with
    %MONITOR_FLAG_ACTIVE enabled and then delete the non-monitor
    interface.

 2) We only have monitor interfaces enabled on resume while at least one
    has %MONITOR_FLAG_ACTIVE set.

Signed-off-by: Alexander Wetzel <Alexander@wetzel-home.de>
Link: https://patch.msgid.link/20250204111352.7004-2-Alexander@wetzel-home.de
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-02-11 11:58:17 +01:00
Alexander Wetzel
be22179cfb wifi: nl80211/cfg80211: Stop supporting cooked monitor
Unconditionally start to refuse creating cooked monitor interfaces to
phase them out.

There is no feature flag for drivers to opt-in for cooked monitor and
all known users are using/preferring the modern API since the hostapd
release 1.0 in May 2012.

Signed-off-by: Alexander Wetzel <Alexander@wetzel-home.de>
Link: https://patch.msgid.link/20250204111352.7004-1-Alexander@wetzel-home.de
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-02-11 11:58:17 +01:00