Convert struct proto pre_connect(), connect(), bind(), and bind_add()
callback function prototypes from struct sockaddr to struct sockaddr_unsized.
This does not change per-implementation use of sockaddr for passing around
an arbitrarily sized sockaddr struct. Those will be addressed in future
patches.
Additionally removes the no longer referenced struct sockaddr from
include/net/inet_common.h.
No binary changes expected.
Signed-off-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20251104002617.2752303-5-kees@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Update all struct proto_ops connect() callback function prototypes from
"struct sockaddr *" to "struct sockaddr_unsized *" to avoid lying to the
compiler about object sizes. Calls into struct proto handlers gain casts
that will be removed in the struct proto conversion patch.
No binary changes expected.
Signed-off-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20251104002617.2752303-3-kees@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Update all struct proto_ops bind() callback function prototypes from
"struct sockaddr *" to "struct sockaddr_unsized *" to avoid lying to the
compiler about object sizes. Calls into struct proto handlers gain casts
that will be removed in the struct proto conversion patch.
No binary changes expected.
Signed-off-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20251104002617.2752303-2-kees@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- Split cq_lock into two smaller locks: cq_prod_lock and
cq_cached_prod_lock
- Avoid disabling/enabling interrupts in the hot xmit path
In either xsk_cq_cancel_locked() or xsk_cq_reserve_locked() function,
the race condition is only between multiple xsks sharing the same
pool. They are all in the process context rather than interrupt context,
so now the small lock named cq_cached_prod_lock can be used without
handling interrupts.
While cq_cached_prod_lock ensures the exclusive modification of
@cached_prod, cq_prod_lock in xsk_cq_submit_addr_locked() only cares
about @producer and corresponding @desc. Both of them don't necessarily
be consistent with @cached_prod protected by cq_cached_prod_lock.
That's the reason why the previous big lock can be split into two
smaller ones. Please note that SPSC rule is all about the global state
of producer and consumer that can affect both layers instead of local
or cached ones.
Frequently disabling and enabling interrupt are very time consuming
in some cases, especially in a per-descriptor granularity, which now
can be avoided after this optimization, even when the pool is shared by
multiple xsks.
With this patch, the performance number[1] could go from 1,872,565 pps
to 1,961,009 pps. It's a minor rise of around 5%.
[1]: taskset -c 1 ./xdpsock -i enp2s0f1 -q 0 -t -S -s 64
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://patch.msgid.link/20251030000646.18859-3-kerneljasonxing@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
MPLS (re)uses RTNL to protect net->mpls.platform_label,
but the lock does not need to be RTNL at all.
Let's protect net->mpls.platform_label with a dedicated
per-netns mutex.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/20251029173344.2934622-13-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
rcu_dereference_rtnl() does not clearly tell whether the caller
is under RCU or RTNL.
Let's add in6_dev_rcu() to make it easy to remove __in6_dev_get()
in the future.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/20251029173344.2934622-5-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Convert 9p to the new mount API. This patch consolidates all parsing
into fs/9p/v9fs.c, which stores all results into a filesystem context
which can be passed to the various transports as needed.
Some of the parsing helper functions such as get_cache_mode() have been
eliminated in favor of using the new mount API's enum param type,
for simplicity.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Message-ID: <20251010214222.1347785-5-sandeen@redhat.com>
[ Dominique: handled source explicitly as per follow-up discussion ]
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
This patch creates a new v9fs_context structure which includes
new p9_session_opts and p9_client_opts structures, as well as
re-using the existing p9_fd_opts and p9_rdma_opts to store options
during parsing. The new structure will be used in the next
commit to pass all parsed options to the appropriate transports.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Message-ID: <20251010214222.1347785-4-sandeen@redhat.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
With the new mount API all option parsing will need to happen
in fs/v9fs.c, so move some existing data structures and macros
to header files to facilitate this. Rename some to reflect
the transport they are used for (rdma, fd, etc), for clarity.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Message-ID: <20251010214222.1347785-3-sandeen@redhat.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
'->def' is only ever used as a true/false flag
Reported-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Message-ID: <20251103-v9fs_trans_def_bool-v1-1-f33dc7ed9e81@codewreck.org>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
While developing a 9P server (https://github.com/Barre/ZeroFS) and
testing it under high-load, I was running into allocation failures.
The failures occur even with plenty of free memory available because
kmalloc requires contiguous physical memory.
This results in errors like:
ls: page allocation failure: order:7, mode:0x40c40(GFP_NOFS|__GFP_COMP)
This patch introduces a transport capability flag (supports_vmalloc)
that indicates whether a transport can work with vmalloc'd buffers
(non-physically contiguous memory). Transports requiring DMA should
leave this flag as false.
The fd-based transports (tcp, unix, fd) set this flag to true, and
p9_fcall_init will use kvmalloc instead of kmalloc for these
transports. This allows the allocator to fall back to vmalloc when
contiguous physical memory is not available.
Additionally, if kmem_cache_alloc fails, the code falls back to
kvmalloc for transports that support it.
Signed-off-by: Pierre Barre <pierre@barre.sh>
Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com>
Message-ID: <d2017c29-11fb-44a5-bd0f-4204329bbefb@app.fastmail.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Handle the NIC hardware link state events received from the HW
channel, then set the proper link state accordingly.
And, add a feature bit, GDMA_DRV_CAP_FLAG_1_HW_VPORT_LINK_AWARE,
to inform the NIC hardware this handler exists.
Our MANA NIC only sends out the link state down/up messages
when we need to let the VM rerun DHCP client and change IP
address. So, add netif_carrier_on() in the probe(), let the NIC
show the right initial state in /sys/class/net/ethX/operstate.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Link: https://patch.msgid.link/1761770601-16920-1-git-send-email-haiyangz@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- ath10k: revert a patch that had caused issues on some devices
- cfg80211/mac80211: use hrtimers for some things where the
precise timing matters
- zd1211rw: fix a long-standing potential leak
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmkDQiwACgkQ10qiO8sP
aAD8mg//RrRLyJIrVzC/c5TtiVT/ziwyMPMe6l6aRBSN+H/15hX+QfsDt29x+BgX
CyN9io7Tix3Szshos3XPSrWjKrRy2JYDDGQbVW1npxzAs/E4O5rhETm3SWymHkJh
geF3W9NDbF7OQIzm6jBp7/i5Im97LU7Y6c9Pd6FxPucKBvFahvxqyDvrD2MAa6/j
RDr9de7l7X02xymU8Z2h4qNjXQ1fbLWio1UMr+xPxPvPmXsR4aYfGb3xESryHS1y
a2Xnui1Apzv0Bxf0j0CrsqPoZmSSMhhfB5NMclTsZluPZySRmFygoPt5fjoeaw/Q
8Ga4nkzUW2ZMZ9CJiB8qFX8dSQ+k658AyVY0XAQkZFey4e0kPwwOG/f1xBoAeIW5
VcvLbxIZ2Ao3K2qAg5f2bfPyjcp1t+I4iFb/WKzTzX5EcuHcxB5IfoqYfIAkLfgP
ada9iCJzrVv0+aBCbpKvDm+BAnV/H1sQVRtPQKH09+P8Dylj/klVRYemBMzG9zH2
zSrkllJrESKUI7Y0fDszt4G9HmWuX9+y5eKhHrLwYkUutqIJsdkW9hbe+EQGUIke
bOL5nH5hhQDaOa5hpoTQRUKYZb+4tbttKCa9Bii+1M1NgPlQq22Ud4wBdGWqvfqF
dFHx7vmbVAqrTl8RzXtpLj/UVbgVNhGJAIbD6IAc3DyZC4BvZJs=
=3oN9
-----END PGP SIGNATURE-----
Merge tag 'wireless-2025-10-30' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Johannes Berg says:
====================
Couple of new fixes:
- ath10k: revert a patch that had caused issues on some devices
- cfg80211/mac80211: use hrtimers for some things where the
precise timing matters
- zd1211rw: fix a long-standing potential leak
* tag 'wireless-2025-10-30' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
wifi: zd1211rw: fix potential memory leak in __zd_usb_enable_rx()
wifi: mac80211: use wiphy_hrtimer_work for csa.switch_work
wifi: mac80211: use wiphy_hrtimer_work for ml_reconf_work
wifi: mac80211: use wiphy_hrtimer_work for ttlm_work
wifi: cfg80211: add an hrtimer based delayed work item
Revert "wifi: ath10k: avoid unnecessary wait for service ready message"
====================
Link: https://patch.msgid.link/20251030104919.12871-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In the parse_adv_monitor_pattern() function, the value of
the 'length' variable is currently limited to HCI_MAX_EXT_AD_LENGTH(251).
The size of the 'value' array in the mgmt_adv_pattern structure is 31.
If the value of 'pattern[i].length' is set in the user space
and exceeds 31, the 'patterns[i].value' array can be accessed
out of bound when copied.
Increasing the size of the 'value' array in
the 'mgmt_adv_pattern' structure will break the userspace.
Considering this, and to avoid OOB access revert the limits for 'offset'
and 'length' back to the value of HCI_MAX_AD_LENGTH.
Found by InfoTeCS on behalf of Linux Verification Center
(linuxtesting.org) with SVACE.
Fixes: db08722fc7 ("Bluetooth: hci_core: Fix missing instances using HCI_MAX_AD_LENGTH")
Cc: stable@vger.kernel.org
Signed-off-by: Ilia Gavrilov <Ilia.Gavrilov@infotecs.ru>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
-----BEGIN PGP SIGNATURE-----
iQJBBAABCAArFiEEgKkgxbID4Gn1hq6fcJGo2a1f9gAFAmkDUZINHGZ3QHN0cmxl
bi5kZQAKCRBwkajZrV/2AK4uD/9NS2Q+uqGlyAnsKrmetRBuhCMRdJaHnrniD5+N
xnsPrVpyL3EdioOehoXZNkbJ2JXTz/xl4bqZ4Zx4Z6vCVpkws2et5aU9lvm39n9s
yIhSkthuI0buRYrTzsHO21bHBPQU4gpqOXvbcFotm3QOj2wF2NaNthrDhyWUjtiY
tPmkSEWu/mjbjMEt1VU3ec3rb3fbe4i6tMW+Y4bTwH7dAopk7Rhw26NexGVUY8fO
+BzenYi3XTktOrevYnlAJXmsavJjFkSsRffe76PtewNNY2WTCY+st341IXmOVASL
OzL+9qvY3xIyevClUKx/WEKPGZae+V0Buxq6U6SijDLgbDF19A7R6zMId/ILsexm
zuAnAdhXdP1gRl4cHAs/AQGD72gqXpVaIRVJE7NsRz3+rvOXzvYRvBhIVQAdvg6v
pnR8PQg30jHPmPPRruarwH6O4x4jGptX3ASQtIf9Fii37vbDhZmJIt3j2OCaEuFP
09ImQSwjvizD+aXvf+GtsWD/ZS2PNhE6m0jpJKI26grngODSpKPfNA54uwHmzXDM
+1Zy+b6AxFjLi5HoAu+Cq7a+ktFKYVCRsunfovzVdvM+63AXoP9e2aRqWnoJUxvb
sB4eAccoDE1W0CDDktz36bVt92ZTz4zmPO7HoT/oNyvkH8ye6tYpOPfXVWrUirkQ
MNFqmw==
=woNE
-----END PGP SIGNATURE-----
Merge tag 'nf-next-25-10-30' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Florian Westphal says:
====================
netfilter: updates for net-next
1) Convert nf_tables 'nft_set_iter' usage to use C99 struct
initialization, from Fernando Fernandez Mancera.
2) Disallow nf_conntrack_max=0. This was an (undocumented)
historic inheritance from ip_conntrack (ipv4 only nf_conntrack
predecessor). Doing so will simplify future changes to make
this pernet-tuneable.
3) Fix a typo in conntrack.h comment, from Weibiao Tu.
* tag 'nf-next-25-10-30' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: fix typo in nf_conntrack_l4proto.h comment
netfilter: conntrack: disable 0 value for conntrack_max setting
netfilter: nf_tables: use C99 struct initializer for nft_set_iter
====================
Link: https://patch.msgid.link/20251030121954.29175-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Think SMC_WR_BUF_CNT_SEND := SMC_WR_BUF_CNT used in send context and
SMC_WR_BUF_CNT_RECV := 3 * SMC_WR_BUF_CNT used in recv context. Those
get replaced with lgr->max_send_wr and lgr->max_recv_wr respective.
Please note that although with the default sysctl values
qp_attr.cap.max_send_wr == qp_attr.cap.max_recv_wr is maintained but
can not be assumed to be generally true any more. I see no downside to
that, but my confidence level is rather modest.
Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
Reviewed-by: Sidraya Jayagond <sidraya@linux.ibm.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Tested-by: Mahanta Jambigi <mjambigi@linux.ibm.com>
Link: https://patch.msgid.link/20251027224856.2970019-2-pasic@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
In the comment for nf_conntrack_l4proto.h, the word "nfnetink" was
incorrectly spelled. It has been corrected to "nfnetlink".
Fixes a typo to enhance readability and ensure consistency.
Signed-off-by: caivive (Weibiao Tu) <cavivie@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
The GSO segmentation functions for ESP tunnel mode
(xfrm4_tunnel_gso_segment and xfrm6_tunnel_gso_segment) were
determining the inner packet's L2 protocol type by checking the static
x->inner_mode.family field from the xfrm state.
This is unreliable. In tunnel mode, the state's actual inner family
could be defined by x->inner_mode.family or by
x->inner_mode_iaf.family. Checking only the former can lead to a
mismatch with the actual packet being processed, causing GSO to create
segments with the wrong L2 header type.
This patch fixes the bug by deriving the inner mode directly from the
packet's inner protocol stored in XFRM_MODE_SKB_CB(skb)->protocol.
Instead of replicating the code, this patch modifies the
xfrm_ip2inner_mode helper function. It now correctly returns
&x->inner_mode if the selector family (x->sel.family) is already
specified, thereby handling both specific and AF_UNSPEC cases
appropriately.
With this change, ESP GSO can use xfrm_ip2inner_mode to get the
correct inner mode. It doesn't affect existing callers, as the updated
logic now mirrors the checks they were already performing externally.
Fixes: 26dbd66eab ("esp: choose the correct inner protocol for GSO on inter address family tunnels")
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
mac80211 already reports some basic information in the radiotap header
with the known fields declared by the driver. However, drivers may want
to report more accurate information and in that case the full VHT
radiotap structure needs to be provided.
Add a new RX_FLAG_RADIOTAP_VHT which is set when the VHT information
should be pulled from the skb. Update the code to fill in the VHT fields
to only do so when requested by the driver or if the information has not
yet been set. This way the driver can fully control the information if
it chooses so.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20251027142118.0bad1c307a21.I2cf285c20a822698039603f2af00ed9c548f2ee0@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When building drivers/net/ethernet/intel/idpf/xsk.c for ARCH=arm with
CONFIG_CFI=y using a version of LLVM prior to 22.0.0, there is a
BUILD_BUG_ON failure:
$ cat arch/arm/configs/repro.config
CONFIG_BPF_SYSCALL=y
CONFIG_CFI=y
CONFIG_IDPF=y
CONFIG_XDP_SOCKETS=y
$ make -skj"$(nproc)" ARCH=arm LLVM=1 clean defconfig repro.config drivers/net/ethernet/intel/idpf/xsk.o
In file included from drivers/net/ethernet/intel/idpf/xsk.c:4:
include/net/libeth/xsk.h:205:2: error: call to '__compiletime_assert_728' declared with 'error' attribute: BUILD_BUG_ON failed: !__builtin_constant_p(tmo == libeth_xsktmo)
205 | BUILD_BUG_ON(!__builtin_constant_p(tmo == libeth_xsktmo));
| ^
...
libeth_xdp_tx_xmit_bulk() indirectly calls libeth_xsk_xmit_fill_buf()
but these functions are marked as __always_inline so that the compiler
can turn these indirect calls into direct ones and see that the tmo
parameter to __libeth_xsk_xmit_fill_buf_md() is ultimately libeth_xsktmo
from idpf_xsk_xmit().
Unfortunately, the generic kCFI pass in LLVM expands the kCFI bundles
from the indirect calls in libeth_xdp_tx_xmit_bulk() in such a way that
later optimizations cannot turn these calls into direct ones, making the
BUILD_BUG_ON fail because it cannot be proved at compile time that tmo
is libeth_xsktmo.
Disable the generic kCFI pass for libeth_xdp_tx_xmit_bulk() to ensure
these indirect calls can always be turned into direct calls to avoid
this error.
Closes: https://github.com/ClangBuiltLinux/linux/issues/2124
Fixes: 9705d6552f ("idpf: implement Rx path for AF_XDP")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Acked-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20251025-idpf-fix-arm-kcfi-build-error-v1-3-ec57221153ae@kernel.org
Signed-off-by: Kees Cook <kees@kernel.org>
When a netdev issues a RX async resync request for a TLS connection,
the TLS module handles it by logging record headers and attempting to
match them to the tcp_sn provided by the device. If a match is found,
the TLS module approves the tcp_sn for resynchronization.
While waiting for a device response, the TLS module also increments
rcd_delta each time a new TLS record is received, tracking the distance
from the original resync request.
However, if the device response is delayed or fails (e.g due to
unstable connection and device getting out of tracking, hardware
errors, resource exhaustion etc.), the TLS module keeps logging and
incrementing, which can lead to a WARN() when rcd_delta exceeds the
threshold.
To address this, introduce tls_offload_rx_resync_async_request_cancel()
to explicitly cancel resync requests when a device response failure is
detected. Call this helper also as a final safeguard when rcd_delta
crosses its threshold, as reaching this point implies that earlier
cancellation did not occur.
Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1761508983-937977-3-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Update tls_offload_rx_resync_async_request_start() and
tls_offload_rx_resync_async_request_end() to get a struct
tls_offload_resync_async parameter directly, rather than
extracting it from struct sock.
This change aligns the function signatures with the upcoming
tls_offload_rx_resync_async_request_cancel() helper, which
will be introduced in a subsequent patch.
Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1761508983-937977-2-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add the ability to append the incoming IP interface information to
ICMPv6 error messages in accordance with RFC 5837 and RFC 4884. This is
required for more meaningful traceroute results in unnumbered networks.
The feature is disabled by default and controlled via a new sysctl
("net.ipv6.icmp.errors_extension_mask") which accepts a bitmask of ICMP
extensions to append to ICMP error messages. Currently, only a single
value is supported, but the interface and the implementation should be
able to support more extensions, if needed.
Clone the skb and copy the relevant data portions before modifying the
skb as the caller of icmp6_send() still owns the skb after the function
returns. This should be fine since by default ICMP error messages are
rate limited to 1000 per second and no more than 1 per second per
specific host.
Trim or pad the packet to 128 bytes before appending the ICMP extension
structure in order to be compatible with legacy applications that assume
that the ICMP extension structure always starts at this offset (the
minimum length specified by RFC 4884).
Since commit 20e1954fe2 ("ipv6: RFC 4884 partial support for SIT/GRE
tunnels") it is possible for icmp6_send() to be called with an skb that
already contains ICMP extensions. This can happen when we receive an
ICMPv4 message with extensions from a tunnel and translate it to an
ICMPv6 message towards an IPv6 host in the overlay network. I could not
find an RFC that supports this behavior, but it makes sense to not
overwrite the original extensions that were appended to the packet.
Therefore, avoid appending extensions if the length field in the
provided ICMPv6 header is already filled.
Export netdev_copy_name() using EXPORT_IPV6_MOD_GPL() to make it
available to IPv6 when it is built as a module.
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251027082232.232571-3-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add the ability to append the incoming IP interface information to
ICMPv4 error messages in accordance with RFC 5837 and RFC 4884. This is
required for more meaningful traceroute results in unnumbered networks.
The feature is disabled by default and controlled via a new sysctl
("net.ipv4.icmp_errors_extension_mask") which accepts a bitmask of ICMP
extensions to append to ICMP error messages. Currently, only a single
value is supported, but the interface and the implementation should be
able to support more extensions, if needed.
Clone the skb and copy the relevant data portions before modifying the
skb as the caller of __icmp_send() still owns the skb after the function
returns. This should be fine since by default ICMP error messages are
rate limited to 1000 per second and no more than 1 per second per
specific host.
Trim or pad the packet to 128 bytes before appending the ICMP extension
structure in order to be compatible with legacy applications that assume
that the ICMP extension structure always starts at this offset (the
minimum length specified by RFC 4884).
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251027082232.232571-2-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch has no functional change, and prepares the following one.
tcp_rcvbuf_grow() will need to have access to tp->rcvq_space.space
old and new values.
Change mptcp_rcvbuf_grow() in a similar way.
Signed-off-by: Eric Dumazet <edumazet@google.com>
[ Moved 'oldval' declaration to the next patch to avoid warnings at
build time. ]
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20251028-net-tcp-recv-autotune-v3-3-74b43ba4c84c@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
'struct sctp_sched_ops' is not modified in these drivers.
Constifying this structure moves some data to a read-only section, so
increases overall security, especially when the structure holds some
function pointers.
On a x86_64, with allmodconfig, as an example:
Before:
======
text data bss dec hex filename
8019 568 0 8587 218b net/sctp/stream_sched_fc.o
After:
=====
text data bss dec hex filename
8275 312 0 8587 218b net/sctp/stream_sched_fc.o
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://patch.msgid.link/dce03527eb7b7cc8a3c26d5cdac12bafe3350135.1761377890.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Remove the NET_IOV_MAX workaround from the net_iov_type enum. This entry
was previously added to force the enum size to unsigned long to satisfy
the NET_IOV_ASSERT_OFFSET static assertions.
After commit f3d85c9ee5 ("netmem: introduce struct netmem_desc
mirroring struct page") this approach became unnecessary by placing the
net_iov_type after the netmem_desc. Placing the net_iov_type after
netmem_desc results in the net_iov_type size having no effect on the
position or layout of the fields that mirror the struct page.
The layout before this patch:
struct net_iov {
union {
struct netmem_desc desc; /* 0 48 */
struct {
long unsigned int _flags; /* 0 8 */
long unsigned int pp_magic; /* 8 8 */
struct page_pool * pp; /* 16 8 */
long unsigned int _pp_mapping_pad; /* 24 8 */
long unsigned int dma_addr; /* 32 8 */
atomic_long_t pp_ref_count; /* 40 8 */
}; /* 0 48 */
}; /* 0 48 */
struct net_iov_area * owner; /* 48 8 */
enum net_iov_type type; /* 56 8 */
/* size: 64, cachelines: 1, members: 3 */
};
The layout after this patch:
struct net_iov {
union {
struct netmem_desc desc; /* 0 48 */
struct {
long unsigned int _flags; /* 0 8 */
long unsigned int pp_magic; /* 8 8 */
struct page_pool * pp; /* 16 8 */
long unsigned int _pp_mapping_pad; /* 24 8 */
long unsigned int dma_addr; /* 32 8 */
atomic_long_t pp_ref_count; /* 40 8 */
}; /* 0 48 */
}; /* 0 48 */
struct net_iov_area * owner; /* 48 8 */
enum net_iov_type type; /* 56 4 */
/* size: 64, cachelines: 1, members: 3 */
/* padding: 4 */
};
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20251024-b4-devmem-remove-niov-max-v1-1-ba72c68bc869@meta.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The normal timer mechanism assume that timeout further in the future
need a lower accuracy. As an example, the granularity for a timer
scheduled 4096 ms in the future on a 1000 Hz system is already 512 ms.
This granularity is perfectly sufficient for e.g. timeouts, but there
are other types of events that will happen at a future point in time and
require a higher accuracy.
Add a new wiphy_hrtimer_work type that uses an hrtimer internally. The
API is almost identical to the existing wiphy_delayed_work and it can be
used as a drop-in replacement after minor adjustments. The work will be
scheduled relative to the current time with a slack of 1 millisecond.
CC: stable@vger.kernel.org # 6.4+
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20251028125710.7f13a2adc5eb.I01b5af0363869864b0580d9c2a1770bafab69566@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Now, sctp_accept() and sctp_do_peeloff() use sk_clone(), and
we no longer need sctp_copy_sock() and sctp_copy_descendant().
Let's remove them.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Acked-by: Xin Long <lucien.xin@gmail.com>
Link: https://patch.msgid.link/20251023231751.4168390-9-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
sctp_accept() will use sk_clone_lock(), but it will be called
with the parent socket locked, and sctp_migrate() acquires the
child lock later.
Let's add no lock version of sk_clone_lock().
Note that lockdep complains if we simply use bh_lock_sock_nested().
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Link: https://patch.msgid.link/20251023231751.4168390-5-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
During a handshake, an endpoint may specify a maximum record size limit.
Currently, the kernel defaults to TLS_MAX_PAYLOAD_SIZE (16KB) for the
maximum record size. Meaning that, the outgoing records from the kernel
can exceed a lower size negotiated during the handshake. In such a case,
the TLS endpoint must send a fatal "record_overflow" alert [1], and
thus the record is discarded.
Upcoming Western Digital NVMe-TCP hardware controllers implement TLS
support. For these devices, supporting TLS record size negotiation is
necessary because the maximum TLS record size supported by the controller
is less than the default 16KB currently used by the kernel.
Currently, there is no way to inform the kernel of such a limit. This patch
adds support to a new setsockopt() option `TLS_TX_MAX_PAYLOAD_LEN` that
allows for setting the maximum plaintext fragment size. Once set, outgoing
records are no larger than the size specified. This option can be used to
specify the record size limit.
[1] https://www.rfc-editor.org/rfc/rfc8449
Signed-off-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20251022001937.20155-1-wilfred.opensource@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In multi-radio wiphy architecture, where a single wiphy can have
multiple radios tied to it, radio specific configuration parameters
and global wiphy parameters are maintained for the entire physical
device and common to all radios. But, each radio in a wiphy can have
different values for each radio configuration parameter, like RTS
threshold. With the current debugfs directory structure, the values
of global wiphy configuration parameters can be viewed, but, values
of individual radio configuration parameters cannot be viewed, as
radio specific configuration parameters are not maintained, separately.
To address this, in addition to maintaining global wiphy configuration
parameters common to all radios, create separate debugfs directories
for each radio in a wiphy to maintain parameters corresponding to that
radio in this directory.
In implementation, maintain a dentry structure in wiphy_radio_cfg, a
structure containing radio configurations of a wiphy. This struct is
maintained to denote per-radio configurations of a wiphy. Create
separate directories representing each radio within phy#X directory in
debugfs during wiphy registration.
Sample directory structure with this change:
ls /sys/kernel/debug/ieee80211/phy0/radio
radio0/ radio1/ radio2/
Signed-off-by: Roopni Devanathan <quic_rdevanat@quicinc.com>
Link: https://patch.msgid.link/20251024044649.483557-2-quic_rdevanat@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In multi-radio devices, it is possible to have an MLD AP and a monitor
interface active at the same time. In such cases, monitor mode may not
be able to specify a fixed channel and could end up capturing frames
from all radios, including those outside the intended frequency bands.
This patch adds frequency validation for monitor mode. Received frames
are now only processed if their frequency fall within the allowed ranges
of the radios specified by the interface's radio_mask.
This prevents monitor mode from capturing frames outside the supported radio.
Signed-off-by: Ryder Lee <ryder.lee@mediatek.com>
Link: https://patch.msgid.link/700b8284e845d96654eb98431f8eeb5a81503862.1758647858.git.ryder.lee@mediatek.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Only neigh_for_each() and neigh_seq_start/stop() are on the
reader side of neigh_table.lock.
Let's convert rwlock to the plain spinlock.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251022054004.2514876-6-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
NEIGH_VAR() is read locklessly in the fast path, and IPv6 ndisc uses
NEIGH_VAR_SET() locklessly.
The next patch will convert neightbl_dump_info() to RCU.
Let's annotate accesses to neigh_param with READ_ONCE() and WRITE_ONCE().
Note that ndisc_ifinfo_sysctl_change() uses &NEIGH_VAR() and we cannot
use '&' with READ_ONCE(), so NEIGH_VAR_PTR() is introduced.
Note also that NEIGH_VAR_INIT() does not need WRITE_ONCE() as it is before
parms is published. Also, the only user hippi_neigh_setup_dev() is no
longer called since commit e3804cbebb ("net: remove COMPAT_NET_DEV_OPS"),
which looks wrong, but probably no one uses HIPPI and RoadRunner.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251022054004.2514876-3-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Periodic advertising enabled flag cannot be tracked by the enabled
flag since advertising and periodic advertising each can be
enabled/disabled separately from one another causing the states to be
inconsistent when for example an advertising set is disabled its
enabled flag is set to false which is then used for periodic which has
not being disabled.
Fixes: eca0ae4aea ("Bluetooth: Add initial implementation of BIS connections")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This reverts commit c9d84da18d. It
replaces in L2CAP calls to msecs_to_jiffies() to secs_to_jiffies()
and updates the constants accordingly. But the constants are also
used in LCAP Configure Request and L2CAP Configure Response which
expect values in milliseconds.
This may prevent correct usage of L2CAP channel.
To fix it, keep those constants in milliseconds and so revert this
change.
Fixes: c9d84da18d ("Bluetooth: L2CAP: convert timeouts to secs_to_jiffies()")
Signed-off-by: Frédéric Danis <frederic.danis@collabora.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
There is a BUG: KASAN: stack-out-of-bounds in set_mesh_sync due to
memcpy from badly declared on-stack flexible array.
Another crash is in set_mesh_complete() due to double list_del via
mgmt_pending_valid + mgmt_pending_remove.
Use DEFINE_FLEX to declare the flexible array right, and don't memcpy
outside bounds.
As mgmt_pending_valid removes the cmd from list, use mgmt_pending_free,
and also report status on error.
Fixes: 302a1f674c ("Bluetooth: MGMT: Fix possible UAFs")
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This fixes the state tracking of advertisement set/instance 0x00 which
is considered a legacy instance and is not tracked individually by
adv_instances list, previously it was assumed that hci_dev itself would
track it via HCI_LE_ADV but that is a global state not specifc to
instance 0x00, so to fix it a new flag is introduced that only tracks the
state of instance 0x00.
Fixes: 1488af7b8b ("Bluetooth: hci_sync: Fix hci_resume_advertising_sync")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Add support for Motorcomm YT921x tags, which includes a proper
configurable ethertype field (default to 0x9988).
Signed-off-by: David Yang <mmyangfl@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20251017060859.326450-3-mmyangfl@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
UDP TX packets destructor is sock_wfree().
It suffers from a cache line bouncing in sock_def_write_space_wfree().
Instead of reading sk->sk_wmem_alloc after we just did an atomic RMW
on it, use __refcount_sub_and_test() to get the old value for free,
and pass the new value to sock_def_write_space_wfree().
Add __sock_writeable() helper.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251017133712.2842665-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Correct multiple kernel-doc warnings in nl802154.h:
- Fix a typo on one enum name to avoid a kernel-doc warning.
- Drop 2 enum descriptions that are no longer needed.
- Mark 2 internal enums as "private:" so that kernel-doc is not needed
for them.
Warning: nl802154.h:239 Enum value 'NL802154_CAP_ATTR_MAX_MAXBE' not described in enum 'nl802154_wpan_phy_capability_attr'
Warning: nl802154.h:239 Excess enum value '%NL802154_CAP_ATTR_MIN_CCA_ED_LEVEL' description in 'nl802154_wpan_phy_capability_attr'
Warning: nl802154.h:239 Excess enum value '%NL802154_CAP_ATTR_MAX_CCA_ED_LEVEL' description in 'nl802154_wpan_phy_capability_attr'
Warning: nl802154.h:369 Enum value '__NL802154_CCA_OPT_ATTR_AFTER_LAST' not described in enum 'nl802154_cca_opts'
Warning: nl802154.h:369 Enum value 'NL802154_CCA_OPT_ATTR_MAX' not described in enum 'nl802154_cca_opts'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20251016035917.1148012-1-rdunlap@infradead.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ6NaUOruQGUkvPdG4raS+Z+3y5EwUCaPFVxwAKCRAraS+Z+3y5
E1yDAP98nxKLBFb9auLj8AV1zUNl4lKEpOIhH7ceXaDT7ucsagEAwvSbGAN5sun/
M0+sOpKEZwF8JBoRU8L61uYPYSu7XAk=
=2K9O
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Martin KaFai Lau says:
====================
pull-request: bpf-next 2025-10-16
We've added 6 non-merge commits during the last 1 day(s) which contain
a total of 18 files changed, 577 insertions(+), 38 deletions(-).
The main changes are:
1) Bypass the global per-protocol memory accounting either by setting
a netns sysctl or using bpf_setsockopt in a bpf program,
from Kuniyuki Iwashima.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
selftests/bpf: Add test for sk->sk_bypass_prot_mem.
bpf: Introduce SK_BPF_BYPASS_PROT_MEM.
bpf: Support bpf_setsockopt() for BPF_CGROUP_INET_SOCK_CREATE.
net: Introduce net.core.bypass_prot_mem sysctl.
net: Allow opt-out from global protocol memory accounting.
tcp: Save lock_sock() for memcg in inet_csk_accept().
====================
Link: https://patch.msgid.link/20251016204539.773707-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Make tcp-md5 use the MD5 library API (added in 6.18) instead of the
crypto_ahash API. This is much simpler and also more efficient:
- The library API just operates on struct md5_ctx. Just allocate this
struct on the stack instead of using a pool of pre-allocated
crypto_ahash and ahash_request objects.
- The library API accepts standard pointers and doesn't require
scatterlists. So, for hashing the headers just use an on-stack buffer
instead of a pool of pre-allocated kmalloc'ed scratch buffers.
- The library API never fails. Therefore, checking for MD5 hashing
errors is no longer necessary. Update tcp_v4_md5_hash_skb(),
tcp_v6_md5_hash_skb(), tcp_v4_md5_hash_hdr(), tcp_v6_md5_hash_hdr(),
tcp_md5_hash_key(), tcp_sock_af_ops::calc_md5_hash, and
tcp_request_sock_ops::calc_md5_hash to return void instead of int.
- The library API provides direct access to the MD5 code, eliminating
unnecessary overhead such as indirect function calls and scatterlist
management. Microbenchmarks of tcp_v4_md5_hash_skb() on x86_64 show a
speedup from 7518 to 7041 cycles (6% fewer) with skb->len == 1440, or
from 1020 to 678 cycles (33% fewer) with skb->len == 140.
Since tcp_sigpool_hash_skb_data() can no longer be used, add a function
tcp_md5_hash_skb_data() which is specialized to MD5. Of course, to the
extent that this duplicates any code, it's well worth it.
To preserve the existing behavior of TCP-MD5 support being disabled when
the kernel is booted with "fips=1", make tcp_md5_do_add() check
fips_enabled itself. Previously it relied on the error from
crypto_alloc_ahash("md5") being bubbled up. I don't know for sure that
this is actually needed, but this preserves the existing behavior.
Tested with bidirectional TCP-MD5, both IPv4 and IPv6, between a kernel
that includes this commit and a kernel that doesn't include this commit.
(Side note: please don't use TCP-MD5! It's cryptographically weak. But
as long as Linux supports it, it might as well be implemented properly.)
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Link: https://patch.msgid.link/20251014215836.115616-1-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since ehash lookups are lockless, if one CPU performs a lookup while
another concurrently deletes and inserts (removing reqsk and inserting sk),
the lookup may fail to find the socket, an RST may be sent.
The call trace map is drawn as follows:
CPU 0 CPU 1
----- -----
inet_ehash_insert()
spin_lock()
sk_nulls_del_node_init_rcu(osk)
__inet_lookup_established()
(lookup failed)
__sk_nulls_add_node_rcu(sk, list)
spin_unlock()
As both deletion and insertion operate on the same ehash chain, this patch
introduces a new sk_nulls_replace_node_init_rcu() helper functions to
implement atomic replacement.
Fixes: 5e0724d027 ("tcp/dccp: fix hashdance race for passive sessions")
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Signed-off-by: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251015020236.431822-3-xuanqiang.luo@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In {tcp6,udp6,raw6}_sock, struct ipv6_pinfo is always placed at
the beginning of a new cache line because
1. __alignof__(struct tcp_sock) is 64 due to ____cacheline_aligned
of __cacheline_group_begin(tcp_sock_write_tx)
2. __alignof__(struct udp_sock) is 64 due to ____cacheline_aligned
of struct numa_drop_counters
3. in raw6_sock, struct numa_drop_counters is placed before
struct ipv6_pinfo
. struct ipv6_pinfo is 136 bytes, but the last cache line is
only used by ipv6_fl_list:
$ pahole -C ipv6_pinfo vmlinux
struct ipv6_pinfo {
...
/* --- cacheline 2 boundary (128 bytes) --- */
struct ipv6_fl_socklist * ipv6_fl_list; /* 128 8 */
/* size: 136, cachelines: 3, members: 23 */
Let's move ipv6_fl_list from struct ipv6_pinfo to struct inet_sock
to save a full cache line for {tcp6,udp6,raw6}_sock.
Now, struct ipv6_pinfo is 128 bytes, and {tcp6,udp6,raw6}_sock have
64 bytes less, while {tcp,udp,raw}_sock retain the same size.
Before:
# grep -E "^(RAW|UDP[^L\-]|TCP)" /proc/slabinfo | awk '{print $1, "\t", $4}'
RAWv6 1408
UDPv6 1472
TCPv6 2560
RAW 1152
UDP 1280
TCP 2368
After:
# grep -E "^(RAW|UDP[^L\-]|TCP)" /proc/slabinfo | awk '{print $1, "\t", $4}'
RAWv6 1344
UDPv6 1408
TCPv6 2496
RAW 1152
UDP 1280
TCP 2368
Also, ipv6_fl_list and inet_flags (SNDFLOW bit) are placed in the
same cache line.
$ pahole -C inet_sock vmlinux
...
/* --- cacheline 11 boundary (704 bytes) was 56 bytes ago --- */
struct ipv6_pinfo * pinet6; /* 760 8 */
/* --- cacheline 12 boundary (768 bytes) --- */
struct ipv6_fl_socklist * ipv6_fl_list; /* 768 8 */
unsigned long inet_flags; /* 776 8 */
Doc churn is due to the insufficient Type column (only 1 space short).
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251014224210.2964778-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Remove busylock spinlock and use a lockless list (llist)
to reduce spinlock contention to the minimum.
Idea is that only one cpu might spin on the qdisc spinlock,
while others simply add their skb in the llist.
After this patch, we get a 300 % improvement on heavy TX workloads.
- Sending twice the number of packets per second.
- While consuming 50 % less cycles.
Note that this also allows in the future to submit batches
to various qdisc->enqueue() methods.
Tested:
- Dual Intel(R) Xeon(R) 6985P-C (480 hyper threads).
- 100Gbit NIC, 30 TX queues with FQ packet scheduler.
- echo 64 >/sys/kernel/slab/skbuff_small_head/cpu_partial (avoid contention in mm)
- 240 concurrent "netperf -t UDP_STREAM -- -m 120 -n"
Before:
16 Mpps (41 Mpps if each thread is pinned to a different cpu)
vmstat 2 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
243 0 0 2368988672 51036 1100852 0 0 146 1 242 60 0 9 91 0 0
244 0 0 2368988672 51036 1100852 0 0 536 10 487745 14718 0 52 48 0 0
244 0 0 2368988672 51036 1100852 0 0 512 0 503067 46033 0 52 48 0 0
244 0 0 2368988672 51036 1100852 0 0 512 0 494807 12107 0 52 48 0 0
244 0 0 2368988672 51036 1100852 0 0 702 26 492845 10110 0 52 48 0 0
Lock contention (1 second sample taken on 8 cores)
perf lock record -C0-7 sleep 1; perf lock contention
contended total wait max wait avg wait type caller
442111 6.79 s 162.47 ms 15.35 us spinlock dev_hard_start_xmit+0xcd
5961 9.57 ms 8.12 us 1.60 us spinlock __dev_queue_xmit+0x3a0
244 560.63 us 7.63 us 2.30 us spinlock do_softirq+0x5b
13 25.09 us 3.21 us 1.93 us spinlock net_tx_action+0xf8
If netperf threads are pinned, spinlock stress is very high.
perf lock record -C0-7 sleep 1; perf lock contention
contended total wait max wait avg wait type caller
964508 7.10 s 147.25 ms 7.36 us spinlock dev_hard_start_xmit+0xcd
201 268.05 us 4.65 us 1.33 us spinlock __dev_queue_xmit+0x3a0
12 26.05 us 3.84 us 2.17 us spinlock do_softirq+0x5b
@__dev_queue_xmit_ns:
[256, 512) 21 | |
[512, 1K) 631 | |
[1K, 2K) 27328 |@ |
[2K, 4K) 265392 |@@@@@@@@@@@@@@@@ |
[4K, 8K) 417543 |@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[8K, 16K) 826292 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16K, 32K) 733822 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[32K, 64K) 19055 |@ |
[64K, 128K) 17240 |@ |
[128K, 256K) 25633 |@ |
[256K, 512K) 4 | |
After:
29 Mpps (57 Mpps if each thread is pinned to a different cpu)
vmstat 2 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
78 0 0 2369573632 32896 1350988 0 0 22 0 331 254 0 8 92 0 0
75 0 0 2369573632 32896 1350988 0 0 22 50 425713 280199 0 23 76 0 0
104 0 0 2369573632 32896 1350988 0 0 290 0 430238 298247 0 23 76 0 0
86 0 0 2369573632 32896 1350988 0 0 132 0 428019 291865 0 24 76 0 0
90 0 0 2369573632 32896 1350988 0 0 502 0 422498 278672 0 23 76 0 0
perf lock record -C0-7 sleep 1; perf lock contention
contended total wait max wait avg wait type caller
2524 116.15 ms 486.61 us 46.02 us spinlock __dev_queue_xmit+0x55b
5821 107.18 ms 371.67 us 18.41 us spinlock dev_hard_start_xmit+0xcd
2377 9.73 ms 35.86 us 4.09 us spinlock ___slab_alloc+0x4e0
923 5.74 ms 20.91 us 6.22 us spinlock ___slab_alloc+0x5c9
121 3.42 ms 193.05 us 28.24 us spinlock net_tx_action+0xf8
6 564.33 us 167.60 us 94.05 us spinlock do_softirq+0x5b
If netperf threads are pinned (~54 Mpps)
perf lock record -C0-7 sleep 1; perf lock contention
32907 316.98 ms 195.98 us 9.63 us spinlock dev_hard_start_xmit+0xcd
4507 61.83 ms 212.73 us 13.72 us spinlock __dev_queue_xmit+0x554
2781 23.53 ms 40.03 us 8.46 us spinlock ___slab_alloc+0x5c9
3554 18.94 ms 34.69 us 5.33 us spinlock ___slab_alloc+0x4e0
233 9.09 ms 215.70 us 38.99 us spinlock do_softirq+0x5b
153 930.66 us 48.67 us 6.08 us spinlock net_tx_action+0xfd
84 331.10 us 14.22 us 3.94 us spinlock ___slab_alloc+0x5c9
140 323.71 us 9.94 us 2.31 us spinlock ___slab_alloc+0x4e0
@__dev_queue_xmit_ns:
[128, 256) 1539830 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[256, 512) 2299558 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[512, 1K) 483936 |@@@@@@@@@@ |
[1K, 2K) 265345 |@@@@@@ |
[2K, 4K) 145463 |@@@ |
[4K, 8K) 54571 |@ |
[8K, 16K) 10270 | |
[16K, 32K) 9385 | |
[32K, 64K) 7749 | |
[64K, 128K) 26799 | |
[128K, 256K) 2665 | |
[256K, 512K) 665 | |
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20251014171907.3554413-7-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Replace state2 field with a boolean.
Move it to a hole between qstats and state so that
we shrink Qdisc by a full cache line.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20251014171907.3554413-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This reverts commits 0f022d32c3
and 44180feacc.
Prior patch in this series implemented loop detection
in act_mirred, we can remove q->owner to save some cycles
in the fast path.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20251014171907.3554413-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If a socket has sk->sk_bypass_prot_mem flagged, the socket opts out
of the global protocol memory accounting.
Let's control the flag by a new sysctl knob.
The flag is written once during socket(2) and is inherited to child
sockets.
Tested with a script that creates local socket pairs and send()s a
bunch of data without recv()ing.
Setup:
# mkdir /sys/fs/cgroup/test
# echo $$ >> /sys/fs/cgroup/test/cgroup.procs
# sysctl -q net.ipv4.tcp_mem="1000 1000 1000"
# ulimit -n 524288
Without net.core.bypass_prot_mem, charged to tcp_mem & memcg
# python3 pressure.py &
# cat /sys/fs/cgroup/test/memory.stat | grep sock
sock 22642688 <-------------------------------------- charged to memcg
# cat /proc/net/sockstat| grep TCP
TCP: inuse 2006 orphan 0 tw 0 alloc 2008 mem 5376 <-- charged to tcp_mem
# ss -tn | head -n 5
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 2000 0 127.0.0.1:34479 127.0.0.1:53188
ESTAB 2000 0 127.0.0.1:34479 127.0.0.1:49972
ESTAB 2000 0 127.0.0.1:34479 127.0.0.1:53868
ESTAB 2000 0 127.0.0.1:34479 127.0.0.1:53554
# nstat | grep Pressure || echo no pressure
TcpExtTCPMemoryPressures 1 0.0
With net.core.bypass_prot_mem=1, charged to memcg only:
# sysctl -q net.core.bypass_prot_mem=1
# python3 pressure.py &
# cat /sys/fs/cgroup/test/memory.stat | grep sock
sock 2757468160 <------------------------------------ charged to memcg
# cat /proc/net/sockstat | grep TCP
TCP: inuse 2006 orphan 0 tw 0 alloc 2008 mem 0 <- NOT charged to tcp_mem
# ss -tn | head -n 5
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 111000 0 127.0.0.1:36019 127.0.0.1:49026
ESTAB 110000 0 127.0.0.1:36019 127.0.0.1:45630
ESTAB 110000 0 127.0.0.1:36019 127.0.0.1:44870
ESTAB 111000 0 127.0.0.1:36019 127.0.0.1:45274
# nstat | grep Pressure || echo no pressure
no pressure
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Link: https://patch.msgid.link/20251014235604.3057003-4-kuniyu@google.com
Some protocols (e.g., TCP, UDP) implement memory accounting for socket
buffers and charge memory to per-protocol global counters pointed to by
sk->sk_proto->memory_allocated.
Sometimes, system processes do not want that limitation. For a similar
purpose, there is SO_RESERVE_MEM for sockets under memcg.
Also, by opting out of the per-protocol accounting, sockets under memcg
can avoid paying costs for two orthogonal memory accounting mechanisms.
A microbenchmark result is in the subsequent bpf patch.
Let's allow opt-out from the per-protocol memory accounting if
sk->sk_bypass_prot_mem is true.
sk->sk_bypass_prot_mem and sk->sk_prot are placed in the same cache
line, and sk_has_account() always fetches sk->sk_prot before accessing
sk->sk_bypass_prot_mem, so there is no extra cache miss for this patch.
The following patches will set sk->sk_bypass_prot_mem to true, and
then, the per-protocol memory accounting will be skipped.
Note that this does NOT disable memcg, but rather the per-protocol one.
Another option not to use the hole in struct sock_common is create
sk_prot variants like tcp_prot_bypass, but this would complicate
SOCKMAP logic, tcp_bpf_prots etc.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Link: https://patch.msgid.link/20251014235604.3057003-3-kuniyu@google.com
sk->sk_refcnt has been converted to refcount_t in 2017.
__sock_put(sk) being refcount_dec(&sk->sk_refcnt), it will complain
loudly if the current refcnt is 1 (or less) in a non racy way.
We can remove four WARN_ON() in favor of the generic refcount_dec()
check.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Xuanqiang Luo<luoxuanqiang@kylinos.cn>
Link: https://patch.msgid.link/20251014140605.2982703-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This is a followup of commit 726e9e8b94 ("tcp: refine
skb->ooo_okay setting") and of prior commit in this series
("net: control skb->ooo_okay from skb_set_owner_w()")
skb->ooo_okay might never be set for bulk flows that always
have at least one skb in a qdisc queue of NIC queue,
especially if TX completion is delayed because of a stressed cpu.
The so-called "strange attractors" has caused many performance
issues (see for instance 9b462d02d6 ("tcp: TCP Small Queues
and strange attractors")), we need to do better.
We have tried very hard to avoid reorders because TCP was
not dealing with them nicely a decade ago.
Use the new net.core.txq_reselection_ms sysctl to let
flows follow XPS and select a more efficient queue.
After this patch, we no longer have to make sure threads
are pinned to cpus, they now can be migrated without
adding too much spinlock/qdisc/TX completion pressure anymore.
TX completion part was problematic, because it added false sharing
on various socket fields, but also added false sharing and spinlock
contention in mm layers. Calling skb_orphan() from ndo_start_xmit()
is not an option unfortunately.
Note for later:
1) move sk->sk_tx_queue_mapping closer
to sk_tx_queue_mapping_jiffies for better cache locality.
2) Study if 9b462d02d6 ("tcp: TCP Small Queues
and strange attractors") could be revised.
Tested:
Used a host with 32 TX queues, shared by groups of 8 cores.
XPS setup :
echo ff >/sys/class/net/eth1/queue/tx-0/xps_cpus
echo ff00 >/sys/class/net/eth1/queue/tx-1/xps_cpus
echo ff0000 >/sys/class/net/eth1/queue/tx-2/xps_cpus
echo ff000000 >/sys/class/net/eth1/queue/tx-3/xps_cpus
echo ff,00000000 >/sys/class/net/eth1/queue/tx-4/xps_cpus
echo ff00,00000000 >/sys/class/net/eth1/queue/tx-5/xps_cpus
echo ff0000,00000000 >/sys/class/net/eth1/queue/tx-6/xps_cpus
echo ff000000,00000000 >/sys/class/net/eth1/queue/tx-7/xps_cpus
...
Launched a tcp_stream with 15 threads and 1000 flows, initially affined to core 0-15
taskset -c 0-15 tcp_stream -T15 -F1000 -l1000 -c -H target_host
Checked that only queues 0 and 1 are used as instructed by XPS :
tc -s qdisc show dev eth1|grep backlog|grep -v "backlog 0b 0p"
backlog 123489410b 1890p
backlog 69809026b 1064p
backlog 52401054b 805p
Then force each thread to run on cpu 1,9,17,25,33,41,49,57,65,73,81,89,97,105,113,121
C=1;PID=`pidof tcp_stream`;for P in `ls /proc/$PID/task`; do taskset -pc $C $P; C=$(($C + 8));done
Set txq_reselection_ms to 1000
echo 1000 > /proc/sys/net/core/txq_reselection_ms
Check that the flows have migrated nicely:
tc -s qdisc show dev eth1|grep backlog|grep -v "backlog 0b 0p"
backlog 130508314b 1916p
backlog 8584380b 126p
backlog 8584380b 126p
backlog 8379990b 123p
backlog 8584380b 126p
backlog 8487484b 125p
backlog 8584380b 126p
backlog 8448120b 124p
backlog 8584380b 126p
backlog 8720640b 128p
backlog 8856900b 130p
backlog 8584380b 126p
backlog 8652510b 127p
backlog 8448120b 124p
backlog 8516250b 125p
backlog 7834950b 115p
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251013152234.842065-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a new sysctl to control how often a queue reselection
can happen even if a flow has a persistent queue of skbs
in a Qdisc or NIC queue.
A value of zero means the feature is disabled.
Default is 1000 (1 second).
This sysctl is used in the following patch.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251013152234.842065-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
sk->sk_wmem_alloc is initialized to 1, and sk_wmem_alloc_get()
takes care of this initial value.
Add SK_WMEM_ALLOC_BIAS define to not spread this magic value.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251013152234.842065-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Some applications uses TCP_TX_DELAY socket option after TCP flow
is established.
Some metrics need to be updated, otherwise TCP might take time to
adapt to the new (emulated) RTT.
This patch adjusts tp->srtt_us, tp->rtt_min, icsk_rto
and sk->sk_pacing_rate.
This is best effort, and for instance icsk_rto is reset
without taking backoff into account.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251013145926.833198-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Now that we have struct netmem_desc, it'd better access the pp fields
via struct netmem_desc rather than struct net_iov.
Introduce netmem_to_nmdesc() for safely converting netmem_ref to
netmem_desc regardless of the type underneath e.i. netmem_desc, net_iov.
While at it, remove __netmem_clear_lsb() and make netmem_to_nmdesc()
used instead.
Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Byungchul Park <byungchul@sk.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20251013044133.69472-1-byungchul@sk.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Similarly to ipv4 tunnel, ipv6 version updates dev->needed_headroom, too.
While ipv4 tunnel headroom adjustment growth was limited in
commit 5ae1e9922b ("net: ip_tunnel: prevent perpetual headroom growth"),
ipv6 tunnel yet increases the headroom without any ceiling.
Reflect ipv4 tunnel headroom adjustment limit on ipv6 version.
Credits to Francesco Ruggeri, who was originally debugging this issue
and wrote local Arista-specific patch and a reproducer.
Fixes: 8eb30be035 ("ipv6: Create ip6_tnl_xmit")
Cc: Florian Westphal <fw@strlen.de>
Cc: Francesco Ruggeri <fruggeri05@gmail.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Link: https://patch.msgid.link/20251009-ip6_tunnel-headroom-v2-1-8e4dbd8f7e35@arista.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Rx path may be passing around unreferenced sockets, which means
that skb_set_owner_edemux() may not set skb->sk and PSP will crash:
KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017]
RIP: 0010:psp_reply_set_decrypted (./include/net/psp/functions.h:132 net/psp/psp_sock.c:287)
tcp_v6_send_response.constprop.0 (net/ipv6/tcp_ipv6.c:979)
tcp_v6_send_reset (net/ipv6/tcp_ipv6.c:1140 (discriminator 1))
tcp_v6_do_rcv (net/ipv6/tcp_ipv6.c:1683)
tcp_v6_rcv (net/ipv6/tcp_ipv6.c:1912)
Fixes: 659a2899a5 ("tcp: add datapath logic for PSP with inline key exchange")
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251001022426.2592750-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Core & protocols
----------------
- Improve drop account scalability on NUMA hosts for RAW and UDP sockets
and the backlog, almost doubling the Pps capacity under DoS.
- Optimize the UDP RX performance under stress, reducing contention,
revisiting the binary layout of the involved data structs and
implementing NUMA-aware locking. This improves UDP RX performance by
an additional 50%, even more under extreme conditions.
- Add support for PSP encryption of TCP connections; this mechanism has
some similarities with IPsec and TLS, but offers superior HW offloads
capabilities.
- Ongoing work to support Accurate ECN for TCP. AccECN allows more than
one congestion notification signal per RTT and is a building block for
Low Latency, Low Loss, and Scalable Throughput (L4S).
- Reorganize the TCP socket binary layout for data locality, reducing
the number of touched cachelines in the fastpath.
- Refactor skb deferral free to better scale on large multi-NUMA hosts,
this improves TCP and UDP RX performances significantly on such HW.
- Increase the default socket memory buffer limits from 256K to 4M to
better fit modern link speeds.
- Improve handling of setups with a large number of nexthop, making dump
operating scaling linearly and avoiding unneeded synchronize_rcu() on
delete.
- Improve bridge handling of VLAN FDB, storing a single entry per bridge
instead of one entry per port; this makes the dump order of magnitude
faster on large switches.
- Restore IP ID correctly for encapsulated packets at GSO segmentation
time, allowing GRO to merge packets in more scenarios.
- Improve netfilter matching performance on large sets.
- Improve MPTCP receive path performance by leveraging recently
introduced core infrastructure (skb deferral free) and adopting recent
TCP autotuning changes.
- Allow bridges to redirect to a backup port when the bridge port is
administratively down.
- Introduce MPTCP 'laminar' endpoint that con be used only once per
connection and simplify common MPTCP setups.
- Add RCU safety to dst->dev, closing a lot of possible races.
- A significant crypto library API for SCTP, MPTCP and IPv6 SR, reducing
code duplication.
- Supports pulling data from an skb frag into the linear area of an XDP
buffer.
Things we sprinkled into general kernel code
--------------------------------------------
- Generate netlink documentation from YAML using an integrated
YAML parser.
Driver API
----------
- Support using IPv6 Flow Label in Rx hash computation and RSS queue
selection.
- Introduce API for fetching the DMA device for a given queue, allowing
TCP zerocopy RX on more H/W setups.
- Make XDP helpers compatible with unreadable memory, allowing more
easily building DevMem-enabled drivers with a unified XDP/skbs
datapath.
- Add a new dedicated ethtool callback enabling drivers to provide the
number of RX rings directly, improving efficiency and clarity in RX
ring queries and RSS configuration.
- Introduce a burst period for the health reporter, allowing better
handling of multiple errors due to the same root cause.
- Support for DPLL phase offset exponential moving average, controlling
the average smoothing factor.
Device drivers
--------------
- Add a new Huawei driver for 3rd gen NIC (hinic3).
- Add a new SpacemiT driver for K1 ethernet MAC.
- Add a generic abstraction for shared memory communication devices
(dibps)
- Ethernet high-speed NICs:
- nVidia/Mellanox:
- Use multiple per-queue doorbell, to avoid MMIO contention issues
- support adjacent functions, allowing them to delegate their
SR-IOV VFs to sibling PFs
- support RSS for IPSec offload
- support exposing raw cycle counters in PTP and mlx5
- support for disabling host PFs.
- Intel (100G, ice, idpf):
- ice: support for SRIOV VFs over an Active-Active link aggregate
- ice: support for firmware logging via debugfs
- ice: support for Earliest TxTime First (ETF) hardware offload
- idpf: support basic XDP functionalities and XSk
- Broadcom (bnxt):
- support Hyper-V VF ID
- dynamic SRIOV resource allocations for RoCE
- Meta (fbnic):
- support queue API, zero-copy Rx and Tx
- support basic XDP functionalities
- devlink health support for FW crashes and OTP mem corruptions
- expand hardware stats coverage to FEC, PHY, and Pause
- Wangxun:
- support ethtool coalesce options
- support for multiple RSS contexts
- Ethernet virtual:
- Macsec:
- replace custom netlink attribute checks with policy-level checks
- Bonding:
- support aggregator selection based on port priority
- Microsoft vNIC:
- use page pool fragments for RX buffers instead of full pages to
improve memory efficiency
- Ethernet NICs consumer, and embedded:
- Qualcomm: support Ethernet function for IPQ9574 SoC
- Airoha: implement wlan offloading via NPU
- Freescale
- enetc: add NETC timer PTP driver and add PTP support
- fec: enable the Jumbo frame support for i.MX8QM
- Renesas (R-Car S4): support HW offloading for layer 2 switching
- support for RZ/{T2H, N2H} SoCs
- Cadence (macb): support TAPRIO traffic scheduling
- TI:
- support for Gigabit ICSS ethernet SoC (icssm-prueth)
- Synopsys (stmmac): a lot of cleanups
- Ethernet PHYs:
- Support 10g-qxgmi phy-mode for AQR412C, Felix DSA and Lynx PCS
driver
- Support bcm63268 GPHY power control
- Support for Micrel lan8842 PHY and PTP
- Support for Aquantia AQR412 and AQR115
- CAN:
- a large CAN-XL preparation work
- reorganize raw_sock and uniqframe struct to minimize memory usage
- rcar_canfd: update the CAN-FD handling
- WiFi:
- extended Neighbor Awareness Networking (NAN) support
- S1G channel representation cleanup
- improve S1G support
- WiFi drivers:
- Intel (iwlwifi):
- major refactor and cleanup
- Broadcom (brcm80211):
- support for AP isolation
- RealTek (rtw88/89) rtw88/89:
- preparation work for RTL8922DE support
- MediaTek (mt76):
- HW restart improvements
- MLO support
- Qualcomm/Atheros (ath10k_
- GTK rekey fixes
- Bluetooth drivers:
- btusb: support for several new IDs for MT7925
- btintel: support for BlazarIW core
- btintel_pcie: support for _suspend() / _resume()
- btintel_pcie: support for Scorpious, Panther Lake-H484 IDs
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmjdBdsSHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOkWnAP/2Jz0ExnYMDxLxkkKqMte+3JNeF/KNjB
zVIslYN/5ekkd1TMXyDLFlWHqUOUMSF9+cK5ZRMHG6/lzOueSzFuuuDFsWs6Kf2f
q7Q9MzlXhR9++YCsdES1uS5x3PPjMInzo2ZivMFa6fUDLLFSzeAOKL9k+RS0EggU
VlXv2Wkm73R0O6KAssgDsHke9cnNz+F0DzhQ0S3qkyZF9tS5NrDeUl7fZ47XZgwb
ZuXdEzKmTTepo2XvZGxJoF2D7nekWFlHhLjEPpDJtET19nwhukCry41/FplJwlKR
dWsYkqLkrOEQKFQ+q++/5c3BgFOG+IrQLbP5bGQF1tZRMl4Dqp9PDxK5fKJfccbS
0VY3Y2qWOSYejT2Ji2kEHR5E5rPyFm3Y5A4Rnpz0yEHj14vL2v4zf7CZRkCyyDfD
doIZXFGkM0+N7QeXLEN833cje9zjaXuP9GAE+2bb+wAWAZAqof4KX8JgHh+y5Rwm
pvUtvFxmEtntlMwNBap8aT3FquGtfZncU8pzAE5kvWjuMvyF0NVWiE5rB2kSQM/X
NLmUdvDyjwwJqthXAJG0fl+a0mNJ/kOAqSOKJDJrfKDGWa+ovwY0iY06SpK0wIbO
Wz7tpMk5MSlYXW8xWMlbyhvvU6T9xuoQ2KV4QTdMxc6Ir3sNX6YkQr+gjQjxB0gx
ST2QF6lZeWFh
=w2Kz
-----END PGP SIGNATURE-----
Merge tag 'net-next-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Paolo Abeni:
"Core & protocols:
- Improve drop account scalability on NUMA hosts for RAW and UDP
sockets and the backlog, almost doubling the Pps capacity under DoS
- Optimize the UDP RX performance under stress, reducing contention,
revisiting the binary layout of the involved data structs and
implementing NUMA-aware locking. This improves UDP RX performance
by an additional 50%, even more under extreme conditions
- Add support for PSP encryption of TCP connections; this mechanism
has some similarities with IPsec and TLS, but offers superior HW
offloads capabilities
- Ongoing work to support Accurate ECN for TCP. AccECN allows more
than one congestion notification signal per RTT and is a building
block for Low Latency, Low Loss, and Scalable Throughput (L4S)
- Reorganize the TCP socket binary layout for data locality, reducing
the number of touched cachelines in the fastpath
- Refactor skb deferral free to better scale on large multi-NUMA
hosts, this improves TCP and UDP RX performances significantly on
such HW
- Increase the default socket memory buffer limits from 256K to 4M to
better fit modern link speeds
- Improve handling of setups with a large number of nexthop, making
dump operating scaling linearly and avoiding unneeded
synchronize_rcu() on delete
- Improve bridge handling of VLAN FDB, storing a single entry per
bridge instead of one entry per port; this makes the dump order of
magnitude faster on large switches
- Restore IP ID correctly for encapsulated packets at GSO
segmentation time, allowing GRO to merge packets in more scenarios
- Improve netfilter matching performance on large sets
- Improve MPTCP receive path performance by leveraging recently
introduced core infrastructure (skb deferral free) and adopting
recent TCP autotuning changes
- Allow bridges to redirect to a backup port when the bridge port is
administratively down
- Introduce MPTCP 'laminar' endpoint that con be used only once per
connection and simplify common MPTCP setups
- Add RCU safety to dst->dev, closing a lot of possible races
- A significant crypto library API for SCTP, MPTCP and IPv6 SR,
reducing code duplication
- Supports pulling data from an skb frag into the linear area of an
XDP buffer
Things we sprinkled into general kernel code:
- Generate netlink documentation from YAML using an integrated YAML
parser
Driver API:
- Support using IPv6 Flow Label in Rx hash computation and RSS queue
selection
- Introduce API for fetching the DMA device for a given queue,
allowing TCP zerocopy RX on more H/W setups
- Make XDP helpers compatible with unreadable memory, allowing more
easily building DevMem-enabled drivers with a unified XDP/skbs
datapath
- Add a new dedicated ethtool callback enabling drivers to provide
the number of RX rings directly, improving efficiency and clarity
in RX ring queries and RSS configuration
- Introduce a burst period for the health reporter, allowing better
handling of multiple errors due to the same root cause
- Support for DPLL phase offset exponential moving average,
controlling the average smoothing factor
Device drivers:
- Add a new Huawei driver for 3rd gen NIC (hinic3)
- Add a new SpacemiT driver for K1 ethernet MAC
- Add a generic abstraction for shared memory communication
devices (dibps)
- Ethernet high-speed NICs:
- nVidia/Mellanox:
- Use multiple per-queue doorbell, to avoid MMIO contention
issues
- support adjacent functions, allowing them to delegate their
SR-IOV VFs to sibling PFs
- support RSS for IPSec offload
- support exposing raw cycle counters in PTP and mlx5
- support for disabling host PFs.
- Intel (100G, ice, idpf):
- ice: support for SRIOV VFs over an Active-Active link
aggregate
- ice: support for firmware logging via debugfs
- ice: support for Earliest TxTime First (ETF) hardware offload
- idpf: support basic XDP functionalities and XSk
- Broadcom (bnxt):
- support Hyper-V VF ID
- dynamic SRIOV resource allocations for RoCE
- Meta (fbnic):
- support queue API, zero-copy Rx and Tx
- support basic XDP functionalities
- devlink health support for FW crashes and OTP mem corruptions
- expand hardware stats coverage to FEC, PHY, and Pause
- Wangxun:
- support ethtool coalesce options
- support for multiple RSS contexts
- Ethernet virtual:
- Macsec:
- replace custom netlink attribute checks with policy-level
checks
- Bonding:
- support aggregator selection based on port priority
- Microsoft vNIC:
- use page pool fragments for RX buffers instead of full pages
to improve memory efficiency
- Ethernet NICs consumer, and embedded:
- Qualcomm: support Ethernet function for IPQ9574 SoC
- Airoha: implement wlan offloading via NPU
- Freescale
- enetc: add NETC timer PTP driver and add PTP support
- fec: enable the Jumbo frame support for i.MX8QM
- Renesas (R-Car S4):
- support HW offloading for layer 2 switching
- support for RZ/{T2H, N2H} SoCs
- Cadence (macb): support TAPRIO traffic scheduling
- TI:
- support for Gigabit ICSS ethernet SoC (icssm-prueth)
- Synopsys (stmmac): a lot of cleanups
- Ethernet PHYs:
- Support 10g-qxgmi phy-mode for AQR412C, Felix DSA and Lynx PCS
driver
- Support bcm63268 GPHY power control
- Support for Micrel lan8842 PHY and PTP
- Support for Aquantia AQR412 and AQR115
- CAN:
- a large CAN-XL preparation work
- reorganize raw_sock and uniqframe struct to minimize memory
usage
- rcar_canfd: update the CAN-FD handling
- WiFi:
- extended Neighbor Awareness Networking (NAN) support
- S1G channel representation cleanup
- improve S1G support
- WiFi drivers:
- Intel (iwlwifi):
- major refactor and cleanup
- Broadcom (brcm80211):
- support for AP isolation
- RealTek (rtw88/89) rtw88/89:
- preparation work for RTL8922DE support
- MediaTek (mt76):
- HW restart improvements
- MLO support
- Qualcomm/Atheros (ath10k):
- GTK rekey fixes
- Bluetooth drivers:
- btusb: support for several new IDs for MT7925
- btintel: support for BlazarIW core
- btintel_pcie: support for _suspend() / _resume()
- btintel_pcie: support for Scorpious, Panther Lake-H484 IDs"
* tag 'net-next-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1536 commits)
net: stmmac: Add support for Allwinner A523 GMAC200
dt-bindings: net: sun8i-emac: Add A523 GMAC200 compatible
Revert "Documentation: net: add flow control guide and document ethtool API"
octeontx2-pf: fix bitmap leak
octeontx2-vf: fix bitmap leak
net/mlx5e: Use extack in set rxfh callback
net/mlx5e: Introduce mlx5e_rss_params for RSS configuration
net/mlx5e: Introduce mlx5e_rss_init_params
net/mlx5e: Remove unused mdev param from RSS indir init
net/mlx5: Improve QoS error messages with actual depth values
net/mlx5e: Prevent entering switchdev mode with inconsistent netns
net/mlx5: HWS, Generalize complex matchers
net/mlx5: Improve write-combining test reliability for ARM64 Grace CPUs
selftests/net: add tcp_port_share to .gitignore
Revert "net/mlx5e: Update and set Xon/Xoff upon MTU set"
net: add NUMA awareness to skb_attempt_defer_free()
net: use llist for sd->defer_list
net: make softnet_data.defer_count an atomic
selftests: drv-net: psp: add tests for destroying devices
selftests: drv-net: psp: add test for auto-adjusting TCP MSS
...
Cross-merge networking fixes after downstream PR (net-6.17-rc8).
Conflicts:
tools/testing/selftests/drivers/net/bonding/Makefile
87951b5664 selftests: bonding: add test for passive LACP mode
c2377f1763 selftests: bonding: add test for LACP actor port priority
Adjacent changes:
drivers/net/ethernet/cadence/macb.h
fca3dc859b net: macb: remove illusion about TBQPH/RBQPH being per-queue
89934dbf16 net: macb: Add TAPRIO traffic scheduling support
drivers/net/ethernet/cadence/macb_main.c
fca3dc859b net: macb: remove illusion about TBQPH/RBQPH being per-queue
89934dbf16 net: macb: Add TAPRIO traffic scheduling support
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmjZH40ACgkQ6rmadz2v
bTrG7w//X/5CyDoKIYJCqynYRdMtfqYuCe8Jhud4p5++iBVqkDyS6Y8EFLqZVyg/
UHTqaSE4Nz8/pma0WSjhUYn6Chs1AeH+Rw/g109SovE/YGkek2KNwY3o2hDrtPMX
+oD0my8qF2HLKgEyteXXyZ5Ju+AaF92JFiGko4/wNTX8O99F9nyz2pTkrctS9Vl9
VwuTxrEXpmhqrhP3WCxkfNfcbs9HP+AALpgOXZKdMI6T4KI0N1gnJ0ZWJbiXZ8oT
tug0MTPkNRidYMl0wHY2LZ6ZG8Q3a7Sgc+M0xFzaHGvGlJbBg1HjsDMtT6j34CrG
TIVJ/O8F6EJzAnQ5Hio0FJk8IIgMRgvng5Kd5GXidU+mE6zokTyHIHOXitYkBQNH
Hk+lGA7+E2cYqUqKvB5PFoyo+jlucuIH7YwrQlyGfqz+98n65xCgZKcmdVXr0hdB
9v3WmwJFtVIoPErUvBC3KRANQYhFk4eVk1eiGV/20+eIVyUuNbX6wqSWSA9uEXLy
n5fm/vlk4RjZmrPZHxcJ0dsl9LTF1VvQQHkgoC1Sz/Cc+jA6k4I+ECVHAqEbk36p
1TUF52yPOD2ViaJKkj+962JaaaXlUn6+Dq7f1GMP6VuyHjz4gsI3mOo4XarqNdWd
c7TnYmlGO/cGwqd4DdbmWiF1DDsrBcBzdbC8+FgffxQHLPXGzUg=
=LeQi
-----END PGP SIGNATURE-----
Merge tag 'bpf-next-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Pull bpf updates from Alexei Starovoitov:
- Support pulling non-linear xdp data with bpf_xdp_pull_data() kfunc
(Amery Hung)
Applied as a stable branch in bpf-next and net-next trees.
- Support reading skb metadata via bpf_dynptr (Jakub Sitnicki)
Also a stable branch in bpf-next and net-next trees.
- Enforce expected_attach_type for tailcall compatibility (Daniel
Borkmann)
- Replace path-sensitive with path-insensitive live stack analysis in
the verifier (Eduard Zingerman)
This is a significant change in the verification logic. More details,
motivation, long term plans are in the cover letter/merge commit.
- Support signed BPF programs (KP Singh)
This is another major feature that took years to materialize.
Algorithm details are in the cover letter/marge commit
- Add support for may_goto instruction to s390 JIT (Ilya Leoshkevich)
- Add support for may_goto instruction to arm64 JIT (Puranjay Mohan)
- Fix USDT SIB argument handling in libbpf (Jiawei Zhao)
- Allow uprobe-bpf program to change context registers (Jiri Olsa)
- Support signed loads from BPF arena (Kumar Kartikeya Dwivedi and
Puranjay Mohan)
- Allow access to union arguments in tracing programs (Leon Hwang)
- Optimize rcu_read_lock() + migrate_disable() combination where it's
used in BPF subsystem (Menglong Dong)
- Introduce bpf_task_work_schedule*() kfuncs to schedule deferred
execution of BPF callback in the context of a specific task using the
kernel’s task_work infrastructure (Mykyta Yatsenko)
- Enforce RCU protection for KF_RCU_PROTECTED kfuncs (Kumar Kartikeya
Dwivedi)
- Add stress test for rqspinlock in NMI (Kumar Kartikeya Dwivedi)
- Improve the precision of tnum multiplier verifier operation
(Nandakumar Edamana)
- Use tnums to improve is_branch_taken() logic (Paul Chaignon)
- Add support for atomic operations in arena in riscv JIT (Pu Lehui)
- Report arena faults to BPF error stream (Puranjay Mohan)
- Search for tracefs at /sys/kernel/tracing first in bpftool (Quentin
Monnet)
- Add bpf_strcasecmp() kfunc (Rong Tao)
- Support lookup_and_delete_elem command in BPF_MAP_STACK_TRACE (Tao
Chen)
* tag 'bpf-next-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (197 commits)
libbpf: Replace AF_ALG with open coded SHA-256
selftests/bpf: Add stress test for rqspinlock in NMI
selftests/bpf: Add test case for different expected_attach_type
bpf: Enforce expected_attach_type for tailcall compatibility
bpftool: Remove duplicate string.h header
bpf: Remove duplicate crypto/sha2.h header
libbpf: Fix error when st-prefix_ops and ops from differ btf
selftests/bpf: Test changing packet data from kfunc
selftests/bpf: Add stacktrace map lookup_and_delete_elem test case
selftests/bpf: Refactor stacktrace_map case with skeleton
bpf: Add lookup_and_delete_elem for BPF_MAP_STACK_TRACE
selftests/bpf: Fix flaky bpf_cookie selftest
selftests/bpf: Test changing packet data from global functions with a kfunc
bpf: Emit struct bpf_xdp_sock type in vmlinux BTF
selftests/bpf: Task_work selftest cleanup fixes
MAINTAINERS: Delete inactive maintainers from AF_XDP
bpf: Mark kfuncs as __noclone
selftests/bpf: Add kprobe multi write ctx attach test
selftests/bpf: Add kprobe write ctx attach test
selftests/bpf: Add uprobe context ip register change test
...
Instead of sharing sd->defer_list & sd->defer_count with
many cpus, add one pair for each NUMA node.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250928084934.3266948-4-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The active-backup bonding mode supports XFRM ESP offload. However, when
a bond is added using command like `ip link add bond0 type bond mode 1
miimon 100`, the `ethtool -k` command shows that the XFRM ESP offload is
disabled. This occurs because, in bond_newlink(), we change bond link
first and register bond device later. So the XFRM feature update in
bond_option_mode_set() is not called as the bond device is not yet
registered, leading to the offload feature not being set successfully.
To resolve this issue, we can modify the code order in bond_newlink() to
ensure that the bond device is registered first before changing the bond
link parameters. This change will allow the XFRM ESP offload feature to be
correctly enabled.
Fixes: 007ab53455 ("bonding: fix feature flag setting at init time")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20250925023304.472186-1-liuhangbin@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This reverts commit 4effb335b5.
This was a benefit for UDP flood case, which was later greatly improved
with commits 6471658dc6 ("udp: use skb_attempt_defer_free()")
and b650bf0977 ("udp: remove busylock and add per NUMA queues").
Apparently blamed commit added a regression for RAW sockets, possibly
because they do not use the dual RX queue strategy that UDP has.
sock_queue_rcv_skb_reason() and RAW recvmsg() compete for sk_receive_buf
and sk_rmem_alloc changes, and them being in the same
cache line reduce performance.
Fixes: 4effb335b5 ("net: group sk_backlog and sk_receive_queue")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202509281326.f605b4eb-lkp@intel.com
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: David Ahern <dsahern@kernel.org>
Cc: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250929182112.824154-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
To leverage the auto-tuning improvements brought by commit 2da35e4b4d
("Merge branch 'tcp-receive-side-improvements'"), the MPTCP stack need
to access the mentioned helper.
Acked-by: Geliang Tang <geliang@kernel.org>
Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250927-net-next-mptcp-rcv-path-imp-v1-2-5da266aa9c1a@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaNZQgQAKCRCRxhvAZXjc
oiFXAQCpbLvkWbld9wLgxUBhq+q+kw5NvGxzpvqIhXwJB9F9YAEA44/Wevln4xGx
+kRUbP+xlRQqenIYs2dLzVHzAwAdfQ4=
=EO4Y
-----END PGP SIGNATURE-----
Merge tag 'namespace-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull namespace updates from Christian Brauner:
"This contains a larger set of changes around the generic namespace
infrastructure of the kernel.
Each specific namespace type (net, cgroup, mnt, ...) embedds a struct
ns_common which carries the reference count of the namespace and so
on.
We open-coded and cargo-culted so many quirks for each namespace type
that it just wasn't scalable anymore. So given there's a bunch of new
changes coming in that area I've started cleaning all of this up.
The core change is to make it possible to correctly initialize every
namespace uniformly and derive the correct initialization settings
from the type of the namespace such as namespace operations, namespace
type and so on. This leaves the new ns_common_init() function with a
single parameter which is the specific namespace type which derives
the correct parameters statically. This also means the compiler will
yell as soon as someone does something remotely fishy.
The ns_common_init() addition also allows us to remove ns_alloc_inum()
and drops any special-casing of the initial network namespace in the
network namespace initialization code that Linus complained about.
Another part is reworking the reference counting. The reference
counting was open-coded and copy-pasted for each namespace type even
though they all followed the same rules. This also removes all open
accesses to the reference count and makes it private and only uses a
very small set of dedicated helpers to manipulate them just like we do
for e.g., files.
In addition this generalizes the mount namespace iteration
infrastructure introduced a few cycles ago. As reminder, the vfs makes
it possible to iterate sequentially and bidirectionally through all
mount namespaces on the system or all mount namespaces that the caller
holds privilege over. This allow userspace to iterate over all mounts
in all mount namespaces using the listmount() and statmount() system
call.
Each mount namespace has a unique identifier for the lifetime of the
systems that is exposed to userspace. The network namespace also has a
unique identifier working exactly the same way. This extends the
concept to all other namespace types.
The new nstree type makes it possible to lookup namespaces purely by
their identifier and to walk the namespace list sequentially and
bidirectionally for all namespace types, allowing userspace to iterate
through all namespaces. Looking up namespaces in the namespace tree
works completely locklessly.
This also means we can move the mount namespace onto the generic
infrastructure and remove a bunch of code and members from struct
mnt_namespace itself.
There's a bunch of stuff coming on top of this in the future but for
now this uses the generic namespace tree to extend a concept
introduced first for pidfs a few cycles ago. For a while now we have
supported pidfs file handles for pidfds. This has proven to be very
useful.
This extends the concept to cover namespaces as well. It is possible
to encode and decode namespace file handles using the common
name_to_handle_at() and open_by_handle_at() apis.
As with pidfs file handles, namespace file handles are exhaustive,
meaning it is not required to actually hold a reference to nsfs in
able to decode aka open_by_handle_at() a namespace file handle.
Instead the FD_NSFS_ROOT constant can be passed which will let the
kernel grab a reference to the root of nsfs internally and thus decode
the file handle.
Namespaces file descriptors can already be derived from pidfds which
means they aren't subject to overmount protection bugs. IOW, it's
irrelevant if the caller would not have access to an appropriate
/proc/<pid>/ns/ directory as they could always just derive the
namespace based on a pidfd already.
It has the same advantage as pidfds. It's possible to reliably and for
the lifetime of the system refer to a namespace without pinning any
resources and to compare them trivially.
Permission checking is kept simple. If the caller is located in the
namespace the file handle refers to they are able to open it otherwise
they must hold privilege over the owning namespace of the relevant
namespace.
The namespace file handle layout is exposed as uapi and has a stable
and extensible format. For now it simply contains the namespace
identifier, the namespace type, and the inode number. The stable
format means that userspace may construct its own namespace file
handles without going through name_to_handle_at() as they are already
allowed for pidfs and cgroup file handles"
* tag 'namespace-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (65 commits)
ns: drop assert
ns: move ns type into struct ns_common
nstree: make struct ns_tree private
ns: add ns_debug()
ns: simplify ns_common_init() further
cgroup: add missing ns_common include
ns: use inode initializer for initial namespaces
selftests/namespaces: verify initial namespace inode numbers
ns: rename to __ns_ref
nsfs: port to ns_ref_*() helpers
net: port to ns_ref_*() helpers
uts: port to ns_ref_*() helpers
ipv4: use check_net()
net: use check_net()
net-sysfs: use check_net()
user: port to ns_ref_*() helpers
time: port to ns_ref_*() helpers
pid: port to ns_ref_*() helpers
ipc: port to ns_ref_*() helpers
cgroup: port to ns_ref_*() helpers
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaNZgMQAKCRCRxhvAZXjc
ornXAP954dZjz+OJw6lJLCf0j9TXJOczGHvK3oW5ZD9KnqtTdwEA7p1A6WMOKJyl
8VtTgCS0yNt8QlznUnsSDfVm0jXVGAY=
=tUXG
-----END PGP SIGNATURE-----
Merge tag 'kernel-6.18-rc1.clone3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull copy_process updates from Christian Brauner:
"This contains the changes to enable support for clone3() on nios2
which apparently is still a thing.
The more exciting part of this is that it cleans up the inconsistency
in how the 64-bit flag argument is passed from copy_process() into the
various other copy_*() helpers"
[ Fixed up rv ltl_monitor 32-bit support as per Sasha Levin in the merge ]
* tag 'kernel-6.18-rc1.clone3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
nios2: implement architecture-specific portion of sys_clone3
arch: copy_thread: pass clone_flags as u64
copy_process: pass clone_flags as u64 across calltree
copy_sighand: Handle architectures where sizeof(unsigned long) < sizeof(u64)
-Wflex-array-member-not-at-end was introduced in GCC-14, and we are
getting ready to enable it, globally.
Use the __struct_group() helper to fix 31 instances of the following
type of warnings:
30 net/bluetooth/mgmt_config.c:16:33: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
1 net/bluetooth/mgmt_config.c:22:33: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
When enabling debug via CONFIG_BT_FEATURE_DEBUG include function and
line information by default otherwise it is hard to make any sense of
which function the logs comes from.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This attempts to detect if an ISO link has been waiting for an ISO
buffer for longer than the maximum allowed transport latency then
proceed to use hci_link_tx_to which prints an error and disconnects.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This aligns the usage of socket sk_sndtimeo as conn_timeout when
initiating a connection and then use it when scheduling the
resulting HCI command, similar to what has been done in bf98feea5b
("Bluetooth: hci_conn: Always use sk_timeo as conn_timeout").
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Add the __counted_by_le() compiler attribute to the flexible array
member 'supported_commands' to improve access bounds-checking via
CONFIG_UBSAN_BOUNDS and CONFIG_FORTIFY_SOURCE.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
- mt76: MLO support, HW restart improvements
- rtw88/89: small features, prep for RTL8922DE support
- ath10k: GTK rekey fixes
- cfg80211/mac80211:
- additions for more NAN support
- S1G channel representation cleanup
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmjU+BcACgkQ10qiO8sP
aADryQ//aarXIxRw6irA9d0IuiDdCAnTpYGSnbSCzWi4I51lnL5VU9bH0BynGHMu
VSuQokuG6Fp+rMD0CPVAhIlTj2TxueItSN1Dl0aLMzKXHSNFd9QmhyLc6XL+ei0M
d8bw1UEjdWxjfEUQHocqP8uCgLAZqOwIlJhvs4HGBrkwNSoMMinjltQMtPNdnqX6
lXHKok717I3kTERBLSVE6m+1OkLizz1inswmvkccJM5JBQ2Y0iCkRewEsXJxfL50
V1LAKgZOy15OPBja3o+tUSkZBYu7GHGirWmColZ8mO0nVOdWaH42XtsCaVA/UMOk
FCeVetfbPlFlQ5huo4rvPDPqUKOjAkVIp10dal/1+mdCvvlyKAnhYslqxxdlS+cY
uwHV9/GGFZkUj2cbx0ARq16BD0Co0tGfEbjeb0dkltjuekfH1jEpC0l4LjM/VDYX
F+UJXeVBe1926g6RjP5Akj7NdGtz8/ND9bo+hpKW7EA5milzwFpDGTSXUAqiBomk
KpGqjeVvLlXmwaAZtZldePEYrZvYGtJmR8mVZA/B6RzVerN1p1ttE+wQgCCLCbXm
vwYLOPYJqBLJCXP1Xr0VIAGJELO4Z/0nu/Cil9jE8CDx/PSysOWRwWa+V5OowdTB
dxC2GHKnEq1Bc2qPQJPeT+bV1NVQXMzaT/kTQ1+vPaKX2P0oRRo=
=3HjM
-----END PGP SIGNATURE-----
Merge tag 'wireless-next-2025-09-25' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
Johannes Berg says:
====================
Quite a bit more things, including pull requests from drivers:
- mt76: MLO support, HW restart improvements
- rtw88/89: small features, prep for RTL8922DE support
- ath10k: GTK rekey fixes
- cfg80211/mac80211:
- additions for more NAN support
- S1G channel representation cleanup
* tag 'wireless-next-2025-09-25' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (167 commits)
wifi: libertas: add WQ_UNBOUND to alloc_workqueue users
Revert "wifi: libertas: WQ_PERCPU added to alloc_workqueue users"
wifi: libertas: WQ_PERCPU added to alloc_workqueue users
wifi: cfg80211: fix width unit in cfg80211_radio_chandef_valid()
wifi: ath11k: HAL SRNG: don't deinitialize and re-initialize again
wifi: ath12k: enforce CPU endian format for all QMI data
wifi: ath12k: Use 1KB Cache Flush Command for QoS TID Descriptors
wifi: ath12k: Fix flush cache failure during RX queue update
wifi: ath12k: Add Retry Mechanism for REO RX Queue Update Failures
wifi: ath12k: Refactor REO command to use ath12k_dp_rx_tid_rxq
wifi: ath12k: Refactor RX TID buffer cleanup into helper function
wifi: ath12k: Refactor RX TID deletion handling into helper function
wifi: ath12k: Increase DP_REO_CMD_RING_SIZE to 256
wifi: cfg80211: remove IEEE80211_CHAN_{1,2,4,8,16}MHZ flags
wifi: rtw89: avoid circular locking dependency in ser_state_run()
wifi: rtw89: fix leak in rtw89_core_send_nullfunc()
wifi: rtw89: avoid possible TX wait initialization race
wifi: rtw89: fix use-after-free in rtw89_core_tx_kick_off_and_wait()
wifi: ath12k: Fix peer lookup in ath12k_dp_mon_rx_deliver_msdu()
wifi: mac80211: fix Rx packet handling when pubsta information is not available
...
====================
Link: https://patch.msgid.link/20250925232341.4544-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, packets with fixed IDs will be merged only if their
don't-fragment bit is set. This restriction is unnecessary since
packets without the don't-fragment bit will be forwarded as-is even
if they were merged together. The merged packets will be segmented
into their original forms before being forwarded, either by GSO or
by TSO. The IDs will also remain identical unless NETIF_F_TSO_MANGLEID
is set, in which case the IDs can become incrementing, which is also fine.
Clean up the code by removing the unnecessary don't-fragment checks.
Signed-off-by: Richard Gobert <richardbgobert@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250923085908.4687-5-richardbgobert@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Only merge encapsulated packets if their outer IDs are either
incrementing or fixed, just like for inner IDs and IDs of non-encapsulated
packets.
Add another ip_fixedid bit for a total of two bits: one for outer IDs (and
for unencapsulated packets) and one for inner IDs.
This commit preserves the current behavior of GSO where only the IDs of the
inner-most headers are restored correctly.
Signed-off-by: Richard Gobert <richardbgobert@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250923085908.4687-3-richardbgobert@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Remove is_ipv6 from napi_gro_cb and use sk->sk_family instead.
This frees up space for another ip_fixedid bit that will be added
in the next commit.
udp_sock_create always creates either a AF_INET or a AF_INET6 socket,
so using sk->sk_family is reliable. In IPv6-FOU, cfg->ipv6_v6only is
always enabled.
Signed-off-by: Richard Gobert <richardbgobert@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250923085908.4687-2-richardbgobert@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ6NaUOruQGUkvPdG4raS+Z+3y5EwUCaNNwBQAKCRAraS+Z+3y5
E8heAQDdJTR9rwAL7gD79cldlHP5PTmjyidLIoFG/efaGSbN1AD9EdvrykDU4xOG
aGaO8TooGUZf7vAL8tIFuMeydYvi/gM=
=Qu4T
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Martin KaFai Lau says:
====================
pull-request: bpf-next 2025-09-23
We've added 9 non-merge commits during the last 33 day(s) which contain
a total of 10 files changed, 480 insertions(+), 53 deletions(-).
The main changes are:
1) A new bpf_xdp_pull_data kfunc that supports pulling data from
a frag into the linear area of a xdp_buff, from Amery Hung.
This includes changes in the xdp_native.bpf.c selftest, which
Nimrod's future work depends on.
It is a merge from a stable branch 'xdp_pull_data' which has
also been merged to bpf-next.
There is a conflict with recent changes in 'include/net/xdp.h'
in the net-next tree that will need to be resolved.
2) A compiler warning fix when CONFIG_NET=n in the recent dynptr
skb_meta support, from Jakub Sitnicki.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
selftests: drv-net: Pull data before parsing headers
selftests/bpf: Test bpf_xdp_pull_data
bpf: Support specifying linear xdp packet data size for BPF_PROG_TEST_RUN
bpf: Make variables in bpf_prog_test_run_xdp less confusing
bpf: Clear packet pointers after changing packet data in kfuncs
bpf: Support pulling non-linear xdp data
bpf: Allow bpf_xdp_shrink_data to shrink a frag from head and tail
bpf: Clear pfmemalloc flag when freeing all fragments
bpf: Return an error pointer for skb metadata when CONFIG_NET=n
====================
Link: https://patch.msgid.link/20250924050303.2466356-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
busylock was protecting UDP sockets against packet floods,
but unfortunately was not protecting the host itself.
Under stress, many cpus could spin while acquiring the busylock,
and NIC had to drop packets. Or packets would be dropped
in cpu backlog if RPS/RFS were in place.
This patch replaces the busylock by intermediate
lockless queues. (One queue per NUMA node).
This means that fewer number of cpus have to acquire
the UDP receive queue lock.
Most of the cpus can either:
- immediately drop the packet.
- or queue it in their NUMA aware lockless queue.
Then one of the cpu is chosen to process this lockless queue
in a batch.
The batch only contains packets that were cooked on the same
NUMA node, thus with very limited latency impact.
Tested:
DDOS targeting a victim UDP socket, on a platform with 6 NUMA nodes
(Intel(R) Xeon(R) 6985P-C)
Before:
nstat -n ; sleep 1 ; nstat | grep Udp
Udp6InDatagrams 1004179 0.0
Udp6InErrors 3117 0.0
Udp6RcvbufErrors 3117 0.0
After:
nstat -n ; sleep 1 ; nstat | grep Udp
Udp6InDatagrams 1116633 0.0
Udp6InErrors 14197275 0.0
Udp6RcvbufErrors 14197275 0.0
We can see this host can now proces 14.2 M more packets per second
while under attack, and the victim socket can receive 11 % more
packets.
I used a small bpftrace program measuring time (in us) spent in
__udp_enqueue_schedule_skb().
Before:
@udp_enqueue_us[398]:
[0] 24901 |@@@ |
[1] 63512 |@@@@@@@@@ |
[2, 4) 344827 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4, 8) 244673 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[8, 16) 54022 |@@@@@@@@ |
[16, 32) 222134 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[32, 64) 232042 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[64, 128) 4219 | |
[128, 256) 188 | |
After:
@udp_enqueue_us[398]:
[0] 5608855 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1] 1111277 |@@@@@@@@@@ |
[2, 4) 501439 |@@@@ |
[4, 8) 102921 | |
[8, 16) 29895 | |
[16, 32) 43500 | |
[32, 64) 31552 | |
[64, 128) 979 | |
[128, 256) 13 | |
Note that the remaining bottleneck for this platform is in
udp_drops_inc() because we limited struct numa_drop_counters
to only two nodes so far.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250922104240.2182559-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move skb_frag_t adjustment into bpf_xdp_shrink_data() and extend its
functionality to be able to shrink an xdp fragment from both head and
tail. In a later patch, bpf_xdp_pull_data() will reuse it to shrink an
xdp fragment from head.
Additionally, in bpf_xdp_frags_shrink_tail(), breaking the loop when
bpf_xdp_shrink_data() returns false (i.e., not releasing the current
fragment) is not necessary as the loop condition, offset > 0, has the
same effect. Remove the else branch to simplify the code.
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://patch.msgid.link/20250922233356.3356453-3-ameryhung@gmail.com
It is possible for bpf_xdp_adjust_tail() to free all fragments. The
kfunc currently clears the XDP_FLAGS_HAS_FRAGS bit, but not
XDP_FLAGS_FRAGS_PF_MEMALLOC. So far, this has not caused a issue when
building sk_buff from xdp_buff since all readers of xdp_buff->flags
use the flag only when there are fragments. Clear the
XDP_FLAGS_FRAGS_PF_MEMALLOC bit as well to make the flags correct.
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://patch.msgid.link/20250922233356.3356453-2-ameryhung@gmail.com
Add defines for all event types and subtypes an ism device is known to
produce as it can be helpful for debugging purposes.
Introduces a generic 'struct dibs_event' and adopt ism device driver
and smc-d client accordingly. Tolerate and ignore other type and subtype
values to enable future device extensions.
SMC-D and ISM are now independent.
struct ism_dev can be moved to drivers/s390/net/ism.h.
Note that in smc, the term 'ism' is still used. Future patches could
replace that with 'dibs' or 'smc-d' as appropriate.
Signed-off-by: Julian Ruess <julianr@linux.ibm.com>
Co-developed-by: Alexandra Winter <wintera@linux.ibm.com>
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Link: https://patch.msgid.link/20250918110500.1731261-15-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Use struct dibs_dmb instead of struct smc_dmb and move the corresponding
client tables to dibs_dev. Leave driver specific implementation details
like sba in the device drivers.
Register and unregister dmbs via dibs_dev_ops. A dmb is dedicated to a
single client, but a dibs device can have dmbs for more than one client.
Trigger dibs clients via dibs_client_ops->handle_irq(), when data is
received into a dmb. For dibs_loopback replace scheduling an smcd receive
tasklet with calling dibs_client_ops->handle_irq().
For loopback devices attach_dmb(), detach_dmb() and move_data() need to
access the dmb tables, so move those to dibs_dev_ops in this patch as well.
Remove remaining definitions of smc_loopback as they are no longer
required, now that everything is in dibs_loopback.
Note that struct ism_client and struct ism_dev are still required in smc
until a follow-on patch moves event handling to dibs. (Loopback does not
use events).
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Link: https://patch.msgid.link/20250918110500.1731261-14-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Provide the dibs_dev_ops->query_remote_gid() in ism and dibs_loopback
dibs_devices. And call it in smc dibs_client.
Reviewed-by: Julian Ruess <julianr@linux.ibm.com>
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Link: https://patch.msgid.link/20250918110500.1731261-13-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
It can be debated how much benefit definition of vlan ids for dibs devices
brings, as the dmbs are accessible only by a single peer anyhow. But ism
provides vlan support and smcd exploits it, so move it to dibs layer as an
optional feature.
smcd_loopback simply ignores all vlan settings, do the same in
dibs_loopback.
SMC-D and ISM have a method to use the invalid VLAN ID 1FFF
(ISM_RESERVED_VLANID), to indicate that both communication peers support
routable SMC-Dv2. Tolerate it in dibs, but move it to SMC only.
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Link: https://patch.msgid.link/20250918110500.1731261-12-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Define a uuid_t GID attribute to identify a dibs device.
SMC uses 64 Bit and 128 Bit Global Identifiers (GIDs) per device, that
need to be sent via the SMC protocol. Because the smc code uses integers,
network endianness and host endianness need to be considered. Avoid this
in the dibs layer by using uuid_t byte arrays. Future patches could change
SMC to use uuid_t. For now conversion helper functions are introduced.
ISM devices provide 64 Bit GIDs. Map them to dibs uuid_t GIDs like this:
_________________________________________
| 64 Bit ISM-vPCI GID | 00000000_00000000 |
-----------------------------------------
If interpreted as UUID [1], this would be interpreted as the UIID variant,
that is reserved for NCS backward compatibility. So it will not collide
with UUIDs that were generated according to the standard.
smc_loopback already uses version 4 UUIDs as 128 Bit GIDs, move that to
dibs loopback. A temporary change to smc_lo_query_rgid() is required,
that will be moved to dibs_loopback with a follow-on patch.
Provide gid of a dibs device as sysfs read-only attribute.
Link: https://datatracker.ietf.org/doc/html/rfc4122 [1]
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Reviewed-by: Julian Ruess <julianr@linux.ibm.com>
Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com>
Link: https://patch.msgid.link/20250918110500.1731261-11-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Move struct device from ism_dev and smc_lo_dev to dibs_dev, and define a
corresponding release function. Free ism_dev in ism_remove() and smc_lo_dev
in smc_lo_dev_remove().
Replace smcd->ops->get_dev(smcd) by using dibs->dev directly.
An alternative design would be to embed dibs_dev as a field in ism_dev and
do the same for other dibs device driver specific structs. However that
would have the disadvantage that each dibs device driver needs to allocate
dibs_dev and each dibs device driver needs a different device release
function. The advantage would be that ism_dev and other device driver
specific structs would be covered by device reference counts.
Signed-off-by: Julian Ruess <julianr@linux.ibm.com>
Co-developed-by: Alexandra Winter <wintera@linux.ibm.com>
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com>
Link: https://patch.msgid.link/20250918110500.1731261-9-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Move the device add() and remove() functions from ism_client to
dibs_client_ops and call add_dev()/del_dev() for ism devices and
dibs_loopback devices. dibs_client_ops->add_dev() = smcd_register_dev() for
the smc_dibs_client. This is the first step to handle ism and loopback
devices alike (as dibs devices) in the smc dibs client.
Define dibs_dev->ops and move smcd_ops->get_chid to
dibs_dev_ops->get_fabric_id() for ism and loopback devices. See below for
why this needs to be in the same patch as dibs_client_ops->add_dev().
The following changes contain intermediate steps, that will be obsoleted by
follow-on patches, once more functionality has been moved to dibs:
Use different smcd_ops and max_dmbs for ism and loopback. Follow-on patches
will change SMC-D to directly use dibs_ops instead of smcd_ops.
In smcd_register_dev() it is now necessary to identify a dibs_loopback
device before smcd_dev and smcd_ops->get_chid() are available. So provide
dibs_dev_ops->get_fabric_id() in this patch and evaluate it in
smc_ism_is_loopback().
Call smc_loopback_init() in smcd_register_dev() and call
smc_loopback_exit() in smcd_unregister_dev() to handle the functionality
that is still in smc_loopback. Follow-on patches will move all smc_loopback
code to dibs_loopback.
In smcd_[un]register_dev() use only ism device name, this will be replaced
by dibs device name by a follow-on patch.
End of changes with intermediate parts.
Allocate an smcd event workqueue for all dibs devices, although
dibs_loopback does not generate events.
Use kernel memory instead of devres memory for smcd_dev and smcd->conn.
Since commit a72178cfe8 ("net/smc: Fix dependency of SMC on ISM") an ism
device and its driver can have a longer lifetime than the smc module, so
smc should not rely on devres to free its resources [1]. It is now the
responsibility of the smc client to free smcd and smcd->conn for all dibs
devices, ism devices as well as loopback. Call client->ops->del_dev() for
all existing dibs devices in dibs_unregister_client(), so all device
related structures can be freed in the client.
When dibs_unregister_client() is called in the context of smc_exit() or
smc_core_reboot_event(), these functions have already called
smc_lgrs_shutdown() which calls smc_smcd_terminate_all(smcd) and sets
going_away. This is done a second time in smcd_unregister_dev(). This is
analogous to how smcr is handled in these functions, by calling first
smc_lgrs_shutdown() and then smc_ib_unregister_client() >
smc_ib_remove_dev(), so leave it that way. It may be worth investigating,
whether smc_lgrs_shutdown() is still required or useful.
Remove CONFIG_SMC_LO. CONFIG_DIBS_LO now controls whether a dibs loopback
device exists or not.
Link: https://www.kernel.org/doc/Documentation/driver-model/devres.txt [1]
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com>
Link: https://patch.msgid.link/20250918110500.1731261-8-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
smcd_buf_free() calls smc_ism_unregister_dmb(lgr->smcd, buf_desc) and
then unconditionally frees buf_desc.
Remove the cleaning up of fields of buf_desc in
smc_ism_unregister_dmb(), because it is not helpful.
This removes the only usage of ISM_ERROR from the smc module. So move it
to drivers/s390/net/ism.h.
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Link: https://patch.msgid.link/20250918110500.1731261-2-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Today, once an inet_bind_bucket enters a state where fastreuse >= 0 or
fastreuseport >= 0 after a socket is explicitly bound to a port, it remains
in that state until all sockets are removed and the bucket is destroyed.
In this state, the bucket is skipped during ephemeral port selection in
connect(). For applications using a reduced ephemeral port
range (IP_LOCAL_PORT_RANGE socket option), this can cause faster port
exhaustion since blocked buckets are excluded from reuse.
The reason the bucket state isn't updated on port release is unclear.
Possibly a performance trade-off to avoid scanning bucket owners, or just
an oversight.
Fix it by recalculating the bucket state when a socket releases a port. To
limit overhead, each inet_bind2_bucket stores its own (fastreuse,
fastreuseport) state. On port release, only the relevant port-addr bucket
is scanned, and the overall state is derived from these.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250917-update-bind-bucket-state-on-unhash-v5-1-57168b661b47@cloudflare.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
synflood_warned had to be u32 for xchg(), but ensuring
atomicity is not really needed.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250919204856.2977245-9-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
sk->sk_sndbuf is read-mostly in tx path, so move it from
sock_write_tx group to more appropriate sock_read_tx.
sk->sk_err_soft was not identified previously, but
is used from tcp_ack().
Move it to sock_write_tx group for better cache locality.
Also change tcp_ack() to clear sk->sk_err_soft only if needed.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250919204856.2977245-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
sk_uid and sk_protocol are read from inet6_csk_route_socket()
for each TCP transmit.
Also read from udpv6_sendmsg(), udp_sendmsg() and others.
Move them to sock_read_tx for better cache locality.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250919204856.2977245-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
inet_hash() and inet6_hash() are exactly the same.
Also, we do not need to export inet6_hash().
Let's consolidate the two into __inet_hash() and rename it to inet_hash().
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250919083706.1863217-3-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
__inet_hash() is called from inet_hash() and inet6_hash with osk NULL.
Let's remove the 2nd arg from __inet_hash().
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250919083706.1863217-2-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
These were used by S1G for older chandef representation, but
are no longer needed. Clean them up, even if we can't drop
them from the userspace API entirely.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
udp_v4_early_demux already returns drop reasons as it either returns 0
or ip_mc_validate_source, which itself returns drop reasons. Its return
value is also already used as a drop reason itself.
Makes this explicit by making it return drop reasons.
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250915091958.15382-2-atenart@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
assoc_drv_spc is the size of psp_assoc.drv_data[].
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250918192539.1587586-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Using flags to check sk_state only makes sense to check for a subset
of states in parallel e.g. sk_fullsock(). We are not doing that
here. Compare for individual states directly.
Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Link: https://patch.msgid.link/20250918155205.2197603-4-daniel.zahka@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
It is weird to cast to a timewait_sock before checking sk_state, even
if the use is after such a check. Remove the tw local variable, and
use inet_twsk() directly in the timewait branch.
Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Link: https://patch.msgid.link/20250918155205.2197603-3-daniel.zahka@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Every namespace type has a container_of(ns, <ns_type>, ns) static inline
function that is currently not exposed in the header. So we have a bunch
of places that open-code it via container_of(). Move it to the headers
so we can use it directly.
Reviewed-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Currently, the S1G channelisation implementation differs from that of
VHT, which is the PHY that S1G is based on. The major difference between
the clock rate is 1/10th of VHT. However how their channelisation is
represented within cfg80211 and mac80211 vastly differ.
To rectify this, remove the use of IEEE80211_CHAN_1/2/4.. flags that were
previously used to indicate the control channel width, however it should be
implied that the control channels are 1MHz in the case of S1G. Additionally,
introduce the invert - being IEEE80211_CHAN_NO_4/8/16MHz - that imply
the control channel may not be used for a certain bandwidth. With these
new flags, we can perform regulatory and chandef validation just as we would
for VHT.
To deal with the notion that S1G PHYs may contain a 2MHz primary channel,
introduce a new variable, s1g_primary_2mhz, which indicates whether we are
operating on a 2MHz primary channel. In this case, the chandef::chan points to
the 1MHz primary channel pointed to by the primary channel location. Alongside
this, introduce some new helper routines that can extract the sibling 1MHz
channel. The sibling being the alternate 1MHz primary subchannel within the
2MHz primary channel that is not pointed to by chandef::chan.
Furthermore, due to unique restrictions imposed on S1G PHYs, introduce
a new flag, IEEE80211_CHAN_S1G_NO_PRIMARY, which states that the 1MHz channel
cannot be used as a primary channel. This is assumed to be set by vendors
as it is hardware and regdom specific, When we validate a 2MHz primary channel,
we need to ensure both 1MHz subchannels do not contain this flag. If one or
both of the 1MHz subchannels contain this flag then the 2MHz primary is not
permitted for use as a primary channel.
Properly integrate S1G channel validation such that it is implemented
according with other PHY types such as VHT. Additionally, implement a new
S1G-specific regulatory flag to allow cfg80211 to understand specific
vendor requirements for S1G PHYs.
Signed-off-by: Arien Judge <arien.judge@morsemicro.com>
Signed-off-by: Andrew Pope <andrew.pope@morsemicro.com>
Signed-off-by: Lachlan Hodges <lachlan.hodges@morsemicro.com>
Link: https://patch.msgid.link/20250918051913.500781-2-lachlan.hodges@morsemicro.com
[remove redundant NL80211_ATTR_S1G_PRIMARY_2MHZ check]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add support for sending management frame over a NAN Device
interface:
- Declare support for the supported management frames types.
- Since action frame transmissions over a NAN Device interface
do not necessarily require a channel configuration, e.g., they
can be transmitted during DW, modify the Tx path to avoid
accessing channel information for NAN Device interface.
- In addition modify the points in the Tx path logic to account
for cases that a band is not specified in the Tx information.
Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250908140015.23b160089228.I65a58af753bcbcfb5c4ad8ef372d546f889725ba@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When the driver indicates that the device has joined
a cluster, store the cluster ID. This is needed for data
path operations, e.g., filtering received frames etc.
Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250908140015.63e9fef2a3aa.I6c858185c9e71f84bd2c5174d7ee45902b4391c3@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The drivers should notify upper layers and user space when a NAN device
joins a cluster. This is needed, for example, to set the correct addr3
in SDF frames. Add API to report cluster join event.
Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250908140015.ad27b7b6e4d9.I70b213a2a49f18d1ba2ad325e67e8eff51cc7a1f@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This notification will be used by the device to inform user space
about upcoming DW. When received, user space will be able to prepare
multicast Service Discovery Frames (SDFs) to be transmitted during the
next DW using %NL80211_CMD_FRAME command on the NAN management interface.
The device/driver will take care to transmit the frames in the correct
timing. This allows to implement a synchronized Discovery Engine (DE)
in user space, if the device doesn't support DE offload.
Note that this notification can be sent before the actual DW starts as
long as the driver/device handles the actual timing of the SDF
transmission.
Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250908140015.0e1d15031bab.I5b1721e61b63910452b3c5cdcdc1e94cb094d4c9@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Current NAN APIs have only basic configuration for master
preference and operating bands. Add and parse additional parameters
which provide more control over NAN synchronization. The newly added
attributes allow to publish additional NAN attributes and vendor
elements in NAN beacons, control scan and discovery beacons
periodicity, enable/disable DW notifications etc.
Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
tested: Miriam Rachel Korenblit <miriam.rachel.korenblit@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250908140015.a4779492bf8e.I375feb919bd72358173766b9fe10010c40796b33@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Cross-merge networking fixes after downstream PR (net-6.17-rc7).
No conflicts.
Adjacent changes:
drivers/net/ethernet/mellanox/mlx5/core/en/fs.h
9536fbe10c ("net/mlx5e: Add PSP steering in local NIC RX")
7601a0a462 ("net/mlx5e: Add a miss level for ipsec crypto offload")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Andrei Vagin reported that blamed commit broke CRIU.
Indeed, while we want to keep sk_uid unchanged when a socket
is cloned, we want to clear sk->sk_ino.
Otherwise, sock_diag might report multiple sockets sharing
the same inode number.
Move the clearing part from sock_orphan() to sk_set_socket(sk, NULL),
called both from sock_orphan() and sk_clone_lock().
Fixes: 5d6b58c932 ("net: lockless sock_i_ino()")
Closes: https://lore.kernel.org/netdev/aMhX-VnXkYDpKd9V@google.com/
Closes: https://github.com/checkpoint-restore/criu/issues/2744
Reported-by: Andrei Vagin <avagin@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Andrei Vagin <avagin@google.com>
Link: https://patch.msgid.link/20250917135337.1736101-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Create psp_dev_rcv(), which drivers can call to psp decapsulate and attach
a psp_skb_ext to an skb.
psp_dev_rcv() only supports what the PSP architecture specification
refers to as "transport mode" packets, where the L3 header is either
IPv6 or IPv4.
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Co-developed-by: Daniel Zahka <daniel.zahka@gmail.com>
Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250917000954.859376-18-daniel.zahka@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Create a new function psp_encapsulate(), which takes a TCP packet and
PSP encapsulates it according to the "Transport Mode Packet Format"
section of the PSP Architecture Specification.
psp_encapsulate() does not push a PSP trailer onto the skb. Both IPv6
and IPv4 are supported. Virtualization cookie is not included.
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Co-developed-by: Daniel Zahka <daniel.zahka@gmail.com>
Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250917000954.859376-14-daniel.zahka@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
There is a (somewhat theoretical in absence of multi-host support)
possibility that another entity will rotate the key and we won't
know. This may lead to accepting packets with matching SPI but
which used different crypto keys than we expected.
The PSP Architecture specification mentions that an implementation
should track device key generation when device keys are managed by the
NIC. Some PSP implementations may opt to include this key generation
state in decryption metadata each time a device key is used to decrypt
a packet. If that is the case, that key generation counter can also be
used when policy checking a decrypted skb against a psp_assoc. This is
an optional feature that is not explicitly part of the PSP spec, but
can provide additional security in the case where an attacker may have
the ability to force key rotations faster than rekeying can occur.
Since we're tracking "key generations" more explicitly now,
maintain different lists for associations from different generations.
This way we can catch stale associations (the user space should
listen to rotation notifications and change the keys).
Drivers can "opt out" of generation tracking by setting
the generation value to 0.
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250917000954.859376-11-daniel.zahka@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
PSP eats 40B of header space. Adjust MSS appropriately.
We can either modify tcp_mtu_to_mss() / tcp_mss_to_mtu()
or reuse icsk_ext_hdr_len. The former option is more TCP
specific and has runtime overhead. The latter is a bit
of a hack as PSP is not an ext_hdr. If one squints hard
enough, UDP encap is just a more practical version of
IPv6 exthdr, so go with the latter. Happy to change.
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250917000954.859376-10-daniel.zahka@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Add the ability to install PSP Rx and Tx crypto keys on TCP
connections. Netlink ops are provided for both operations.
Rx side combines allocating a new Rx key and installing it
on the socket. Theoretically these are separate actions,
but in practice they will always be used one after the
other. We can add distinct "alloc" and "install" ops later.
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Co-developed-by: Daniel Zahka <daniel.zahka@gmail.com>
Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250917000954.859376-9-daniel.zahka@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Provide a callback to validate skb's originating from tcp timewait
socks before passing to the device layer. Full socks have a
sk_validate_xmit_skb member for checking that a device is capable of
performing offloads required for transmitting an skb. With psp, tcp
timewait socks will inherit the crypto state from their corresponding
full socks. Any ACKs or RSTs that originate from a tcp timewait sock
carrying psp state should be psp encapsulated.
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250917000954.859376-8-daniel.zahka@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Move definition of sk_validate_xmit_skb() from net/core/sock.c to
net/core/dev.c.
This change is in preparation of the next patch, where
sk_validate_xmit_skb() will need to cast sk to a tcp_timewait_sock *,
and access member fields. Including linux/tcp.h from linux/sock.h
creates a circular dependency, and dev.c is the only current call site
of this function.
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250917000954.859376-7-daniel.zahka@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Rotating the device key is a key part of the PSP protocol design.
Some external daemon needs to do it once a day, or so.
Add a netlink op to perform this operation.
Add a notification group for informing users that key has been
rotated and they should rekey (next rotation will cut them off).
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250917000954.859376-6-daniel.zahka@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Add validation points and state propagation to support PSP key
exchange inline, on TCP connections. The expectation is that
application will use some well established mechanism like TLS
handshake to establish a secure channel over the connection and
if both endpoints are PSP-capable - exchange and install PSP keys.
Because the connection can existing in PSP-unsecured and PSP-secured
state we need to make sure that there are no race conditions or
retransmission leaks.
On Tx - mark packets with the skb->decrypted bit when PSP key
is at the enqueue time. Drivers should only encrypt packets with
this bit set. This prevents retransmissions getting encrypted when
original transmission was not. Similarly to TLS, we'll use
sk->sk_validate_xmit_skb to make sure PSP skbs can't "escape"
via a PSP-unaware device without being encrypted.
On Rx - validation is done under socket lock. This moves the validation
point later than xfrm, for example. Please see the documentation patch
for more details on the flow of securing a connection, but for
the purpose of this patch what's important is that we want to
enforce the invariant that once connection is secured any skb
in the receive queue has been encrypted with PSP.
Add GRO and coalescing checks to prevent PSP authenticated data from
being combined with cleartext data, or data with non-matching PSP
state. On Rx, check skb's with psp_skb_coalesce_diff() at points
before psp_sk_rx_policy_check(). After skb's are policy checked and on
the socket receive queue, skb_cmp_decrypted() is sufficient for
checking for coalescable PSP state. On Tx, tcp_write_collapse_fence()
should be called when transitioning a socket into PSP Tx state to
prevent data sent as cleartext from being coalesced with PSP
encapsulated data.
This change only adds the validation points, for ease of review.
Subsequent change will add the ability to install keys, and flesh
the enforcement logic out
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Co-developed-by: Daniel Zahka <daniel.zahka@gmail.com>
Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250917000954.859376-5-daniel.zahka@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Add pointers to psp data structures to core networking structs,
and an SKB extension to carry the PSP information from the drivers
to the socket layer.
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Co-developed-by: Daniel Zahka <daniel.zahka@gmail.com>
Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250917000954.859376-4-daniel.zahka@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Add a netlink family for PSP and allow drivers to register support.
The "PSP device" is its own object. This allows us to perform more
flexible reference counting / lifetime control than if PSP information
was part of net_device. In the future we should also be able
to "delegate" PSP access to software devices, such as *vlan, veth
or netkit more easily.
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250917000954.859376-3-daniel.zahka@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
While having all spinlocks packed into an array was a space saver,
this also caused NUMA imbalance and hash collisions.
UDPv6 socket size becomes 1600 after this patch.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250916160951.541279-10-edumazet@google.com
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Generic sk_drops_inc() reads sk->sk_drop_counters.
We know the precise location for UDP sockets.
Move sk_drop_counters out of sock_read_rxtx
so that sock_write_rxtx starts at a cache line boundary.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250916160951.541279-9-edumazet@google.com
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
ipv6_pinfo.daddr_cache is either NULL or &sk->sk_v6_daddr
We do not need 8 bytes, a boolean is enough.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250916160951.541279-3-edumazet@google.com
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
ipv6_pinfo.saddr_cache is either NULL or &np->saddr.
We do not need 8 bytes, a boolean is enough.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250916160951.541279-2-edumazet@google.com
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The AccECN option ceb/cep heuristic algorithm is from AccECN spec
Appendix A.2.2 to mitigate against false ACE field overflows. Armed
with ceb delta from option, delivered bytes, and delivered packets it
is possible to estimate how many times ACE field wrapped.
This calculation is necessary only if more than one wrap is possible.
Without SACK, delivered bytes and packets are not always trustworthy in
which case TCP falls back to the simpler no-or-all wraps ceb algorithm.
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250916082434.100722-10-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
AccECN option may fail in various way, handle these:
- Attempt to negotiate the use of AccECN on the 1st retransmitted SYN
- From the 2nd retransmitted SYN, stop AccECN negotiation
- Remove option from SYN/ACK rexmits to handle blackholes
- If no option arrives in SYN/ACK, assume Option is not usable
- If an option arrives later, re-enabled
- If option is zeroed, disable AccECN option processing
This patch use existing padding bits in tcp_request_sock and
holes in tcp_sock without increasing the size.
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250916082434.100722-9-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Instead of sending the option in every ACK, limit sending to
those ACKs where the option is necessary:
- Handshake
- "Change-triggered ACK" + the ACK following it. The
2nd ACK is necessary to unambiguously indicate which
of the ECN byte counters in increasing. The first
ACK has two counters increasing due to the ecnfield
edge.
- ACKs with CE to allow CEP delta validations to take
advantage of the option.
- Force option to be sent every at least once per 2^22
bytes. The check is done using the bit edges of the
byte counters (avoids need for extra variables).
- AccECN option beacon to send a few times per RTT even if
nothing in the ECN state requires that. The default is 3
times per RTT, and its period can be set via
sysctl_tcp_ecn_option_beacon.
Below are the pahole outcomes before and after this patch,
in which the group size of tcp_sock_write_tx is increased
from 89 to 97 due to the new u64 accecn_opt_tstamp member:
[BEFORE THIS PATCH]
struct tcp_sock {
[...]
u64 tcp_wstamp_ns; /* 2488 8 */
struct list_head tsorted_sent_queue; /* 2496 16 */
[...]
__cacheline_group_end__tcp_sock_write_tx[0]; /* 2521 0 */
__cacheline_group_begin__tcp_sock_write_txrx[0]; /* 2521 0 */
u8 nonagle:4; /* 2521: 0 1 */
u8 rate_app_limited:1; /* 2521: 4 1 */
/* XXX 3 bits hole, try to pack */
/* Force alignment to the next boundary: */
u8 :0;
u8 received_ce_pending:4;/* 2522: 0 1 */
u8 unused2:4; /* 2522: 4 1 */
u8 accecn_minlen:2; /* 2523: 0 1 */
u8 est_ecnfield:2; /* 2523: 2 1 */
u8 unused3:4; /* 2523: 4 1 */
[...]
__cacheline_group_end__tcp_sock_write_txrx[0]; /* 2628 0 */
[...]
/* size: 3200, cachelines: 50, members: 171 */
}
[AFTER THIS PATCH]
struct tcp_sock {
[...]
u64 tcp_wstamp_ns; /* 2488 8 */
u64 accecn_opt_tstamp; /* 2596 8 */
struct list_head tsorted_sent_queue; /* 2504 16 */
[...]
__cacheline_group_end__tcp_sock_write_tx[0]; /* 2529 0 */
__cacheline_group_begin__tcp_sock_write_txrx[0]; /* 2529 0 */
u8 nonagle:4; /* 2529: 0 1 */
u8 rate_app_limited:1; /* 2529: 4 1 */
/* XXX 3 bits hole, try to pack */
/* Force alignment to the next boundary: */
u8 :0;
u8 received_ce_pending:4;/* 2530: 0 1 */
u8 unused2:4; /* 2530: 4 1 */
u8 accecn_minlen:2; /* 2531: 0 1 */
u8 est_ecnfield:2; /* 2531: 2 1 */
u8 accecn_opt_demand:2; /* 2531: 4 1 */
u8 prev_ecnfield:2; /* 2531: 6 1 */
[...]
__cacheline_group_end__tcp_sock_write_txrx[0]; /* 2636 0 */
[...]
/* size: 3200, cachelines: 50, members: 173 */
}
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Co-developed-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250916082434.100722-8-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The Accurate ECN allows echoing back the sum of bytes for
each IP ECN field value in the received packets using
AccECN option. This change implements AccECN option tx & rx
side processing without option send control related features
that are added by a later change.
Based on specification:
https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt
(Some features of the spec will be added in the later changes
rather than in this one).
A full-length AccECN option is always attempted but if it does
not fit, the minimum length is selected based on the counters
that have changed since the last update. The AccECN option
(with 24-bit fields) often ends in odd sizes so the option
write code tries to take advantage of some nop used to pad
the other TCP options.
The delivered_ecn_bytes pairs with received_ecn_bytes similar
to how delivered_ce pairs with received_ce. In contrast to
ACE field, however, the option is not always available to update
delivered_ecn_bytes. For ACK w/o AccECN option, the delivered
bytes calculated based on the cumulative ACK+SACK information
are assigned to one of the counters using an estimation
heuristic to select the most likely ECN byte counter. Any
estimation error is corrected when the next AccECN option
arrives. It may occur that the heuristic gets too confused
when there are enough different byte counter deltas between
ACKs with the AccECN option in which case the heuristic just
gives up on updating the counters for a while.
tcp_ecn_option sysctl can be used to select option sending
mode for AccECN: TCP_ECN_OPTION_DISABLED, TCP_ECN_OPTION_MINIMUM,
and TCP_ECN_OPTION_FULL.
This patch increases the size of tcp_info struct, as there is
no existing holes for new u32 variables. Below are the pahole
outcomes before and after this patch:
[BEFORE THIS PATCH]
struct tcp_info {
[...]
__u32 tcpi_total_rto_time; /* 244 4 */
/* size: 248, cachelines: 4, members: 61 */
}
[AFTER THIS PATCH]
struct tcp_info {
[...]
__u32 tcpi_total_rto_time; /* 244 4 */
__u32 tcpi_received_ce; /* 248 4 */
__u32 tcpi_delivered_e1_bytes; /* 252 4 */
__u32 tcpi_delivered_e0_bytes; /* 256 4 */
__u32 tcpi_delivered_ce_bytes; /* 260 4 */
__u32 tcpi_received_e1_bytes; /* 264 4 */
__u32 tcpi_received_e0_bytes; /* 268 4 */
__u32 tcpi_received_ce_bytes; /* 272 4 */
/* size: 280, cachelines: 5, members: 68 */
}
This patch uses the existing 1-byte holes in the tcp_sock_write_txrx
group for new u8 members, but adds a 4-byte hole in tcp_sock_write_rx
group after the new u32 delivered_ecn_bytes[3] member. Therefore, the
group size of tcp_sock_write_rx is increased from 96 to 112. Below
are the pahole outcomes before and after this patch:
[BEFORE THIS PATCH]
struct tcp_sock {
[...]
u8 received_ce_pending:4; /* 2522: 0 1 */
u8 unused2:4; /* 2522: 4 1 */
/* XXX 1 byte hole, try to pack */
[...]
u32 rcv_rtt_last_tsecr; /* 2668 4 */
[...]
__cacheline_group_end__tcp_sock_write_rx[0]; /* 2728 0 */
[...]
/* size: 3200, cachelines: 50, members: 167 */
}
[AFTER THIS PATCH]
struct tcp_sock {
[...]
u8 received_ce_pending:4;/* 2522: 0 1 */
u8 unused2:4; /* 2522: 4 1 */
u8 accecn_minlen:2; /* 2523: 0 1 */
u8 est_ecnfield:2; /* 2523: 2 1 */
u8 unused3:4; /* 2523: 4 1 */
[...]
u32 rcv_rtt_last_tsecr; /* 2668 4 */
u32 delivered_ecn_bytes[3];/* 2672 12 */
/* XXX 4 bytes hole, try to pack */
[...]
__cacheline_group_end__tcp_sock_write_rx[0]; /* 2744 0 */
[...]
/* size: 3200, cachelines: 50, members: 171 */
}
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250916082434.100722-7-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
These three byte counters track IP ECN field payload byte sums for
all arriving (acceptable) packets for ECT0, ECT1, and CE. The
AccECN option (added by a later patch in the series) echoes these
counters back to sender side; therefore, it is placed within the
group of tcp_sock_write_txrx.
Below are the pahole outcomes before and after this patch, in which
the group size of tcp_sock_write_txrx is increased from 95 + 4 to
107 + 4 and an extra 4-byte hole is created but will be exploited
in later patches:
[BEFORE THIS PATCH]
struct tcp_sock {
[...]
u32 delivered_ce; /* 2576 4 */
u32 received_ce; /* 2580 4 */
u32 app_limited; /* 2584 4 */
u32 rcv_wnd; /* 2588 4 */
struct tcp_options_received rx_opt; /* 2592 24 */
__cacheline_group_end__tcp_sock_write_txrx[0]; /* 2616 0 */
[...]
/* size: 3200, cachelines: 50, members: 166 */
}
[AFTER THIS PATCH]
struct tcp_sock {
[...]
u32 delivered_ce; /* 2576 4 */
u32 received_ce; /* 2580 4 */
u32 received_ecn_bytes[3];/* 2584 12 */
u32 app_limited; /* 2596 4 */
u32 rcv_wnd; /* 2600 4 */
struct tcp_options_received rx_opt; /* 2604 24 */
__cacheline_group_end__tcp_sock_write_txrx[0]; /* 2628 0 */
/* XXX 4 bytes hole, try to pack */
[...]
/* size: 3200, cachelines: 50, members: 167 */
}
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250916082434.100722-4-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Accurate ECN negotiation parts based on the specification:
https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt
Accurate ECN is negotiated using ECE, CWR and AE flags in the
TCP header. TCP falls back into using RFC3168 ECN if one of the
ends supports only RFC3168-style ECN.
The AccECN negotiation includes reflecting IP ECN field value
seen in SYN and SYNACK back using the same bits as negotiation
to allow responding to SYN CE marks and to detect ECN field
mangling. CE marks should not occur currently because SYN=1
segments are sent with Non-ECT in IP ECN field (but proposal
exists to remove this restriction).
Reflecting SYN IP ECN field in SYNACK is relatively simple.
Reflecting SYNACK IP ECN field in the final/third ACK of
the handshake is more challenging. Linux TCP code is not well
prepared for using the final/third ACK a signalling channel
which makes things somewhat complicated here.
tcp_ecn sysctl can be used to select the highest ECN variant
(Accurate ECN, ECN, No ECN) that is attemped to be negotiated and
requested for incoming connection and outgoing connection:
TCP_ECN_IN_NOECN_OUT_NOECN, TCP_ECN_IN_ECN_OUT_ECN,
TCP_ECN_IN_ECN_OUT_NOECN, TCP_ECN_IN_ACCECN_OUT_ACCECN,
TCP_ECN_IN_ACCECN_OUT_ECN, and TCP_ECN_IN_ACCECN_OUT_NOECN.
After this patch, the size of tcp_request_sock remains unchanged
and no new holes are added. Below are the pahole outcomes before
and after this patch:
[BEFORE THIS PATCH]
struct tcp_request_sock {
[...]
u32 rcv_nxt; /* 352 4 */
u8 syn_tos; /* 356 1 */
/* size: 360, cachelines: 6, members: 16 */
}
[AFTER THIS PATCH]
struct tcp_request_sock {
[...]
u32 rcv_nxt; /* 352 4 */
u8 syn_tos; /* 356 1 */
bool accecn_ok; /* 357 1 */
u8 syn_ect_snt:2; /* 358: 0 1 */
u8 syn_ect_rcv:2; /* 358: 2 1 */
u8 accecn_fail_mode:4; /* 358: 4 1 */
/* size: 360, cachelines: 6, members: 20 */
}
After this patch, the size of tcp_sock remains unchanged and no new
holes are added. Also, 4 bits of the existing 2-byte hole are exploited.
Below are the pahole outcomes before and after this patch:
[BEFORE THIS PATCH]
struct tcp_sock {
[...]
u8 dup_ack_counter:2; /* 2761: 0 1 */
u8 tlp_retrans:1; /* 2761: 2 1 */
u8 unused:5; /* 2761: 3 1 */
u8 thin_lto:1; /* 2762: 0 1 */
u8 fastopen_connect:1; /* 2762: 1 1 */
u8 fastopen_no_cookie:1; /* 2762: 2 1 */
u8 fastopen_client_fail:2; /* 2762: 3 1 */
u8 frto:1; /* 2762: 5 1 */
/* XXX 2 bits hole, try to pack */
[...]
u8 keepalive_probes; /* 2765 1 */
/* XXX 2 bytes hole, try to pack */
[...]
/* size: 3200, cachelines: 50, members: 164 */
}
[AFTER THIS PATCH]
struct tcp_sock {
[...]
u8 dup_ack_counter:2; /* 2761: 0 1 */
u8 tlp_retrans:1; /* 2761: 2 1 */
u8 syn_ect_snt:2; /* 2761: 3 1 */
u8 syn_ect_rcv:2; /* 2761: 5 1 */
u8 thin_lto:1; /* 2761: 7 1 */
u8 fastopen_connect:1; /* 2762: 0 1 */
u8 fastopen_no_cookie:1; /* 2762: 1 1 */
u8 fastopen_client_fail:2; /* 2762: 2 1 */
u8 frto:1; /* 2762: 4 1 */
/* XXX 3 bits hole, try to pack */
[...]
u8 keepalive_probes; /* 2765 1 */
u8 accecn_fail_mode:4; /* 2766: 0 1 */
/* XXX 4 bits hole, try to pack */
/* XXX 1 byte hole, try to pack */
[...]
/* size: 3200, cachelines: 50, members: 166 */
}
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Co-developed-by: Olivier Tilmans <olivier.tilmans@nokia.com>
Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia.com>
Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250916082434.100722-3-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This change implements Accurate ECN without negotiation and
AccECN Option (that will be added by later changes). Based on
AccECN specifications:
https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt
Accurate ECN allows feeding back the number of CE (congestion
experienced) marks accurately to the sender in contrast to
RFC3168 ECN that can only signal one marks-seen-yes/no per RTT.
Congestion control algorithms can take advantage of the accurate
ECN information to fine-tune their congestion response to avoid
drastic rate reduction when only mild congestion is encountered.
With Accurate ECN, tp->received_ce (r.cep in AccECN spec) keeps
track of how many segments have arrived with a CE mark. Accurate
ECN uses ACE field (ECE, CWR, AE) to communicate the value back
to the sender which updates tp->delivered_ce (s.cep) based on the
feedback. This signalling channel is lossy when ACE field overflow
occurs.
Conservative strategy is selected here to deal with the ACE
overflow, however, some strategies using the AccECN option later
in the overall patchset mitigate against false overflows detected.
The ACE field values on the wire are offset by
TCP_ACCECN_CEP_INIT_OFFSET. Delivered_ce/received_ce count the
real CE marks rather than forcing all downstream users to adapt
to the wire offset.
This patch uses the first 1-byte hole and the last 4-byte hole of
the tcp_sock_write_txrx for 'received_ce_pending' and 'received_ce'.
Also, the group size of tcp_sock_write_txrx is increased from
91 + 4 to 95 + 4 due to the new u32 received_ce member. Below are
the trimmed pahole outcomes before and after this patch.
[BEFORE THIS PATCH]
struct tcp_sock {
[...]
__cacheline_group_begin__tcp_sock_write_txrx[0]; /* 2521 0 */
u8 nonagle:4; /* 2521: 0 1 */
u8 rate_app_limited:1; /* 2521: 4 1 */
/* XXX 3 bits hole, try to pack */
/* XXX 2 bytes hole, try to pack */
[...]
u32 delivered_ce; /* 2576 4 */
u32 app_limited; /* 2580 4 */
u32 rcv_wnd; /* 2684 4 */
struct tcp_options_received rx_opt; /* 2688 24 */
__cacheline_group_end__tcp_sock_write_txrx[0]; /* 2612 0 */
/* XXX 4 bytes hole, try to pack */
[...]
/* size: 3200, cachelines: 50, members: 161 */
}
[AFTER THIS PATCH]
struct tcp_sock {
[...]
__cacheline_group_begin__tcp_sock_write_txrx[0]; /* 2521 0 */
u8 nonagle:4; /* 2521: 0 1 */
u8 rate_app_limited:1; /* 2521: 4 1 */
/* XXX 3 bits hole, try to pack */
/* Force alignment to the next boundary: */
u8 :0;
u8 received_ce_pending:4;/* 2522: 0 1 */
u8 unused2:4; /* 2522: 4 1 */
/* XXX 1 byte hole, try to pack */
[...]
u32 delivered_ce; /* 2576 4 */
u32 received_ce; /* 2580 4 */
u32 app_limited; /* 2584 4 */
u32 rcv_wnd; /* 2588 4 */
struct tcp_options_received rx_opt; /* 2592 24 */
__cacheline_group_end__tcp_sock_write_txrx[0]; /* 2616 0 */
[...]
/* size: 3200, cachelines: 50, members: 164 */
}
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Co-developed-by: Olivier Tilmans <olivier.tilmans@nokia.com>
Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia.com>
Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250916082434.100722-2-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This parameter can be used by drivers to configure a different number of
doorbells.
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The following patches will modify ECN helpers and add AccECN herlpers,
and this patch moves the existing ones into a separated include file.
No functional changes.
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250911110642.87529-5-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The following patch will use tcp_ecn_mode_accecn(),
TCP_ACCECN_CEP_INIT_OFFSET, TCP_ACCECN_CEP_ACE_MASK in
__tcp_fast_path_on() to make new flag for AccECN.
No functional changes.
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250911110642.87529-3-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Both OVS and TC flower allow extracting and matching on the DF bit of
the outer IP header via OVS_TUNNEL_KEY_ATTR_DONT_FRAGMENT in the
OVS_KEY_ATTR_TUNNEL and TCA_FLOWER_KEY_FLAGS_TUNNEL_DONT_FRAGMENT in
the TCA_FLOWER_KEY_ENC_FLAGS respectively. Flow dissector extracts
this information as FLOW_DIS_F_TUNNEL_DONT_FRAGMENT from the tunnel
info key.
However, the IP_TUNNEL_DONT_FRAGMENT_BIT in the tunnel key is never
actually set, because the tunneling code doesn't actually extract it
from the IP header. OAM and CRIT_OPT are extracted by the tunnel
implementation code, same code also sets the KEY flag, if present.
UDP tunnel core takes care of setting the CSUM flag if the checksum
is present in the UDP header, but the DONT_FRAGMENT is not handled at
any layer.
Fix that by checking the bit and setting the corresponding flag while
populating the tunnel info in the IP layer where it belongs.
Not using __assign_bit as we don't really need to clear the bit in a
just initialized field. It also doesn't seem like using __assign_bit
will make the code look better.
Clearly, users didn't rely on this functionality for anything very
important until now. The reason why this doesn't break OVS logic is
that it only matches on what kernel previously parsed out and if kernel
consistently reports this bit as zero, OVS will only match on it to be
zero, which sort of works. But it is still a bug that the uAPI reports
and allows matching on the field that is not actually checked in the
packet. And this is causing misleading -df reporting in OVS datapath
flows, while the tunnel traffic actually has the bit set in most cases.
This may also cause issues if a hardware properly implements support
for tunnel flag matching as it will disagree with the implementation
in a software path of TC flower.
Fixes: 7d5437c709 ("openvswitch: Add tunneling interface.")
Fixes: 1d17568e74 ("net/sched: cls_flower: add support for matching tunnel control flags")
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250909165440.229890-2-i.maximets@ovn.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
During recent testing with the netem qdisc to inject delays into TCP
traffic, we observed that our CLS BPF program failed to function correctly
due to incorrect classid retrieval from task_get_classid(). The issue
manifests in the following call stack:
bpf_get_cgroup_classid+5
cls_bpf_classify+507
__tcf_classify+90
tcf_classify+217
__dev_queue_xmit+798
bond_dev_queue_xmit+43
__bond_start_xmit+211
bond_start_xmit+70
dev_hard_start_xmit+142
sch_direct_xmit+161
__qdisc_run+102 <<<<< Issue location
__dev_xmit_skb+1015
__dev_queue_xmit+637
neigh_hh_output+159
ip_finish_output2+461
__ip_finish_output+183
ip_finish_output+41
ip_output+120
ip_local_out+94
__ip_queue_xmit+394
ip_queue_xmit+21
__tcp_transmit_skb+2169
tcp_write_xmit+959
__tcp_push_pending_frames+55
tcp_push+264
tcp_sendmsg_locked+661
tcp_sendmsg+45
inet_sendmsg+67
sock_sendmsg+98
sock_write_iter+147
vfs_write+786
ksys_write+181
__x64_sys_write+25
do_syscall_64+56
entry_SYSCALL_64_after_hwframe+100
The problem occurs when multiple tasks share a single qdisc. In such cases,
__qdisc_run() may transmit skbs created by different tasks. Consequently,
task_get_classid() retrieves an incorrect classid since it references the
current task's context rather than the skb's originating task.
Given that dev_queue_xmit() always executes with bh disabled, we can use
softirq_count() instead to obtain the correct classid.
The simple steps to reproduce this issue:
1. Add network delay to the network interface:
such as: tc qdisc add dev bond0 root netem delay 1.5ms
2. Build two distinct net_cls cgroups, each with a network-intensive task
3. Initiate parallel TCP streams from both tasks to external servers.
Under this specific condition, the issue reliably occurs. The kernel
eventually dequeues an SKB that originated from Task-A while executing in
the context of Task-B.
It is worth noting that it will change the established behavior for a
slightly different scenario:
<sock S is created by task A>
<class ID for task A is changed>
<skb is created by sock S xmit and classified>
prior to this patch the skb will be classified with the 'new' task A
classid, now with the old/original one. The bpf_get_cgroup_classid_curr()
function is a more appropriate choice for this case.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20250902062933.30087-1-laoar.shao@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Hosts under DOS attack can suffer from false sharing
in enqueue_to_backlog() : atomic_inc(&sd->dropped).
This is because sd->dropped can be touched from many cpus,
possibly residing on different NUMA nodes.
Generalize the sk_drop_counters infrastucture
added in commit c51613fa27 ("net: add sk->sk_drop_counters")
and use it to replace softnet_data.dropped
with NUMA friendly softnet_data.drop_counters.
This adds 64 bytes per cpu, maybe more in the future
if we increase the number of counters (currently 2)
per 'struct numa_drop_counters'.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250909121942.1202585-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Now that the destruction of info/keys is delayed until the socket
destructor, it's safe to use kfree() without an RCU callback.
The socket is in TCP_CLOSE state either because it never left it,
or it's already closed and the refcounter is zero. In any way,
no one can discover it anymore, it's safe to release memory
straight away.
Similar thing was possible for twsk already.
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Link: https://patch.msgid.link/20250909-b4-tcp-ao-md5-rst-finwait2-v5-2-9ffaaaf8b236@arista.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently there are a couple of minor issues with destroying the keys
tcp_v4_destroy_sock():
1. The socket is yet in TCP bind buckets, making it reachable for
incoming segments [on another CPU core], potentially available to send
late FIN/ACK/RST replies.
2. There is at least one code path, where tcp_done() is called before
sending RST [kudos to Bob for investigation]. This is a case of
a server, that finished sending its data and just called close().
The socket is in TCP_FIN_WAIT2 and has RCV_SHUTDOWN (set by
__tcp_close())
tcp_v4_do_rcv()/tcp_v6_do_rcv()
tcp_rcv_state_process() /* LINUX_MIB_TCPABORTONDATA */
tcp_reset()
tcp_done_with_error()
tcp_done()
inet_csk_destroy_sock() /* Destroys AO/MD5 keys */
/* tcp_rcv_state_process() returns SKB_DROP_REASON_TCP_ABORT_ON_DATA */
tcp_v4_send_reset() /* Sends an unsigned RST segment */
tcpdump:
> 22:53:15.399377 00:00:b2:1f:00:00 > 00:00:01:01:00:00, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 33929, offset 0, flags [DF], proto TCP (6), length 60)
> 1.0.0.1.34567 > 1.0.0.2.49848: Flags [F.], seq 2185658590, ack 3969644355, win 502, options [nop,nop,md5 valid], length 0
> 22:53:15.399396 00:00:01:01:00:00 > 00:00:b2:1f:00:00, ethertype IPv4 (0x0800), length 86: (tos 0x0, ttl 64, id 51951, offset 0, flags [DF], proto TCP (6), length 72)
> 1.0.0.2.49848 > 1.0.0.1.34567: Flags [.], seq 3969644375, ack 2185658591, win 128, options [nop,nop,md5 valid,nop,nop,sack 1 {2185658590:2185658591}], length 0
> 22:53:16.429588 00:00:b2:1f:00:00 > 00:00:01:01:00:00, ethertype IPv4 (0x0800), length 60: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 40)
> 1.0.0.1.34567 > 1.0.0.2.49848: Flags [R], seq 2185658590, win 0, length 0
> 22:53:16.664725 00:00:b2:1f:00:00 > 00:00:01:01:00:00, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
> 1.0.0.1.34567 > 1.0.0.2.49848: Flags [R], seq 2185658591, win 0, options [nop,nop,md5 valid], length 0
> 22:53:17.289832 00:00:b2:1f:00:00 > 00:00:01:01:00:00, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
> 1.0.0.1.34567 > 1.0.0.2.49848: Flags [R], seq 2185658591, win 0, options [nop,nop,md5 valid], length 0
Note the signed RSTs later in the dump - those are sent by the server
when the fin-wait socket gets removed from hash buckets, by
the listener socket.
Instead of destroying AO/MD5 info and their keys in inet_csk_destroy_sock(),
slightly delay it until the actual socket .sk_destruct(). As shutdown'ed
socket can yet send non-data replies, they should be signed in order for
the peer to process them. Now it also matches how AO/MD5 gets destructed
for TIME-WAIT sockets (in tcp_twsk_destructor()).
This seems optimal for TCP-MD5, while for TCP-AO it seems to have an
open problem: once RST get sent and socket gets actually destructed,
there is no information on the initial sequence numbers. So, in case
this last RST gets lost in the network, the server's listener socket
won't be able to properly sign another RST. Nothing in RFC 1122
prescribes keeping any local state after non-graceful reset.
Luckily, BGP are known to use keep alive(s).
While the issue is quite minor/cosmetic, these days monitoring network
counters is a common practice and getting invalid signed segments from
a trusted BGP peer can get customers worried.
Investigated-by: Bob Gilligan <gilligan@arista.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Link: https://patch.msgid.link/20250909-b4-tcp-ao-md5-rst-finwait2-v5-1-9ffaaaf8b236@arista.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- iwlwifi: major cleanups/rework
- brcmfmac: gets AP isolation support
- mac80211: gets more S1G support
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmjCn20ACgkQ10qiO8sP
aADlog//eLH6dmMJSF5d3HCCXzpABholMM7SgFmeL7zz2JVhMe+TD9sdM972A9/d
aSTPhFvj3SESiayohbziKD2Fk2EWLYh35MVxP1VQG/pBMdXGryReWxXuMc+/iep8
wZJjvR7b5dwXc7xWi4BDYxRj4Z+3IMFulmwXMJl9pfBG8l/M2vRd1ihvpLyXwbZ1
k67rcmFltjvIh4DpGdxypO0onuTIgLycGF84YhyFrjeOpB75tM3knzCQMjnHZYo5
YL295cTPfpLY+7iyMhrFv5ootWm5eLfGX+jXA5EmEojOwinZbiN6AB4xDeMWC3fz
rxk5ESo/wGEbJpLiV+CVkkptHYubGk7pA+EPSMQg94+HV4WL+GekZ9/5bRKLZ/Xv
BXNkIL71KxhVd9DiDG5WomgnO2hDpk0zxpMADgZNEPkdgDRcl1Kh1oDw7S/4bp58
V/Ea349wRInesxK10BdJwtsJ0khDg0gaN6f+OG8L7dRw4mSl9akk2e0KVhHoOxYT
rBbK4QKt77308TjlYqzJ+SUFhrb6PhPNgRMFx2zjSgH1eaHONVEM2zYOscQGNgqR
Rfz8d6LUXx+aR87aOT31vUWqjl1KgwqbICiI6KWaQpcKTVK61WVFUiksddK6XyB7
drwoPFekBEjVXOVYaQtyPG3qEGEOVCEjooHXFPQJ+SJeR1sMHCc=
=Nlkd
-----END PGP SIGNATURE-----
Merge tag 'wireless-next-2025-09-11' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
Johannes Berg says:
====================
Plenty of things going on, notably:
- iwlwifi: major cleanups/rework
- brcmfmac: gets AP isolation support
- mac80211: gets more S1G support
* tag 'wireless-next-2025-09-11' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (94 commits)
wifi: mwifiex: fix endianness handling in mwifiex_send_rgpower_table
wifi: cfg80211: Remove the redundant wiphy_dev
wifi: mac80211: fix incorrect comment
wifi: cfg80211: update the time stamps in hidden ssid
wifi: mac80211: Fix HE capabilities element check
wifi: mac80211: add tx_handlers_drop statistics to ethtool
wifi: mac80211: fix reporting of all valid links in sta_set_sinfo()
wifi: iwlwifi: mld: CHANNEL_SURVEY_NOTIF is always supported
wifi: iwlwifi: mld: remove support of iwl_esr_mode_notif version 1
wifi: iwlwifi: mld: remove support from of sta cmd version 1
wifi: iwlwifi: mld: remove support of roc cmd version 5
wifi: iwlwifi: mld: remove support of mac cmd ver 2
wifi: iwlwifi: mld: don't consider phy cmd version 5
wifi: iwlwifi: implement wowlan status notification API update
wifi: iwlwifi: fw: Add ASUS to PPAG and TAS list
wifi: iwlwifi: add kunit tests for nvm parse
wifi: iwlwifi: api: add a flag to iwl_link_ctx_modify_flags
wifi: iwlwifi: pcie: move ltr_enabled to the specific transport
wifi: iwlwifi: pcie: move pm_support to the specific transport
wifi: iwlwifi: rename iwl_finish_nic_init
...
====================
Link: https://patch.msgid.link/20250911100854.20445-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cross-merge networking fixes after downstream PR (net-6.17-rc6).
Conflicts:
net/netfilter/nft_set_pipapo.c
net/netfilter/nft_set_pipapo_avx2.c
c4eaca2e10 ("netfilter: nft_set_pipapo: don't check genbit from packetpath lookups")
84c1da7b38 ("netfilter: nft_set_pipapo: use avx2 algorithm for insertions too")
Only trivial adjacent changes (in a doc and a Makefile).
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
__icmp_send() is used to generate ICMP error messages in response to
various situations such as MTU errors (i.e., "Fragmentation Required")
and too many hops (i.e., "Time Exceeded").
The skb that generated the error does not necessarily come from the IPv4
layer and does not always have a valid IPv4 control block in skb->cb.
Therefore, commit 9ef6b42ad6 ("net: Add __icmp_send helper.") changed
the function to take the IP options structure as argument instead of
deriving it from the skb's control block. Some callers of this function
such as icmp_send() pass the IP options structure from the skb's control
block as in these call paths the control block is known to be valid, but
other callers simply pass a zeroed structure.
A subsequent patch will need __icmp_send() to access more information
from the IPv4 control block (specifically, the ifindex of the input
interface). As a preparation for this change, change the function to
take the IPv4 control block structure as an argument instead of the IP
options structure. This makes the function similar to its IPv6
counterpart that already takes the IPv6 control block structure as an
argument.
No functional changes intended.
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250908073238.119240-3-idosch@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
We don't expect frags with unreadable memory to be presented
to XDP programs today, but the XDP helpers are designed to be
usable whether XDP is enabled or not. Support handling frags
with unreadable memory.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20250905221539.2930285-3-kuba@kernel.org
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
xdp_update_skb_shared_info() needs to update skb state which
was maintained in xdp_buff / frame. Pass full flags into it,
instead of breaking it out bit by bit. We will need to add
a bit for unreadable frags (even tho XDP doesn't support
those the driver paths may be common), at which point almost
all call sites would become:
xdp_update_skb_shared_info(skb, num_frags,
sinfo->xdp_frags_size,
MY_PAGE_SIZE * num_frags,
xdp_buff_is_frag_pfmemalloc(xdp),
xdp_buff_is_frag_unreadable(xdp));
Keep a helper for accessing the flags, in case we need to
transform them somehow in the future (e.g. to cover up xdp_buff
vs xdp_frame differences).
While we are touching call callers - rename the helper to
xdp_update_skb_frags_info(), previous name may have implied that
it's shinfo that's updated. We are updating flags in struct sk_buff
based on frags that got attched.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://patch.msgid.link/20250905221539.2930285-2-kuba@kernel.org
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This function was added for retpoline mitigation and is replaced by a
static inline helper if mitigations are not enabled.
Enable this helper function unconditionally so next patch can add a lookup
restart mechanism to fix possible false negatives while transactions are
in progress.
Adding lookup restarts in nft_lookup_eval doesn't work as nft_objref would
then need the same copypaste loop.
This patch is separate to ease review of the actual bug fix.
Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
This will soon be read from packet path around same time as the gencursor.
Both gencursor and base_seq get incremented almost at the same time, so
it makes sense to place them in the same structure.
This doesn't increase struct net size on 64bit due to padding.
Signed-off-by: Florian Westphal <fw@strlen.de>
NICs are typically configured with total_vfs=0, forcing users to rely
on external tools to enable SR-IOV (a widely used and essential feature).
Add total_vfs parameter to devlink for SR-IOV max VF configurability.
Enables standard kernel tools to manage SR-IOV, addressing the need for
flexible VF configuration.
Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
Tested-by: Kamal Heib <kheib@redhat.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250907012953.301746-2-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tony Nguyen says:
====================
idpf: add XDP support
Alexander Lobakin says:
Add XDP support (w/o XSk for now) to the idpf driver using the libeth_xdp
sublib. All possible verdicts, .ndo_xdp_xmit(), multi-buffer etc. are here.
In general, nothing outstanding comparing to ice, except performance --
let's say, up to 2x for .ndo_xdp_xmit() on certain platforms and
scenarios.
idpf doesn't support VLAN Rx offload, so only the hash hint is
available for now.
Patches 1-7 are prereqs, without which XDP would either not work at all
or work slower/worse/...
* '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
idpf: add XDP RSS hash hint
idpf: add support for .ndo_xdp_xmit()
idpf: add support for XDP on Rx
idpf: use generic functions to build xdp_buff and skb
idpf: implement XDP_SETUP_PROG in ndo_bpf for splitq
idpf: prepare structures to support XDP
idpf: add support for nointerrupt queues
idpf: remove SW marker handling from NAPI
idpf: add 4-byte completion descriptor definition
idpf: link NAPIs to queues
idpf: use a saner limit for default number of queues to allocate
idpf: fix Rx descriptor ready check barrier in splitq
xdp, libeth: make the xdp_init_buff() micro-optimization generic
====================
Link: https://patch.msgid.link/20250908195748.1707057-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a new ad_select policy 'port_priority' that uses the per-port
actor priority values (set via ad_actor_port_prio) to determine
aggregator selection.
This allows administrators to influence which ports are preferred
for aggregation by assigning different priority values, providing
more flexible load balancing control in LACP configurations.
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20250902064501.360822-3-liuhangbin@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Introduce a new netlink attribute 'actor_port_prio' to allow setting
the LACP actor port priority on a per-slave basis. This extends the
existing bonding infrastructure to support more granular control over
LACP negotiations.
The priority value is embedded in LACPDU packets and will be used by
subsequent patches to influence aggregator selection policies.
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20250902064501.360822-2-liuhangbin@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
No more user of SNMP_MIB_SENTINEL, we can remove it.
Also remove snmp_get_cpu_field[64]_batch() helpers.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250905165813.1470708-10-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use ARRAY_SIZE(), so that we know the limit at compile time.
Following patch needs this preliminary change.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250905165813.1470708-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Often times the compilers are not able to expand two consecutive 32-bit
writes into one 64-bit on the corresponding architectures. This applies
to xdp_init_buff() called for every received frame (or at least once
per each 64 frames when the frag size is fixed).
Move the not-so-pretty hack from libeth_xdp straight to xdp_init_buff(),
but using a proper union around ::frame_sz and ::flags.
The optimization is limited to LE architectures due to the structure
layout.
One simple example from idpf with the XDP series applied (Clang 22-git,
CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE => -O2):
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-27 (-27)
Function old new delta
idpf_vport_splitq_napi_poll 5076 5049 -27
The perf difference with XDP_DROP is around +0.8-1% which I see as more
than satisfying.
Suggested-by: Simon Horman <horms@kernel.org>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Tested-by: Ramu R <ramu.r@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
The so-called fullmac devices rely on firmware functionality and/or API to
change BSS parameters. Today there are limited drivers supporting the
nl80211 primitive, but they only handle a subset of the bss parameters
passed if any. The mac80211 driver does handle all parameters and stores
their configured values. Some of the BSS parameters were already conditional
by wiphy->features. For these the wiphy->bss_param_support and wiphy->features
fields are silently aligned in wiphy_register(). Maybe better to issue a warning
instead when they are misaligned.
Signed-off-by: Arend van Spriel <arend.vanspriel@broadcom.com>
Link: https://patch.msgid.link/20250817190435.1495094-2-arend.vanspriel@broadcom.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add new attributes to support EHT MCS/NSS Tx rates and EHT GI/LTF.
Parse EHT fixed MCS/NSS Tx rates and EHT GI/LTF values passed by the
userspace, validate and add as part of cfg80211_bitrate_mask.
MCS mask is constructed by new function, eht_build_mcs_mask(). Max NSS
supported for MCS rates of 7, 9, 11 and 13 is utilized to set MCS
bitmask for each NSS. MCS rates 14, and 15 if supported, are set only
for NSS = 0.
Co-developed-by: Aloka Dixit <aloka.dixit@oss.qualcomm.com>
Signed-off-by: Aloka Dixit <aloka.dixit@oss.qualcomm.com>
Signed-off-by: Muna Sinada <muna.sinada@oss.qualcomm.com>
Link: https://patch.msgid.link/20250815213011.2704803-1-muna.sinada@oss.qualcomm.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If a valid radio index is not found, the function returns -ENOENT. If the
channel argument itself is invalid, it returns -EINVAL. However, since the
caller only checks for < 0, the distinction between these error codes is
not utilized much. Also, handling these two distinct error codes throughout
the codebase adds complexity, as both cases must be addressed separately. A
subsequent change aims to simplify this by using a single error code for
all invalid cases, making error handling more consistent and streamlined.
To support this change, update the return value to -EINVAL when a valid
radio index is not found. This is still appropriate because, even if the
channel argument is structurally valid, the absence of a corresponding
radio index implies that the argument is effectively invalid—otherwise, a
valid index would have been found.
Signed-off-by: Aditya Kumar Singh <aditya.kumar.singh@oss.qualcomm.com>
Link: https://patch.msgid.link/20250812-fix_scan_ap_flag_requirement_during_mlo-v4-1-383ffb6da213@oss.qualcomm.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
mlx5 pokes into the rxq state to check if the queue has a memory
provider, and therefore whether it may produce unreadable mem.
Add a helper for doing this in the page pool API. fbnic will want
a similar thing (tho, for a slightly different reason).
Reviewed-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20250901211214.1027927-11-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
An exchange with a NFC target must complete within NCI_DATA_TIMEOUT.
A delay of 700 ms is not sufficient for cryptographic operations on smart
cards. CardOS 6.0 may need up to 1.3 seconds to perform 256-bit ECDH
or 3072-bit RSA. To prevent brute-force attacks, passports and similar
documents introduce even longer delays into access control protocols
(BAC/PACE).
The timeout should be higher, but not too much. The expiration allows
us to detect that a NFC target has disappeared.
Signed-off-by: Juraj Šarinay <juraj@sarinay.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://patch.msgid.link/20250902113630.62393-1-juraj@sarinay.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Followup of commit c51da3f7a1 ("net: remove sock_i_uid()")
A recent syzbot report was the trigger for this change.
Over the years, we had many problems caused by the
read_lock[_bh](&sk->sk_callback_lock) in sock_i_uid().
We could fix smc_diag_dump_proto() or make a more radical move:
Instead of waiting for new syzbot reports, cache the socket
inode number in sk->sk_ino, so that we no longer
need to acquire sk->sk_callback_lock in sock_i_ino().
This makes socket dumps faster (one less cache line miss,
and two atomic ops avoided).
Prior art:
commit 25a9c8a443 ("netlink: Add __sock_i_ino() for __netlink_diag_dump().")
commit 4f9bf2a2f5 ("tcp: Don't acquire inet_listen_hashbucket::lock with disabled BH.")
commit efc3dbc374 ("rds: Make rds_sock_lock BH rather than IRQ safe.")
Fixes: d2d6422f8b ("x86: Allow to enable PREEMPT_RT.")
Reported-by: syzbot+50603c05bbdf4dfdaffa@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/68b73804.050a0220.3db4df.01d8.GAE@google.com/T/#u
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20250902183603.740428-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJBBAABCAArFiEEgKkgxbID4Gn1hq6fcJGo2a1f9gAFAmi28PMNHGZ3QHN0cmxl
bi5kZQAKCRBwkajZrV/2AJ7LD/9vb6dwLS7u7+wrksZUdDdkIvIN7TOeiN0JJ/xd
XOAiABE2LVwoAyA7zHDdQKjkB+3kJwhSQcS3RO+qn3E77JGlADRWxYI4tg5FfWMC
WrDGSVW5vU92611s2i8GvLdCeZ7u7/qoMMAIjlnS8XHBq+6SHMacmwbH/W/4pMKR
58XFW+fYOJF4q9ATH9F6DXAFys9ZC873dv2rUw7/jXiAcHggwpWFlPVPbimWcrT/
u+z07preK9jALaYj5ctUsRznwqd0x7AwSsLI/e6VgPtXFhdTPdsH//aWwkMlX1Uw
eQ0i9XoarYzoA5wnYaB2t5e0b0haAll+rxJRM3ISavizmBx6BPHOyerpe5J3kmO5
H5A8Low+KB/zJkiovTY9A5uLzpXaScS00kD5a6obAPFfCm64kwctBTmV64+zPfYo
/36/zze/mpfjI8FMdyCZfOZgSfQ5KeCZg7FMi6uNZMg0PoMkW5yCrnjXX2Oaz2Rj
TukUGXg6lFRMBjK4jiCy/HXusn+Xuv1FrHwYwAvYvtbOxwbUZkVo1h65f3Fng/vU
vAtWfw6Gh+I9AKWiTyC2FRm0rZ26K+p/8OrErGIJmsGbcqXcYFZE0bqvfhIDzMWg
VzykeGbK51q3s4nFR9C/7wcprDTaQCi3/sobu4ussetkeDMU3d9aEsxkOxNPkz1j
dxJT/w==
=wamJ
-----END PGP SIGNATURE-----
Merge tag 'nf-next-25-09-02' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Florian Westphal says:
====================
netfilter: updates for net-next
1) prefer vmalloc_array in ebtables, from Qianfeng Rong.
2) Use csum_replace4 instead of open-coding it, from Christophe Leroy.
3+4) Get rid of GFP_ATOMIC in transaction object allocations, those
cause silly failures with large sets under memory pressure, from
myself.
5) Remove test for AVX cpu feature in nftables pipapo set type,
testing for AVX2 feature is sufficient.
6) Unexport a few function in nf_reject infra: no external callers.
7) Extend payload offset to u16, this was restricted to values <=255
so far, from Fernando Fernandez Mancera.
* tag 'nf-next-25-09-02' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: nft_payload: extend offset to 65535 bytes
netfilter: nf_reject: remove unneeded exports
netfilter: nft_set_pipapo: remove redundant test for avx feature bit
netfilter: nf_tables: all transaction allocations can now sleep
netfilter: nf_tables: allow iter callbacks to sleep
netfilter: nft_payload: Use csum_replace4() instead of opencoding
netfilter: ebtables: Use vmalloc_array() to improve code
====================
Link: https://patch.msgid.link/20250902133549.15945-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In this context "not that ..." should properly be "note that ...".
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250902154640.759815-4-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a helper to check if RFS is needed or not. Allows to make the code a
bit cleaner and the next patch to have MPTCP use this helper to decide
whether or not to iterate over the subflows.
tun_flow_update() was calling sock_rps_record_flow_hash() regardless of
the state of rfs_needed. This was not really a bug as sock_flow_table
simply ends up being NULL and thus everything will be fine.
This commit here thus also implicitly makes tun_flow_update() respect
the state of rfs_needed.
Suggested-by: Matthieu Baerts <matttbe@kernel.org>
Signed-off-by: Christoph Paasch <cpaasch@openai.com>
Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250902-net-next-mptcp-misc-feat-6-18-v2-3-fa02bb3188b1@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tcfa_qstats is currently only used to hold drops and overlimits counters.
tcf_action_inc_drop_qstats() and tcf_action_inc_overlimit_qstats()
currently acquire a->tcfa_lock to increment these counters.
Switch to two atomic_t to get lock-free accounting.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20250901093141.2093176-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Remove the implementation of use_carrier, the link monitoring
method that utilizes ethtool or ioctl to determine the link state of an
interface in a bond. Bonding will always behaves as if use_carrier=1,
which relies on netif_carrier_ok() to determine the link state of
interfaces.
To avoid acquiring RTNL many times per second, bonding inspects
link state under RCU, but not under RTNL. However, ethtool
implementations in drivers may sleep, and therefore this strategy is
unsuitable for use with calls into driver ethtool functions.
The use_carrier option was introduced in 2003, to provide
backwards compatibility for network device drivers that did not support
the then-new netif_carrier_ok/on/off system. Device drivers are now
expected to support netif_carrier_*, and the use_carrier backwards
compatibility logic is no longer necessary.
The option itself remains, but when queried always returns 1,
and may only be set to 1.
Link: https://lore.kernel.org/000000000000eb54bf061cfd666a@google.com
Link: https://lore.kernel.org/20240718122017.d2e33aaac43a.I10ab9c9ded97163aef4e4de10985cd8f7de60d28@changeid
Signed-off-by: Jay Vosburgh <jv@jvosburgh.net>
Reported-by: syzbot+b8c48ea38ca27d150063@syzkaller.appspotmail.com
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/2029487.1756512517@famine
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In some situations 255 bytes offset is not enough to match or manipulate
the desired packet field. Increase the offset limit to 65535 or U16_MAX.
In addition, the nla policy maximum value is not set anymore as it is
limited to s16. Instead, the maximum value is checked during the payload
expression initialization function.
Tested with the nft command line tool.
table ip filter {
chain output {
@nh,2040,8 set 0xff
@nh,524280,8 set 0xff
@nh,524280,8 0xff
@nh,2040,8 0xff
}
}
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Quoting Sven Auhagen:
we do see on occasions that we get the following error message, more so on
x86 systems than on arm64:
Error: Could not process rule: Cannot allocate memory delete table inet filter
It is not a consistent error and does not happen all the time.
We are on Kernel 6.6.80, seems to me like we have something along the lines
of the nf_tables: allow clone callbacks to sleep problem using GFP_ATOMIC.
As hinted at by Sven, this is because of GFP_ATOMIC allocations during
set flush.
When set is flushed, all elements are deactivated. This triggers a set
walk and each element gets added to the transaction list.
The rbtree and rhashtable sets don't allow the iter callback to sleep:
rbtree walk acquires read side of an rwlock with bh disabled, rhashtable
walk happens with rcu read lock held.
Rbtree is simple enough to resolve:
When the walk context is ITER_READ, no change is needed (the iter
callback must not deactivate elements; we're not in a transaction).
When the iter type is ITER_UPDATE, the rwlock isn't needed because the
caller holds the transaction mutex, this prevents any and all changes to
the ruleset, including add/remove of set elements.
Rhashtable is slightly more complex.
When the iter type is ITER_READ, no change is needed, like rbtree.
For ITER_UPDATE, we hold transaction mutex which prevents elements from
getting free'd, even outside of rcu read lock section.
So build a temporary list of all elements while doing the rcu iteration
and then call the iterator in a second pass.
The disadvantage is the need to iterate twice, but this cost comes with
the benefit to allow the iter callback to use GFP_KERNEL allocations in
a followup patch.
The new list based logic makes it necessary to catch recursive calls to
the same set earlier.
Such walk -> iter -> walk recursion for the same set can happen during
ruleset validation in case userspace gave us a bogus (cyclic) ruleset
where verdict map m jumps to chain that sooner or later also calls
"vmap @m".
Before the new ->in_update_walk test, the ruleset is rejected because the
infinite recursion causes ctx->level to exceed the allowed maximum.
But with the new logic added here, elements would get skipped:
nft_rhash_walk_update would see elements that are on the walk_list of
an older stack frame.
As all recursive calls into same map results in -EMLINK, we can avoid this
problem by using the new in_update_walk flag and reject immediately.
Next patch converts the problematic GFP_ATOMIC allocations.
Reported-by: Sven Auhagen <Sven.Auhagen@belden.com>
Closes: https://lore.kernel.org/netfilter-devel/BY1PR18MB5874110CAFF1ED098D0BC4E7E07BA@BY1PR18MB5874.namprd18.prod.outlook.com/
Signed-off-by: Florian Westphal <fw@strlen.de>
Provide isolation between netns for ping idents.
Randomize initial ping_port_rover value at netns creation.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250829153054.474201-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There is no point in keeping ping_hash().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Yue Haibing <yuehaibing@huawei.com>
Link: https://patch.msgid.link/20250829153054.474201-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
TCP tracks the number of orphaned (SOCK_DEAD but not yet destructed)
sockets in tcp_orphan_count.
In some code that was shared with DCCP, tcp_orphan_count is referenced
via sk->sk_prot->orphan_count.
Let's reference tcp_orphan_count directly.
inet_csk_prepare_for_destroy_sock() is moved to inet_connection_sock.c
due to header dependency.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250829215641.711664-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
With the introduction of clone3 in commit 7f192e3cd3 ("fork: add
clone3") the effective bit width of clone_flags on all architectures was
increased from 32-bit to 64-bit, with a new type of u64 for the flags.
However, for most consumers of clone_flags the interface was not
changed from the previous type of unsigned long.
While this works fine as long as none of the new 64-bit flag bits
(CLONE_CLEAR_SIGHAND and CLONE_INTO_CGROUP) are evaluated, this is still
undesirable in terms of the principle of least surprise.
Thus, this commit fixes all relevant interfaces of callees to
sys_clone3/copy_process (excluding the architecture-specific
copy_thread) to consistently pass clone_flags as u64, so that
no truncation to 32-bit integers occurs on 32-bit architectures.
Signed-off-by: Simon Schuster <schuster.simon@siemens-energy.com>
Link: https://lore.kernel.org/20250901-nios2-implement-clone3-v2-2-53fcf5577d57@siemens-energy.com
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Use RCU to protect accesses to dst->dev from sk_setup_caps()
and sk_dst_gso_max_size().
Also use dst_dev_rcu() in ip6_dst_mtu_maybe_forward(),
and ip_dst_mtu_maybe_forward().
ip4_dst_hoplimit() can use dst_dev_net_rcu().
Fixes: 4a6ce2b6f2 ("net: introduce a new function dst_dev_put()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250828195823.3958522-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Followup of commit 88fe14253e ("net: dst: add four helpers
to annotate data-races around dst->dev").
We want to gradually add explicit RCU protection to dst->dev,
including lockdep support.
Add an union to alias dst->dev_rcu and dst->dev.
Add dst_dev_net_rcu() helper.
Fixes: 4a6ce2b6f2 ("net: introduce a new function dst_dev_put()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250828195823.3958522-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cross-merge networking fixes after downstream PR (net-6.17-rc4).
No conflicts.
Adjacent changes:
drivers/net/ethernet/intel/idpf/idpf_txrx.c
02614eee26 ("idpf: do not linearize big TSO packets")
6c4e684802 ("idpf: remove obsolete stashing code")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Also storing tcf_action into struct tcf_skbmod_params
makes sure there is no discrepancy in tcf_skbmod_act().
No longer block BH in tcf_skbmod_init() when acquiring tcf_lock.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250827125349.3505302-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Also storing tcf_action into struct tcf_tunnel_key_params
makes sure there is no discrepancy in tunnel_key_act().
No longer block BH in tunnel_key_init() when acquiring tcf_lock.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250827125349.3505302-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Also storing tcf_action into struct tcf_vlan_params
makes sure there is no discrepancy in tcf_vlan_act().
No longer block BH in tcf_vlan_init() when acquiring tcf_lock.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250827125349.3505302-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
For zerocopy (io_uring, devmem), there is an assumption that the
parent device can do DMA. However that is not always the case:
- Scalable Function netdevs [1] have the DMA device in the grandparent.
- For Multi-PF netdevs [2] queues can be associated to different DMA
devices.
This patch introduces the a queue based interface for allowing drivers
to expose a different DMA device for zerocopy.
[1] Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst
[2] Documentation/networking/multi-pf-netdev.rst
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20250827144017.1529208-3-dtatulea@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When a packet flood hits one or more RAW sockets, many cpus
have to update sk->sk_drops.
This slows down other cpus, because currently
sk_drops is in sock_write_rx group.
Add a socket_drop_counters structure to raw sockets.
Using dedicated cache lines to hold drop counters
makes sure that consumers no longer suffer from
false sharing if/when producers only change sk->sk_drops.
This adds 128 bytes per RAW socket.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250826125031.1578842-6-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When a packet flood hits one or more UDP sockets, many cpus
have to update sk->sk_drops.
This slows down other cpus, because currently
sk_drops is in sock_write_rx group.
Add a socket_drop_counters structure to udp sockets.
Using dedicated cache lines to hold drop counters
makes sure that consumers no longer suffer from
false sharing if/when producers only change sk->sk_drops.
This adds 128 bytes per UDP socket.
Tested with the following stress test, sending about 11 Mpps
to a dual socket AMD EPYC 7B13 64-Core.
super_netperf 20 -t UDP_STREAM -H DUT -l10 -- -n -P,1000 -m 120
Note: due to socket lookup, only one UDP socket is receiving
packets on DUT.
Then measure receiver (DUT) behavior. We can see both
consumer and BH handlers can process more packets per second.
Before:
nstat -n ; sleep 1 ; nstat | grep Udp
Udp6InDatagrams 615091 0.0
Udp6InErrors 3904277 0.0
Udp6RcvbufErrors 3904277 0.0
After:
nstat -n ; sleep 1 ; nstat | grep Udp
Udp6InDatagrams 816281 0.0
Udp6InErrors 7497093 0.0
Udp6RcvbufErrors 7497093 0.0
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250826125031.1578842-5-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Some sockets suffer from heavy false sharing on sk->sk_drops,
and fields in the same cache line.
Add sk->sk_drop_counters to:
- move the drop counter(s) to dedicated cache lines.
- Add basic NUMA awareness to these drop counter(s).
Following patches will use this infrastructure for UDP and RAW sockets.
sk_clone_lock() is not yet ready, it would need to properly
set newsk->sk_drop_counters if we plan to use this for TCP sockets.
v2: used Paolo suggestion from https://lore.kernel.org/netdev/8f09830a-d83d-43c9-b36b-88ba0a23e9b2@redhat.com/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250826125031.1578842-4-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Existing sk_drops_add() helper is renamed to sk_drops_skbadd().
Add sk_drops_add() and convert sk_drops_inc() to use it.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250826125031.1578842-3-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
We want to split sk->sk_drops in the future to reduce
potential contention on this field.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250826125031.1578842-2-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This patch fixes an issue where two different flows on the same RXq
produce the same hash resulting in continuous flow overwrites.
Flow #1: A packet for Flow #1 comes in, kernel calls the steering
function. The driver gives back a filter id. The kernel saves
this filter id in the selected slot. Later, the driver's
service task checks if any filters have expired and then
installs the rule for Flow #1.
Flow #2: A packet for Flow #2 comes in. It goes through the same steps.
But this time, the chosen slot is being used by Flow #1. The
driver gives a new filter id and the kernel saves it in the
same slot. When the driver's service task runs, it runs through
all the flows, checks if Flow #1 should be expired, the kernel
returns True as the slot has a different filter id, and then
the driver installs the rule for Flow #2.
Flow #1: Another packet for Flow #1 comes in. The same thing repeats.
The slot is overwritten with a new filter id for Flow #1.
This causes a repeated cycle of flow programming for missed packets,
wasting CPU cycles while not improving performance. This problem happens
at higher rates when the RPS table is small, but tests show it still
happens even with 12,000 connections and an RPS size of 16K per queue
(global table size = 144x16K = 64K).
This patch prevents overwriting an rps_dev_flow entry if it is active.
The intention is that it is better to do aRFS for the first flow instead
of hurting all flows on the same hash. Without this, two (or more) flows
on one RX queue with the same hash can keep overwriting each other. This
causes the driver to reprogram the flow repeatedly.
Changes:
1. Add a new 'hash' field to struct rps_dev_flow.
2. Add rps_flow_is_active(): a helper function to check if a flow is
active or not, extracted from rps_may_expire_flow(). It is further
simplified as per reviewer feedback.
3. In set_rps_cpu():
- Avoid overwriting by programming a new filter if:
- The slot is not in use, or
- The slot is in use but the flow is not active, or
- The slot has an active flow with the same hash, but target CPU
differs.
- Save the hash in the rps_dev_flow entry.
4. rps_may_expire_flow(): Use earlier extracted rps_flow_is_active().
Testing & results:
- Driver: ice (E810 NIC), Kernel: net-next
- #CPUs = #RXq = 144 (1:1)
- Number of flows: 12K
- Eight RPS settings from 256 to 32768. Though RPS=256 is not ideal,
it is still sufficient to cover 12K flows (256*144 rx-queues = 64K
global table slots)
- Global Table Size = 144 * RPS (effectively equal to 256 * RPS)
- Each RPS test duration = 8 mins (org code) + 8 mins (new code).
- Metrics captured on client
Legend for following tables:
Steer-C: #times ndo_rx_flow_steer() was Called by set_rps_cpu()
Steer-L: #times ice_arfs_flow_steer() Looped over aRFS entries
Add: #times driver actually programmed aRFS (ice_arfs_build_entry())
Del: #times driver deleted the flow (ice_arfs_del_flow_rules())
Units: K = 1,000 times, M = 1 million times
|-------|---------|------| Org Code |---------|---------|
| RPS | Latency | CPU | Add | Del | Steer-C | Steer-L |
|-------|---------|------|--------|--------|---------|---------|
| 256 | 227.0 | 93.2 | 1.6M | 1.6M | 121.7M | 267.6M |
| 512 | 225.9 | 94.1 | 11.5M | 11.2M | 65.7M | 199.6M |
| 1024 | 223.5 | 95.6 | 16.5M | 16.5M | 27.1M | 187.3M |
| 2048 | 222.2 | 96.3 | 10.5M | 10.5M | 12.5M | 115.2M |
| 4096 | 223.9 | 94.1 | 5.5M | 5.5M | 7.2M | 65.9M |
| 8192 | 224.7 | 92.5 | 2.7M | 2.7M | 3.0M | 29.9M |
| 16384 | 223.5 | 92.5 | 1.3M | 1.3M | 1.4M | 13.9M |
| 32768 | 219.6 | 93.2 | 838.1K | 838.1K | 965.1K | 8.9M |
|-------|---------|------| New Code |---------|---------|
| 256 | 201.5 | 99.1 | 13.4K | 5.0K | 13.7K | 75.2K |
| 512 | 202.5 | 98.2 | 11.2K | 5.9K | 11.2K | 55.5K |
| 1024 | 207.3 | 93.9 | 11.5K | 9.7K | 11.5K | 59.6K |
| 2048 | 207.5 | 96.7 | 11.8K | 11.1K | 15.5K | 79.3K |
| 4096 | 206.9 | 96.6 | 11.8K | 11.7K | 11.8K | 63.2K |
| 8192 | 205.8 | 96.7 | 11.9K | 11.8K | 11.9K | 63.9K |
| 16384 | 200.9 | 98.2 | 11.9K | 11.9K | 11.9K | 64.2K |
| 32768 | 202.5 | 98.0 | 11.9K | 11.9K | 11.9K | 64.2K |
|-------|---------|------|--------|--------|---------|---------|
Some observations:
1. Overall Latency improved: (1790.19-1634.94)/1790.19*100 = 8.67%
2. Overall CPU increased: (777.32-751.49)/751.45*100 = 3.44%
3. Flow Management (add/delete) remained almost constant at ~11K
compared to values in millions.
Signed-off-by: Krishna Kumar <krikku@gmail.com>
Link: https://patch.msgid.link/20250825031005.3674864-2-krikku@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The 'use' field in struct rose_neigh is used as a reference counter but
lacks atomicity. This can lead to race conditions where a rose_neigh
structure is freed while still being referenced by other code paths.
For example, when rose_neigh->use becomes zero during an ioctl operation
via rose_rt_ioctl(), the structure may be removed while its timer is
still active, potentially causing use-after-free issues.
This patch changes the type of 'use' from unsigned short to refcount_t and
updates all code paths to use rose_neigh_hold() and rose_neigh_put() which
operate reference counts atomically.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Takamitsu Iwai <takamitz@amazon.co.jp>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250823085857.47674-3-takamitz@amazon.co.jp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The current rose_remove_neigh() performs two distinct operations:
1. Removes rose_neigh from rose_neigh_list
2. Frees the rose_neigh structure
Split these operations into separate functions to improve maintainability
and prepare for upcoming refcount_t conversion. The timer cleanup remains
in rose_remove_neigh() because free operations can be called from timer
itself.
This patch introduce rose_neigh_put() to handle the freeing of rose_neigh
structures and modify rose_remove_neigh() to handle removal only.
Signed-off-by: Takamitsu Iwai <takamitz@amazon.co.jp>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250823085857.47674-2-takamitz@amazon.co.jp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Prepare the HMAC key when it is added to the kernel, instead of
preparing it implicitly for every packet. This significantly improves
the performance of seg6_hmac_compute(). A microbenchmark on x86_64
shows seg6_hmac_compute() (with HMAC-SHA256) dropping from ~1978 cycles
to ~1419 cycles, a 28% improvement.
The size of 'struct seg6_hmac_info' increases by 128 bytes, but that
should be fine, since there should not be a massive number of keys.
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Link: https://patch.msgid.link/20250824013644.71928-3-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use the HMAC-SHA1 and HMAC-SHA256 library functions instead of
crypto_shash. This is simpler and faster. Pre-allocating per-CPU hash
transformation objects and descriptors is no longer needed, and a
microbenchmark on x86_64 shows seg6_hmac_compute() (with HMAC-SHA256)
dropping from ~2494 cycles to ~1978 cycles, a 20% improvement.
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Link: https://patch.msgid.link/20250824013644.71928-2-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Convert the ->flowic_tos field of struct flowi_common from __u8 to
dscp_t, rename it ->flowic_dscp and propagate these changes to struct
flowi and struct flowi4.
We've had several bugs in the past where ECN bits could interfere with
IPv4 routing, because these bits were not properly cleared when setting
->flowi4_tos. These bugs should be fixed now and the dscp_t type has
been introduced to ensure that variables carrying DSCP values don't
accidentally have any ECN bits set. Several variables and structure
fields have been converted to dscp_t already, but the main IPv4 routing
structure, struct flowi4, is still using a __u8. To avoid any future
regression, this patch converts it to dscp_t.
There are many users to convert at once. Fortunately, around half of
->flowi4_tos users already have a dscp_t value at hand, which they
currently convert to __u8 using inet_dscp_to_dsfield(). For all of
these users, we just need to drop that conversion.
But, although we try to do the __u8 <-> dscp_t conversions at the
boundaries of the network or of user space, some places still store
TOS/DSCP variables as __u8 in core networking code. Those can hardly be
converted either because the data structure is part of UAPI or because
the same variable or field is also used for handling ECN in other parts
of the code. In all of these cases where we don't have a dscp_t
variable at hand, we need to use inet_dsfield_to_dscp() when
interacting with ->flowi4_dscp.
Changes since v1:
* Fix space alignment in __bpf_redirect_neigh_v4() (Ido).
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/29acecb45e911d17446b9a3dbdb1ab7b821ea371.1756128932.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, the devlink health reporter starts the grace period
immediately after handling an error, blocking any further recoveries
until it finished.
However, when a single root cause triggers multiple errors in a short
time frame, it is desirable to treat them as a bulk of errors and to
allow their recoveries, avoiding premature blocking of subsequent
related errors, and reducing the risk of inconsistent or incomplete
error handling.
To address this, introduce a configurable burst period for devlink
health reporter. Start this period when the first error is handled,
and allow recovery attempts for reported errors during this window.
Once burst period expires, begin the grace period to block further
recoveries until it concludes.
Timeline summary:
----|--------|------------------------------/----------------------/--
error is error is burst period grace period
reported recovered (recoveries allowed) (recoveries blocked)
For calculating the burst period duration, use the same
last_recovery_ts as the grace period. Update it on recovery only
when the burst period is inactive (either disabled or at the
first error).
This patch implements the framework for the burst period and
effectively sets its value to 0 at reporter creation, so the current
behavior remains unchanged, which ensures backward compatibility.
A downstream patch will make the burst period configurable.
Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250824084354.533182-4-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move the default graceful period from a parameter to
devlink_health_reporter_create() to a field in the
devlink_health_reporter_ops structure.
This change improves consistency, as the graceful period is inherently
tied to the reporter's behavior and recovery policy. It simplifies the
signature of devlink_health_reporter_create() and its internal helper
functions. It also centralizes the reporter configuration at the ops
structure, preparing the groundwork for a downstream patch that will
introduce a devlink health reporter burst period attribute whose
default value will similarly be provided by the driver via the ops
structure.
Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250824084354.533182-2-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
These socket lookup functions required struct inet_hashinfo because
they are shared by TCP and DCCP.
* __inet_lookup_established()
* __inet_lookup_listener()
* __inet6_lookup_established()
* inet6_lookup_listener()
DCCP has gone, and we don't need to pass hashinfo down to them.
Let's fetch net->ipv4.tcp_death_row.hashinfo directly in the above
4 functions.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250822190803.540788-5-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since DCCP has been removed, sk->sk_prot->twsk_prot->twsk_destructor
is always tcp_twsk_destructor().
Let's call tcp_twsk_destructor() directly in inet_twsk_free() and
remove ->twsk_destructor().
While at it, tcp_twsk_destructor() is un-exported.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250822190803.540788-3-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Function set_name_sync() uses hdev->dev_name field to send
HCI_OP_WRITE_LOCAL_NAME command, but copying from data to hdev->dev_name
is called after mgmt cmd was queued, so it is possible that function
set_name_sync() will read old name value.
This change adds name as a parameter for function hci_update_name_sync()
to avoid race condition.
Fixes: 6f6ff38a1e ("Bluetooth: hci_sync: Convert MGMT_OP_SET_LOCAL_NAME")
Signed-off-by: Pavel Shpakovskiy <pashpakovskii@salutedevices.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
There is a page_pool_put() function but no get equivalent.
Having multiple references to a page pool is quite useful.
It avoids branching in create / destroy paths in drivers
which support memory providers.
Use the new helper in bnxt.
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20250820025704.166248-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The port's actor_oper_port_state activity flag should be updated immediately
after changing the lacp_active option to reflect the current mode correctly.
Fixes: 3a755cd8b7 ("bonding: add new option lacp_active")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20250815062000.22220-2-liuhangbin@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Convert SCTP cookies to use HMAC-SHA256, instead of the previous choice
of the legacy algorithms HMAC-MD5 and HMAC-SHA1. Simplify and optimize
the code by using the HMAC-SHA256 library instead of crypto_shash, and
by preparing the HMAC key when it is generated instead of per-operation.
This doesn't break compatibility, since the cookie format is an
implementation detail, not part of the SCTP protocol itself.
Note that the cookie size doesn't change either. The HMAC field was
already 32 bytes, even though previously at most 20 bytes were actually
compared. 32 bytes exactly fits an untruncated HMAC-SHA256 value. So,
although we could safely truncate the MAC to something slightly shorter,
for now just keep the cookie size the same.
I also considered SipHash, but that would generate only 8-byte MACs. An
8-byte MAC *might* suffice here. However, there's quite a lot of
information in the SCTP cookies: more than in TCP SYN cookies. So
absent an analysis that occasional forgeries of all that information is
okay in SCTP, I errored on the side of caution.
Remove HMAC-MD5 and HMAC-SHA1 as options, since the new HMAC-SHA256
option is just better. It's faster as well as more secure. For
example, benchmarking on x86_64, cookie authentication is now nearly 3x
as fast as the previous default choice and implementation of HMAC-MD5.
Also just make the kernel always support cookie authentication if SCTP
is supported at all, rather than making it optional in the build. (It
was sort of optional before, but it didn't really work properly. E.g.,
a kernel with CONFIG_SCTP_COOKIE_HMAC_MD5=n still supported HMAC-MD5
cookie authentication if CONFIG_CRYPTO_HMAC and CONFIG_CRYPTO_MD5
happened to be enabled in the kconfig for other reasons.)
Acked-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Link: https://patch.msgid.link/20250818205426.30222-5-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
For SCTP chunk authentication, use the HMAC-SHA1 and HMAC-SHA256 library
functions instead of crypto_shash. This is simpler and faster. There's
no longer any need to pre-allocate 'crypto_shash' objects; the SCTP code
now simply calls into the HMAC code directly.
As part of this, make SCTP always support both HMAC-SHA1 and
HMAC-SHA256. Previously, it only guaranteed support for HMAC-SHA1.
However, HMAC-SHA256 tended to be supported too anyway, as it was
supported if CONFIG_CRYPTO_SHA256 was enabled elsewhere in the kconfig.
Acked-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Link: https://patch.msgid.link/20250818205426.30222-4-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Except for sk_clone_lock(), all accesses to sk->sk_memcg
is done under CONFIG_MEMCG.
As a bonus, let's define sk->sk_memcg under CONFIG_MEMCG.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Link: https://patch.msgid.link/20250815201712.1745332-11-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We will store a flag in the lowest bit of sk->sk_memcg.
Then, we cannot pass the raw pointer to mem_cgroup_under_socket_pressure().
Let's pass struct sock to it and rename the function to match other
functions starting with mem_cgroup_sk_.
Note that the helper is moved to sock.h to use mem_cgroup_from_sk().
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Link: https://patch.msgid.link/20250815201712.1745332-10-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The socket memcg feature is enabled by a static key and
only works for non-root cgroup.
We check both conditions in many places.
Let's factorise it as a helper function.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Link: https://patch.msgid.link/20250815201712.1745332-8-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We will store a flag in the lowest bit of sk->sk_memcg.
Then, directly dereferencing sk->sk_memcg will be illegal, and we
do not want to allow touching the raw sk->sk_memcg in many places.
Let's introduce mem_cgroup_from_sk().
Other places accessing the raw sk->sk_memcg will be converted later.
Note that we cannot define the helper as an inline function in
memcontrol.h as we cannot access any fields of struct sock there
due to circular dependency, so it is placed in sock.h.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Link: https://patch.msgid.link/20250815201712.1745332-7-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch enhances RX buffer handling in the mana driver by allocating
pages from a page pool and slicing them into MTU-sized fragments, rather
than dedicating a full page per packet. This approach is especially
beneficial on systems with large base page sizes like 64KB.
Key improvements:
- Proper integration of page pool for RX buffer allocations.
- MTU-sized buffer slicing to improve memory utilization.
- Reduce overall per Rx queue memory footprint.
- Automatic fallback to full-page buffers when:
* Jumbo frames are enabled (MTU > PAGE_SIZE / 2).
* The XDP path is active, to avoid complexities with fragment reuse.
Testing on VMs with 64KB pages shows around 200% throughput improvement.
Memory efficiency is significantly improved due to reduced wastage in page
allocations. Example: We are now able to fit 35 rx buffers in a single 64kb
page for MTU size of 1500, instead of 1 rx buffer per page previously.
Tested:
- iperf3, iperf2, and nttcp benchmarks.
- Jumbo frames with MTU 9000.
- Native XDP programs (XDP_PASS, XDP_DROP, XDP_TX, XDP_REDIRECT) for
testing the XDP path in driver.
- Memory leak detection (kmemleak).
- Driver load/unload, reboot, and stress scenarios.
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
Link: https://patch.msgid.link/20250814140410.GA22089@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This fixes the likes of hci_conn_num(CIS_LINK) returning the total of
ISO connection which includes BIS_LINK as well, so this splits the
iso_num into each link type and introduces hci_iso_num that can be used
in places where the total number of ISO connection still needs to be
used.
Fixes: 23205562ff ("Bluetooth: separate CIS_LINK and BIS_LINK link types")
Fixes: a7bcffc673 ("Bluetooth: Add PA_LINK to distinguish BIG sync and PA sync connections")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
ll_privacy_capable only indicates that the controller supports the
feature but it doesnt' check that LE is enabled so it end up being
marked as active in the current settings when it shouldn't.
Fixes: ad383c2c65 ("Bluetooth: hci_sync: Enable advertising when LL privacy is enabled")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
{cis,bis}_capable only indicates the controller supports the feature
since it doesn't check that LE is enabled so it shall not be used for
current setting, instead this introduces {cis,bis}_enabled macros that
can be used to indicate that these features are currently enabled.
Fixes: 26afbd826e ("Bluetooth: Add initial implementation of CIS connections")
Fixes: eca0ae4aea ("Bluetooth: Add initial implementation of BIS connections")
Fixes: ae75336131 ("Bluetooth: Check for ISO support in controller")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This issue applies for the following qdiscs: hhf, fq, fq_codel, and
fq_pie, and occurs in their change handlers when adjusting to the new
limit. The problem is the following in the values passed to the
subsequent qdisc_tree_reduce_backlog call given a tbf parent:
When the tbf parent runs out of tokens, skbs of these qdiscs will
be placed in gso_skb. Their peek handlers are qdisc_peek_dequeued,
which accounts for both qlen and backlog. However, in the case of
qdisc_dequeue_internal, ONLY qlen is accounted for when pulling
from gso_skb. This means that these qdiscs are missing a
qdisc_qstats_backlog_dec when dropping packets to satisfy the
new limit in their change handlers.
One can observe this issue with the following (with tc patched to
support a limit of 0):
export TARGET=fq
tc qdisc del dev lo root
tc qdisc add dev lo root handle 1: tbf rate 8bit burst 100b latency 1ms
tc qdisc replace dev lo handle 3: parent 1:1 $TARGET limit 1000
echo ''; echo 'add child'; tc -s -d qdisc show dev lo
ping -I lo -f -c2 -s32 -W0.001 127.0.0.1 2>&1 >/dev/null
echo ''; echo 'after ping'; tc -s -d qdisc show dev lo
tc qdisc change dev lo handle 3: parent 1:1 $TARGET limit 0
echo ''; echo 'after limit drop'; tc -s -d qdisc show dev lo
tc qdisc replace dev lo handle 2: parent 1:1 sfq
echo ''; echo 'post graft'; tc -s -d qdisc show dev lo
The second to last show command shows 0 packets but a positive
number (74) of backlog bytes. The problem becomes clearer in the
last show command, where qdisc_purge_queue triggers
qdisc_tree_reduce_backlog with the positive backlog and causes an
underflow in the tbf parent's backlog (4096 Mb instead of 0).
To fix this issue, the codepath for all clients of qdisc_dequeue_internal
has been simplified: codel, pie, hhf, fq, fq_pie, and fq_codel.
qdisc_dequeue_internal handles the backlog adjustments for all cases that
do not directly use the dequeue handler.
The old fq_codel_change limit adjustment loop accumulated the arguments to
the subsequent qdisc_tree_reduce_backlog call through the cstats field.
However, this is confusing and error prone as fq_codel_dequeue could also
potentially mutate this field (which qdisc_dequeue_internal calls in the
non gso_skb case), so we have unified the code here with other qdiscs.
Fixes: 2d3cbfd6d5 ("net_sched: Flush gso_skb list too during ->change()")
Fixes: 4b549a2ef4 ("fq_codel: Fair Queue Codel AQM")
Fixes: 10239edf86 ("net-qdisc-hhf: Heavy-Hitter Filter (HHF) qdisc")
Signed-off-by: William Liu <will@willsroot.io>
Reviewed-by: Savino Dicanosa <savy@syst3mfailure.io>
Link: https://patch.msgid.link/20250812235725.45243-1-will@willsroot.io
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Constify the devlink port attributes to indicate they are read only
and does not depend on anything else. Therefore, validate it early
before setting in the devlink port.
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Parav Pandit <parav@nvidia.com>
Link: https://patch.msgid.link/20250813094417.7269-3-parav@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tony Nguyen says:
====================
ixgbe: bypass devlink phys_port_name generation
Jedrzej adds option to skip phys_port_name generation and opts
ixgbe into it as some configurations rely on pre-devlink naming
which could end up broken as a result.
* '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
ixgbe: prevent from unwanted interface name changes
devlink: let driver opt out of automatic phys_port_name generation
====================
Link: https://patch.msgid.link/20250812205226.1984369-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The estimator kthreads' affinity are defined by sysctl overwritten
preferences and applied through a plain call to the scheduler's affinity
API.
However since the introduction of managed kthreads preferred affinity,
such a practice shortcuts the kthreads core code which eventually
overwrites the target to the default unbound affinity.
Fix this with using the appropriate kthread's API.
Fixes: d1a8919758 ("kthread: Default affine kthread to its preferred NUMA node")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
Currently when adding devlink port, phys_port_name is automatically
generated within devlink port initialization flow. As a result adding
devlink port support to driver may result in forced changes of interface
names, which breaks already existing network configs.
This is an expected behavior but in some scenarios it would not be
preferable to provide such limitation for legacy driver not being able to
keep 'pre-devlink' interface name.
Add flag no_phys_port_name to devlink_port_attrs struct which indicates
if devlink should not alter name of interface.
Suggested-by: Jiri Pirko <jiri@resnulli.us>
Link: https://lore.kernel.org/all/nbwrfnjhvrcduqzjl4a2jafnvvud6qsbxlvxaxilnryglf4j7r@btuqrimnfuly/
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Page pool can have pages "directly" (locklessly) recycled to it,
if the NAPI that owns the page pool is scheduled to run on the same CPU.
To make this safe we check that the NAPI is disabled while we destroy
the page pool. In most cases NAPI and page pool lifetimes are tied
together so this happens naturally.
The queue API expects the following order of calls:
-> mem_alloc
alloc new pp
-> stop
napi_disable
-> start
napi_enable
-> mem_free
free old pp
Here we allocate the page pool in ->mem_alloc and free in ->mem_free.
But the NAPIs are only stopped between ->stop and ->start. We created
page_pool_disable_direct_recycling() to safely shut down the recycling
in ->stop. This way the page_pool_destroy() call in ->mem_free doesn't
have to worry about recycling any more.
Unfortunately, the page_pool_disable_direct_recycling() is not enough
to deal with failures which necessitate freeing the _new_ page pool.
If we hit a failure in ->mem_alloc or ->stop the new page pool has
to be freed while the NAPI is active (assuming driver attaches the
page pool to an existing NAPI instance and doesn't reallocate NAPIs).
Freeing the new page pool is technically safe because it hasn't been
used for any packets, yet, so there can be no recycling. But the check
in napi_assert_will_not_race() has no way of knowing that. We could
check if page pool is empty but that'd make the check much less likely
to trigger during development.
Add page_pool_enable_direct_recycling(), pairing with
page_pool_disable_direct_recycling(). It will allow us to create the new
page pools in "disabled" state and only enable recycling when we know
the reconfig operation will not fail.
Coincidentally it will also let us re-enable the recycling for the old
pool, if the reconfig failed:
-> mem_alloc (new)
-> stop (old)
# disables direct recycling for old
-> start (new)
# fail!!
-> start (old)
# go back to old pp but direct recycling is lost :(
-> mem_free (new)
The new helper is idempotent to make the life easier for drivers,
which can operate in HDS mode and support zero-copy Rx.
The driver can call the helper twice whether there are two pools
or it has multiple references to a single pool.
Fixes: 40eca00ae6 ("bnxt_en: unlink page pool when stopping Rx queue")
Tested-by: David Wei <dw@davidwei.uk>
Link: https://patch.msgid.link/20250805003654.2944974-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- netlink: avoid infinite retry looping in netlink_unicast()
Previous releases - always broken:
- packet: fix a race in packet_set_ring() and packet_notifier()
- ipv6: reject malicious packets in ipv6_gso_segment()
- sched: mqprio: fix stack out-of-bounds write in tc entry parsing
- net: drop UFO packets (injected via virtio) in udp_rcv_segment()
- eth: mlx5: correctly set gso_segs when LRO is used, avoid false
positive checksum validation errors
- netpoll: prevent hanging NAPI when netcons gets enabled
- phy: mscc: fix parsing of unicast frames for PTP timestamping
- number of device tree / OF reference leak fixes
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmiUvMcACgkQMUZtbf5S
IrucuA//bQGZdQkpRo/2zWDFS4wN7hV8PkR+kI0F4rUyNQjFDyUOrccYnhHl8VRn
DfnmVC9oiWxwuW2QgrgH1KKSw9dxhDPVmhLezAHaZv9XAPPgPO/Yb2Dr3oKHF+yK
+/QW57FN8rNg8mHUzMS/26Y+rH6OubTaf3MKcsT2uZhuIXKHScOUedmr6EeEkp/L
c32MpWfcVC7M2jjvCH+HO4k4RawWaB8W93iMUrLMSVI1oE3Tsjyhl5Cv9TBixh7g
3KZW98qVqMXKBQr0QTF4LBR0IWgKOS4KMVJqrgQ3CZszbDWbmBsFfi8olr/AS0Nt
vgk6hTd8vVHXa+sOvdFpDbocdjXBq6vf5bDd9p57bt3JyxFMfFUWbG1eDj8bUpMI
YuKAhQg9mTr6R8DaLwqANY8zrSET2FiYbNkUsP9TD8++q2j34U5/a468cxymfjJ/
90vSrBGUpwxYPhx2B+Slu1SZJ1g0NbUL34jrKBuSNloVoZDRuzq2zd310BigG/35
T5CTr5OH+USlFjL6KpgGHnygiMQB/h2WgEjWbFoZHnIb+exVKCgS6IiNBVV3b2OK
Nv6iVyFKPdKJ8qxeaiYMWA16wG8buOytlBn/1hYKLBvyVKAAx9Ib4ao2TK9w/1Vs
50sX21h+Awj2a/17sXlSXA+RVdeBcAzHdsGWHr1ql31htn2Xl10=
=nwy7
-----END PGP SIGNATURE-----
Merge tag 'net-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
Previous releases - regressions:
- netlink: avoid infinite retry looping in netlink_unicast()
Previous releases - always broken:
- packet: fix a race in packet_set_ring() and packet_notifier()
- ipv6: reject malicious packets in ipv6_gso_segment()
- sched: mqprio: fix stack out-of-bounds write in tc entry parsing
- net: drop UFO packets (injected via virtio) in udp_rcv_segment()
- eth: mlx5: correctly set gso_segs when LRO is used, avoid false
positive checksum validation errors
- netpoll: prevent hanging NAPI when netcons gets enabled
- phy: mscc: fix parsing of unicast frames for PTP timestamping
- a number of device tree / OF reference leak fixes"
* tag 'net-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (44 commits)
pptp: fix pptp_xmit() error path
net: ti: icssg-prueth: Fix skb handling for XDP_PASS
net: Update threaded state in napi config in netif_set_threaded
selftests: netdevsim: Xfail nexthop test on slow machines
eth: fbnic: Lock the tx_dropped update
eth: fbnic: Fix tx_dropped reporting
eth: fbnic: remove the debugging trick of super high page bias
net: ftgmac100: fix potential NULL pointer access in ftgmac100_phy_disconnect
dt-bindings: net: Replace bouncing Alexandru Tachici emails
dpll: zl3073x: ZL3073X_I2C and ZL3073X_SPI should depend on NET
net/sched: mqprio: fix stack out-of-bounds write in tc entry parsing
Revert "net: mdio_bus: Use devm for getting reset GPIO"
selftests: net: packetdrill: xfail all problems on slow machines
net/packet: fix a race in packet_set_ring() and packet_notifier()
benet: fix BUG when creating VFs
net: airoha: npu: Add missing MODULE_FIRMWARE macros
net: devmem: fix DMA direction on unmapping
ipa: fix compile-testing with qcom-mdt=m
eth: fbnic: unlink NAPIs from queues on error to open
net: Add locking to protect skb->dev access in ip_output
...
- Fix a shortcut key issue in menuconfig
- Fix missing rebuild of kheaders
- Sort the symbol dump generated by gendwarfsyms
- Support zboot extraction in scripts/extract-vmlinux
- Migrate gconfig to GTK 3
- Add TAR variable to allow overriding the default tar command
- Hand over Kbuild maintainership
-----BEGIN PGP SIGNATURE-----
iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmiSr38VHG1hc2FoaXJv
eUBrZXJuZWwub3JnAAoJED2LAQed4NsGcTIP/RCVr/OEJgVXg8dOBNhNuhGoidnM
2uqRcaza68tOSegpFGcfd9bhO1TCR/O5SL117TS8UEx4f9ge7gk/+XVZC8i1268m
9+V6eJd8QI34nB/EezMnrhvFmn2kC0UMuSldZYQJ2cReLjtapBP2xBtWnxi+Zyyw
+FjdHwQln7E8UaB/gMqh9KVVOytX4NIUUZEA/78nd4eJaJbLxJ/5ztAxGLB//bXI
Rr6bjAeOmIfRWS9QWnGzNzHmzp4SSmU+/gdLXyaWlmoVjeut8O+BJXvQRNfswk8K
JXmk6uZUx6CNheCca2RaM0i6XAArkpOQc/7v7Ul/rSriTxdxAVUghjk0fNrXJGvQ
kBjewOTUXg8f4xhuPAL3nkWmCh0s0tV3Q0EneAaJuUck5yRkW0bxxKa74h6ji2q8
8RsNS5Mq0yYiR1gmCqhEmTN6/wDamSpGkHT0k6ukipixjCr5DP2QFP/xT3d7qSwc
W6msliXgBmY/ZrGoJXy4zjmu5vxKfAes+7JLBDxUKjdItC818qwSPf+nbvVIdJqb
K4/hcNDuYPKd/8N8YapPjbGfTBk9Xqp74ez4xg5XNIBPS+cE5k/mePletbCzkzC5
vzjoNgVUmpPGPdDaBk+S4jSEzWUi575aQx590OfdoXsBt4CQVHHk4PEM1Qh5cWnZ
m6tx1oqfruovU6Gx
=MtyJ
-----END PGP SIGNATURE-----
Merge tag 'kbuild-v6.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild updates from Masahiro Yamada:
"This is the last pull request from me.
I'm grateful to have been able to continue as a maintainer for eight
years. From the next cycle, Nathan and Nicolas will maintain Kbuild.
- Fix a shortcut key issue in menuconfig
- Fix missing rebuild of kheaders
- Sort the symbol dump generated by gendwarfsyms
- Support zboot extraction in scripts/extract-vmlinux
- Migrate gconfig to GTK 3
- Add TAR variable to allow overriding the default tar command
- Hand over Kbuild maintainership"
* tag 'kbuild-v6.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (92 commits)
MAINTAINERS: hand over Kbuild maintenance
kheaders: make it possible to override TAR
kbuild: userprogs: use correct linker when mixing clang and GNU ld
kconfig: lxdialog: replace strcpy() with strncpy() in inputbox.c
kconfig: lxdialog: replace strcpy with snprintf in print_autowrap
kconfig: gconf: refactor text_insert_help()
kconfig: gconf: remove unneeded variable in text_insert_msg
kconfig: gconf: use hyphens in signals
kconfig: gconf: replace GtkImageMenuItem with GtkMenuItem
kconfig: gconf: Fix Back button behavior
kconfig: gconf: fix single view to display dependent symbols correctly
scripts: add zboot support to extract-vmlinux
gendwarfksyms: order -T symtypes output by name
gendwarfksyms: use preferred form of sizeof for allocation
kconfig: qconf: confine {begin,end}Group to constructor and destructor
kconfig: qconf: fix ConfigList::updateListAllforAll()
kconfig: add a function to dump all menu entries in a tree-like format
kconfig: gconf: show GTK version in About dialog
kconfig: gconf: replace GtkHPaned and GtkVPaned with GtkPaned
kconfig: gconf: replace GdkColor with GdkRGBA
...
In ip_output() skb->dev is updated from the skb_dst(skb)->dev
this can become invalid when the interface is unregistered and freed,
Introduced new skb_dst_dev_rcu() function to be used instead of
skb_dst_dev() within rcu_locks in ip_output.This will ensure that
all the skb's associated with the dev being deregistered will
be transnmitted out first, before freeing the dev.
Given that ip_output() is called within an rcu_read_lock()
critical section or from a bottom-half context, it is safe to introduce
an RCU read-side critical section within it.
Multiple panic call stacks were observed when UL traffic was run
in concurrency with device deregistration from different functions,
pasting one sample for reference.
[496733.627565][T13385] Call trace:
[496733.627570][T13385] bpf_prog_ce7c9180c3b128ea_cgroupskb_egres+0x24c/0x7f0
[496733.627581][T13385] __cgroup_bpf_run_filter_skb+0x128/0x498
[496733.627595][T13385] ip_finish_output+0xa4/0xf4
[496733.627605][T13385] ip_output+0x100/0x1a0
[496733.627613][T13385] ip_send_skb+0x68/0x100
[496733.627618][T13385] udp_send_skb+0x1c4/0x384
[496733.627625][T13385] udp_sendmsg+0x7b0/0x898
[496733.627631][T13385] inet_sendmsg+0x5c/0x7c
[496733.627639][T13385] __sys_sendto+0x174/0x1e4
[496733.627647][T13385] __arm64_sys_sendto+0x28/0x3c
[496733.627653][T13385] invoke_syscall+0x58/0x11c
[496733.627662][T13385] el0_svc_common+0x88/0xf4
[496733.627669][T13385] do_el0_svc+0x2c/0xb0
[496733.627676][T13385] el0_svc+0x2c/0xa4
[496733.627683][T13385] el0t_64_sync_handler+0x68/0xb4
[496733.627689][T13385] el0t_64_sync+0x1a4/0x1a8
Changes in v3:
- Replaced WARN_ON() with WARN_ON_ONCE(), as suggested by Willem de Bruijn.
- Dropped legacy lines mistakenly pulled in from an outdated branch.
Changes in v2:
- Addressed review comments from Eric Dumazet
- Used READ_ONCE() to prevent potential load/store tearing
- Added skb_dst_dev_rcu() and used along with rcu_read_lock() in ip_output
Signed-off-by: Sharath Chandra Vurukala <quic_sharathv@quicinc.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250730105118.GA26100@hu-sharathv-hyd.qualcomm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When sending a packet with virtio_net_hdr to tun device, if the gso_type
in virtio_net_hdr is SKB_GSO_UDP and the gso_size is less than udphdr
size, below crash may happen.
------------[ cut here ]------------
kernel BUG at net/core/skbuff.c:4572!
Oops: invalid opcode: 0000 [#1] SMP NOPTI
CPU: 0 UID: 0 PID: 62 Comm: mytest Not tainted 6.16.0-rc7 #203 PREEMPT(voluntary)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
RIP: 0010:skb_pull_rcsum+0x8e/0xa0
Code: 00 00 5b c3 cc cc cc cc 8b 93 88 00 00 00 f7 da e8 37 44 38 00 f7 d8 89 83 88 00 00 00 48 8b 83 c8 00 00 00 5b c3 cc cc cc cc <0f> 0b 0f 0b 66 66 2e 0f 1f 84 00 000
RSP: 0018:ffffc900001fba38 EFLAGS: 00000297
RAX: 0000000000000004 RBX: ffff8880040c1000 RCX: ffffc900001fb948
RDX: ffff888003e6d700 RSI: 0000000000000008 RDI: ffff88800411a062
RBP: ffff8880040c1000 R08: 0000000000000000 R09: 0000000000000001
R10: ffff888003606c00 R11: 0000000000000001 R12: 0000000000000000
R13: ffff888004060900 R14: ffff888004050000 R15: ffff888004060900
FS: 000000002406d3c0(0000) GS:ffff888084a19000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000040 CR3: 0000000004007000 CR4: 00000000000006f0
Call Trace:
<TASK>
udp_queue_rcv_one_skb+0x176/0x4b0 net/ipv4/udp.c:2445
udp_queue_rcv_skb+0x155/0x1f0 net/ipv4/udp.c:2475
udp_unicast_rcv_skb+0x71/0x90 net/ipv4/udp.c:2626
__udp4_lib_rcv+0x433/0xb00 net/ipv4/udp.c:2690
ip_protocol_deliver_rcu+0xa6/0x160 net/ipv4/ip_input.c:205
ip_local_deliver_finish+0x72/0x90 net/ipv4/ip_input.c:233
ip_sublist_rcv_finish+0x5f/0x70 net/ipv4/ip_input.c:579
ip_sublist_rcv+0x122/0x1b0 net/ipv4/ip_input.c:636
ip_list_rcv+0xf7/0x130 net/ipv4/ip_input.c:670
__netif_receive_skb_list_core+0x21d/0x240 net/core/dev.c:6067
netif_receive_skb_list_internal+0x186/0x2b0 net/core/dev.c:6210
napi_complete_done+0x78/0x180 net/core/dev.c:6580
tun_get_user+0xa63/0x1120 drivers/net/tun.c:1909
tun_chr_write_iter+0x65/0xb0 drivers/net/tun.c:1984
vfs_write+0x300/0x420 fs/read_write.c:593
ksys_write+0x60/0xd0 fs/read_write.c:686
do_syscall_64+0x50/0x1c0 arch/x86/entry/syscall_64.c:63
</TASK>
To trigger gso segment in udp_queue_rcv_skb(), we should also set option
UDP_ENCAP_ESPINUDP to enable udp_sk(sk)->encap_rcv. When the encap_rcv
hook return 1 in udp_queue_rcv_one_skb(), udp_csum_pull_header() will try
to pull udphdr, but the skb size has been segmented to gso size, which
leads to this crash.
Previous commit cf329aa42b ("udp: cope with UDP GRO packet misdirection")
introduces segmentation in UDP receive path only for GRO, which was never
intended to be used for UFO, so drop UFO packets in udp_rcv_segment().
Link: https://lore.kernel.org/netdev/20250724083005.3918375-1-wangliang74@huawei.com/
Link: https://lore.kernel.org/netdev/20250729123907.3318425-1-wangliang74@huawei.com/
Fixes: cf329aa42b ("udp: cope with UDP GRO packet misdirection")
Suggested-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Signed-off-by: Wang Liang <wangliang74@huawei.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250730101458.3470788-1-wangliang74@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmiINnEACgkQ6rmadz2v
bToBnA/9F+A3R6rTwGk4HK3xpfc/nm2Tanl3oRN7S2ub/mskDOtWSIyG6cVFZ0UG
1fK6IkByyRIpAF/5qhdlw8drRXHkQtGLA0lP2L9llm4X1mHLofB18y9OeLrDE1WN
KwNP06+IGX9W802lCGSIXOY+VmRscVfXSMokyQt2ilHplKjOnDqJcYkWupi3T2rC
mz79FY9aEl2YrIcpj9RXz+8nwP49pZBuW2P0IM5PAIj4BJBXShrUp8T1nz94okNe
NFsnAyRxjWpUT0McEgtA9WvpD9lZqujYD8Qp0KlGZWmI3vNpV5d9S1+dBcEb1n7q
dyNMkTF3oRrJhhg4VqoHc6fVpzSEoZ9ZxV5Hx4cs+ganH75D4YbdGqx/7mR3DUgH
MZh6rHF1pGnK7TAm7h5gl3ZRAOkZOaahbe1i01NKo9CEe5fSh3AqMyzJYoyGHRKi
xDN39eQdWBNA+hm1VkbK2Bv93Rbjrka2Kj+D3sSSO9Bo/u3ntcknr7LW39idKz62
Q8dkKHcCEtun7gjk0YXPF013y81nEohj1C+52BmJ2l5JitM57xfr6YOaQpu7DPDE
AJbHx6ASxKdyEETecd0b+cXUPQ349zmRXy0+CDMAGKpBicC0H0mHhL14cwOY1Hfu
EIpIjmIJGI3JNF6T5kybcQGSBOYebdV0FFgwSllzPvuYt7YsHCs=
=/O3j
-----END PGP SIGNATURE-----
Merge tag 'bpf-next-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Pull bpf updates from Alexei Starovoitov:
- Remove usermode driver (UMD) framework (Thomas Weißschuh)
- Introduce Strongly Connected Component (SCC) in the verifier to
detect loops and refine register liveness (Eduard Zingerman)
- Allow 'void *' cast using bpf_rdonly_cast() and corresponding
'__arg_untrusted' for global function parameters (Eduard Zingerman)
- Improve precision for BPF_ADD and BPF_SUB operations in the verifier
(Harishankar Vishwanathan)
- Teach the verifier that constant pointer to a map cannot be NULL
(Ihor Solodrai)
- Introduce BPF streams for error reporting of various conditions
detected by BPF runtime (Kumar Kartikeya Dwivedi)
- Teach the verifier to insert runtime speculation barrier (lfence on
x86) to mitigate speculative execution instead of rejecting the
programs (Luis Gerhorst)
- Various improvements for 'veristat' (Mykyta Yatsenko)
- For CONFIG_DEBUG_KERNEL config warn on internal verifier errors to
improve bug detection by syzbot (Paul Chaignon)
- Support BPF private stack on arm64 (Puranjay Mohan)
- Introduce bpf_cgroup_read_xattr() kfunc to read xattr of cgroup's
node (Song Liu)
- Introduce kfuncs for read-only string opreations (Viktor Malik)
- Implement show_fdinfo() for bpf_links (Tao Chen)
- Reduce verifier's stack consumption (Yonghong Song)
- Implement mprog API for cgroup-bpf programs (Yonghong Song)
* tag 'bpf-next-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (192 commits)
selftests/bpf: Migrate fexit_noreturns case into tracing_failure test suite
selftests/bpf: Add selftest for attaching tracing programs to functions in deny list
bpf: Add log for attaching tracing programs to functions in deny list
bpf: Show precise rejected function when attaching fexit/fmod_ret to __noreturn functions
bpf: Fix various typos in verifier.c comments
bpf: Add third round of bounds deduction
selftests/bpf: Test invariants on JSLT crossing sign
selftests/bpf: Test cross-sign 64bits range refinement
selftests/bpf: Update reg_bound range refinement logic
bpf: Improve bounds when s64 crosses sign boundary
bpf: Simplify bounds refinement from s32
selftests/bpf: Enable private stack tests for arm64
bpf, arm64: JIT support for private stack
bpf: Move bpf_jit_get_prog_name() to core.c
bpf, arm64: Fix fp initialization for exception boundary
umd: Remove usermode driver framework
bpf/preload: Don't select USERMODE_DRIVER
selftests/bpf: Fix test dynptr/test_dynptr_memset_xdp_chunks failure
selftests/bpf: Fix test dynptr/test_dynptr_copy_xdp failure
selftests/bpf: Increase xdp data size for arm64 64K page size
...
Core & protocols
----------------
- Wrap datapath globals into net_aligned_data, to avoid false sharing.
- Preserve MSG_ZEROCOPY in forwarding (e.g. out of a container).
- Add SO_INQ and SCM_INQ support to AF_UNIX.
- Add SIOCINQ support to AF_VSOCK.
- Add TCP_MAXSEG sockopt to MPTCP.
- Add IPv6 force_forwarding sysctl to enable forwarding per interface.
- Make TCP validation of whether packet fully fits in the receive
window and the rcv_buf more strict. With increased use of HW
aggregation a single "packet" can be multiple 100s of kB.
- Add MSG_MORE flag to optimize large TCP transmissions via sockmap,
improves latency up to 33% for sockmap users.
- Convert TCP send queue handling from tasklet to BH workque.
- Improve BPF iteration over TCP sockets to see each socket exactly once.
- Remove obsolete and unused TCP RFC3517/RFC6675 loss recovery code.
- Support enabling kernel threads for NAPI processing on per-NAPI
instance basis rather than a whole device. Fully stop the kernel NAPI
thread when threaded NAPI gets disabled. Previously thread would stick
around until ifdown due to tricky synchronization.
- Allow multicast routing to take effect on locally-generated packets.
- Add output interface argument for End.X in segment routing.
- MCTP: add support for gateway routing, improve bind() handling.
- Don't require rtnl_lock when fetching an IPv6 neighbor over Netlink.
- Add a new neighbor flag ("extern_valid"), which cedes refresh
responsibilities to userspace. This is needed for EVPN multi-homing
where a neighbor entry for a multi-homed host needs to be synced
across all the VTEPs among which the host is multi-homed.
- Support NUD_PERMANENT for proxy neighbor entries.
- Add a new queuing discipline for IETF RFC9332 DualQ Coupled AQM.
- Add sequence numbers to netconsole messages. Unregister netconsole's
console when all net targets are removed. Code refactoring.
Add a number of selftests.
- Align IPSec inbound SA lookup to RFC 4301. Only SPI and protocol
should be used for an inbound SA lookup.
- Support inspecting ref_tracker state via DebugFS.
- Don't force bonding advertisement frames tx to ~333 ms boundaries.
Add broadcast_neighbor option to send ARP/ND on all bonded links.
- Allow providing upcall pid for the 'execute' command in openvswitch.
- Remove DCCP support from Netfilter's conntrack.
- Disallow multiple packet duplications in the queuing layer.
- Prevent use of deprecated iptables code on PREEMPT_RT.
Driver API
----------
- Support RSS and hashing configuration over ethtool Netlink.
- Add dedicated ethtool callbacks for getting and setting hashing fields.
- Add support for power budget evaluation strategy in PSE /
Power-over-Ethernet. Generate Netlink events for overcurrent etc.
- Support DPLL phase offset monitoring across all device inputs.
Support providing clock reference and SYNC over separate DPLL
inputs.
- Support traffic classes in devlink rate API for bandwidth management.
- Remove rtnl_lock dependency from UDP tunnel port configuration.
Device drivers
--------------
- Add a new Broadcom driver for 800G Ethernet (bnge).
- Add a standalone driver for Microchip ZL3073x DPLL.
- Remove IBM's NETIUCV device driver.
- Ethernet high-speed NICs:
- Broadcom (bnxt):
- support zero-copy Tx of DMABUF memory
- take page size into account for page pool recycling rings
- Intel (100G, ice, idpf):
- idpf: XDP and AF_XDP support preparations
- idpf: add flow steering
- add link_down_events statistic
- clean up the TSPLL code
- preparations for live VM migration
- nVidia/Mellanox:
- support zero-copy Rx/Tx interfaces (DMABUF and io_uring)
- optimize context memory usage for matchers
- expose serial numbers in devlink info
- support PCIe congestion metrics
- Meta (fbnic):
- add 25G, 50G, and 100G link modes to phylink
- support dumping FW logs
- Marvell/Cavium:
- support for CN20K generation of the Octeon chips
- Amazon:
- add HW clock (without timestamping, just hypervisor time access)
- Ethernet virtual:
- VirtIO net:
- support segmentation of UDP-tunnel-encapsulated packets
- Google (gve):
- support packet timestamping and clock synchronization
- Microsoft vNIC:
- add handler for device-originated servicing events
- allow dynamic MSI-X vector allocation
- support Tx bandwidth clamping
- Ethernet NICs consumer, and embedded:
- AMD:
- amd-xgbe: hardware timestamping and PTP clock support
- Broadcom integrated MACs (bcmgenet, bcmasp):
- use napi_complete_done() return value to support NAPI polling
- add support for re-starting auto-negotiation
- Broadcom switches (b53):
- support BCM5325 switches
- add bcm63xx EPHY power control
- Synopsys (stmmac):
- lots of code refactoring and cleanups
- TI:
- icssg-prueth: read firmware-names from device tree
- icssg: PRP offload support
- Microchip:
- lan78xx: convert to PHYLINK for improved PHY and MAC management
- ksz: add KSZ8463 switch support
- Intel:
- support similar queue priority scheme in multi-queue and
time-sensitive networking (taprio)
- support packet pre-emption in both
- RealTek (r8169):
- enable EEE at 5Gbps on RTL8126
- Airoha:
- add PPPoE offload support
- MDIO bus controller for Airoha AN7583
- Ethernet PHYs:
- support for the IPQ5018 internal GE PHY
- micrel KSZ9477 switch-integrated PHYs:
- add MDI/MDI-X control support
- add RX error counters
- add cable test support
- add Signal Quality Indicator (SQI) reporting
- dp83tg720: improve reset handling and reduce link recovery time
- support bcm54811 (and its MII-Lite interface type)
- air_en8811h: support resume/suspend
- support PHY counters for QCA807x and QCA808x
- support WoL for QCA807x
- CAN drivers:
- rcar_canfd: support for Transceiver Delay Compensation
- kvaser: report FW versions via devlink dev info
- WiFi:
- extended regulatory info support (6 GHz)
- add statistics and beacon monitor for Multi-Link Operation (MLO)
- support S1G aggregation, improve S1G support
- add Radio Measurement action fields
- support per-radio RTS threshold
- some work around how FIPS affects wifi, which was wrong (RC4 is used
by TKIP, not only WEP)
- improvements for unsolicited probe response handling
- WiFi drivers:
- RealTek (rtw88):
- IBSS mode for SDIO devices
- RealTek (rtw89):
- BT coexistence for MLO/WiFi7
- concurrent station + P2P support
- support for USB devices RTL8851BU/RTL8852BU
- Intel (iwlwifi):
- use embedded PNVM in (to be released) FW images to fix
compatibility issues
- many cleanups (unused FW APIs, PCIe code, WoWLAN)
- some FIPS interoperability
- MediaTek (mt76):
- firmware recovery improvements
- more MLO work
- Qualcomm/Atheros (ath12k):
- fix scan on multi-radio devices
- more EHT/Wi-Fi 7 features
- encapsulation/decapsulation offload
- Broadcom (brcm80211):
- support SDIO 43751 device
- Bluetooth:
- hci_event: add support for handling LE BIG Sync Lost event
- ISO: add socket option to report packet seqnum via CMSG
- ISO: support SCM_TIMESTAMPING for ISO TS
- Bluetooth drivers:
- intel_pcie: support Function Level Reset
- nxpuart: add support for 4M baudrate
- nxpuart: implement powerup sequence, reset, FW dump, and FW loading
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmiFgLgACgkQMUZtbf5S
IrvafxAAnQRwYBoIG+piCILx6z5pRvBGHkmEQ4AQgSCFuq2eO3ubwMFIqEybfma1
5+QFjUZAV3OgGgKRBS2KGWxtSzdiF+/JGV1VOIN67sX3Mm0a2QgjA4n5CgKL0FPr
o6BEzjX5XwG1zvGcBNQ5BZ19xUUKjoZQgTtnea8sZ57Fsp5RtRgmYRqoewNvNk/n
uImh0NFsDVb0UeOpSzC34VD9l1dJvLGdui4zJAjno/vpvmT1DkXjoK419J/r52SS
X+5WgsfJ6DkjHqVN1tIhhK34yWqBOcwGFZJgEnWHMkFIl2FqRfFKMHyqtfLlVnLA
mnIpSyz8Sq2AHtx0TlgZ3At/Ri8p5+yYJgHOXcDKyABa8y8Zf4wrycmr6cV9JLuL
z54nLEVnJuvfDVDVJjsLYdJXyhMpZFq6+uAItdxKaw8Ugp/QqG4QtoRj+XIHz4ZW
z6OohkCiCzTwEISFK+pSTxPS30eOxq43kCspcvuLiwCCStJBRkRb5GdZA4dm7LA+
1Od4ADAkHjyrFtBqTyyC2scX8UJ33DlAIpAYyIeS6w9Cj9EXxtp1z33IAAAZ03MW
jJwIaJuc8bK2fWKMmiG7ucIXjPo4t//KiWlpkwwqLhPbjZgfDAcxq1AC2TLoqHBL
y4EOgKpHDCMAghSyiFIAn2JprGcEt8dp+11B0JRXIn4Pm/eYDH8=
=lqbe
-----END PGP SIGNATURE-----
Merge tag 'net-next-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core & protocols:
- Wrap datapath globals into net_aligned_data, to avoid false sharing
- Preserve MSG_ZEROCOPY in forwarding (e.g. out of a container)
- Add SO_INQ and SCM_INQ support to AF_UNIX
- Add SIOCINQ support to AF_VSOCK
- Add TCP_MAXSEG sockopt to MPTCP
- Add IPv6 force_forwarding sysctl to enable forwarding per interface
- Make TCP validation of whether packet fully fits in the receive
window and the rcv_buf more strict. With increased use of HW
aggregation a single "packet" can be multiple 100s of kB
- Add MSG_MORE flag to optimize large TCP transmissions via sockmap,
improves latency up to 33% for sockmap users
- Convert TCP send queue handling from tasklet to BH workque
- Improve BPF iteration over TCP sockets to see each socket exactly
once
- Remove obsolete and unused TCP RFC3517/RFC6675 loss recovery code
- Support enabling kernel threads for NAPI processing on per-NAPI
instance basis rather than a whole device. Fully stop the kernel
NAPI thread when threaded NAPI gets disabled. Previously thread
would stick around until ifdown due to tricky synchronization
- Allow multicast routing to take effect on locally-generated packets
- Add output interface argument for End.X in segment routing
- MCTP: add support for gateway routing, improve bind() handling
- Don't require rtnl_lock when fetching an IPv6 neighbor over Netlink
- Add a new neighbor flag ("extern_valid"), which cedes refresh
responsibilities to userspace. This is needed for EVPN multi-homing
where a neighbor entry for a multi-homed host needs to be synced
across all the VTEPs among which the host is multi-homed
- Support NUD_PERMANENT for proxy neighbor entries
- Add a new queuing discipline for IETF RFC9332 DualQ Coupled AQM
- Add sequence numbers to netconsole messages. Unregister
netconsole's console when all net targets are removed. Code
refactoring. Add a number of selftests
- Align IPSec inbound SA lookup to RFC 4301. Only SPI and protocol
should be used for an inbound SA lookup
- Support inspecting ref_tracker state via DebugFS
- Don't force bonding advertisement frames tx to ~333 ms boundaries.
Add broadcast_neighbor option to send ARP/ND on all bonded links
- Allow providing upcall pid for the 'execute' command in openvswitch
- Remove DCCP support from Netfilter's conntrack
- Disallow multiple packet duplications in the queuing layer
- Prevent use of deprecated iptables code on PREEMPT_RT
Driver API:
- Support RSS and hashing configuration over ethtool Netlink
- Add dedicated ethtool callbacks for getting and setting hashing
fields
- Add support for power budget evaluation strategy in PSE /
Power-over-Ethernet. Generate Netlink events for overcurrent etc
- Support DPLL phase offset monitoring across all device inputs.
Support providing clock reference and SYNC over separate DPLL
inputs
- Support traffic classes in devlink rate API for bandwidth
management
- Remove rtnl_lock dependency from UDP tunnel port configuration
Device drivers:
- Add a new Broadcom driver for 800G Ethernet (bnge)
- Add a standalone driver for Microchip ZL3073x DPLL
- Remove IBM's NETIUCV device driver
- Ethernet high-speed NICs:
- Broadcom (bnxt):
- support zero-copy Tx of DMABUF memory
- take page size into account for page pool recycling rings
- Intel (100G, ice, idpf):
- idpf: XDP and AF_XDP support preparations
- idpf: add flow steering
- add link_down_events statistic
- clean up the TSPLL code
- preparations for live VM migration
- nVidia/Mellanox:
- support zero-copy Rx/Tx interfaces (DMABUF and io_uring)
- optimize context memory usage for matchers
- expose serial numbers in devlink info
- support PCIe congestion metrics
- Meta (fbnic):
- add 25G, 50G, and 100G link modes to phylink
- support dumping FW logs
- Marvell/Cavium:
- support for CN20K generation of the Octeon chips
- Amazon:
- add HW clock (without timestamping, just hypervisor time access)
- Ethernet virtual:
- VirtIO net:
- support segmentation of UDP-tunnel-encapsulated packets
- Google (gve):
- support packet timestamping and clock synchronization
- Microsoft vNIC:
- add handler for device-originated servicing events
- allow dynamic MSI-X vector allocation
- support Tx bandwidth clamping
- Ethernet NICs consumer, and embedded:
- AMD:
- amd-xgbe: hardware timestamping and PTP clock support
- Broadcom integrated MACs (bcmgenet, bcmasp):
- use napi_complete_done() return value to support NAPI polling
- add support for re-starting auto-negotiation
- Broadcom switches (b53):
- support BCM5325 switches
- add bcm63xx EPHY power control
- Synopsys (stmmac):
- lots of code refactoring and cleanups
- TI:
- icssg-prueth: read firmware-names from device tree
- icssg: PRP offload support
- Microchip:
- lan78xx: convert to PHYLINK for improved PHY and MAC management
- ksz: add KSZ8463 switch support
- Intel:
- support similar queue priority scheme in multi-queue and
time-sensitive networking (taprio)
- support packet pre-emption in both
- RealTek (r8169):
- enable EEE at 5Gbps on RTL8126
- Airoha:
- add PPPoE offload support
- MDIO bus controller for Airoha AN7583
- Ethernet PHYs:
- support for the IPQ5018 internal GE PHY
- micrel KSZ9477 switch-integrated PHYs:
- add MDI/MDI-X control support
- add RX error counters
- add cable test support
- add Signal Quality Indicator (SQI) reporting
- dp83tg720: improve reset handling and reduce link recovery time
- support bcm54811 (and its MII-Lite interface type)
- air_en8811h: support resume/suspend
- support PHY counters for QCA807x and QCA808x
- support WoL for QCA807x
- CAN drivers:
- rcar_canfd: support for Transceiver Delay Compensation
- kvaser: report FW versions via devlink dev info
- WiFi:
- extended regulatory info support (6 GHz)
- add statistics and beacon monitor for Multi-Link Operation (MLO)
- support S1G aggregation, improve S1G support
- add Radio Measurement action fields
- support per-radio RTS threshold
- some work around how FIPS affects wifi, which was wrong (RC4 is
used by TKIP, not only WEP)
- improvements for unsolicited probe response handling
- WiFi drivers:
- RealTek (rtw88):
- IBSS mode for SDIO devices
- RealTek (rtw89):
- BT coexistence for MLO/WiFi7
- concurrent station + P2P support
- support for USB devices RTL8851BU/RTL8852BU
- Intel (iwlwifi):
- use embedded PNVM in (to be released) FW images to fix
compatibility issues
- many cleanups (unused FW APIs, PCIe code, WoWLAN)
- some FIPS interoperability
- MediaTek (mt76):
- firmware recovery improvements
- more MLO work
- Qualcomm/Atheros (ath12k):
- fix scan on multi-radio devices
- more EHT/Wi-Fi 7 features
- encapsulation/decapsulation offload
- Broadcom (brcm80211):
- support SDIO 43751 device
- Bluetooth:
- hci_event: add support for handling LE BIG Sync Lost event
- ISO: add socket option to report packet seqnum via CMSG
- ISO: support SCM_TIMESTAMPING for ISO TS
- Bluetooth drivers:
- intel_pcie: support Function Level Reset
- nxpuart: add support for 4M baudrate
- nxpuart: implement powerup sequence, reset, FW dump, and FW loading"
* tag 'net-next-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1742 commits)
dpll: zl3073x: Fix build failure
selftests: bpf: fix legacy netfilter options
ipv6: annotate data-races around rt->fib6_nsiblings
ipv6: fix possible infinite loop in fib6_info_uses_dev()
ipv6: prevent infinite loop in rt6_nlmsg_size()
ipv6: add a retry logic in net6_rt_notify()
vrf: Drop existing dst reference in vrf_ip6_input_dst
net/sched: taprio: align entry index attr validation with mqprio
net: fsl_pq_mdio: use dev_err_probe
selftests: rtnetlink.sh: remove esp4_offload after test
vsock: remove unnecessary null check in vsock_getname()
igb: xsk: solve negative overflow of nb_pkts in zerocopy mode
stmmac: xsk: fix negative overflow of budget in zerocopy mode
dt-bindings: ieee802154: Convert at86rf230.txt yaml format
net: dsa: microchip: Disable PTP function of KSZ8463
net: dsa: microchip: Setup fiber ports for KSZ8463
net: dsa: microchip: Write switch MAC address differently for KSZ8463
net: dsa: microchip: Use different registers for KSZ8463
net: dsa: microchip: Add KSZ8463 switch support to KSZ DSA driver
dt-bindings: net: dsa: microchip: Add KSZ8463 switch support
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmiHcP8QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgphu+EACp+MNKBIx7iIM1arbmafEmroaR9Whi5CJu
8BczAKRVPNhICPXAjr6UOmRuD4nBzPysPNi73xNRdXdnd/dnOL27zMqF/0JCJsd6
V4Xt7pXcanJ4BoeRDv9P332h6/IVo1ulxqEzhWJt2KKoq4cWbSwlvk18NPOdGIkp
PcMPARf15d4KbLprdST5Dzrg2uqZ3jZ4Y4zvIMKery56AZCkp5ZrtPFttw4jhTff
FQzIjY/hXXDFURubv75EHFg7KC9rcZa4CuU40duTHmsXD/EsSsa0eAJMmiVAGjhM
PXK/70ZOoxLK1zU7zN+iAMUnVL7onoNvo8g9e7/FJNuHV25am8aUx4DzzP3ci+He
+2C4PpOKx9Egz0oKmC6hl9eMg/muPd/3j8uxQlGOvz38icgeirQB/ebr1KLtBkM9
pKQmCsGaIzncXWddcTMCayEj3wbvn2lgWSqSgZuwTt1AXqSTxLOmTywCYcXDBdli
2ejh/Hk45TALnBhiiDWdOT2euh2EEP1ADOZzVsZzeR29OzLqJDRjRnitb40NUU60
+ny2HOcplIN6aah+QdFwK31FieVKDY/ufjt1yGvwCP+kIxUEDSYi+NRRldmwznxy
8UZwFyvd8wOxUc+iG/wCK7ccY+MtYyaZAE4ok2QEvQvMUTBJE/LvI4bkd77fx7Sy
2cMeDukZ/w==
=EGH5
-----END PGP SIGNATURE-----
Merge tag 'for-6.17/io_uring-20250728' of git://git.kernel.dk/linux
Pull io_uring updates from Jens Axboe:
- Optimization to avoid reference counts on non-cloned registered
buffers. This is how these buffers were handled prior to having
cloning support, and we can still use that approach as long as the
buffers haven't been cloned to another ring.
- Cleanup and improvement for uring_cmd, where btrfs was the only user
of storing allocated data for the lifetime of the uring_cmd. Clean
that up so we can get rid of the need to do that.
- Avoid unnecessary memory copies in uring_cmd usage. This is
particularly important as a lot of uring_cmd usage necessitates the
use of 128b SQEs.
- A few updates for recv multishot, where it's now possible to add
fairness limits for limiting how much is transferred for each retry
loop. Additionally, recv multishot now supports an overall cap as
well, where once reached the multishot recv will terminate. The
latter is useful for buffer management and juggling many recv streams
at the same time.
- Add support for returning the TX timestamps via a new socket command.
This feature can work in either singleshot or multishot mode, where
the latter triggers a completion whenever new timestamps are
available. This is an alternative to using the existing error queue.
- Add support for an io_uring "mock" file, which is the start of being
able to do 100% targeted testing in terms of exercising io_uring
request handling. The idea is to have a file type that can be
anything the tester would like, and behave exactly how you want it to
behave in terms of hitting the code paths you want.
- Improve zcrx by using sgtables to de-duplicate and improve dma
address handling.
- Prep work for supporting larger pages for zcrx.
- Various little improvements and fixes.
* tag 'for-6.17/io_uring-20250728' of git://git.kernel.dk/linux: (42 commits)
io_uring/zcrx: fix leaking pages on sg init fail
io_uring/zcrx: don't leak pages on account failure
io_uring/zcrx: fix null ifq on area destruction
io_uring: fix breakage in EXPERT menu
io_uring/cmd: remove struct io_uring_cmd_data
btrfs/ioctl: store btrfs_uring_encoded_data in io_btrfs_cmd
io_uring/cmd: introduce IORING_URING_CMD_REISSUE flag
io_uring/zcrx: account area memory
io_uring: export io_[un]account_mem
io_uring/net: Support multishot receive len cap
io_uring: deduplicate wakeup handling
io_uring/net: cast min_not_zero() type
io_uring/poll: cleanup apoll freeing
io_uring/net: allow multishot receive per-invocation cap
io_uring/net: move io_sr_msg->retry_flags to io_sr_msg->flags
io_uring/net: use passed in 'len' in io_recv_buf_select()
io_uring/zcrx: prepare fallback for larger pages
io_uring/zcrx: assert area type in io_zcrx_iov_page
io_uring/zcrx: allocate sgtable for umem areas
io_uring/zcrx: introduce io_populate_area_dma
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaINCiQAKCRCRxhvAZXjc
orltAQDq3y1anYETz5/FD6P2gXY1W5hXdSm3EHHeacQ1JjTXvgEA2g1lWO7J4anf
oOVE8aSvMow/FOjivLZBYmI65pkYJAE=
=oDKB
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.17-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull pidfs updates from Christian Brauner:
- persistent info
Persist exit and coredump information independent of whether anyone
currently holds a pidfd for the struct pid.
The current scheme allocated pidfs dentries on-demand repeatedly.
This scheme is reaching it's limits as it makes it impossible to pin
information that needs to be available after the task has exited or
coredumped and that should not be lost simply because the pidfd got
closed temporarily. The next opener should still see the stashed
information.
This is also a prerequisite for supporting extended attributes on
pidfds to allow attaching meta information to them.
If someone opens a pidfd for a struct pid a pidfs dentry is allocated
and stashed in pid->stashed. Once the last pidfd for the struct pid
is closed the pidfs dentry is released and removed from pid->stashed.
So if 10 callers create a pidfs dentry for the same struct pid
sequentially, i.e., each closing the pidfd before the other creates a
new one then a new pidfs dentry is allocated every time.
Because multiple tasks acquiring and releasing a pidfd for the same
struct pid can race with each another a task may still find a valid
pidfs entry from the previous task in pid->stashed and reuse it. Or
it might find a dead dentry in there and fail to reuse it and so
stashes a new pidfs dentry. Multiple tasks may race to stash a new
pidfs dentry but only one will succeed, the other ones will put their
dentry.
The current scheme aims to ensure that a pidfs dentry for a struct
pid can only be created if the task is still alive or if a pidfs
dentry already existed before the task was reaped and so exit
information has been was stashed in the pidfs inode.
That's great except that it's buggy. If a pidfs dentry is stashed in
pid->stashed after pidfs_exit() but before __unhash_process() is
called we will return a pidfd for a reaped task without exit
information being available.
The pidfds_pid_valid() check does not guard against this race as it
doens't sync at all with pidfs_exit(). The pid_has_task() check might
be successful simply because we're before __unhash_process() but
after pidfs_exit().
Introduce a new scheme where the lifetime of information associated
with a pidfs entry (coredump and exit information) isn't bound to the
lifetime of the pidfs inode but the struct pid itself.
The first time a pidfs dentry is allocated for a struct pid a struct
pidfs_attr will be allocated which will be used to store exit and
coredump information.
If all pidfs for the pidfs dentry are closed the dentry and inode can
be cleaned up but the struct pidfs_attr will stick until the struct
pid itself is freed. This will ensure minimal memory usage while
persisting relevant information.
The new scheme has various advantages. First, it allows to close the
race where we end up handing out a pidfd for a reaped task for which
no exit information is available. Second, it minimizes memory usage.
Third, it allows to remove complex lifetime tracking via dentries
when registering a struct pid with pidfs. There's no need to get or
put a reference. Instead, the lifetime of exit and coredump
information associated with a struct pid is bound to the lifetime of
struct pid itself.
- extended attributes
Now that we have a way to persist information for pidfs dentries we
can start supporting extended attributes on pidfds. This will allow
userspace to attach meta information to tasks.
One natural extension would be to introduce a custom pidfs.* extended
attribute space and allow for the inheritance of extended attributes
across fork() and exec().
The first simple scheme will allow privileged userspace to set
trusted extended attributes on pidfs inodes.
- Allow autonomous pidfs file handles
Various filesystems such as pidfs and drm support opening file
handles without having to require a file descriptor to identify the
filesystem. The filesystem are global single instances and can be
trivially identified solely on the information encoded in the file
handle.
This makes it possible to not have to keep or acquire a sentinal file
descriptor just to pass it to open_by_handle_at() to identify the
filesystem. That's especially useful when such sentinel file
descriptor cannot or should not be acquired.
For pidfs this means a file handle can function as full replacement
for storing a pid in a file. Instead a file handle can be stored and
reopened purely based on the file handle.
Such autonomous file handles can be opened with or without specifying
a a file descriptor. If no proper file descriptor is used the
FD_PIDFS_ROOT sentinel must be passed. This allows us to define
further special negative fd sentinels in the future.
Userspace can trivially test for support by trying to open the file
handle with an invalid file descriptor.
- Allow pidfds for reaped tasks with SCM_PIDFD messages
This is a logical continuation of the earlier work to create pidfds
for reaped tasks through the SO_PEERPIDFD socket option merged in
923ea4d448 ("Merge patch series "net, pidfs: enable handing out
pidfds for reaped sk->sk_peer_pid"").
- Two minor fixes:
* Fold fs_struct->{lock,seq} into a seqlock
* Don't bother with path_{get,put}() in unix_open_file()
* tag 'vfs-6.17-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (37 commits)
don't bother with path_get()/path_put() in unix_open_file()
fold fs_struct->{lock,seq} into a seqlock
selftests: net: extend SCM_PIDFD test to cover stale pidfds
af_unix: enable handing out pidfds for reaped tasks in SCM_PIDFD
af_unix: stash pidfs dentry when needed
af_unix/scm: fix whitespace errors
af_unix: introduce and use scm_replace_pid() helper
af_unix: introduce unix_skb_to_scm helper
af_unix: rework unix_maybe_add_creds() to allow sleep
selftests/pidfd: decode pidfd file handles withou having to specify an fd
fhandle, pidfs: support open_by_handle_at() purely based on file handle
uapi/fcntl: add FD_PIDFS_ROOT
uapi/fcntl: add FD_INVALID
fcntl/pidfd: redefine PIDFD_SELF_THREAD_GROUP
uapi/fcntl: mark range as reserved
fhandle: reflow get_path_anchor()
pidfs: add pidfs_root_path() helper
fhandle: rename to get_path_anchor()
fhandle: hoist copy_from_user() above get_path_from_fd()
fhandle: raise FILEID_IS_DIR in handle_type
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEjF9xRqF1emXiQiqU1w0aZmrPKyEFAmiDtZkACgkQ1w0aZmrP
KyGxohAAns0Vyq4zx4VKcaDQQbciakZnHPK28eHOGxNKzh6qX/ybB8feOeAVOR1Q
RCsVkSej/HiGIOsrWwiOKecx+d/Sf2wSGeJYftPMjk5pL/V3SyMdAUUgQgPtne1o
LHE3rsu9BLGXt6M1mpn4+DjDQqDbLBVXvi3x/FNXCFJETuEfaL7gcisXKxzeqCtS
fWqFw0JzQNmHCYsBbUb7CpoowD3QKvSBdIUP+8ciC9qszRfgGfOlCKwkyRwp/uV3
vh1yLkHvlh9r4oe+PP4fNvTMbsaPS1jecj6xNwRmEEyf0bvBBCl9iNbWe5F9Fqvr
dbbOkKOHfo0nCYR3AzQFJpM1o81KGQ90JkcrR/hhEII8KD/3VKdvUqFl1WkU6rEP
rAxkl4lXM7aq11nJp2dClvshSZO/6Fo2byISGfLuVLPnq1/Lo2zZjnmk7StE1bm2
bWCA+C64CjAt1gqUCC1TZ7XpMtAZDQEdDSSN49/BqpJJncFEhigzph+I3H+LFuE9
hcUdF4i1UTNlqMTv4uY8ycdmrjRj4SfeXaBXxWjnaRn5Wxfs0GdtmdeiVuBTLylQ
Y18MM5OMGso1bzgS392jjwzMsommdLcgb7/v8Z6TU9dAjoeS2cYac5JiqilkgiSN
0nzikhtADCHZ+tgMU2LOc+nK93KQWyaJUtrSR5uEU6tHqe/m6No=
=mKnX
-----END PGP SIGNATURE-----
Merge tag 'nf-next-25-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following series contains Netfilter/IPVS updates for net-next:
1) Display netns inode in conntrack table full log, from lvxiafei.
2) Autoload nf_log_syslog in case no logging backend is available,
from Lance Yang.
3) Three patches to remove unused functions in x_tables, nf_tables and
conntrack. From Yue Haibing.
4) Exclude LEGACY TABLES on PREEMPT_RT: Add NETFILTER_XTABLES_LEGACY
to exclude xtables legacy infrastructure.
5) Restore selftests by toggling NETFILTER_XTABLES_LEGACY where needed.
From Florian Westphal.
6) Use CONFIG_INET_SCTP_DIAG in tools/testing/selftests/net/netfilter/config,
from Sebastian Andrzej Siewior.
7) Use timer_delete in comment in IPVS codebase, from WangYuli.
8) Dump flowtable information in nfnetlink_hook, this includes an initial
patch to consolidate common code in helper function, from Phil Sutter.
9) Remove unused arguments in nft_pipapo set backend, from Florian Westphal.
10) Return nft_set_ext instead of boolean in set lookup function,
from Florian Westphal.
11) Remove indirection in dynamic set infrastructure, also from Florian.
12) Consolidate pipapo_get/lookup, from Florian.
13) Use kvmalloc in nft_pipapop, from Florian Westphal.
14) syzbot reports slab-out-of-bounds in xt_nfacct log message,
fix from Florian Westphal.
15) Ignored tainted kernels in selftest nft_interface_stress.sh,
from Phil Sutter.
16) Fix IPVS selftest by disabling rp_filter with ipip tunnel device,
from Yi Chen.
* tag 'nf-next-25-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
selftests: netfilter: ipvs.sh: Explicity disable rp_filter on interface tunl0
selftests: netfilter: Ignore tainted kernels in interface stress test
netfilter: xt_nfacct: don't assume acct name is null-terminated
netfilter: nft_set_pipapo: prefer kvmalloc for scratch maps
netfilter: nft_set_pipapo: merge pipapo_get/lookup
netfilter: nft_set: remove indirection from update API call
netfilter: nft_set: remove one argument from lookup and update functions
netfilter: nft_set_pipapo: remove unused arguments
netfilter: nfnetlink_hook: Dump flowtable info
netfilter: nfnetlink: New NFNLA_HOOK_INFO_DESC helper
ipvs: Rename del_timer in comment in ip_vs_conn_expire_now()
selftests: netfilter: Enable CONFIG_INET_SCTP_DIAG
selftests: net: Enable legacy netfilter legacy options.
netfilter: Exclude LEGACY TABLES on PREEMPT_RT.
netfilter: conntrack: Remove unused net in nf_conntrack_double_lock()
netfilter: nf_tables: Remove unused nft_reduce_is_readonly()
netfilter: x_tables: Remove unused functions xt_{in|out}name()
netfilter: load nf_log_syslog on enabling nf_conntrack_log_invalid
netfilter: conntrack: table full detailed log
====================
Link: https://patch.msgid.link/20250725170340.21327-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As part of the removal of the variably-sized sockaddr for kernel
internals, replace struct sockaddr with sockaddr_inet in the sctp_addr
union.
No binary changes; the union size remains unchanged due to sockaddr_inet
matching the size of sockaddr_in6.
Signed-off-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20250722171836.1078436-3-kees@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This stems from a time when sets and nft_dynset resided in different kernel
modules. We can replace this with a direct call.
We could even remove both ->update and ->delete, given its only
supported by rhashtable, but on the off-chance we'll see runtime
add/delete for other types or a new set type keep that as-is for now.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Return the extension pointer instead of passing it as a function
argument to be filled in by the callee.
As-is, whenever false is returned, the extension pointer is not used.
For all set types, when true is returned, the extension pointer was set
to the matching element.
Only exception: nft_set_bitmap doesn't support extensions.
Return a pointer to a static const empty element extension container.
return false -> return NULL
return true -> return the elements' extension pointer.
This saves one function argument.
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Since commit 9e539c5b6d ("netfilter: nf_tables: disable expression
reduction infra") this is unused.
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When no logger is registered, nf_conntrack_log_invalid fails to log invalid
packets, leaving users unaware of actual invalid traffic. Improve this by
loading nf_log_syslog, similar to how 'iptables -I FORWARD 1 -m conntrack
--ctstate INVALID -j LOG' triggers it.
Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Zi Li <zi.li@linux.dev>
Signed-off-by: Lance Yang <lance.yang@linux.dev>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Move multiple copies of same code snippet doing `gro_flush` and
`gro_normal_list` into separate helper function.
Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250723013031.2911384-2-skhawaja@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
core:
- hci_sync: fix double free in 'hci_discovery_filter_clear()'
- hci_event: Mask data status from LE ext adv reports
- hci_devcd_dump: fix out-of-bounds via dev_coredumpv
- ISO: add socket option to report packet seqnum via CMSG
- hci_event: Add support for handling LE BIG Sync Lost event
- ISO: Support SCM_TIMESTAMPING for ISO TS
- hci_core: Add PA_LINK to distinguish BIG sync and PA sync connections
- hci_sock: Reset cookie to zero in hci_sock_free_cookie()
drivers:
- btusb: Add new VID/PID 0489/e14e for MT7925
- btusb: Add a new VID/PID 2c7c/7009 for MT7925
- btusb: Add RTL8852BE device 0x13d3:0x3618
- btusb: Add support for variant of RTL8851BE (USB ID 13d3:3601)
- btusb: Add USB ID 3625:010b for TP-LINK Archer TX10UB Nano
- btusb: QCA: Support downloading custom-made firmwares
- btusb: Add one more ID 0x28de:0x1401 for Qualcomm WCN6855
- nxp: add support for supply and reset
- btnxpuart: Add support for 4M baudrate
- btnxpuart: Correct the Independent Reset handling after FW dump
- btnxpuart: Add uevents for FW dump and FW download complete
- btintel: Define a macro for Intel Reset vendor command
- btintel_pcie: Support Function level reset
- btintel_pcie: Add support for device 0x4d76
- btintel_pcie: Make driver wait for alive interrupt
- btintel_pcie: Fix Alive Context State Handling
- hci_qca: Enable ISO data packet RX
-----BEGIN PGP SIGNATURE-----
iQJNBAABCgA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmiBMSEZHGx1aXoudm9u
LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKSrXD/9hlsXgpNgSGJQGdI07xdKR
QrBbK/7GU7yur5w4lzW5hepn5inmy8cOw9PKo45HhM50cHu95IAnlUP4wDjwc4Ia
Yv69+VNFonYFZ0P8Frcuy5RUZPdK/K88bdqoyccRm3eXk+Y2Of/6iv6mjx3Wq/fP
dN8vvOLlEdsiYDK8g82IQKH28Gt8w4TWGgJ94DT/NgPpLgy2IkVcPt9S/s1O4tXu
InRxvzHwqX98oEgkxgOclbsVZ8rClZMl94ffvT4C0f9rEZ2RWPqU7GCOJUJhSJHs
98Rxp4K8H+rlRuMhlwe2bzPs7Y6Ao/mMAN6zxKgEN0LfHXgveQ/GQmCznclMlhjS
5AX7zdCP9Jpfh/DOCfZdKde2Y5j8rBeIn7/7+IqxFjeEnXY7URxdKN+YMxR5w+fe
vepX7CsJg5x28k1wTgG4CJ9iUP/AV6AmZrwnJcCF35M964IsgrSFa061AeKYb1OX
gz4a+rQiLC6hTB2cvyBXgLmZehDu/eGHypCxfFFlxOkHwq0dRzV+W+hQfeel+DkO
AF3Kcw6xzA0nPovjzLGErxeSG18h7Y58pjHHU11M9lUthVChpG3QBxtlS70xbugl
pPSc3kQI/oB6UYFOFyrjwzSDPkVkXRGgfG/FbZQDag40edMB6hATnBCEqVkf+L7x
qtgHU9wLX3uAE1f0aUx5gQ==
=L038
-----END PGP SIGNATURE-----
Merge tag 'for-net-next-2025-07-23' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Luiz Augusto von Dentz says:
====================
bluetooth-next pull request for net-next:
core:
- hci_sync: fix double free in 'hci_discovery_filter_clear()'
- hci_event: Mask data status from LE ext adv reports
- hci_devcd_dump: fix out-of-bounds via dev_coredumpv
- ISO: add socket option to report packet seqnum via CMSG
- hci_event: Add support for handling LE BIG Sync Lost event
- ISO: Support SCM_TIMESTAMPING for ISO TS
- hci_core: Add PA_LINK to distinguish BIG sync and PA sync connections
- hci_sock: Reset cookie to zero in hci_sock_free_cookie()
drivers:
- btusb: Add new VID/PID 0489/e14e for MT7925
- btusb: Add a new VID/PID 2c7c/7009 for MT7925
- btusb: Add RTL8852BE device 0x13d3:0x3618
- btusb: Add support for variant of RTL8851BE (USB ID 13d3:3601)
- btusb: Add USB ID 3625:010b for TP-LINK Archer TX10UB Nano
- btusb: QCA: Support downloading custom-made firmwares
- btusb: Add one more ID 0x28de:0x1401 for Qualcomm WCN6855
- nxp: add support for supply and reset
- btnxpuart: Add support for 4M baudrate
- btnxpuart: Correct the Independent Reset handling after FW dump
- btnxpuart: Add uevents for FW dump and FW download complete
- btintel: Define a macro for Intel Reset vendor command
- btintel_pcie: Support Function level reset
- btintel_pcie: Add support for device 0x4d76
- btintel_pcie: Make driver wait for alive interrupt
- btintel_pcie: Fix Alive Context State Handling
- hci_qca: Enable ISO data packet RX
* tag 'for-net-next-2025-07-23' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next: (42 commits)
Bluetooth: Add PA_LINK to distinguish BIG sync and PA sync connections
Bluetooth: hci_event: Mask data status from LE ext adv reports
Bluetooth: btintel_pcie: Fix Alive Context State Handling
Bluetooth: btintel_pcie: Make driver wait for alive interrupt
Bluetooth: hci_devcd_dump: fix out-of-bounds via dev_coredumpv
Bluetooth: hci_sync: fix double free in 'hci_discovery_filter_clear()'
Bluetooth: btusb: Add one more ID 0x28de:0x1401 for Qualcomm WCN6855
Bluetooth: btusb: Sort WCN6855 device IDs by VID and PID
Bluetooth: btusb: QCA: Support downloading custom-made firmwares
Bluetooth: btnxpuart: Add uevents for FW dump and FW download complete
Bluetooth: btnxpuart: Correct the Independent Reset handling after FW dump
Bluetooth: ISO: Support SCM_TIMESTAMPING for ISO TS
Bluetooth: ISO: add socket option to report packet seqnum via CMSG
Bluetooth: btintel: Define a macro for Intel Reset vendor command
Bluetooth: Fix typos in comments
Bluetooth: RFCOMM: Fix typos in comments
Bluetooth: aosp: Fix typo in comment
Bluetooth: hci_bcm4377: Fix typo in comment
Bluetooth: btrtl: Fix typo in comment
Bluetooth: btmtk: Fix typo in log string
...
====================
Link: https://patch.msgid.link/20250723190233.166823-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- rtw89:
- STA+P2P concurrency
- support for USB devices RTL8851BU/RTL8852BU
- ath9k: OF support
- ath12k:
- more EHT/Wi-Fi 7 features
- encapsulation/decapsulation offload
- iwlwifi: some FIPS interoperability
- brcm80211: support SDIO 43751 device
- rt2x00: better DT/OF support
- cfg80211/mac80211:
- improved S1G support
- beacon monitor for MLO
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmiCBJcACgkQ10qiO8sP
aABs3w/9GiI7NLSicM2ulpaWs92za2RrsXsYH01z+m+dx3wUCupKEXKHh+hwy+EW
WxJpfMqKURAh4QofTCLK0mwOKLr1bT/gNmXd9tKM3oaDIH/fk6HCteTff8GgmU/1
zfbd5UMSU1W98WJiS3Ukm9FjZN5i1X/cpPdYMIq0sX3sM9JSmDZL6ToLsknr3sC2
6CZj3UcjPx89e/ei4ALVjlU8DRhKBgqpzHMBgdnPt8bAbp9tUmyTKF0RqrBUn1Md
9D1inkA/bJxmIKHslc8PgUEeuBlndCjrLCzlYw2XW8UOcLaHxujqQv7drgEjjlr2
UTM+Hv6itFsRugS465EwbYLjM3lotmbpSWKR7ZQiSBF16jg0mBscq2mqqOU6Dv+X
SqxTp9WSYptCtylinz/6h8SsaPr+rGxa00sLopcPUrhWgWagndhmKMcVfQBvnlUE
JAg9gXkQ0d5GuYDOIdW+7i1NpLADthQpyynihhkAyISgfk+43HGUssTXCfRMr7wc
iL07j6PFGXYTdSDiuaxnS70qn8jHSlFQ6FKvb7jxTGBR3RD1KtCyoWVStx6vE6Q7
MTnsBGZZZ0yli6ur5vtVJr6ziwjjkBG+XxxuLvbeFb6+LYbPdpfa9fXj/UFct9el
U9iBlQ0SBwgw8olh1Y6Xbj0XM9B57UnKLgRZ9GxMPVgQPLTli9k=
=Dkfh
-----END PGP SIGNATURE-----
Merge tag 'wireless-next-2025-07-24' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
Johannes Berg says:
====================
Another wireless update:
- rtw89:
- STA+P2P concurrency
- support for USB devices RTL8851BU/RTL8852BU
- ath9k: OF support
- ath12k:
- more EHT/Wi-Fi 7 features
- encapsulation/decapsulation offload
- iwlwifi: some FIPS interoperability
- brcm80211: support SDIO 43751 device
- rt2x00: better DT/OF support
- cfg80211/mac80211:
- improved S1G support
- beacon monitor for MLO
* tag 'wireless-next-2025-07-24' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (199 commits)
ssb: use new GPIO line value setter callbacks for the second GPIO chip
wifi: Fix typos
wifi: brcmsmac: Use str_true_false() helper
wifi: brcmfmac: fix EXTSAE WPA3 connection failure due to AUTH TX failure
wifi: brcm80211: Remove yet more unused functions
wifi: brcm80211: Remove more unused functions
wifi: brcm80211: Remove unused functions
wifi: iwlwifi: Revert "wifi: iwlwifi: remove support of several iwl_ppag_table_cmd versions"
wifi: iwlwifi: check validity of the FW API range
wifi: iwlwifi: don't export symbols that we shouldn't
wifi: iwlwifi: mld: use spec link id and not FW link id
wifi: iwlwifi: mld: decode EOF bit for AMPDUs
wifi: iwlwifi: Remove support for rx OMI bandwidth reduction
wifi: iwlwifi: stop supporting iwl_omi_send_status_notif ver 1
wifi: iwlwifi: remove SC2F firmware support
wifi: iwlwifi: mvm: Remove NAN support
wifi: iwlwifi: mld: avoid outdated reorder buffer head_sn
wifi: iwlwifi: mvm: avoid outdated reorder buffer head_sn
wifi: iwlwifi: disable certain features for fips_enabled
wifi: iwlwifi: mld: support channel survey collection for ACS scans
...
====================
Link: https://patch.msgid.link/20250724100349.21564-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmiAj9QACgkQrB3Eaf9P
W7dQEhAAj0ccCgKych/IBx4zlK/TSsPN9PnZmylTgpfWQdClCKjNEa80Umf2wWgc
0j5rA77B2AJSHkRHMW6cWpmZRbicnWDavAtUOB/2lEcw8EDyMcIEEByzBtRkylvb
S1slLlpAYyezHsSr9uGu6+Meb3llkL2+pnF2CiA02uWI0MaEXaqtGfEqZ4DZX9BK
xQcb3St6Mr6pBV0sQ/rlY3Gfe/t/gzUnJNQQtRuKcZmBG4Ls1MLpJ6fYU++k+lfX
my4JehdI2Sg6QujX4HdlBDF1bCzcKpxzJGQVhkjT9OPx25owtoknMezeWdrK3Gdk
0pwIFs0+dFCc6hi2x6seabSBPNZJzT7978XZ0U2PneGoKlSw36UuAvk2ckAaBS+N
I2flFQRGAf7XjxNn53lRailG/nxGYUMEvhlM+uGjy1rujDn8NmdZXAwtQUGP25bJ
DCcXiQp6KUnclIWJ0MjScuFUG++Afn52DgsM+N5Fmjg9U/F1dVqIWjixcOOM4Lx2
DOpJy9gi3r7JZ0b1K7KMq1FibxFGszaW4YeABNPqlcEjnWiVhPNQjxKm0S84UQdL
Qh0s1PiAWhBqt4bvdf6h+r8309JWgo704MWxC2tBodDVJoYgugNXbnPihrr6+wO3
D9SbUJHSdTLOWTWzhlrFEb2oHVHAXc3KZpIQnk+32XaxqcNHpeY=
=MBQn
-----END PGP SIGNATURE-----
Merge tag 'ipsec-2025-07-23' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:
====================
pull request (net): ipsec 2025-07-23
1) Premption fixes for xfrm_state_find.
From Sabrina Dubroca.
2) Initialize offload path also for SW IPsec GRO. This fixes a
performance regression on SW IPsec offload.
From Leon Romanovsky.
3) Fix IPsec UDP GRO for IKE packets.
From Tobias Brunner,
4) Fix transport header setting for IPcomp after decompressing.
From Fernando Fernandez Mancera.
5) Fix use-after-free when xfrmi_changelink tries to change
collect_md for a xfrm interface.
From Eyal Birger .
6) Delete the special IPcomp x->tunnel state along with the state x
to avoid refcount problems.
From Sabrina Dubroca.
Please pull or let me know if there are problems.
* tag 'ipsec-2025-07-23' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
Revert "xfrm: destroy xfrm_state synchronously on net exit path"
xfrm: delete x->tunnel as we delete x
xfrm: interface: fix use-after-free after changing collect_md xfrm interface
xfrm: ipcomp: adjust transport header after decompressing
xfrm: Set transport header to fix UDP GRO handling
xfrm: always initialize offload path
xfrm: state: use a consistent pcpu_id in xfrm_state_find
xfrm: state: initialize state_ptrs earlier in xfrm_state_find
====================
Link: https://patch.msgid.link/20250723075417.3432644-1-steffen.klassert@secunet.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
DualPI2 provides L4S-type low latency & loss to traffic that uses a
scalable congestion controller (e.g. TCP-Prague, DCTCP) without
degrading the performance of 'classic' traffic (e.g. Reno,
Cubic etc.). It is to be the reference implementation of IETF RFC9332
DualQ Coupled AQM (https://datatracker.ietf.org/doc/html/rfc9332).
Note that creating two independent queues cannot meet the goal of
DualPI2 mentioned in RFC9332: "...to preserve fairness between
ECN-capable and non-ECN-capable traffic." Further, it could even
lead to starvation of Classic traffic, which is also inconsistent
with the requirements in RFC9332: "...although priority MUST be
bounded in order not to starve Classic traffic." DualPI2 is
designed to maintain approximate per-flow fairness on L-queue and
C-queue by forming a single qdisc using the coupling factor and
scheduler between two queues.
The qdisc provides two queues called low latency and classic. It
classifies packets based on the ECN field in the IP headers. By
default it directs non-ECN and ECT(0) into the classic queue and
ECT(1) and CE into the low latency queue, as per the IETF spec.
Each queue runs its own AQM:
* The classic AQM is called PI2, which is similar to the PIE AQM but
more responsive and simpler. Classic traffic requires a decent
target queue (default 15ms for Internet deployment) to fully
utilize the link and to avoid high drop rates.
* The low latency AQM is, by default, a very shallow ECN marking
threshold (1ms) similar to that used for DCTCP.
The DualQ isolates the low queuing delay of the Low Latency queue
from the larger delay of the 'Classic' queue. However, from a
bandwidth perspective, flows in either queue will share out the link
capacity as if there was just a single queue. This bandwidth pooling
effect is achieved by coupling together the drop and ECN-marking
probabilities of the two AQMs.
The PI2 AQM has two main parameters in addition to its target delay.
The integral gain factor alpha is used to slowly correct any persistent
standing queue error from the target delay, while the proportional gain
factor beta is used to quickly compensate for queue changes (growth or
shrinkage). Either alpha and beta are given as a parameter, or they can
be calculated by tc from alternative typical and maximum RTT parameters.
Internally, the output of a linear Proportional Integral (PI)
controller is used for both queues. This output is squared to
calculate the drop or ECN-marking probability of the classic queue.
This counterbalances the square-root rate equation of Reno/Cubic,
which is the trick that balances flow rates across the queues. For
the ECN-marking probability of the low latency queue, the output of
the base AQM is multiplied by a coupling factor. This determines the
balance between the flow rates in each queue. The default setting
makes the flow rates roughly equal, which should be generally
applicable.
If DUALPI2 AQM has detected overload (due to excessive non-responsive
traffic in either queue), it will switch to signaling congestion
solely using drop, irrespective of the ECN field. Alternatively, it
can be configured to limit the drop probability and let the queue
grow and eventually overflow (like tail-drop).
GSO splitting in DUALPI2 is configurable from userspace while the
default behavior is to split gso. When running DUALPI2 at unshaped
10gigE with 4 download streams test, splitting gso apart results in
halving the latency with no loss in throughput:
Summary of tcp_4down run 'no_split_gso':
avg median # data pts
Ping (ms) ICMP : 0.53 0.30 ms 350
TCP download avg : 2326.86 N/A Mbits/s 350
TCP download sum : 9307.42 N/A Mbits/s 350
TCP download::1 : 2672.99 2568.73 Mbits/s 350
TCP download::2 : 2586.96 2570.51 Mbits/s 350
TCP download::3 : 1786.26 1798.82 Mbits/s 350
TCP download::4 : 2261.21 2309.49 Mbits/s 350
Summart of tcp_4down run 'split_gso':
avg median # data pts
Ping (ms) ICMP : 0.22 0.23 ms 350
TCP download avg : 2335.02 N/A Mbits/s 350
TCP download sum : 9340.09 N/A Mbits/s 350
TCP download::1 : 2335.30 2334.22 Mbits/s 350
TCP download::2 : 2334.72 2334.20 Mbits/s 350
TCP download::3 : 2335.28 2334.58 Mbits/s 350
TCP download::4 : 2334.79 2334.39 Mbits/s 350
A similar result is observed when running DUALPI2 at unshaped 1gigE
with 1 download stream test:
Summary of tcp_1down run 'no_split_gso':
avg median # data pts
Ping (ms) ICMP : 1.13 1.25 ms 350
TCP download : 941.41 941.46 Mbits/s 350
Summart of tcp_1down run 'split_gso':
avg median # data pts
Ping (ms) ICMP : 0.51 0.55 ms 350
TCP download : 941.41 941.45 Mbits/s 350
Additional details can be found in the draft:
https://datatracker.ietf.org/doc/html/rfc9332
Signed-off-by: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com>
Co-developed-by: Olga Albisser <olga@albisser.org>
Signed-off-by: Olga Albisser <olga@albisser.org>
Co-developed-by: Olivier Tilmans <olivier.tilmans@nokia.com>
Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia.com>
Co-developed-by: Henrik Steen <henrist@henrist.net>
Signed-off-by: Henrik Steen <henrist@henrist.net>
Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Signed-off-by: Bob Briscoe <research@bobbriscoe.net>
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Acked-by: Dave Taht <dave.taht@gmail.com>
Link: https://patch.msgid.link/20250722095915.24485-4-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
To eliminate the use of struct page in page pool, the page pool users
should use netmem descriptor and APIs instead.
Make xdp access ->pp through netmem_desc instead of page.
Signed-off-by: Byungchul Park <byungchul@sk.com>
Link: https://patch.msgid.link/20250721021835.63939-13-byungchul@sk.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
To eliminate the use of struct page in page pool, the page pool users
should use netmem descriptor and APIs instead.
Make mlx4 access ->pp_ref_count through netmem_desc instead of page.
While at it, add a helper, pp_page_to_nmdesc() and __pp_page_to_nmdesc(),
that can be used to get netmem_desc from page only if it's a pp page.
For now that netmem_desc overlays on page, it can be achieved by just
casting, and use macro and _Generic to cover const casting as well.
Plus, change page_pool_page_is_pp() to check for 'const struct page *'
instead of 'struct page *' since it doesn't modify data and additionally
covers const type.
Signed-off-by: Byungchul Park <byungchul@sk.com>
Link: https://patch.msgid.link/20250721021835.63939-4-byungchul@sk.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
To eliminate the use of the page pool fields in struct page, the page
pool code should use netmem descriptor and APIs instead.
However, __netmem_get_pp() still accesses ->pp via struct page. So
change it to use struct netmem_desc instead, since ->pp no longer will
be available in struct page.
While at it, add a helper, __netmem_to_nmdesc(), that can be used to
unsafely get pointer to netmem_desc backing the netmem_ref, only when
the netmem_ref is always backed by system memory.
Signed-off-by: Byungchul Park <byungchul@sk.com>
Link: https://patch.msgid.link/20250721021835.63939-3-byungchul@sk.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
To simplify struct page, the page pool members of struct page should be
moved to other, allowing these members to be removed from struct page.
Introduce a network memory descriptor to store the members, struct
netmem_desc, and make it union'ed with the existing fields in struct
net_iov, allowing to organize the fields of struct net_iov.
Signed-off-by: Byungchul Park <byungchul@sk.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20250721021835.63939-2-byungchul@sk.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, BIS_LINK is used for both BIG sync and PA sync connections,
which makes it impossible to distinguish them when searching for a PA
sync connection.
Adding PA_LINK will make the distinction clearer and simplify future
extensions for PA-related features.
Signed-off-by: Yang Li <yang.li@amlogic.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
The Event_Type field in an LE Extended Advertising Report uses bits 5
and 6 for data status (e.g. truncation or fragmentation), not the PDU
type itself.
The ext_evt_type_to_legacy() function fails to mask these status bits
before evaluation. This causes valid advertisements with status bits set
(e.g. a truncated non-connectable advertisement, which ends up showing
as PDU type 0x40) to be misclassified as unknown and subsequently
dropped. This is okay for most checks which use bitwise AND on the
relevant event type bits, but it doesn't work for non-connectable types,
which are checked with '== LE_EXT_ADV_NON_CONN_IND' (that is, zero).
In terms of behaviour, first the device sends a truncated report:
> HCI Event: LE Meta Event (0x3e) plen 26
LE Extended Advertising Report (0x0d)
Entry 0
Event type: 0x0040
Data status: Incomplete, data truncated, no more to come
Address type: Random (0x01)
Address: 1D:12:46:FA:F8:6E (Non-Resolvable)
SID: 0x03
RSSI: -98 dBm (0x9e)
Data length: 0x00
Then, a few seconds later, it sends the subsequent complete report:
> HCI Event: LE Meta Event (0x3e) plen 122
LE Extended Advertising Report (0x0d)
Entry 0
Event type: 0x0000
Data status: Complete
Address type: Random (0x01)
Address: 1D:12:46:FA:F8:6E (Non-Resolvable)
SID: 0x03
RSSI: -97 dBm (0x9f)
Data length: 0x60
Service Data: Google (0xfef3)
Data[92]: ...
These devices often send multiple truncated reports per second.
This patch introduces a PDU type mask to ensure only the relevant bits
are evaluated, allowing for the correct translation of all valid
extended advertising packets.
Fixes: b2cc9761f1 ("Bluetooth: Handle extended ADV PDU types")
Signed-off-by: Chris Down <chris@chrisdown.name>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
User applications need a way to track which ISO interval a given SDU
belongs to, to properly detect packet loss. All controllers do not set
timestamps, and it's not guaranteed user application receives all packet
reports (small socket buffer, or controller doesn't send all reports
like Intel AX210 is doing).
Add socket option BT_PKT_SEQNUM that enables reporting of received
packet ISO sequence number in BT_SCM_PKT_SEQNUM CMSG.
Use BT_PKT_SEQNUM == 22 for the socket option, as 21 was used earlier
for a removed experimental feature that never got into mainline.
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Correct the misspelling of “estabilished” in the code.
Signed-off-by: Yang Li <yang.li@amlogic.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
When the BIS source stops, the controller sends an LE BIG Sync Lost
event (subevent 0x1E). Currently, this event is not handled, causing
the BIS stream to remain active in BlueZ and preventing recovery.
Signed-off-by: Yang Li <yang.li@amlogic.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Since commit 4aa42119d9 ("Bluetooth: Remove pending ACL connection
attempts") this function is unused.
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
ieee80211_add_gtk_rekey receives a keyconf as an argument, and the
cipher and keylen are taken from there to the new allocated key.
But in rekey, both the cipher and the keylen should be the same as of
the old key, so let ieee80211_add_gtk_rekey find those, so drivers won't
have to fill it in.
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250721214922.3c5c023bfae9.Ie6594ae2b4b6d5b3d536e642b349046ebfce7a5d@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The upper layer may require the link ID to properly handle
unexpected frames. For instance, if hostapd, operating as an
AP MLD, receives a data frame from a non-associated STA,
it must send deauthentication to the link on which the STA is
operating.
Signed-off-by: Michael-CY Lee <michael-cy.lee@mediatek.com>
Reviewed-by: Money Wang <money.wang@mediatek.com>
Link: https://patch.msgid.link/20250721065159.1740992-1-michael-cy.lee@mediatek.com
[edit commit message]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add a new SKB drop reason (SKB_DROP_REASON_PFMEMALLOC) to track packets
dropped due to memory pressure. In production environments, we've observed
memory exhaustion reported by memory layer stack traces, but these drops
were not properly tracked in the SKB drop reason infrastructure.
While most network code paths now properly report pfmemalloc drops, some
protocol-specific socket implementations still use sk_filter() without
drop reason tracking:
- Bluetooth L2CAP sockets
- CAIF sockets
- IUCV sockets
- Netlink sockets
- SCTP sockets
- Unix domain sockets
These remaining cases represent less common paths and could be converted
in a follow-up patch if needed. The current implementation provides
significantly improved observability into memory pressure events in the
network stack, especially for key protocols like TCP and UDP, helping to
diagnose problems in production environments.
Reported-by: Matt Fleming <mfleming@cloudflare.com>
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://patch.msgid.link/175268316579.2407873.11634752355644843509.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Introduce the ability to parse the short beacon data and long
beacon period. The long beacon period represents the number of beacon
intervals between each long beacon transmission. Additionally,
as a BSS cannot change its configuration such that short beaconing
is dynamically disabled/enabled without tearing down the interface
- we ensure we have an existing short beacon before performing
the update.
Signed-off-by: Lachlan Hodges <lachlan.hodges@morsemicro.com>
Link: https://patch.msgid.link/20250717074205.312577-3-lachlan.hodges@morsemicro.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
S1G short beacons are an optional frame type used in an S1G BSS
that contain a limited set of elements. While they are optional,
they are a fundamental part of S1G that enables significant
power saving.
Expose 2 additional netlink attributes,
NL80211_ATTR_S1G_LONG_BEACON_PERIOD which denotes the number of beacon
intervals between each long beacon and NL80211_ATTR_S1G_SHORT_BEACON
which is a nested attribute containing the short beacon tail and
head. We split them as the long beacon period cannot be updated,
and is only used when initialisng the interface, whereas the short
beacon data can be used to both initialise and update the templates.
This follows how things such as the beacon interval and DTIM period
currently operate.
During the initialisation path, we ensure we have the long beacon
period if the short beacon data is being passed down, whereas
the update path will simply update the template if its sent down.
The short beacon data is validated using the same routines for regular
beacons as they support correctly parsing the short beacon format
while ensuring the frame is well-formed.
Signed-off-by: Lachlan Hodges <lachlan.hodges@morsemicro.com>
Link: https://patch.msgid.link/20250717074205.312577-2-lachlan.hodges@morsemicro.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
neigh_add() updates pneigh_entry() found or created by pneigh_create().
This update is serialised by RTNL, but we will remove it.
Let's move the update part to pneigh_create() and make it return errno
instead of a pointer of pneigh_entry.
Now, the pneigh code is RTNL free.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-16-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tbl->phash_buckets[] is only modified in the slow path by pneigh_create()
and pneigh_delete() under the table lock.
Both of them are called under RTNL, so no extra lock is needed, but we
will remove RTNL from the paths.
pneigh_create() looks up a pneigh_entry, and this part can be lockless,
but it would complicate the logic like
1. lookup
2. allocate pengih_entry for GFP_KERNEL
3. lookup again but under lock
4. if found, return it after freeing the allocated memory
5. else, return the new one
Instead, let's add a per-table mutex and run lookup and allocation
under it.
Note that updating pneigh_entry part in neigh_add() is still protected
by RTNL and will be moved to pneigh_create() in the next patch.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-15-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
__pneigh_lookup() is the lockless version of pneigh_lookup(),
but its only caller pndisc_is_router() holds the table lock and
reads pneigh_netry.flags.
This is because accessing pneigh_entry after pneigh_lookup() was
illegal unless the caller holds RTNL or the table lock.
Now, pneigh_entry is guaranteed to be alive during the RCU critical
section.
Let's call pneigh_lookup() and use READ_ONCE() for n->flags in
pndisc_is_router() and remove __pneigh_lookup().
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-13-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We will convert RTM_GETNEIGH to RCU.
neigh_get() looks up pneigh_entry by pneigh_lookup() and passes
it to pneigh_fill_info().
Then, we must ensure that the entry is alive till pneigh_fill_info()
completes, but read_lock_bh(&tbl->lock) in pneigh_lookup() does not
guarantee that.
Also, we will convert all readers of tbl->phash_buckets[] to RCU.
Let's use call_rcu() to free pneigh_entry and update phash_buckets[]
and ->next by rcu_assign_pointer().
pneigh_ifdown_and_unlock() uses list_head to avoid overwriting
->next and moving RCU iterators to another list.
pndisc_destructor() (only IPv6 ndisc uses this) uses a mutex, so it
is not delayed to call_rcu(), where we cannot sleep. This is fine
because the mcast code works with RCU and ipv6_dev_mc_dec() frees
mcast objects after RCU grace period.
While at it, we change the return type of pneigh_ifdown_and_unlock()
to void.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-8-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The next patch will free pneigh_entry with call_rcu().
Then, we need to annotate neigh_table.phash_buckets[] and
pneigh_entry.next with __rcu.
To make the next patch cleaner, let's annotate the fields in advance.
Currently, all accesses to the fields are under the neigh table lock,
so rcu_dereference_protected() is used with 1 for now, but most of them
(except in pneigh_delete() and pneigh_ifdown_and_unlock()) will be
replaced with rcu_dereference() and rcu_dereference_check().
Note that pneigh_ifdown_and_unlock() changes pneigh_entry.next to a
local list, which is illegal because the RCU iterator could be moved
to another list. This part will be fixed in the next patch.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-7-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
pneigh_lookup() has ASSERT_RTNL() in the middle of the function, which
is confusing.
When called with the last argument, creat, 0, pneigh_lookup() literally
looks up a proxy neighbour entry. This is the case of the reader path
as the fast path and RTM_GETNEIGH.
pneigh_lookup(), however, creates a pneigh_entry when called with creat 1
from RTM_NEWNEIGH and SIOCSARP, which require RTNL.
Let's split pneigh_lookup() into two functions.
We will convert all the reader paths to RCU, and read_lock_bh(&tbl->lock)
in the new pneigh_lookup() will be dropped.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-6-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- hci_sync: fix connectable extended advertising when using static random address
- hci_core: fix typos in macros
- hci_core: add missing braces when using macro parameters
- hci_core: replace 'quirks' integer by 'quirk_flags' bitmap
- SMP: If an unallowed command is received consider it a failure
- SMP: Fix using HCI_ERROR_REMOTE_USER_TERM on timeout
- L2CAP: Fix null-ptr-deref in l2cap_sock_resume_cb()
- L2CAP: Fix attempting to adjust outgoing MTU
- btintel: Check if controller is ISO capable on btintel_classify_pkt_type
- btusb: QCA: Fix downloading wrong NVM for WCN6855 GF variant without board ID
-----BEGIN PGP SIGNATURE-----
iQJNBAABCgA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmh5CF4ZHGx1aXoudm9u
LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKeeYD/9oqrpfAnF+ZYakvt+W+bJx
KPBWlXlgVSnPbK9qJPWm8AUaEOz6yyGd728S0QYZ+y5map5TVMWE0n1BYfauUgch
GmUS/Li44qRVi9ygxS3CqiXHoVFtRiMJd3kx5v3SH8LHUQZakcNsFg4DCfQAufZe
uJI+/+vccBx8rF+WR3mlhziE0bUosHOLAkqujnuKg/EpVO4xc4zeG6AKK5ihVHgQ
1SlPp/s6BYz1VcMj9HMsEk6z4iY8WF5bdN1YdzkvRziTYYuFMDpJwI83FtkmmrsG
v59GwlPMsGNlz25KbapzqGgflydeXSKbigTJQr7LHAaKv4jmqnnAeCOlhkvFuFBm
snb2Zkkw16w+s/DBQvriBy6D+yiaSwKkZUjNWwGTvyDqAna6Kx44jzT1QpgOSm2p
d+rxjrNXRjT59wiIo1JsOXpK5Mbbyz5QGXge/RbUO36glh/J2Vs44F1HueZHwSSw
GGt0jmRTjB8/icbcvnkMVgwnoQEul7bsV95fPOq6CGSuRxYIX7uFXWMM/Wb/1SN7
QWQyN/P7z5XpZMWFH3SDVx/FhN6G5Pi17OkvaLVSwfKs7jK45gb10Oi6cdypL2rc
Ed6EkBOIL0ETAqDd4NLGTOeHEpJ3zfxxWqlu5cYrUf4qj7vtXk39ylNmXNNA/4Ci
lU4vQdGAHYX3BIhaRdqryQ==
=JuS2
-----END PGP SIGNATURE-----
Merge tag 'for-net-2025-07-17' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- hci_sync: fix connectable extended advertising when using static random address
- hci_core: fix typos in macros
- hci_core: add missing braces when using macro parameters
- hci_core: replace 'quirks' integer by 'quirk_flags' bitmap
- SMP: If an unallowed command is received consider it a failure
- SMP: Fix using HCI_ERROR_REMOTE_USER_TERM on timeout
- L2CAP: Fix null-ptr-deref in l2cap_sock_resume_cb()
- L2CAP: Fix attempting to adjust outgoing MTU
- btintel: Check if controller is ISO capable on btintel_classify_pkt_type
- btusb: QCA: Fix downloading wrong NVM for WCN6855 GF variant without board ID
* tag 'for-net-2025-07-17' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: L2CAP: Fix attempting to adjust outgoing MTU
Bluetooth: btusb: QCA: Fix downloading wrong NVM for WCN6855 GF variant without board ID
Bluetooth: hci_dev: replace 'quirks' integer by 'quirk_flags' bitmap
Bluetooth: hci_core: add missing braces when using macro parameters
Bluetooth: hci_core: fix typos in macros
Bluetooth: SMP: Fix using HCI_ERROR_REMOTE_USER_TERM on timeout
Bluetooth: SMP: If an unallowed command is received consider it a failure
Bluetooth: btintel: Check if controller is ISO capable on btintel_classify_pkt_type
Bluetooth: hci_sync: fix connectable extended advertising when using static random address
Bluetooth: Fix null-ptr-deref in l2cap_sock_resume_cb()
====================
Link: https://patch.msgid.link/20250717142849.537425-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- cfg80211: fix double-free introduced earlier
- mac80211: fix RCU iteration in CSA
- iwlwifi: many cleanups (unused FW APIs, PCIe code, WoWLAN)
- mac80211: some work around how FIPS affects wifi, which was
wrong (RC4 is used by TKIP, not only WEP)
- cfg/mac80211: improvements for unsolicated probe response
handling
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmh4xE8ACgkQ10qiO8sP
aAANTA/9HWfjfXTvZ3lEZqmOeJdHxd6xw3uHA732NF+KxoKi8vnO8V63WNVxBzCF
6srSaSi0IkdtnyR/uPqG5AkWej/ouAeDy+FtzoDpP8O6G4e9yjKZBRnvP+O/gkod
ERM4ouLWsumAIZqSqrSAJHgNTOP0/n5MNiLoKHlcjZmXphL2/lM8Qay7kVmWG+vT
tqv/LrwqRiTe/a+O9PKaW2sq6z23CBhEqWf6hhFCaEPVQXQJLBc6FTI4008zP/Rt
aSq672rMwA8ckuVOH0EAX544DMJc736to9XJEuu9EzuOaDZ9FwfqF9nOTMnFDev2
k9qn8Q5vsx+786A6pXw+sarnxhUxnQCYB00MhMe1e1yoB0lv57hZCD1Z0JTBGXPh
8ghmbelYBvkLollie6zUUcdf3G98ADUYvHsR0elBgGADWPHvnbn3ORVcv852X6jS
El1CR3lyvwwHn9USKaR91ESQWFRtLtPG+OmmLRLIG7wdkp0sDjHOpvV7A/7QpFLw
A18QrN+ORNL3Pja4A4dyIpDRQiTKHIjEEm3dAlVDzSWcPpcws5WlfreC1L71XblA
kRl+1I7SZ87rp8d6TXXxc8QmIMA+M4OL6M4FMAJgjn8VdyjlaC6lvl1i+tgdaDik
ISrY/78gnnt93BQmHU4P/syvh+NdpgNN+LgeaikKb1kmjG837Yk=
=g8VQ
-----END PGP SIGNATURE-----
Merge tag 'wireless-next-2025-07-17' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
Johannes Berg says:
====================
Another set of changes, notably:
- cfg80211: fix double-free introduced earlier
- mac80211: fix RCU iteration in CSA
- iwlwifi: many cleanups (unused FW APIs, PCIe code, WoWLAN)
- mac80211: some work around how FIPS affects wifi, which was
wrong (RC4 is used by TKIP, not only WEP)
- cfg/mac80211: improvements for unsolicated probe response
handling
* tag 'wireless-next-2025-07-17' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (64 commits)
wifi: cfg80211: fix double free for link_sinfo in nl80211_station_dump()
wifi: cfg80211: fix off channel operation allowed check for MLO
wifi: mac80211: use RCU-safe iteration in ieee80211_csa_finish
wifi: mac80211_hwsim: Update comments in header
wifi: mac80211: parse unsolicited broadcast probe response data
wifi: cfg80211: parse attribute to update unsolicited probe response template
wifi: mac80211: don't use TPE data from assoc response
wifi: mac80211: handle WLAN_HT_ACTION_NOTIFY_CHANWIDTH async
wifi: mac80211: simplify __ieee80211_rx_h_amsdu() loop
wifi: mac80211: don't mark keys for inactive links as uploaded
wifi: mac80211: only assign chanctx in reconfig
wifi: mac80211_hwsim: Declare support for AP scanning
wifi: mac80211: clean up cipher suite handling
wifi: mac80211: don't send keys to driver when fips_enabled
wifi: cfg80211: Fix interface type validation
wifi: mac80211: remove ieee80211_link_unreserve_chanctx() return value
wifi: mac80211: don't unreserve never reserved chanctx
mwl8k: Add missing check after DMA map
wifi: mac80211: make VHT opmode NSS ignore a debug message
wifi: iwlwifi: remove support of several iwl_ppag_table_cmd versions
...
====================
Link: https://patch.msgid.link/20250717094610.20106-47-johannes@sipsolutions.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
- ath12k performance regression from -rc1
- cfg80211 counted_by() removal for scan request
as it doesn't match usage and keeps complaining
- iwlwifi crash with certain older devices
- iwlwifi missing an error path unlock
- iwlwifi compatibility with certain BIOS updates
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmh4vwsACgkQ10qiO8sP
aAAv+w/9Fs51/x4KVnUhAK7+WtOjvFAuzQkCsVlMZplO8WnXOzUPWfZ+sOiBrMbR
ws55b4SDnO82LoeLq4MBxAMJG7P+aWK9hNbMzA6TENaWCgxvTYNEQBijGjplx8pt
0r2lpn3OP3ALluV09UnCVTVIWFL3pq/jCsGEGHVhjpRUJz9RSwmBrHFM6V7yqrzl
VIbN8k5U90L1gwQ0wf2RI/KMZwJ85zCVViYSx1wNEpfwDH0pnoy5V1bMCNaCvReh
Gca8XCmCKYP2/NSQcsoyqnz0RgjKDAsN1a8s/S/HHFea62iOpUctSg8rgc8bLfJn
bbuwuSx40Ol70kPnSii35ImTO6PRSWXpibOlA3xtjPoLprX99e7xrdm6nSbf03It
VnQtfhB5fuoBHQuPxLUeLDVl8dT4r4zbS1JQw7feUCP1Rs3IGGDLtBQHZQNE0VIr
eFkWSa4SHv5mOFrPSuyAsSviRn6O8/Ni84/hXnAf9Lh1hzfPFG4pQ+gzBNf1ZoZI
lrgNvj1xoDothTb3DzgEQwO5oxd3zFHdtOD/4DN8wH0pSOTm4obQm23b43F6t+CC
q2E8eD4BurfnBuL1zlLseo6JbSFC7LOsustgarB+q2XSPoTCmJqaOfR24Kq9ZjjN
OBE+uvTlg+/l2UjZ5lp/07XE3LSy6bWAvsl4Yy02biOiyZ0GuXU=
=yzt3
-----END PGP SIGNATURE-----
Merge tag 'wireless-2025-07-17' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Johannes Berg says:
====================
Couple of fixes:
- ath12k performance regression from -rc1
- cfg80211 counted_by() removal for scan request
as it doesn't match usage and keeps complaining
- iwlwifi crash with certain older devices
- iwlwifi missing an error path unlock
- iwlwifi compatibility with certain BIOS updates
* tag 'wireless-2025-07-17' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
wifi: iwlwifi: Fix botched indexing conversion
wifi: cfg80211: remove scan request n_channels counted_by
wifi: ath12k: Fix packets received in WBM error ring with REO LUT enabled
wifi: iwlwifi: mask reserved bits in chan_state_active_bitmap
wifi: iwlwifi: pcie: fix locking on invalid TOP reset
====================
Link: https://patch.msgid.link/20250717091831.18787-5-johannes@sipsolutions.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
A crash in conntrack was reported while trying to unlink the conntrack
entry from the hash bucket list:
[exception RIP: __nf_ct_delete_from_lists+172]
[..]
#7 [ff539b5a2b043aa0] nf_ct_delete at ffffffffc124d421 [nf_conntrack]
#8 [ff539b5a2b043ad0] nf_ct_gc_expired at ffffffffc124d999 [nf_conntrack]
#9 [ff539b5a2b043ae0] __nf_conntrack_find_get at ffffffffc124efbc [nf_conntrack]
[..]
The nf_conn struct is marked as allocated from slab but appears to be in
a partially initialised state:
ct hlist pointer is garbage; looks like the ct hash value
(hence crash).
ct->status is equal to IPS_CONFIRMED|IPS_DYING, which is expected
ct->timeout is 30000 (=30s), which is unexpected.
Everything else looks like normal udp conntrack entry. If we ignore
ct->status and pretend its 0, the entry matches those that are newly
allocated but not yet inserted into the hash:
- ct hlist pointers are overloaded and store/cache the raw tuple hash
- ct->timeout matches the relative time expected for a new udp flow
rather than the absolute 'jiffies' value.
If it were not for the presence of IPS_CONFIRMED,
__nf_conntrack_find_get() would have skipped the entry.
Theory is that we did hit following race:
cpu x cpu y cpu z
found entry E found entry E
E is expired <preemption>
nf_ct_delete()
return E to rcu slab
init_conntrack
E is re-inited,
ct->status set to 0
reply tuplehash hnnode.pprev
stores hash value.
cpu y found E right before it was deleted on cpu x.
E is now re-inited on cpu z. cpu y was preempted before
checking for expiry and/or confirm bit.
->refcnt set to 1
E now owned by skb
->timeout set to 30000
If cpu y were to resume now, it would observe E as
expired but would skip E due to missing CONFIRMED bit.
nf_conntrack_confirm gets called
sets: ct->status |= CONFIRMED
This is wrong: E is not yet added
to hashtable.
cpu y resumes, it observes E as expired but CONFIRMED:
<resumes>
nf_ct_expired()
-> yes (ct->timeout is 30s)
confirmed bit set.
cpu y will try to delete E from the hashtable:
nf_ct_delete() -> set DYING bit
__nf_ct_delete_from_lists
Even this scenario doesn't guarantee a crash:
cpu z still holds the table bucket lock(s) so y blocks:
wait for spinlock held by z
CONFIRMED is set but there is no
guarantee ct will be added to hash:
"chaintoolong" or "clash resolution"
logic both skip the insert step.
reply hnnode.pprev still stores the
hash value.
unlocks spinlock
return NF_DROP
<unblocks, then
crashes on hlist_nulls_del_rcu pprev>
In case CPU z does insert the entry into the hashtable, cpu y will unlink
E again right away but no crash occurs.
Without 'cpu y' race, 'garbage' hlist is of no consequence:
ct refcnt remains at 1, eventually skb will be free'd and E gets
destroyed via: nf_conntrack_put -> nf_conntrack_destroy -> nf_ct_destroy.
To resolve this, move the IPS_CONFIRMED assignment after the table
insertion but before the unlock.
Pablo points out that the confirm-bit-store could be reordered to happen
before hlist add resp. the timeout fixup, so switch to set_bit and
before_atomic memory barrier to prevent this.
It doesn't matter if other CPUs can observe a newly inserted entry right
before the CONFIRMED bit was set:
Such event cannot be distinguished from above "E is the old incarnation"
case: the entry will be skipped.
Also change nf_ct_should_gc() to first check the confirmed bit.
The gc sequence is:
1. Check if entry has expired, if not skip to next entry
2. Obtain a reference to the expired entry.
3. Call nf_ct_should_gc() to double-check step 1.
nf_ct_should_gc() is thus called only for entries that already failed an
expiry check. After this patch, once the confirmed bit check passes
ct->timeout has been altered to reflect the absolute 'best before' date
instead of a relative time. Step 3 will therefore not remove the entry.
Without this change to nf_ct_should_gc() we could still get this sequence:
1. Check if entry has expired.
2. Obtain a reference.
3. Call nf_ct_should_gc() to double-check step 1:
4 - entry is still observed as expired
5 - meanwhile, ct->timeout is corrected to absolute value on other CPU
and confirm bit gets set
6 - confirm bit is seen
7 - valid entry is removed again
First do check 6), then 4) so the gc expiry check always picks up either
confirmed bit unset (entry gets skipped) or expiry re-check failure for
re-inited conntrack objects.
This change cannot be backported to releases before 5.19. Without
commit 8a75a2c174 ("netfilter: conntrack: remove unconfirmed list")
|= IPS_CONFIRMED line cannot be moved without further changes.
Cc: Razvan Cojocaru <rzvncj@gmail.com>
Link: https://lore.kernel.org/netfilter-devel/20250627142758.25664-1-fw@strlen.de/
Link: https://lore.kernel.org/netfilter-devel/4239da15-83ff-4ca4-939d-faef283471bb@gmail.com/
Fixes: 1397af5bfd ("netfilter: conntrack: remove the percpu dying list")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The 'quirks' member already ran out of bits on some platforms some time
ago. Replace the integer member by a bitmap in order to have enough bits
in future. Replace raw bit operations by accessor macros.
Fixes: ff26b2dd65 ("Bluetooth: Add quirk for broken READ_VOICE_SETTING")
Fixes: 127881334e ("Bluetooth: Add quirk for broken READ_PAGE_SCAN_TYPE")
Suggested-by: Pauli Virtanen <pav@iki.fi>
Tested-by: Ivan Pravdin <ipravdin.official@gmail.com>
Signed-off-by: Kiran K <kiran.k@intel.com>
Signed-off-by: Christian Eggers <ceggers@arri.de>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Macro parameters should always be put into braces when accessing it.
Fixes: 4fc9857ab8 ("Bluetooth: hci_sync: Add check simultaneous roles support")
Signed-off-by: Christian Eggers <ceggers@arri.de>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
The provided macro parameter is named 'dev' (rather than 'hdev', which
may be a variable on the stack where the macro is used).
Fixes: a9a830a676 ("Bluetooth: hci_event: Fix sending HCI_OP_READ_ENC_KEY_SIZE")
Fixes: 6126ffabba ("Bluetooth: Introduce HCI_CONN_FLAG_DEVICE_PRIVACY device flag")
Signed-off-by: Christian Eggers <ceggers@arri.de>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Prior to calling bind() a program may call connect() on a socket to
restrict to a remote peer address.
Using connect() is the normal mechanism to specify a remote network
peer, so we use that here. In MCTP connect() is only used for bound
sockets - send() is not available for MCTP since a tag must be provided
for each message.
The smctp_type must match between connect() and bind() calls.
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Link: https://patch.msgid.link/20250710-mctp-bind-v4-6-8ec2f6460c56@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Ensure that a specific EID (remote or local) bind will match in
preference to a MCTP_ADDR_ANY bind.
This adds infrastructure for binding a socket to receive messages from a
specific remote peer address, a future commit will expose an API for
this.
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Link: https://patch.msgid.link/20250710-mctp-bind-v4-5-8ec2f6460c56@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
At present, the updated unsolicited broadcast probe response template is
not processed during userspace commands such as channel switch or color
change. This leads to an issue where older incorrect unsolicited probe
response is still used during these events.
Add support to parse the netlink attribute and store it so that
mac80211/drivers can use it to set the BSS_CHANGED_UNSOL_BCAST_PROBE_RESP
flag in order to send the updated unsolicited broadcast probe response
templates during these events.
Signed-off-by: Yuvarani V <quic_yuvarani@quicinc.com>
Signed-off-by: Aditya Kumar Singh <aditya.kumar.singh@oss.qualcomm.com>
Link: https://patch.msgid.link/20250710-update_unsol_bcast_probe_resp-v2-1-31aca39d3b30@oss.qualcomm.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This reverts commit e3eac9f32e ("wifi: cfg80211: Annotate struct
cfg80211_scan_request with __counted_by").
This really has been a completely failed experiment. There were
no actual bugs found, and yet at this point we already have four
"fixes" to it, with nothing to show for but code churn, and it
never even made the code any safer.
In all of the cases that ended up getting "fixed", the structure
is also internally inconsistent after the n_channels setting as
the channel list isn't actually filled yet. You cannot scan with
such a structure, that's just wrong. In mac80211, the struct is
also reused multiple times, so initializing it once is no good.
Some previous "fixes" (e.g. one in brcm80211) are also just setting
n_channels before accessing the array, under the assumption that the
code is correct and the array can be accessed, further showing that
the whole thing is just pointless when the allocation count and use
count are not separate.
If we really wanted to fix it, we'd need to separately track the
number of channels allocated and the number of channels currently
used, but given that no bugs were found despite the numerous syzbot
reports, that'd just be a waste of time.
Remove the __counted_by() annotation. We really should also remove
a number of the n_channels settings that are setting up a structure
that's inconsistent, but that can wait.
Reported-by: syzbot+e834e757bd9b3d3e1251@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=e834e757bd9b3d3e1251
Fixes: e3eac9f32e ("wifi: cfg80211: Annotate struct cfg80211_scan_request with __counted_by")
Link: https://patch.msgid.link/20250714142130.9b0bbb7e1f07.I09112ccde72d445e11348fc2bef68942cb2ffc94@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
These functions to not modify the skb, add a const qualifier.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250711114006.480026-7-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a new SNMP MIB : LINUX_MIB_BEYOND_WINDOW
Incremented when an incoming packet is received beyond the
receiver window.
nstat -az | grep TcpExtBeyondWindow
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250711114006.480026-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, TCP accepts incoming packets which might go beyond the
offered RWIN.
Add to tcp_sequence() the validation of packet end sequence.
Add the corresponding check in the fast path.
We relax this new constraint if the receive queue is empty,
to not freeze flows from buggy peers.
Add a new drop reason : SKB_DROP_REASON_TCP_INVALID_END_SEQUENCE.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250711114006.480026-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
x25_terminate_link() has been unused since the last use was removed
in 2020 by:
commit 7eed751b3b ("net/x25: handle additional netdev events")
Remove it.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Acked-by: Martin Schiller <ms@dev.tdt.de>
Link: https://patch.msgid.link/20250712205759.278777-1-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This reverts commit 465b9ee0ee.
Such notifications fit better into core or nfnetlink_hook code,
following the NFNL_MSG_HOOK_GET message format.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Also storing tcf_action into struct tcf_skbedit_params
makes sure there is no discrepancy in tcf_skbedit_act().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250709090204.797558-12-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Also storing tcf_action into struct tcf_police_params
makes sure there is no discrepancy in tcf_police_act().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250709090204.797558-11-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Also storing tcf_action into struct tcf_pedit_params
makes sure there is no discrepancy in tcf_pedit_act().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250709090204.797558-10-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Also storing tcf_action into struct tcf_nat_params
makes sure there is no discrepancy in tcf_nat_act().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250709090204.797558-9-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Also storing tcf_action into struct tcf_mpls_params
makes sure there is no discrepancy in tcf_mpls_act().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250709090204.797558-8-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Also storing tcf_action into struct tcf_ctinfo_params
makes sure there is no discrepancy in tcf_ctinfo_act().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250709090204.797558-7-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 21c167aa0b ("net/sched: act_ctinfo: use percpu stats")
missed that stats_dscp_set, stats_dscp_error and stats_cpmark_set
might be written (and read) locklessly.
Use atomic64_t for these three fields, I doubt act_ctinfo is used
heavily on big SMP hosts anyway.
Fixes: 24ec483cec ("net: sched: Introduce act_ctinfo action")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Pedro Tammela <pctammela@mojatatu.com>
Link: https://patch.msgid.link/20250709090204.797558-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Also storing tcf_action into struct tcf_ct_params
makes sure there is no discrepancy in tcf_ct_act().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250709090204.797558-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Also storing tcf_action into struct tcf_csum_params
makes sure there is no discrepancy in tcf_csum_act().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250709090204.797558-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Also storing tcf_action into struct tcf_connmark_parms
makes sure there is no discrepancy in tcf_connmark_act().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250709090204.797558-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tcf_tm_dump() reads fields that can be changed concurrently,
and tcf_lastuse_update() might race against itself.
Add READ_ONCE() and WRITE_ONCE() annotations.
Fetch jiffies once in tcf_tm_dump().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250709090204.797558-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cross-merge networking fixes after downstream PR (net-6.16-rc6-2).
No conflicts.
Adjacent changes:
drivers/net/wireless/mediatek/mt76/mt7925/mcu.c
c701574c54 ("wifi: mt76: mt7925: fix invalid array index in ssid assignment during hw scan")
b3a431fe2e ("wifi: mt76: mt7925: fix off by one in mt7925_mcu_hw_scan()")
drivers/net/wireless/mediatek/mt76/mt7996/mac.c
62da647a2b ("wifi: mt76: mt7996: Add MLO support to mt7996_tx_check_aggr()")
dc66a129ad ("wifi: mt76: add a wrapper for wcid access with validation")
drivers/net/wireless/mediatek/mt76/mt7996/main.c
3dd6f67c66 ("wifi: mt76: Move RCU section in mt7996_mcu_add_rate_ctrl()")
8989d8e90f ("wifi: mt76: mt7996: Do not set wcid.sta to 1 in mt7996_mac_sta_event()")
net/mac80211/cfg.c
58fcb1b428 ("wifi: mac80211: reject VHT opmode for unsupported channel widths")
037dc18ac3 ("wifi: mac80211: add support for storing station S1G capabilities")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use attach_type in bpf_link to replace the location filed, and
remove location field in tcx_link.
Signed-off-by: Tao Chen <chen.dylane@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20250710032038.888700-5-chen.dylane@linux.dev
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEjF9xRqF1emXiQiqU1w0aZmrPKyEFAmhvEc8ACgkQ1w0aZmrP
KyGYmBAAxw/EnlVp9sHWAPBNka6KOQECAb1LrnBVjv5nFZK3jHy8ehb/XjW0Q/rR
ScjTgGJLBVuXMSGfFYBsDezesLYZLs/QK3amsTP9+HtYYGdOxdgvmTPMrUe2mF4W
He/xrvWpVKTV00vnpVvZe5zbWnpw9iDTXbDecxmArqNVu78LOsRFU20S2MPc86ne
wbgJohCS/QvhHPsKBUd83niGtKXR8uaFitLDaCgaaqqF4noXfNxIpdxm2O3NIqu1
2gbEAp3qi1YWHUYHQUxsK7W7JjY8xfd9EFLViRdQCI0aevE01GmwJc4J2oDlZ7Ki
DqRFEYzpwpxqe0YwihAsB72faXMOsWFODwIiojlAue26W5iVLzu+w7hBUiOBiq6E
CEBH2OSL47n3s/Va6e1Gb9W2A24x+Qp+EZEZb0Ci0I0MlstOw0aUutoc/P6F9MxI
v9ApavHkt5/FImeXiTB3j61ZodgduC5WLZ7HCM5EjrCjlZznsJFuxtWSYARyYTUM
aj3U00KRW85WgeuvWj1aTGTiT7dmTsQVo7XllEpzfXWvVxE0ATwfiLRfUQwtUwGx
6SecA49s34AwgdUKMynfPtlkxIzraGjhzJmGdTCA1Hi1fw2TUYniASmHRDKVQvUA
1jYCQucYf+Q1zybosscXepUCtsVSPTZTZE/gQSDCUy6WNvHnv84=
=zECx
-----END PGP SIGNATURE-----
Merge tag 'nf-next-25-07-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next (v2)
The following series contains an initial small batch of Netfilter
updates for net-next:
1) Remove DCCP conntrack support, keep DCCP matches around in order to
avoid breakage when loading ruleset, add Kconfig to wrap the code
so it can be disabled by distributors.
2) Remove buggy code aiming at shrinking netlink deletion event, then
re-add it correctly in another patch. This is to prevent -stable to
pick up on a fix that breaks old userspace. From Phil Sutter.
3) Missing WARN_ON_ONCE() to check for lockdep_commit_lock_is_held()
to uncover bugs. From Fedor Pchelkin.
* tag 'nf-next-25-07-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: nf_tables: adjust lockdep assertions handling
netfilter: nf_tables: Reintroduce shortened deletion notifications
netfilter: nf_tables: Drop dead code from fill_*_info routines
netfilter: conntrack: remove DCCP protocol support
====================
Link: https://patch.msgid.link/20250710010706.2861281-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- mt76: firmware recovery improvements, MLO work
- iwlwifi: use embedded PNVM in (to be released) FW images
to fix compatibility issues
- cfg80211/mac80211: extended regulatory info support (6 GHz)
- cfg80211: use "faux device" for regulatory
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmhvskoACgkQ10qiO8sP
aACdthAAiN1wzi9Hrpm2ULqDecEL553ecvfOPAzlYibF4wiulVEdKJd7nFn03udA
n2ylEgiANbTb+yEgDjj6IwQ9w5XmYvlLSGO/wIag3UUSWUYuCC5if/JEq56lom+B
IZFMOFXpvO35jSe1h1v0HuuJvbeZv7KDWMJlA4aSXbLx4Y1juAuQid4/YllUkuXt
QgAziAhF7laNk+8nfQLQ3N1DQytftiDK32vCJ6kJ7ciEhh8qxwT5aVkmE0Q/iOLg
na4TdrcRRMQ7kkgODqksGz1nVrr/0AHyMVi/rQ/YaL1uY8fBxvbs0nmjU8G78LkO
gM/401kySEzgLl6x/xrnaVUupbAtsyYrHYH+4q6Z07UPg1OgYY4G71rJIMp/Hq52
/iQVMOFmbhoIqFLqlvmh92QqZRH+rMf/KnO9Pnh/wVdxGH74ZP0bFYB02bqdaRie
uxtO9lHfnuMedET85D747rmgAqrBJ2t3BqAD++LNdqF530eOaWiBkAPeAbTRSgis
d1BJoDtWL3l0tjAA1ivdUw/fPjqWxffpdDTtzSwRjuFUNE5YPpz3VuKNSurSgvMZ
DCSvxTHf5DLVrg6b4W/YXickD2kArJLpaCwkxzkpyBzh6wyRvxs+b/oZSQRVeKvL
C36I3WNDqARfC29ilYUiz/G8kJry8IQaSuLaMP6X92Fa6Hdl7Kg=
=qkYl
-----END PGP SIGNATURE-----
Merge tag 'wireless-next-2025-07-10' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
Johannes Berg says:
====================
Quite a bit more work, notably:
- mt76: firmware recovery improvements, MLO work
- iwlwifi: use embedded PNVM in (to be released) FW images
to fix compatibility issues
- cfg80211/mac80211: extended regulatory info support (6 GHz)
- cfg80211: use "faux device" for regulatory
* tag 'wireless-next-2025-07-10' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (48 commits)
wifi: mac80211: don't complete management TX on SAE commit
wifi: cfg80211/mac80211: implement dot11ExtendedRegInfoSupport
wifi: mac80211: send extended MLD capa/ops if AP has it
wifi: mac80211: copy first_part into HW scan
wifi: cfg80211: add a flag for the first part of a scan
wifi: mac80211: remove DISALLOW_PUNCTURING_5GHZ code
wifi: cfg80211: only verify part of Extended MLD Capabilities
wifi: nl80211: make nl80211_check_scan_flags() type safe
wifi: cfg80211: hide scan internals
wifi: mac80211: fix deactivated link CSA
wifi: mac80211: add mandatory bitrate support for 6 GHz
wifi: mac80211: remove spurious blank line
wifi: mac80211: verify state before connection
wifi: mac80211: avoid weird state in error path
wifi: iwlwifi: mvm: remove support for iwl_wowlan_info_notif_v4
wifi: iwlwifi: bump minimum API version in BZ
wifi: iwlwifi: mvm: remove unneeded argument
wifi: iwlwifi: mvm: remove MLO GTK rekey code
wifi: iwlwifi: pcie: rename iwl_pci_gen1_2_probe() argument
wifi: iwlwifi: match discrete/integrated to fix some names
...
====================
Link: https://patch.msgid.link/20250710123113.24878-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ND_PRINTK with val > 1 only works when the ND_DEBUG was set in compilation
phase. Replace it with dynamic debug. Convert ND_PRINTK with val <= 1 to
net_{err,warn}_ratelimited, and convert the rest to net_dbg_ratelimited.
Suggested-by: Ido Schimmel <idosch@idosch.org>
Signed-off-by: Wang Liang <wangliang74@huawei.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250708033342.1627636-1-wangliang74@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch provides a setsockopt method to let applications leverage to
adjust how many descs to be handled at most in one send syscall. It
mitigates the situation where the default value (32) that is too small
leads to higher frequency of triggering send syscall.
Considering the prosperity/complexity the applications have, there is no
absolutely ideal suggestion fitting all cases. So keep 32 as its default
value like before.
The patch does the following things:
- Add XDP_MAX_TX_SKB_BUDGET socket option.
- Set max_tx_budget to 32 by default in the initialization phase as a
per-socket granular control.
- Set the range of max_tx_budget as [32, xs->tx->nentries].
The idea behind this comes out of real workloads in production. We use a
user-level stack with xsk support to accelerate sending packets and
minimize triggering syscalls. When the packets are aggregated, it's not
hard to hit the upper bound (namely, 32). The moment user-space stack
fetches the -EAGAIN error number passed from sendto(), it will loop to try
again until all the expected descs from tx ring are sent out to the driver.
Enlarging the XDP_MAX_TX_SKB_BUDGET value contributes to less frequency of
sendto() and higher throughput/PPS.
Here is what I did in production, along with some numbers as follows:
For one application I saw lately, I suggested using 128 as max_tx_budget
because I saw two limitations without changing any default configuration:
1) XDP_MAX_TX_SKB_BUDGET, 2) socket sndbuf which is 212992 decided by
net.core.wmem_default. As to XDP_MAX_TX_SKB_BUDGET, the scenario behind
this was I counted how many descs are transmitted to the driver at one
time of sendto() based on [1] patch and then I calculated the
possibility of hitting the upper bound. Finally I chose 128 as a
suitable value because 1) it covers most of the cases, 2) a higher
number would not bring evident results. After twisting the parameters,
a stable improvement of around 4% for both PPS and throughput and less
resources consumption were found to be observed by strace -c -p xxx:
1) %time was decreased by 7.8%
2) error counter was decreased from 18367 to 572
[1]: https://lore.kernel.org/all/20250619093641.70700-1-kerneljasonxing@gmail.com/
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://patch.msgid.link/20250704160138.48677-1-kerneljasonxing@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
To prevent a potential crash in agg_dequeue (net/sched/sch_qfq.c)
when cl->qdisc->ops->peek(cl->qdisc) returns NULL, we check the return
value before using it, similar to the existing approach in sch_hfsc.c.
To avoid code duplication, the following changes are made:
1. Changed qdisc_warn_nonwc(include/net/pkt_sched.h) into a static
inline function.
2. Moved qdisc_peek_len from net/sched/sch_hfsc.c to
include/net/pkt_sched.h so that sch_qfq can reuse it.
3. Applied qdisc_peek_len in agg_dequeue to avoid crashing.
Signed-off-by: Xiang Mei <xmei5@asu.edu>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Link: https://patch.msgid.link/20250705212143.3982664-1-xmei5@asu.edu
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Add a new device generic parameter to specify clock ID that should
be used by the device for registering DPLL devices and pins.
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20250704182202.1641943-5-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Only 8, 16 and 32-bit integers are supported for numeric devlink
parameters. The subsequent patch adds support for DPLL clock ID
that is defined as 64-bit number. Add support for u64 parameter
type.
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20250704182202.1641943-4-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When SAE commit is sent and received in response, there's no
ordering for the SAE confirm messages. As such, don't call
drivers to stop listening on the channel when the confirm
message is still expected.
This fixes an issue if the local confirm is transmitted later
than the AP's confirm, for iwlwifi (and possibly mt76) the
AP's confirm would then get lost since the device isn't on
the channel at the time the AP transmit the confirm.
For iwlwifi at least, this also improves the overall timing
of the authentication handshake (by about 15ms according to
the report), likely since the session protection won't be
aborted and rescheduled.
Note that even before this, mgd_complete_tx() wasn't always
called for each call to mgd_prepare_tx() (e.g. in the case
of WEP key shared authentication), and the current drivers
that have the complete callback don't seem to mind. Document
this as well though.
Reported-by: Jan Hendrik Farr <kernel@jfarr.cc>
Closes: https://lore.kernel.org/all/aB30Ea2kRG24LINR@archlinux/
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250609213232.12691580e140.I3f1d3127acabcd58348a110ab11044213cf147d3@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When there are no non-6 GHz channels, then the 6 GHz scan is the first
part of a split scan. Add a boolean denoting whether the scan is the
first part of a scan as it might be useful to drivers for internal
bookkeeping. This flag is also set if the scan is not split.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250609213231.07e5a8a452ec.Ibf18f513e507422078fb31b28947e582a20df87a@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Hide the internal scan fields from mac80211 and drivers, the
'notified' variable is for internal tracking, and the 'info'
is output that's passed to cfg80211_scan_done() and stored
only for delayed userspace notification.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Reviewed-by: Benjamin Berg <benjamin.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250609213231.6a62e41858e2.I004f66e9c087cc6e6ae4a24951cf470961ee9466@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If we get to the error path of ieee80211_prep_connection, for example
because of a FW issue, then ieee80211_vif_set_links is called
with 0.
But the call to drv_change_vif_links from ieee80211_vif_update_links
will probably fail as well, for the same reason.
In this case, the valid_links and active_links bitmaps will be reverted
to the value of the failing connection.
Then, in the next connection, due to the logic of
ieee80211_set_vif_links_bitmaps, valid_links will be set to the ID of
the new connection assoc link, but the active_links will remain with the
ID of the old connection's assoc link.
If those IDs are different, we get into a weird state of valid_links and
active_links being different. One of the consequences of this state is
to call drv_change_vif_links with new_links as 0, since the & operation
between the bitmaps will be 0.
Since a removal of a link should always succeed, ignore the return value
of drv_change_vif_links if it was called to only remove links, which is
the case for the ieee80211_prep_connection's error path.
That way, the bitmaps will not be reverted to have the value from the
failing connection and will have 0, so the next connection will have a
good state.
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20250609213231.ba2011fb435f.Id87ff6dab5e1cf757b54094ac2d714c656165059@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
We have an application that uses almost the same code for TCP and
AF_UNIX (SOCK_STREAM).
TCP can use TCP_INQ, but AF_UNIX doesn't have it and requires an
extra syscall, ioctl(SIOCINQ) or getsockopt(SO_MEMINFO) as an
alternative.
Let's introduce the generic version of TCP_INQ.
If SO_INQ is enabled, recvmsg() will put a cmsg of SCM_INQ that
contains the exact value of ioctl(SIOCINQ). The cmsg is also
included when msg->msg_get_inq is non-zero to make sockets
io_uring-friendly.
Note that SOCK_CUSTOM_SOCKOPT is flagged only for SOCK_STREAM to
override setsockopt() for SOL_SOCKET.
By having the flag in struct unix_sock, instead of struct sock, we
can later add SO_INQ support for TCP and reuse tcp_sk(sk)->recvmsg_inq.
Note also that supporting custom getsockopt() for SOL_SOCKET will need
preparation for other SOCK_CUSTOM_SOCKOPT users (UDP, vsock, MPTCP).
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250702223606.1054680-7-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Compared to TCP, ioctl(SIOCINQ) for AF_UNIX SOCK_STREAM socket is more
expensive, as unix_inq_len() requires iterating through the receive queue
and accumulating skb->len.
Let's cache the value for SOCK_STREAM to a new field during sendmsg()
and recvmsg().
The field is protected by the receive queue lock.
Note that ioctl(SIOCINQ) for SOCK_DGRAM returns the length of the first
skb in the queue.
SOCK_SEQPACKET still requires iterating through the queue because we do
not touch functions shared with unix_dgram_ops. But, if really needed,
we can support it by switching __skb_try_recv_datagram() to a custom
version.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250702223606.1054680-5-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This reverts commit f75a2804da.
With all states (whether user or kern) removed from the hashtables
during deletion, there's no need for synchronous destruction of
states. xfrm6_tunnel states still need to have been destroyed (which
will be the case when its last user is deleted (not destroyed)) so
that xfrm6_tunnel_free_spi removes it from the per-netns hashtable
before the netns is destroyed.
This has the benefit of skipping one synchronize_rcu per state (in
__xfrm_state_destroy(sync=true)) when we exit a netns.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
The ipcomp fallback tunnels currently get deleted (from the various
lists and hashtables) as the last user state that needed that fallback
is destroyed (not deleted). If a reference to that user state still
exists, the fallback state will remain on the hashtables/lists,
triggering the WARN in xfrm_state_fini. Because of those remaining
references, the fix in commit f75a2804da ("xfrm: destroy xfrm_state
synchronously on net exit path") is not complete.
We recently fixed one such situation in TCP due to defered freeing of
skbs (commit 9b6412e697 ("tcp: drop secpath at the same time as we
currently drop dst")). This can also happen due to IP reassembly: skbs
with a secpath remain on the reassembly queue until netns
destruction. If we can't guarantee that the queues are flushed by the
time xfrm_state_fini runs, there may still be references to a (user)
xfrm_state, preventing the timely deletion of the corresponding
fallback state.
Instead of chasing each instance of skbs holding a secpath one by one,
this patch fixes the issue directly within xfrm, by deleting the
fallback state as soon as the last user state depending on it has been
deleted. Destruction will still happen when the final reference is
dropped.
A separate lockdep class for the fallback state is required since
we're going to lock x->tunnel while x is locked.
Fixes: 9d4139c769 ("netns xfrm: per-netns xfrm_state_all list")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
tc_action_net_exit() got an rtnl exclusion in commit
a159d3c4b8 ("net_sched: acquire RTNL in tc_action_net_exit()")
Since then, commit 16af606739 ("net: sched: implement reference
counted action release") made this RTNL exclusion obsolete for
most cases.
Only tcf_action_offload_del() might still require it.
Move the rtnl locking into tcf_idrinfo_destroy() when
an offload action is found.
Most netns do not have actions, yet deleting them is adding a lot
of pressure on RTNL, which is for many the most contended mutex
in the kernel.
We are moving to a per-netns 'rtnl', so tc_action_net_exit()
will not be able to grab 'rtnl' a single time for a batch of netns.
Before the patch:
perf probe -a rtnl_lock
perf record -e probe:rtnl_lock -a /bin/bash -c 'unshare -n "/bin/true"; sleep 1'
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.305 MB perf.data (25 samples) ]
After the patch:
perf record -e probe:rtnl_lock -a /bin/bash -c 'unshare -n "/bin/true"; sleep 1'
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.304 MB perf.data (9 samples) ]
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Vlad Buslov <vladbu@nvidia.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Link: https://patch.msgid.link/20250702071230.1892674-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This change allows for gateway routing, where a route table entry
may reference a routable endpoint (by network and EID), instead of
routing directly to a netdevice.
We add support for a RTM_GATEWAY attribute for netlink route updates,
with an attribute format of:
struct mctp_fq_addr {
unsigned int net;
mctp_eid_t eid;
}
- we need the net here to uniquely identify the target EID, as we no
longer have the device reference directly (which would provide the net
id in the case of direct routes).
This makes route lookups recursive, as a route lookup that returns a
gateway route must be resolved into a direct route (ie, to a device)
eventually. We provide a limit to the route lookups, to prevent infinite
loop routing.
The route lookup populates a new 'nexthop' field in the dst structure,
which now specifies the key for the neighbour table lookup on device
output, rather than using the packet destination address directly.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://patch.msgid.link/20250702-dev-forwarding-v5-13-1468191da8a4@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Now that we have the dst->haddr populated by sendmsg (when extended
addressing is in use), we no longer need to stash the link-layer address
in the skb->cb.
Instead, only use skb->cb for incoming lladdr data.
While we're at it: remove cb->src, as was never used.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://patch.msgid.link/20250702-dev-forwarding-v5-4-1468191da8a4@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This change adds a struct mctp_dst, representing the result of a routing
lookup. This decouples the struct mctp_route from the actual
implementation of a routing operation.
This will allow for future routing changes which may require more
involved lookup logic, such as gateway routing - which may require
multiple traversals of the routing table.
Since we only use the struct mctp_route at lookup time, we no longer
hold routes over a routing operation, as we only need it to populate the
dst. However, we do hold the dev while the dst is active.
This requires some changes to the route test infrastructure, as we no
longer have a mock route to handle the route output operation, and
transient dsts are created by the routing code, so we can't override
them as easily.
Instead, we use kunit->priv to stash a packet queue, and a custom
dst_output function queues into that packet queue, which we can use for
later expectations.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://patch.msgid.link/20250702-dev-forwarding-v5-3-1468191da8a4@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Stacking technology is a type of technology used to expand ports on
Ethernet switches. It is widely used as a common access method in
large-scale Internet data center architectures. Years of practice
have proved that stacking technology has advantages and disadvantages
in high-reliability network architecture scenarios. For instance,
in stacking networking arch, conventional switch system upgrades
require multiple stacked devices to restart at the same time.
Therefore, it is inevitable that the business will be interrupted
for a while. It is for this reason that "no-stacking" in data centers
has become a trend. Additionally, when the stacking link connecting
the switches fails or is abnormal, the stack will split. Although it is
not common, it still happens in actual operation. The problem is that
after the split, it is equivalent to two switches with the same
configuration appearing in the network, causing network configuration
conflicts and ultimately interrupting the services carried by the
stacking system.
To improve network stability, "non-stacking" solutions have been
increasingly adopted, particularly by public cloud providers and
tech companies like Alibaba, Tencent, and Didi. "non-stacking" is
a method of mimicing switch stacking that convinces a LACP peer,
bonding in this case, connected to a set of "non-stacked" switches
that all of its ports are connected to a single switch
(i.e., LACP aggregator), as if those switches were stacked. This
enables the LACP peer's ports to aggregate together, and requires
(a) special switch configuration, described in the linked article,
and (b) modifications to the bonding 802.3ad (LACP) mode to send
all ARP/ND packets across all ports of the active aggregator.
Note that, with multiple aggregators, the current broadcast mode
logic will send only packets to the selected aggregator(s).
+-----------+ +-----------+
| switch1 | | switch2 |
+-----------+ +-----------+
^ ^
| |
+-----------------+
| bond4 lacp |
+-----------------+
| |
| NIC1 | NIC2
+-----------------+
| server |
+-----------------+
- https://www.ruijie.com/fr-fr/support/tech-gallery/de-stack-data-center-network-architecture/
Cc: Jay Vosburgh <jv@jvosburgh.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Andrew Lunn <andrew+netdev@lunn.ch>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Tonghao Zhang <tonghao@bamaicloud.com>
Signed-off-by: Zengbing Tu <tuzengbing@didiglobal.com>
Link: https://patch.msgid.link/84d0a044514157bb856a10b6d03a1028c4883561.1751031306.git.tonghao@bamaicloud.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
- hci_sync: Fix not disabling advertising instance
- hci_core: Remove check of BDADDR_ANY in hci_conn_hash_lookup_big_state
- hci_sync: Fix attempting to send HCI_Disconnect to BIS handle
- hci_event: Fix not marking Broadcast Sink BIS as connected
-----BEGIN PGP SIGNATURE-----
iQJNBAABCgA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmhmpiMZHGx1aXoudm9u
LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKTZOD/9ggH1wJ52hD6NeRow2dZHt
BAppV0qNg7p9PkmYPDTOKNkdASmp5VfdLyCXH0vBUjC7+iutrkA/oq5UmIzAM7Uz
1HSELyvsLBL3Gj2AxWyElBtuwo3F7cA70veesZKDJJ92VHlCMMXPJugMCrcy0iJn
sVYENIJKtMnLJMTnnpmBCpRQszQfHGYnzQerbeUlCPoFgn7vsRy+onZ9XiIj6R2K
HGZVk6MhUnNWrXthwcgwAZ0SRq+5nhxhB4jOSSZ6nNnT9Wt8oloDM7KtYSy976QA
Ub2LzvR2E8/XtnTeWus5oLL2kAvz1vClS6gpGrvziElnIryq8aYhwKOO7yhRFmDK
kllpOIPaiDrl1H1nVR4z7t/IK5z/0A0wkTcXXm4WNZkLOj8YmY+BdjPKO39PRxK4
PpcHzsmI6PQjnLeGRi3a4PYfMgVYRvdO/OKSO9/xphqgE+dQdonH7/Dsvw+QkEw8
TgFvfKj8bMNjd11Mynd46P45tnQ3HIrTSYGTbdvoU/n8ZFEjATuHtIxRscw0bRmj
iy4u55JV+U3BAdm2CsRouiiTHnvXt9S2HMyMnemwwxsGpim/JG+IsccYUZd8mvLf
d+N2LWsq71CpVhRSJ50WJDrCEJdKT71KwHbSYLJl8vGhObV3VpTBZ57eVF98SNjm
Zqg8JqLJFvCLOhl4RWj/xA==
=o7T9
-----END PGP SIGNATURE-----
Merge tag 'for-net-2025-07-03' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- hci_sync: Fix not disabling advertising instance
- hci_core: Remove check of BDADDR_ANY in hci_conn_hash_lookup_big_state
- hci_sync: Fix attempting to send HCI_Disconnect to BIS handle
- hci_event: Fix not marking Broadcast Sink BIS as connected
* tag 'for-net-2025-07-03' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: hci_event: Fix not marking Broadcast Sink BIS as connected
Bluetooth: hci_sync: Fix attempting to send HCI_Disconnect to BIS handle
Bluetooth: hci_core: Remove check of BDADDR_ANY in hci_conn_hash_lookup_big_state
Bluetooth: hci_sync: Fix not disabling advertising instance
====================
Link: https://patch.msgid.link/20250703160409.1791514-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The page pool members in struct page cannot be removed unless it's not
allowed to access any of them via struct page.
Do not access 'page->dma_addr' directly in page_pool_get_dma_addr() but
just wrap page_pool_get_dma_addr_netmem() safely.
Signed-off-by: Byungchul Park <byungchul@sk.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://patch.msgid.link/20250702053256.4594-6-byungchul@sk.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The current page_to_netmem() doesn't cover const casting resulting in
trying to cast const struct page * to const netmem_ref fails.
To cover the case, change page_to_netmem() to use macro and _Generic.
Signed-off-by: Byungchul Park <byungchul@sk.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/20250702053256.4594-5-byungchul@sk.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
From commit 634f1a7110 ("vsock: support sockmap"), `struct proto
vsock_proto`, defined in af_vsock.c, is not static anymore, since it's
used by vsock_bpf.c.
If CONFIG_BPF_SYSCALL is not defined, `make C=2` will print a warning:
$ make O=build C=2 W=1 net/vmw_vsock/
...
CC [M] net/vmw_vsock/af_vsock.o
CHECK ../net/vmw_vsock/af_vsock.c
../net/vmw_vsock/af_vsock.c:123:14: warning: symbol 'vsock_proto' was not declared. Should it be static?
Declare `vsock_proto` regardless of CONFIG_BPF_SYSCALL, since it's defined
in af_vsock.c, which is built regardless of CONFIG_BPF_SYSCALL.
Fixes: 634f1a7110 ("vsock: support sockmap")
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://patch.msgid.link/20250703112329.28365-1-sgarzare@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The check for destination to be BDADDR_ANY is no longer necessary with
the introduction of BIS_LINK.
Fixes: 23205562ff ("Bluetooth: separate CIS_LINK and BIS_LINK link types")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
The DCCP socket family has now been removed from this tree, see:
8bb3212be4 ("Merge branch 'net-retire-dccp-socket'")
Remove connection tracking and NAT support for this protocol, this
should not pose a problem because no DCCP traffic is expected to be seen
on the wire.
As for the code for matching on dccp header for iptables and nftables,
mark it as deprecated and keep it in place. Ruleset restoration is an
atomic operation. Without dccp matching support, an astray match on dccp
could break this operation leaving your computer with no policy in
place, so let's follow a more conservative approach for matches.
Add CONFIG_NFT_EXTHDR_DCCP which is set to 'n' by default to deprecate
dccp extension support. Similarly, label CONFIG_NETFILTER_XT_MATCH_DCCP
as deprecated too and also set it to 'n' by default.
Code to match on DCCP protocol from ebtables also remains in place, this
is just a few checks on IPPROTO_DCCP from _check() path which is
exercised when ruleset is loaded. There is another use of IPPROTO_DCCP
from the _check() path in the iptables multiport match. Another check
for IPPROTO_DCCP from the packet in the reject target is also removed.
So let's schedule removal of the dccp matching for a second stage, this
should not interfer with the dccp retirement since this is only matching
on the dccp header.
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Introduce support for specifying relative bandwidth shares between
traffic classes (TC) in the devlink-rate API. This new option allows
users to allocate bandwidth across multiple traffic classes in a
single command.
This feature provides a more granular control over traffic management,
especially for scenarios requiring Enhanced Transmission Selection.
Users can now define a relative bandwidth share for each traffic class.
For example, assigning share values of 20 to TC0 (TCP/UDP) and 80 to TC5
(RoCE) will result in TC0 receiving 20% and TC5 receiving 80% of the
total bandwidth. The actual percentage each class receives depends on
the ratio of its share value to the sum of all shares.
Example:
DEV=pci/0000:08:00.0
$ devlink port function rate add $DEV/vfs_group tx_share 10Gbit \
tx_max 50Gbit tc-bw 0:20 1:0 2:0 3:0 4:0 5:80 6:0 7:0
$ devlink port function rate set $DEV/vfs_group \
tc-bw 0:20 1:0 2:0 3:0 4:0 5:20 6:60 7:0
Example usage with ynl:
./tools/net/ynl/cli.py --spec Documentation/netlink/specs/devlink.yaml \
--do rate-set --json '{
"bus-name": "pci",
"dev-name": "0000:08:00.0",
"port-index": 1,
"rate-tc-bws": [
{"rate-tc-index": 0, "rate-tc-bw": 50},
{"rate-tc-index": 1, "rate-tc-bw": 50},
{"rate-tc-index": 2, "rate-tc-bw": 0},
{"rate-tc-index": 3, "rate-tc-bw": 0},
{"rate-tc-index": 4, "rate-tc-bw": 0},
{"rate-tc-index": 5, "rate-tc-bw": 0},
{"rate-tc-index": 6, "rate-tc-bw": 0},
{"rate-tc-index": 7, "rate-tc-bw": 0}
]
}'
./tools/net/ynl/cli.py --spec Documentation/netlink/specs/devlink.yaml \
--do rate-get --json '{
"bus-name": "pci",
"dev-name": "0000:08:00.0",
"port-index": 1
}'
output for rate-get:
{'bus-name': 'pci',
'dev-name': '0000:08:00.0',
'port-index': 1,
'rate-tc-bws': [{'rate-tc-bw': 50, 'rate-tc-index': 0},
{'rate-tc-bw': 50, 'rate-tc-index': 1},
{'rate-tc-bw': 0, 'rate-tc-index': 2},
{'rate-tc-bw': 0, 'rate-tc-index': 3},
{'rate-tc-bw': 0, 'rate-tc-index': 4},
{'rate-tc-bw': 0, 'rate-tc-index': 5},
{'rate-tc-bw': 0, 'rate-tc-index': 6},
{'rate-tc-bw': 0, 'rate-tc-index': 7}],
'rate-tx-max': 0,
'rate-tx-priority': 0,
'rate-tx-share': 0,
'rate-tx-weight': 0,
'rate-type': 'leaf'}
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250629142138.361537-3-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add the nlmsg_for_each_attr_type() macro to simplify iteration over
attributes of a specific type in a Netlink message.
Convert existing users in vxlan and nfsd to use the new macro.
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250629142138.361537-2-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use the new helpers as a step to deal with potential dst->dev races.
v2: fix typo in ipv6_rthdr_rcv() (kernel test robot <lkp@intel.com>)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-10-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use the new helper as a step to deal with potential dst->dev races.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-9-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use the new helpers as a first step to deal with
potential dst->dev races.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-8-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
dst->dev is read locklessly in many contexts,
and written in dst_dev_put().
Fixing all the races is going to need many changes.
We probably will have to add full RCU protection.
Add three helpers to ease this painful process.
static inline struct net_device *dst_dev(const struct dst_entry *dst)
{
return READ_ONCE(dst->dev);
}
static inline struct net_device *skb_dst_dev(const struct sk_buff *skb)
{
return dst_dev(skb_dst(skb));
}
static inline struct net *skb_dst_dev_net(const struct sk_buff *skb)
{
return dev_net(skb_dst_dev(skb));
}
static inline struct net *skb_dst_dev_net_rcu(const struct sk_buff *skb)
{
return dev_net_rcu(skb_dst_dev(skb));
}
Fixes: 4a6ce2b6f2 ("net: introduce a new function dst_dev_put()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-7-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
dst_dev_put() can overwrite dst->output while other
cpus might read this field (for instance from dst_output())
Add READ_ONCE()/WRITE_ONCE() annotations to suppress
potential issues.
We will likely need RCU protection in the future.
Fixes: 4a6ce2b6f2 ("net: introduce a new function dst_dev_put()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
dst_dev_put() can overwrite dst->input while other
cpus might read this field (for instance from dst_input())
Add READ_ONCE()/WRITE_ONCE() annotations to suppress
potential issues.
We will likely need full RCU protection later.
Fixes: 4a6ce2b6f2 ("net: introduce a new function dst_dev_put()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
(dst_entry)->lastuse is read and written locklessly,
add corresponding annotations.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
(dst_entry)->expires is read and written locklessly,
add corresponding annotations.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
____cacheline_aligned_in_smp attribute only makes sure to align
a field to a cache line. It does not prevent the linker to use
the remaining of the cache line for other variables, causing
potential false sharing.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250630093540.3052835-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
____cacheline_aligned_in_smp attribute only makes sure to align
a field to a cache line. It does not prevent the linker to use
the remaining of the cache line for other variables, causing
potential false sharing.
Move tcp_memory_allocated into a dedicated cache line.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250630093540.3052835-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Using per-cpu data for net->net_cookie generation is overkill,
because even busy hosts do not create hundreds of netns per second.
Make sure to put net_cookie in a private cache line to avoid
potential false sharing.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250630093540.3052835-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This structure will hold networking data that must
consume a full cache line to avoid accidental false sharing.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250630093540.3052835-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Upon receiving the Reset Request, pause the connection and clean up
queues, wait for the specified period, then resume the NIC.
In the cleanup phase, the HWC is no longer responding, so set hwc_timeout
to zero to skip waiting on the response.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Link: https://patch.msgid.link/1751055983-29760-1-git-send-email-haiyangz@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tl;dr
=====
Add a new neighbor flag ("extern_valid") that can be used to indicate to
the kernel that a neighbor entry was learned and determined to be valid
externally. The kernel will not try to remove or invalidate such an
entry, leaving these decisions to the user space control plane. This is
needed for EVPN multi-homing where a neighbor entry for a multi-homed
host needs to be synced across all the VTEPs among which the host is
multi-homed.
Background
==========
In a typical EVPN multi-homing setup each host is multi-homed using a
set of links called ES (Ethernet Segment, i.e., LAG) to multiple leaf
switches (VTEPs). VTEPs that are connected to the same ES are called ES
peers.
When a neighbor entry is learned on a VTEP, it is distributed to both ES
peers and remote VTEPs using EVPN MAC/IP advertisement routes. ES peers
use the neighbor entry when routing traffic towards the multi-homed host
and remote VTEPs use it for ARP/NS suppression.
Motivation
==========
If the ES link between a host and the VTEP on which the neighbor entry
was locally learned goes down, the EVPN MAC/IP advertisement route will
be withdrawn and the neighbor entries will be removed from both ES peers
and remote VTEPs. Routing towards the multi-homed host and ARP/NS
suppression can fail until another ES peer locally learns the neighbor
entry and distributes it via an EVPN MAC/IP advertisement route.
"draft-rbickhart-evpn-ip-mac-proxy-adv-03" [1] suggests avoiding these
intermittent failures by having the ES peers install the neighbor
entries as before, but also injecting EVPN MAC/IP advertisement routes
with a proxy indication. When the previously mentioned ES link goes down
and the original EVPN MAC/IP advertisement route is withdrawn, the ES
peers will not withdraw their neighbor entries, but instead start aging
timers for the proxy indication.
If an ES peer locally learns the neighbor entry (i.e., it becomes
"reachable"), it will restart its aging timer for the entry and emit an
EVPN MAC/IP advertisement route without a proxy indication. An ES peer
will stop its aging timer for the proxy indication if it observes the
removal of the proxy indication from at least one of the ES peers
advertising the entry.
In the event that the aging timer for the proxy indication expired, an
ES peer will withdraw its EVPN MAC/IP advertisement route. If the timer
expired on all ES peers and they all withdrew their proxy
advertisements, the neighbor entry will be completely removed from the
EVPN fabric.
Implementation
==============
In the above scheme, when the control plane (e.g., FRR) advertises a
neighbor entry with a proxy indication, it expects the corresponding
entry in the data plane (i.e., the kernel) to remain valid and not be
removed due to garbage collection or loss of carrier. The control plane
also expects the kernel to notify it if the entry was learned locally
(i.e., became "reachable") so that it will remove the proxy indication
from the EVPN MAC/IP advertisement route. That is why these entries
cannot be programmed with dummy states such as "permanent" or "noarp".
Instead, add a new neighbor flag ("extern_valid") which indicates that
the entry was learned and determined to be valid externally and should
not be removed or invalidated by the kernel. The kernel can probe the
entry and notify user space when it becomes "reachable" (it is initially
installed as "stale"). However, if the kernel does not receive a
confirmation, have it return the entry to the "stale" state instead of
the "failed" state.
In other words, an entry marked with the "extern_valid" flag behaves
like any other dynamically learned entry other than the fact that the
kernel cannot remove or invalidate it.
One can argue that the "extern_valid" flag should not prevent garbage
collection and that instead a neighbor entry should be programmed with
both the "extern_valid" and "extern_learn" flags. There are two reasons
for not doing that:
1. Unclear why a control plane would like to program an entry that the
kernel cannot invalidate but can completely remove.
2. The "extern_learn" flag is used by FRR for neighbor entries learned
on remote VTEPs (for ARP/NS suppression) whereas here we are
concerned with local entries. This distinction is currently irrelevant
for the kernel, but might be relevant in the future.
Given that the flag only makes sense when the neighbor has a valid
state, reject attempts to add a neighbor with an invalid state and with
this flag set. For example:
# ip neigh add 192.0.2.1 nud none dev br0.10 extern_valid
Error: Cannot create externally validated neighbor with an invalid state.
# ip neigh add 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid
# ip neigh replace 192.0.2.1 nud failed dev br0.10 extern_valid
Error: Cannot mark neighbor as externally validated with an invalid state.
The above means that a neighbor cannot be created with the
"extern_valid" flag and flags such as "use" or "managed" as they result
in a neighbor being created with an invalid state ("none") and
immediately getting probed:
# ip neigh add 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use
Error: Cannot create externally validated neighbor with an invalid state.
However, these flags can be used together with "extern_valid" after the
neighbor was created with a valid state:
# ip neigh add 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid
# ip neigh replace 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use
One consequence of preventing the kernel from invalidating a neighbor
entry is that by default it will only try to determine reachability
using unicast probes. This can be changed using the "mcast_resolicit"
sysctl:
# sysctl net.ipv4.neigh.br0/10.mcast_resolicit
0
# tcpdump -nn -e -i br0.10 -Q out arp &
# ip neigh replace 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use
62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
# sysctl -wq net.ipv4.neigh.br0/10.mcast_resolicit=3
# ip neigh replace 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use
62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
62:50:1d:11:93:6f > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
62:50:1d:11:93:6f > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
62:50:1d:11:93:6f > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
iproute2 patches can be found here [2].
[1] https://datatracker.ietf.org/doc/html/draft-rbickhart-evpn-ip-mac-proxy-adv-03
[2] https://github.com/idosch/iproute2/tree/submit/extern_valid_v1
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://patch.msgid.link/20250626073111.244534-2-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
inet_rtx_syn_ack() is a simple wrapper around tcp_rtx_synack(),
if we move req->num_retrans update.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250626153017.2156274-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Now inet_rtx_syn_ack() is only used by TCP, it can directly
call tcp_rtx_synack() instead of using an indirect call
to req->rsk_ops->rtx_syn_ack().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250626153017.2156274-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since commit c54e1d920f ("flow_offload: add ops to tc_action_ops for
flow action setup") these are unused.
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Link: https://patch.msgid.link/20250624014327.3686873-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
While configuring a vxlan tunnel in a system with a i40e NIC driver, I
observe the following deadlock:
WARNING: possible recursive locking detected
6.16.0-rc2.net-next-6.16_92d87230d899+ #13 Tainted: G E
--------------------------------------------
kworker/u256:4/1125 is trying to acquire lock:
ffff88921ab9c8c8 (&utn->lock){+.+.}-{4:4}, at: i40e_udp_tunnel_set_port (/home/pabeni/net-next/include/net/udp_tunnel.h:343 /home/pabeni/net-next/drivers/net/ethernet/intel/i40e/i40e_main.c:13013) i40e
but task is already holding lock:
ffff88921ab9c8c8 (&utn->lock){+.+.}-{4:4}, at: udp_tunnel_nic_device_sync_work (/home/pabeni/net-next/net/ipv4/udp_tunnel_nic.c:739) udp_tunnel
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&utn->lock);
lock(&utn->lock);
*** DEADLOCK ***
May be due to missing lock nesting notation
4 locks held by kworker/u256:4/1125:
#0: ffff8892910ca158 ((wq_completion)udp_tunnel_nic){+.+.}-{0:0}, at: process_one_work (/home/pabeni/net-next/kernel/workqueue.c:3213)
#1: ffffc900244efd30 ((work_completion)(&utn->work)){+.+.}-{0:0}, at: process_one_work (/home/pabeni/net-next/kernel/workqueue.c:3214)
#2: ffffffff9a14e290 (rtnl_mutex){+.+.}-{4:4}, at: udp_tunnel_nic_device_sync_work (/home/pabeni/net-next/net/ipv4/udp_tunnel_nic.c:737) udp_tunnel
#3: ffff88921ab9c8c8 (&utn->lock){+.+.}-{4:4}, at: udp_tunnel_nic_device_sync_work (/home/pabeni/net-next/net/ipv4/udp_tunnel_nic.c:739) udp_tunnel
stack backtrace:
Hardware name: Dell Inc. PowerEdge R7525/0YHMCJ, BIOS 2.2.5 04/08/2021
i
Call Trace:
<TASK>
dump_stack_lvl (/home/pabeni/net-next/lib/dump_stack.c:123)
print_deadlock_bug (/home/pabeni/net-next/kernel/locking/lockdep.c:3047)
validate_chain (/home/pabeni/net-next/kernel/locking/lockdep.c:3901)
__lock_acquire (/home/pabeni/net-next/kernel/locking/lockdep.c:5240)
lock_acquire.part.0 (/home/pabeni/net-next/kernel/locking/lockdep.c:473 /home/pabeni/net-next/kernel/locking/lockdep.c:5873)
__mutex_lock (/home/pabeni/net-next/kernel/locking/mutex.c:604 /home/pabeni/net-next/kernel/locking/mutex.c:747)
i40e_udp_tunnel_set_port (/home/pabeni/net-next/include/net/udp_tunnel.h:343 /home/pabeni/net-next/drivers/net/ethernet/intel/i40e/i40e_main.c:13013) i40e
udp_tunnel_nic_device_sync_by_port (/home/pabeni/net-next/net/ipv4/udp_tunnel_nic.c:230 /home/pabeni/net-next/net/ipv4/udp_tunnel_nic.c:249) udp_tunnel
__udp_tunnel_nic_device_sync.part.0 (/home/pabeni/net-next/net/ipv4/udp_tunnel_nic.c:292) udp_tunnel
udp_tunnel_nic_device_sync_work (/home/pabeni/net-next/net/ipv4/udp_tunnel_nic.c:742) udp_tunnel
process_one_work (/home/pabeni/net-next/kernel/workqueue.c:3243)
worker_thread (/home/pabeni/net-next/kernel/workqueue.c:3315 /home/pabeni/net-next/kernel/workqueue.c:3402)
kthread (/home/pabeni/net-next/kernel/kthread.c:464)
AFAICS all the existing callsites of udp_tunnel_nic_set_port_priv() are
already under the utn lock scope, avoid (re-)acquiring it in such a
function.
Fixes: 1ead750109 ("udp_tunnel: remove rtnl_lock dependency")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/95a827621ec78c12d1564ec3209e549774f9657d.1750675978.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add support for updating the stations S1G capabilities when
an S1G association occurs.
Signed-off-by: Lachlan Hodges <lachlan.hodges@morsemicro.com>
Link: https://patch.msgid.link/20250617080610.756048-3-lachlan.hodges@morsemicro.com
[remove unused S1G_CAP3_MAX_MPDU_LEN_3895/_7791]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently there is no support for initialising a peers S1G capabilities,
this patch adds support for configuring an S1G station.
Signed-off-by: Lachlan Hodges <lachlan.hodges@morsemicro.com>
Link: https://patch.msgid.link/20250617080610.756048-2-lachlan.hodges@morsemicro.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently, setting RTS threshold is based on per-phy basis, i.e., all the
radios present in a wiphy will take RTS threshold value to be the one sent
from userspace. But each radio in a multi-radio wiphy can have different
RTS threshold requirements.
To extend support to set RTS threshold for each radio, get the radio for
which RTS threshold needs to be changed from the user. Use the attribute
in NL - NL80211_ATTR_WIPHY_RADIO_INDEX, to identify the radio of interest.
Create a new structure - wiphy_radio_cfg and add rts_threshold in it as a
u32 value to store RTS threshold of each radio in a wiphy and allocate
memory for it during wiphy register based on the wiphy.n_radio updated by
drivers. Pass radio id received from the user to mac80211 drivers along
with its corresponding RTS threshold.
Signed-off-by: Roopni Devanathan <quic_rdevanat@quicinc.com>
Link: https://patch.msgid.link/20250615082312.619639-3-quic_rdevanat@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently, per-radio attributes are set on per-phy basis, i.e., all the
radios present in a wiphy will take attributes values sent from user. But
each radio in a wiphy can get different values from userspace based on
its requirement.
To extend support to set per-radio attributes, add support to get radio
index from userspace. Add an NL attribute - NL80211_ATTR_WIPHY_RADIO_INDEX,
to get user specified radio index for which attributes should be changed.
Pass this to individual drivers, so that the drivers can use this radio
index to change per-radio attributes when necessary. Currently, per-radio
attributes identified are:
NL80211_ATTR_WIPHY_TX_POWER_LEVEL
NL80211_ATTR_WIPHY_ANTENNA_TX
NL80211_ATTR_WIPHY_ANTENNA_RX
NL80211_ATTR_WIPHY_RETRY_SHORT
NL80211_ATTR_WIPHY_RETRY_LONG
NL80211_ATTR_WIPHY_FRAG_THRESHOLD
NL80211_ATTR_WIPHY_RTS_THRESHOLD
NL80211_ATTR_WIPHY_COVERAGE_CLASS
NL80211_ATTR_TXQ_LIMIT
NL80211_ATTR_TXQ_MEMORY_LIMIT
NL80211_ATTR_TXQ_QUANTUM
By default, the radio index is set to -1. This means the attribute should
be treated as a global configuration. If the user has not specified any
index, then the radio index passed to individual drivers would be -1. This
would indicate that the attribute applies to all radios in that wiphy.
Signed-off-by: Roopni Devanathan <quic_rdevanat@quicinc.com>
Link: https://patch.msgid.link/20250615082312.619639-2-quic_rdevanat@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently, link station statistics for MLO are filled by mac80211.
But there are some statistics that kept by mac80211 might not be
accurate, so let the driver pre-fill the link statistics. The driver
can fill the values (indicating which field is filled, by setting the
filled bitmapin in link_station structure).
Statistics that driver don't fill are filled by mac80211.
Hence, add link_sta_statistics callback to fill link station statistics
for MLO in sta_set_link_sinfo() by drivers.
Signed-off-by: Sarika Sharma <quic_sarishar@quicinc.com>
Link: https://patch.msgid.link/20250528054420.3050133-11-quic_sarishar@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently, sinfo structure is supported to fill information at
deflink( or one of the links) level for station. This has problems
when applied to fetch multi-link(ML) station information.
Hence, if valid_links are present, support filling link_station
structure for each link.
This will be helpful to check the link related statistics during MLO.
Additionally, TXQ stats for pertid are applicable at station level
not at link level. Therefore check link_id is less then 0, before
filling TXQ stats in pertid stats.
Signed-off-by: Sarika Sharma <quic_sarishar@quicinc.com>
Link: https://patch.msgid.link/20250528054420.3050133-9-quic_sarishar@quicinc.com
[fix some indentation]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently, station_info structure is passed to fill station statistics
from mac80211/drivers. After NL message send to user space for requested
station statistics, memory for station statistics is freed in cfg80211.
Therefore, memory allocation/free for link station statistics should
also happen in cfg80211 only.
Hence, allocate the memory for link_station structure for all
possible links and free in cfg80211_sinfo_release_content().
Signed-off-by: Sarika Sharma <quic_sarishar@quicinc.com>
Link: https://patch.msgid.link/20250528054420.3050133-6-quic_sarishar@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Current implementation of NL80211_GET_STATION does not work for
multi-link operation(MLO) since in case of MLO only deflink (or one
of the links) is considered and not all links.
Therefore to support for MLO, add link_station_info structure
to account link level statistics for station.
Additionally, add valid_links in station_info structure to indicate
bitmap of valid links for MLO. This will be helpful to check the link
related statistics during MLO.
Signed-off-by: Sarika Sharma <quic_sarishar@quicinc.com>
Link: https://patch.msgid.link/20250528054420.3050133-3-quic_sarishar@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently, in supporting API's to fill sinfo structure from sta
structure, is mapped to fill the fields from sta->deflink. However,
for multi-link (ML) station, sinfo structure should be filled from
corresponding link_id.
Therefore, add link_id as an additional argument in supporting API's
for filling sinfo structure correctly. Link_id is set to -1 for non-ML
station and corresponding link_id for ML stations. In supporting API's
for filling sinfo structure, check for link_id, if link_id < 0, fill
the sinfo structure from sta->deflink, otherwise fill from
sta->link[link_id].
Current, changes are done at the deflink level i.e, pass -1 as link_id.
Actual link_id will be added in subsequent patches to support
station statistics for MLO.
Signed-off-by: Sarika Sharma <quic_sarishar@quicinc.com>
Link: https://patch.msgid.link/20250528054420.3050133-2-quic_sarishar@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The symbol names in the .modinfo section are never used and already
randomized by the __UNIQUE_ID() macro.
Therefore, the second parameter of __MODULE_INFO() is meaningless
and can be removed to simplify the code.
With this change, the symbol names in the .modinfo section will be
prefixed with __UNIQUE_ID_modinfo, making it clearer that they
originate from MODULE_INFO().
[Before]
$ objcopy -j .modinfo vmlinux.o modinfo.o
$ nm -n modinfo.o | head -n10
0000000000000000 r __UNIQUE_ID_license560
0000000000000011 r __UNIQUE_ID_file559
0000000000000030 r __UNIQUE_ID_description558
0000000000000074 r __UNIQUE_ID_license580
000000000000008e r __UNIQUE_ID_file579
00000000000000bd r __UNIQUE_ID_description578
00000000000000e6 r __UNIQUE_ID_license581
00000000000000ff r __UNIQUE_ID_file580
0000000000000134 r __UNIQUE_ID_description579
0000000000000179 r __UNIQUE_ID_uncore_no_discover578
[After]
$ objcopy -j .modinfo vmlinux.o modinfo.o
$ nm -n modinfo.o | head -n10
0000000000000000 r __UNIQUE_ID_modinfo560
0000000000000011 r __UNIQUE_ID_modinfo559
0000000000000030 r __UNIQUE_ID_modinfo558
0000000000000074 r __UNIQUE_ID_modinfo580
000000000000008e r __UNIQUE_ID_modinfo579
00000000000000bd r __UNIQUE_ID_modinfo578
00000000000000e6 r __UNIQUE_ID_modinfo581
00000000000000ff r __UNIQUE_ID_modinfo580
0000000000000134 r __UNIQUE_ID_modinfo579
0000000000000179 r __UNIQUE_ID_modinfo578
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Petr Pavlu <petr.pavlu@suse.com>
Followup of commit 285975dd67 ("net: annotate data-races around
sk->sk_{rcv|snd}timeo").
Remove lock_sock()/release_sock() from ksmbd_tcp_rcv_timeout()
and add READ_ONCE()/WRITE_ONCE() where it is needed.
Also SO_RCVTIMEO_OLD and SO_RCVTIMEO_NEW can call sock_set_timeout()
without holding the socket lock.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250620155536.335520-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Followup of commit 285975dd67 ("net: annotate data-races around
sk->sk_{rcv|snd}timeo").
Remove lock_sock()/release_sock() from sock_set_sndtimeo(),
and add READ_ONCE()/WRITE_ONCE() where it is needed.
Also SO_SNDTIMEO_OLD and SO_SNDTIMEO_NEW can call sock_set_timeout()
without holding the socket lock.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250620155536.335520-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Difference between sock_i_uid() and sk_uid() is that
after sock_orphan(), sock_i_uid() returns GLOBAL_ROOT_UID
while sk_uid() returns the last cached sk->sk_uid value.
None of sock_i_uid() callers care about this.
Use sk_uid() which is much faster and inlined.
Note that diag/dump users are calling sock_i_ino() and
can not see the full benefit yet.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Reviewed-by: Maciej Żenczykowski <maze@google.com>
Link: https://patch.msgid.link/20250620133001.4090592-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
sk->sk_uid can be read while another thread changes its
value in sockfs_setattr().
Add sk_uid(const struct sock *sk) helper to factorize the needed
READ_ONCE() annotations, and add corresponding WRITE_ONCE()
where needed.
Fixes: 86741ec254 ("net: core: Add a UID field to struct sock.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Reviewed-by: Maciej Żenczykowski <maze@google.com>
Link: https://patch.msgid.link/20250620133001.4090592-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
syzbot reported use-after-free in vhci_flush() without repro. [0]
From the splat, a thread close()d a vhci file descriptor while
its device was being used by iotcl() on another thread.
Once the last fd refcnt is released, vhci_release() calls
hci_unregister_dev(), hci_free_dev(), and kfree() for struct
vhci_data, which is set to hci_dev->dev->driver_data.
The problem is that there is no synchronisation after unlinking
hdev from hci_dev_list in hci_unregister_dev(). There might be
another thread still accessing the hdev which was fetched before
the unlink operation.
We can use SRCU for such synchronisation.
Let's run hci_dev_reset() under SRCU and wait for its completion
in hci_unregister_dev().
Another option would be to restore hci_dev->destruct(), which was
removed in commit 587ae086f6 ("Bluetooth: Remove unused
hci-destruct cb"). However, this would not be a good solution, as
we should not run hci_unregister_dev() while there are in-flight
ioctl() requests, which could lead to another data-race KCSAN splat.
Note that other drivers seem to have the same problem, for exmaple,
virtbt_remove().
[0]:
BUG: KASAN: slab-use-after-free in skb_queue_empty_lockless include/linux/skbuff.h:1891 [inline]
BUG: KASAN: slab-use-after-free in skb_queue_purge_reason+0x99/0x360 net/core/skbuff.c:3937
Read of size 8 at addr ffff88807cb8d858 by task syz.1.219/6718
CPU: 1 UID: 0 PID: 6718 Comm: syz.1.219 Not tainted 6.16.0-rc1-syzkaller-00196-g08207f42d3ff #0 PREEMPT(full)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025
Call Trace:
<TASK>
dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120
print_address_description mm/kasan/report.c:408 [inline]
print_report+0xd2/0x2b0 mm/kasan/report.c:521
kasan_report+0x118/0x150 mm/kasan/report.c:634
skb_queue_empty_lockless include/linux/skbuff.h:1891 [inline]
skb_queue_purge_reason+0x99/0x360 net/core/skbuff.c:3937
skb_queue_purge include/linux/skbuff.h:3368 [inline]
vhci_flush+0x44/0x50 drivers/bluetooth/hci_vhci.c:69
hci_dev_do_reset net/bluetooth/hci_core.c:552 [inline]
hci_dev_reset+0x420/0x5c0 net/bluetooth/hci_core.c:592
sock_do_ioctl+0xd9/0x300 net/socket.c:1190
sock_ioctl+0x576/0x790 net/socket.c:1311
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:907 [inline]
__se_sys_ioctl+0xf9/0x170 fs/ioctl.c:893
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fcf5b98e929
Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fcf5c7b9038 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007fcf5bbb6160 RCX: 00007fcf5b98e929
RDX: 0000000000000000 RSI: 00000000400448cb RDI: 0000000000000009
RBP: 00007fcf5ba10b39 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 00007fcf5bbb6160 R15: 00007ffd6353d528
</TASK>
Allocated by task 6535:
kasan_save_stack mm/kasan/common.c:47 [inline]
kasan_save_track+0x3e/0x80 mm/kasan/common.c:68
poison_kmalloc_redzone mm/kasan/common.c:377 [inline]
__kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:394
kasan_kmalloc include/linux/kasan.h:260 [inline]
__kmalloc_cache_noprof+0x230/0x3d0 mm/slub.c:4359
kmalloc_noprof include/linux/slab.h:905 [inline]
kzalloc_noprof include/linux/slab.h:1039 [inline]
vhci_open+0x57/0x360 drivers/bluetooth/hci_vhci.c:635
misc_open+0x2bc/0x330 drivers/char/misc.c:161
chrdev_open+0x4c9/0x5e0 fs/char_dev.c:414
do_dentry_open+0xdf0/0x1970 fs/open.c:964
vfs_open+0x3b/0x340 fs/open.c:1094
do_open fs/namei.c:3887 [inline]
path_openat+0x2ee5/0x3830 fs/namei.c:4046
do_filp_open+0x1fa/0x410 fs/namei.c:4073
do_sys_openat2+0x121/0x1c0 fs/open.c:1437
do_sys_open fs/open.c:1452 [inline]
__do_sys_openat fs/open.c:1468 [inline]
__se_sys_openat fs/open.c:1463 [inline]
__x64_sys_openat+0x138/0x170 fs/open.c:1463
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Freed by task 6535:
kasan_save_stack mm/kasan/common.c:47 [inline]
kasan_save_track+0x3e/0x80 mm/kasan/common.c:68
kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:576
poison_slab_object mm/kasan/common.c:247 [inline]
__kasan_slab_free+0x62/0x70 mm/kasan/common.c:264
kasan_slab_free include/linux/kasan.h:233 [inline]
slab_free_hook mm/slub.c:2381 [inline]
slab_free mm/slub.c:4643 [inline]
kfree+0x18e/0x440 mm/slub.c:4842
vhci_release+0xbc/0xd0 drivers/bluetooth/hci_vhci.c:671
__fput+0x44c/0xa70 fs/file_table.c:465
task_work_run+0x1d1/0x260 kernel/task_work.c:227
exit_task_work include/linux/task_work.h:40 [inline]
do_exit+0x6ad/0x22e0 kernel/exit.c:955
do_group_exit+0x21c/0x2d0 kernel/exit.c:1104
__do_sys_exit_group kernel/exit.c:1115 [inline]
__se_sys_exit_group kernel/exit.c:1113 [inline]
__x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1113
x64_sys_call+0x21ba/0x21c0 arch/x86/include/generated/asm/syscalls_64.h:232
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
The buggy address belongs to the object at ffff88807cb8d800
which belongs to the cache kmalloc-1k of size 1024
The buggy address is located 88 bytes inside of
freed 1024-byte region [ffff88807cb8d800, ffff88807cb8dc00)
Fixes: bf18c7118c ("Bluetooth: vhci: Free driver_data on file release")
Reported-by: syzbot+2faa4825e556199361f9@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=f62d64848fc4c7c30cd6
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Acked-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
In the case of SME-in-driver, the driver can internally choose to
update the links based on the AP MLD recommendation and do link
reconfiguration negotiation with AP MLD.
(e.g., After the driver processing the BSS Transition Management request
frame received from the AP MLD with Neighbor Report containing
Multi-Link element with recommended links information chooses to do link
reconfiguration negotiation with AP MLD).
To support this, extend cfg80211_mlo_reconf_add_done() and
NL80211_CMD_ASSOC_MLO_RECONF to indicate added links information for
driver-initiated link reconfiguration requests. For removed links,
the driver indicates links information using the
NL80211_CMD_LINKS_REMOVED event for driver-initiated cases, the same as
supplicant initiated cases.
For the driver-initiated case, cfg80211 will receive link
reconfiguration result asynchronously from driver so holding BSSes of
the accepted add links is needed in the event path. Also, no need of
unhold call for the rejected add link BSSes since there was no hold call
happened previously.
Once the supplicant receives the NL80211_CMD_ASSOC_MLO_RECONF event,
it needs to process the information about newly added links and install
per-link group keys (e.g., GTK/IGTK/BIGTK etc.).
In case of the SME-in-driver, using a vendor interface etc. to notify
the supplicant to initiate a link reconfiguration request and then
supplicant sending command to the cfg80211 can lead to race conditions.
The correct design to avoid this is that the driver indicates the
cfg80211 directly with the results of the link reconfiguration
negotiation.
Signed-off-by: Kavita Kavita <quic_kkavita@quicinc.com>
Link: https://patch.msgid.link/20250604105757.2542-3-quic_kkavita@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add utility API cfg80211_get_radio_idx_by_chan() to retrieve the radio
index corresponding to a given channel in a multi-radio wiphy.
This utility function can be used when we want to check the radio-specific
data for a channel in a multi-radio wiphy. For example, it can help
determine the radio index required to handle a scan request. This index
can then be used to decide whether the scan can proceed without
interfering with ongoing DFS operations on another radio.
Signed-off-by: Vasanthakumar Thiagarajan <vasanthakumar.thiagarajan@oss.qualcomm.com>
Co-developed-by: Raj Kumar Bhagat <quic_rajkbhag@quicinc.com>
Signed-off-by: Raj Kumar Bhagat <quic_rajkbhag@quicinc.com>
Link: https://patch.msgid.link/20250527-mlo-dfs-acs-v2-1-92c2f37c81d9@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
As discussesd before in [0] proxy entries (which are more configuration
than runtime data) should stay when the link (carrier) goes does down.
This is what happens for regular neighbour entries.
So lets fix this by:
- storing in proxy entries the fact that it was added as NUD_PERMANENT
- not removing NUD_PERMANENT proxy entries when the carrier goes down
(same as how it's done in neigh_flush_dev() for regular neigh entries)
[0]: https://lore.kernel.org/netdev/c584ef7e-6897-01f3-5b80-12b53f7b4bf4@kernel.org/
Signed-off-by: Nicolas Escande <nico.escande@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250617141334.3724863-1-nico.escande@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If any of the HWC commands are not recognized by the
underlying hardware, the hardware returns the response
header status of -1. Log the information using
netdev_info_once to avoid multiple error logs in dmesg.
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Reviewed-by: Saurabh Singh Sengar <ssengar@linux.microsoft.com>
Reviewed-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
Link: https://patch.msgid.link/1750144656-2021-5-git-send-email-ernis@linux.microsoft.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Allow mana ethtool get_link_ksettings operation to report
the maximum speed supported by the SKU in mbps.
The driver retrieves this information by issuing a
HWC command to the hardware via mana_query_link_cfg(),
which retrieves the SKU's maximum supported speed.
These APIs when invoked on hardware that are older/do
not support these APIs, the speed would be reported as UNKNOWN.
Before:
$ethtool enP30832s1
> Settings for enP30832s1:
Supported ports: [ ]
Supported link modes: Not reported
Supported pause frame use: No
Supports auto-negotiation: No
Supported FEC modes: Not reported
Advertised link modes: Not reported
Advertised pause frame use: No
Advertised auto-negotiation: No
Advertised FEC modes: Not reported
Speed: Unknown!
Duplex: Full
Auto-negotiation: off
Port: Other
PHYAD: 0
Transceiver: internal
Link detected: yes
After:
$ethtool enP30832s1
> Settings for enP30832s1:
Supported ports: [ ]
Supported link modes: Not reported
Supported pause frame use: No
Supports auto-negotiation: No
Supported FEC modes: Not reported
Advertised link modes: Not reported
Advertised pause frame use: No
Advertised auto-negotiation: No
Advertised FEC modes: Not reported
Speed: 16000Mb/s
Duplex: Full
Auto-negotiation: off
Port: Other
PHYAD: 0
Transceiver: internal
Link detected: yes
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Reviewed-by: Saurabh Singh Sengar <ssengar@linux.microsoft.com>
Reviewed-by: Long Li <longli@microsoft.com>
Link: https://patch.msgid.link/1750144656-2021-4-git-send-email-ernis@linux.microsoft.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Introduce support for net_shaper_ops in the MANA driver,
enabling configuration of rate limiting on the MANA NIC.
To apply rate limiting, the driver issues a HWC command via
mana_set_bw_clamp() and updates the corresponding shaper object
in the net_shaper cache. If an error occurs during this process,
the driver restores the previous speed by querying the current link
configuration using mana_query_link_cfg().
The minimum supported bandwidth is 100 Mbps, and only values that are
exact multiples of 100 Mbps are allowed. Any other values are rejected.
To remove a shaper, the driver resets the bandwidth to the maximum
supported by the SKU using mana_set_bw_clamp() and clears the
associated cache entry. If an error occurs during this process,
the shaper details are retained.
On the hardware that does not support these APIs, the net-shaper
calls to set speed would fail.
Set the speed:
./tools/net/ynl/pyynl/cli.py \
--spec Documentation/netlink/specs/net_shaper.yaml \
--do set --json '{"ifindex":'$IFINDEX',
"handle":{"scope": "netdev", "id":'$ID' },
"bw-max": 200000000 }'
Get the shaper details:
./tools/net/ynl/pyynl/cli.py \
--spec Documentation/netlink/specs/net_shaper.yaml \
--do get --json '{"ifindex":'$IFINDEX',
"handle":{"scope": "netdev", "id":'$ID' }}'
> {'bw-max': 200000000,
> 'handle': {'scope': 'netdev'},
> 'ifindex': $IFINDEX,
> 'metric': 'bps'}
Delete the shaper object:
./tools/net/ynl/pyynl/cli.py \
--spec Documentation/netlink/specs/net_shaper.yaml \
--do delete --json '{"ifindex":'$IFINDEX',
"handle":{"scope": "netdev","id":'$ID' }}'
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Reviewed-by: Saurabh Singh Sengar <ssengar@linux.microsoft.com>
Reviewed-by: Long Li <longli@microsoft.com>
Link: https://patch.msgid.link/1750144656-2021-3-git-send-email-ernis@linux.microsoft.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Add a new device generic parameter to enable/disable the
PHC (PTP Hardware Clock) functionality in the device associated
with the devlink instance.
Signed-off-by: David Arinzon <darinzon@amazon.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20250617110545.5659-6-darinzon@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Drivers that are using ops lock and don't depend on RTNL lock
still need to manage it because udp_tunnel's RTNL dependency.
Introduce new udp_tunnel_nic_lock and use it instead of
rtnl_lock. Drop non-UDP_TUNNEL_NIC_INFO_MAY_SLEEP mode from
udp_tunnel infra (udp_tunnel_nic_device_sync_work needs to
grab udp_tunnel_nic_lock mutex and might sleep).
Cover more places in v4:
- netlink
- udp_tunnel_notify_add_rx_port (ndo_open)
- triggers udp_tunnel_nic_device_sync_work
- udp_tunnel_notify_del_rx_port (ndo_stop)
- triggers udp_tunnel_nic_device_sync_work
- udp_tunnel_get_rx_info (__netdev_update_features)
- triggers NETDEV_UDP_TUNNEL_PUSH_INFO
- udp_tunnel_drop_rx_info (__netdev_update_features)
- triggers NETDEV_UDP_TUNNEL_DROP_INFO
- udp_tunnel_nic_reset_ntf (ndo_open)
- notifiers
- udp_tunnel_nic_netdevice_event, depending on the event:
- triggers NETDEV_UDP_TUNNEL_PUSH_INFO
- triggers NETDEV_UDP_TUNNEL_DROP_INFO
- ethnl_tunnel_info_reply_size
- udp_tunnel_nic_set_port_priv (two intel drivers)
Cc: Michael Chan <michael.chan@broadcom.com>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Stanislav Fomichev <stfomichev@gmail.com>
Link: https://patch.msgid.link/20250616162117.287806-4-stfomichev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tony Nguyen says:
====================
libeth: add libeth_xdp helper lib
Alexander Lobakin says:
Time to add XDP helpers infra to libeth to greatly simplify adding
XDP to idpf and iavf, as well as improve and extend XDP in ice and
i40e. Any vendor is free to reuse helpers. If this happens, I'm fine
with moving the folder of out intel/.
The helpers greatly simplify building xdp_buff, running a prog,
handling the verdict, implement XDP_TX, .ndo_xdp_xmit, XDP buffer
completion. Same applies to XSk (with XSk xmit instead of
.ndo_xdp_xmit, plus stuff like XSk wakeup).
They are entirely generic with no HW definitions or assumptions.
HW-specific stuff like parsing Rx desc / filling Tx desc is passed
from the driver as inline callbacks.
For now, key assumptions that optimize performance / avoid code
bloat, but might not fit every driver in driver/net/:
* netmem holding the buffers are always order-0;
* driver has separate XDP Tx queues, doesn't use stack queues for
that. For best efficiency, you may want to have nr_cpu_ids XDP
queues, but less (queue sharing) is also supported;
* XDP Tx queues are interrupt-less and use "lazy" cleaning only
when there are less than 1/4 free Tx descriptors of the queue
size;
* main target platforms are 64-bit, although 32-bit is also fully
supported, but the code might be not as optimized for them.
Library code already supports multi-buffer for all kinds of Tx and
both header split and no split for Rx and Tx. Frags can come from
devmem/io_uring etc., direct `struct page *` is used only for header
buffers for which it's always true.
Drivers are free to pass their own Rx hints and XSK xmit hints ops.
XDP_TX and ndo_xdp_xmit use onstack bulk for the frames to be sent
and send them by batches of 16 buffers. This eats ~280 bytes on the
stack, but gives good boosts and allow to greatly optimize the main
sending function leaving it without any error/exception paths.
XSk xmit fills Tx descriptors in the loop unrolled by 8. This was
proven to improve perf on ice and i40e. XDP_TX and ndo_xdp_xmit
doesn't use unrolling as I wasn't able to get any improvements in
those scenenarios from this, while +1 Kb for their sending functions
for nothing doesn't sound reasonable.
XSk wakeup, instead of traditionally used "SW interrupts" provided
by NICs, uses IPI to schedule NAPI on the CPU corresponding to the
given queue pair. It gives better control over CPU distribution and
in general performs way better than "SW interrupts", plus allows us
to not pass any HW-specific callbacks there.
The code is built the way that all callbacks passed from drivers
get inlined; in general, most of hotpath gets inlined. Everything
slow/exception lands to .c files in the libeth folder, doesn't
create copies in the drivers themselves and doesn't overloat
hotpath.
Sure, inlining means that hotpath will be compiled into every driver
that uses the lib, but the core code is written in one place, so no
copying of bugs happens. Fixed once -- works everywhere.
The last commit might look like sorta hack, but it gives really good
boosts and decreases object code size, plus there are checks that
all those wider accesses are fully safe, so I don't feel anything
bad about it.
An example of using libeth_xdp can be found either on my GitHub or
on the mailing lists here ("XDP for idpf"). Macros for building
driver XDP functions lead to that some implementations (XDP_TX,
ndo_xdp_xmit etc.) consist of really only a few lines.
* '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
libeth: xdp, xsk: access adjacent u32s as u64 where applicable
libeth: xsk: add XSkFQ refill and XSk wakeup helpers
libeth: xsk: add XSk Rx processing support
libeth: xsk: add XSk xmit functions
libeth: xsk: add XSk XDP_TX sending helpers
libeth: xdp: add RSS hash hint and XDP features setup helpers
libeth: xdp: add templates for building driver-side callbacks
libeth: xdp: add XDP prog run and verdict result handling
libeth: xdp: add helpers for preparing/processing &libeth_xdp_buff
libeth: xdp: add XDPSQ cleanup timers
libeth: xdp: add XDPSQ locking helpers
libeth: xdp: add XDPSQE completion helpers
libeth: xdp: add .ndo_xdp_xmit() helpers
libeth: xdp: add XDP_TX buffers sending
libeth: support native XDP and register memory model
libeth: convert to netmem
libeth, libie: clean symbol exports up a little
====================
Link: https://patch.msgid.link/20250616201639.710420-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This is the netmem counterpart of page_pool_dev_alloc_pages() which
uses the default GFP flags for RX.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250616141441.1243044-4-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The only generic interface to execute asynchronously in the BH context is
tasklet; however, it's marked deprecated and has some design flaws. To
replace tasklets, BH workqueue support was recently added. A BH workqueue
behaves similarly to regular workqueues except that the queued work items
are executed in the BH context.
This patch converts TCP Small Queues implementation from tasklet to BH
workqueue.
Semantically, this is an equivalent conversion and there shouldn't be any
user-visible behavior changes. While workqueue's queueing and execution
paths are a bit heavier than tasklet's, unless the work item is being queued
every packet, the difference hopefully shouldn't matter.
My experience with the networking stack is very limited and this patch
definitely needs attention from someone who actually understands networking.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Cc: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/aFBeJ38AS1ZF3Dq5@slm.duckdns.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Locally-generated MC packets have so far not been subject to MC routing.
Instead an MC-enabled installation would maintain the MC routing tables,
and separately from that the list of interfaces to send packets to as part
of the VXLAN FDB and MDB.
In a previous patch, a ip_mr_output() and ip6_mr_output() routines were
added for IPv4 and IPv6. All locally generated MC traffic is now passed
through these functions. For reasons of backward compatibility, an SKB
(IPCB / IP6CB) flag guards the actual MC routing.
This patch adds logic to set the flag, and the UAPI to enable the behavior.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/d899655bb7e9b2521ee8c793e67056b9fd02ba12.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ip6tunnel_xmit() erases the contents of the SKB control block. In order to
be able to set particular IP6CB flags on the SKB, add a corresponding
parameter, and propagate it to udp_tunnel6_xmit_skb() as well.
In one of the following patches, VXLAN driver will use this facility to
mark packets as subject to IPv6 multicast routing.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/acb4f9f3e40c3a931236c3af08a720b017fbfbfb.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The function always returns zero, thus the return value does not carry any
signal. Just make it void.
Most callers already ignore the return value. However:
- Refold arguments of the call from sctp_v6_xmit() so that they fit into
the 80-column limit.
- tipc_udp_xmit() initializes err from the return value, but that should
already be always zero at that point. So there's no practical change, but
elision of the assignment prompts a couple more tweaks to clean up the
function.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/7facacf9d8ca3ca9391a4aee88160913671b868d.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Multicast routing is today handled in the input path. Locally generated MC
packets don't hit the IPMR code today. Thus if a VXLAN remote address is
multicast, the driver needs to set an OIF during route lookup. Thus MC
routing configuration needs to be kept in sync with the VXLAN FDB and MDB.
Ideally, the VXLAN packets would be routed by the MC routing code instead.
To that end, this patch adds support to route locally generated multicast
packets. The newly-added routines do largely what ip_mr_input() and
ip_mr_forward() do: make an MR cache lookup to find where to send the
packets, and use ip_mc_output() to send each of them. When no cache entry
is found, the packet is punted to the daemon for resolution.
However, an installation that uses a VXLAN underlay netdevice for which it
also has matching MC routes, would get a different routing with this patch.
Previously, the MC packets would be delivered directly to the underlay
port, whereas now they would be MC-routed. In order to avoid this change in
behavior, introduce an IPCB flag. Only if the flag is set will
ip_mr_output() actually engage, otherwise it reverts to ip_mc_output().
This code is based on work by Roopa Prabhu and Nikolay Aleksandrov.
Signed-off-by: Roopa Prabhu <roopa@nvidia.com>
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/0aadbd49330471c0f758d54afb05eb3b6e3a6b65.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
iptunnel_xmit() erases the contents of the SKB control block. In order to
be able to set particular IPCB flags on the SKB, add a corresponding
parameter, and propagate it to udp_tunnel_xmit_skb() as well.
In one of the following patches, VXLAN driver will use this facility to
mark packets as subject to IP multicast routing.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Acked-by: Antonio Quartulli <antonio@openvpn.net>
Link: https://patch.msgid.link/89c9daf9f2dc088b6b92ccebcc929f51742de91f.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Trivial fix to a couple of outdated netmem comments. No code changes,
just more accurately describing current code.
Signed-off-by: Mina Almasry <almasrymina@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250615203511.591438-1-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add support for legacy Broadcom FCS tags, which are similar to
DSA_TAG_PROTO_BRCM_LEGACY.
BCM5325 and BCM5365 switches require including the original FCS value and
length, as opposed to BCM63xx switches.
Adding the original FCS value and length to DSA_TAG_PROTO_BRCM_LEGACY would
impact performance of BCM63xx switches, so it's better to create a new tag.
Signed-off-by: Álvaro Fernández Rojas <noltari@gmail.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20250614080000.1884236-3-noltari@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Now that we have removed the RFC3517/RFC6675 hints,
tcp_clear_retrans_hints_partial() is empty, and can be removed.
Suggested-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250615001435.2390793-4-ncardwell.sw@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Now that obsolete RFC3517/RFC6675 TCP loss detection has been removed,
we can remove the somewhat complex and intrusive code to maintain its
hint state: lost_skb_hint and lost_cnt_hint.
This commit makes tcp_clear_retrans_hints_partial() empty. We will
remove tcp_clear_retrans_hints_partial() and its call sites in the
next commit.
Suggested-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250615001435.2390793-3-ncardwell.sw@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pavel Begunkov says:
====================
io_uring cmd for tx timestamps (part)
Apply the networking helpers for the io_uring timestamp API.
====================
Link: https://patch.msgid.link/cover.1750065793.git.asml.silence@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Shradha Gupta says:
====================
Allow dyn MSI-X vector allocation of MANA
In this patchset we want to enable the MANA driver to be able to
allocate MSI-X vectors in PCI dynamically.
The first patch exports pci_msix_prepare_desc() in PCI to be able to
correctly prepare descriptors for dynamically added MSI-X vectors.
The second patch adds the support of dynamic vector allocation in
pci-hyperv PCI controller by enabling the MSI_FLAG_PCI_MSIX_ALLOC_DYN
flag and using the pci_msix_prepare_desc() exported in first patch.
The third patch adds a detailed description of the irq_setup(), to
help understand the function design better.
The fourth patch is a preparation patch for mana changes to support
dynamic IRQ allocation. It contains changes in irq_setup() to allow
skipping first sibling CPU sets, in case certain IRQs are already
affinitized to them.
The fifth patch has the changes in MANA driver to be able to allocate
MSI-X vectors dynamically. If the support does not exist it defaults to
older behavior.
* 'shradha_v6.16-rc1' of https://github.com/shradhagupta6/linux:
net: mana: Allocate MSI-X vectors dynamically
net: mana: Allow irq_setup() to skip cpus for affinity
net: mana: explain irq_setup() algorithm
PCI: hv: Allow dynamic MSI-X vector allocation
PCI/MSI: Export pci_msix_prepare_desc() for dynamic MSI-X allocations
====================
Link: https://patch.msgid.link/1749650984-9193-1-git-send-email-shradhagupta@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, the MANA driver allocates MSI-X vectors statically based on
MANA_MAX_NUM_QUEUES and num_online_cpus() values and in some cases ends
up allocating more vectors than it needs. This is because, by this time
we do not have a HW channel and do not know how many IRQs should be
allocated.
To avoid this, we allocate 1 MSI-X vector during the creation of HWC and
after getting the value supported by hardware, dynamically add the
remaining MSI-X vectors.
Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
To collaborate with hardware servicing events, upon receiving the special
EQE notification from the HW channel, remove the devices on this bus.
Then, after a waiting period based on the device specs, rescan the parent
bus to recover the devices.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/1749834034-18498-1-git-send-email-haiyangz@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
On 64-bit systems, writing/reading one u64 is faster than two u32s even
when they're are adjacent in a struct. The compilers won't guarantee
they will combine those; I observed both successful and unsuccessful
attempts with both GCC and Clang, and it's not easy to say what it
depends on.
There's a few places in libeth_xdp winning up to several percent from
combined access (both performance and object code size, especially
when unrolling). Add __LIBETH_WORD_ACCESS and use it there on LE.
Drivers are free to optimize HW-specific callbacks under the same
definition.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
XSkFQ refill is pretty generic across the drivers minus FQ descriptor
filling and can easily be unified with one inline callback.
XSk wakeup is usually not, but here, instead of commonly used
"SW interrupts", I picked firing an IPI. In most tests, it showed better
performance; it also provides better control for userspace on which CPU
will handle the xmit, as SW interrupts honor IRQ affinity no matter
which core produces XSk xmit descs (while XDPSQs are associated 1:1
with cores having the same ID).
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Add XSk counterparts for preparing XSk &libeth_xdp_buff (adding head and
frags), running the program, and handling the verdict, inc. XDP_PASS.
Shortcuts in comparison with regular Rx: frags and all verdicts except
XDP_REDIRECT are under unlikely() and out of line; no checks for XDP
program presence as it's always true for XSk.
Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> # optimizations
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Reuse core sending functions to send XSk xmit frames.
Both metadata and no metadata pools/driver are supported. libeth_xdp
also provides generic XSk metadata ops, currently with the checksum
offload only and for cases when HW doesn't require supplying L3/L4
checksum offsets. Drivers are free to pass their own ops.
&libeth_xdp_tx_bulk is not used here as it would be redundant;
pool->tx_descs are accessed directly.
Fake "libeth_xsktmo" is needed to hide implementation details from the
drivers when they want to use the generic ops: the original struct is
defined in the same file where dev->xsk_tx_metadata_ops gets set to
avoid duplication of slowpath; at the same time; XSk xmit functions
use local "fast" copy to inline XMO callbacks.
Tx descriptor filling loop is unrolled by 8.
Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> # optimizations
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Add Xsk counterparts for XDP_TX buffer sending and completion.
The same base structures and functions used from the libeth_xdp core,
with adjustments to that XSk Rx always operates on &xdp_buff_xsk for
both head and frags. And unlike regular Rx, here unlikely() are used
for frags, as the header split gives no benefits for XSk Rx, at
least for now.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
End the XDP section by adding helpers to setup XDP features, flipping
.ndo_xdp_xmit() support at runtime (in case when it's not always on),
and calculating the queue clean/refill threshold.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Defining driver-specific functions to pass to libeth_xdp functions can
induce boilerplates and/or look a bit cryptic with all those layers of
indirection. On the other hand, this indirection is needed to allow
compilers to uninline big functions even when passed to __always_inline
helpers (too much inlining also hurts performance in some cases), plus
to reuse some XDP helpers in XSk code.
Add macros to quickly build them, with the detailed kdoc. They take
names of the actual callbacks for filling a Tx descriptor and other
purely HW-specific things and wrap them appropriately.
LIBETH_XDP_DEFINE_{BEGIN,END}() is needed for GCC 8+ unfortunately to
let the drivers control which functions will be static and which global
without hitting `-Wold-style-declaration`.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Running a prog and handling the verdicts, up to napi_gro_receive()
is also pretty generic code not really differing between vendors
(except for Tx descriptor filling and Rx descriptor parsing).
Define a couple inlines to do that. The inline callbacks a driver
needs to pass is mentioned above: Tx descriptor filling for XDP_TX,
populating skb with the descriptor data for XDP_PASS, finalizing
XDPSQs after the polling loop for XDP_TX (kicking the HW to start
sending).
The populate callback passes only &libeth_xdp_buff assuming buff::desc
pointer is enough, plus you can always get the corresponding Rx queue
structure via container_of(buff::rxq). If not, a driver can extend
the buff with more fields directly on the stack without touching
libeth_xdp definitions.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Add convenience helpers to build an &xdp_buff. This means: general
initialization before the NAPI loop, adding head, adding frags etc.
libeth_xdp_process_buff() is the same what everybody have in their
drivers:
dma_sync_for_cpu();
if (!frag) {
add_head();
prefetch();
} else {
add_frag();
}
Note that I don't use net_prefetch(), sticking to the original
prefetch(). In none of my tests prefetching 128 bytes yielded better
perf than 64 bytes. That might differ if the headers are huge enough,
but then additional tunneling etc. overhead takes place, you either
way won't win a lot.
&libeth_xdp_stash is for cases when you exit the polling loop without
finishing building the buff. If that happens, you need to store the
buffer in the queue structure until the next loop and then restore it.
It makes no sense to place a whole full &xdp_buff there. Define a
minimal structure, which would store only the fields essential to
restore it.
I was able to pack it into 16 bytes, which is only 8 bytes bigger
than `struct sk_buff *skb` on x64.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
When XDP Tx queues are not interrupt-driven but use lazy cleaning,
i.e. only when there are less than `threshold` free descriptors left,
we also need cleanup timers to avoid &xdp_buff and &xdp_frame stall
for too long, especially with Page Pool (it warns every about inflight
pages every 60 second).
Let's say we sent 256 frames and don't need to send more, but we clean
only when the number of pending items >= 384. In that case, those 256
will stall until 128 more are sent. For this, add simple helpers to
run a timer which will clean the queue regardless, after 1 second of
the last send.
The timer is triggered when finalizing the queue. As long as there is
regular active traffic, the timer doesn't fire.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Unfortunately, it's not always possible to allocate
max(num_rxqs, nr_cpu_ids) even on hi-end NICs.
To mitigate this, add simple locking helpers to libeth_xdp.
As long as XDPSQs are not shared, the whole functionality is gated
behind a static lock. Otherwise, each bulk flush locks the queue for
the time of cleaning and filling the descriptors.
As long as this particular queue is not used by more than 1 CPU,
the impact is minimal (runtime check for boolean twice per 16+
descriptors).
Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> # static key
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Similarly to libeth_tx_complete(), add libeth_xdp_complete_tx() to
handle XDP_TX and xmit buffers. Both use bulk return under the hood.
Also add out of line libeth_tx_complete_any() which handles both
regular and XDP frames (if libeth_xdp is loaded), for example,
to call on queue destroy, where we don't need inlining but
convenience.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Add helpers for implementing .ndo_xdp_xmit().
Same as for XDP_TX, accumulate up to 16 DMA-mapped frames on the stack,
then flush. If DMA mapping is failed for some reason, don't try mapping
further frames, but still flush what was already prepared.
DMA address of a head frame is stored in its headroom, assuming it
has enough of it for an 8 (or 4) byte value.
In addition to @prep and @xmit driver callbacks in XDP_TX, xmit also
needs @finalize to kick the XDPSQ after filling.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Start adding XDP-specific code to libeth, namely handling XDP_TX buffers
(only sending).
The idea is that we accumulate up to 16 buffers on the stack, then,
if either the limit is reached or the polling is finished, flush them
at once with only one XDPSQ cleaning (if needed). The main sending
function will be aware of the sending budget and already have all the
info to send the buffers, so it can't fail.
Drivers need to provide 2 inline callbacks to the main sending function:
for cleaning an XDPSQ and for filling descriptors; the library code
takes care of the rest.
Note that unlike the generic code, multi-buffer support is not wrapped
here with unlikely() to not hurt header split setups.
&libeth_xdp_buff is a simple extension over &xdp_buff which has a direct
pointer to the corresponding Rx descriptor (and, luckily, precisely 1 CL
size and 16-byte alignment on x86_64).
Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> # xmit logic
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Expand libeth's Page Pool functionality by adding native XDP support.
This means picking the appropriate headroom and DMA direction.
Also, register all the created &page_pools as XDP memory models.
A driver then can call xdp_rxq_info_attach_page_pool() when registering
its RxQ info.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Back when the libeth Rx core was initially written, devmem was a draft
and netmem_ref didn't exist in the mainline. Now that it's here, make
libeth MP-agnostic before introducing any new code or any new library
users.
When it's known that the created PP/FQ is for header buffers, use faster
"unsafe" underscored netmem <--> virt accessors as netmem_is_net_iov()
is always false in that case, but consumes some cycles (bit test +
true branch).
Reviewed-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
The field is spelled "message_priprity" in the big-endian bit-field
definition. Nothing in-tree currently references the member, so the
typo does not break kernel builds, but it is clearly incorrect.
Signed-off-by: RubenKelevra <rubenkelevra@gmail.com>
Link: https://patch.msgid.link/20250612145012.185321-1-rubenkelevra@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- eir: Fix NULL pointer deference on eir_get_service_data
- eir: Fix possible crashes on eir_create_adv_data
- hci_sync: Fix broadcast/PA when using an existing instance
- ISO: Fix using BT_SK_PA_SYNC to detect BIS sockets
- ISO: Fix not using bc_sid as advertisement SID
- MGMT: Fix sparse errors
-----BEGIN PGP SIGNATURE-----
iQJNBAABCgA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmhJ66MZHGx1aXoudm9u
LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKfp/D/0VTEMF4PiA2eLHIPSwyIHr
pvpz3nY1WE84lAVL0VKNJalA15dk6TVs3Vxgns62BHLdajBOmYPpuJGXaSERBfLB
t5eb4nU9rx9F7+SW8zVLNwtnn5bTENNYKQIjfLmslDQQGfOjeaUP5sO/rIcLEiO3
0rEi55pE4nM6S2wUcmQlhWPC6tr3vIptg4lAz3MWlATDuUnkLjJ3rzEZdkg2kt39
2VJGNxXEG7sBrwv+coO3ROe54YSOrb+gvd9HOL0vq3MVBcvncCRqc7TuBlYi7/5C
p+WdEyG26FgS/TzdgMJKuVISQp6kNKulbuRhsnD2XZA3Gik+t+79Ex9haYW+HLDS
AWQNBm1FgYdCc4LsAxKfwGdvp8wAx1ci1vLNniYVTelyUAc5LosEZ/15DCCyTKdK
9zXEAfxwn72dLVtryVIRKqDR39QVqsxDSuV9ydgXzPJWwjisHX3AB01EqN5PGjYH
aspNgMGfYL9zSw6N1LQ+99M+/JLbvLs7b4jui4CbD3EI7nxN0YqOcKlHw7vEje5s
auU/UEL7DgWOzHTxCcidwATuV79pfx0CRSwsXaPLV1yA9lhS5AYdpBlsRB+wRFbN
vhpw8dwj/WCM0GVYnG87BU3mriyfNgaERTVA2nLKZXvn+cRkVBUkLwBV3Jpi7vQZ
cJ22gcrRj7uYotfvyCHv9g==
=dulg
-----END PGP SIGNATURE-----
Merge tag 'for-net-2025-06-11' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- eir: Fix NULL pointer deference on eir_get_service_data
- eir: Fix possible crashes on eir_create_adv_data
- hci_sync: Fix broadcast/PA when using an existing instance
- ISO: Fix using BT_SK_PA_SYNC to detect BIS sockets
- ISO: Fix not using bc_sid as advertisement SID
- MGMT: Fix sparse errors
* tag 'for-net-2025-06-11' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: MGMT: Fix sparse errors
Bluetooth: ISO: Fix not using bc_sid as advertisement SID
Bluetooth: ISO: Fix using BT_SK_PA_SYNC to detect BIS sockets
Bluetooth: eir: Fix possible crashes on eir_create_adv_data
Bluetooth: hci_sync: Fix broadcast/PA when using an existing instance
Bluetooth: Fix NULL pointer deference on eir_get_service_data
====================
Link: https://patch.msgid.link/20250611204944.1559356-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This function is no longer used after the four prior fixes.
Given all prior uses were wrong, it seems better to remove it.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250611111515.1983366-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Offload path is used for GRO with SW IPsec, and not just for HW
offload. So initialize it anyway.
Fixes: 585b64f5a6 ("xfrm: delay initialization of offload path till its actually requested")
Reported-by: Sabrina Dubroca <sd@queasysnail.net>
Closes: https://lore.kernel.org/all/aEGW_5HfPqU1rFjl@krikkit
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Currently bc_sid is being ignore when acting as Broadcast Source role,
so this fix it by passing the bc_sid and then use it when programming
the PA:
< HCI Command: LE Set Exte.. (0x08|0x0036) plen 25
Handle: 0x01
Properties: 0x0000
Min advertising interval: 140.000 msec (0x00e0)
Max advertising interval: 140.000 msec (0x00e0)
Channel map: 37, 38, 39 (0x07)
Own address type: Random (0x01)
Peer address type: Public (0x00)
Peer address: 00:00:00:00:00:00 (OUI 00-00-00)
Filter policy: Allow Scan Request from Any, Allow Connect Request from Any (0x00)
TX power: Host has no preference (0x7f)
Primary PHY: LE 1M (0x01)
Secondary max skip: 0x00
Secondary PHY: LE 2M (0x02)
SID: 0x01
Scan request notifications: Disabled (0x00)
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
-----BEGIN PGP SIGNATURE-----
iQFHBAABCgAxFiEEn/sM2K9nqF/8FWzzDHRl3/mQkZwFAmhH/swTHG1rbEBwZW5n
dXRyb25peC5kZQAKCRAMdGXf+ZCRnONCCACa16bTW53gBzmiTxdEgUJ/h+gQuR8G
Fj+yOYIWNZY/YOExa40ldApu3iB9UAB0D+FOly4Wv5zYDct6yNBxqtZjbkTFMaoi
3i+SSrRLNtIxgGs1KgJKVPis8mhCqiBL0aGoJDGyRiye6hotECDyQWvlGM3lMGUr
wdMDQW2xyKOWvm++jXijkUMyKThmI7czlSH8al+JU9KcAO9hiUlGzejdI56KUIMW
TRlg2QSK9CfIzgUP4RQughbF59/8Xbq3LOidu50xMad2wiOJj0IUHB0h6LoAshnS
jFy4Ox4Gw5hcmdaEKazjYEtq3nQeZ6wct7jThw02e4D9h0ac2MCVhphk
=Pt9d
-----END PGP SIGNATURE-----
Merge tag 'linux-can-next-for-6.17-20250610' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next
Marc Kleine-Budde says:
====================
pull-request: can-next 2025-06-10
The first 4 patches are by Vincent Mailhol and prepare the CAN netlink
interface for the introduction of CAN XL configuration.
Geert Uytterhoeven's patch updates the CAN networking documentation.
The last 2 patched are by Davide Caratti and introduce skb drop
reasons in the receive path of several CAN protocols.
* tag 'linux-can-next-for-6.17-20250610' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next:
can: add drop reasons in CAN protocols receive path
can: add drop reasons in the receive path of AF_CAN
documentation: networking: can: Document alloc_candev_mqs()
can: netlink: can_changelink(): rename tdc_mask into fd_tdc_flag_provided
can: bittiming: rename can_tdc_is_enabled() into can_fd_tdc_is_enabled()
can: bittiming: rename CAN_CTRLMODE_TDC_MASK into CAN_CTRLMODE_FD_TDC_MASK
can: netlink: replace tabulation by space in assignment
====================
Link: https://patch.msgid.link/20250610094933.1593081-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
sk->sk_prot->sock_is_readable is a valid function pointer when sk resides
in a sockmap. After the last sk_psock_put() (which usually happens when
socket is removed from sockmap), sk->sk_prot gets restored and
sk->sk_prot->sock_is_readable becomes NULL.
This makes sk_is_readable() racy, if the value of sk->sk_prot is reloaded
after the initial check. Which in turn may lead to a null pointer
dereference.
Ensure the function pointer does not turn NULL after the check.
Fixes: 8934ce2fd0 ("bpf: sockmap redirect ingress support")
Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250609-skisreadable-toctou-v1-1-d0dfb2d62c37@rbox.co
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a new function, netif_subqueue_sent, which is a wrapper for
netdev_tx_sent_queue.
Drivers that use the subqueue variant macros, netif_subqueue_xxx,
identify queue by index and are not required to obtain
struct netdev_queue explicitly.
Such drivers still need to call netdev_tx_sent_queue which is a
counterpart of netif_subqueue_completed_wake. Allowing drivers to use a
subqueue variant for this purpose improves their code consistency by
always referring to queue by its index.
Signed-off-by: Gur Stavi <gur.stavi@huawei.com>
Link: https://patch.msgid.link/909a5c92db49cad39f0954d6cb86775e6480ef4c.1749038081.git.gur.stavi@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This function was introduced in commit 783da70e83 ("net: add
sock_enable_timestamps"), with one caller in rxrpc.
That only caller was removed in commit 7903d4438b ("rxrpc: Don't use
received skbuff timestamps").
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20250609153254.3504909-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add support for reporting additional hardware counters for drop and
TC using the ethtool -S interface.
These counters include:
- Aggregate Rx/Tx drop counters
- Per-TC Rx/Tx packet counters
- Per-TC Rx/Tx byte counters
- Per-TC Rx/Tx pause frame counters
The counters are exposed using ethtool_ops->get_ethtool_stats and
ethtool_ops->get_strings. This feature/counters are not available
to all versions of hardware.
Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
Reviewed-by: Subbaraya Sundeep <sbhatta@marvell.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Link: https://patch.msgid.link/20250609100103.GA7102@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Besides the existing pr_warn_once(), use skb drop reasons in case AF_CAN
layer drops non-conformant CAN{,FD,XL} frames, or conformant frames
received by "wrong" devices, so that it's possible to debug (and count)
such events using existing tracepoints:
| # perf record -e skb:kfree_skb -aR -- ./drv/canfdtest -v -g -l 1 vcan0
| # perf script
| [...]
| canfdtest 1123 [000] 3893.271264: skb:kfree_skb: skbaddr=0xffff975703c9f700 rx_sk=(nil) protocol=12 location=can_rcv+0x4b reason: CAN_RX_INVALID_FRAME
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Link: https://patch.msgid.link/20250604160605.1005704-2-dcaratti@redhat.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
- MGMT: Fix UAF on mgmt_remove_adv_monitor_complete
- MGMT: Protect mgmt_pending list with its own lock
- hci_core: fix list_for_each_entry_rcu usage
- btintel_pcie: Increase the tx and rx descriptor count
- btintel_pcie: Reduce driver buffer posting to prevent race condition
- btintel_pcie: Fix driver not posting maximum rx buffers
-----BEGIN PGP SIGNATURE-----
iQJNBAABCgA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmhB65gZHGx1aXoudm9u
LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKXuaEACPXWNUOViPFPE85M1Y/VGA
Hw4uDO9x25XySBk740NRT3qkYS8pWZa8SujQZa0ijqklrggosnz3q7QdwiRow5Cv
CLqCZiuQDtekXV8K9xa66K8rt2iUxMDnQRzNW32Pe0OW6Xy2RFiYqC7ZVpFomXBj
2vMj+aNRwbdzvKStEQTxWCISdCkP7XSuOdWS/wnAFyiSThgr4R8PByLQZ9P2J5xj
KfLBs+QzwHCc1hGbO7odTVqyv+UN3v82aN2fmyusdgBYBJ9ymLMV1gpBm/B4oGI7
/zXbU9bZWL+uis+pB3k9MQnaytc32v1ODFyqY8Ua1slE4Qzwz7OKB/8TP9MeOO1s
MzzIYuAK2KJ6C5mxyIBRVMcbdX2GgiwVIXJBWesuqoZc0H1En+eSpoKNzfoX16Ul
hMc8pCfvpKXaqo9KOJMldr5Yg4iKV83Am7zNUB1ka6TymM8NUx56gbF50tYDlOXY
TGYpli8OQF4x5/tWRh9AE+DxgYa4sVrDiQncvnSMlmlyBGf/wCczCjaFwRlGM9Wu
MZPi2zm0lwa1F6T358uOyJRbcFawaV39AGHo37SrCFOvPIKC+c6iTYqLHWLeq6V6
mXlUn4BrTrt7TUqFpBIUcN0LOOLKgxr7Oa8UAhhCfn8LLsFvryuTEbNtxOqvFLQP
4ZUyJFMjUnVAr5PMPjyJ3w==
=VZN1
-----END PGP SIGNATURE-----
Merge tag 'for-net-2025-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- MGMT: Fix UAF on mgmt_remove_adv_monitor_complete
- MGMT: Protect mgmt_pending list with its own lock
- hci_core: fix list_for_each_entry_rcu usage
- btintel_pcie: Increase the tx and rx descriptor count
- btintel_pcie: Reduce driver buffer posting to prevent race condition
- btintel_pcie: Fix driver not posting maximum rx buffers
* tag 'for-net-2025-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: MGMT: Protect mgmt_pending list with its own lock
Bluetooth: MGMT: Fix UAF on mgmt_remove_adv_monitor_complete
Bluetooth: btintel_pcie: Reduce driver buffer posting to prevent race condition
Bluetooth: btintel_pcie: Increase the tx and rx descriptor count
Bluetooth: btintel_pcie: Fix driver not posting maximum rx buffers
Bluetooth: hci_core: fix list_for_each_entry_rcu usage
====================
Link: https://patch.msgid.link/20250605191136.904411-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Current release - regressions:
- Revert "kunit: configs: Enable CONFIG_INIT_STACK_ALL_PATTERN
in all_tests", makes kunit error out if compiler is old
- wifi: iwlwifi: mvm: fix assert on suspend
- rxrpc: fix return from none_validate_challenge()
Current release - new code bugs:
- ovpn: couple of fixes for socket cleanup and UDP-tunnel teardown
- can: kvaser_pciefd: refine error prone echo_skb_max handling logic
- fix net_devmem_bind_dmabuf() stub when DEVMEM not compiled
- eth: airoha: fixes for config / accel in bridge mode
Previous releases - regressions:
- Bluetooth: hci_qca: move the SoC type check to the right place,
fix GPIO integration
- prevent a NULL deref in rtnl_create_link() after locking changes
- fix udp gso skb_segment after pull from frag_list
- hv_netvsc: fix potential deadlock in netvsc_vf_setxdp()
Previous releases - always broken:
- netfilter:
- nf_nat: also check reverse tuple to obtain clashing entry
- nf_set_pipapo_avx2: fix initial map fill (zeroing)
- fix the helper for incremental update of packet checksums after
modifying the IP address, used by ILA and BPF
- eth: stmmac: prevent div by 0 when clock rate is misconfigured
- eth: ice: fix Tx scheduler handling of XDP and changing queue count
- eth: b53: fix support for the RGMII interface when delays configured
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmhBv5kACgkQMUZtbf5S
Irs/DA/+PIh7a33iVcsGIcmWtpnGp+18id1tSLnYGUGx1cW6zxutPD8rb6BsAN84
KR+XVsbMDUehIa10xPoF2L5mX5YujEiPSkjP8eE2KJKDLGpDtYNOyOWKT21yudnd
4EVF5JQoEbWHrkHMKF97tla84QLd5fFtgsvejVeZtQYSIDOteNGfra4Jly8iiR+J
i9k+HdB0CNEKVvvibQZjZ5CrkpmdNPmB9UoJ59bG15q2+vXdzOPm/CCNo//9ZQJB
I8O40nu16msRRVA9nc2V/Tp98fTk9dnDpTSyWiBlNCut9g9ftx456Ew+tjobMRIT
yeh+q9+1z3YHjGJB8P1FGmMZWK3tbrwyqjFGqpSjr7juucFok9kxAaRPqrQxga7H
Yxq3RegeNqukEAV39ZE14TL765Jy+XXF1uTHhNBkUADlNJVKnZygSk78/Ut2nDvQ
vkfoto+CfKny5qkSbTk8KKv1rZu3xwewoOjlcdkHlOBoouCjPOxTC7yxTZgUZB5c
yap0jQsedJct4OAA+O7IGLCmf3KrJ0H32HbWEY68mpTEd+4Df5vAWiIi7vmVJmk3
DX9JWmu5A5yjNMhOEsBQU98gkNw366aA/E8dr+lEfp3AoqDrmdbG3l8+qqhqYnb+
nnL1sNiQH1griZwQBUROAhrtXnYlYsAsZi+cv23Q0hQiGIvIC2Q=
=sRQt
-----END PGP SIGNATURE-----
Merge tag 'net-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from CAN, wireless, Bluetooth, and Netfilter.
Current release - regressions:
- Revert "kunit: configs: Enable CONFIG_INIT_STACK_ALL_PATTERN in
all_tests", makes kunit error out if compiler is old
- wifi: iwlwifi: mvm: fix assert on suspend
- rxrpc: fix return from none_validate_challenge()
Current release - new code bugs:
- ovpn: couple of fixes for socket cleanup and UDP-tunnel teardown
- can: kvaser_pciefd: refine error prone echo_skb_max handling logic
- fix net_devmem_bind_dmabuf() stub when DEVMEM not compiled
- eth: airoha: fixes for config / accel in bridge mode
Previous releases - regressions:
- Bluetooth: hci_qca: move the SoC type check to the right place, fix
GPIO integration
- prevent a NULL deref in rtnl_create_link() after locking changes
- fix udp gso skb_segment after pull from frag_list
- hv_netvsc: fix potential deadlock in netvsc_vf_setxdp()
Previous releases - always broken:
- netfilter:
- nf_nat: also check reverse tuple to obtain clashing entry
- nf_set_pipapo_avx2: fix initial map fill (zeroing)
- fix the helper for incremental update of packet checksums after
modifying the IP address, used by ILA and BPF
- eth:
- stmmac: prevent div by 0 when clock rate is misconfigured
- ice: fix Tx scheduler handling of XDP and changing queue count
- eth: fix support for the RGMII interface when delays configured"
* tag 'net-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (76 commits)
calipso: unlock rcu before returning -EAFNOSUPPORT
seg6: Fix validation of nexthop addresses
net: prevent a NULL deref in rtnl_create_link()
net: annotate data-races around cleanup_net_task
selftests: drv-net: tso: make bkg() wait for socat to quit
selftests: drv-net: tso: fix the GRE device name
selftests: drv-net: add configs for the TSO test
wireguard: device: enable threaded NAPI
netlink: specs: rt-link: decode ip6gre
netlink: specs: rt-link: add missing byte-order properties
net: wwan: mhi_wwan_mbim: use correct mux_id for multiplexing
wifi: cfg80211/mac80211: correctly parse S1G beacon optional elements
net: dsa: b53: do not touch DLL_IQQD on bcm53115
net: dsa: b53: allow RGMII for bcm63xx RGMII ports
net: dsa: b53: do not configure bcm63xx's IMP port interface
net: dsa: b53: do not enable RGMII delay on bcm63xx
net: dsa: b53: do not enable EEE on bcm63xx
net: ti: icssg-prueth: Fix swapped TX stats for MII interfaces.
selftests: netfilter: nft_nat.sh: add test for reverse clash with nat
netfilter: nf_nat: also check reverse tuple to obtain clashing entry
...
During ILA address translations, the L4 checksums can be handled in
different ways. One of them, adj-transport, consist in parsing the
transport layer and updating any found checksum. This logic relies on
inet_proto_csum_replace_by_diff and produces an incorrect skb->csum when
in state CHECKSUM_COMPLETE.
This bug can be reproduced with a simple ILA to SIR mapping, assuming
packets are received with CHECKSUM_COMPLETE:
$ ip a show dev eth0
14: eth0@if15: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 62:ae:35:9e:0f:8d brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet6 3333:0:0:1::c078/64 scope global
valid_lft forever preferred_lft forever
inet6 fd00:10:244:1::c078/128 scope global nodad
valid_lft forever preferred_lft forever
inet6 fe80::60ae:35ff:fe9e:f8d/64 scope link proto kernel_ll
valid_lft forever preferred_lft forever
$ ip ila add loc_match fd00:10:244:1 loc 3333:0:0:1 \
csum-mode adj-transport ident-type luid dev eth0
Then I hit [fd00:10:244:1::c078]:8000 with a server listening only on
[3333:0:0:1::c078]:8000. With the bug, the SYN packet is dropped with
SKB_DROP_REASON_TCP_CSUM after inet_proto_csum_replace_by_diff changed
skb->csum. The translation and drop are visible on pwru [1] traces:
IFACE TUPLE FUNC
eth0:9 [fd00:10:244:3::3d8]:51420->[fd00:10:244:1::c078]:8000(tcp) ipv6_rcv
eth0:9 [fd00:10:244:3::3d8]:51420->[fd00:10:244:1::c078]:8000(tcp) ip6_rcv_core
eth0:9 [fd00:10:244:3::3d8]:51420->[fd00:10:244:1::c078]:8000(tcp) nf_hook_slow
eth0:9 [fd00:10:244:3::3d8]:51420->[fd00:10:244:1::c078]:8000(tcp) inet_proto_csum_replace_by_diff
eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) tcp_v6_early_demux
eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) ip6_route_input
eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) ip6_input
eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) ip6_input_finish
eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) ip6_protocol_deliver_rcu
eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) raw6_local_deliver
eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) ipv6_raw_deliver
eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) tcp_v6_rcv
eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) __skb_checksum_complete
eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) kfree_skb_reason(SKB_DROP_REASON_TCP_CSUM)
eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) skb_release_head_state
eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) skb_release_data
eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) skb_free_head
eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) kfree_skbmem
This is happening because inet_proto_csum_replace_by_diff is updating
skb->csum when it shouldn't. The L4 checksum is updated such that it
"cancels" the IPv6 address change in terms of checksum computation, so
the impact on skb->csum is null.
Note this would be different for an IPv4 packet since three fields
would be updated: the IPv4 address, the IP checksum, and the L4
checksum. Two would cancel each other and skb->csum would still need
to be updated to take the L4 checksum change into account.
This patch fixes it by passing an ipv6 flag to
inet_proto_csum_replace_by_diff, to skip the skb->csum update if we're
in the IPv6 case. Note the behavior of the only other user of
inet_proto_csum_replace_by_diff, the BPF subsystem, is left as is in
this patch and fixed in the subsequent patch.
With the fix, using the reproduction from above, I can confirm
skb->csum is not touched by inet_proto_csum_replace_by_diff and the TCP
SYN proceeds to the application after the ILA translation.
Link: https://github.com/cilium/pwru [1]
Fixes: 65d7ab8de5 ("net: Identifier Locator Addressing module")
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://patch.msgid.link/b5539869e3550d46068504feb02d37653d939c0b.1748509484.git.paul.chaignon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Usual collection of driver fixes:
- Small bug fixes and cleansup in hfi, hns, rxe, mlx5, mana siw
- Further ODP functionality in rxe
- Remote access MRs in mana, along with more page sizes
- Improve CM scalability with a rwlock around the agent
- More trace points for hns
- ODP hmm conversion to the new two step dma API
- Support the ethernet HW device in mana as well as the RNIC
- Cleanups:
* Use secs_to_jiffies() when appropriate
* Use ERR_CAST() instead of naked casts
* Don't use %pK in printk
* Unusued functions removed
* Allocation type matching
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCaDm95gAKCRCFwuHvBreF
YXJxAQCZ+p+mxt0rTeVI2j6YQ26thuvb/tH0Upu8epgdQ3T/ZgD/YOHBC6OrXWJa
Uz6BTiyz/xiyMtJLTD4kEiG2o74J1gE=
=DNQC
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
"Usual collection of driver fixes:
- Small bug fixes and cleansup in hfi, hns, rxe, mlx5, mana siw
- Further ODP functionality in rxe
- Remote access MRs in mana, along with more page sizes
- Improve CM scalability with a rwlock around the agent
- More trace points for hns
- ODP hmm conversion to the new two step dma API
- Support the ethernet HW device in mana as well as the RNIC
- Cleanups:
- Use secs_to_jiffies() when appropriate
- Use ERR_CAST() instead of naked casts
- Don't use %pK in printk
- Unusued functions removed
- Allocation type matching"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (57 commits)
RDMA/cma: Fix hang when cma_netevent_callback fails to queue_work
RDMA/bnxt_re: Support extended stats for Thor2 VF
RDMA/hns: Fix endian issue in trace events
RDMA/mlx5: Avoid flexible array warning
IB/cm: Remove dead code and adjust naming
RDMA/core: Avoid hmm_dma_map_alloc() for virtual DMA devices
RDMA/rxe: Break endless pagefault loop for RO pages
RDMA/bnxt_re: Fix return code of bnxt_re_configure_cc
RDMA/bnxt_re: Fix missing error handling for tx_queue
RDMA/bnxt_re: Fix incorrect display of inactivity_cp in debugfs output
RDMA/mlx5: Add support for 200Gbps per lane speeds
RDMA/mlx5: Remove the redundant MLX5_IB_STAGE_UAR stage
RDMA/iwcm: Fix use-after-free of work objects after cm_id destruction
net: mana: Add support for auxiliary device servicing events
RDMA/mana_ib: unify mana_ib functions to support any gdma device
RDMA/mana_ib: Add support of mana_ib for RNIC and ETH nic
net: mana: Probe rdma device in mana driver
RDMA/siw: replace redundant ternary operator with just rv
RDMA/umem: Separate implicit ODP initialization from explicit ODP
RDMA/core: Convert UMEM ODP DMA mapping to caching IOVA and page linkage
...
To support Multi Vports on Bare metal, increase the device config response
version. And, skip the register HW vport, and register filter steps, when
the Bare metal hostmode is set.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Link: https://patch.msgid.link/1747671636-5810-1-git-send-email-haiyangz@microsoft.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
sctp_do_peeloff is only used inside of net/sctp/socket.c,
so mark it static.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Xin Long <lucien.xin@gmail.com>
Link: https://patch.msgid.link/20250526054745.2329201-1-hch@lst.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Lingering should be transport-independent in the long run. In preparation
for supporting other transports, as well as the linger on shutdown(), move
code to core.
Generalize by querying vsock_transport::unsent_bytes(), guard against the
callback being unimplemented. Do not pass sk_lingertime explicitly. Pull
SOCK_LINGER check into vsock_linger().
Flatten the function. Remove the nested block by inverting the condition:
return early on !timeout.
Suggested-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://patch.msgid.link/20250522-vsock-linger-v6-2-2ad00b0e447e@rbox.co
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCgA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmgzoyMeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG0cEIAJrO2lKaFN4fbv6G
FQTHQF1soicGpak3yY9u1o5LCqEIzjW2ScxcKG+dl7FcXsaZYcyg4HNzxbV9l/rr
Ck2qZh3CCkVem0/nEsOJwYbNYKnq+pM5h1jIwn/LUkRuV55s5K5oRHzRj673BEj5
BLaRFivZ1t4eM64EqbU1ut11/VEAkr2GcB01forHDeuWwoa3p6DfmALo7X/U43Vg
FN2hp/3PPfiU6PwoCxQlmMpHNFkoZOHpi8P8Qm+mu0MQI12QrUC1Riib4EkrwEEv
a28F4Au+TIjLceRdi6Ss/rhTC71usQIQ2OnnmHBUeYgdwHRXHgfewhtQDUKTU0MR
OwKECbY=
=skuS
-----END PGP SIGNATURE-----
Merge tag 'v6.15' into rdma.git for-next
Following patches need the RDMA rc branch since we are past the RC cycle
now.
Merge conflicts resolved based on Linux-next:
- For RXE odp changes keep for-next version and fixup new places that
need to call is_odp_mr()
https://lore.kernel.org/r/20250422143019.500201bd@canb.auug.org.auhttps://lore.kernel.org/r/20250514122455.3593b083@canb.auug.org.au
- irdma is keeping the while/kfree bugfix from -rc and the pf/cdev_info
change from for-next
https://lore.kernel.org/r/20250513130630.280ee6c5@canb.auug.org.au
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEjF9xRqF1emXiQiqU1w0aZmrPKyEFAmgwd00ACgkQ1w0aZmrP
KyEfwA//RXQ3i8PCa7lKHxDRhVzG3rEvgXRmiXeNd+JjzsCnybBb7+wRf3dtBGWT
+1s44Utx1JqosWxCVBulqYC5bqSC66789l5X2jhYJmUZxRrbcsqPngwnIrjb/XeK
ZJM62wiRhkBQED7yZLGy+y4VHQiG8CEMt16AOQHk863aruWv1tT7up90CTtzA545
4GF/grU3FC0PsoTLwzWyvqsWK+9uk3Y4Tifp5hU3w6uRD9EjX5tHCZlXXSqOF5gu
KT26OYsePYXhJVZIwDf2oVLGi0EVTPB9IFxZSNgLqyXqu2ILAb9OwRNVTNfTP7Pg
1RWJWmgqvRNs9OM2ecifYgQf/AfvCL0Cja1BJOjmvtICuGegrYH7G5YYQsMl9CoE
7jBoTzpToSASat5+dwoz81Bvzh447dYxRE2VmbxmRTTWToQYS1KGBPc9e3u/n5Rr
ruh8tRZ3/R0Fy+YLDkrJst3grh5RLITbuyu4ElJMArPU50mLTVYxKd6nA3BqwB5G
1GmLfCzvQH3e6PKz6CNke1AytVDy/wLTXtcbLnze2Muaj4AqhtOe5Q8ypnOO0Vyk
PsJ6U3rm2asd3GE9+AIx8gZBv8yCu1w9CiwLK8ybT2NETb2dEnqPgWeDyT7rpcaD
sQOPsBE1q/TEp9gofbYCHBm5E2mX9UP7Q6EHCTekrI97xLq8Q2M=
=fBhd
-----END PGP SIGNATURE-----
Merge tag 'nf-next-25-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following batch contains Netfilter updates for net-next,
specifically 26 patches: 5 patches adding/updating selftests,
4 fixes, 3 PREEMPT_RT fixes, and 14 patches to enhance nf_tables):
1) Improve selftest coverage for pipapo 4 bit group format, from
Florian Westphal.
2) Fix incorrect dependencies when compiling a kernel without
legacy ip{6}tables support, also from Florian.
3) Two patches to fix nft_fib vrf issues, including selftest updates
to improve coverage, also from Florian Westphal.
4) Fix incorrect nesting in nft_tunnel's GENEVE support, from
Fernando F. Mancera.
5) Three patches to fix PREEMPT_RT issues with nf_dup infrastructure
and nft_inner to match in inner headers, from Sebastian Andrzej Siewior.
6) Integrate conntrack information into nft trace infrastructure,
from Florian Westphal.
7) A series of 13 patches to allow to specify wildcard netdevice in
netdev basechain and flowtables, eg.
table netdev filter {
chain ingress {
type filter hook ingress devices = { eth0, eth1, vlan* } priority 0; policy accept;
}
}
This also allows for runtime hook registration on NETDEV_{UN}REGISTER
event, from Phil Sutter.
netfilter pull request 25-05-23
* tag 'nf-next-25-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: (26 commits)
selftests: netfilter: Torture nftables netdev hooks
netfilter: nf_tables: Add notifications for hook changes
netfilter: nf_tables: Support wildcard netdev hook specs
netfilter: nf_tables: Sort labels in nft_netdev_hook_alloc()
netfilter: nf_tables: Handle NETDEV_CHANGENAME events
netfilter: nf_tables: Wrap netdev notifiers
netfilter: nf_tables: Respect NETDEV_REGISTER events
netfilter: nf_tables: Prepare for handling NETDEV_REGISTER events
netfilter: nf_tables: Have a list of nf_hook_ops in nft_hook
netfilter: nf_tables: Pass nf_hook_ops to nft_unregister_flowtable_hook()
netfilter: nf_tables: Introduce nft_register_flowtable_ops()
netfilter: nf_tables: Introduce nft_hook_find_ops{,_rcu}()
netfilter: nf_tables: Introduce functions freeing nft_hook objects
netfilter: nf_tables: add packets conntrack state to debug trace info
netfilter: conntrack: make nf_conntrack_id callable without a module dependency
netfilter: nf_dup_netdev: Move the recursion counter struct netdev_xmit
netfilter: nft_inner: Use nested-BH locking for nft_pcpu_tun_ctx
netfilter: nf_dup{4, 6}: Move duplication check to task_struct
netfilter: nft_tunnel: fix geneve_opt dump
selftests: netfilter: nft_fib.sh: add type and oif tests with and without VRFs
...
====================
Link: https://patch.msgid.link/20250523132712.458507-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmgwJa4ACgkQrB3Eaf9P
W7d34A//V3NukN6UNAUKd+MbH80eXCEbNSNIuVUstfr0S71qTCxovLX58u+oQztb
43mx/NsnF38TzNFWVyVzF4vcr/n0DS/Da3P5pJEjoewIYSDrz/WfOum6VpVIUsZ/
kLCDZlIoX/fBPFZDPHMmsDXDemAdrtr8CuK72NUH10vKDuGKSUG0NElqDieDBEsA
y/fqgBsyxQXi9cMdRxf+DLDK/hzqyaJmVj8B8WEcFtYXJ4RE6+jfLgAaTE6J7V5W
fYACTu/IcdtgEEm2U7wlow66oIjqqGReuWUzV9zHGJNCB9+da6L4dbGtzlRmOPdn
kI1PIALFWT2HbKnJOJJbaThO6zES1rMOm3PsWt7iVewCT8HuhAa9kDV0xzdcLQE1
+REfo8dXW9f5hRUrSuqpJFUArkCHWHLhQEcmTHaF0b2RveC/hd9rOyKIfae+fgIP
5uLU2DpwafDgw5UCjsQTLyQ5M6icO8wFgM7vKAUJWyI1Pck1ktf7Ic6+KQRNjWiv
Q7ImwpSdLH2bZpIbIKDnIcyZg3CMBIQ88cdsYi0+ckgDQ0hMf6ZrXRseXKRe0P/M
gKgBOoXIJBF7niJQTDqHjsmnYGvvhZysIJNQLf4BZFYOeF5L9OduP6ywqMe5pFKt
QAsJSZw/+SibheLEYQAzvyLD6VdMXaxeOAHlPylRRpl9vEX0l04=
=GRVJ
-----END PGP SIGNATURE-----
Merge tag 'ipsec-next-2025-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:
====================
1) Remove some unnecessary strscpy_pad() size arguments.
From Thorsten Blum.
2) Correct use of xso.real_dev on bonding offloads.
Patchset from Cosmin Ratiu.
3) Add hardware offload configuration to XFRM_MSG_MIGRATE.
From Chiachang Wang.
4) Refactor migration setup during cloning. This was
done after the clone was created. Now it is done
in the cloning function itself.
From Chiachang Wang.
5) Validate assignment of maximal possible SEQ number.
Prevent from setting to the maximum sequrnce number
as this would cause for traffic drop.
From Leon Romanovsky.
6) Prevent configuration of interface index when offload
is used. Hardware can't handle this case.i
From Leon Romanovsky.
7) Always use kfree_sensitive() for SA secret zeroization.
From Zilin Guan.
ipsec-next-2025-05-23
* tag 'ipsec-next-2025-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next:
xfrm: use kfree_sensitive() for SA secret zeroization
xfrm: prevent configuration of interface index when offload is used
xfrm: validate assignment of maximal possible SEQ number
xfrm: Refactor migration setup during the cloning process
xfrm: Migrate offload configuration
bonding: Fix multiple long standing offload races
bonding: Mark active offloaded xfrm_states
xfrm: Add explicit dev to .xdo_dev_state_{add,delete,free}
xfrm: Remove unneeded device check from validate_xmit_xfrm
xfrm: Use xdo.dev instead of xdo.real_dev
net/mlx5: Avoid using xso.real_dev unnecessarily
xfrm: Remove unnecessary strscpy_pad() size arguments
====================
Link: https://patch.msgid.link/20250523075611.3723340-1-steffen.klassert@secunet.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Replace kfree_skb() used in neigh_resolve_output() and
neigh_connected_output() with kfree_skb_reason().
Following new skb drop reason is added:
/* failed to fill the device hard header */
SKB_DROP_REASON_NEIGH_HH_FILLFAIL
Signed-off-by: Qiu Yutan <qiu.yutan@zte.com.cn>
Signed-off-by: Jiang Kun <jiang.kun2@zte.com.cn>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Xu Xin <xu.xin16@zte.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
Notify user space if netdev hooks are updated due to netdev add/remove
events. Send minimal notification messages by introducing
NFT_MSG_NEWDEV/DELDEV message types describing a single device only.
Upon NETDEV_CHANGENAME, the callback has no information about the
interface's old name. To provide a clear message to user space, include
the hook's stored interface name in the notification.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Supporting a 1:n relationship between nft_hook and nf_hook_ops is
convenient since a chain's or flowtable's nft_hooks may remain in place
despite matching interfaces disappearing. This stabilizes ruleset dumps
in that regard and opens the possibility to claim newly added interfaces
which match the spec. Also it prepares for wildcard interface specs
since these will potentially match multiple interfaces.
All spots dealing with hook registration are updated to handle a list of
multiple nf_hook_ops, but nft_netdev_hook_alloc() only adds a single
item for now to retain the old behaviour. The only expected functional
change here is how vanishing interfaces are handled: Instead of dropping
the respective nft_hook, only the matching nf_hook_ops are dropped.
To safely remove individual ops from the list in netdev handlers, an
rcu_head is added to struct nf_hook_ops so kfree_rcu() may be used.
There is at least nft_flowtable_find_dev() which may be iterating
through the list at the same time.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Also a pretty dull wrapper around the hook->ops.dev comparison for now.
Will search the embedded nf_hook_ops list in future. The ugly cast to
eliminate the const qualifier will vanish then, too.
Since this future list will be RCU-protected, also introduce an _rcu()
variant here.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
fib has two modes:
1. Obtain output device according to source or destination address
2. Obtain the type of the address, e.g. local, unicast, multicast.
'fib daddr type' should return 'local' if the address is configured
in this netns or unicast otherwise.
'fib daddr . iif type' should return 'local' if the address is configured
on the input interface or unicast otherwise, i.e. more restrictive.
However, if the interface is part of a VRF, then 'fib daddr type'
returns unicast even if the address is configured on the incoming
interface.
This is broken for both ipv4 and ipv6.
In the ipv4 case, inet_dev_addr_type must only be used if the
'iif' or 'oif' (strict mode) was requested.
Else inet_addr_type_dev_table() needs to be used and the correct
dev argument must be passed as well so the correct fib (vrf) table
is used.
In the ipv6 case, the bug is similar, without strict mode, dev is NULL
so .flowi6_l3mdev will be set to 0.
Add a new 'nft_fib_l3mdev_master_ifindex_rcu()' helper and use that
to init the .l3mdev structure member.
For ipv6, use it from nft_fib6_flowi_init() which gets called from
both the 'type' and the 'route' mode eval functions.
This provides consistent behaviour for all modes for both ipv4 and ipv6:
If strict matching is requested, the input respectively output device
of the netfilter hooks is used.
Otherwise, use skb->dev to obtain the l3mdev ifindex.
Without this, most type checks in updated nft_fib.sh selftest fail:
FAIL: did not find veth0 . 10.9.9.1 . local in fibtype4
FAIL: did not find veth0 . dead:1::1 . local in fibtype6
FAIL: did not find veth0 . dead:9::1 . local in fibtype6
FAIL: did not find tvrf . 10.0.1.1 . local in fibtype4
FAIL: did not find tvrf . 10.9.9.1 . local in fibtype4
FAIL: did not find tvrf . dead:1::1 . local in fibtype6
FAIL: did not find tvrf . dead:9::1 . local in fibtype6
FAIL: fib expression address types match (iif in vrf)
(fib errounously returns 'unicast' for all of them, even
though all of these addresses are local to the vrf).
Fixes: f6d0cbcf09 ("netfilter: nf_tables: add fib expression")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
As long as recvmsg() or recvmmsg() is used with cmsg, it is not
possible to avoid receiving file descriptors via SCM_RIGHTS.
This behaviour has occasionally been flagged as problematic, as
it can be (ab)used to trigger DoS during close(), for example, by
passing a FUSE-controlled fd or a hung NFS fd.
For instance, as noted on the uAPI Group page [0], an untrusted peer
could send a file descriptor pointing to a hung NFS mount and then
close it. Once the receiver calls recvmsg() with msg_control, the
descriptor is automatically installed, and then the responsibility
for the final close() now falls on the receiver, which may result
in blocking the process for a long time.
Regarding this, systemd calls cmsg_close_all() [1] after each
recvmsg() to close() unwanted file descriptors sent via SCM_RIGHTS.
However, this cannot work around the issue at all, because the final
fput() may still occur on the receiver's side once sendmsg() with
SCM_RIGHTS succeeds. Also, even filtering by LSM at recvmsg() does
not work for the same reason.
Thus, we need a better way to refuse SCM_RIGHTS at sendmsg().
Let's introduce SO_PASSRIGHTS to disable SCM_RIGHTS.
Note that this option is enabled by default for backward
compatibility.
Link: https://uapi-group.org/kernel-features/#disabling-reception-of-scm_rights-for-af_unix-sockets #[0]
Link: https://github.com/systemd/systemd/blob/v257.5/src/basic/fd-util.c#L612-L628 #[1]
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As explained in the next patch, SO_PASSRIGHTS would have a problem
if we assigned a corresponding bit to socket->flags, so it must be
managed in struct sock.
Mixing socket->flags and sk->sk_flags for similar options will look
confusing, and sk->sk_flags does not have enough space on 32bit system.
Also, as mentioned in commit 16e5726269 ("af_unix: dont send
SCM_CREDENTIALS by default"), SOCK_PASSCRED and SOCK_PASSPID handling
is known to be slow, and managing the flags in struct socket cannot
avoid that for embryo sockets.
Let's move SOCK_PASS{CRED,PIDFD,SEC} to struct sock.
While at it, other SOCK_XXX flags in net.h are grouped as enum.
Note that assign_bit() was atomic, so the writer side is moved down
after lock_sock() in setsockopt(), but the bit is only read once
in sendmsg() and recvmsg(), so lock_sock() is not needed there.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SCM_CREDENTIALS and SCM_SECURITY can be recv()ed by calling
scm_recv() or scm_recv_unix(), and SCM_PIDFD is only used by
scm_recv_unix().
scm_recv() is called from AF_NETLINK and AF_BLUETOOTH.
scm_recv_unix() is literally called from AF_UNIX.
Let's restrict SO_PASSCRED and SO_PASSSEC to such sockets and
SO_PASSPIDFD to AF_UNIX only.
Later, SOCK_PASS{CRED,PIDFD,SEC} will be moved to struct sock
and united with another field.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
scm_recv() has been placed in scm.h since the pre-git era for no
particular reason (I think), which makes the file really fragile.
For example, when you move SOCK_PASSCRED from include/linux/net.h to
enum sock_flags in include/net/sock.h, you will see weird build failure
due to terrible dependency.
To avoid the build failure in the future, let's move scm_recv(_unix())?
and its callees to scm.c.
Note that only scm_recv() needs to be exported for Bluetooth.
scm_send() should be moved to scm.c too, but I'll revisit later.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cross-merge networking fixes after downstream PR (net-6.15-rc8).
Conflicts:
80f2ab46c2 ("irdma: free iwdev->rf after removing MSI-X")
4bcc063939 ("ice, irdma: fix an off by one in error handling code")
c24a65b6a2 ("iidc/ice/irdma: Update IDC to support multiple consumers")
https://lore.kernel.org/20250513130630.280ee6c5@canb.auug.org.au
No extra adjacent changes.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmgtYgEACgkQrB3Eaf9P
W7e1ag//da84UIRyJwMfDO4Y3MXDNPslNSDuq0HuvwRtdLIBLFtwitSzU1uhKsxY
yn5v7RSsxvp6lXW2RT+Ycor2qZ/mGHJsHcVfG7m0YjxH6unw7yzjqn5LNNzRbYN4
NcD8P0skuX6d80EFPUB3Hsnmdj1VKR62OsWyk3rAPb4CLBVKJt9OsseVfN4bn1R0
TaZSIkdh5EDGYXTBKb49jc8LFfQo7+uVg/AjtZ/2ZsWt+Qgw3XevTIcwLokH00rt
GzXcLjC1g+b6TeVncOuD1oiNJUtQVGYV23t2yQlk9k2HFzCdNnq0YM9pzawwiI+l
icBV2X/QFjhdCRkvJRF4dkXq/4tnnEmYoY/1vSOoWR9VmY2u8Lr3VRiDD/h0gYJT
KXd8YPMtZLDnLgmH+DwWbv4vdLtHvQTmB8XFzb/4VN6Ikucenry3loJsUsLnS+Je
t1/7unLrg9yyJC6UPzweqjAx+6VgZvem/M5kejIVxHpk+Wg2dXGZ2jz4fsVuZYPB
dMLj1h1MLn4gOt2b/bdI2do0C+p2R1axrTNw+RiqwCrb1h5Ey+7RAhWyXyaHUEs3
1brMAgOcvdbaaeSIpoHJ8eJx/PgRxDrxRnUC3HjCGPNApYQXC3FM3POk7wwJ9C0i
odlHrq+yOdzLCZyU+YKdR1q3kPq9AWpUSmc4Olg359OQ9IxDGQw=
=bgyq
-----END PGP SIGNATURE-----
Merge tag 'ipsec-2025-05-21' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:
====================
pull request (net): ipsec 2025-05-21
1) Fix some missing kfree_skb in the error paths of espintcp.
From Sabrina Dubroca.
2) Fix a reference leak in espintcp.
From Sabrina Dubroca.
3) Fix UDP GRO handling for ESPINUDP.
From Tobias Brunner.
4) Fix ipcomp truesize computation on the receive path.
From Sabrina Dubroca.
5) Sanitize marks before policy/state insertation.
From Paul Chaignon.
* tag 'ipsec-2025-05-21' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
xfrm: Sanitize marks before insert
xfrm: ipcomp: fix truesize computation on receive
xfrm: Fix UDP GRO handling for some corner cases
espintcp: remove encap socket caching to avoid reference leak
espintcp: fix skb leaks
====================
Link: https://patch.msgid.link/20250521054348.4057269-1-steffen.klassert@secunet.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Now that the only remaining caller of __skb_checksum() is
skb_checksum(), fold __skb_checksum() into skb_checksum(). This makes
struct skb_checksum_ops unnecessary, so remove that too and simply do
the "regular" net checksum. It also makes the wrapper functions
csum_partial_ext() and csum_block_add_ext() unnecessary, so remove those
too and just use the underlying functions.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://patch.msgid.link/20250519175012.36581-7-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Make sctp_compute_cksum() just use the new function skb_crc32c(),
instead of calling __skb_checksum() with a skb_checksum_ops struct that
does CRC32C. This is faster and simpler.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://patch.msgid.link/20250519175012.36581-6-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use separate link type id for unicast and broadcast ISO connections.
These connection types are handled with separate HCI commands, socket
API is different, and hci_conn has union fields that are different in
the two cases, so they shall not be mixed up.
Currently in most places it is attempted to distinguish ucast by
bacmp(&c->dst, BDADDR_ANY) but it is wrong as dst is set for bcast sink
hci_conn in iso_conn_ready(). Additionally checking sync_handle might be
OK, but depends on details of bcast conn configuration flow.
To avoid complicating it, use separate link types.
Fixes: f764a6c2c1 ("Bluetooth: ISO: Add broadcast support")
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Bluetooth needs some way for user to get supported so_timestamping flags
for the different socket types.
Use SIOCETHTOOL API for this purpose. As hci_dev is not associated with
struct net_device, the existing implementation can't be reused, so we
add a small one here.
Add support (only) for ETHTOOL_GET_TS_INFO command. The API differs
slightly from netdev in that the result depends also on socket type.
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Although commit 75ddcd5ad4 ("Bluetooth: btusb: Configure altsetting
for HCI_USER_CHANNEL") has enabled the HCI_USER_CHANNEL user to send out
SCO data through USB Bluetooth chips, it's observed that with the patch
HFP is flaky on most of the existing USB Bluetooth controllers: Intel
chips sometimes send out no packet for Transparent codec; MTK chips may
generate SCO data with a wrong handle for CVSD codec; RTK could split
the data with a wrong packet size for Transparent codec; ... etc.
To address the issue above one needs to reset the altsetting back to
zero when there is no active SCO connection, which is the same as the
BlueZ behavior, and another benefit is the bus doesn't need to reserve
bandwidth when no SCO connection.
This patch adds the infrastructure that allow the user space program to
talk to Bluetooth drivers directly:
- Define the new packet type HCI_DRV_PKT which is specifically used for
communication between the user space program and the Bluetooth drviers
- hci_send_frame intercepts the packets and invokes drivers' HCI Drv
callbacks (so far only defined for btusb)
- 2 kinds of events to user space: Command Status and Command Complete,
the former simply returns the status while the later may contain
additional response data.
Cc: chromeos-bluetooth-upstreaming@chromium.org
Fixes: b16b327edb ("Bluetooth: btusb: add sysfs attribute to control USB alt setting")
Signed-off-by: Hsin-chen Chuang <chharry@chromium.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Checking the SOCK_WIFI_STATUS flag bit in sk_flags may give wrong results
since sk_flags are part of a union and the union is used otherwise. Add
sk_requests_wifi_status() which checks if sk is non-NULL, sk is a full
socket (so flags are valid) and checks the flag bit.
Fixes: 76a853f86c ("wifi: free SKBTX_WIFI_STATUS skb tx_flags flag")
Suggested-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Bert Karwatzki <spasswolf@web.de>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20250520223430.6875-1-spasswolf@web.de
[edit commit message, fix indentation]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Commit f130a0cc1b ("inet: fix lwtunnel_valid_encap_type() lock
imbalance") added the rtnl_is_held argument as a temporary fix while
I'm converting nexthop and IPv6 routing table to per-netns RTNL or RCU.
Now all callers of lwtunnel_valid_encap_type() do not hold RTNL.
Let's remove the argument.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250516022759.44392-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
netdev_lock is already held when calling bnxt_ulp_irq_stop() and
bnxt_ulp_irq_restart(). When converting rtnl_lock to netdev_lock,
the original code was rtnl_dereference() to indicate that rtnl_lock
was already held. rcu_dereference_protected() is the correct
conversion after replacing rtnl_lock with netdev_lock.
Add a new helper netdev_lock_dereference() similar to
rtnl_dereference().
Fixes: 004b500801 ("eth: bnxt: remove most dependencies on RTNL")
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250519204130.3097027-2-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
AFAIU always returning -1 from lockdep's compare function
basically disables checking of dependencies between given
locks. Try to be a little more precise about what guarantees
that instance locks won't deadlock.
Right now we only nest them under protection of rtnl_lock.
Mostly in unregister_netdevice_many() and dev_close_many().
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250517200810.466531-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Rename local variable in macros from txq to _txq.
When macro parameter get_desc is expended it is likely to have a txq
token that refers to a different txq variable at the caller's site.
Signed-off-by: Gur Stavi <gur.stavi@huawei.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/95b60d218f004308486d92ed17c8cc6f28bac09d.1747559621.git.gur.stavi@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
RFS can exhibit lower performance for workloads using short-lived
flows and a small set of 4-tuple.
This is often the case for load-testers, using a pair of hosts,
if the server has a single listener port.
Typical use case :
Server : tcp_crr -T128 -F1000 -6 -U -l30 -R 14250
Client : tcp_crr -T128 -F1000 -6 -U -l30 -c -H server | grep local_throughput
This is because RFS global hash table contains stale information,
when the same RSS key is recycled for another socket and another cpu.
Make sure to undo the changes and go back to initial state when
a flow is disconnected.
Performance of the above test is increased by 22 %,
going from 372604 transactions per second to 457773.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Octavian Purdila <tavip@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20250515100354.3339920-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Drivers need to make sure not to pass netmem dma-addrs to the
dma-mapping API in order to support netmem TX.
Add helpers and netmem_dma_*() helpers that enables special handling of
netmem dma-addrs that drivers can use.
Document in netmem.rst what drivers need to do to support netmem TX.
Signed-off-by: Mina Almasry <almasrymina@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250508004830.4100853-7-almasrymina@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Augment dmabuf binding to be able to handle TX. Additional to all the RX
binding, we also create tx_vec needed for the TX path.
Provide API for sendmsg to be able to send dmabufs bound to this device:
- Provide a new dmabuf_tx_cmsg which includes the dmabuf to send from.
- MSG_ZEROCOPY with SCM_DEVMEM_DMABUF cmsg indicates send from dma-buf.
Devmem is uncopyable, so piggyback off the existing MSG_ZEROCOPY
implementation, while disabling instances where MSG_ZEROCOPY falls back
to copying.
We additionally pipe the binding down to the new
zerocopy_fill_skb_from_devmem which fills a TX skb with net_iov netmems
instead of the traditional page netmems.
We also special case skb_frag_dma_map to return the dma-address of these
dmabuf net_iovs instead of attempting to map pages.
The TX path may release the dmabuf in a context where we cannot wait.
This happens when the user unbinds a TX dmabuf while there are still
references to its netmems in the TX path. In that case, the netmems will
be put_netmem'd from a context where we can't unmap the dmabuf, Resolve
this by making __net_devmem_dmabuf_binding_free schedule_work'd.
Based on work by Stanislav Fomichev <sdf@fomichev.me>. A lot of the meat
of the implementation came from devmem TCP RFC v1[1], which included the
TX path, but Stan did all the rebasing on top of netmem/net_iov.
Cc: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250508004830.4100853-5-almasrymina@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Currently net_iovs support only pp ref counts, and do not support a
page ref equivalent.
This is fine for the RX path as net_iovs are used exclusively with the
pp and only pp refcounting is needed there. The TX path however does not
use pp ref counts, thus, support for get_page/put_page equivalent is
needed for netmem.
Support get_netmem/put_netmem. Check the type of the netmem before
passing it to page or net_iov specific code to obtain a page ref
equivalent.
For dmabuf net_iovs, we obtain a ref on the underlying binding. This
ensures the entire binding doesn't disappear until all the net_iovs have
been put_netmem'ed. We do not need to track the refcount of individual
dmabuf net_iovs as we don't allocate/free them from a pool similar to
what the buddy allocator does for pages.
This code is written to be extensible by other net_iov implementers.
get_netmem/put_netmem will check the type of the netmem and route it to
the correct helper:
pages -> [get|put]_page()
dmabuf net_iovs -> net_devmem_[get|put]_net_iov()
new net_iovs -> new helpers
Signed-off-by: Mina Almasry <almasrymina@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250508004830.4100853-3-almasrymina@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Later patches in the series adds TX net_iovs where there is no pp
associated, so we can't rely on niov->pp->mp_ops to tell what is the
type of the net_iov.
Add a type enum to the net_iov which tells us the net_iov type.
Signed-off-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20250508004830.4100853-2-almasrymina@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Handle soc servicing events which require the rdma auxiliary device resources to
be cleaned up during a suspend, and re-initialized during a resume.
Signed-off-by: Shiraz Saleem <shirazsaleem@microsoft.com>
Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://patch.msgid.link/1746633545-17653-5-git-send-email-kotaranov@linux.microsoft.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Initialize gdma device for rdma inside mana module.
For each gdma device, initialize an auxiliary ib device.
Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://patch.msgid.link/1746633545-17653-2-git-send-email-kotaranov@linux.microsoft.com
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
New timestamping API was introduced in commit 66f7223039 ("net: add
NDOs for configuring hardware timestamping") from kernel v6.6. It is
time to convert DSA to the new API, so that the ndo_eth_ioctl() path can
be removed completely.
Move the ds->ops->port_hwtstamp_get() and ds->ops->port_hwtstamp_set()
calls from dsa_user_ioctl() to dsa_user_hwtstamp_get() and
dsa_user_hwtstamp_set().
Due to the fact that the underlying ifreq type changes to
kernel_hwtstamp_config, the drivers and the Ocelot switchdev front-end,
all hooked up directly or indirectly, must also be converted all at once.
The conversion also updates the comment from dsa_port_supports_hwtstamp(),
which is no longer true because kernel_hwtstamp_config is kernel memory
and does not need copy_to_user(). I've deliberated whether it is
necessary to also update "err != -EOPNOTSUPP" to a more general "!err",
but all drivers now either return 0 or -EOPNOTSUPP.
The existing logic from the ocelot_ioctl() function, to avoid
configuring timestamping if the PHY supports the operation, is obsoleted
by more advanced core logic in dev_set_hwtstamp_phylib().
This is only a partial preparation for proper PHY timestamping support.
None of these switch driver currently sets up PTP traps for PHY
timestamping, so setting dev->see_all_hwtstamp_requests is not yet
necessary and the conversion is relatively trivial.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # felix, sja1105, mv88e6xxx
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20250508095236.887789-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Previously, when reducing a qdisc's limit via the ->change() operation, only
the main skb queue was trimmed, potentially leaving packets in the gso_skb
list. This could result in NULL pointer dereference when we only check
sch->limit against sch->q.qlen.
This patch introduces a new helper, qdisc_dequeue_internal(), which ensures
both the gso_skb list and the main queue are properly flushed when trimming
excess packets. All relevant qdiscs (codel, fq, fq_codel, fq_pie, hhf, pie)
are updated to use this helper in their ->change() routines.
Fixes: 76e3cc126b ("codel: Controlled Delay AQM")
Fixes: 4b549a2ef4 ("fq_codel: Fair Queue Codel AQM")
Fixes: afe4fd0624 ("pkt_sched: fq: Fair Queue packet scheduler")
Fixes: ec97ecf1eb ("net: sched: add Flow Queue PIE packet scheduler")
Fixes: 10239edf86 ("net-qdisc-hhf: Heavy-Hitter Filter (HHF) qdisc")
Fixes: d4b36210c2 ("net: pkt_sched: PIE AQM scheme")
Reported-by: Will <willsroot@protonmail.com>
Reported-by: Savy <savy@syst3mfailure.io>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- MGMT: Fix MGMT_OP_ADD_DEVICE invalid device flags
- hci_event: Fix not using key encryption size when its known
-----BEGIN PGP SIGNATURE-----
iQJNBAABCgA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmgcyKYZHGx1aXoudm9u
LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKWSqEACGdIL6oDrfUmE2mFUxlT2C
eUSe1RDvXi63pFCxv4a1/JrnAICfIPbuKhvkOhf9g3JoQXjqdtX5REjEPcCndyBg
sSyGGfxcaoYLsSWQY1nSb2p/bttk1R3LfPT78QhKPDsh63FkvRw77P++8anzG3T1
wqgjKU9XZb7zYmMElYCqYbnd0K2PymJtD+Oml1srBRdOpz1ex3s6Qj4Cy6cg07Vb
SlxqWMnWZG0wfnusAFZ48zwYu/7/LQQnSJ6rbHSfrKLQnizNVvFTtsW+xrIfQ2m6
G7/WOX2LVwtP3XMe8VIDV0kdDtkDFtHq0KuNToAtt4DS582RHw0AVe+Xc2x5FFKA
rmcukZLvg6tv/DM/PM5zJdvZW4M+r1IOnBSZvFMAdYb4Af04DHjI1k1ow9On6O3f
oJCeRZ4LoOREljR+xdO/Ewn207za0wGR7IlDXFVpeGEOnLcSAxvqUl6biL3cEQpA
bb97I7fjqwSPqpAGsMV0uhULTJxT7QPhF05rL3bpSzAzZoDpRFhc070TLBBuQvW1
zQmf+SCeg72vDtv1vsekyoXWfM+SpwlsNYuStPBztrKZgBGeRKjtZc2/L6dLPKbo
bATO6Dhidm+outsF59YaqORFteON4yLAXSryLw6uVNIT3ApyUspNICyn4zi0tbYY
8N4rjTclIk5zV1z8zCJ7Lg==
=uyVh
-----END PGP SIGNATURE-----
Merge tag 'for-net-2025-05-08' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- MGMT: Fix MGMT_OP_ADD_DEVICE invalid device flags
- hci_event: Fix not using key encryption size when its known
* tag 'for-net-2025-05-08' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: hci_event: Fix not using key encryption size when its known
Bluetooth: MGMT: Fix MGMT_OP_ADD_DEVICE invalid device flags
====================
Link: https://patch.msgid.link/20250508150927.385675-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As per IEEE 802.11be-2024 - 9.4.2.321, EHT operation element
contains MCS15 Disable subfield as the sixth bit, which is set when
MCS15 support is not enabled.
Get MCS15 support from EHT operation params and add it in link_conf
so that driver can use this value to know if EHT-MCS 15 reception
is enabled.
Co-developed-by: Dhanavandhana Kannan <quic_dhanavan1@quicinc.com>
Signed-off-by: Dhanavandhana Kannan <quic_dhanavan1@quicinc.com>
Signed-off-by: Mohan Kumar G <quic_mkumarg@quicinc.com>
Link: https://patch.msgid.link/20250505152836.3266829-1-quic_mkumarg@quicinc.com
[remove pointless !! for bool assignment]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This fixes the regression introduced by 50c1241e6a8a ("Bluetooth: l2cap:
Check encryption key size on incoming connection") introduced a check for
l2cap_check_enc_key_size which checks for hcon->enc_key_size which may
not be initialized if HCI_OP_READ_ENC_KEY_SIZE is still pending.
If the key encryption size is known, due previously reading it using
HCI_OP_READ_ENC_KEY_SIZE, then store it as part of link_key/smp_ltk
structures so the next time the encryption is changed their values are
used as conn->enc_key_size thus avoiding the racing against
HCI_OP_READ_ENC_KEY_SIZE.
Now that the enc_size is stored as part of key the information the code
then attempts to check that there is no downgrade of security if
HCI_OP_READ_ENC_KEY_SIZE returns a value smaller than what has been
previously stored.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=220061
Link: https://bugzilla.kernel.org/show_bug.cgi?id=220063
Fixes: 522e9ed157 ("Bluetooth: l2cap: Check encryption key size on incoming connection")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Older drivers and drivers with lower queue counts often have a static
array of queues, rather than allocating structs for each queue on demand.
Add a helper for adding up qstats from a queue range. Expectation is
that driver will pass a queue range [netdev->real_num_*x_queues, MAX).
It was tempting to always use num_*x_queues as the end, but virtio
seems to clamp its queue count after allocating the netdev. And this
way we can trivaly reuse the helper for [0, real_..).
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20250507003221.823267-2-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
* stack
- free SKBTX_WIFI_STATUS flag
- fixes for VLAN multicast in multi-link
- improve codel parameters (revert some old twiddling)
* ath12k
- Enable AHB support for IPQ5332.
- Add monitor interface support to QCN9274.
- Add MLO support to WCN7850.
- Add 802.11d scan offload support to WCN7850.
* ath11k
- Restore hibernation support
* iwlwifi
- EMLSR on two 5 GHz links
* mwifiex
- cleanups/refactoring
along with many other small features/cleanups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmgaSmcACgkQ10qiO8sP
aAA4hA//f/nfeLTAnhON53mDlqxa55/2bw9XSH7pOIOasVBWxmuYxhWfn5uiZluI
zOGlBO7vtJYvPrVHEHSuPWMNCQ+ieL2ShRSP5BBfy3KBYdD4gKKAd95WoiXmRVp4
d13OYtF9msFbVXOZYMyxMHmzIrWlBQIokjOSjqjSBeYRnD8U0GRemiSecugWo/qI
bE7xcZuTgKBy+gr7242017DcUjWBdWcsp1C6C+COZm/KrSihQ0SQ4PIcOZgPsZjl
COGextZltbWW56qnlp6QC394V+Vhah+Owcz3Qqz9zZ7hzJnuPo+DnpPMShhRGruL
/IgqKhzcuye5UUJJl8nD768x6ebClchkcBC+A/hfjk5UYVl/oZxA1Bw5fC2O+5VU
ycDMHr1Qu/yEE2rbwIJPGNOZ7NisqJFF07CPFuygKjBGNnp7I5S7H6UJsKRi5MuZ
0CXHiFMhuHgWLmDFauIa66XI1JpqIzQgbZSjqVYFKRqwEz3yRKwihwDzZy1m6pDQ
NhMFedQznFkMSJ+m020IOxxXYPy98PHcus+TZWL4os0SJTfEycEUWJZnlJ47Bb25
Z9IF1OER2I4giMKxUVKRoDq0SGStbZODwqxrVCRD261Um9ybbfDdRbDeDOmvY+Gu
jEmE6tscWbRfDvCS2M+xIg+MAcVHJq5gO4S8RopMSJlFHY/T/uE=
=s9mS
-----END PGP SIGNATURE-----
Merge tag 'wireless-next-2025-05-06' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
Johannes Berg says:
====================
wireless features, notably
* stack
- free SKBTX_WIFI_STATUS flag
- fixes for VLAN multicast in multi-link
- improve codel parameters (revert some old twiddling)
* ath12k
- Enable AHB support for IPQ5332.
- Add monitor interface support to QCN9274.
- Add MLO support to WCN7850.
- Add 802.11d scan offload support to WCN7850.
* ath11k
- Restore hibernation support
* iwlwifi
- EMLSR on two 5 GHz links
* mwifiex
- cleanups/refactoring
along with many other small features/cleanups
* tag 'wireless-next-2025-05-06' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (177 commits)
Revert "wifi: iwlwifi: clean up config macro"
wifi: iwlwifi: move phy_filters to fw_runtime
wifi: iwlwifi: pcie: make sure to lock rxq->read
wifi: iwlwifi: add definitions for iwl_mac_power_cmd version 2
wifi: iwlwifi: clean up config macro
wifi: iwlwifi: mld: simplify iwl_mld_rx_fill_status()
wifi: iwlwifi: mld: rx: simplify channel handling
wifi: iwlwifi: clean up band in RX metadata
wifi: iwlwifi: mld: skip unknown FW channel load values
wifi: iwlwifi: define API for external FSEQ images
wifi: iwlwifi: mld: allow EMLSR on separated 5 GHz subbands
wifi: iwlwifi: mld: use cfg80211_chandef_get_width()
wifi: iwlwifi: mld: fix iwl_mld_emlsr_disallowed_with_link() return
wifi: iwlwifi: mld: clarify variable type
wifi: iwlwifi: pcie: add support for the reset handshake in MSI
wifi: mac80211_hwsim: Prevent tsf from setting if beacon is disabled
wifi: mac80211: restructure tx profile retrieval for MLO MBSSID
wifi: nl80211: add link id of transmitted profile for MLO MBSSID
wifi: ieee80211: Add helpers to fetch EMLSR delay and timeout values
wifi: mac80211: update ML STA with EML capabilities
...
====================
Link: https://patch.msgid.link/20250506174656.119970-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Assign DEVLINK_PARAM_TYPE_* enum values to DEVLINK_VAR_ATTR_TYPE_* to
ensure the same values are used internally and in UAPI. Benefit from
that by removing the value translations.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20250505114513.53370-4-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
sctp_assoc_del_peer() last use was removed in 2015 by
commit 73e6742027 ("sctp: Do not try to search for the transport twice")
which now uses rm_peer instead of del_peer.
sctp_chunk_iif() last use was removed in 2016 by
commit 1f45f78f8e ("sctp: allow GSO frags to access the chunk too")
Remove them.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Acked-by: Xin Long <lucien.xin@gmail.com>
Link: https://patch.msgid.link/20250501233815.99832-1-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The last use of __strp_unpause() was removed in 2022 by
commit 84c61fe1a7 ("tls: rx: do not use the standard strparser")
Remove it.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250501002402.308843-1-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Load balance new TCP connections across nexthops also when they
connect to the same service at a single remote address and port.
This affects only port-based multipath hashing:
fib_multipath_hash_policy 1 or 3.
Local connections must choose both a source address and port when
connecting to a remote service, in ip_route_connect. This
"chicken-and-egg problem" (commit 2d7192d6cb ("ipv4: Sanitize and
simplify ip_route_{connect,newports}()")) is resolved by first
selecting a source address, by looking up a route using the zero
wildcard source port and address.
As a result multiple connections to the same destination address and
port have no entropy in fib_multipath_hash.
This is not a problem when forwarding, as skb-based hashing has a
4-tuple. Nor when establishing UDP connections, as autobind there
selects a port before reaching ip_route_connect.
Load balance also TCP, by using a random port in fib_multipath_hash.
Port assignment in inet_hash_connect is not atomic with
ip_route_connect. Thus ports are unpredictable, effectively random.
Implementation details:
Do not actually pass a random fl4_sport, as that affects not only
hashing, but routing more broadly, and can match a source port based
policy route, which existing wildcard port 0 will not. Instead,
define a new wildcard flowi flag that is used only for hashing.
Selecting a random source is equivalent to just selecting a random
hash entirely. But for code clarity, follow the normal 4-tuple hash
process and only update this field.
fib_multipath_hash can be reached with zero sport from other code
paths, so explicitly pass this flowi flag, rather than trying to infer
this case in the function itself.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250424143549.669426-3-willemdebruijn.kernel@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
With multipath routes, try to ensure that packets leave on the device
that is associated with the source address.
Avoid the following tcpdump example:
veth0 Out IP 10.1.0.2.38640 > 10.2.0.3.8000: Flags [S]
veth1 Out IP 10.1.0.2.38648 > 10.2.0.3.8000: Flags [S]
Which can happen easily with the most straightforward setup:
ip addr add 10.0.0.1/24 dev veth0
ip addr add 10.1.0.1/24 dev veth1
ip route add 10.2.0.3 nexthop via 10.0.0.2 dev veth0 \
nexthop via 10.1.0.2 dev veth1
This is apparently considered WAI, based on the comment in
ip_route_output_key_hash_rcu:
* 2. Moreover, we are allowed to send packets with saddr
* of another iface. --ANK
It may be ok for some uses of multipath, but not all. For instance,
when using two ISPs, a router may drop packets with unknown source.
The behavior occurs because tcp_v4_connect makes three route
lookups when establishing a connection:
1. ip_route_connect calls to select a source address, with saddr zero.
2. ip_route_connect calls again now that saddr and daddr are known.
3. ip_route_newports calls again after a source port is also chosen.
With a route with multiple nexthops, each lookup may make a different
choice depending on available entropy to fib_select_multipath. So it
is possible for 1 to select the saddr from the first entry, but 3 to
select the second entry. Leading to the above situation.
Address this by preferring a match that matches the flowi4 saddr. This
will make 2 and 3 make the same choice as 1. Continue to update the
backup choice until a choice that matches saddr is found.
Do this in fib_select_multipath itself, rather than passing an fl4_oif
constraint, to avoid changing non-multipath route selection. Commit
e6b45241c5 ("ipv4: reset flowi parameters on route connect") shows
how that may cause regressions.
Also read ipv4.sysctl_fib_multipath_use_neigh only once. No need to
refresh in the loop.
This does not happen in IPv6, which performs only one lookup.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250424143549.669426-2-willemdebruijn.kernel@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The "noqueue" qdisc can either be directly attached, or get default
attached if net_device priv_flags has IFF_NO_QUEUE. In both cases, the
allocated Qdisc structure gets it's enqueue function pointer reset to
NULL by noqueue_init() via noqueue_qdisc_ops.
This is a common case for software virtual net_devices. For these devices
with no-queue, the transmission path in __dev_queue_xmit() will bypass
the qdisc layer. Directly invoking device drivers ndo_start_xmit (via
dev_hard_start_xmit). In this mode the device driver is not allowed to
ask for packets to be queued (either via returning NETDEV_TX_BUSY or
stopping the TXQ).
The simplest and most reliable way to identify this no-queue case is by
checking if enqueue == NULL.
The vrf driver currently open-codes this check (!qdisc->enqueue). While
functionally correct, this low-level detail is better encapsulated in a
dedicated helper for clarity and long-term maintainability.
To make this behavior more explicit and reusable, this patch introduce a
new helper: qdisc_txq_has_no_queue(). Helper will also be used by the
veth driver in the next patch, which introduces optional qdisc-based
backpressure.
This is a non-functional change.
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://patch.msgid.link/174559293172.827981.7583862632045264175.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
BIG Create Sync requires the command to just generates a status so this
makes use of __hci_cmd_sync_status_sk to wait for
HCI_EVT_LE_BIG_SYNC_ESTABLISHED, also because of this chance it is not
longer necessary to use a custom method to serialize the process of
creating the BIG sync since the cmd_work_sync itself ensures only one
command would be pending which now awaits for
HCI_EVT_LE_BIG_SYNC_ESTABLISHED before proceeding to next connection.
Fixes: 42ecf19471 ("Bluetooth: ISO: Do not emit LE BIG Create Sync if previous is pending")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Broadcast Receiver requires creating PA sync but the command just
generates a status so this makes use of __hci_cmd_sync_status_sk to wait
for HCI_EV_LE_PA_SYNC_ESTABLISHED, also because of this chance it is not
longer necessary to use a custom method to serialize the process of
creating the PA sync since the cmd_work_sync itself ensures only one
command would be pending which now awaits for
HCI_EV_LE_PA_SYNC_ESTABLISHED before proceeding to next connection.
Fixes: 4a5e0ba686 ("Bluetooth: ISO: Do not emit LE PA Create Sync if previous is pending")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Bring back previous offset calculation behaviour
in AF_XDP unaligned umem mode.
In unaligned mode, upper 16 bits should contain
data offset, lower 48 bits should contain
only specific chunk location without offset.
Remove pool->headroom duplication into 48bit address.
Signed-off-by: Eryk Kubanski <e.kubanski@partner.samsung.com>
Fixes: bea14124ba ("xsk: Get rid of xdp_buff_xsk::orig_addr")
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://patch.msgid.link/20250416112925.7501-1-e.kubanski@partner.samsung.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move rx_lock from xsk_socket to xsk_buff_pool.
Fix synchronization for shared umem mode in
generic RX path where multiple sockets share
single xsk_buff_pool.
RX queue is exclusive to xsk_socket, while FILL
queue can be shared between multiple sockets.
This could result in race condition where two
CPU cores access RX path of two different sockets
sharing the same umem.
Protect both queues by acquiring spinlock in shared
xsk_buff_pool.
Lock contention may be minimized in the future by some
per-thread FQ buffering.
It's safe and necessary to move spin_lock_bh(rx_lock)
after xsk_rcv_check():
* xs->pool and spinlock_init is synchronized by
xsk_bind() -> xsk_is_bound() memory barriers.
* xsk_rcv_check() may return true at the moment
of xsk_release() or xsk_unbind_dev(),
however this will not cause any data races or
race conditions. xsk_unbind_dev() removes xdp
socket from all maps and waits for completion
of all outstanding rx operations. Packets in
RX path will either complete safely or drop.
Signed-off-by: Eryk Kubanski <e.kubanski@partner.samsung.com>
Fixes: bf0bdd1343 ("xdp: fix race on generic receive path")
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://patch.msgid.link/20250416101908.10919-1-e.kubanski@partner.samsung.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Remove three functions that are no longer used.
rxrpc_get_txbuf() last use was removed by 2020's
commit 5e6ef4f101 ("rxrpc: Make the I/O thread take over the call and
local processor work")
rxrpc_kernel_get_epoch() last use was removed by 2020's
commit 44746355cc ("afs: Don't get epoch from a server because it may be
ambiguous")
rxrpc_kernel_set_max_life() last use was removed by 2023's
commit db099c625b ("rxrpc: Fix timeout of a call that hasn't yet been
granted a channel")
Both of the rxrpc_kernel_* functions were documented. Remove that
documentation as well as the code.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Acked-by: David Howells <dhowells@redhat.com>
Link: https://patch.msgid.link/20250422235147.146460-1-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We will get rid of RTNL from RTM_NEWROUTE and SIOCADDRT.
Then, we may be going to add a route tied to a dying nexthop.
The nexthop itself is not freed during the RCU grace period, but
if we link a route after __remove_nexthop_fib() is called for the
nexthop, the route will be leaked.
To avoid the race between IPv6 route addition under RCU vs nexthop
deletion under RTNL, let's add a dead flag and protect it and
nh->f6i_list with a spinlock.
__remove_nexthop_fib() acquires the nexthop's spinlock and sets false
to nh->dead, then calls ip6_del_rt() for the linked route one by one
without the spinlock because fib6_purge_rt() acquires it later.
While adding an IPv6 route, fib6_add() acquires the nexthop lock and
checks the dead flag just before inserting the route.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250418000443.43734-15-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The next patch adds per-nexthop spinlock which protects nh->f6i_list.
When rt->nh is not NULL, fib6_add_rt2node() will be called under the lock.
fib6_add_rt2node() could call fib6_purge_rt() for another route, which
could holds another nexthop lock.
Then, deadlock could happen between two nexthops.
Let's defer fib6_purge_rt() after fib6_add_rt2node().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/20250418000443.43734-14-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>