Currently, the condition 'page->pp_magic == PP_SIGNATURE' is used to
determine if a page belongs to a page pool. However, with the planned
removal of @pp_magic, we should instead leverage the page_type in struct
page, such as PGTY_netpp, for this purpose.
Introduce and use the page type APIs e.g. PageNetpp(), __SetPageNetpp(),
and __ClearPageNetpp() instead, and remove the existing APIs accessing
@pp_magic e.g. page_pool_page_is_pp(), netmem_or_pp_magic(), and
netmem_clear_pp_magic().
Plus, add @page_type to struct net_iov at the same offset as struct page
so as to use the page_type APIs for struct net_iov as well. While at it,
reorder @type and @owner in struct net_iov to avoid a hole and increasing
the struct size.
This work was inspired by the following link:
https://lore.kernel.org/all/582f41c0-2742-4400-9c81-0d46bf4e8314@gmail.com/
While at it, move the sanity check for page pool to on the free path.
[byungchul@sk.com: gate the sanity check, per Johannes]
Link: https://lkml.kernel.org/r/20260316223113.20097-1-byungchul@sk.com
Link: https://lkml.kernel.org/r/20260224051347.19621-1-byungchul@sk.com
Co-developed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Byungchul Park <byungchul@sk.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrew Lunn <andrew+netdev@lunn.ch>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: David Wei <dw@davidwei.uk>
Cc: Dragos Tatulea <dtatulea@nvidia.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Mark Bloch <mbloch@nvidia.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Saeed Mahameed <saeedm@nvidia.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Stanislav Fomichev <sdf@fomichev.me>
Cc: Stehen Rothwell <sfr@canb.auug.org.au>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Taehee Yoo <ap420073@gmail.com>
Cc: Tariq Toukan <tariqt@nvidia.com>
Cc: Usama Arif <usamaarif642@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In configurations with multiple tunnel layers and MPLS lwtunnel routing, a
single tunnel hop can increment the counter beyond this limit. This causes
packets to be dropped with the "Dead loop on virtual device" message even
when a routing loop doesn't exist.
Increase IP_TUNNEL_RECURSION_LIMIT from 4 to 5 to handle this use-case.
Fixes: 6f1a9140ec ("net: add xmit recursion limit to tunnel xmit functions")
Link: https://lore.kernel.org/netdev/88deb91b-ef1b-403c-8eeb-0f971f27e34f@redhat.com/
Signed-off-by: Chris J Arges <carges@cloudflare.com>
Link: https://patch.msgid.link/20260402222401.3408368-1-carges@cloudflare.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Rather than querying the output device for its address in
mctp_local_output, set up the source address when we're populating the
dst structure. If no address is assigned, use MCTP_ADDR_NULL.
This will allow us more flexibility when routing for NULL-source-eid
cases. For now though, we still reject a NULL source address in the
output path.
We need to update the tests a little, so that addresses are assigned
before we do the dst lookups.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://patch.msgid.link/20260331-dev-mctp-null-eids-v1-1-b4d047372eaf@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When an RSS QP is destroyed (e.g. DPDK exit), mana_ib_destroy_qp_rss()
destroys the RX WQ objects but does not disable vPort RX steering in
firmware. This leaves stale steering configuration that still points to
the destroyed RX objects.
If traffic continues to arrive (e.g. peer VM is still transmitting) and
the VF interface is subsequently brought up (mana_open), the firmware
may deliver completions using stale CQ IDs from the old RX objects.
These CQ IDs can be reused by the ethernet driver for new TX CQs,
causing RX completions to land on TX CQs:
WARNING: mana_poll_tx_cq+0x1b8/0x220 [mana] (is_sq == false)
WARNING: mana_gd_process_eq_events+0x209/0x290 (cq_table lookup fails)
Fix this by disabling vPort RX steering before destroying RX WQ objects.
Note that mana_fence_rqs() cannot be used here because the fence
completion is delivered on the CQ, which is polled by user-mode (e.g.
DPDK) and not visible to the kernel driver.
Refactor the disable logic into a shared mana_disable_vport_rx() in
mana_en, exported for use by mana_ib, replacing the duplicate code.
The ethernet driver's mana_dealloc_queues() is also updated to call
this common function.
Fixes: 0266a17763 ("RDMA/mana_ib: Add a driver for Microsoft Azure Network Adapter")
Cc: stable@vger.kernel.org
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://patch.msgid.link/20260325194100.1929056-1-longli@microsoft.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
As IPv6 is built-in only and there are no more users of ipv6_stub, the
ipv6_stub is now entirely obsolete.
Remove all the code related to the definition, initialization and usage.
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Tested-by: Ricardo B. Marlière <rbm@suse.com>
Link: https://patch.msgid.link/20260325120928.15848-11-fmancera@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As IPv6 is built-in only, the ipv6_bpf_stub can be removed completely.
Convert all ipv6_bpf_stub usage to direct function calls instead. The
fallback functions introduced previously will prevent linkage errors
when CONFIG_IPV6 is disabled.
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Tested-by: Ricardo B. Marlière <rbm@suse.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260325120928.15848-10-fmancera@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As IPv6 is built-in only, the ipv6_stub infrastructure is no longer
necessary.
Convert remaining ipv6_stub users to make direct function calls. The
fallback functions introduced previously will prevent linkage errors
when CONFIG_IPV6 is disabled.
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Tested-by: Ricardo B. Marlière <rbm@suse.com>
Link: https://patch.msgid.link/20260325120928.15848-9-fmancera@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In preparation for dropping ipv6_stub and converting its users to direct
function calls, introduce static inline dummy functions and fallback
macros in the IPv6 networking headers. In addition, introduce checks on
fib6_nh_init(), ip6_dst_lookup_flow() and ip6_fragment() to avoid a
crash due to ipv6.disable=1 set during booting. The other functions are
safe as they cannot be called with ipv6.disable=1 set.
These fallbacks ensure that when CONFIG_IPV6 is completely disabled,
there are no compiling or linking errors due to code paths not guarded
by preprocessor macro IS_ENABLED(CONFIG_IPV6).
In addition, export ndisc_send_na(), ip6_route_input() and
ip6_fragment().
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Tested-by: Ricardo B. Marlière <rbm@suse.com>
Link: https://patch.msgid.link/20260325120928.15848-6-fmancera@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As IPv6 is built-in only, it does not make sense to continue using
IS_BUILTIN(CONFIG_IPV6). Therefore, replace it with IS_ENABLED() when
necessary and drop it if it isn't valid anymore.
Notice that there is still one instance related to ICMPv6, as it
requires more changes it will be handle separately.
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Tested-by: Ricardo B. Marlière <rbm@suse.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260325120928.15848-4-fmancera@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As IPv6 is built-in only, the macro is always evaluating to an empty
one. Remove it completely from the code.
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Link: https://patch.msgid.link/20260325120928.15848-3-fmancera@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The RCU-protected codepaths (mpls_forward, mpls_dump_routes) can have
an inconsistent view of platform_labels vs platform_label in case of a
concurrent resize (resize_platform_label_table, under
platform_mutex). This can lead to OOB accesses.
This patch adds a seqcount, so that we get a consistent snapshot.
Note that mpls_label_ok is also susceptible to this, so the check
against RTA_DST in rtm_to_route_config, done outside platform_mutex,
is not sufficient. This value gets passed to mpls_label_ok once more
in both mpls_route_add and mpls_route_del, so there is no issue, but
that additional check must not be removed.
Reported-by: Yuan Tan <tanyuan98@outlook.com>
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Fixes: 7720c01f3f ("mpls: Add a sysctl to control the size of the mpls label table")
Fixes: dde1b38e87 ("mpls: Convert mpls_dump_routes() to RCU.")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/cd8fca15e3eb7e212b094064cd83652e20fd9d31.1774284088.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- cfg80211: new APIs for NAN (Neighbor Aware Networking,
aka Wi-Fi Aware) so less work must be in firmware
- mt76:
- mt7996/mt7925 MLO fixes/improvements
- mt7996 NPU support (HW eth/wifi traffic offload)
- iwlwifi: UNII-9 and continuing UHR work
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmnFTegACgkQ10qiO8sP
aABpghAAmcubFELG/ivDfwujEXjeKRU4CGcFPWDnOwBo28w8bQ36SoKRh251BUSL
4XCEwZwPR2gFI77bJ7fLn1gsRNd8Cv+t8wsi2K3TV3bOy6wCxH85A7l4GmN5vGzP
9MLcAAT7R684YAC4gFAi3DqFmSucd/ZodAt93Cw7+ikXq2tvrbR5wgUv9AQ5mUIw
f5cqocOOv+4IbSL+r2cQnCAKLGWxVMJpoiWuAPpIQn7odcrncrhvBIG3l9ZC4KOL
BKiO+YpK8Yg3+uc9zrz+RwOcQx6TjzgAydFY/AnqOmGfQ2dGaWC/zy/5stCOVrfd
mAqw4jr14eAumUoHQoNrOBsWikuDBKmYMjHVObR3cKB9jJ/54CHtSYJVueg9gdhP
4+s5lNkX0zEt76wimYQRpCkYhalBUZMwUv3HFnab99PDDmWvNFS8uHi8i2g7U81i
yVdxI3MbQp2SRgJMDbKQPziSad1qJyIzg/LoN9fb6GV1DoNZ3IZabgVMOA2IoB0L
zYi3Yuyo63yhDh2Np9uzDsIRQAbTCdbou2fzPqy6CvOyG6JXxCI8PZpZAN7dqYxc
u8rljjaxQ4IYfBWrryFdHzIrYHJLo/B4g8kSFE+vzLiFblFmTxBoziHDWpJ4u5im
YTyOyBYAtzQf0l8cZPKzRq+AuVgIuJVNV/3zyxnoFxfqg/lUWNk=
=zap4
-----END PGP SIGNATURE-----
Merge tag 'wireless-next-2026-03-26' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
Johannes Berg says:
====================
A fairly big set of changes all over, notably with:
- cfg80211: new APIs for NAN (Neighbor Aware Networking,
aka Wi-Fi Aware) so less work must be in firmware
- mt76:
- mt7996/mt7925 MLO fixes/improvements
- mt7996 NPU support (HW eth/wifi traffic offload)
- iwlwifi: UNII-9 and continuing UHR work
* tag 'wireless-next-2026-03-26' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (230 commits)
wifi: mac80211: ignore reserved bits in reconfiguration status
wifi: cfg80211: allow protected action frame TX for NAN
wifi: ieee80211: Add some missing NAN definitions
wifi: nl80211: Add a notification to notify NAN channel evacuation
wifi: nl80211: add NL80211_CMD_NAN_ULW_UPDATE notification
wifi: nl80211: allow reporting spurious NAN Data frames
wifi: cfg80211: allow ToDS=0/FromDS=0 data frames on NAN data interfaces
wifi: nl80211: define an API for configuring the NAN peer's schedule
wifi: nl80211: add support for NAN stations
wifi: cfg80211: separately store HT, VHT and HE capabilities for NAN
wifi: cfg80211: add support for NAN data interface
wifi: cfg80211: make sure NAN chandefs are valid
wifi: cfg80211: Add an API to configure local NAN schedule
wifi: mac80211: cleanup error path of ieee80211_do_open
wifi: mac80211: extract channel logic from link logic
wifi: iwlwifi: mld: set RX_FLAG_RADIOTAP_TLV_AT_END generically
wifi: iwlwifi: reduce the number of prints upon firmware crash
wifi: iwlwifi: fix the description of SESSION_PROTECTION_CMD
wifi: iwlwifi: mld: introduce iwl_mld_vif_fw_id_valid
wifi: iwlwifi: mld: block EMLSR during TDLS connections
...
====================
Link: https://patch.msgid.link/20260326152021.305959-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Set the default number of queues per vPort to MANA_DEF_NUM_QUEUES (16),
as 16 queues can achieve optimal throughput for typical workloads. The
actual number of queues may be lower if it exceeds the hardware reported
limit. Users can increase the number of queues up to max_queues via
ethtool if needed.
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://patch.msgid.link/20260323194925.1766385-1-longli@microsoft.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
__nf_ct_expect_find() and nf_ct_expect_find_get() are called under
rcu_read_lock() but they dereference the master conntrack via
exp->master.
Since the expectation does not hold a reference on the master conntrack,
this could be dying conntrack or different recycled conntrack than the
real master due to SLAB_TYPESAFE_RCU.
Store the netns, the master_tuple and the zone in struct
nf_conntrack_expect as a safety measure.
This patch is required by the follow up fix not to dump expectations
that do not belong to this netns.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Holding reference on the expectation is not sufficient, the master
conntrack object can just go away, making exp->master invalid.
To access exp->master safely:
- Grab the nf_conntrack_expect_lock, this gets serialized with
clean_from_lists() which also holds this lock when the master
conntrack goes away.
- Hold reference on master conntrack via nf_conntrack_find_get().
Not so easy since the master tuple to look up for the master conntrack
is not available in the existing problematic paths.
This patch goes for extending the nf_conntrack_expect_lock section
to address this issue for simplicity, in the cases that are described
below this is just slightly extending the lock section.
The add expectation command already holds a reference to the master
conntrack from ctnetlink_create_expect().
However, the delete expectation command needs to grab the spinlock
before looking up for the expectation. Expand the existing spinlock
section to address this to cover the expectation lookup. Note that,
the nf_ct_expect_iterate_net() calls already grabs the spinlock while
iterating over the expectation table, which is correct.
The get expectation command needs to grab the spinlock to ensure master
conntrack does not go away. This also expands the existing spinlock
section to cover the expectation lookup too. I needed to move the
netlink skb allocation out of the spinlock to keep it GFP_KERNEL.
For the expectation events, the IPEXP_DESTROY event is already delivered
under the spinlock, just move the delivery of IPEXP_NEW under the
spinlock too because the master conntrack event cache is reached through
exp->master.
While at it, add lockdep notations to help identify what codepaths need
to grab the spinlock.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The expectation helper field is mostly unused. As a result, the
netfilter codebase relies on accessing the helper through exp->master.
Always set on the expectation helper field so it can be used to reach
the helper.
nf_ct_expect_init() is called from packet path where the skb owns
the ct object, therefore accessing exp->master for the newly created
expectation is safe. This saves a lot of updates in all callsites
to pass the ct object as parameter to nf_ct_expect_init().
This is a preparation patches for follow up fixes.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If all available channel resources are used for NAN channels, and one of
them is shared with another interface, and that interface needs to move
to a different channel (for example STA interface that needs to do a
channel or a link switch), then the driver can evacuate one of the NAN
channels (i.e. detach it from its channel resource and announce to the
peers that this channel is ULWed). In that case, the driver needs to
notify user space about the channel evacuation, so the user space can
adjust the local schedule accordingly.
Add a notification to let userspace know about it.
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20260219114327.d5bebfd5ff73.Iaaf5ef17e1ab7a38c19d60558e68fcf517e2b400@changeid
Link: https://patch.msgid.link/20260318123926.206536-11-miriam.rachel.korenblit@intel.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add an NL80211 command to configure the NAN schedule of a NAN peer.
Such a schedule contains a list of NAN channels, and a mapping from each
time slots to the corresponding channel (or unscheduled).
Also contains more information about the schedule, such as sequence ID
and map ID.
Not all of the restrictions are validated in this patch. In particular,
comparison of two maps of the same peer requires storing/retrieving each
map of each peer, only for validation.
Therefore, it is the responsibilty of the driver to check that.
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20260219114327.5b13fa5af4f6.If0e214ff5b52c9666e985fefa3f7be0ad14d93fb@changeid
Link: https://patch.msgid.link/20260318123926.206536-7-miriam.rachel.korenblit@intel.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
There are 2 types of logical links with a NAN peer:
- management (NMI), which is used for Tx/Rx of NAN management frames.
- data (NDI), which is used for Tx/Rx of data frames, or non-NAN
management frames.
The NMI station has two roles:
- representation of the NAN peer - for example, the peer's schedule
and the HT, VHT, HE capabilities - belong to the NMI station, and not to
the NDI ones.
- Tx/Rx of NAN management frames to/from the peer.
The NDI station is used for Tx/Rx data frames of a specific NDP that was
established with the NAN peer.
Note that a peer can choose to reuse its NMI address as the NDI address.
In that case, it is expected that two stations will be added even though
they will have the same address.
- An NDI station can only be added after the corresponding NMI station
was configured with capabilities.
- All the NDI stations will be removed before the NDI interface is brought
down.
- All NMI stations will be removed before NAN is stopped.
- Before NMI sta removal, all corresponding NDI stations will be removed
Add support for adding, removing, and changing NMI and NDI stations.
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20260219114327.d280936ee832.I6d859eee759bb5824a9ffd2984410faf879ba00e@changeid
Link: https://patch.msgid.link/20260318123926.206536-6-miriam.rachel.korenblit@intel.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In NAN, unlike in other modes, there is only one set of (HT, VHT, HE)
capabilities that is used for all channels (and bands) used in the NAN
data path.
This set of capabilities will have to be a special one, for example - have
the minimum of (HT-for-5 GHz, HT-for-2.4 GHz), careful handling of the
bits that have a different meaning for each band, etc.
While we could use the exiting sband/iftype capabilities, and require
identical capabilities for all bands (makes no sense since this means
that we will have VHT capabilities in the 2.4 GHz slot),
or require that only one of the sbands will be set,
or have logic to extract the minimum and handle the conflicting bits -
it seems simpler to add a dedicated set of capabilities which is special
for NAN, and is band agnostic, to be populated by the driver.
That way we also let the driver decide how it wants to handle the
conflicting bits.
Add this special set of these capabilities to wiphy:nan_capabilities, to be
populated by the driver.
Send it to user space.
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20260219114327.4b6f3e4a81b4.I45422adc0df3ad4101d857a92e83f0de5cf241e1@changeid
Link: https://patch.msgid.link/20260318123926.206536-5-miriam.rachel.korenblit@intel.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This new interface type represents a NAN data interface (NDI).
It is used for data communication with NAN peers.
Note that the existing NL80211_IFTYPE_NAN interface, which is the NAN
Management Interface (NMI), is used for management communication.
An NDI interface is started when a new NAN data path is about to
be established, and is stopped after the NAN data path is terminated.
- An NDI interface can only be started if the NMI is running, and NAN is
started.
- Before the NMI is stopped, the NDI interfaces will be stopped.
Add the new interface type, handle add/remove operations for it,
and makes sure of the conditions above.
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20260219114327.0d681335c2e2.I92973483e927820ae2297853c141842fdb262747@changeid
Link: https://patch.msgid.link/20260318123926.206536-4-miriam.rachel.korenblit@intel.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add an nl80211 API to allow user space to configure the local NAN
schedule.
The local schedule consists of a list of channel definitions and a schedule
map, in which each element covers a time slot and indicates on what
channel the device should be in that time slot.
Channels can be added to schedule even without being scheduled, for
reservation purposes.
A schedule can be configured either immedietally or be deferred, in case
there are already connected peers.
When the deferred flag is set, the command is a request from the device
to perform an announced schedule update: send the updated NAN
Availability - as set in this command - to the peers, and do the
actual switch to the new schedule on the right time (i.e. at the end of
the slot after the slot in which the update was sent to the peers).
In addition, a notification will be sent to indicate a deferred update
completion.
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20260219114327.ecca178a2de0.Ic977ab08b4ed5cf9b849e55d3a59b01ad3fbd08e@changeid
Link: https://patch.msgid.link/20260318123926.206536-2-miriam.rachel.korenblit@intel.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
(tcp_congestion_ops)->cwnd_event() is called very often, with
@event oscillating between CA_EVENT_TX_START and other values.
This is not branch prediction friendly.
Provide a new cwnd_event_tx_start pointer dedicated for CA_EVENT_TX_START.
Both BBR and CUBIC benefit from this change, since they only care
about CA_EVENT_TX_START.
No change in kernel size:
$ scripts/bloat-o-meter -t vmlinux.0 vmlinux
add/remove: 4/4 grow/shrink: 3/1 up/down: 564/-568 (-4)
Function old new delta
bbr_cwnd_event_tx_start - 450 +450
cubictcp_cwnd_event_tx_start - 70 +70
__pfx_cubictcp_cwnd_event_tx_start - 16 +16
__pfx_bbr_cwnd_event_tx_start - 16 +16
tcp_unregister_congestion_control 93 99 +6
tcp_update_congestion_control 518 521 +3
tcp_register_congestion_control 422 425 +3
__tcp_transmit_skb 3308 3306 -2
__pfx_cubictcp_cwnd_event 16 - -16
__pfx_bbr_cwnd_event 16 - -16
cubictcp_cwnd_event 80 - -80
bbr_cwnd_event 454 - -454
Total: Before=25240512, After=25240508, chg -0.00%
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260323234920.1097858-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When codel_dequeue() finds an empty queue, it resets vars->dropping
but does not reset vars->first_above_time. The reference CoDel
algorithm (Nichols & Jacobson, ACM Queue 2012) resets both:
dodeque_result codel_queue_t::dodeque(time_t now) {
...
if (r.p == NULL) {
first_above_time = 0; // <-- Linux omits this
}
...
}
Note that codel_should_drop() does reset first_above_time when called
with a NULL skb, but codel_dequeue() returns early before ever calling
codel_should_drop() in the empty-queue case. The post-drop code paths
do reach codel_should_drop(NULL) and correctly reset the timer, so a
dropped packet breaks the cycle -- but the next delivered packet
re-arms first_above_time and the cycle repeats.
For sparse flows such as ICMP ping (one packet every 200ms-1s), the
first packet arms first_above_time, the flow goes empty, and the
second packet arrives after the interval has elapsed and gets dropped.
The pattern repeats, producing sustained loss on flows that are not
actually congested.
Test: veth pair, fq_codel, BQL disabled, 30000 iptables rules in the
consumer namespace (NAPI-64 cycle ~14ms, well above fq_codel's 5ms
target), ping at 5 pps under UDP flood:
Before fix: 26% ping packet loss
After fix: 0% ping packet loss
Fix by resetting first_above_time to zero in the empty-queue path
of codel_dequeue(), matching the reference algorithm.
Fixes: 76e3cc126b ("codel: Controlled Delay AQM")
Fixes: d068ca2ae2 ("codel: split into multiple files")
Co-developed-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Jonas Köppeler <j.koeppeler@tu-berlin.de>
Reported-by: Chris Arges <carges@cloudflare.com>
Tested-by: Jonas Köppeler <j.koeppeler@tu-berlin.de>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/all/20260318134826.1281205-7-hawk@kernel.org/
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260323174920.253526-1-hawk@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmnA+cgACgkQrB3Eaf9P
W7fJgBAAlZKkRki11NUIeI8IjOzEoMRShSsbOMjeCVBUDKc05krfWyln1FLuQbD/
BNSgRNFQ0uT653Cn88CbVRtxuebkmhde7bH29yEpfnsd/duVDlJaHkwjCEH15hvb
zIeWrzdn+ct77Kg6i1EsJ5BfC7kADYWfgCFrSAAz2MEerCGNcLn2pKlopAEIGAD9
Ahd7XohBK9uxP8ZhF4GLQAjTImTDEQmJJek0QDdGp6sr+V0PuIh1MQ75SjW+9rZK
4p+rHhsOGCcjobljbksYTJd9/5hC2ThqsYBBbRsxS+g9ibvMvDoal2PCtBA7SnHZ
F66PL8Lui555V4jL80Fi80Mu/uquizOX0iMiVjhAtepiqxn9IZleXutddPN/9yCg
tHlk7IytBSovGBBT/AdL6F8hOVvwAFa/pnr/6pzjcjmiIkwSLMCU0ge/yjF01vGK
tnltSGfuZ9+aF6XEjAmIZ2jMbA7mtKIoc9VOJB5/96yFS3G48/E7Aq6SNYIF8vyB
N6xgdbhqp4PfIYuQ+zWcibj2XAGlXW9RF34i2CSbf7BlztetoctS8iuHlUWIlkS3
dcYAp7/ZQWRM779pg9pTKw7kGUwPlS0LbUBr4Z8nvcxdBUULuKc+9PAgRO3nX1v0
7EbIukGdhc+hvM8zC/aok8g6h8cPNvvaaL8CLL+wSYt28/xHrLs=
=E39n
-----END PGP SIGNATURE-----
Merge tag 'ipsec-2026-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:
====================
pull request (net): ipsec 2026-03-23
1) Add missing extack for XFRMA_SA_PCPU in add_acquire and allocspi.
From Sabrina Dubroca.
2) Fix the condition on x->pcpu_num in xfrm_sa_len by using the
proper check. From Sabrina Dubroca.
3) Call xdo_dev_state_delete during state update to properly cleanup
the xdo device state. From Sabrina Dubroca.
4) Fix a potential skb leak in espintcp when async crypto is used.
From Sabrina Dubroca.
5) Validate inner IPv4 header length in IPTFS payload to avoid
parsing malformed packets. From Roshan Kumar.
6) Fix skb_put() panic on non-linear skb during IPTFS reassembly.
From Fernando Fernandez Mancera.
7) Silence various sparse warnings related to RCU, state, and policy
handling. From Sabrina Dubroca.
8) Fix work re-schedule race after cancel in xfrm_nat_keepalive_net_fini().
From Hyunwoo Kim.
9) Prevent policy_hthresh.work from racing with netns teardown by using
a proper cleanup mechanism. From Minwoo Ra.
10) Validate that the family of the source and destination addresses match
in pfkey_send_migrate(). From Eric Dumazet.
11) Only publish mode_data after the clone is setup in the IPTFS receive path.
This prevents leaving x->mode_data pointing at freed memory on error.
From Paul Moses.
Please pull or let me know if there are problems.
ipsec-2026-03-23
* tag 'ipsec-2026-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
xfrm: iptfs: only publish mode_data after clone setup
af_key: validate families in pfkey_send_migrate()
xfrm: prevent policy_hthresh.work from racing with netns teardown
xfrm: Fix work re-schedule after cancel in xfrm_nat_keepalive_net_fini()
xfrm: avoid RCU warnings around the per-netns netlink socket
xfrm: add rcu_access_pointer to silence sparse warning for xfrm_input_afinfo
xfrm: policy: silence sparse warning in xfrm_policy_unregister_afinfo
xfrm: policy: fix sparse warnings in xfrm_policy_{init,fini}
xfrm: state: silence sparse warnings during netns exit
xfrm: remove rcu/state_hold from xfrm_state_lookup_spi_proto
xfrm: state: add xfrm_state_deref_prot to state_by* walk under lock
xfrm: state: fix sparse warnings around XFRM_STATE_INSERT
xfrm: state: fix sparse warnings in xfrm_state_init
xfrm: state: fix sparse warnings on xfrm_state_hold_rcu
xfrm: iptfs: fix skb_put() panic on non-linear skb during reassembly
xfrm: iptfs: validate inner IPv4 header length in IPTFS payload
esp: fix skb leak with espintcp and async crypto
xfrm: call xdo_dev_state_delete during state update
xfrm: fix the condition on x->pcpu_num in xfrm_sa_len
xfrm: add missing extack for XFRMA_SA_PCPU in add_acquire and allocspi
====================
Link: https://patch.msgid.link/20260323083440.2741292-1-steffen.klassert@secunet.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When binding a udp_sock to a local address and port, UDP uses
two hashes (udptable->hash and udptable->hash2) for collision
detection. The current code switches to "hash2" when
hslot->count > 10.
"hash2" is keyed by local address and local port.
"hash" is keyed by local port only.
The issue can be shown in the following bind sequence (pseudo code):
bind(fd1, "[fd00::1]:8888")
bind(fd2, "[fd00::2]:8888")
bind(fd3, "[fd00::3]:8888")
bind(fd4, "[fd00::4]:8888")
bind(fd5, "[fd00::5]:8888")
bind(fd6, "[fd00::6]:8888")
bind(fd7, "[fd00::7]:8888")
bind(fd8, "[fd00::8]:8888")
bind(fd9, "[fd00::9]:8888")
bind(fd10, "[fd00::10]:8888")
/* Correctly return -EADDRINUSE because "hash" is used
* instead of "hash2". udp_lib_lport_inuse() detects the
* conflict.
*/
bind(fail_fd, "[::]:8888")
/* After one more socket is bound to "[fd00::11]:8888",
* hslot->count exceeds 10 and "hash2" is used instead.
*/
bind(fd11, "[fd00::11]:8888")
bind(fail_fd, "[::]:8888") /* succeeds unexpectedly */
The same issue applies to the IPv4 wildcard address "0.0.0.0"
and the IPv4-mapped wildcard address "::ffff:0.0.0.0". For
example, if there are existing sockets bound to
"192.168.1.[1-11]:8888", then binding "0.0.0.0:8888" or
"[::ffff:0.0.0.0]:8888" can also miss the conflict when
hslot->count > 10.
TCP inet_csk_get_port() already has the correct check in
inet_use_bhash2_on_bind(). Rename it to
inet_use_hash2_on_bind() and move it to inet_hashtables.h
so udp.c can reuse it in this fix.
Fixes: 30fff9231f ("udp: bind() optimisation")
Reported-by: Andrew Onyshchuk <oandrew@meta.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260319181817.1901357-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The cited commit mechanically put fib6_remove_gc_list()
just after every fib6_clean_expires() call.
When a temporary route is promoted to a permanent route,
there may already be exception routes tied to it.
If fib6_remove_gc_list() removes the route from tb6_gc_hlist,
such exception routes will no longer be aged.
Let's replace fib6_remove_gc_list() with a new helper
fib6_may_remove_gc_list() and use fib6_age_exceptions() there.
Note that net->ipv6 is only compiled when CONFIG_IPV6 is
enabled, so fib6_{add,remove,may_remove}_gc_list() are guarded.
Fixes: 5eb902b8e7 ("net/ipv6: Remove expired routes with a separated list of routes.")
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20260320072317.2561779-3-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This attempt to fix regressions caused by reusing ident which apparently
is not handled well on certain stacks causing the stack to not respond to
requests, so instead of simple returning the first unallocated id this
stores the last used tx_ident and then attempt to use the next until all
available ids are exausted and then cycle starting over to 1.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=221120
Link: https://bugzilla.kernel.org/show_bug.cgi?id=221177
Fixes: 6c3ea155e5 ("Bluetooth: L2CAP: Fix not tracking outstanding TX ident")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Tested-by: Christian Eggers <ceggers@arri.de>
- cfg80211/mac80211: S1G and UHR improvements
- hwsim: incumbent signal report test support
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmm7sb4ACgkQ10qiO8sP
aAD+iQ/+IYWmM1Z/Iu6eZZx/VPrc4Xnj/8UgbalyjetyLRYFNEvawFSdutqZ23uI
FO7vYbzGXMtAlt7fjmxVKMiN4aoX+rISRjG5cnH1qPpeVO8w9fnOZyqmNUFJFboN
ibpr4dqPIS2qZDKegvOa9JO+8KkkPerPWl608eOzXPxoZaZAMnXOhWuV4cWdvuTT
vEnL+Ma4ckkOV6QdBFazYaxAyTt3Mpqj5ULodixtKPMdgB3P+6mAVipp/icE5R1P
R/Vd7Fn+0r7wb/4+1S6DcCBvT6V6Ui94bIRF9DB5LGG/9iLPrGYRD52qQpetzXzA
Si238bs7qi/6t6Q5tfzK1LZVnzZXTUqcWGS6ba4JiMxrLTAK1AEmcLved6A48ywt
YH9zImLRBRMSANbH27BoWvijT5YZGMetH06cTdFmZ8MMGoYV7CWBxVOaIroH7WMx
exMnWEcX6PUVMtlIR4FTGwX/nalGbvnBtoMv9ei3NRb2Dkart8OFT6vIDfy6TBnD
BzAUE3pDAW3I7ukbLQGJ3mmanZpHtF/Xgfr5Y9EbZHPjtC08l7cwdd2zn0n3Q2qu
JGlzZiut6sJTfnRESbUvJ6fnCMdGARpQxq6p2At3njJW0sncvyV9WFKh4A+ReaDr
PQ24fgapG5PNEISevO+/FV1z2qZ0+IbHSmcH+BIoktBnPUBLZFo=
=cLVw
-----END PGP SIGNATURE-----
Merge tag 'wireless-next-2026-03-19' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
Johannes Berg says:
====================
Aside from various small improvements/cleanups, not much:
- cfg80211/mac80211: S1G and UHR improvements
- hwsim: incumbent signal report test support
* tag 'wireless-next-2026-03-19' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (31 commits)
qtnfmac: use alloc_netdev macro for single queue devices
wifi: libertas: don't kill URBs in interrupt context
wifi: libertas: use USB anchors for tracking in-flight URBs
wifi: nl80211: use int for band coming from netlink
wifi: rsi_91x_usb: do not pause rfkill polling when stopping mac80211
wifi: mac80211: fix STA link removal during link removal
wifi: nl80211: reject S1G/60G with HT chantype
wifi: ieee80211: fix definition of EHT-MCS 15 in MRU
wifi: cfg80211: check non-S1G width with S1G chandef
wifi: cfg80211: restrict cfg80211_chandef_create() to only HT-based bands
wifi: mac80211: don't use cfg80211_chandef_create() for default chandef
wifi: mac80211: Remove deleted sta links in ieee80211_ml_reconf_work()
wifi: b43: use register definitions in nphy_op_software_rfkill
wifi: cfg80211: split control freq check from chandef check
wifi: mac80211: always use full chanctx compatible check
wifi: mac80211: refactor chandef tracing macros
wifi: mac80211: validate HE 6 GHz operation when EHT is used
wifi: nl80211: split out UHR operation information
wifi: mwifiex: drop redundant device reference
wifi: rt2x00: drop redundant device reference
...
====================
Link: https://patch.msgid.link/20260319082439.79875-3-johannes@sipsolutions.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
In network setup as below:
fastpath bypass
.----------------------------------------.
/ \
| IP - forwarding |
| / \ v
| / wan ...
| /
| |
| |
| brlan.1
| |
| +-------------------------------+
| | vlan 1 |
| | |
| | brlan (vlan-filtering) |
| | +---------------+
| | | DSA-SWITCH |
| | vlan 1 | |
| | to | |
| | untagged 1 vlan 1 |
| +---------------+---------------+
. / \
----->wlan1 lan0
. .
. ^
^ vlan 1 tagged packets
untagged packets
br_vlan_fill_forward_path_mode() sets DEV_PATH_BR_VLAN_UNTAG_HW when
filling in from brlan.1 towards wlan1. But it should be set to
DEV_PATH_BR_VLAN_UNTAG in this case. Using BR_VLFLAG_ADDED_BY_SWITCHDEV
is not correct. The dsa switchdev adds it as a foreign port.
The same problem for all foreignly added dsa vlans on the bridge.
First add the vlan, trying only native devices.
If this fails, we know this may be a vlan from a foreign device.
Use BR_VLFLAG_TAGGING_BY_SWITCHDEV to make sure DEV_PATH_BR_VLAN_UNTAG_HW
is set only when there if no foreign device involved.
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Eric Woudstra <ericwouds@gmail.com>
Link: https://patch.msgid.link/20260317110347.363875-1-ericwouds@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
For RX CQEs with type CQE_RX_COALESCED_4, to measure the coalescing
efficiency, add counters to count how many contains 2, 3, 4 packets
respectively.
Also, add a counter for the error case of first packet with length == 0.
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Link: https://patch.msgid.link/20260317191826.1346111-4-haiyangz@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Our NIC can have up to 4 RX packets on 1 CQE. To support this feature,
check and process the type CQE_RX_COALESCED_4. The default setting is
disabled, to avoid possible regression on latency.
And, add ethtool handler to switch this feature. To turn it on, run:
ethtool -C <nic> rx-cqe-frames 4
To turn it off:
ethtool -C <nic> rx-cqe-frames 1
The rx-cqe-nsec is the time out value in nanoseconds after the first
packet arrival in a coalesced CQE to be sent. It's read-only for this
NIC.
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Link: https://patch.msgid.link/20260317191826.1346111-3-haiyangz@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- cfg80211:
- guarantee pmsr work is cancelled
- mac80211:
- reject TDLS operations on non-TDLS stations
- fix crash in AP_VLAN bandwidth change
- fix leak or double-free on some TX preparation
failures
- remove keys needed for beacons _after_ stopping
those
- fix debugfs static branch race
- avoid underflow in inactive time
- fix another NULL dereference in mesh on invalid
frames
- ti/wlcore: avoid infinite realloc loop
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmm63pMACgkQ10qiO8sP
aAD+0w//QZRJt2tOsp/QOqmlEGs5zk8BnutcrU0ov/8OatgCX5sYWr2GD9Eub15P
t+NWWSJoOaXrEvlpyhFTDB4RPnKKUbajFVGmQJTgddFvYzXARDupFrQIpBZ+UqYr
kwNH/vnHxOuQ5MLaiuvaldbMdzdsH1R9Lr0nBqilg1tL1emQVTFFAfMh6URlbzB/
EaMG7sWKyzjVCvaGNBKsjyrfdWAz4LkyAw47St/MDV9GofSdSA2Oyt7PGM+TYuQ1
ozKsbOBiXuKIQkNVXNFQrrsGePY1hXgj4F0mO1KvjRov+2Wq+Xk+KFFpCCGeZrGt
ZTehROtzS3I96UZmpFimJGdLOiiFC/CqP9bDBOn4y87Ink24m0/z2WFyLcp4IpDy
KQFaPpvFnigZmuB+crtv+OI1bNuzb04EjfC1+M3AhDgkcMaSUUD/zxczge4DP1tX
llYMZh0LL8CdUezTBcB/l3uBMTWh6R7T2bUUIIGLtyMqpMBl4GwncJ7dQFl2wyXr
ytXZFE4rJNDXzvxkYOoOrT+JCD1COPiIuddy7xXWdxuC6yzY4H7QXGtljgOZUaqf
0ED6HiTvLG25lep1SLmgbwN2x9+izGxjWrUFqT7DIjxQo9bBulwBUARosoGAAxXW
7pio7oKDtYVD8FYSsFhbmNS/z+9Gs5wqgrfSyjrmvxHZm+rJJFw=
=C5Rn
-----END PGP SIGNATURE-----
Merge tag 'wireless-2026-03-18' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Johannes Berg says:
====================
Just a few updates:
- cfg80211:
- guarantee pmsr work is cancelled
- mac80211:
- reject TDLS operations on non-TDLS stations
- fix crash in AP_VLAN bandwidth change
- fix leak or double-free on some TX preparation
failures
- remove keys needed for beacons _after_ stopping
those
- fix debugfs static branch race
- avoid underflow in inactive time
- fix another NULL dereference in mesh on invalid
frames
- ti/wlcore: avoid infinite realloc loop
* tag 'wireless-2026-03-18' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
wifi: mac80211: always free skb on ieee80211_tx_prepare_skb() failure
wifi: wlcore: Return -ENOMEM instead of -EAGAIN if there is not enough headroom
wifi: mac80211: fix NULL deref in mesh_matches_local()
wifi: mac80211: check tdls flag in ieee80211_tdls_oper
wifi: cfg80211: cancel pmsr_free_wk in cfg80211_pmsr_wdev_down
wifi: mac80211: Fix static_branch_dec() underflow for aql_disable.
mac80211: fix crash in ieee80211_chan_bw_change for AP_VLAN stations
wifi: mac80211: use jiffies_delta_to_msecs() for sta_info inactive times
wifi: mac80211: remove keys after disabling beaconing
wifi: mac80211_hwsim: fully initialise PMSR capabilities
====================
Link: https://patch.msgid.link/20260318172515.381148-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When CONFIG_IPV6 is disabled, the udp_sock_create6() function returns 0
(success) without actually creating a socket. Callers such as
fou_create() then proceed to dereference the uninitialized socket
pointer, resulting in a NULL pointer dereference.
The captured NULL deref crash:
BUG: kernel NULL pointer dereference, address: 0000000000000018
RIP: 0010:fou_nl_add_doit (net/ipv4/fou_core.c:590 net/ipv4/fou_core.c:764)
[...]
Call Trace:
<TASK>
genl_family_rcv_msg_doit.constprop.0 (net/netlink/genetlink.c:1114)
genl_rcv_msg (net/netlink/genetlink.c:1194 net/netlink/genetlink.c:1209)
[...]
netlink_rcv_skb (net/netlink/af_netlink.c:2550)
genl_rcv (net/netlink/genetlink.c:1219)
netlink_unicast (net/netlink/af_netlink.c:1319 net/netlink/af_netlink.c:1344)
netlink_sendmsg (net/netlink/af_netlink.c:1894)
__sock_sendmsg (net/socket.c:727 (discriminator 1) net/socket.c:742 (discriminator 1))
__sys_sendto (./include/linux/file.h:62 (discriminator 1) ./include/linux/file.h:83 (discriminator 1) net/socket.c:2183 (discriminator 1))
__x64_sys_sendto (net/socket.c:2213 (discriminator 1) net/socket.c:2209 (discriminator 1) net/socket.c:2209 (discriminator 1))
do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1))
entry_SYSCALL_64_after_hwframe (net/arch/x86/entry/entry_64.S:130)
This patch makes udp_sock_create6 return -EPFNOSUPPORT instead, so
callers correctly take their error paths. There is only one caller of
the vulnerable function and only privileged users can trigger it.
Fixes: fd384412e1 ("udp_tunnel: Seperate ipv6 functions into its own file.")
Reported-by: Weiming Shi <bestswngs@gmail.com>
Signed-off-by: Xiang Mei <xmei5@asu.edu>
Link: https://patch.msgid.link/20260317010241.1893893-1-xmei5@asu.edu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ieee80211_tx_prepare_skb() has three error paths, but only two of them
free the skb. The first error path (ieee80211_tx_prepare() returning
TX_DROP) does not free it, while invoke_tx_handlers() failure and the
fragmentation check both do.
Add kfree_skb() to the first error path so all three are consistent,
and remove the now-redundant frees in callers (ath9k, mt76,
mac80211_hwsim) to avoid double-free.
Document the skb ownership guarantee in the function's kdoc.
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://patch.msgid.link/20260314065455.2462900-1-nbd@nbd.name
Fixes: 06be6b149f ("mac80211: add ieee80211_tx_prepare_skb() helper function")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Fix a use-after-free in the clsact qdisc upon init/destroy rollback asymmetry.
The latter is achieved by first fully initializing a clsact instance, and
then in a second step having a replacement failure for the new clsact qdisc
instance. clsact_init() initializes ingress first and then takes care of the
egress part. This can fail midway, for example, via tcf_block_get_ext(). Upon
failure, the kernel will trigger the clsact_destroy() callback.
Commit 1cb6f0bae5 ("bpf: Fix too early release of tcx_entry") details the
way how the transition is happening. If tcf_block_get_ext on the q->ingress_block
ends up failing, we took the tcx_miniq_inc reference count on the ingress
side, but not yet on the egress side. clsact_destroy() tests whether the
{ingress,egress}_entry was non-NULL. However, even in midway failure on the
replacement, both are in fact non-NULL with a valid egress_entry from the
previous clsact instance.
What we really need to test for is whether the qdisc instance-specific ingress
or egress side previously got initialized. This adds a small helper for checking
the miniq initialization called mini_qdisc_pair_inited, and utilizes that upon
clsact_destroy() in order to fix the use-after-free scenario. Convert the
ingress_destroy() side as well so both are consistent to each other.
Fixes: 1cb6f0bae5 ("bpf: Fix too early release of tcx_entry")
Reported-by: Keenan Dong <keenanat2000@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260313065531.98639-1-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Seems we have been carrying around repeated defines for unaligned mode
logic. Remove redundant ones.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20260313111931.438911-1-maciej.fijalkowski@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Multiple PFs may reside on the same physical chip, running a single
firmware. Some of the resources and configurations may be shared among
these PFs. Currently, there is no good object to pin the configuration
knobs on.
Introduce a shared devlink instance, instantiated upon probe of
the first PF and removed during remove of the last PF. The shared
devlink instance is not backed by any device device, as there is
no PCI device related to it.
The implementation uses reference counting to manage the lifecycle:
each PF that probes calls devlink_shd_get() to get or create
the shared instance, and calls devlink_shd_put() when it removes.
The shared instance is automatically destroyed when the last PF removes.
Example:
pci/0000:08:00.0: index 0
nested_devlink:
auxiliary/mlx5_core.eth.0
devlink_index/1: index 1
nested_devlink:
pci/0000:08:00.0
pci/0000:08:00.1
auxiliary/mlx5_core.eth.0: index 2
pci/0000:08:00.1: index 3
nested_devlink:
auxiliary/mlx5_core.eth.1
auxiliary/mlx5_core.eth.1: index 4
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20260312100407.551173-12-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In preparation to dev-less devlinks, add devlink_dev_driver_name()
that returns the driver name stored in devlink struct, and use it in
all trace events.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20260312100407.551173-9-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Introduce devlink_bus_name() and devlink_dev_name() helpers and
convert all direct accesses to devlink->dev->bus->name and
dev_name(devlink->dev) to use them.
This prepares for dev-less devlink instances where these helpers
will be extended to handle the missing device.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20260312100407.551173-3-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ip[6]tunnel_xmit() can drop packets if a too deep recursion level
is detected.
Add SKB_DROP_REASON_RECURSION_LIMIT drop reason.
We will use this reason later in __dev_queue_xmit().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Joe Damato <joe@dama.to>
Link: https://patch.msgid.link/20260312201824.203093-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
By default, the Linux TCP implementation does not shrink the
advertised window (RFC 7323 calls this "window retraction") with the
following exceptions:
- When an incoming segment cannot be added due to the receive buffer
running out of memory. Since commit 8c670bdfa5 ("tcp: correct
handling of extreme memory squeeze") a zero window will be
advertised in this case. It turns out that reaching the required
memory pressure is easy when window scaling is in use. In the
simplest case, sending a sufficient number of segments smaller than
the scale factor to a receiver that does not read data is enough.
- Commit b650d953cd ("tcp: enforce receive buffer memory limits by
allowing the tcp window to shrink") addressed the "eating memory"
problem by introducing a sysctl knob that allows shrinking the
window before running out of memory.
However, RFC 7323 does not only state that shrinking the window is
necessary in some cases, it also formulates requirements for TCP
implementations when doing so (Section 2.4).
This commit addresses the receiver-side requirements: After retracting
the window, the peer may have a snd_nxt that lies within a previously
advertised window but is now beyond the retracted window. This means
that all incoming segments (including pure ACKs) will be rejected
until the application happens to read enough data to let the peer's
snd_nxt be in window again (which may be never).
To comply with RFC 7323, the receiver MUST honor any segment that
would have been in window for any ACK sent by the receiver and, when
window scaling is in effect, SHOULD track the maximum window sequence
number it has advertised. This patch tracks that maximum window
sequence number rcv_mwnd_seq throughout the connection and uses it in
tcp_sequence() when deciding whether a segment is acceptable.
rcv_mwnd_seq is updated together with rcv_wup and rcv_wnd in
tcp_select_window(). If we count tcp_sequence() as fast path, it is
read in the fast path. Therefore, rcv_mwnd_seq is put into rcv_wnd's
cacheline group.
The logic for handling received data in tcp_data_queue() is already
sufficient and does not need to be updated.
Signed-off-by: Simon Baatz <gmbnomis@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260309-tcp_rfc7323_retract_wnd_rfc-v3-1-4c7f96b1ec69@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since UDP and UDP-Lite had dedicated socket hash tables for
each, we have had to pass the pointer down to many socket
lookup functions.
UDP-Lite gone, and we do not need to do that.
Let's fetch net->ipv4.udp_table only where needed in IPv4
stack: __udp4_lib_lookup(), __udp4_lib_mcast_deliver(),
and udp_diag_dump().
Some functions are renamed as the wrapper functions are no
longer needed.
__udp4_lib_err() -> udp_err()
__udp_diag_destroy() -> udp_diag_destroy()
udp_dump_one() -> udp_diag_dump_one()
udp_dump() -> udp_diag_dump()
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260311052020.1213705-15-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since UDP and UDP-Lite had dedicated socket hash tables for
each, we have had to pass the pointer down to many socket
lookup functions.
UDP-Lite gone, and we do not need to do that.
Let's fetch net->ipv4.udp_table only where needed in IPv6
stack: __udp6_lib_lookup() and __udp6_lib_mcast_deliver().
__udp6_lib_err() is renamed to udpv6_err() as its wrapper
is no longer needed.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260311052020.1213705-14-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since UDP and UDP-Lite had dedicated socket hash tables for
each, we have had to fetch them from different pointers for
procfs or bpf iterator.
UDP always has its global or per-netns table in
net->ipv4.udp_table and struct udp_seq_afinfo.udp_table
is NULL.
OTOH, UDP-Lite had only one global table in the pointer.
We no longer use the field.
Let's remove it and udp_get_table_seq().
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260311052020.1213705-12-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since UDP and UDP-Lite had dedicated socket hash tables for
each, we have had to fetch them from different pointers.
UDP always has its global or per-netns table in
net->ipv4.udp_table and struct proto.h.udp_table is NULL.
OTOH, UDP-Lite had only one global table in the pointer.
We no longer use the field.
Let's remove it and udp_get_table_prot().
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260311052020.1213705-11-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
UDP-Lite supports variable-length checksum and has two socket
options, UDPLITE_SEND_CSCOV and UDPLITE_RECV_CSCOV, to control
the checksum coverage.
Let's remove the support.
setsockopt(UDPLITE_SEND_CSCOV / UDPLITE_RECV_CSCOV) was only
available for UDP-Lite and returned -ENOPROTOOPT for UDP.
Now, the options are handled in ip_setsockopt() and
ipv6_setsockopt(), which still return the same error.
getsockopt(UDPLITE_SEND_CSCOV / UDPLITE_RECV_CSCOV) was available
for UDP and always returned 0, meaning full checksum, but now
-ENOPROTOOPT is returned.
Given that getsockopt() is meaningless for UDP and even the options
are not defined under include/uapi/, this should not be a problem.
$ man 7 udplite
...
BUGS
Where glibc support is missing, the following definitions
are needed:
#define IPPROTO_UDPLITE 136
#define UDPLITE_SEND_CSCOV 10
#define UDPLITE_RECV_CSCOV 11
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260311052020.1213705-10-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
UDP TX paths also have some code for UDP-Lite partial
checksum:
* udplite_csum() in udp_send_skb() and udp_v6_send_skb()
* udplite_getfrag() in udp_sendmsg() and udpv6_sendmsg()
Let's remove such code.
Now, we can use IPPROTO_UDP directly instead of sk->sk_protocol
or fl6->flowi6_proto for csum_tcpudp_magic() and csum_ipv6_magic().
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260311052020.1213705-9-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
UDP-Lite supports the partial checksum and the coverage is
stored in the position of the length field of struct udphdr.
In RX paths, udp4_csum_init() / udp6_csum_init() save the value
in UDP_SKB_CB(skb)->cscov and set UDP_SKB_CB(skb)->partial_cov
to 1 if the coverage is not full.
The subsequent processing diverges depending on the value,
but such paths are now dead.
Also, these functions have some code guarded for UDP:
* udp_unicast_rcv_skb / udp6_unicast_rcv_skb
* __udp4_lib_rcv() and __udp6_lib_rcv().
Let's remove the partial csum code and the unnecessary
guard for UDP-Lite in RX.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260311052020.1213705-8-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since UDP and UDP-Lite shared most of the code, we have had
to check the protocol every time we increment SNMP stats.
Now that the UDP-Lite paths are dead, let's remove UDP-Lite
SNMP stats.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260311052020.1213705-6-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We have deprecated IPv6 UDP-Lite sockets.
Let's drop support for IPv4 UDP-Lite sockets as well.
Most of the changes are similar to the IPv6 patch: removing
udplite.c and udp_impl.h, marking most functions in udp_impl.h
as static, moving the prototype for udp_recvmsg() to udp.h, and
adding INDIRECT_CALLABLE_SCOPE for it.
In addition, the INET_DIAG support for UDP-Lite is dropped.
We will remove the remaining dead code in the following patches.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260311052020.1213705-5-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As announced in commit be28c14ac8 ("udplite: Print deprecation
notice."), it's time to deprecate UDP-Lite.
As a first step, let's drop support for IPv6 UDP-Lite sockets.
We will remove the remaining dead code gradually.
Along with the removal of udplite.c, most of the functions exposed
via udp_impl.h are made static.
The prototypes of udpv6_sendmsg() and udpv6_recvmsg() are moved
to udp.h, but only udpv6_recvmsg() has INDIRECT_CALLABLE_DECLARE()
because udpv6_sendmsg() is exported for rxrpc since commit ed472b0c87
("rxrpc: Call udp_sendmsg() directly").
Also, udpv6_recvmsg() needs INDIRECT_CALLABLE_SCOPE for
CONFIG_MITIGATION_RETPOLINE=n.
Note that udplite.h is included temporarily for udplite_csum().
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260311052020.1213705-3-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since commit a3d2599b24 ("ipv{4,6}/udp{,lite}: simplify proc
registration"), udp4_seq_show() and udp6_seq_show() are not
used in net/ipv4/udplite.c and net/ipv6/udplite.c.
Instead, udp_seq_ops and udp6_seq_ops are exposed to UDP-Lite.
Let's make udp4_seq_show() and udp6_seq_show() static.
udp_seq_ops and udp6_seq_ops are moved to udp_impl.h so that
we can make them static when the header is removed.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260311052020.1213705-2-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If cloning the second stateful expression in the element via GFP_ATOMIC
fails, then the first stateful expression remains in place without being
released.
unreferenced object (percpu) 0x607b97e9cab8 (size 16):
comm "softirq", pid 0, jiffies 4294931867
hex dump (first 16 bytes on cpu 3):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
backtrace (crc 0):
pcpu_alloc_noprof+0x453/0xd80
nft_counter_clone+0x9c/0x190 [nf_tables]
nft_expr_clone+0x8f/0x1b0 [nf_tables]
nft_dynset_new+0x2cb/0x5f0 [nf_tables]
nft_rhash_update+0x236/0x11c0 [nf_tables]
nft_dynset_eval+0x11f/0x670 [nf_tables]
nft_do_chain+0x253/0x1700 [nf_tables]
nft_do_chain_ipv4+0x18d/0x270 [nf_tables]
nf_hook_slow+0xaa/0x1e0
ip_local_deliver+0x209/0x330
Fixes: 563125a73a ("netfilter: nftables: generalize set extension to support for several expressions")
Reported-by: Gurpreet Shergill <giki.shergill@proton.me>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
This reverts commit 648946966a ("netfilter: nft_set_rbtree: validate
open interval overlap").
There have been reports of nft failing to laod valid rulesets after this
patch was merged into -stable.
I can reproduce several such problem with recent nft versions, including
nft 1.1.6 which is widely shipped by distributions.
We currently have little choice here.
This commit can be resurrected at some point once the nftables fix that
triggers the false overlap positive has appeared in common distros
(see e83e32c8d1cd ("mnl: restore create element command with large batches" in
nftables.git).
Fixes: 648946966a ("netfilter: nft_set_rbtree: validate open interval overlap")
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Blamed commits forgot that vxlan/geneve use udp_tunnel[6]_xmit_skb() which
call iptunnel_xmit_stats().
iptunnel_xmit_stats() was assuming tunnels were only using
NETDEV_PCPU_STAT_TSTATS.
@syncp offset in pcpu_sw_netstats and pcpu_dstats is different.
32bit kernels would either have corruptions or freezes if the syncp
sequence was overwritten.
This patch also moves pcpu_stat_type closer to dev->{t,d}stats to avoid
a potential cache line miss since iptunnel_xmit_stats() needs to read it.
Fixes: 6fa6de3022 ("geneve: Handle stats using NETDEV_PCPU_STAT_DSTATS.")
Fixes: be226352e8 ("vxlan: Handle stats using NETDEV_PCPU_STAT_DSTATS.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/20260311123110.1471930-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The current page_pool alloc-cache size and refill values were chosen to
match the NAPI budget and to leave headroom for XDP_DROP recycling.
These fixed values do not scale well with large pages,
as they significantly increase a given page_pool's memory footprint.
Scale these values to better balance memory footprint across page sizes,
while keeping behavior on 4KB-page systems unchanged.
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Nimrod Oren <noren@nvidia.com>
Link: https://patch.msgid.link/20260309081301.103152-1-noren@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Blamed commit missed that both functions can be called with dev == NULL.
Also add unlikely() hints for these conditions that only fuzzers can hit.
Fixes: 6f1a9140ec ("net: add xmit recursion limit to tunnel xmit functions")
Signed-off-by: Eric Dumazet <edumazet@google.com>
CC: Weiming Shi <bestswngs@gmail.com>
Link: https://patch.msgid.link/20260312043908.2790803-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Majority of tcp_release_cb() calls do nothing at all.
Provide tcp_release_cb_cond() helper so that release_sock()
can avoid these calls.
Also hint the compiler that __release_sock() and wake_up()
are rarely called.
$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-77 (-77)
Function old new delta
release_sock 258 181 -77
Total: Before=25235790, After=25235713, chg -0.00%
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260310124451.2280968-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When no H2G transport is loaded, vsock currently routes all CIDs to the
G2H transport (commit 65b422d9b6 ("vsock: forward all packets to the
host when no H2G is registered"). Extend that existing behavior: when
an H2G transport is loaded but does not claim a given CID, the
connection falls back to G2H in the same way.
This matters in environments like Nitro Enclaves, where an instance may
run nested VMs via vhost-vsock (H2G) while also needing to reach sibling
enclaves at higher CIDs through virtio-vsock-pci (G2H). With the old
code, any CID > 2 was unconditionally routed to H2G when vhost was
loaded, making those enclaves unreachable without setting
VMADDR_FLAG_TO_HOST explicitly on every connect.
Requiring every application to set VMADDR_FLAG_TO_HOST creates friction:
tools like socat, iperf, and others would all need to learn about it.
The flag was introduced 6 years ago and I am still not aware of any tool
that supports it. Even if there was support, it would be cumbersome to
use. The most natural experience is a single CID address space where H2G
only wins for CIDs it actually owns, and everything else falls through to
G2H, extending the behavior that already exists when H2G is absent.
To give user space at least a hint that the kernel applied this logic,
automatically set the VMADDR_FLAG_TO_HOST on the remote address so it
can determine the path taken via getpeername().
Add a per-network namespace sysctl net.vsock.g2h_fallback (default 1).
At 0 it forces strict routing: H2G always wins for CID > VMADDR_CID_HOST,
or ENODEV if H2G is not loaded.
Signed-off-by: Alexander Graf <graf@amazon.com>
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://patch.msgid.link/20260304230027.59857-1-graf@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
net->xfrm.nlsk is used in 2 types of contexts:
- fully under RCU, with rcu_read_lock + rcu_dereference and a NULL check
- in the netlink handlers, with requests coming from a userspace socket
In the 2nd case, net->xfrm.nlsk is guaranteed to stay non-NULL and the
object is alive, since we can't enter the netns destruction path while
the user socket holds a reference on the netns.
After adding the __rcu annotation to netns_xfrm.nlsk (which silences
sparse warnings in the RCU users and __net_init code), we need to tell
sparse that the 2nd case is safe. Add a helper for that.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
While testing other changes in vng I noticed that
nl_netdev.page_pool_check flakes. This never happens in real CI.
Turns out vng may boot and get to that test in less than a second.
page_pool_detached() records the detach time in seconds, so if
vng is fast enough detach time is set to 0. Other code treats
0 as "not detached". detach_time is only used to report the state
to the user, so it's not a huge deal in practice but let's fix it.
Store the raw ktime_t (nanoseconds) instead. A nanosecond value
of 0 is practically impossible.
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Fixes: 69cb4952b6 ("net: page_pool: report when page pool was destroyed")
Link: https://patch.msgid.link/20260310003907.3540019-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
With the current port selection algorithm, ports after a reserved port
range or long time used port are used more often than others [1]. This
causes an uneven port usage distribution. This combines with cloud
environments blocking connections between the application server and the
database server if there was a previous connection with the same source
port, leading to connectivity problems between applications on cloud
environments.
The real issue here is that these firewalls cannot cope with
standards-compliant port reuse. This is a workaround for such situations
and an improvement on the distribution of ports selected.
The proposed solution is to implement a variant of RFC 6056 Algorithm 5.
The step size is selected randomly on every connect() call ensuring it
is a coprime with respect to the size of the range of ports we want to
scan. This way, we can ensure that all ports within the range are
scanned before returning an error. To enable this algorithm, the user
must configure the new sysctl option "net.ipv4.ip_local_port_step_width".
In addition, on graphs generated we can observe that the distribution of
source ports is more even with the proposed approach. [2]
[1] https://0xffsoftware.com/port_graph_current_alg.html
[2] https://0xffsoftware.com/port_graph_random_step_alg.html
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Link: https://patch.msgid.link/20260309023946.5473-2-fmancera@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As a part of MANA hardening for CVM, add validation for the doorbell
ID (db_id) received from hardware in the GDMA_REGISTER_DEVICE response
to prevent out-of-bounds memory access when calculating the doorbell
page address.
In mana_gd_ring_doorbell(), the doorbell page address is calculated as:
addr = db_page_base + db_page_size * db_index
= (bar0_va + db_page_off) + db_page_size * db_index
A hardware could return values that cause this address to fall outside
the BAR0 MMIO region. In Confidential VM environments, hardware responses
cannot be fully trusted.
Add the following validations:
- Store the BAR0 size (bar0_size) in gdma_context during probe.
- Validate the doorbell page offset (db_page_off) read from device
registers does not exceed bar0_size during initialization, converting
mana_gd_init_registers() to return an error code.
- Validate db_id from GDMA_REGISTER_DEVICE response against the
maximum number of doorbell pages that fit within BAR0.
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Link: https://patch.msgid.link/20260306211212.543376-1-ernis@linux.microsoft.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Tunnel xmit functions (iptunnel_xmit, ip6tunnel_xmit) lack their own
recursion limit. When a bond device in broadcast mode has GRE tap
interfaces as slaves, and those GRE tunnels route back through the
bond, multicast/broadcast traffic triggers infinite recursion between
bond_xmit_broadcast() and ip_tunnel_xmit()/ip6_tnl_xmit(), causing
kernel stack overflow.
The existing XMIT_RECURSION_LIMIT (8) in the no-qdisc path is not
sufficient because tunnel recursion involves route lookups and full IP
output, consuming much more stack per level. Use a lower limit of 4
(IP_TUNNEL_RECURSION_LIMIT) to prevent overflow.
Add recursion detection using dev_xmit_recursion helpers directly in
iptunnel_xmit() and ip6tunnel_xmit() to cover all IPv4/IPv6 tunnel
paths including UDP encapsulated tunnels (VXLAN, Geneve, etc.).
Move dev_xmit_recursion helpers from net/core/dev.h to public header
include/linux/netdevice.h so they can be used by tunnel code.
BUG: KASAN: stack-out-of-bounds in blake2s.constprop.0+0xe7/0x160
Write of size 32 at addr ffff88810033fed0 by task kworker/0:1/11
Workqueue: mld mld_ifc_work
Call Trace:
<TASK>
__build_flow_key.constprop.0 (net/ipv4/route.c:515)
ip_rt_update_pmtu (net/ipv4/route.c:1073)
iptunnel_xmit (net/ipv4/ip_tunnel_core.c:84)
ip_tunnel_xmit (net/ipv4/ip_tunnel.c:847)
gre_tap_xmit (net/ipv4/ip_gre.c:779)
dev_hard_start_xmit (net/core/dev.c:3887)
sch_direct_xmit (net/sched/sch_generic.c:347)
__dev_queue_xmit (net/core/dev.c:4802)
bond_dev_queue_xmit (drivers/net/bonding/bond_main.c:312)
bond_xmit_broadcast (drivers/net/bonding/bond_main.c:5279)
bond_start_xmit (drivers/net/bonding/bond_main.c:5530)
dev_hard_start_xmit (net/core/dev.c:3887)
__dev_queue_xmit (net/core/dev.c:4841)
ip_finish_output2 (net/ipv4/ip_output.c:237)
ip_output (net/ipv4/ip_output.c:438)
iptunnel_xmit (net/ipv4/ip_tunnel_core.c:86)
gre_tap_xmit (net/ipv4/ip_gre.c:779)
dev_hard_start_xmit (net/core/dev.c:3887)
sch_direct_xmit (net/sched/sch_generic.c:347)
__dev_queue_xmit (net/core/dev.c:4802)
bond_dev_queue_xmit (drivers/net/bonding/bond_main.c:312)
bond_xmit_broadcast (drivers/net/bonding/bond_main.c:5279)
bond_start_xmit (drivers/net/bonding/bond_main.c:5530)
dev_hard_start_xmit (net/core/dev.c:3887)
__dev_queue_xmit (net/core/dev.c:4841)
ip_finish_output2 (net/ipv4/ip_output.c:237)
ip_output (net/ipv4/ip_output.c:438)
iptunnel_xmit (net/ipv4/ip_tunnel_core.c:86)
ip_tunnel_xmit (net/ipv4/ip_tunnel.c:847)
gre_tap_xmit (net/ipv4/ip_gre.c:779)
dev_hard_start_xmit (net/core/dev.c:3887)
sch_direct_xmit (net/sched/sch_generic.c:347)
__dev_queue_xmit (net/core/dev.c:4802)
bond_dev_queue_xmit (drivers/net/bonding/bond_main.c:312)
bond_xmit_broadcast (drivers/net/bonding/bond_main.c:5279)
bond_start_xmit (drivers/net/bonding/bond_main.c:5530)
dev_hard_start_xmit (net/core/dev.c:3887)
__dev_queue_xmit (net/core/dev.c:4841)
mld_sendpack
mld_ifc_work
process_one_work
worker_thread
</TASK>
Fixes: 745e20f1b6 ("net: add a recursion limit in xmit path")
Reported-by: Xiang Mei <xmei5@asu.edu>
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
Link: https://patch.msgid.link/20260306160133.3852900-2-bestswngs@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
tcp_chrono_start() is small enough, and used in TCP sendmsg()
fast path (from tcp_skb_entail()).
Note clang is already inlining it from functions in tcp_output.c.
Inlining it improves performance and reduces bloat :
$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/2 grow/shrink: 1/0 up/down: 1/-84 (-83)
Function old new delta
tcp_skb_entail 280 281 +1
__pfx_tcp_chrono_start 16 - -16
tcp_chrono_start 68 - -68
Total: Before=25192434, After=25192351, chg -0.00%
Note that tcp_chrono_stop() is too big.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20260308123549.2924460-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Some modern cpus disable X86_FEATURE_RETPOLINE feature,
even if a direct call can still be beneficial.
Even when IBRS is present, an indirect call is more expensive
than a direct one:
Direct Calls:
Compilers can perform powerful optimizations like inlining,
where the function body is directly inserted at the call site,
eliminating call overhead entirely.
Indirect Calls:
Inlining is much harder, if not impossible, because the compiler
doesn't know the target function at compile time.
Techniques like Indirect Call Promotion can help by using
profile-guided optimization to turn frequently taken indirect calls
into conditional direct calls, but they still add complexity
and potential overhead compared to a truly direct call.
In this patch, I split tc_skip_wrapper in two different
static keys, one for tc_act() (tc_skip_wrapper_act)
and one for tc_classify() (tc_skip_wrapper_cls).
Then I enable the tc_skip_wrapper_cls only if the count
of builtin classifiers is above one.
I enable tc_skip_wrapper_act only it the count of builtin
actions is above one.
In our production kernels, we only have CONFIG_NET_CLS_BPF=y
and CONFIG_NET_ACT_BPF=y. Other are modules or are not compiled.
Tested on AMD Turin cpus, cls_bpf_classify() cost went
from 1% down to 0.18 %, and FDO will be able to inline
it in tcf_classify() for further gains.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Link: https://patch.msgid.link/20260307133601.3863071-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 18fd64d254 ("netns-ipv4: reorganize netns_ipv4 fast path
variables") missed that __tcp_select_window() is reading
net->ipv4.sysctl_tcp_shrink_window.
Move this field to netns_ipv4_read_txrx group, as __tcp_select_window()
is used both in tx and rx paths.
Saves a potential cache line miss.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260307092214.2433548-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Following typical script is extremely disruptive,
because each graft operation calls dev_deactivate()
which resets all the queues of the device.
QPARAM="limit 100000 flow_limit 1000 buckets 4096"
TXQS=64
for ETH in eth1
do
tc qd del dev $ETH root 2>/dev/null
tc qd add dev $ETH root handle 1: mq
for i in `seq 1 $TXQS`
do
slot=$( printf %x $(( i )) )
tc qd add dev $ETH parent 1:$slot fq $QPARAM
done
done
One can add "ip link set dev $ETH down/up" to reduce the disruption time:
QPARAM="limit 100000 flow_limit 1000 buckets 4096"
TXQS=64
for ETH in eth1
do
ip link set dev $ETH down
tc qd del dev $ETH root 2>/dev/null
tc qd add dev $ETH root handle 1: mq
for i in `seq 1 $TXQS`
do
slot=$( printf %x $(( i )) )
tc qd add dev $ETH parent 1:$slot fq $QPARAM
done
ip link set dev $ETH up
done
Or we can add a @reset_needed flag to dev_deactivate() and
dev_deactivate_many().
This flag is set to true at device dismantle or linkwatch_do_dev(),
and to false for graft operations.
In the future, we might only stop one queue instead of the whole
device, ie call dev_deactivate_queue() instead of dev_deactivate().
I think the problem (quadratic behavior) was added in commit
2fb541c862 ("net: sch_generic: aviod concurrent reset and enqueue op
for lockless qdisc") but this does not look serious enough to deserve
risky backports.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yunsheng Lin <linyunsheng@huawei.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Link: https://patch.msgid.link/20260307163430.470644-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tcp_v4_early_demux() has a single caller : ip_rcv_finish_core().
Move it to net/ipv4/ip_input.c and mark it static, for possible
compiler/linker optimizations.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260306131130.654991-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
inode->i_ino is being converted to a u64. sock.sk_ino (which caches the
inode number) must also be widened to avoid truncation on 32-bit
architectures where unsigned long is only 32 bits.
Change sk_ino from unsigned long to u64, and update the return type
of sock_i_ino() to match. Fix all format strings that print the
result of sock_i_ino() (%lu -> %llu), and widen the intermediate
variables and function parameters in the diag modules that were
using int to hold the inode number.
Note that the UAPI socket diag structures (inet_diag_msg.idiag_inode,
unix_diag_msg.udiag_ino, etc.) are all __u32 and cannot be changed
without breaking the ABI. The assignments to those fields will
silently truncate, which is the existing behavior.
Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> # for net/can
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20260304-iino-u64-v3-3-2257ad83d372@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
inet_ehashfn() and inet6_ehashfn() initialise random secrets
on the first call by net_get_random_once().
While the init part is patched out using static keys, with
CONFIG_STACKPROTECTOR_STRONG=y, this causes a compiler to
generate a stack canary due to an automatic variable,
unsigned long ___flags, in the DO_ONCE() macro being passed
to __do_once_start().
With FDO, this is visible in __inet_lookup_established() and
__inet6_lookup_established() too.
Let's initialise the secrets by get_random_sleepable_once()
in the slow paths: inet_hash() for listen(), and
inet_hash_connect() and inet6_hash_connect() for connect().
Note that IPv6 listener will initialise both IPv4 & IPv6 secrets
in inet_hash() for IPv4-mapped IPv6 address.
With the patch, the stack size is reduced by 16 bytes (___flags
+ a stack canary) and NOPs for the static key go away.
Before: __inet6_lookup_established()
...
push %rbx
sub $0x38,%rsp # stack is 56 bytes
mov %edx,%ebx # sport
mov %gs:0x299419f(%rip),%rax # load stack canary
mov %rax,0x30(%rsp) and store it onto stack
mov 0x440(%rdi),%r15 # net->ipv4.tcp_death_row.hashinfo
nop
32: mov %r8d,%ebp # hnum
shl $0x10,%ebp # hnum << 16
nop
3d: mov 0x70(%rsp),%r14d # sdif
or %ebx,%ebp # INET_COMBINED_PORTS(sport, hnum)
mov 0x11a8382(%rip),%eax # inet6_ehashfn() ...
After: __inet6_lookup_established()
...
push %rbx
sub $0x28,%rsp # stack is 40 bytes
mov 0x60(%rsp),%ebp # sdif
mov %r8d,%r14d # hnum
shl $0x10,%r14d # hnum << 16
or %edx,%r14d # INET_COMBINED_PORTS(sport, hnum)
mov 0x440(%rdi),%rax # net->ipv4.tcp_death_row.hashinfo
mov 0x1194f09(%rip),%r10d # inet6_ehashfn() ...
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260303235424.3877267-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tcp_v6_early_demux() has a single caller : ip6_rcv_finish_core().
Move it to net/ipv6/ip6_input.c and mark it static, for possible
compiler/linker optimizations.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260304022706.1062459-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cross-merge networking fixes after downstream PR (net-7.0-rc3).
No conflicts.
Adjacent changes:
net/netfilter/nft_set_rbtree.c
fb7fb40163 ("netfilter: nf_tables: clone set on flush only")
3aea466a43 ("netfilter: nft_set_rbtree: don't disable bh when acquiring tree lock")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The only user of frag_size field in XDP RxQ info is
bpf_xdp_frags_increase_tail(). It clearly expects whole buffer size instead
of DMA write size. Different assumptions in idpf driver configuration lead
to negative tailroom.
To make it worse, buffer sizes are not actually uniform in idpf when
splitq is enabled, as there are several buffer queues, so rxq->rx_buf_size
is meaningless in this case.
Use truesize of the first bufq in AF_XDP ZC, as there is only one. Disable
growing tail for regular splitq.
Fixes: ac8a861f63 ("idpf: prepare structures to support XDP")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Link: https://patch.msgid.link/20260305111253.2317394-8-larysa.zaremba@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
rxq->frag_size is basically a step between consecutive strictly aligned
frames. In ZC mode, chunk size fits exactly, but if chunks are unaligned,
there is no safe way to determine accessible space to grow tailroom.
Report frag_size to be zero, if chunks are unaligned, chunk_size otherwise.
Fixes: 24ea50127e ("xsk: support mbuf on ZC RX")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Link: https://patch.msgid.link/20260305111253.2317394-3-larysa.zaremba@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Yiming Qian reports Use-after-free in the pipapo set type:
Under a large number of expired elements, commit-time GC can run for a very
long time in a non-preemptible context, triggering soft lockup warnings and
RCU stall reports (local denial of service).
We must split GC in an unlink and a reclaim phase.
We cannot queue elements for freeing until pointers have been swapped.
Expired elements are still exposed to both the packet path and userspace
dumpers via the live copy of the data structure.
call_rcu() does not protect us: dump operations or element lookups starting
after call_rcu has fired can still observe the free'd element, unless the
commit phase has made enough progress to swap the clone and live pointers
before any new reader has picked up the old version.
This a similar approach as done recently for the rbtree backend in commit
35f83a7552 ("netfilter: nft_set_rbtree: don't gc elements on insert").
Fixes: 3c4287f620 ("nf_tables: Add set type for arbitrary concatenation of ranges")
Reported-by: Yiming Qian <yimingqian591@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Syzbot with fault injection triggered a failing memory allocation with
GFP_KERNEL which results in a WARN splat:
iter.err
WARNING: net/netfilter/nf_tables_api.c:845 at nft_map_deactivate+0x34e/0x3c0 net/netfilter/nf_tables_api.c:845, CPU#0: syz.0.17/5992
Modules linked in:
CPU: 0 UID: 0 PID: 5992 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2026
RIP: 0010:nft_map_deactivate+0x34e/0x3c0 net/netfilter/nf_tables_api.c:845
Code: 8b 05 86 5a 4e 09 48 3b 84 24 a0 00 00 00 75 62 48 8d 65 d8 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc cc e8 63 6d fa f7 90 <0f> 0b 90 43
+80 7c 35 00 00 0f 85 23 fe ff ff e9 26 fe ff ff 89 d9
RSP: 0018:ffffc900045af780 EFLAGS: 00010293
RAX: ffffffff89ca45bd RBX: 00000000fffffff4 RCX: ffff888028111e40
RDX: 0000000000000000 RSI: 00000000fffffff4 RDI: 0000000000000000
RBP: ffffc900045af870 R08: 0000000000400dc0 R09: 00000000ffffffff
R10: dffffc0000000000 R11: fffffbfff1d141db R12: ffffc900045af7e0
R13: 1ffff920008b5f24 R14: dffffc0000000000 R15: ffffc900045af920
FS: 000055557a6a5500(0000) GS:ffff888125496000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fb5ea271fc0 CR3: 000000003269e000 CR4: 00000000003526f0
Call Trace:
<TASK>
__nft_release_table+0xceb/0x11f0 net/netfilter/nf_tables_api.c:12115
nft_rcv_nl_event+0xc25/0xdb0 net/netfilter/nf_tables_api.c:12187
notifier_call_chain+0x19d/0x3a0 kernel/notifier.c:85
blocking_notifier_call_chain+0x6a/0x90 kernel/notifier.c:380
netlink_release+0x123b/0x1ad0 net/netlink/af_netlink.c:761
__sock_release net/socket.c:662 [inline]
sock_close+0xc3/0x240 net/socket.c:1455
Restrict set clone to the flush set command in the preparation phase.
Add NFT_ITER_UPDATE_CLONE and use it for this purpose, update the rbtree
and pipapo backends to only clone the set when this iteration type is
used.
As for the existing NFT_ITER_UPDATE type, update the pipapo backend to
use the existing set clone if available, otherwise use the existing set
representation. After this update, there is no need to clone a set that
is being deleted, this includes bound anonymous set.
An alternative approach to NFT_ITER_UPDATE_CLONE is to add a .clone
interface and call it from the flush set path.
Reported-by: syzbot+4924a0edc148e8b4b342@syzkaller.appspotmail.com
Fixes: 3f1d886cc7 ("netfilter: nft_set_pipapo: move cloning of match info to insert/removal path")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
-----BEGIN PGP SIGNATURE-----
iQJdBAABCABHFiEEgKkgxbID4Gn1hq6fcJGo2a1f9gAFAmmoGSwbFIAAAAAABAAO
bWFudTIsMi41KzEuMTEsMiwyDRxmd0BzdHJsZW4uZGUACgkQcJGo2a1f9gAPTw/+
O9OR5n1v7C2qlOTg9dDKEvSlCceg2bqNndplrVyPb7+NlbGbhQJyzuIHh/7jvVpo
VNLtEYl6wYAuRRux/I3eFc7KV1hEtqXjV0Asi0C0HMVUcig+/9Wh4CMt6LnBJ7Xp
GksxXtwqGBewfT1jiu/hxnsgjNRzGDWMf+23QgLTHnch6H456kySUetlaWq96SLR
AhZKSeb3dinh9YHKC50RoPzKaPtf9HQWDM7vlX8Q1hu6bAHfP14xW4CRqFq8JGYi
hEWd/E5oIDJbPO7gAIuwq5GBnmfw/oiblfQBdYBN2MkmzN7CvYBnleL/N7ZXhnkH
4sBFJQCLBNGu/v5aD+lAjAjq7YJUs5jrSmGghsrORkMe2hEf4IwbFmEoisSz9ycO
snJPX8LHoud1Ah5sDQdj0zYRD/iDkd2kLqiFMGgddJeZ+7RlNZm4rgJWIjXE2lLi
0RXjUgJtJobrhmrCethsB/AFts5XrEVCWpRPlfEAx/yFiuG3x2IsxgFJGpBSfPBQ
o1Opl9YRkMM2FmfKC/NeLA+lkRUl94PV330khCqHOupVGc5JCzKWC7o8ndp3hB/Y
8+4wUziUMf60YVW2fo6wNu1gOkNV1RH5/yZkdVzTq7mxrPkwK+NCy+KQh7OOdyVT
YV5WdqRUh6Kp6AvU7TJaa2FXjlVXB58i9GrgnoQz5YM=
=6PUL
-----END PGP SIGNATURE-----
Merge tag 'nf-next-26-03-04' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Florian Westphal says:
====================
netfilter: updates for net-next
The following patchset contains Netfilter updates for *net-next*,
including changes to IPv6 stack and updates to IPVS from Julian Anastasov.
1) ipv6: export fib6_lookup for nft_fib_ipv6 module
2) factor out ipv6_anycast_destination logic so its usable without
dst_entry. These are dependencies for patch 3.
3) switch nft_fib_ipv6 module to no longer need temporary dst_entry
object allocations by using fib6_lookup() + RCU.
This gets us ~13% higher packet rate in my tests.
Patches 4 to 8, from Eric Dumazet, zap sk_callback_lock usage in
netfilter. Patch 9 removes another sk_callback_lock instance.
Remaining patches, from Julian Anastasov, improve IPVS, Quoting Julian:
* Add infrastructure for resizable hash tables based on hlist_bl.
* Change the 256-bucket service hash table to be resizable.
* Change the global connection table to be per-net and resizable.
* Make connection hashing more secure for setups with multiple services.
netfilter pull request nf-next-26-03-04
* tag 'nf-next-26-03-04' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
ipvs: use more keys for connection hashing
ipvs: switch to per-net connection table
ipvs: use resizable hash table for services
ipvs: add resizable hash tables
rculist_bl: add hlist_bl_for_each_entry_continue_rcu
netfilter: nfnetlink_queue: remove locking in nfqnl_get_sk_secctx
netfilter: nfnetlink_queue: no longer acquire sk_callback_lock
netfilter: nfnetlink_log: no longer acquire sk_callback_lock
netfilter: nft_meta: no longer acquire sk_callback_lock in nft_meta_get_eval_skugid()
netfilter: xt_owner: no longer acquire sk_callback_lock in mt_owner()
netfilter: nf_log_syslog: no longer acquire sk_callback_lock in nf_log_dump_sk_uid_gid()
netfilter: nft_fib_ipv6: switch to fib6_lookup
ipv6: make ipv6_anycast_destination logic usable without dst_entry
ipv6: export fib6_lookup for nft_fib_ipv6
====================
Link: https://patch.msgid.link/20260304114921.31042-1-fw@strlen.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Instead of using struct timespec64 in scm_timestamping_internal,
use ktime_t, saving 24 bytes in kernel stack.
This makes tcp_update_recv_tstamps() small enough to be inlined.
The ktime_t -> timespec64 conversions happen after socket lock
has been released in tcp_recvmsg(), and only if the application
requested them.
$ scripts/bloat-o-meter -t vmlinux.0 vmlinux
add/remove: 0/2 grow/shrink: 5/4 up/down: 146/-277 (-131)
Function old new delta
tcp_zerocopy_receive 2383 2425 +42
mptcp_recvmsg 1565 1607 +42
tcp_recvmsg_locked 3797 3823 +26
put_cmsg_scm_timestamping64 131 149 +18
put_cmsg_scm_timestamping 131 149 +18
__pfx_tcp_update_recv_tstamps 16 - -16
do_tcp_getsockopt 4024 4006 -18
tcp_recv_timestamp 474 430 -44
tcp_zc_handle_leftover 417 371 -46
__sock_recv_timestamp 1087 1031 -56
tcp_update_recv_tstamps 97 - -97
Total: Before=25223788, After=25223657, chg -0.00%
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20260304012747.881644-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This reverts 28ee1b746f ("secure_seq: downgrade to per-host timestamp offsets")
tcp_tw_recycle went away in 2017.
Zhouyan Deng reported off-path TCP source port leakage via
SYN cookie side-channel that can be fixed in multiple ways.
One of them is to bring back TCP ports in TS offset randomization.
As a bonus, we perform a single siphash() computation
to provide both an ISN and a TS offset.
Fixes: 28ee1b746f ("secure_seq: downgrade to per-host timestamp offsets")
Reported-by: Zhouyan Deng <dengzhouyan_nwpu@163.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Acked-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20260302205527.1982836-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When shrinking the number of real tx queues,
netif_set_real_num_tx_queues() calls qdisc_reset_all_tx_gt() to flush
qdiscs for queues which will no longer be used.
qdisc_reset_all_tx_gt() currently serializes qdisc_reset() with
qdisc_lock(). However, for lockless qdiscs, the dequeue path is
serialized by qdisc_run_begin/end() using qdisc->seqlock instead, so
qdisc_reset() can run concurrently with __qdisc_run() and free skbs
while they are still being dequeued, leading to UAF.
This can easily be reproduced on e.g. virtio-net by imposing heavy
traffic while frequently changing the number of queue pairs:
iperf3 -ub0 -c $peer -t 0 &
while :; do
ethtool -L eth0 combined 1
ethtool -L eth0 combined 2
done
With KASAN enabled, this leads to reports like:
BUG: KASAN: slab-use-after-free in __qdisc_run+0x133f/0x1760
...
Call Trace:
<TASK>
...
__qdisc_run+0x133f/0x1760
__dev_queue_xmit+0x248f/0x3550
ip_finish_output2+0xa42/0x2110
ip_output+0x1a7/0x410
ip_send_skb+0x2e6/0x480
udp_send_skb+0xb0a/0x1590
udp_sendmsg+0x13c9/0x1fc0
...
</TASK>
Allocated by task 1270 on cpu 5 at 44.558414s:
...
alloc_skb_with_frags+0x84/0x7c0
sock_alloc_send_pskb+0x69a/0x830
__ip_append_data+0x1b86/0x48c0
ip_make_skb+0x1e8/0x2b0
udp_sendmsg+0x13a6/0x1fc0
...
Freed by task 1306 on cpu 3 at 44.558445s:
...
kmem_cache_free+0x117/0x5e0
pfifo_fast_reset+0x14d/0x580
qdisc_reset+0x9e/0x5f0
netif_set_real_num_tx_queues+0x303/0x840
virtnet_set_channels+0x1bf/0x260 [virtio_net]
ethnl_set_channels+0x684/0xae0
ethnl_default_set_doit+0x31a/0x890
...
Serialize qdisc_reset_all_tx_gt() against the lockless dequeue path by
taking qdisc->seqlock for TCQ_F_NOLOCK qdiscs, matching the
serialization model already used by dev_reset_queue().
Additionally clear QDISC_STATE_NON_EMPTY after reset so the qdisc state
reflects an empty queue, avoiding needless re-scheduling.
Fixes: 6b3ba9146f ("net: sched: allow qdiscs to handle locking")
Signed-off-by: Koichiro Den <den@valinux.co.jp>
Link: https://patch.msgid.link/20260228145307.3955532-1-den@valinux.co.jp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Instead of storing the @log at the beginning of rps_dev_flow_table
use 5 low order bits of the rps_tag_ptr to store the log of the size.
This removes a potential cache line miss (for light traffic).
This allows us to switch to one high-order allocation instead of vmalloc()
when CONFIG_RFS_ACCEL is not set.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260302181432.1836150-8-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Remove rps_dev_flow_table_release() in favor of kvfree_rcu_mightsleep().
In the following pach, we will remove "u8 @log" field
and 'struct rps_dev_flow_table' size will be a power-of-two.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260302181432.1836150-7-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Instead of storing the @mask at the beginning of rps_sock_flow_table,
use 5 low order bits of the rps_tag_ptr to store the log of the size.
This removes a potential cache line miss to fetch @mask.
More importantly, we can switch to vmalloc_huge() without wasting memory.
Tested with:
numactl --interleave=all bash -c "echo 4194304 >/proc/sys/net/core/rps_sock_flow_entries"
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260302181432.1836150-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In preparation of the following patch, abstract access
to the @mask field in 'struct rps_sock_flow_table'.
Also cleanup rps_sock_flow_sysctl() a bit :
- Rename orig_sock_table to o_sock_table.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260302181432.1836150-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Removing rcu_head (and @mask in a following patch)
will allow a power-of-two allocation and thus high-order
allocation for better performance.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260302181432.1836150-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a new rps_tag_ptr type to encode a pointer and a size
to a power-of-two table.
Three helpers are added converting an rps_tag_ptr to:
1) A log of the size.
2) A mask : (size - 1).
3) A pointer to the array.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260302181432.1836150-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
udp_flow_src_port() and psp_write_headers() use ip_local_port_range.
ip_local_port_range is inclusive : all ports between min and max
can be used.
Before this patch, if ip_local_port_range was set to 40000-40001
40001 would not be used as a source port.
Use reciprocal_scale() to help code readability.
Not tagged for stable trees, as this change could break user
expectations.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260302163933.1754393-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- cfg80211/mac80211
- finished assoc frame encryption/EPPKE/802.1X-over-auth
(also hwsim)
- radar detection improvements
- 6 GHz incumbent signal detection APIs
- multi-link support for FILS, probe response
templates and client probling
- ath12k:
- monitor mode support on IPQ5332
- basic hwmon temperature reporting
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmmoGDYACgkQ10qiO8sP
aACl9BAAi4ezTR8jjvBQNjJ9EXJmamjVAitMlHulUaw0DVHnMAMALTgYGq0ZpIva
8EMiH/ksfxmYvu8qFYypYH2WcQAsg9DFuuo2Mcd4MwmJkOyQgme1mqaTpTDuHAWp
S+wZBgQQCrnhQkmmNUJmp8m4Edw4cYi94jcct0BRYvAMBdQo4hMctA/7Ja8+ttU5
Q2uhHVZjmNPR2OXBp31INp4vo7RK5AXUFI5l/7XX36o7zIudtqbJJ0GL+1UNeG3f
v4an+a0tiunacgZiuWeeL/U1t4cZ5WQiDV31FQPIBiiYQO5M76l7+cuikr3HLkG1
kdqGXs77blW32s7NF3MebswIV+dzmBF69HjwCxdsU0iWzp54y8I3Lgu/cN8O721a
2Pt6IGmcsOm9F9Lbrxn6UNHMjn6VQUYGg40NtbhHGwniheLX4Gi4MBjbgOdD3GJh
9h12h/2CRZcHjA6kg3tcdzluD09510IiWMbPaAtXr456CPJ+hBUJIutuXOszbA+7
d9eecObxoMtMqtesRLkhbyBMt7aNkWLYBvpSQVHaJktqt7c5NmKe0xXXdRHeIqKo
XpXsl2q/1NrmSj9lPyyte8LHWWXQ+TVWWujqaHFUJdMDT/IBscKk4ahxGoEBtHOR
KHRFCD2oRsyCnsI6tSJ3/IuU5AVmBIzd6wZlPYZUZI/PsWuMwIg=
=oNzs
-----END PGP SIGNATURE-----
Merge tag 'wireless-next-2026-03-04' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
Johannes Berg says:
====================
Notable features this time:
- cfg80211/mac80211
- finished assoc frame encryption/EPPKE/802.1X-over-auth
(also hwsim)
- radar detection improvements
- 6 GHz incumbent signal detection APIs
- multi-link support for FILS, probe response
templates and client probling
- ath12k:
- monitor mode support on IPQ5332
- basic hwmon temperature reporting
* tag 'wireless-next-2026-03-04' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (38 commits)
wifi: UHR: define DPS/DBE/P-EDCA elements and fix size parsing
wifi: mac80211_hwsim: change hwsim_class to a const struct
wifi: mac80211: give the AP more time for EPPKE as well
wifi: ath12k: Remove the unused argument from the Rx data path
wifi: ath12k: Enable monitor mode support on IPQ5332
wifi: ath12k: Set up MLO after SSR
wifi: ath11k: Silence remoteproc probe deferral prints
wifi: cfg80211: support key installation on non-netdev wdevs
wifi: cfg80211: make cluster id an array
wifi: mac80211: update outdated comment
wifi: mac80211: Advertise IEEE 802.1X authentication support
wifi: mac80211: Add support for IEEE 802.1X authentication protocol in non-AP STA mode
wifi: cfg80211: add support for IEEE 802.1X Authentication Protocol
wifi: mac80211: Advertise EPPKE support based on driver capabilities
wifi: mac80211_hwsim: Advertise support for (Re)Association frame encryption
wifi: mac80211: Fix AAD/Nonce computation for management frames with MLO
wifi: rt2x00: use generic nvmem_cell_get
wifi: mac80211: fetch unsolicited probe response template by link ID
wifi: mac80211: fetch FILS discovery template by link ID
wifi: nl80211: don't allow DFS channels for NAN
...
====================
Link: https://patch.msgid.link/20260304113707.175181-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Simon Kirby reported long time ago that IPVS connection hashing
based only on the client address/port (caddr, cport) as hash keys
is not suitable for setups that accept traffic on multiple virtual
IPs and ports. It can happen for multiple VIP:VPORT services, for
single or many fwmark service(s) that match multiple virtual IPs
and ports or even for passive FTP with peristence in DR/TUN mode
where we expect traffic on multiple ports for the virtual IP.
Fix it by adding virtual addresses and ports to the hash function.
This causes the traffic from NAT real servers to clients to use
second hashing for the in->out direction.
As result:
- the IN direction from client will use hash node hn0 where
the source/dest addresses and ports used by client will be used
as hash keys
- the OUT direction from NAT real servers will use hash node hn1
for the traffic from real server to client
- the persistence templates are hashed only with parameters based on
the IN direction, so they now will also use the virtual address,
port and fwmark from the service.
OLD:
- all methods: c_list node: proto, caddr:cport
- persistence templates: c_list node: proto, caddr_net:0
- persistence engine templates: c_list node: per-PE, PE-SIP uses jhash
NEW:
- all methods: hn0 node (dir 0): proto, caddr:cport -> vaddr:vport
- MASQ method: hn1 node (dir 1): proto, daddr:dport -> caddr:cport
- persistence templates: hn0 node (dir 0):
proto, caddr_net:0 -> vaddr:vport_or_0
proto, caddr_net:0 -> fwmark:0
- persistence engine templates: hn0 node (dir 0): as before
Also reorder the ip_vs_conn fields, so that hash nodes are on same
read-mostly cache line while write-mostly fields are on separate
cache line.
Reported-by: Simon Kirby <sim@hostway.ca>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
Use per-net resizable hash table for connections. The global table
is slow to walk when using many namespaces.
The table can be resized in the range of [256 - ip_vs_conn_tab_size].
Table is attached only while services are present. Resizing is done
by delayed work based on load (the number of connections).
Add a hash_key field into the connection to store the table ID in
the highest bit and the entry's hash value in the lowest bits. The
lowest part of the hash value is used as bucket ID, the remaining
part is used to filter the entries in the bucket before matching
the keys and as result, helps the lookup operation to access only
one cache line. By knowing the table ID and bucket ID for entry,
we can unlink it without calculating the hash value and doing
lookup by keys. We need only to validate the saved hash_key under
lock.
For better security switch from jhash to siphash for the default
connection hashing but the persistence engines may use their own
function. Keeping the hash table loaded with entries below the
size (12%) allows to avoid collision for 96+% of the conns.
ip_vs_conn_fill_cport() now will rehash the connection with proper
locking because unhash+hash is not safe for RCU readers.
To invalidate the templates setting just dport to 0xffff is enough,
no need to rehash them. As result, ip_vs_conn_unhash() is now
unused and removed.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
Make the hash table for services resizable in the bit range of 4-20.
Table is attached only while services are present. Resizing is done
by delayed work based on load (the number of hashed services).
Table grows when load increases 2+ times (above 12.5% with lfactor=-3)
and shrinks 8+ times when load decreases 16+ times (below 0.78%).
Switch to jhash hashing to reduce the collisions for multiple
services.
Add a hash_key field into the service to store the table ID in
the highest bit and the entry's hash value in the lowest bits. The
lowest part of the hash value is used as bucket ID, the remaining
part is used to filter the entries in the bucket before matching
the keys and as result, helps the lookup operation to access only
one cache line. By knowing the table ID and bucket ID for entry,
we can unlink it without calculating the hash value and doing
lookup by keys. We need only to validate the saved hash_key under
lock.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
Add infrastructure for resizable hash tables based on hlist_bl
which we will use in followup patches.
The tables allow RCU lookups during resizing, bucket modifications
are protected with per-bucket bit lock and additional custom locking,
the tables are resized when load reaches thresholds determined based
on load factor parameter.
Compared to other implementations we rely on:
* fast entry removal by using node unlinking without pre-lookup
* entry rehashing when hash key changes
* entries can contain multiple hash nodes
* custom locking depending on different contexts
* adjustable load factor to customize the grow/shrink process
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
nft_fib_ipv6 uses ipv6_anycast_destination(), but upcoming patch removes
the dst_entry usage in favor of fib6_result.
Move the 'plen > 127' logic to a new helper and call it from the
existing one.
Signed-off-by: Florian Westphal <fw@strlen.de>
`struct sysctl_fib_multipath_hash_seed` contains two u32 fields
(user_seed and mp_seed), making it an 8-byte structure with a 4-byte
alignment requirement.
In `fib_multipath_hash_from_keys()`, the code evaluates the entire
struct atomically via `READ_ONCE()`:
mp_seed = READ_ONCE(net->ipv4.sysctl_fib_multipath_hash_seed).mp_seed;
While this silently works on GCC by falling back to unaligned regular
loads which the ARM64 kernel tolerates, it causes a fatal kernel panic
when compiled with Clang and LTO enabled.
Commit e35123d83e ("arm64: lto: Strengthen READ_ONCE() to acquire
when CONFIG_LTO=y") strengthens `READ_ONCE()` to use Load-Acquire
instructions (`ldar` / `ldapr`) to prevent compiler reordering bugs
under Clang LTO. Since the macro evaluates the full 8-byte struct,
Clang emits a 64-bit `ldar` instruction. ARM64 architecture strictly
requires `ldar` to be naturally aligned, thus executing it on a 4-byte
aligned address triggers a strict Alignment Fault (FSC = 0x21).
Fix the read side by moving the `READ_ONCE()` directly to the `u32`
member, which emits a safe 32-bit `ldar Wn`.
Furthermore, Eric Dumazet pointed out that `WRITE_ONCE()` on the entire
struct in `proc_fib_multipath_hash_set_seed()` is also flawed. Analysis
shows that Clang splits this 8-byte write into two separate 32-bit
`str` instructions. While this avoids an alignment fault, it destroys
atomicity and exposes a tear-write vulnerability. Fix this by
explicitly splitting the write into two 32-bit `WRITE_ONCE()`
operations.
Finally, add the missing `READ_ONCE()` when reading `user_seed` in
`proc_fib_multipath_hash_seed()` to ensure proper pairing and
concurrency safety.
Fixes: 4ee2a8cace ("net: ipv4: Add a sysctl to set multipath hash seed")
Signed-off-by: Yung Chih Su <yuuchihsu@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260302060247.7066-1-yuuchihsu@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The GF stats periodic query is used as mechanism to monitor HWC health
check. If this HWC command times out, it is a strong indication that
the device/SoC is in a faulty state and requires recovery.
Today, when a timeout is detected, the driver marks
hwc_timeout_occurred, clears cached stats, and stops rescheduling the
periodic work. However, the device itself is left in the same failing
state.
Extend the timeout handling path to trigger the existing MANA VF
recovery service by queueing a GDMA_EQE_HWC_RESET_REQUEST work item.
This is expected to initiate the appropriate recovery flow by suspende
resume first and if it fails then trigger a bus rescan.
This change is intentionally limited to HWC command timeouts and does
not trigger recovery for errors reported by the SoC as a normal command
response.
Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/aaFShvKnwR5FY8dH@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
bond_option_mode_set() already rejects mode changes that would make a
loaded XDP program incompatible via bond_xdp_check(). However,
bond_option_xmit_hash_policy_set() has no such guard.
For 802.3ad and balance-xor modes, bond_xdp_check() returns false when
xmit_hash_policy is vlan+srcmac, because the 802.1q payload is usually
absent due to hardware offload. This means a user can:
1. Attach a native XDP program to a bond in 802.3ad/balance-xor mode
with a compatible xmit_hash_policy (e.g. layer2+3).
2. Change xmit_hash_policy to vlan+srcmac while XDP remains loaded.
This leaves bond->xdp_prog set but bond_xdp_check() now returning false
for the same device. When the bond is later destroyed, dev_xdp_uninstall()
calls bond_xdp_set(dev, NULL, NULL) to remove the program, which hits
the bond_xdp_check() guard and returns -EOPNOTSUPP, triggering:
WARN_ON(dev_xdp_install(dev, mode, bpf_op, NULL, 0, NULL))
Fix this by rejecting xmit_hash_policy changes to vlan+srcmac when an
XDP program is loaded on a bond in 802.3ad or balance-xor mode.
commit 39a0876d59 ("net, bonding: Disallow vlan+srcmac with XDP")
introduced bond_xdp_check() which returns false for 802.3ad/balance-xor
modes when xmit_hash_policy is vlan+srcmac. The check was wired into
bond_xdp_set() to reject XDP attachment with an incompatible policy, but
the symmetric path -- preventing xmit_hash_policy from being changed to an
incompatible value after XDP is already loaded -- was left unguarded in
bond_option_xmit_hash_policy_set().
Note:
commit 094ee6017e ("bonding: check xdp prog when set bond mode")
later added a similar guard to bond_option_mode_set(), but
bond_option_xmit_hash_policy_set() remained unprotected.
Reported-by: syzbot+5a287bcdc08104bc3132@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/6995aff6.050a0220.2eeac1.014e.GAE@google.com/T/
Fixes: 39a0876d59 ("net, bonding: Disallow vlan+srcmac with XDP")
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Link: https://patch.msgid.link/20260226080306.98766-2-jiayuan.chen@linux.dev
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Commit c92c81df93 ("net: dccp: fix kernel crash on module load")
added inet_hashinfo2_init_mod() for DCCP.
Commit 22d6c9eebf ("net: Unexport shared functions for DCCP.")
removed EXPORT_SYMBOL_GPL() it but forgot to remove the function
itself.
Let's remove inet_hashinfo2_init_mod().
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260301063756.1581685-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We will no longer hold RTNL for ipmr_rtm_route() to modify the
MFC hash table.
Only __dev_get_by_index() in rtm_to_ipmr_mfcc() is the RTNL
dependant, otherwise, we just need protection for mrt->mfc_hash
and mrt->mfc_cache_list.
Let's add a new mutex for ipmr_mfc_add(), ipmr_mfc_delete(),
and mroute_clean_tables() (setsockopt(MRT_FLUSH or MRT_DONE)).
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260228221800.1082070-15-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We will no longer hold RTNL for ipmr_mfc_add() and ipmr_mfc_delete().
MFC entry can be loosely connected with VIF by its index for
mrt->vif_table[] (stored in mfc_parent), but the two tables are
not synchronised. i.e. Even if VIF 1 is removed, MFC for VIF 1
is not automatically removed.
The only field that the MFC/VIF interfaces share is
net->ipv[46].ipmr_seq, which is protected by RTNL.
Adding a new mutex for both just to protect a single field is overkill.
Let's convert the field to atomic_t.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260228221800.1082070-14-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net->ipv4.ipmr_notifier_ops and net->ipv4.ipmr_seq are used
only in net/ipv4/ipmr.c.
Let's move these definitions under CONFIG_IP_MROUTE.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260228221800.1082070-13-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently key installation is only supported for netdev. For NAN,
support most key operations (except setting default data key) on
wdevs instead of netdevs, and adjust all the APIs and tracing to
match.
Since nothing currently sets NL80211_EXT_FEATURE_SECURE_NAN, this
doesn't change anything (P2P Device already isn't allowed.)
Signed-off-by: Avraham Stern <avraham.stern@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20260107150057.69a0cfad95fa.I00efdf3b2c11efab82ef6ece9f393382bcf33ba8@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
cfg80211_nan_conf::cluster_id is currently a pointer, but there is no real
reason to not have it an array. It makes things easier as there is no
need to check the pointer validity each time.
If a cluster ID wasn't provided by user space it will be randomized.
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20260302091108.2b12e4ccf5bb.Ib16bf5cca55463d4c89e18099cf1dfe4de95d405@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Per IEEE Std 802.11be-2024, 12.5.2.3.3, if the MPDU is an
individually addressed Data frame between an AP MLD and a
non-AP MLD associated with the AP MLD, then A1/A2/A3
will be MLD MAC addresses. Otherwise, Al/A2/A3 will be
over-the-air link MAC addresses.
Currently, during AAD and Nonce computation for software based
encryption/decryption cases, mac80211 directly uses the addresses it
receives in the skb frame header. However, after the first
authentication, management frame addresses for non-AP MLD stations
are translated to MLD addresses from over the air link addresses in
software. This means that the skb header could contain translated MLD
addresses, which when used as is, can lead to incorrect AAD/Nonce
computation.
In the following manner, ensure that the right set of addresses are used:
In the receive path, stash the pre-translated link addresses in
ieee80211_rx_data and use them for the AAD/Nonce computations
when required.
In the transmit path, offload the encryption for a CCMP/GCMP key
to the hwsim driver that can then ensure that encryption and hence
the AAD/Nonce computations are performed on the frame containing the
right set of addresses, i.e, MLD addresses if unicast data frame and
link addresses otherwise.
To do so, register the set key handler in hwsim driver so mac80211 is
aware that it is the driver that would take care of encrypting the
frame. Offload encryption for a CCMP/GCMP key, while keeping the
encryption for WEP/TKIP and MMIE generation for a AES_CMAC or a
AES_GMAC key still at the SW crypto in mac layer
Co-developed-by: Rohan Dutta <quic_drohan@quicinc.com>
Signed-off-by: Rohan Dutta <quic_drohan@quicinc.com>
Signed-off-by: Sai Pratyusha Magam <sai.magam@oss.qualcomm.com>
Link: https://patch.msgid.link/20260226042959.3766157-1-sai.magam@oss.qualcomm.com
[only store and apply link_addrs for unicast non-data
rather storing always and applying for !unicast_data]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently, the unsolicited probe response template is always fetched from
the default link of a virtual interface in both Multi-Link Operation (MLO)
and non-MLO cases. However, in the MLO case there is a need to fetch the
unsolicited probe response template from a specific link instead of the
default link.
Hence, add support for fetching the unsolicited probe response template
based on the link ID from the corresponding link data.
Signed-off-by: Sriram R <quic_srirrama@quicinc.com>
Co-developed-by: Raj Kumar Bhagat <raj.bhagat@oss.qualcomm.com>
Signed-off-by: Raj Kumar Bhagat <raj.bhagat@oss.qualcomm.com>
Link: https://patch.msgid.link/20260220-fils-prob-by-link-v1-2-a2746a853f75@oss.qualcomm.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently, the FILS discovery template is always fetched from the default
link of a virtual interface in both Multi-Link Operation (MLO) and
non-MLO cases. However, in the MLO case there is a need to fetch the FILS
discovery template from a specific link instead of the default link.
Hence, add support for fetching the FILS discovery template based on the
link ID from the corresponding link data.
Signed-off-by: Sriram R <quic_srirrama@quicinc.com>
Co-developed-by: Raj Kumar Bhagat <raj.bhagat@oss.qualcomm.com>
Signed-off-by: Raj Kumar Bhagat <raj.bhagat@oss.qualcomm.com>
Link: https://patch.msgid.link/20260220-fils-prob-by-link-v1-1-a2746a853f75@oss.qualcomm.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently, a station can only be added to a netdev interface,
mainly because there was no need for a station of a non-netdev
interface.
But for NAN, we will have stations that belong to the NL80211_IFTYPE_NAN
interface.
Prepare for adding/changing/deleting a station that belongs to a non-netdev
interface. This doesn't actually allow such stations - this will be done
in a different patch.
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20260219114327.65c9cc96f814.Ic02066b88bb8ad6b21e15cbea8d720280008c83b@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When any incumbent signal is detected by an AP/mesh interface operating
in 6 GHz band, FCC mandates the AP/mesh to vacate the channels affected
by it [1].
Add a new API cfg80211_incumbent_signal_notify() that can be used
by mac80211 or drivers to notify the higher layers about the signal
interference event with the interference bitmap in which each bit
denotes the affected 20 MHz in the operating channel.
Add support for the new nl80211 event and nl80211 attribute as well to
notify userspace on the details about the interference event. Userspace is
expected to process it and take further action - vacate the channel, or
reduce the bandwidth.
[1] - https://apps.fcc.gov/kdb/GetAttachment.html?id=nXQiRC%2B4mfiA54Zha%2BrW4Q%3D%3D&desc=987594%20D02%20U-NII%206%20GHz%20EMC%20Measurement%20v03&tracking_number=277034
Signed-off-by: Hari Chandrakanthan <quic_haric@quicinc.com>
Signed-off-by: Amith A <amith.a@oss.qualcomm.com>
Link: https://patch.msgid.link/20260216032027.2310956-2-amith.a@oss.qualcomm.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Allow to track and check CAC state from user mode by
simple check phy channels eg. using iw phy1 channels
command.
This is done for regular CAC and background CAC.
It is important for background CAC while we can start
it from any app (eg. iw or hostapd).
Signed-off-by: Janusz Dziedzic <janusz.dziedzic@gmail.com>
Link: https://patch.msgid.link/20260206171830.553879-3-janusz.dziedzic@gmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
DualPI2 drops packets during dequeue but was using kfree_skb_reason()
directly, bypassing trace_qdisc_drop. Convert to qdisc_dequeue_drop()
and add QDISC_DROP_L4S_STEP_NON_ECN to the qdisc drop reason enum.
- Set TCQ_F_DEQUEUE_DROPS flag in dualpi2_init()
- Use enum qdisc_drop_reason in drop_and_retry()
- Replace kfree_skb_reason() with qdisc_dequeue_drop()
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/177211351978.3011628.11267023360997620069.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Rename QDISC_DROP_CAKE_FLOOD to QDISC_DROP_FLOOD_PROTECTION to use a
generic name without embedding the qdisc name. This follows the
principle that drop reasons should describe the drop mechanism rather
than being tied to a specific qdisc implementation.
The flood protection drop reason is used by qdiscs implementing
probabilistic drop algorithms (like BLUE) that detect unresponsive
flows indicating potential DoS or flood attacks. CAKE uses this via
its Cobalt AQM component.
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/177211347537.3011628.13759059534638729639.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Rename FQ-specific drop reasons to generic names:
- QDISC_DROP_FQ_BAND_LIMIT -> QDISC_DROP_BAND_LIMIT
- QDISC_DROP_FQ_HORIZON_LIMIT -> QDISC_DROP_HORIZON_LIMIT
This follows the principle that drop reasons should describe the drop
mechanism rather than being tied to a specific qdisc implementation.
These concepts (priority band limits, timestamp horizon) could apply
to other qdiscs as well.
Remove the local macro define FQDR() and instead use the
full QDISC_DROP_* name to make it easier to navigate code.
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/177211346902.3011628.12523261489552097455.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Convert SFQ to use the new qdisc-specific drop reason infrastructure.
This patch demonstrates how to convert a flow-based qdisc to use the
new enum qdisc_drop_reason. As part of this conversion:
- Add QDISC_DROP_MAXFLOWS for flow table exhaustion
- Rename FQ_FLOW_LIMIT to generic FLOW_LIMIT, now shared by FQ and SFQ
- Use QDISC_DROP_OVERLIMIT for sfq_drop() when overall limit exceeded
- Use QDISC_DROP_FLOW_LIMIT for per-flow depth limit exceeded
The FLOW_LIMIT reason is now a common drop reason for per-flow limits,
applicable to both FQ and SFQ qdiscs.
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/177211345946.3011628.12770616071857185664.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Create new enum qdisc_drop_reason and trace_qdisc_drop tracepoint
for qdisc layer drop diagnostics with direct qdisc context visibility.
The new tracepoint includes qdisc handle, parent, kind (name), and
device information. Existing SKB_DROP_REASON_QDISC_DROP is retained
for backwards compatibility via kfree_skb_reason().
Convert qdiscs with drop reasons to use the new infrastructure.
Change CAKE's cobalt_should_drop() return type from enum skb_drop_reason
to enum qdisc_drop_reason to fix implicit enum conversion warnings.
Use QDISC_DROP_UNSPEC as the 'not dropped' sentinel instead of
SKB_NOT_DROPPED_YET. Both have the same compiled value (0), so the
comparison logic remains semantically equivalent.
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/177211345275.3011628.1974310302645218067.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
After commit b692bf9a75 ("xsk: Get rid of xdp_buff_xsk::xskb_list_node"),
the list_node field is reused for both the xskb pool list and the buffer
free list, this causes a buffer leak as described below.
xp_free() checks if a buffer is already on the free list using
list_empty(&xskb->list_node). When list_del() is used to remove a node
from the xskb pool list, it doesn't reinitialize the node pointers.
This means list_empty() will return false even after the node has been
removed, causing xp_free() to incorrectly skip adding the buffer to the
free list.
Fix this by using list_del_init() instead of list_del() in all fragment
handling paths, this ensures the list node is reinitialized after removal,
allowing the list_empty() to work correctly.
Fixes: b692bf9a75 ("xsk: Get rid of xdp_buff_xsk::xskb_list_node")
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Nikhil P. Rao <nikhil.rao@amd.com>
Link: https://patch.msgid.link/20260225000456.107806-2-nikhil.rao@amd.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As Paolo said earlier [1]:
"Since the blamed commit below, classify can return TC_ACT_CONSUMED while
the current skb being held by the defragmentation engine. As reported by
GangMin Kim, if such packet is that may cause a UaF when the defrag engine
later on tries to tuch again such packet."
act_ct was never meant to be used in the egress path, however some users
are attaching it to egress today [2]. Attempting to reach a middle
ground, we noticed that, while most qdiscs are not handling
TC_ACT_CONSUMED, clsact/ingress qdiscs are. With that in mind, we
address the issue by only allowing act_ct to bind to clsact/ingress
qdiscs and shared blocks. That way it's still possible to attach act_ct to
egress (albeit only with clsact).
[1] https://lore.kernel.org/netdev/674b8cbfc385c6f37fb29a1de08d8fe5c2b0fbee.1771321118.git.pabeni@redhat.com/
[2] https://lore.kernel.org/netdev/cc6bfb4a-4a2b-42d8-b9ce-7ef6644fb22b@ovn.org/
Reported-by: GangMin Kim <km.kim1503@gmail.com>
Fixes: 3f14b377d0 ("net/sched: act_ct: fix skb leak and crash on ooo frags")
CC: stable@vger.kernel.org
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20260225134349.1287037-1-victor@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
UDP/TCP lookups are using RCU, thus isk->inet_num accesses
should use READ_ONCE() and WRITE_ONCE() where needed.
Fixes: 3ab5aee7fe ("net: Convert TCP & DCCP hash tables to use RCU / hlist_nulls")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260225203545.1512417-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The gate action can be replaced while the hrtimer callback or dump path is
walking the schedule list.
Convert the parameters to an RCU-protected snapshot and swap updates under
tcf_lock, freeing the previous snapshot via call_rcu(). When REPLACE omits
the entry list, preserve the existing schedule so the effective state is
unchanged.
Fixes: a51c328df3 ("net: qos: introduce a gate control flow action")
Cc: stable@vger.kernel.org
Signed-off-by: Paul Moses <p@1g4.org>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Link: https://patch.msgid.link/20260223150512.2251594-2-p@1g4.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Now that the pp fields in net_iov have no users, remove them from
net_iov and clean up.
Signed-off-by: Byungchul Park <byungchul@sk.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20260224061424.11219-1-byungchul@sk.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cross-merge networking fixes after downstream PR (net-7.0-rc2).
Conflicts:
tools/testing/selftests/drivers/net/hw/rss_ctx.py
19c3a2a81d ("selftests: drv-net: rss: Generate unique ports for RSS context tests")
ce5a0f4612 ("selftests: drv-net: rss_ctx: test RSS contexts persist after ifdown/up")
include/net/inet_connection_sock.h
858d2a4f67 ("tcp: fix potential race in tcp_v6_syn_recv_sock()")
fcd3d039fa ("tcp: make tcp_v{4,6}_send_check() static")
https://lore.kernel.org/aZ8PSFLzBrEU3I89@sirena.org.uk
drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.c
drivers/net/ethernet/mellanox/mlx5/core/en/xsk/pool.c
69050f8d6d ("treewide: Replace kmalloc with kmalloc_obj for non-scalar types")
bf4afc53b7 ("Convert 'alloc_obj' family to use the new default GFP_KERNEL argument")
8a96b9144f ("net/mlx5e: Alloc xsk channel param out of mlx5e_open_xsk()")
Adjacent changes:
net/netfilter/ipvs/ip_vs_ctl.c
c59bd9e62e ("ipvs: use more counters to avoid service lookups")
bf4afc53b7 ("Convert 'alloc_obj' family to use the new default GFP_KERNEL argument")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Two administrator processes may race when setting child_ns_mode as one
process sets child_ns_mode to "local" and then creates a namespace, but
another process changes child_ns_mode to "global" between the write and
the namespace creation. The first process ends up with a namespace in
"global" mode instead of "local". While this can be detected after the
fact by reading ns_mode and retrying, it is fragile and error-prone.
Make child_ns_mode write-once so that a namespace manager can set it
once and be sure it won't change. Writing a different value after the
first write returns -EBUSY. This applies to all namespaces, including
init_net, where an init process can write "local" to lock all future
namespaces into local mode.
Fixes: eafb64f40c ("vsock: add netns to vsock core")
Suggested-by: Daan De Meyer <daan.j.demeyer@gmail.com>
Suggested-by: Stefano Garzarella <sgarzare@redhat.com>
Co-developed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://patch.msgid.link/20260223-vsock-ns-write-once-v3-2-c0cde6959923@meta.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This facility was disabled in commit
9e539c5b6d ("netfilter: nf_tables: disable expression reduction infra"),
because not all nft_exprs guarantee they will update the destination
register: some may set NFT_BREAK instead to cancel evaluation of the
rule.
This has been dead code ever since.
There are no plans to salvage this at this time, so remove this.
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20260224205048.4718-10-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Change the no_cport counters to be per-net and address family.
This should reduce the extra conn lookups done during present
NO_CPORT connections.
By changing from global to per-net dropentry counters, one net
will not affect the drop rate of another net.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20260224205048.4718-7-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When new connection is created we can lookup for services multiple
times to support fallback options. We already have some counters
to skip specific lookups because it costs CPU cycles for hash
calculation, etc.
Add more counters for fwmark/non-fwmark services (fwm_services and
nonfwm_services) and make all counters per address family.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20260224205048.4718-6-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
fwmark based services and non-fwmark based services can be hashed
in same service table. This reduces the burden of working with two
tables.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20260224205048.4718-4-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Current ipvs uses one global mutex "__ip_vs_mutex" to keep the global
"ip_vs_svc_table" and "ip_vs_svc_fwm_table" safe. But when there are
tens of thousands of services from different netns in the table, it
takes a long time to look up the table, for example, using "ipvsadm
-ln" from different netns simultaneously.
We make "ip_vs_svc_table" and "ip_vs_svc_fwm_table" per netns, and we
add "service_mutex" per netns to keep these two tables safe instead of
the global "__ip_vs_mutex" in current version. To this end, looking up
services from different netns simultaneously will not get stuck,
shortening the time consumption in large-scale deployment. It can be
reproduced using the simple scripts below.
init.sh: #!/bin/bash
for((i=1;i<=4;i++));do
ip netns add ns$i
ip netns exec ns$i ip link set dev lo up
ip netns exec ns$i sh add-services.sh
done
add-services.sh: #!/bin/bash
for((i=0;i<30000;i++)); do
ipvsadm -A -t 10.10.10.10:$((80+$i)) -s rr
done
runtest.sh: #!/bin/bash
for((i=1;i<4;i++));do
ip netns exec ns$i ipvsadm -ln > /dev/null &
done
ip netns exec ns4 ipvsadm -ln > /dev/null
Run "sh init.sh" to initiate the network environment. Then run "time
./runtest.sh" to evaluate the time consumption. Our testbed is a 4-core
Intel Xeon ECS. The result of the original version is around 8 seconds,
while the result of the modified version is only 0.8 seconds.
Signed-off-by: Jiejian Wu <jiejian@linux.alibaba.com>
Co-developed-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20260224205048.4718-2-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This function is only called from tcp_v6_mtu_reduced() and can be
(auto)inlined by the compiler.
Note that inet6_csk_route_socket() is no longer (auto)inlined,
which is a good thing as it is slow path.
$ scripts/bloat-o-meter -t vmlinux.0 vmlinux.1
add/remove: 0/2 grow/shrink: 2/0 up/down: 93/-129 (-36)
Function old new delta
tcp_v6_mtu_reduced 139 228 +89
inet6_csk_route_socket 486 490 +4
__pfx_inet6_csk_update_pmtu 16 - -16
inet6_csk_update_pmtu 113 - -113
Total: Before=25076512, After=25076476, chg -0.00%
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260223153047.886683-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tcp_v{4,6}_send_check() are only called from tcp_output.c
and should be made static so that the compiler does not need
to put an out of line copy of them.
Remove (struct inet_connection_sock_af_ops) send_check field
and use instead @net_header_len.
Move @net_header_len close to @queue_xmit for data locality
as both are used in TCP tx fast path.
$ scripts/bloat-o-meter -t vmlinux.2 vmlinux.3
add/remove: 0/2 grow/shrink: 0/3 up/down: 0/-172 (-172)
Function old new delta
__tcp_transmit_skb 3426 3423 -3
tcp_v4_send_check 136 132 -4
mptcp_subflow_init 777 763 -14
__pfx_tcp_v6_send_check 16 - -16
tcp_v6_send_check 135 - -135
Total: Before=25143196, After=25143024, chg -0.00%
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260223100729.3761597-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Inline __tcp_v4_send_check(), like __tcp_v6_send_check().
Move tcp_v4_send_check() to tcp_output.c close to
its fast path caller (__tcp_transmit_skb()).
Note __tcp_v4_send_check() is still out-of-line for tcp4_gso_segment()
because it is called in an unlikely() section.
$ scripts/bloat-o-meter -t vmlinux.0 vmlinux.1
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-9 (-9)
Function old new delta
__tcp_v4_send_check 130 121 -9
Total: Before=25143100, After=25143091, chg -0.00%
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260223100729.3761597-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This function has a single caller in net/ipv6/udp.c.
Move it there so that the compiler can decide to (auto)inline
it if he prefers to. IBT glue is removed anyway.
With clang, we can see it was able to inline it and also
inlined one other helper at the same time.
UDPLITE removal will also help.
$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/2 grow/shrink: 1/0 up/down: 840/-785 (55)
Function old new delta
__udp6_lib_rcv 1247 2087 +840
__pfx_udp6_csum_init 16 - -16
udp6_csum_init 769 - -769
Total: Before=25074399, After=25074454, chg +0.00%
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260223093445.3696368-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
After commit 6511882cdd ("mptcp: allocate fwd memory separately
on the rx and tx path") __lock_sock() can be static again.
Make sure __lock_sock() is not inlined, so that lock_sock_nested()
no longer needs a stack canary.
Add a noinline attribute on lock_sock_nested() so that calls
to lock_sock() from net/core/sock.c are not inlined,
none of them are fast path to deserve that:
- sockopt_lock_sock()
- sock_set_reuseport()
- sock_set_reuseaddr()
- sock_set_mark()
- sock_set_keepalive()
- sock_no_linger()
- sock_bindtoindex()
- sk_wait_data()
- sock_set_rcvbuf()
$ scripts/bloat-o-meter -t vmlinux.old vmlinux
add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-312 (-312)
Function old new delta
__lock_sock 192 188 -4
__lock_sock_fast 239 86 -153
lock_sock_nested 227 72 -155
Total: Before=24888707, After=24888395, chg -0.00%
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260223092716.3673939-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- purge error queues in socket destructors
- hci_sync: Fix CIS host feature condition
- L2CAP: Fix invalid response to L2CAP_ECRED_RECONF_REQ
- L2CAP: Fix result of L2CAP_ECRED_CONN_RSP when MTU is too short
- L2CAP: Fix response to L2CAP_ECRED_CONN_REQ
- L2CAP: Fix not checking output MTU is acceptable on L2CAP_ECRED_CONN_REQ
- L2CAP: Fix missing key size check for L2CAP_LE_CONN_REQ
- hci_qca: Cleanup on all setup failures
-----BEGIN PGP SIGNATURE-----
iQJNBAABCgA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmmcw1EZHGx1aXoudm9u
LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKUTyD/4jtQwDrveC19zamF5n7lFY
Oils6eftANcLFzLwTrMqGO7IxESga4qdNOf2vc/UgVSUfNqsPIUJ5El+LzpXZXAa
sYBP/KudEX53CfU3fEVyPTUaWkZ4CdMRZeiCmgXqW7GxYbGw92SFuaSIHAP6Ep4s
Z7Ryd1H0xhX9QPMc4g4IgoMiBiKzNs4GtlLSbDJcivAtbC/34nkMOxK9g+1DbU0F
qzW+oPfYCpPzXTf20I1QIAMt5smnSM3Tuvo9u2pZRuEGpKjENxeY4hdAejfjeKA6
RLWXm6JvMP2lUBT68plMQQdYyQ8DxG75sVjgSoQYIu2YTVnsX76t/kD2hhiHXH/Y
nQoy4dtA1/5V7Ka0cfMhcvino4Rb9Gh3dsFKJOuWRT+aTY+gNhpyr56SuJh24Y3C
7tUeEDI4fBkJGaRAbreVbaI5vw4kbSfi7IDOM/ccWDSLaG8HGaLOtn0IU8q4AgMa
IkYzB5zwtiyM/zaSTO1k0HkpjR0wwftnTd+Fj2mUWdTwSeek64R9enmKYmg5UJrv
14yhfLHFsbAQo+o1B3ZslnCdYQJpgFmyAInV6Jpunc78IE9+g/YA55K22JbDDSzI
t9Zy25OWLyYZyuD1PzDkMlYU5OARNYeyRXbJ3w037LrpqRoEuFsK0qTmgi+kR9C7
VR9IpCqgf4SJbL7ge83H8g==
=JBaa
-----END PGP SIGNATURE-----
Merge tag 'for-net-2026-02-23' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- purge error queues in socket destructors
- hci_sync: Fix CIS host feature condition
- L2CAP: Fix invalid response to L2CAP_ECRED_RECONF_REQ
- L2CAP: Fix result of L2CAP_ECRED_CONN_RSP when MTU is too short
- L2CAP: Fix response to L2CAP_ECRED_CONN_REQ
- L2CAP: Fix not checking output MTU is acceptable on L2CAP_ECRED_CONN_REQ
- L2CAP: Fix missing key size check for L2CAP_LE_CONN_REQ
- hci_qca: Cleanup on all setup failures
* tag 'for-net-2026-02-23' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: L2CAP: Fix missing key size check for L2CAP_LE_CONN_REQ
Bluetooth: L2CAP: Fix not checking output MTU is acceptable on L2CAP_ECRED_CONN_REQ
Bluetooth: Fix CIS host feature condition
Bluetooth: L2CAP: Fix response to L2CAP_ECRED_CONN_REQ
Bluetooth: hci_qca: Cleanup on all setup failures
Bluetooth: purge error queues in socket destructors
Bluetooth: L2CAP: Fix result of L2CAP_ECRED_CONN_RSP when MTU is too short
Bluetooth: L2CAP: Fix invalid response to L2CAP_ECRED_RECONF_REQ
====================
Link: https://patch.msgid.link/20260223211634.3800315-1-luiz.dentz@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
skb_may_tx_timestamp() may acquire sock::sk_callback_lock. The lock must
not be taken in IRQ context, only softirq is okay. A few drivers receive
the timestamp via a dedicated interrupt and complete the TX timestamp
from that handler. This will lead to a deadlock if the lock is already
write-locked on the same CPU.
Taking the lock can be avoided. The socket (pointed by the skb) will
remain valid until the skb is released. The ->sk_socket and ->file
member will be set to NULL once the user closes the socket which may
happen before the timestamp arrives.
If we happen to observe the pointer while the socket is closing but
before the pointer is set to NULL then we may use it because both
pointer (and the file's cred member) are RCU freed.
Drop the lock. Use READ_ONCE() to obtain the individual pointer. Add a
matching WRITE_ONCE() where the pointer are cleared.
Link: https://lore.kernel.org/all/20260205145104.iWinkXHv@linutronix.de
Fixes: b245be1f4d ("net-timestamp: no-payload only sysctl")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260220183858.N4ERjFW6@linutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>