Core & protocols
----------------
- Support HW queue leasing, allowing containers to be granted access
to HW queues for zero-copy operations and AF_XDP.
- Number of code moves to help the compiler with inlining.
Avoid output arguments for returning drop reason where possible.
- Rework drop handling within qdiscs to include more metadata
about the reason and dropping qdisc in the tracepoints.
- Remove the rtnl_lock use from IP Multicast Routing.
- Pack size information into the Rx Flow Steering table pointer
itself. This allows making the table itself a flat array of u32s,
thus making the table allocation size a power of two.
- Report TCP delayed ack timer information via socket diag.
- Add ip_local_port_step_width sysctl to allow distributing the randomly
selected ports more evenly throughout the allowed space.
- Add support for per-route tunsrc in IPv6 segment routing.
- Start work of switching sockopt handling to iov_iter.
- Improve dynamic recvbuf sizing in MPTCP, limit burstiness and avoid
buffer size drifting up.
- Support MSG_EOR in MPTCP.
- Add stp_mode attribute to the bridge driver for STP mode selection.
This addresses concerns about call_usermodehelper() usage.
- Remove UDP-Lite support (as announced in 2023).
- Remove support for building IPv6 as a module.
Remove the now unnecessary function calling indirection.
Cross-tree stuff
----------------
- Move Michael MIC code from generic crypto into wireless,
it's considered insecure but some WiFi networks still need it.
Netfilter
---------
- Switch nft_fib_ipv6 module to no longer need temporary dst_entry
object allocations by using fib6_lookup() + RCU.
Florian W reports this gets us ~13% higher packet rate.
- Convert IPVS's global __ip_vs_mutex to per-net service_mutex and
switch the service tables to be per-net. Convert some code that
walks the service lists to use RCU instead of the service_mutex.
- Add more opinionated input validation to lower security exposure.
- Make IPVS hash tables to be per-netns and resizable.
Wireless
--------
- Finished assoc frame encryption/EPPKE/802.1X-over-auth.
- Radar detection improvements.
- Add 6 GHz incumbent signal detection APIs.
- Multi-link support for FILS, probe response templates and
client probing.
- New APIs and mac80211 support for NAN (Neighbor Aware Networking,
aka Wi-Fi Aware) so less work must be in firmware.
Driver API
----------
- Add numerical ID for devlink instances (to avoid having to create
fake bus/device pairs just to have an ID). Support shared devlink
instances which span multiple PFs.
- Add standard counters for reporting pause storm events
(implement in mlx5 and fbnic).
- Add configuration API for completion writeback buffering
(implement in mana).
- Support driver-initiated change of RSS context sizes.
- Support DPLL monitoring input frequency (implement in zl3073x).
- Support per-port resources in devlink (implement in mlx5).
Misc
----
- Expand the YAML spec for Netfilter.
Drivers
-------
- Software:
- macvlan: support multicast rx for bridge ports with shared source
MAC address
- team: decouple receive and transmit enablement for IEEE 802.3ad
LACP "independent control"
- Ethernet high-speed NICs:
- nVidia/Mellanox:
- support high order pages in zero-copy mode (for payload
coalescing)
- support multiple packets in a page (for systems with 64kB pages)
- Broadcom 25-400GE (bnxt):
- implement XDP RSS hash metadata extraction
- add software fallback for UDP GSO, lowering the IOMMU cost
- Broadcom 800GE (bnge):
- add link status and configuration handling
- add various HW and SW statistics
- Marvell/Cavium:
- NPC HW block support for cn20k
- Huawei (hinic3):
- add mailbox / control queue
- add rx VLAN offload
- add driver info and link management
- Ethernet NICs:
- Marvell/Aquantia:
- support reading SFP module info on some AQC100 cards
- Realtek PCI (r8169):
- add support for RTL8125cp
- Realtek USB (r8152):
- support for the RTL8157 5Gbit chip
- add 2500baseT EEE status/configuration support
- Ethernet NICs embedded and off-the-shelf IP:
- Synopsys (stmmac):
- cleanup and reorganize SerDes handling and PCS support
- cleanup descriptor handling and per-platform data
- cleanup and consolidate MDIO defines and handling
- shrink driver memory use for internal structures
- improve Tx IRQ coalescing
- improve TCP segmentation handling
- add support for Spacemit K3
- Cadence (macb):
- support PHYs that have inband autoneg disabled with GEM
- support IEEE 802.3az EEE
- rework usrio capabilities and handling
- AMD (xgbe):
- improve power management for S0i3
- improve TX resilience for link-down handling
- Virtual:
- Google cloud vNIC:
- support larger ring sizes in DQO-QPL mode
- improve HW-GRO handling
- support UDP GSO for DQO format
- PCIe NTB:
- support queue count configuration
- Ethernet PHYs:
- automatically disable PHY autonomous EEE if MAC is in charge
- Broadcom:
- add BCM84891/BCM84892 support
- Micrel:
- support for LAN9645X internal PHY
- Realtek:
- add RTL8224 pair order support
- support PHY LEDs on RTL8211F-VD
- support spread spectrum clocking (SSC)
- Maxlinear:
- add PHY-level statistics via ethtool
- Ethernet switches:
- Maxlinear (mxl862xx):
- support for bridge offloading
- support for VLANs
- support driver statistics
- Bluetooth:
- large number of fixes and new device IDs
- Mediatek:
- support MT6639 (MT7927)
- support MT7902 SDIO
- WiFi:
- Intel (iwlwifi):
- UNII-9 and continuing UHR work
- MediaTek (mt76):
- mt7996/mt7925 MLO fixes/improvements
- mt7996 NPU support (HW eth/wifi traffic offload)
- Qualcomm (ath12k):
- monitor mode support on IPQ5332
- basic hwmon temperature reporting
- support IPQ5424
- Realtek:
- add USB RX aggregation to improve performance
- add USB TX flow control by tracking in-flight URBs
- Cellular:
- IPA v5.2 support
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmnelNoACgkQMUZtbf5S
IrtWFw//WyiXuEiGawVQONnbu1dtR+3nw/cvNpSYi0IM66vbRUB9n+9fxm2MIyG4
4jI/c/X/fxIvUxEqGez3yPn5P7KqkQR8WRYwkxrMYKRpXeukN0IDk5Euew5DskCe
wtBKNJOQWKdKXff0bLQoJ9dHWYuJ2IMRVil5M3fhUbeUOXeyJD7Yn1w2ICvJAbj+
T/Hw7sEtchNaHp6h6SbaQfahkUFHQG5peNoETkZF4UDF6ALGY29WH91GXeO2lrgN
IxX203KtaavV0oU8T0oixZgOc57Ns081YfFL/F1JP2HV6lgkwhuq+zxCrRTi1c9M
HPTXgwD7Z80Y74nM3YTLrPfoMOP8GLBZgdV3rUpwmteM26+gMTm+O1zHUur5ZoGy
D6TaMFguPTIqiRyrARa9xY/J6r9TQkc2Wfu4bIuPndKFg8xPoepuEObODnh0+5Hg
4j4pdFhIo2huENhSg7kVb/yl+1q68SFwM3RqTmx+OhCa0AyjcKIKgt/UBhismdnG
r8obxzb+nXeJc2rRDuwNMwlBlcMSbep27uGt64zeHMMXVhTVqOoytNaL/X/ZpH2m
A0DscUrpHvb36IoDPtanc6irP+JOh5Xe7Nw5qhkgwsMc7hlf8SyyHB4OUBBaz1qA
ETSnHlfwklRmXSpWqH2LyGXjdOQpDKP46+h0W3dttMD2/cRBqYo=
=EhQZ
-----END PGP SIGNATURE-----
Merge tag 'net-next-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core & protocols:
- Support HW queue leasing, allowing containers to be granted access
to HW queues for zero-copy operations and AF_XDP
- Number of code moves to help the compiler with inlining. Avoid
output arguments for returning drop reason where possible
- Rework drop handling within qdiscs to include more metadata about
the reason and dropping qdisc in the tracepoints
- Remove the rtnl_lock use from IP Multicast Routing
- Pack size information into the Rx Flow Steering table pointer
itself. This allows making the table itself a flat array of u32s,
thus making the table allocation size a power of two
- Report TCP delayed ack timer information via socket diag
- Add ip_local_port_step_width sysctl to allow distributing the
randomly selected ports more evenly throughout the allowed space
- Add support for per-route tunsrc in IPv6 segment routing
- Start work of switching sockopt handling to iov_iter
- Improve dynamic recvbuf sizing in MPTCP, limit burstiness and avoid
buffer size drifting up
- Support MSG_EOR in MPTCP
- Add stp_mode attribute to the bridge driver for STP mode selection.
This addresses concerns about call_usermodehelper() usage
- Remove UDP-Lite support (as announced in 2023)
- Remove support for building IPv6 as a module. Remove the now
unnecessary function calling indirection
Cross-tree stuff:
- Move Michael MIC code from generic crypto into wireless, it's
considered insecure but some WiFi networks still need it
Netfilter:
- Switch nft_fib_ipv6 module to no longer need temporary dst_entry
object allocations by using fib6_lookup() + RCU.
Florian W reports this gets us ~13% higher packet rate
- Convert IPVS's global __ip_vs_mutex to per-net service_mutex and
switch the service tables to be per-net. Convert some code that
walks the service lists to use RCU instead of the service_mutex
- Add more opinionated input validation to lower security exposure
- Make IPVS hash tables to be per-netns and resizable
Wireless:
- Finished assoc frame encryption/EPPKE/802.1X-over-auth
- Radar detection improvements
- Add 6 GHz incumbent signal detection APIs
- Multi-link support for FILS, probe response templates and client
probing
- New APIs and mac80211 support for NAN (Neighbor Aware Networking,
aka Wi-Fi Aware) so less work must be in firmware
Driver API:
- Add numerical ID for devlink instances (to avoid having to create
fake bus/device pairs just to have an ID). Support shared devlink
instances which span multiple PFs
- Add standard counters for reporting pause storm events (implement
in mlx5 and fbnic)
- Add configuration API for completion writeback buffering (implement
in mana)
- Support driver-initiated change of RSS context sizes
- Support DPLL monitoring input frequency (implement in zl3073x)
- Support per-port resources in devlink (implement in mlx5)
Misc:
- Expand the YAML spec for Netfilter
Drivers
- Software:
- macvlan: support multicast rx for bridge ports with shared
source MAC address
- team: decouple receive and transmit enablement for IEEE 802.3ad
LACP "independent control"
- Ethernet high-speed NICs:
- nVidia/Mellanox:
- support high order pages in zero-copy mode (for payload
coalescing)
- support multiple packets in a page (for systems with 64kB
pages)
- Broadcom 25-400GE (bnxt):
- implement XDP RSS hash metadata extraction
- add software fallback for UDP GSO, lowering the IOMMU cost
- Broadcom 800GE (bnge):
- add link status and configuration handling
- add various HW and SW statistics
- Marvell/Cavium:
- NPC HW block support for cn20k
- Huawei (hinic3):
- add mailbox / control queue
- add rx VLAN offload
- add driver info and link management
- Ethernet NICs:
- Marvell/Aquantia:
- support reading SFP module info on some AQC100 cards
- Realtek PCI (r8169):
- add support for RTL8125cp
- Realtek USB (r8152):
- support for the RTL8157 5Gbit chip
- add 2500baseT EEE status/configuration support
- Ethernet NICs embedded and off-the-shelf IP:
- Synopsys (stmmac):
- cleanup and reorganize SerDes handling and PCS support
- cleanup descriptor handling and per-platform data
- cleanup and consolidate MDIO defines and handling
- shrink driver memory use for internal structures
- improve Tx IRQ coalescing
- improve TCP segmentation handling
- add support for Spacemit K3
- Cadence (macb):
- support PHYs that have inband autoneg disabled with GEM
- support IEEE 802.3az EEE
- rework usrio capabilities and handling
- AMD (xgbe):
- improve power management for S0i3
- improve TX resilience for link-down handling
- Virtual:
- Google cloud vNIC:
- support larger ring sizes in DQO-QPL mode
- improve HW-GRO handling
- support UDP GSO for DQO format
- PCIe NTB:
- support queue count configuration
- Ethernet PHYs:
- automatically disable PHY autonomous EEE if MAC is in charge
- Broadcom:
- add BCM84891/BCM84892 support
- Micrel:
- support for LAN9645X internal PHY
- Realtek:
- add RTL8224 pair order support
- support PHY LEDs on RTL8211F-VD
- support spread spectrum clocking (SSC)
- Maxlinear:
- add PHY-level statistics via ethtool
- Ethernet switches:
- Maxlinear (mxl862xx):
- support for bridge offloading
- support for VLANs
- support driver statistics
- Bluetooth:
- large number of fixes and new device IDs
- Mediatek:
- support MT6639 (MT7927)
- support MT7902 SDIO
- WiFi:
- Intel (iwlwifi):
- UNII-9 and continuing UHR work
- MediaTek (mt76):
- mt7996/mt7925 MLO fixes/improvements
- mt7996 NPU support (HW eth/wifi traffic offload)
- Qualcomm (ath12k):
- monitor mode support on IPQ5332
- basic hwmon temperature reporting
- support IPQ5424
- Realtek:
- add USB RX aggregation to improve performance
- add USB TX flow control by tracking in-flight URBs
- Cellular:
- IPA v5.2 support"
* tag 'net-next-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1561 commits)
net: pse-pd: fix kernel-doc function name for pse_control_find_by_id()
wireguard: device: use exit_rtnl callback instead of manual rtnl_lock in pre_exit
wireguard: allowedips: remove redundant space
tools: ynl: add sample for wireguard
wireguard: allowedips: Use kfree_rcu() instead of call_rcu()
MAINTAINERS: Add netkit selftest files
selftests/net: Add additional test coverage in nk_qlease
selftests/net: Split netdevsim tests from HW tests in nk_qlease
tools/ynl: Make YnlFamily closeable as a context manager
net: airoha: Add missing PPE configurations in airoha_ppe_hw_init()
net: airoha: Fix VIP configuration for AN7583 SoC
net: caif: clear client service pointer on teardown
net: strparser: fix skb_head leak in strp_abort_strp()
net: usb: cdc-phonet: fix skb frags[] overflow in rx_complete()
selftests/bpf: add test for xdp_master_redirect with bond not up
net, bpf: fix null-ptr-deref in xdp_master_redirect() for down master
net: airoha: Remove PCE_MC_EN_MASK bit in REG_FE_PCE_CFG configuration
sctp: disable BH before calling udp_tunnel_xmit_skb()
sctp: fix missing encap_port propagation for GSO fragments
net: airoha: Rely on net_device pointer in ETS callbacks
...
As IPv6 is built-in only, the macro is always evaluating to an empty
one. Remove it completely from the code.
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Link: https://patch.msgid.link/20260325120928.15848-3-fmancera@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
inode->i_ino is being converted to a u64. sock.sk_ino (which caches the
inode number) must also be widened to avoid truncation on 32-bit
architectures where unsigned long is only 32 bits.
Change sk_ino from unsigned long to u64, and update the return type
of sock_i_ino() to match. Fix all format strings that print the
result of sock_i_ino() (%lu -> %llu), and widen the intermediate
variables and function parameters in the diag modules that were
using int to hold the inode number.
Note that the UAPI socket diag structures (inet_diag_msg.idiag_inode,
unix_diag_msg.udiag_ino, etc.) are all __u32 and cannot be changed
without breaking the ABI. The assignments to those fields will
silently truncate, which is the existing behavior.
Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> # for net/can
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20260304-iino-u64-v3-3-2257ad83d372@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
isk->inet_num, isk->inet_rcv_saddr and sk->sk_bound_dev_if
are read locklessly in ping_lookup().
Add READ_ONCE()/WRITE_ONCE() annotations.
The race on isk->inet_rcv_saddr is probably coming from IPv6 support,
but does not deserve a specific backport.
Fixes: dbca1596bb ("ping: convert to RCU lookups, get rid of rwlock")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260216100149.3319315-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use DEFINE_RAW_FLEX() to avoid thousands of -Wflex-array-member-not-at-end
warnings.
Remove struct ip_options_data, and adjust the rest of the code so that
flexible-array member struct ip_options_rcu::opt.__data[] ends last
in struct icmp_bxm.
Compensate for this by using the DEFINE_RAW_FLEX() helper to define each
on-stack struct instance that contained struct ip_options_data as a member,
and to define struct ip_options_rcu with a fixed on-stack size for its
nested flexible-array member opt.__data[].
Also, add a couple of code comments to prevent people from adding members
to a struct after another member that contains a flexible array.
With these changes, fix 2600 warnings of the following type:
include/net/inet_sock.h:65:33: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Link: https://patch.msgid.link/aVteBadWA6AbTp7X@kspp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When the ping program uses an IPPROTO_ICMP socket to send ICMP_ECHO
messages, ICMP_MIB_OUTMSGS is counted twice.
ping_v4_sendmsg
ping_v4_push_pending_frames
ip_push_pending_frames
ip_finish_skb
__ip_make_skb
icmp_out_count(net, icmp_type); // first count
icmp_out_count(sock_net(sk), user_icmph.type); // second count
However, when the ping program uses an IPPROTO_RAW socket,
ICMP_MIB_OUTMSGS is counted correctly only once.
Therefore, the first count should be removed.
Fixes: c319b4d76b ("net: ipv4: add IPPROTO_ICMP socket kind")
Signed-off-by: yuan.gao <yuan.gao@ucloud.cn>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20251224063145.3615282-1-yuan.gao@ucloud.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Convert struct proto pre_connect(), connect(), bind(), and bind_add()
callback function prototypes from struct sockaddr to struct sockaddr_unsized.
This does not change per-implementation use of sockaddr for passing around
an arbitrarily sized sockaddr struct. Those will be addressed in future
patches.
Additionally removes the no longer referenced struct sockaddr from
include/net/inet_common.h.
No binary changes expected.
Signed-off-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20251104002617.2752303-5-kees@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Provide isolation between netns for ping idents.
Randomize initial ping_port_rover value at netns creation.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250829153054.474201-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There is no point in keeping ping_hash().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Yue Haibing <yuehaibing@huawei.com>
Link: https://patch.msgid.link/20250829153054.474201-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We need to check socket netns before considering them in ping_get_port().
Otherwise, one malicious netns could 'consume' all ports.
Add corresponding check in ping_lookup().
Fixes: c319b4d76b ("net: ipv4: add IPPROTO_ICMP socket kind")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Yue Haibing <yuehaibing@huawei.com>
Link: https://patch.msgid.link/20250829153054.474201-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We want to split sk->sk_drops in the future to reduce
potential contention on this field.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250826125031.1578842-2-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Difference between sock_i_uid() and sk_uid() is that
after sock_orphan(), sock_i_uid() returns GLOBAL_ROOT_UID
while sk_uid() returns the last cached sk->sk_uid value.
None of sock_i_uid() callers care about this.
Use sk_uid() which is much faster and inlined.
Note that diag/dump users are calling sock_i_ino() and
can not see the full benefit yet.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Reviewed-by: Maciej Żenczykowski <maze@google.com>
Link: https://patch.msgid.link/20250620133001.4090592-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
sk->sk_uid can be read while another thread changes its
value in sockfs_setattr().
Add sk_uid(const struct sock *sk) helper to factorize the needed
READ_ONCE() annotations, and add corresponding WRITE_ONCE()
where needed.
Fixes: 86741ec254 ("net: core: Add a UID field to struct sock.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Reviewed-by: Maciej Żenczykowski <maze@google.com>
Link: https://patch.msgid.link/20250620133001.4090592-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ping_rcv() callers currently call skb_free() or consume_skb(),
forcing ping_rcv() to clone the skb.
After this patch ping_rcv() is now 'consuming' the original skb,
either moving to a socket receive queue, or dropping it.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250226183437.1457318-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Initialize the ip cookie tos field when initializing the cookie, in
ipcm_init_sk.
The existing code inverts the standard pattern for initializing cookie
fields. Default is to initialize the field from the sk, then possibly
overwrite that when parsing cmsgs (the unlikely case).
This field inverts that, setting the field to an illegal value and
after cmsg parsing checking whether the value is still illegal and
thus should be overridden.
Be careful to always apply mask INET_DSCP_MASK, as before.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250214222720.3205500-5-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Replace kfree_skb_reason with sk_skb_reason_drop and pass the receiving
socket to the tracepoint.
Signed-off-by: Yan Zhai <yan@cloudflare.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZS1d4wAKCRDbK58LschI
g4DSAP441CdKh8fd+wNKUSKHFbpCQ6EvocR6Nf+Sj2DFUx/w/QEA7mfju7Abqjc3
xwDEx0BuhrjMrjV5MmEpxc7lYl9XcQU=
=vuWk
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2023-10-16
We've added 90 non-merge commits during the last 25 day(s) which contain
a total of 120 files changed, 3519 insertions(+), 895 deletions(-).
The main changes are:
1) Add missed stats for kprobes to retrieve the number of missed kprobe
executions and subsequent executions of BPF programs, from Jiri Olsa.
2) Add cgroup BPF sockaddr hooks for unix sockets. The use case is
for systemd to reimplement the LogNamespace feature which allows
running multiple instances of systemd-journald to process the logs
of different services, from Daan De Meyer.
3) Implement BPF CPUv4 support for s390x BPF JIT, from Ilya Leoshkevich.
4) Improve BPF verifier log output for scalar registers to better
disambiguate their internal state wrt defaults vs min/max values
matching, from Andrii Nakryiko.
5) Extend the BPF fib lookup helpers for IPv4/IPv6 to support retrieving
the source IP address with a new BPF_FIB_LOOKUP_SRC flag,
from Martynas Pumputis.
6) Add support for open-coded task_vma iterator to help with symbolization
for BPF-collected user stacks, from Dave Marchevsky.
7) Add libbpf getters for accessing individual BPF ring buffers which
is useful for polling them individually, for example, from Martin Kelly.
8) Extend AF_XDP selftests to validate the SHARED_UMEM feature,
from Tushar Vyavahare.
9) Improve BPF selftests cross-building support for riscv arch,
from Björn Töpel.
10) Add the ability to pin a BPF timer to the same calling CPU,
from David Vernet.
11) Fix libbpf's bpf_tracing.h macros for riscv to use the generic
implementation of PT_REGS_SYSCALL_REGS() to access syscall arguments,
from Alexandre Ghiti.
12) Extend libbpf to support symbol versioning for uprobes, from Hengqi Chen.
13) Fix bpftool's skeleton code generation to guarantee that ELF data
is 8 byte aligned, from Ian Rogers.
14) Inherit system-wide cpu_mitigations_off() setting for Spectre v1/v4
security mitigations in BPF verifier, from Yafang Shao.
15) Annotate struct bpf_stack_map with __counted_by attribute to prepare
BPF side for upcoming __counted_by compiler support, from Kees Cook.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (90 commits)
bpf: Ensure proper register state printing for cond jumps
bpf: Disambiguate SCALAR register state output in verifier logs
selftests/bpf: Make align selftests more robust
selftests/bpf: Improve missed_kprobe_recursion test robustness
selftests/bpf: Improve percpu_alloc test robustness
selftests/bpf: Add tests for open-coded task_vma iter
bpf: Introduce task_vma open-coded iterator kfuncs
selftests/bpf: Rename bpf_iter_task_vma.c to bpf_iter_task_vmas.c
bpf: Don't explicitly emit BTF for struct btf_iter_num
bpf: Change syscall_nr type to int in struct syscall_tp_t
net/bpf: Avoid unused "sin_addr_len" warning when CONFIG_CGROUP_BPF is not set
bpf: Avoid unnecessary audit log for CPU security mitigations
selftests/bpf: Add tests for cgroup unix socket address hooks
selftests/bpf: Make sure mount directory exists
documentation/bpf: Document cgroup unix socket address hooks
bpftool: Add support for cgroup unix socket address hooks
libbpf: Add support for cgroup unix socket address hooks
bpf: Implement cgroup sockaddr hooks for unix sockets
bpf: Add bpf_sock_addr_set_sun_path() to allow writing unix sockaddr from bpf
bpf: Propagate modified uaddrlen from cgroup sockaddr programs
...
====================
Link: https://lore.kernel.org/r/20231016204803.30153-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As prep for adding unix socket support to the cgroup sockaddr hooks,
let's propagate the sockaddr length back to the caller after running
a bpf cgroup sockaddr hook program. While not important for AF_INET or
AF_INET6, the sockaddr length is important when working with AF_UNIX
sockaddrs as the size of the sockaddr cannot be determined just from the
address family or the sockaddr's contents.
__cgroup_bpf_run_filter_sock_addr() is modified to take the uaddrlen as
an input/output argument. After running the program, the modified sockaddr
length is stored in the uaddrlen pointer.
Signed-off-by: Daan De Meyer <daan.j.demeyer@gmail.com>
Link: https://lore.kernel.org/r/20231011185113.140426-3-daan.j.demeyer@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Add missing annotations to inet->mc_index and inet->mc_addr
to fix data-races.
getsockopt(IP_MULTICAST_IF) can be lockless.
setsockopt() side is left for later.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add missing READ_ONCE() annotations when reading inet->uc_index
Implementing getsockopt(IP_UNICAST_IF) locklessly seems possible,
the setsockopt() part might not be possible at the moment.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
inet->pmtudisc can be read locklessly.
Implement proper lockless reads and writes to inet->pmtudisc
ip_sock_set_mtu_discover() can now be called from arbitrary
contexts.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
np->sndflow reads are racy.
Use one bit ftom atomic inet->inet_flags instead,
IPV6_FLOWINFO_SEND setsockopt() can be lockless.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
np->recverr is moved to inet->inet_flags to fix data-races.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
IP_RECVERR socket option can now be set/get without locking the socket.
This patch potentially avoid data-races around inet->recverr.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Various inet fields are currently racy.
do_ip_setsockopt() and do_ip_getsockopt() are mostly holding
the socket lock, but some (fast) paths do not.
Use a new inet->inet_flags to hold atomic bits in the series.
Remove inet->cmsg_flags, and use instead 9 bits from inet_flags.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Define a new helper to figure out the correct route scope to use on TX,
depending on socket configuration, ancillary data and send flags.
Use this new helper to properly initialise the scope in
flowi4_init_output(), instead of overriding tos with the RTO_ONLINK
flag.
The objective is to eventually remove RTO_ONLINK, which will allow
converting .flowi4_tos to dscp_t.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since introduced in commit c319b4d76b ("net: ipv4: add IPPROTO_ICMP
socket kind"), ping socket does not use SLAB_TYPESAFE_BY_RCU nor check
nulls marker in loops.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit dbca1596bb ("ping: convert to RCU lookups, get rid
of rwlock"), we use RCU for ping sockets, but we should use spinlock
for /proc/net/icmp to avoid a potential NULL deref mentioned in
the previous patch.
Let's go back to using spinlock there.
Note we can convert ping sockets to use hlist instead of hlist_nulls
because we do not use SLAB_TYPESAFE_BY_RCU for ping sockets.
Fixes: dbca1596bb ("ping: convert to RCU lookups, get rid of rwlock")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ping_lookup() does not acquire the table spinlock, so iteration should
use hlist_nulls_for_each_entry_rcu().
Spotted during code review.
Fixes: dbca1596bb ("ping: convert to RCU lookups, get rid of rwlock")
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20221129140644.28525-1-fw@strlen.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
We assume the correct errno is -EADDRINUSE when sk->sk_prot->get_port()
fails, so some ->get_port() functions return just 1 on failure and the
callers return -EADDRINUSE instead.
However, mptcp_get_port() can return -EINVAL. Let's not ignore the error.
Note the only exception is inet_autobind(), all of whose callers return
-EAGAIN instead.
Fixes: cec37a6e41 ("mptcp: Handle MP_CAPABLE options for outgoing connections")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For a given ping datagram, ping_getfrag() is called once
per skb fragment.
A large datagram requiring more than one page fragment
is currently getting the checksum of the last fragment,
instead of the cumulative one.
After this patch, "ping -s 35000 ::1" is working correctly.
Fixes: 6d0bfe2261 ("net: ipv6: Add IPv6 support to the ping socket.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Usually when a TCP/UDP connection is initiated, we can bind the socket
to a specific IP attached to an interface in a cgroup/connect hook.
But for pings, this is impossible, as the hook is not being called.
This adds the hook invocation to unprivileged ICMP ping (i.e. ping
sockets created with SOCK_DGRAM IPPROTO_ICMP(V6) as opposed to
SOCK_RAW. Logic is mirrored from UDP sockets where the hook is invoked
during pre_connect, after a check for suficiently sized addr_len.
Signed-off-by: YiFei Zhu <zhuyifei@google.com>
Link: https://lore.kernel.org/r/5764914c252fad4cd134fb6664c6ede95f409412.1662682323.git.zhuyifei@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Using rwlock in networking code is extremely risky.
writers can starve if enough readers are constantly
grabing the rwlock.
I thought rwlock were at fault and sent this patch:
https://lkml.org/lkml/2022/6/17/272
But Peter and Linus essentially told me rwlock had to be unfair.
We need to get rid of rwlock in networking code.
Fixes: c319b4d76b ("net: ipv4: add IPPROTO_ICMP socket kind")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 8ff978b8b2 ("ipv4/raw: support binding to nonlocal addresses")
introduced a helper function to fold duplicated validity checks of bind
addresses into inet_addr_valid_or_nonlocal(). However, this caused an
unintended regression in ping_check_bind_addr(), which previously would
reject binding to multicast and broadcast addresses, but now these are
both incorrectly allowed as reported in [1].
This patch restores the original check. A simple reordering is done to
improve readability and make it evident that multicast and broadcast
addresses should not be allowed. Also, add an early exit for INADDR_ANY
which replaces lost behavior added by commit 0ce779a9f5 ("net: Avoid
unnecessary inet_addr_type() call when addr is INADDR_ANY").
Furthermore, this patch introduces regression selftests to catch these
specific cases.
[1] https://lore.kernel.org/netdev/CANP3RGdkAcDyAZoT1h8Gtuu0saq+eOrrTiWbxnOs+5zn+cpyKg@mail.gmail.com/
Fixes: 8ff978b8b2 ("ipv4/raw: support binding to nonlocal addresses")
Cc: Miaohe Lin <linmiaohe@huawei.com>
Reported-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Carlos Llamas <cmllamas@google.com>
Signed-off-by: Riccardo Paolo Bestetti <pbl@bestov.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
When ping_group_range is updated, 'ping' uses the DGRAM ICMP socket,
instead of an IP raw socket. In this case, 'ping' is unable to bind its
socket to a local address owned by a vrflite.
Before the patch:
$ sysctl -w net.ipv4.ping_group_range='0 2147483647'
$ ip link add blue type vrf table 10
$ ip link add foo type dummy
$ ip link set foo master blue
$ ip link set foo up
$ ip addr add 192.168.1.1/24 dev foo
$ ip addr add 2001::1/64 dev foo
$ ip vrf exec blue ping -c1 -I 192.168.1.1 192.168.1.2
ping: bind: Cannot assign requested address
$ ip vrf exec blue ping6 -c1 -I 2001::1 2001::2
ping6: bind icmp socket: Cannot assign requested address
CC: stable@vger.kernel.org
Fixes: 1b69c6d0ae ("net: Introduce L3 Master device abstraction")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The internal recvmsg() functions have two parameters 'flags' and 'noblock'
that were merged inside skb_recv_datagram(). As a follow up patch to commit
f4b41f062c ("net: remove noblock parameter from skb_recv_datagram()")
this patch removes the separate 'noblock' parameter for recvmsg().
Analogue to the referenced patch for skb_recv_datagram() the 'flags' and
'noblock' parameters are unnecessarily split up with e.g.
err = sk->sk_prot->recvmsg(sk, msg, size, flags & MSG_DONTWAIT,
flags & ~MSG_DONTWAIT, &addr_len);
or in
err = INDIRECT_CALL_2(sk->sk_prot->recvmsg, tcp_recvmsg, udp_recvmsg,
sk, msg, size, flags & MSG_DONTWAIT,
flags & ~MSG_DONTWAIT, &addr_len);
instead of simply using only flags all the time and check for MSG_DONTWAIT
where needed (to preserve for the formerly separated no(n)block condition).
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/r/20220411124955.154876-1-socketcan@hartkopp.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Replace kfree_skb() used in icmp_rcv() and icmpv6_rcv() with
kfree_skb_reason().
In order to get the reasons of the skb drops after icmp message handle,
we change the return type of 'handler()' in 'struct icmp_control' from
'bool' to 'enum skb_drop_reason'. This may change its original
intention, as 'false' means failure, but 'SKB_NOT_DROPPED_YET' means
success now. Therefore, all 'handler' and the call of them need to be
handled. Following 'handler' functions are involved:
icmp_unreach()
icmp_redirect()
icmp_echo()
icmp_timestamp()
icmp_discard()
And following new drop reasons are added:
SKB_DROP_REASON_ICMP_CSUM
SKB_DROP_REASON_INVALID_PROTO
The reason 'INVALID_PROTO' is introduced for the case that the packet
doesn't follow rfc 1122 and is dropped. This is not a common case, and
I believe we can locate the problem from the data in the packet. For now,
this 'INVALID_PROTO' is used for the icmp broadcasts with wrong types.
Maybe there should be a document file for these reasons. For example,
list all the case that causes the 'UNHANDLED_PROTO' and 'INVALID_PROTO'
drop reason. Therefore, users can locate their problems according to the
document.
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Reviewed-by: Jiang Biao <benbjiang@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to avoid to change the return value of ping_queue_rcv_skb(),
introduce the function __ping_queue_rcv_skb(), which is able to report
the reasons of skb drop as its return value, as Paolo suggested.
Meanwhile, make ping_queue_rcv_skb() a simple call to
__ping_queue_rcv_skb().
The kfree_skb() and sock_queue_rcv_skb() used in ping_queue_rcv_skb()
are replaced with kfree_skb_reason() and sock_queue_rcv_skb_reason()
now.
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Reviewed-by: Jiang Biao <benbjiang@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_recv_datagram() has two parameters 'flags' and 'noblock' that are
merged inside skb_recv_datagram() by 'flags | (noblock ? MSG_DONTWAIT : 0)'
As 'flags' may contain MSG_DONTWAIT as value most callers split the 'flags'
into 'flags' and 'noblock' with finally obsolete bit operations like this:
skb_recv_datagram(sk, flags & ~MSG_DONTWAIT, flags & MSG_DONTWAIT, &rc);
And this is not even done consistently with the 'flags' parameter.
This patch removes the obsolete and costly splitting into two parameters
and only performs bit operations when really needed on the caller side.
One missing conversion thankfully reported by kernel test robot. I missed
to enable kunit tests to build the mctp code.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
As Jakub noticed, prints should be avoided on the datapath.
Also, as packets would never come to the else branch in
ping_lookup(), remove pr_err() from ping_lookup().
Fixes: 35a79e64de ("ping: fix the dif and sdif check in ping_lookup")
Reported-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Link: https://lore.kernel.org/r/1ef3f2fcd31bd681a193b1fcf235eee1603819bd.1645674068.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When 'ping' changes to use PING socket instead of RAW socket by:
# sysctl -w net.ipv4.ping_group_range="0 100"
There is another regression caused when matching sk_bound_dev_if
and dif, RAW socket is using inet_iif() while PING socket lookup
is using skb->dev->ifindex, the cmd below fails due to this:
# ip link add dummy0 type dummy
# ip link set dummy0 up
# ip addr add 192.168.111.1/24 dev dummy0
# ping -I dummy0 192.168.111.1 -c1
The issue was also reported on:
https://github.com/iputils/iputils/issues/104
But fixed in iputils in a wrong way by not binding to device when
destination IP is on device, and it will cause some of kselftests
to fail, as Jianlin noticed.
This patch is to use inet(6)_iif and inet(6)_sdif to get dif and
sdif for PING socket, and keep consistent with RAW socket.
Fixes: c319b4d76b ("net: ipv4: add IPPROTO_ICMP socket kind")
Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>