Commit Graph

13519 Commits

Author SHA1 Message Date
Linus Torvalds
515186b7be bpf-fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmn/wmgACgkQ6rmadz2v
 bTosmhAAgYkQLg7zVQdruoSYb7Vzjz1Di4tM2rBXNIX4S7dvfZUGGBNzFV1lWobk
 /r6269llSnPKXofs+69LDVCpdvUXmGRmS7+bq+bxV7WVmg7JruVOTWg839jValJK
 cY3IQi0lZ9GVKaePI5C2XxBS3rCrdQmby91fcfp5C6A/gR6m7PzAlnoIuJ2SQx6A
 7tsxxJb4wRtFWPBp7ClbBo7MAMIzPse/6CzsA2eP+icyJC+De9WGYs6bTDNi7vpY
 +eul0HMyHLTszJe/AGrsu5Ky3S6l+CTydi1fAUSOnk1pYHHhRvvD2WV8ix05/0rO
 2looZl6ogpcisCm1i8HN8g1ST0tS74x3bL9kjvB/hhKGh6K1QpU6/drEvmJqKMAu
 fspYHD3qO+OXN7EV7tFZ1ErJvJZ7zT7UP0JxirAK1DFQZWrki/tJKehSD6gbir8R
 GwwZctXDOPTGADBsdqbxEPEAp1gVTvDXf04k6GOCLkzqqYBMVKdW/8GXN+6Itr+O
 nxxoC0SOOkW7rRlJaxuJd5+kpaCKOuK9FaXWONOn7HPzBgK0E0CL9g3+cZcS1QvI
 2/5utfFj0gMeo40ZDjCyDWXm7w+AnTSKMMapB5pyi0FY3AVtroSV88HNbpm7DJrs
 xp9jO5ZD6EQ9Wn1cufOYAkrgZYwTZL5Z2EqyKcoJUIk1ZjpQbXg=
 =x/fg
 -----END PGP SIGNATURE-----

Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Pull bpf fixes from Alexei Starovoitov:

 - Fix sk_local_storage diag dump via netlink (Amery Hung)

 - Fix off-by-one in arena direct-value access (Junyoung Jang)

 - Reject TCP_NODELAY in bpf-tcp congestion control (KaFai Wan)

 - Fix type confusion in bpf_*_sock() (Kuniyuki Iwashima)

 - Reject TX-only AF_XDP sockets (Linpu Yu)

 - Don't run arg-tracking analysis twice on main subprog (Paul Chaignon)

 - Fix NULL pointer dereference in bpf_sk_storage_clone and fib lookup
   (Weiming Shi)

* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf: Fix off-by-one boundary validation in arena direct-value access
  xskmap: reject TX-only AF_XDP sockets
  bpf: Don't run arg-tracking analysis twice on main subprog
  bpf: Free reuseport cBPF prog after RCU grace period.
  bpf: tcp: Fix type confusion in sol_tcp_sockopt().
  bpf: tcp: Fix type confusion in bpf_skc_to_tcp6_sock().
  bpf: tcp: Fix type confusion in bpf_skc_to_tcp_sock().
  mptcp: bpf: Fix type confusion in bpf_mptcp_sock_from_subflow()
  selftest: bpf: Add test for bpf_tcp_sock() and RAW socket.
  bpf: tcp: Fix type confusion in bpf_tcp_sock().
  tools/headers: Regenerate stddef.h to fix BPF selftests
  bpf: Fix sk_local_storage diag dumping uninitialized special fields
  bpf: Fix NULL pointer dereference in bpf_skb_fib_lookup()
  sockmap: Fix sk_psock_drop() race vs sock_map_{unhash,close,destroy}().
  bpf: Fix NULL pointer dereference in bpf_sk_storage_clone and diag paths
  selftests/bpf: Verify bpf-tcp-cc rejects TCP_NODELAY
  selftests/bpf: Test TCP_NODELAY in TCP hdr opt callbacks
  bpf: Reject TCP_NODELAY in bpf-tcp-cc
  bpf: Reject TCP_NODELAY in TCP header option callbacks
2026-05-09 18:42:54 -07:00
Kuniyuki Iwashima
019c892e46 ipmr: Call ipmr_fib_lookup() under RCU.
Yi Lai reported RCU splat in reg_vif_xmit() below. [0]

When CONFIG_IP_MROUTE_MULTIPLE_TABLES=n, ipmr_fib_lookup()
uses rcu_dereference() without explicit rcu_read_lock().

Although rcu_read_lock_bh() is already held by the caller
__dev_queue_xmit(), lockdep requires explicit rcu_read_lock()
for rcu_dereference().

Let's move up rcu_read_lock() in reg_vif_xmit() to
cover ipmr_fib_lookup().

[0]:
WARNING: suspicious RCU usage
7.1.0-rc2-next-20260504-9d0d467c3572 #1 Not tainted
 -----------------------------
net/ipv4/ipmr.c:329 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
2 locks held by syz.2.17/1779:
 #0: ffffffff87896440 (rcu_read_lock_bh){....}-{1:3}, at: local_bh_disable include/linux/bottom_half.h:20 [inline]
 #0: ffffffff87896440 (rcu_read_lock_bh){....}-{1:3}, at: rcu_read_lock_bh include/linux/rcupdate.h:891 [inline]
 #0: ffffffff87896440 (rcu_read_lock_bh){....}-{1:3}, at: __dev_queue_xmit+0x239/0x4140 net/core/dev.c:4792
 #1: ffff88801a199d18 (_xmit_PIMREG#2){+...}-{3:3}, at: spin_lock include/linux/spinlock.h:342 [inline]
 #1: ffff88801a199d18 (_xmit_PIMREG#2){+...}-{3:3}, at: __netif_tx_lock include/linux/netdevice.h:4795 [inline]
 #1: ffff88801a199d18 (_xmit_PIMREG#2){+...}-{3:3}, at: __dev_queue_xmit+0x1d5d/0x4140 net/core/dev.c:4865

stack backtrace:
CPU: 1 UID: 0 PID: 1779 Comm: syz.2.17 Not tainted 7.1.0-rc2-next-20260504-9d0d467c3572 #1 PREEMPT(lazy)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:94 [inline]
 dump_stack_lvl+0x121/0x150 lib/dump_stack.c:120
 dump_stack+0x19/0x20 lib/dump_stack.c:129
 lockdep_rcu_suspicious+0x15b/0x1f0 kernel/locking/lockdep.c:6878
 ipmr_fib_lookup net/ipv4/ipmr.c:329 [inline]
 reg_vif_xmit+0x2ee/0x3c0 net/ipv4/ipmr.c:540
 __netdev_start_xmit include/linux/netdevice.h:5382 [inline]
 netdev_start_xmit include/linux/netdevice.h:5391 [inline]
 xmit_one net/core/dev.c:3889 [inline]
 dev_hard_start_xmit+0x170/0x700 net/core/dev.c:3905
 __dev_queue_xmit+0x1df1/0x4140 net/core/dev.c:4871
 dev_queue_xmit include/linux/netdevice.h:3423 [inline]
 packet_xmit+0x252/0x370 net/packet/af_packet.c:276
 packet_snd net/packet/af_packet.c:3082 [inline]
 packet_sendmsg+0x39ad/0x5650 net/packet/af_packet.c:3114
 sock_sendmsg_nosec net/socket.c:797 [inline]
 __sock_sendmsg net/socket.c:812 [inline]
 ____sys_sendmsg+0xa21/0xba0 net/socket.c:2716
 ___sys_sendmsg+0x121/0x1c0 net/socket.c:2770
 __sys_sendmsg+0x177/0x220 net/socket.c:2802
 __do_sys_sendmsg net/socket.c:2807 [inline]
 __se_sys_sendmsg net/socket.c:2805 [inline]
 __x64_sys_sendmsg+0x80/0xc0 net/socket.c:2805
 x64_sys_call+0x1d9c/0x21c0 arch/x86/include/generated/asm/syscalls_64.h:47
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xc1/0x1020 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f37e563ee5d
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 93 af 1b 00 f7 d8 64 89 01 48
RSP: 002b:00007ffe5caa7fa8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00000000005c5fa0 RCX: 00007f37e563ee5d
RDX: 0000000000000000 RSI: 00002000000012c0 RDI: 0000000000000004
RBP: 00000000005c5fa0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 00000000005c5fac R15: 00000000005c5fa0
 </TASK>

Fixes: b3b6babf47 ("ipmr: Free mr_table after RCU grace period.")
Reported-by: syzkaller <syzkaller@googlegroups.com>
Reported-by: Yi Lai <yi1.lai@intel.com>
Closes: https://lore.kernel.org/netdev/afrY34dLXNUboevf@ly-workstation/
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260506065955.1695753-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-07 08:38:37 -07:00
Eric Dumazet
c8f7244c8c tcp: tcp_child_process() related UAF
tcp_child_process( .. child ...) currently calls sock_put(child).

Unfortunately @child (named @nsk in callers) can be used after
this point to send a RST packet.

To fix this UAF, I remove the sock_put() from tcp_child_process()
and let the callers handle this after it is safe.

Remove @rsk variable in tcp_v4_do_rcv() and change tcp_v6_do_rcv()
so that both functions look the same.

Fixes: cfb6eeb4c8 ("[TCP]: MD5 Signature Option (RFC2385) support.")
Reported-by: Damiano Melotti <melotti@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260505153927.3435532-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-06 18:11:33 -07:00
Eric Dumazet
67ef49047d inetpeer: add a missing read_seqretry() in inet_getpeer()
When performing a lockless lookup over the inet_peer rbtree,
if a matching node is found, inet_getpeer() returns it immediately
without validating the seqlock sequence.

This missing check introduces a race condition:

Trigger Path: When a host receives an incoming fragmented IPv4 packet,
ip4_frag_init() (in net/ipv4/ip_fragment.c) calls inet_getpeer_v4()
to track the peer.

The Race: If the packet is from a new source IP, CPU A acquires the
write_seqlock, allocates a new inet_peer node (p), sets its IP address
(daddr), and links it to the rbtree (rb_link_node).

Uninitialized Access: Due to the lack of memory barriers between
rb_link_node and the initialization of the rest of the struct
(like refcount_set(&p->refcnt, 1)), CPU A can make the node visible
to readers before its refcnt is initialized.
This is especially true on weakly-ordered architectures like ARM64
where the CPU can reorder the memory stores.

Lockless Reader: Concurrently, CPU B processes a second fragmented packet
from the same source IP. CPU B does a lockless lookup, finds the newly
inserted node, and returns it immediately.

Use-After-Free (UAF): CPU B reads p->refcnt as uninitialized garbage
(left over from previous kmalloc-128/192 allocations).
If the garbage is > 0, refcount_inc_not_zero(&p->refcnt) succeeds.
CPU A then executes refcount_set(&p->refcnt, 1), overwriting CPU B's increment.
When CPU B finishes with the fragment queue, it calls inet_putpeer(),
which drops the refcount to 0 and frees the node via RCU.
The node is now freed but remains linked in the rbtree,
resulting in a Use-After-Free in the rbtree.

Fixes: b145425f26 ("inetpeer: remove AVL implementation in favor of RB tree")
Reported-by: Damiano Melotti <melotti@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260505133233.3039575-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-06 17:44:13 -07:00
Jakub Kicinski
dc61989e37 ipsec-2026-05-05
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmn57igACgkQrB3Eaf9P
 W7cDig//aXeIEN6VUYPU6lTDYXNCWz2A7sM636rXMMizF1nVjkRtrZlzQFwE9pIm
 LOla+Mu1VLGVsuxaoYfW2NagKt6bUg3xEDrlOt+lL/Bn6hengdjVF9PibvP4XCjt
 5bwtg0xN0AysoktYS2v+2b+fSh5CSnQkcEcn9F2d+3zXmFlLpxuyPJqhHn54nHmI
 JPACVyk9bZdKutdfr86uThgWnTDInPvJ2vMRpRlwpGWx5f2JspJv1g4zzWzc38Ad
 yTcRZQXhZ7zfOaYFGjqMD0eHtFDPC+HqMTi0Ak9ngCBAFpZS8/iBJ3/TlukJjNcy
 q805gPyRqnpiVgm6NH55C8HUguzpD7m8tcjBbVADvIrMA0OzMw3mBxwFsbG2aaCs
 cPXxvtT7crDbKPtxvY5RhVJIvCe4BCMP/uqlmo7wuwPE01arVau5i4miZKGPTzXB
 LRNchWJMDIrwE/+MnAbJBXT5RfiN5RPvPdV5OdTlrofkwDzBjpTev5FeQq7QktSx
 ctPy7I28IRw+eCKlu2FNrUJ4x8C/7Fv1ZPADOSvd3D5PdaOAArUb3RhTGwC9giuo
 qKKv8Q30x5xyOv90MB3M8vQwM7mGUloIfZPN6AhRoaDGikdMyy6gZ8Y5M3noGUUJ
 D4z+kZgHy1ZrdYDM58CdfE1Kz/s96rA5aIHUVZQYonaz35YGRts=
 =WKO1
 -----END PGP SIGNATURE-----

Merge tag 'ipsec-2026-05-05' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec

Steffen Klassert says:

====================
pull request (net): ipsec 2026-05-05

1. Fix an IPv6 encapsulation error path that leaked route references
   when UDPv6 ESP decapsulation resolved to an error route.
   From Yilin Zhu.

2. Fix AH with ESN on async crypto paths by accounting for the extra
   high-order sequence number when reconstructing the temporary
   authentication layout in the completion callbacks.
   From Michael Bomarito.

3. Fix XFRM output so it does not overwrite already-correct inner header
   pointers when a tunnel layer such as VXLAN has already saved them.
   The fix comes with new selftests. From Cosmin Ratiu.

4. Add the missing native payload size entry for XFRM_MSG_MAPPING in the
   compat translation path. From Ruijie Li.

5. Harden __xfrm_state_delete() against repeated or inconsistent unhashing
   of state list nodes by keying the removal on actual list membership and
   using delete-and-init helpers. From Michal Kosiorek.

6. Prevent ESP from decrypting shared splice-backed skb fragments in place
   by marking UDP splice frags as shared and forcing copy-on-write in ESP
   input when needed. From Kuan-Ting Chen.

* tag 'ipsec-2026-05-05' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
  xfrm: esp: avoid in-place decrypt on shared skb frags
  xfrm: defensively unhash xfrm_state lists in __xfrm_state_delete
  xfrm: provide message size for XFRM_MSG_MAPPING
  xfrm: Don't clobber inner headers when already set
  tools/selftests: Add a VXLAN+IPsec traffic test
  tools/selftests: Use a sensible timeout value for iperf3 client
  xfrm: ah: account for ESN high bits in async callbacks
  ipv6: xfrm6: release dst on error in xfrm6_rcv_encap()
====================

Link: https://patch.msgid.link/20260505132326.1362733-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-06 16:49:42 -07:00
Kuan-Ting Chen
f4c50a4034 xfrm: esp: avoid in-place decrypt on shared skb frags
MSG_SPLICE_PAGES can attach pages from a pipe directly to an skb. TCP
marks such skbs with SKBFL_SHARED_FRAG after skb_splice_from_iter(),
so later paths that may modify packet data can first make a private
copy. The IPv4/IPv6 datagram append paths did not set this flag when
splicing pages into UDP skbs.

That leaves an ESP-in-UDP packet made from shared pipe pages looking
like an ordinary uncloned nonlinear skb. ESP input then takes the no-COW
fast path for uncloned skbs without a frag_list and decrypts in place
over data that is not owned privately by the skb.

Mark IPv4/IPv6 datagram splice frags with SKBFL_SHARED_FRAG, matching
TCP. Also make ESP input fall back to skb_cow_data() when the flag is
present, so ESP does not decrypt externally backed frags in place.
Private nonlinear skb frags still use the existing fast path.

This intentionally does not change ESP output. In esp_output_head(),
the path that appends the ESP trailer to existing skb tailroom without
calling skb_cow_data() is not reachable for nonlinear skbs:
skb_tailroom() returns zero when skb->data_len is nonzero, while ESP
tailen is positive. Thus ESP output will either use the separate
destination-frag path or fall back to skb_cow_data().

Fixes: cac2661c53 ("esp4: Avoid skb_cow_data whenever possible")
Fixes: 03e2a30f6a ("esp6: Avoid skb_cow_data whenever possible")
Fixes: 7da0dde684 ("ip, udp: Support MSG_SPLICE_PAGES")
Fixes: 6d8192bd69 ("ip6, udp6: Support MSG_SPLICE_PAGES")
Reported-by: Hyunwoo Kim <imv4bel@gmail.com>
Reported-by: Kuan-Ting Chen <h3xrabbit@gmail.com>
Tested-by: Hyunwoo Kim <imv4bel@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Kuan-Ting Chen <h3xrabbit@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2026-05-05 06:38:30 +02:00
Eric Dumazet
c6bebaa744 ipv4: igmp: annotate data-races in igmp_heard_query()
Multiple cpus can run igmp_heard_query() concurrently.

Add missing READ_ONCE()/WRITE_ONCE() over following in_dev fields.

- mr_qrv
- mr_qi
- mr_qri
- mr_v1_seen
- mr_v2_seen

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: syzbot+ae9a171f239b14485310@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/69f38675.050a0220.3cbe47.0002.GAE@google.com
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260430164836.872079-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-01 17:11:42 -07:00
Eric Dumazet
4f34002e2e ipmr: prevent info-leak in pmr_cache_report()
Yiming Qian reported:

<quote>
 ipmr_cache_report()` allocates a report skb with `alloc_skb(128,
 GFP_ATOMIC)` and appends a `struct igmphdr` using `skb_put()`. In the
 non-`IGMPMSG_WHOLEPKT` path it initializes only:

 - `igmp->type`
 - `igmp->code`

 but does not initialize:

 - `igmp->csum`
 - `igmp->group`

 Later, `igmpmsg_netlink_event()` copies the bytes after `sizeof(struct
 igmpmsg)` into the `IPMRA_CREPORT_PKT` netlink attribute and emits
 `RTM_NEWCACHEREPORT` on `RTNLGRP_IPV4_MROUTE_R`.

 As a result, 6 bytes of stale heap data from the skb head are
 disclosed to userspace.
</quote>

Let's use skb_put_zero() instead of skb_put() to fix this bug.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: Yiming Qian <yimingqian591@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260430070611.4004529-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-01 17:01:40 -07:00
Fernando Fernandez Mancera
0bf00859d7 netfilter: nf_socket: skip socket lookup for non-first fragments
Both nft_socket and xt_socket relies on L4 headers to perform socket
lookup in the slow path. For fragmented packets, while the IP protocol
remains constant across all fragments, only the first fragment contains
the actual L4 header.

As the expression/match could be attached to a chain with a priority
lower than -400, it could bypass defragmentation.

Add a check for fragmentation in the lookup functions directly so the
problem is handled for both nft_socket and xt_socket at the same time.
In addition, future users of the functions would not need to care about
this.

Fixes: 902d6a4c2a ("netfilter: nf_defrag: Skip defrag if NOTRACK is set")
Fixes: 554ced0a6e ("netfilter: nf_tables: add support for native socket matching")
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-04-30 17:59:01 +02:00
Jakub Kicinski
67d7ae3340 netfilter pull request 26-04-28
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEjF9xRqF1emXiQiqU1w0aZmrPKyEFAmnwhAAACgkQ1w0aZmrP
 KyGSsxAAh1gE5UmUum0Q9x0K0a3C+Vh07c2YRw4zuI6sy0xh0W0ZQongj5p5QQUA
 dL8b9pAZkV0Kr0WKhOTDvz5HhUFNWH0I/5hppwJ94Swx0PcEq4P+PZ+8eEYH7jfp
 7bxSJu4vsjzGxn4qP6lzI221ICDsiifQisDE1+J0HyNyfV0Qr9oUIkW3usxiJsnP
 IsIMp/zk/9PNC+IOSlQCEwl7tO/86p5g1XyCOP/WUCDa2DfpfBTPWAueMTTacN8r
 Wgk+Butf6xJe7OfteGMJ07kg2oyqUr4pFiwoKog+MxV0EDQCQgm15t10AtYJl4D9
 IIHVBIw4e7MgwlS0P/F5Vhb860U+gguaGuwLx/UPW4QyUV8fkT+ileIvAZdxd15i
 RDwPup0Q+8fKeY9WnIOdvBpdPHh1T7UgrppoVwwwj6PxQZHCf6R6EgvtlftBNVyI
 Zlys4rSwtDG8pbPngVPoIZlPYGMnlx0IljXiQCijHVtnU61afp7D7Rv/gH+Se+N8
 2p9ne5rQ7MRevYdH07etWbMPmlZ/nbgbha9+hCC5jvZceyhekC7TCxfi2PtswGon
 uW1RQhuemZnHHvmtPzsQrHOddwCv7FmozKankdLoEfnYIfjkNywbJCAAnuD7jCg+
 s0utZwXb7uarQszEb7PMy3bCuHoKzqRy8ICw6SDBw2Vc7x2HFQk=
 =E/L6
 -----END PGP SIGNATURE-----

Merge tag 'nf-26-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) IEEE1394 ARP payload contains no target hardware address in the
   ARP packet. Apparently, arp_tables was never updated to deal with
   IEEE1394 ARP properly. To deal with this, return no match in case
   the target hardware address selector is used, either for inverse or
   normal match. Moreover, arpt_mangle disallows mangling of the target
   hardware and IP address because, it is not worth to adjust the
   offset calculation to fix this, we suspect no users of arp_tables
   for this family.

2) Use list_del_rcu() to delete device hooks in nf_tables, this hook
   list is RCU protected, concurrent netlink dump readers can be
   walking on this list, fix it by adding a helper function and use it
   for consistency. From Florian Westphal.

3) Add list_splice_rcu(), this is useful for joining the local list of
   new device hooks to the RCU protected hook list in chain and
   flowtable. Reviewed by Paul E. McKenney.

4) Use list_splice_rcu() to publish the new device hooks in chain and
   flowtable to fix concurrent netlink dump traversal.

5) Add a new hook transaction object to track device hook deletions.
   The current approach moves device hooks to be deleted around during
   the preparation phase, this breaks concurrent RCU reader via netlink
   dump. This new hook transaction is combined with NFT_HOOK_REMOVE
   flag to annotate hooks for removal in the preparation phase.

6) xt_policy inbound policy check in strict mode can lead to
   out-of-bound access of the secpath array due to incorrect.
   The iteration over the secpath needs to be reversed in the inbound
   to check for the human readable policy, expecting inner in first
   position and outer in second position, the secpath from inbound
   actually stores outer in first position then in second position.
   From Jiexun Wang.

7) Fix possible zero shift in nft_bitwise triggering UBSAN splat,
   reject zero shift from control plane, from Kai Ma.

8) Replace simple_strtoul() in the conntrack SIP helper since it relies
   on nul-terminated strings. From Florian Westphal.

* tag 'nf-26-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: nf_conntrack_sip: don't use simple_strtoul
  netfilter: reject zero shift in nft_bitwise
  netfilter: xt_policy: fix strict mode inbound policy matching
  netfilter: nf_tables: add hook transactions for device deletions
  netfilter: nf_tables: join hook list via splice_list_rcu() in commit phase
  rculist: add list_splice_rcu() for private lists
  netfilter: nf_tables: use list_del_rcu for netlink hooks
  netfilter: arp_tables: fix IEEE1394 ARP payload parsing
====================

Link: https://patch.msgid.link/20260428095840.51961-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-28 17:41:06 -07:00
Altan Hacigumus
2b9f6f7065 tcp: make probe0 timer handle expired user timeout
tcp_clamp_probe0_to_user_timeout() computes remaining time in jiffies
using subtraction with an unsigned lvalue.  If elapsed probing time
exceeds the configured TCP_USER_TIMEOUT, the underflow yields a large
value.

This ends up re-arming the probe timer for a full backoff interval
instead of expiring immediately, delaying connection teardown beyond
the configured timeout.

Fix this by preventing underflow so user-set timeout expiration is
handled correctly without extending the probe timer.

Fixes: 344db93ae3 ("tcp: make TCP_USER_TIMEOUT accurate for zero window probes")
Link: https://lore.kernel.org/r/20260414013634.43997-1-ahacigu.linux@gmail.com
Signed-off-by: Altan Hacigumus <ahacigu.linux@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260424014639.54110-1-ahacigu.linux@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-27 19:16:07 -07:00
Kuniyuki Iwashima
b3b6babf47 ipmr: Free mr_table after RCU grace period.
With CONFIG_IP_MROUTE_MULTIPLE_TABLES=n, ipmr_fib_lookup()
does not check if net->ipv4.mrt is NULL.

Since default_device_exit_batch() is called after ->exit_rtnl(),
a device could receive IGMP packets and access net->ipv4.mrt
during/after ipmr_rules_exit_rtnl().

If ipmr_rules_exit_rtnl() had already cleared it and freed the
memory, the access would trigger null-ptr-deref or use-after-free.

Let's fix it by using RCU helper and free mrt after RCU grace
period.

In addition, check_net(net) is added to mroute_clean_tables()
and ipmr_cache_unresolved() to synchronise via mfc_unres_lock.
This prevents ipmr_cache_unresolved() from putting skb into
c->_c.mfc_un.unres.unresolved after mroute_clean_tables()
purges it.

For the same reason, timer_shutdown_sync() is moved after
mroute_clean_tables().

Since rhltable_destroy() holds mutex internally, rcu_work is
used, and it is placed as the first member because rcu_head
must be placed within <4K offset.  mr_table is alraedy 3864
bytes without rcu_work.

Note that IP6MR is not yet converted to ->exit_rtnl(), so this
change is not needed for now but will be.

Fixes: b22b018674 ("ipmr: Convert ipmr_net_exit_batch() to ->exit_rtnl().")
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260423053456.4097409-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-27 18:46:17 -07:00
Linus Torvalds
64edfa6506 Delete some obsolete networking code
Old code like amateur radio and NFC have long been a burden
 to core networking developers. syzbot loves to find bugs
 in BKL-era code, and noobs try to fix them.
 
 If we want to have a fighting chance of surviving the LLM-pocalypse
 this code needs to find a dedicated owner or get deleted.
 We've talked about these deletions multiple times in the past
 and every time someone wanted the code to stay. It is never
 very clear to me how many of those people actually use the code
 vs are just nostalgic to see it go. Amateur radio did have
 occasional users (or so I think) but most users switched
 to user space implementations since its all super slow stuff.
 Nobody stepped up to maintain the kernel code.
 
 We were lucky enough to find someone who wants to help with NFC
 so we're giving that a chance. Let's try to put the rest of
 this code behind us.
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmnqqWYACgkQMUZtbf5S
 IrtEpQ/9F5+8POE6dg6gJVLDKx1+i6GiaOIweAl8h5DatzhBAAGuGr9JyTw0P/iy
 QX7/SU8WQIhi+LVTYBX9M5bJ3Rf+Iws4dll0CyoTTdOFvGwCAck8Ee/w+1gZdsQY
 aG0mQPmftfMEdZGX3KXt8UPDWG7QX4w1gSqxqYcSs1ohN6Txi1F94tmgqXgzYHzv
 vxWP3cF3XTv4eM6BpQj4tiLT3hvrTUfoCZEn9oF4Hn+miYU/yNlWxh0/pmfNjcxd
 vpNN0VfJVK48uPrj57Ep2x9OjkHPviojrUZT0Y55ENBhn1Lykry4MaxsJVsVYhuC
 OqJHQYTFyxwT/USTJxs1gplFyO0i37oCEEt43BKm2KS7rYHgc4pQgMJz7R2IS3wL
 z1xFl45QFt5kX3pw8BvWPXwBomkbDeFORB40Y1qc8RHMfAUKqOhbhzV8rDq9uKup
 0nJxdijdh3/2qdO+LB1pU5rq/MbfAxOQSnRJmKLoKLVljaZHMAVbm829sdap8OM+
 VMnyPF5hOAuTHV0NZJJ2BbcznI4MFDxM1lNEWFuRC39RQeeGRIHsNMjvs4HMHLaW
 V827UBXpUOK6HR3nGCKX3VpLJByUYAIkdIKvRugbWdynvXAw+FJUHx4wRzvFi6oi
 E7ucUY+FI5YOS1rmQJ+rqBjhThcIAdj2U9SNAykDKRVa7zPEUMU=
 =3vMU
 -----END PGP SIGNATURE-----

Merge tag 'net-deletions' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking deletions from Jakub Kicinski:
 "Delete some obsolete networking code

  Old code like amateur radio and NFC have long been a burden to core
  networking developers. syzbot loves to find bugs in BKL-era code, and
  noobs try to fix them.

  If we want to have a fighting chance of surviving the LLM-pocalypse
  this code needs to find a dedicated owner or get deleted. We've talked
  about these deletions multiple times in the past and every time
  someone wanted the code to stay. It is never very clear to me how many
  of those people actually use the code vs are just nostalgic to see it
  go. Amateur radio did have occasional users (or so I think) but most
  users switched to user space implementations since its all super slow
  stuff. Nobody stepped up to maintain the kernel code.

  We were lucky enough to find someone who wants to help with NFC so
  we're giving that a chance. Let's try to put the rest of this code
  behind us"

* tag 'net-deletions' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next:
  drivers: net: 8390: wd80x3: Remove this driver
  drivers: net: 8390: ultra: Remove this driver
  drivers: net: 8390: AX88190: Remove this driver
  drivers: net: fujitsu: fmvj18x: Remove this driver
  drivers: net: smsc: smc91c92: Remove this driver
  drivers: net: smsc: smc9194: Remove this driver
  drivers: net: amd: nmclan: Remove this driver
  drivers: net: amd: lance: Remove this driver
  drivers: net: 3com: 3c589: Remove this driver
  drivers: net: 3com: 3c574: Remove this driver
  drivers: net: 3com: 3c515: Remove this driver
  drivers: net: 3com: 3c509: Remove this driver
  net: packetengines: remove obsolete yellowfin driver and vendor dir
  net: packetengines: remove obsolete hamachi driver
  net: remove unused ATM protocols and legacy ATM device drivers
  net: remove ax25 and amateur radio (hamradio) subsystem
  net: remove ISDN subsystem and Bluetooth CMTP
  caif: remove CAIF NETWORK LAYER
2026-04-24 09:41:58 -07:00
Linus Torvalds
e728258deb Including fixes from Netfilter.
Steady stream of fixes. Last two weeks feel comparable to the two
 weeks before the merge window. Lots of AI-aided bug discovery.
 A newer big source is Sashiko/Gemini (Roman Gushchin's system),
 which points out issues in existing code during patch review
 (maybe 25% of fixes here likely originating from Sashiko).
 Nice thing is these are often fixed by the respective maintainers,
 not drive-bys.
 
 Current release - new code bugs:
 
  - kconfig: MDIO_PIC64HPSC should depend on ARCH_MICROCHIP
 
 Previous releases - regressions:
 
  - add async ndo_set_rx_mode and switch drivers which we promised
    to be called under the per-netdev mutex to it
 
  - dsa: remove duplicate netdev_lock_ops() for conduit ethtool ops
 
  - hv_sock: report EOF instead of -EIO for FIN
 
  - vsock/virtio: fix MSG_PEEK calculation on bytes to copy
 
 Previous releases - always broken:
 
  - ipv6: fix possible UAF in icmpv6_rcv()
 
  - icmp: validate reply type before using icmp_pointers
 
  - af_unix: drop all SCM attributes for SOCKMAP
 
  - netfilter: fix a number of bugs in the osf (OS fingerprinting)
 
  - eth: intel: fix timestamp interrupt configuration for E825C
 
 Misc:
 
  - bunch of data-race annotations
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmnqkmMACgkQMUZtbf5S
 Iruqig/+NSg/YwEkZLbSaW+0LqNMIdOVZPdves97YAvNRdcKvgAPB5I13/G+koCz
 bRpmtdDLYTkfMFLaM582DO6XeO3Hsz/BrRRuRbyEz7lTi7PtxTEs1J+6W6NxGOQ2
 30f3J7OGudGlinsFV9VkJe81rvFbKZFZ9fGPmOcVzzzfLvT3rrt20iVvMOyM+PpD
 H0ixFW+myescEx6AQoGcVs/sDveJ4bpLpNG3p4gADh3Laj9HKSl00kudCIOQ1Kdy
 SEHsSZs3A87ueOnGwIBl/x24zVWGTGHyKcmc5ENPUSIaNGOWzmBxvfhb5dZ989RQ
 HQix+FMue21k4JypYwrdhU3MAnMDPLk+FDp4XJuwJ5I/caNLZXS2geIlnXOI5IFJ
 ojuq4pF5njoWtvkWGvxxRM+shIMiDUYUK+k9xTMqmge88O9ahGIAYb2qyKL+P6Sl
 mMuSRcArk6pw3lPbUA4u1wEaU52IdxRJDPQA/Ai3O5UVTfemJO/VqawQfuBE274g
 KZXG4x0lwE+LSyoguTnSqhMCJk1ZXAeHjtpz1Yo3CEHOwCH9MxEEL/dldAXWZiWN
 K0nLcUQ8fg3GnmOEzYw1gzDVJrgkR1eIrh6OCpw+UGCg0Af0HE6C6QBL9q59YhQw
 DjLJAUNM8puBNIh9paCsHf1aIcFpPXBcR5dKoufCQx41x1OOqew=
 =knNy
 -----END PGP SIGNATURE-----

Merge tag 'net-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull  networking fixes from Jakub Kicinski:
 "Including fixes from Netfilter.

  Steady stream of fixes. Last two weeks feel comparable to the two
  weeks before the merge window. Lots of AI-aided bug discovery. A newer
  big source is Sashiko/Gemini (Roman Gushchin's system), which points
  out issues in existing code during patch review (maybe 25% of fixes
  here likely originating from Sashiko). Nice thing is these are often
  fixed by the respective maintainers, not drive-bys.

  Current release - new code bugs:

   - kconfig: MDIO_PIC64HPSC should depend on ARCH_MICROCHIP

  Previous releases - regressions:

   - add async ndo_set_rx_mode and switch drivers which we promised to
     be called under the per-netdev mutex to it

   - dsa: remove duplicate netdev_lock_ops() for conduit ethtool ops

   - hv_sock: report EOF instead of -EIO for FIN

   - vsock/virtio: fix MSG_PEEK calculation on bytes to copy

  Previous releases - always broken:

   - ipv6: fix possible UAF in icmpv6_rcv()

   - icmp: validate reply type before using icmp_pointers

   - af_unix: drop all SCM attributes for SOCKMAP

   - netfilter: fix a number of bugs in the osf (OS fingerprinting)

   - eth: intel: fix timestamp interrupt configuration for E825C

  Misc:

   - bunch of data-race annotations"

* tag 'net-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (148 commits)
  rxrpc: Fix error handling in rxgk_extract_token()
  rxrpc: Fix re-decryption of RESPONSE packets
  rxrpc: Fix rxrpc_input_call_event() to only unshare DATA packets
  rxrpc: Fix missing validation of ticket length in non-XDR key preparsing
  rxgk: Fix potential integer overflow in length check
  rxrpc: Fix conn-level packet handling to unshare RESPONSE packets
  rxrpc: Fix potential UAF after skb_unshare() failure
  rxrpc: Fix rxkad crypto unalignment handling
  rxrpc: Fix memory leaks in rxkad_verify_response()
  net: rds: fix MR cleanup on copy error
  m68k: mvme147: Make me the maintainer
  net: txgbe: fix firmware version check
  selftests/bpf: check epoll readiness during reuseport migration
  tcp: call sk_data_ready() after listener migration
  vhost_net: fix sleeping with preempt-disabled in vhost_net_busy_poll()
  ipv6: Cap TLV scan in ip6_tnl_parse_tlv_enc_lim
  tipc: fix double-free in tipc_buf_append()
  llc: Return -EINPROGRESS from llc_ui_connect()
  ipv4: icmp: validate reply type before using icmp_pointers
  selftests/net: packetdrill: cover RFC 5961 5.2 challenge ACK on both edges
  ...
2026-04-23 16:50:42 -07:00
Zhenzhong Wu
3864c6ba1e tcp: call sk_data_ready() after listener migration
When inet_csk_listen_stop() migrates an established child socket from
a closing listener to another socket in the same SO_REUSEPORT group,
the target listener gets a new accept-queue entry via
inet_csk_reqsk_queue_add(), but that path never notifies the target
listener's waiters. A nonblocking accept() still works because it
checks the queue directly, but poll()/epoll_wait() waiters and
blocking accept() callers can also remain asleep indefinitely.

Call READ_ONCE(nsk->sk_data_ready)(nsk) after a successful migration
in inet_csk_listen_stop().

However, after inet_csk_reqsk_queue_add() succeeds, the ref acquired
in reuseport_migrate_sock() is effectively transferred to
nreq->rsk_listener. Another CPU can then dequeue nreq via accept()
or listener shutdown, hit reqsk_put(), and drop that listener ref.
Since listeners are SOCK_RCU_FREE, wrap the post-queue_add()
dereferences of nsk in rcu_read_lock()/rcu_read_unlock(), which also
covers the existing sock_net(nsk) access in that path.

The reqsk_timer_handler() path does not need the same changes for two
reasons: half-open requests become readable only after the final ACK,
where tcp_child_process() already wakes the listener; and once nreq is
visible via inet_ehash_insert(), the success path no longer touches
nsk directly.

Fixes: 54b92e8419 ("tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.")
Cc: stable@vger.kernel.org
Suggested-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Zhenzhong Wu <jt26wzz@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260422024554.130346-2-jt26wzz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-23 11:54:43 -07:00
Ruide Cao
67bf002a2d ipv4: icmp: validate reply type before using icmp_pointers
Extended echo replies use ICMP_EXT_ECHOREPLY as the outbound reply type.
That value is outside the range covered by icmp_pointers[], which only
describes the traditional ICMP types up to NR_ICMP_TYPES.

Avoid consulting icmp_pointers[] for reply types outside that range, and
use array_index_nospec() for the remaining in-range lookup. Normal ICMP
replies keep their existing behavior unchanged.

Fixes: d329ea5bd8 ("icmp: add response to RFC 8335 PROBE messages")
Cc: stable@kernel.org
Reported-by: Yuan Tan <yuantan098@gmail.com>
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Signed-off-by: Ruide Cao <caoruide123@gmail.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/0dace90c01a5978e829ca741ef684dbd7304ce62.1776628519.git.caoruide123@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-23 11:40:08 -07:00
Jiayuan Chen
42726ec644 tcp: send a challenge ACK on SEG.ACK > SND.NXT
RFC 5961 Section 5.2 validates an incoming segment's ACK value
against the range [SND.UNA - MAX.SND.WND, SND.NXT] and states:

  "All incoming segments whose ACK value doesn't satisfy the above
   condition MUST be discarded and an ACK sent back."

Commit 354e4aa391 ("tcp: RFC 5961 5.2 Blind Data Injection Attack
Mitigation") opted Linux into this mitigation and implements the
challenge ACK on the lower side (SEG.ACK < SND.UNA - MAX.SND.WND),
but the symmetric upper side (SEG.ACK > SND.NXT) still takes the
pre-RFC-5961 path and silently returns
SKB_DROP_REASON_TCP_ACK_UNSENT_DATA, even though RFC 793 Section 3.9
(now RFC 9293 Section 3.10.7.4) has always required:

  "If the ACK acknowledges something not yet sent (SEG.ACK > SND.NXT)
   then send an ACK, drop the segment, and return."

Complete the mitigation by sending a challenge ACK on that branch,
reusing the existing tcp_send_challenge_ack() path which already
enforces the per-socket RFC 5961 Section 7 rate limit via
__tcp_oow_rate_limited().  FLAG_NO_CHALLENGE_ACK is honoured for
symmetry with the lower-edge case.

Update the existing tcp_ts_recent_invalid_ack.pkt selftest, which
drives this exact path, to consume the new challenge ACK.

Fixes: 354e4aa391 ("tcp: RFC 5961 5.2 Blind Data Injection Attack Mitigation")
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260422123605.320000-2-jiayuan.chen@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-23 11:04:00 -07:00
Jakub Kicinski
dd8d4bc28a net: remove ax25 and amateur radio (hamradio) subsystem
Remove the amateur radio (AX.25, NET/ROM, ROSE) protocol implementation
and all associated hamradio device drivers from the kernel tree.
This set of protocols has long been a huge bug/syzbot magnet,
and since nobody stepped up to help us deal with the influx
of the AI-generated bug reports we need to move it out of tree
to protect our sanity.

The code is moved to an out-of-tree repo:
https://github.com/linux-netdev/mod-orphan
if it's cleaned up and reworked there we can accept it back.

Minimal stub headers are kept for include/net/ax25.h (AX25_P_IP,
AX25_ADDR_LEN, ax25_address) and include/net/rose.h (ROSE_ADDR_LEN)
so that the conditional integration code in arp.c and tun.c continues
to compile and work when the out-of-tree modules are loaded.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Carlos Bilbao <carlos.bilbao@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Link: https://patch.msgid.link/20260421021824.1293976-1-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-04-23 10:24:02 -07:00
KaFai Wan
54377fcab5 bpf: Reject TCP_NODELAY in bpf-tcp-cc
A BPF TCP congestion control program can call bpf_setsockopt() from
its callbacks. In current kernels, if it calls
bpf_setsockopt(TCP_NODELAY) from cwnd_event_tx_start(), the call can
re-enter the TCP transmit path before the outer tcp_transmit_skb()
has completed and advanced the send head.

This can re-trigger CA_EVENT_TX_START and lead to unbounded recursion:

  tcp_transmit_skb()
    -> tcp_event_data_sent()
      -> tcp_ca_event(sk, CA_EVENT_TX_START)
        -> cwnd_event_tx_start()
          -> bpf_setsockopt(TCP_NODELAY)
            -> tcp_push_pending_frames()
              -> tcp_write_xmit()
                -> tcp_transmit_skb()

This leads to unbounded recursion and can overflow the kernel stack.

Reject TCP_NODELAY with -EOPNOTSUPP for bpf-tcp-cc by introducing
a dedicated setsockopt proto for BPF_PROG_TYPE_STRUCT_OPS TCP
congestion control programs. To keep it simple, all tcp-cc ops is
rejected for TCP_NODELAY.

Fixes: 7e41df5dbb ("bpf: Add a few optnames to bpf_setsockopt")
Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
Signed-off-by: KaFai Wan <kafai.wan@linux.dev>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Link: https://patch.msgid.link/20260421155804.135786-3-kafai.wan@linux.dev
2026-04-22 12:58:57 -07:00
Pablo Neira Ayuso
1e8e3f449b netfilter: arp_tables: fix IEEE1394 ARP payload parsing
Weiming Shi says:

"arp_packet_match() unconditionally parses the ARP payload assuming two
hardware addresses are present (source and target). However,
IPv4-over-IEEE1394 ARP (RFC 2734) omits the target hardware address
field, and arp_hdr_len() already accounts for this by returning a
shorter length for ARPHRD_IEEE1394 devices.

As a result, on IEEE1394 interfaces arp_packet_match() advances past a
nonexistent target hardware address and reads the wrong bytes for both
the target device address comparison and the target IP address. This
causes arptables rules to match against garbage data, leading to
incorrect filtering decisions: packets that should be accepted may be
dropped and vice versa.

The ARP stack in net/ipv4/arp.c (arp_create and arp_process) already
handles this correctly by skipping the target hardware address for
ARPHRD_IEEE1394. Apply the same pattern to arp_packet_match()."

Mangle the original patch to always return 0 (no match) in case user
matches on the target hardware address which is never present in
IEEE1394.

Note that this returns 0 (no match) for either normal and inverse match
because matching in the target hardware address in ARPHRD_IEEE1394 has
never been supported by arptables. This is intentional, matching on the
target hardware address should never evaluate true for ARPHRD_IEEE1394.

Moreover, adjust arpt_mangle to drop the packet too as AI suggests:

In arpt_mangle, the logic assumes a standard ARP layout. Because
IEEE1394 (FireWire) omits the target hardware address, the linear
pointer arithmetic miscalculates the offset for the target IP address.
This causes mangling operations to write to the wrong location, leading
to packet corruption. To ensure safety, this patch drops packets
(NF_DROP) when mangling is requested for these fields on IEEE1394
devices, as the current implementation cannot correctly map the FireWire
ARP payload.

This omits both mangling target hardware and IP address. Even if IP
address mangling should be possible in IEEE1394, this would require
to adjust arpt_mangle offset calculation, which has never been
supported.

Based on patch from Weiming Shi <bestswngs@gmail.com>.

Fixes: 6752c8db8e ("firewire net, ipv4 arp: Extend hardware address and remove driver-level packet inspection.")
Reported-by: Xiang Mei <xmei5@asu.edu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-04-21 12:44:39 +02:00
Pablo Neira Ayuso
6eda0d771f netfilter: nat: use kfree_rcu to release ops
Florian Westphal says:

"Historically this is not an issue, even for normal base hooks: the data
path doesn't use the original nf_hook_ops that are used to register the
callbacks.

However, in v5.14 I added the ability to dump the active netfilter
hooks from userspace.

This code will peek back into the nf_hook_ops that are available
at the tail of the pointer-array blob used by the datapath.

The nat hooks are special, because they are called indirectly from
the central nat dispatcher hook. They are currently invisible to
the nfnl hook dump subsystem though.

But once that changes the nat ops structures have to be deferred too."

Update nf_nat_register_fn() to deal with partial exposition of the hooks
from error path which can be also an issue for nfnetlink_hook.

Fixes: e2cf17d377 ("netfilter: add new hook nfnl subsystem")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-04-20 23:45:41 +02:00
Michael Bommarito
ec54093e6a xfrm: ah: account for ESN high bits in async callbacks
AH allocates its temporary auth/ICV layout differently when ESN is enabled:
the async ahash setup appends a 4-byte seqhi slot before the ICV or
auth_data area, but the async completion callbacks still reconstruct the
temporary layout as if seqhi were absent.

With an async AH implementation selected, that makes AH copy or compare
the wrong bytes on both the IPv4 and IPv6 paths. In UML repro on IPv4 AH
with ESN and forced async hmac(sha1), ping fails with 100% packet loss,
and the callback logs show the pre-fix drift:

  ah4 output_done: esn=1 err=0 icv_off=20 expected_off=24
  ah4 input_done: esn=1 auth_off=20 expected_auth_off=24 icv_off=32 expected_icv_off=36

Reconstruct the callback-side layout the same way the setup path built it
by skipping the ESN seqhi slot before locating the saved auth_data or ICV.
Per RFC 4302, the ESN high-order 32 bits participate in the AH ICV
computation, so the async callbacks must account for the seqhi slot.

Post-fix, the same IPv4 AH+ESN+forced-async-hmac(sha1) UML repro shows
the corrected offset (ah4 output_done: esn=1 err=0 icv_off=24
expected_off=24) and ping succeeds; net/ipv4/ah4.o and net/ipv6/ah6.o
build clean at W=1. IPv6 AH+ESN was not exercised at runtime, and the
change has not been tested against a real async hardware AH engine.

Fixes: d4d573d033 ("{IPv4,xfrm} Add ESN support for AH egress part")
Fixes: d8b2a8600b ("{IPv4,xfrm} Add ESN support for AH ingress part")
Fixes: 26dd70c3fa ("{IPv6,xfrm} Add ESN support for AH egress part")
Fixes: 8d6da6f325 ("{IPv6,xfrm} Add ESN support for AH ingress part")
Cc: stable@vger.kernel.org
Assisted-by: Codex:gpt-5-4
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2026-04-20 09:28:34 +02:00
Eric Dumazet
9e89b9d03a tcp: annotate data-races around tp->plb_rehash
tcp_get_timestamping_opt_stats() intentionally runs lockless, we must
add READ_ONCE() and WRITE_ONCE() annotations to keep KCSAN happy.

Fixes: 29c1c44646 ("tcp: add u32 counter in tcp_sock and an SNMP counter for PLB")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260416200319.3608680-15-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-18 11:10:14 -07:00
Eric Dumazet
3a63b3d160 tcp: annotate data-races around (tp->write_seq - tp->snd_nxt)
tcp_get_timestamping_opt_stats() intentionally runs lockless, we must
add READ_ONCE() annotations to keep KCSAN happy.

WRITE_ONCE() annotations are already present.

Fixes: e08ab0b377 ("tcp: add bytes not sent to SCM_TIMESTAMPING_OPT_STATS")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260416200319.3608680-14-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-18 11:10:13 -07:00
Eric Dumazet
71c675358b tcp: annotate data-races around tp->timeout_rehash
tcp_get_timestamping_opt_stats() intentionally runs lockless, we must
add READ_ONCE() and WRITE_ONCE() annotations to keep KCSAN happy.

Fixes: 32efcc06d2 ("tcp: export count for rehash attempts")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260416200319.3608680-13-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-18 11:10:13 -07:00
Eric Dumazet
290b693ce7 tcp: annotate data-races around tp->srtt_us
tcp_get_timestamping_opt_stats() intentionally runs lockless, we must
add READ_ONCE() and WRITE_ONCE() annotations to keep KCSAN happy.

Fixes: e8bd8fca67 ("tcp: add SRTT to SCM_TIMESTAMPING_OPT_STATS")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260416200319.3608680-12-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-18 11:10:13 -07:00
Eric Dumazet
62585690e6 tcp: annotate data-races around tp->reord_seen
tcp_get_timestamping_opt_stats() intentionally runs lockless, we must
add READ_ONCE() and WRITE_ONCE() annotations to keep KCSAN happy.

Fixes: 7ec65372ca ("tcp: add stat of data packet reordering events")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260416200319.3608680-11-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-18 11:10:13 -07:00
Eric Dumazet
a984705ca8 tcp: annotate data-races around tp->dsack_dups
tcp_get_timestamping_opt_stats() intentionally runs lockless, we must
add READ_ONCE() and WRITE_ONCE() annotations to keep KCSAN happy.

Fixes: 7e10b6554f ("tcp: add dsack blocks received stats")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260416200319.3608680-10-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-18 11:10:13 -07:00
Eric Dumazet
5efc7b9f7c tcp: annotate data-races around tp->bytes_retrans
tcp_get_timestamping_opt_stats() intentionally runs lockless, we must
add READ_ONCE() and WRITE_ONCE() annotations to keep KCSAN happy.

Fixes: fb31c9b9f6 ("tcp: add data bytes retransmitted stats")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260416200319.3608680-9-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-18 11:10:13 -07:00
Eric Dumazet
ee43e957ce tcp: annotate data-races around tp->bytes_sent
tcp_get_timestamping_opt_stats() intentionally runs lockless, we must
add READ_ONCE() and WRITE_ONCE() annotations to keep KCSAN happy.

Fixes: ba113c3aa7 ("tcp: add data bytes sent stats")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260416200319.3608680-8-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-18 11:10:12 -07:00
Eric Dumazet
124199444d tcp: add data-race annotations for TCP_NLA_SNDQ_SIZE
tcp_get_timestamping_opt_stats() intentionally runs lockless, we must
add READ_ONCE() and WRITE_ONCE() annotations to keep KCSAN happy.

Fixes: 87ecc95d81 ("tcp: add send queue size stat in SCM_TIMESTAMPING_OPT_STATS")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260416200319.3608680-7-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-18 11:10:12 -07:00
Eric Dumazet
faa886ad3c tcp: annotate data-races around tp->delivered and tp->delivered_ce
tcp_get_timestamping_opt_stats() intentionally runs lockless, we must
add READ_ONCE() and WRITE_ONCE() annotations to keep KCSAN happy.

Fixes: feb5f2ec64 ("tcp: export packets delivery info")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260416200319.3608680-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-18 11:10:12 -07:00
Eric Dumazet
fd571afb05 tcp: annotate data-races around tp->snd_ssthresh
tcp_get_timestamping_opt_stats() intentionally runs lockless, we must
add READ_ONCE() and WRITE_ONCE() annotations to keep KCSAN happy.

Fixes: 7156d194a0 ("tcp: add snd_ssthresh stat in SCM_TIMESTAMPING_OPT_STATS")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260416200319.3608680-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-18 11:10:12 -07:00
Eric Dumazet
829ba1f329 tcp: add data-races annotations around tp->reordering, tp->snd_cwnd
tcp_get_timestamping_opt_stats() intentionally runs lockless, we must
add READ_ONCE(), WRITE_ONCE() data_race() annotations to keep KCSAN happy.

Fixes: bb7c19f960 ("tcp: add related fields into SCM_TIMESTAMPING_OPT_STATS")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260416200319.3608680-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-18 11:10:12 -07:00
Eric Dumazet
21e92a38cf tcp: add data-race annotations around tp->data_segs_out and tp->total_retrans
tcp_get_timestamping_opt_stats() intentionally runs lockless, we must
add READ_ONCE() and WRITE_ONCE() annotations to keep KCSAN happy.

Fixes: 7e98102f48 ("tcp: record pkts sent and retransmistted")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260416200319.3608680-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-18 11:10:12 -07:00
Eric Dumazet
267bf3cf9a tcp: annotate data-races in tcp_get_info_chrono_stats()
tcp_get_timestamping_opt_stats() does not own the socket lock,
this is intentional.

It calls tcp_get_info_chrono_stats() while other threads could
change chrono fields in tcp_chrono_set().

I do not think we need coherent TCP socket state snapshot
in tcp_get_timestamping_opt_stats(), I chose to only
add annotations to keep KCSAN happy.

Fixes: 1c885808e4 ("tcp: SOF_TIMESTAMPING_OPT_STATS option for SO_TIMESTAMPING")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260416200319.3608680-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-18 11:10:11 -07:00
Jiayuan Chen
29c95185ba nexthop: fix IPv6 route referencing IPv4 nexthop
syzbot reported a panic [1] [2].

When an IPv6 nexthop is replaced with an IPv4 nexthop, the has_v4 flag
of all groups containing this nexthop is not updated. This is because
nh_group_v4_update is only called when replacing AF_INET to AF_INET6,
but the reverse direction (AF_INET6 to AF_INET) is missed.

This allows a stale has_v4=false to bypass fib6_check_nexthop, causing
IPv6 routes to be attached to groups that effectively contain only AF_INET
members. Subsequent route lookups then call nexthop_fib6_nh() which
returns NULL for the AF_INET member, leading to a NULL pointer
dereference.

Fix by calling nh_group_v4_update whenever the family changes, not just
AF_INET to AF_INET6.

Reproducer:
	# AF_INET6 blackhole
	ip -6 nexthop add id 1 blackhole
	# group with has_v4=false
	ip nexthop add id 100 group 1
	# replace with AF_INET (no -6), has_v4 stays false
	ip nexthop replace id 1 blackhole
	# pass stale has_v4 check
	ip -6 route add 2001:db8::/64 nhid 100
	# panic
	ping -6 2001:db8::1

[1] https://syzkaller.appspot.com/bug?id=e17283eb2f8dcf3dd9b47fe6f67a95f71faadad0
[2] https://syzkaller.appspot.com/bug?id=8699b6ae54c9f35837d925686208402949e12ef3
Fixes: 7bf4796dd0 ("nexthops: add support for replace")
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20260413114522.147784-1-jiayuan.chen@linux.dev
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-04-16 13:48:30 +02:00
Linus Torvalds
334fbe734e mm.git review status for linus..mm-stable
Everything:
 
 Total patches:       368
 Reviews/patch:       1.56
 Reviewed rate:       74%
 
 Excluding DAMON:
 
 Total patches:       316
 Reviews/patch:       1.77
 Reviewed rate:       81%
 
 Excluding DAMON and zram:
 
 Total patches:       306
 Reviews/patch:       1.81
 Reviewed rate:       82%
 
 Excluding DAMON, zram and maple_tree:
 
 Total patches:       276
 Reviews/patch:       2.01
 Reviewed rate:       91%
 
 Significant patch series in this merge:
 
 - The 30 patch series "maple_tree: Replace big node with maple copy"
   from Liam Howlett is mainly prepararatory work for ongoing development
   but it does reduce stack usage and is an improvement.
 
 - The 12 patch series "mm, swap: swap table phase III: remove swap_map"
   from Kairui Song offers memory savings by removing the static swap_map.
   It also yields some CPU savings and implements several cleanups.
 
 - The 2 patch series "mm: memfd_luo: preserve file seals" from Pratyush
   Yadav adds file seal preservation to LUO's memfd code.
 
 - The 2 patch series "mm: zswap: add per-memcg stat for incompressible
   pages" from Jiayuan Chen adds additional userspace stats reportng to
   zswap.
 
 - The 4 patch series "arch, mm: consolidate empty_zero_page" from Mike
   Rapoport implements some cleanups for our handling of ZERO_PAGE() and
   zero_pfn.
 
 - The 2 patch series "mm/kmemleak: Improve scan_should_stop()
   implementation" from Zhongqiu Han provides an robustness improvement and
   some cleanups in the kmemleak code.
 
 - The 4 patch series "Improve khugepaged scan logic" from Vernon Yang
   "improves the khugepaged scan logic and reduces CPU consumption by
   prioritizing scanning tasks that access memory frequently".
 
 - The 2 patch series "Make KHO Stateless" from Jason Miu simplifies
   Kexec Handover by "transitioning KHO from an xarray-based metadata
   tracking system with serialization to a radix tree data structure that
   can be passed directly to the next kernel"
 
 - The 3 patch series "mm: vmscan: add PID and cgroup ID to vmscan
   tracepoints" from Thomas Ballasi and Steven Rostedt enhances vmscan's
   tracepointing.
 
 - The 5 patch series "mm: arch/shstk: Common shadow stack mapping helper
   and VM_NOHUGEPAGE" from Catalin Marinas is a cleanup for the shadow
   stack code: remove per-arch code in favour of a generic implementation.
 
 - The 2 patch series "Fix KASAN support for KHO restored vmalloc
   regions" from Pasha Tatashin fixes a WARN() which can be emitted the KHO
   restores a vmalloc area.
 
 - The 4 patch series "mm: Remove stray references to pagevec" from Tal
   Zussman provides several cleanups, mainly udpating references to "struct
   pagevec", which became folio_batch three years ago.
 
 - The 17 patch series "mm: Eliminate fake head pages from vmemmap
   optimization" from Kiryl Shutsemau simplifies the HugeTLB vmemmap
   optimization (HVO) by changing how tail pages encode their relationship
   to the head page.
 
 - The 2 patch series "mm/damon/core: improve DAMOS quota efficiency for
   core layer filters" from SeongJae Park improves two problematic
   behaviors of DAMOS that makes it less efficient when core layer filters
   are used.
 
 - The 3 patch series "mm/damon: strictly respect min_nr_regions" from
   SeongJae Park improves DAMON usability by extending the treatment of the
   min_nr_regions user-settable parameter.
 
 - The 3 patch series "mm/page_alloc: pcp locking cleanup" from Vlastimil
   Babka is a proper fix for a previously hotfixed SMP=n issue.  Code
   simplifications and cleanups ennsed.
 
 - The 16 patch series "mm: cleanups around unmapping / zapping" from
   David Hildenbrand implements "a bunch of cleanups around unmapping and
   zapping.  Mostly simplifications, code movements, documentation and
   renaming of zapping functions".
 
 - The 6 patch series "support batched checking of the young flag for
   MGLRU" from Baolin Wang supports batched checking of the young flag for
   MGLRU.  It's part cleanups; one benchmark shows large performance
   benefits for arm64.
 
 - The 5 patch series "memcg: obj stock and slab stat caching cleanups"
   from Johannes Weiner provides memcg cleanup and robustness improvements.
 
 - The 5 patch series "Allow order zero pages in page reporting" from
   Yuvraj Sakshith enhances page_reporting's free page reporting - it is
   presently and undesirably order-0 pages when reporting free memory.
 
 - The 6 patch series "mm: vma flag tweaks" from Lorenzo Stoakes is
   cleanup work following from the recent conversion of the VMA flags to a
   bitmap.
 
 - The 10 patch series "mm/damon: add optional debugging-purpose sanity
   checks" from SeongJae Park adds some more developer-facing debug checks
   into DAMON core.
 
 - The 2 patch series "mm/damon: test and document power-of-2
   min_region_sz requirement" from SeongJae Park adds an additional DAMON
   kunit test and makes some adjustments to the addr_unit parameter
   handling.
 
 - The 3 patch series "mm/damon/core: make passed_sample_intervals
   comparisons overflow-safe" from SeongJae Park fixes a hard-to-hit time
   overflow issue in DAMON core.
 
 - The 7 patch series "mm/damon: improve/fixup/update ratio calculation,
   test and documentation" from SeongJae Park is a "batch of misc/minor
   improvements and fixups" for DAMON.
 
 - The 4 patch series "mm: move vma_(kernel|mmu)_pagesize() out of
   hugetlb.c" from David Hildenbrand fixes a possible issue with dax-device
   when CONFIG_HUGETLB=n.  Some code movement was required.
 
 - The 6 patch series "zram: recompression cleanups and tweaks" from
   Sergey Senozhatsky provides "a somewhat random mix of fixups,
   recompression cleanups and improvements" in the zram code.
 
 - The 11 patch series "mm/damon: support multiple goal-based quota
   tuning algorithms" from SeongJae Park extend DAMOS quotas goal
   auto-tuning to support multiple tuning algorithms that users can select.
 
 - The 4 patch series "mm: thp: reduce unnecessary
   start_stop_khugepaged()" from Breno Leitao fixes the khugpaged sysfs
   handling so we no longer spam the logs with reams of junk when
   starting/stopping khugepaged.
 
 - The 3 patch series "mm: improve map count checks" from Lorenzo Stoakes
   provides some cleanups and slight fixes in the mremap, mmap and vma
   code.
 
 - The 5 patch series "mm/damon: support addr_unit on default monitoring
   targets for modules" from SeongJae Park extends the use of DAMON core's
   addr_unit tunable.
 
 - The 5 patch series "mm: khugepaged cleanups and mTHP prerequisites"
   from Nico Pache provides cleanups in the khugepaged and is a base for
   Nico's planned khugepaged mTHP support.
 
 - The 15 patch series "mm: memory hot(un)plug and SPARSEMEM cleanups"
   from David Hildenbrand implements code movement and cleanups in the
   memhotplug and sparsemem code.
 
 - The 2 patch series "mm: remove CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE and
   cleanup CONFIG_MIGRATION" from David Hildenbrand rationalizes some
   memhotplug Kconfig support.
 
 - The 6 patch series "change young flag check functions to return bool"
   from Baolin Wang is "a cleanup patchset to change all young flag check
   functions to return bool".
 
 - The 3 patch series "mm/damon/sysfs: fix memory leak and NULL
   dereference issues" from Josh Law and SeongJae Park fixes a few
   potential DAMON bugs.
 
 - The 25 patch series "mm/vma: convert vm_flags_t to vma_flags_t in vma
   code" from "converts a lot of the existing use of the legacy vm_flags_t
   data type to the new vma_flags_t type which replaces it".  Mainly in the
   vma code.
 
 - The 21 patch series "mm: expand mmap_prepare functionality and usage"
   from Lorenzo Stoakes "expands the mmap_prepare functionality, which is
   intended to replace the deprecated f_op->mmap hook which has been the
   source of bugs and security issues for some time".  Cleanups,
   documentation, extension of mmap_prepare into filesystem drivers.
 
 - The 13 patch series "mm/huge_memory: refactor zap_huge_pmd()" from
   Lorenzo Stoakes simplifies and cleans up zap_huge_pmd().  Additional
   cleanups around vm_normal_folio_pmd() and the softleaf functionality are
   performed.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCad3HDQAKCRDdBJ7gKXxA
 jrUQAPwNhPk5nPSxnyxjAeQtOBHqgCdnICeEismLajPKd9aYRgEA0s2XAu3tSUYi
 GrBnWImHG3s4ePQxVcPCegWTsOUrXgQ=
 =1Q7o
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2026-04-13-21-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - "maple_tree: Replace big node with maple copy" (Liam Howlett)

   Mainly prepararatory work for ongoing development but it does reduce
   stack usage and is an improvement.

 - "mm, swap: swap table phase III: remove swap_map" (Kairui Song)

   Offers memory savings by removing the static swap_map. It also yields
   some CPU savings and implements several cleanups.

 - "mm: memfd_luo: preserve file seals" (Pratyush Yadav)

   File seal preservation to LUO's memfd code

 - "mm: zswap: add per-memcg stat for incompressible pages" (Jiayuan
   Chen)

   Additional userspace stats reportng to zswap

 - "arch, mm: consolidate empty_zero_page" (Mike Rapoport)

   Some cleanups for our handling of ZERO_PAGE() and zero_pfn

 - "mm/kmemleak: Improve scan_should_stop() implementation" (Zhongqiu
   Han)

   A robustness improvement and some cleanups in the kmemleak code

 - "Improve khugepaged scan logic" (Vernon Yang)

   Improve khugepaged scan logic and reduce CPU consumption by
   prioritizing scanning tasks that access memory frequently

 - "Make KHO Stateless" (Jason Miu)

   Simplify Kexec Handover by transitioning KHO from an xarray-based
   metadata tracking system with serialization to a radix tree data
   structure that can be passed directly to the next kernel

 - "mm: vmscan: add PID and cgroup ID to vmscan tracepoints" (Thomas
   Ballasi and Steven Rostedt)

   Enhance vmscan's tracepointing

 - "mm: arch/shstk: Common shadow stack mapping helper and
   VM_NOHUGEPAGE" (Catalin Marinas)

   Cleanup for the shadow stack code: remove per-arch code in favour of
   a generic implementation

 - "Fix KASAN support for KHO restored vmalloc regions" (Pasha Tatashin)

   Fix a WARN() which can be emitted the KHO restores a vmalloc area

 - "mm: Remove stray references to pagevec" (Tal Zussman)

   Several cleanups, mainly udpating references to "struct pagevec",
   which became folio_batch three years ago

 - "mm: Eliminate fake head pages from vmemmap optimization" (Kiryl
   Shutsemau)

   Simplify the HugeTLB vmemmap optimization (HVO) by changing how tail
   pages encode their relationship to the head page

 - "mm/damon/core: improve DAMOS quota efficiency for core layer
   filters" (SeongJae Park)

   Improve two problematic behaviors of DAMOS that makes it less
   efficient when core layer filters are used

 - "mm/damon: strictly respect min_nr_regions" (SeongJae Park)

   Improve DAMON usability by extending the treatment of the
   min_nr_regions user-settable parameter

 - "mm/page_alloc: pcp locking cleanup" (Vlastimil Babka)

   The proper fix for a previously hotfixed SMP=n issue. Code
   simplifications and cleanups ensued

 - "mm: cleanups around unmapping / zapping" (David Hildenbrand)

   A bunch of cleanups around unmapping and zapping. Mostly
   simplifications, code movements, documentation and renaming of
   zapping functions

 - "support batched checking of the young flag for MGLRU" (Baolin Wang)

   Batched checking of the young flag for MGLRU. It's part cleanups; one
   benchmark shows large performance benefits for arm64

 - "memcg: obj stock and slab stat caching cleanups" (Johannes Weiner)

   memcg cleanup and robustness improvements

 - "Allow order zero pages in page reporting" (Yuvraj Sakshith)

   Enhance free page reporting - it is presently and undesirably order-0
   pages when reporting free memory.

 - "mm: vma flag tweaks" (Lorenzo Stoakes)

   Cleanup work following from the recent conversion of the VMA flags to
   a bitmap

 - "mm/damon: add optional debugging-purpose sanity checks" (SeongJae
   Park)

   Add some more developer-facing debug checks into DAMON core

 - "mm/damon: test and document power-of-2 min_region_sz requirement"
   (SeongJae Park)

   An additional DAMON kunit test and makes some adjustments to the
   addr_unit parameter handling

 - "mm/damon/core: make passed_sample_intervals comparisons
   overflow-safe" (SeongJae Park)

   Fix a hard-to-hit time overflow issue in DAMON core

 - "mm/damon: improve/fixup/update ratio calculation, test and
   documentation" (SeongJae Park)

   A batch of misc/minor improvements and fixups for DAMON

 - "mm: move vma_(kernel|mmu)_pagesize() out of hugetlb.c" (David
   Hildenbrand)

   Fix a possible issue with dax-device when CONFIG_HUGETLB=n. Some code
   movement was required.

 - "zram: recompression cleanups and tweaks" (Sergey Senozhatsky)

   A somewhat random mix of fixups, recompression cleanups and
   improvements in the zram code

 - "mm/damon: support multiple goal-based quota tuning algorithms"
   (SeongJae Park)

   Extend DAMOS quotas goal auto-tuning to support multiple tuning
   algorithms that users can select

 - "mm: thp: reduce unnecessary start_stop_khugepaged()" (Breno Leitao)

   Fix the khugpaged sysfs handling so we no longer spam the logs with
   reams of junk when starting/stopping khugepaged

 - "mm: improve map count checks" (Lorenzo Stoakes)

   Provide some cleanups and slight fixes in the mremap, mmap and vma
   code

 - "mm/damon: support addr_unit on default monitoring targets for
   modules" (SeongJae Park)

   Extend the use of DAMON core's addr_unit tunable

 - "mm: khugepaged cleanups and mTHP prerequisites" (Nico Pache)

   Cleanups to khugepaged and is a base for Nico's planned khugepaged
   mTHP support

 - "mm: memory hot(un)plug and SPARSEMEM cleanups" (David Hildenbrand)

   Code movement and cleanups in the memhotplug and sparsemem code

 - "mm: remove CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE and cleanup
   CONFIG_MIGRATION" (David Hildenbrand)

   Rationalize some memhotplug Kconfig support

 - "change young flag check functions to return bool" (Baolin Wang)

   Cleanups to change all young flag check functions to return bool

 - "mm/damon/sysfs: fix memory leak and NULL dereference issues" (Josh
   Law and SeongJae Park)

   Fix a few potential DAMON bugs

 - "mm/vma: convert vm_flags_t to vma_flags_t in vma code" (Lorenzo
   Stoakes)

   Convert a lot of the existing use of the legacy vm_flags_t data type
   to the new vma_flags_t type which replaces it. Mainly in the vma
   code.

 - "mm: expand mmap_prepare functionality and usage" (Lorenzo Stoakes)

   Expand the mmap_prepare functionality, which is intended to replace
   the deprecated f_op->mmap hook which has been the source of bugs and
   security issues for some time. Cleanups, documentation, extension of
   mmap_prepare into filesystem drivers

 - "mm/huge_memory: refactor zap_huge_pmd()" (Lorenzo Stoakes)

   Simplify and clean up zap_huge_pmd(). Additional cleanups around
   vm_normal_folio_pmd() and the softleaf functionality are performed.

* tag 'mm-stable-2026-04-13-21-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits)
  mm: fix deferred split queue races during migration
  mm/khugepaged: fix issue with tracking lock
  mm/huge_memory: add and use has_deposited_pgtable()
  mm/huge_memory: add and use normal_or_softleaf_folio_pmd()
  mm: add softleaf_is_valid_pmd_entry(), pmd_to_softleaf_folio()
  mm/huge_memory: separate out the folio part of zap_huge_pmd()
  mm/huge_memory: use mm instead of tlb->mm
  mm/huge_memory: remove unnecessary sanity checks
  mm/huge_memory: deduplicate zap deposited table call
  mm/huge_memory: remove unnecessary VM_BUG_ON_PAGE()
  mm/huge_memory: add a common exit path to zap_huge_pmd()
  mm/huge_memory: handle buggy PMD entry in zap_huge_pmd()
  mm/huge_memory: have zap_huge_pmd return a boolean, add kdoc
  mm/huge: avoid big else branch in zap_huge_pmd()
  mm/huge_memory: simplify vma_is_specal_huge()
  mm: on remap assert that input range within the proposed VMA
  mm: add mmap_action_map_kernel_pages[_full]()
  uio: replace deprecated mmap hook with mmap_prepare in uio_info
  drivers: hv: vmbus: replace deprecated mmap hook with mmap_prepare
  mm: allow handling of stacked mmap_prepare hooks in more drivers
  ...
2026-04-15 12:59:16 -07:00
Linus Torvalds
91a4855d6c Networking changes for 7.1.
Core & protocols
 ----------------
 
  - Support HW queue leasing, allowing containers to be granted access
    to HW queues for zero-copy operations and AF_XDP.
 
  - Number of code moves to help the compiler with inlining.
    Avoid output arguments for returning drop reason where possible.
 
  - Rework drop handling within qdiscs to include more metadata
    about the reason and dropping qdisc in the tracepoints.
 
  - Remove the rtnl_lock use from IP Multicast Routing.
 
  - Pack size information into the Rx Flow Steering table pointer
    itself. This allows making the table itself a flat array of u32s,
    thus making the table allocation size a power of two.
 
  - Report TCP delayed ack timer information via socket diag.
 
  - Add ip_local_port_step_width sysctl to allow distributing the randomly
    selected ports more evenly throughout the allowed space.
 
  - Add support for per-route tunsrc in IPv6 segment routing.
 
  - Start work of switching sockopt handling to iov_iter.
 
  - Improve dynamic recvbuf sizing in MPTCP, limit burstiness and avoid
    buffer size drifting up.
 
  - Support MSG_EOR in MPTCP.
 
  - Add stp_mode attribute to the bridge driver for STP mode selection.
    This addresses concerns about call_usermodehelper() usage.
 
  - Remove UDP-Lite support (as announced in 2023).
 
  - Remove support for building IPv6 as a module.
    Remove the now unnecessary function calling indirection.
 
 Cross-tree stuff
 ----------------
 
  - Move Michael MIC code from generic crypto into wireless,
    it's considered insecure but some WiFi networks still need it.
 
 Netfilter
 ---------
 
  - Switch nft_fib_ipv6 module to no longer need temporary dst_entry
    object allocations by using fib6_lookup() + RCU.
    Florian W reports this gets us ~13% higher packet rate.
 
  - Convert IPVS's global __ip_vs_mutex to per-net service_mutex and
    switch the service tables to be per-net. Convert some code that
    walks the service lists to use RCU instead of the service_mutex.
 
  - Add more opinionated input validation to lower security exposure.
 
  - Make IPVS hash tables to be per-netns and resizable.
 
 Wireless
 --------
 
  - Finished assoc frame encryption/EPPKE/802.1X-over-auth.
 
  - Radar detection improvements.
 
  - Add 6 GHz incumbent signal detection APIs.
 
  - Multi-link support for FILS, probe response templates and
    client probing.
 
  - New APIs and mac80211 support for NAN (Neighbor Aware Networking,
    aka Wi-Fi Aware) so less work must be in firmware.
 
 Driver API
 ----------
 
  - Add numerical ID for devlink instances (to avoid having to create
    fake bus/device pairs just to have an ID). Support shared devlink
    instances which span multiple PFs.
 
  - Add standard counters for reporting pause storm events
    (implement in mlx5 and fbnic).
 
  - Add configuration API for completion writeback buffering
    (implement in mana).
 
  - Support driver-initiated change of RSS context sizes.
 
  - Support DPLL monitoring input frequency (implement in zl3073x).
 
  - Support per-port resources in devlink (implement in mlx5).
 
 Misc
 ----
 
  - Expand the YAML spec for Netfilter.
 
 Drivers
 -------
 
  - Software:
    - macvlan: support multicast rx for bridge ports with shared source
      MAC address
    - team: decouple receive and transmit enablement for IEEE 802.3ad
      LACP "independent control"
 
  - Ethernet high-speed NICs:
    - nVidia/Mellanox:
      - support high order pages in zero-copy mode (for payload
        coalescing)
      - support multiple packets in a page (for systems with 64kB pages)
    - Broadcom 25-400GE (bnxt):
      - implement XDP RSS hash metadata extraction
      - add software fallback for UDP GSO, lowering the IOMMU cost
    - Broadcom 800GE (bnge):
      - add link status and configuration handling
      - add various HW and SW statistics
    - Marvell/Cavium:
      - NPC HW block support for cn20k
    - Huawei (hinic3):
      - add mailbox / control queue
      - add rx VLAN offload
      - add driver info and link management
 
  - Ethernet NICs:
    - Marvell/Aquantia:
      - support reading SFP module info on some AQC100 cards
    - Realtek PCI (r8169):
      - add support for RTL8125cp
    - Realtek USB (r8152):
      - support for the RTL8157 5Gbit chip
      - add 2500baseT EEE status/configuration support
 
  - Ethernet NICs embedded and off-the-shelf IP:
    - Synopsys (stmmac):
      - cleanup and reorganize SerDes handling and PCS support
      - cleanup descriptor handling and per-platform data
      - cleanup and consolidate MDIO defines and handling
      - shrink driver memory use for internal structures
      - improve Tx IRQ coalescing
      - improve TCP segmentation handling
      - add support for Spacemit K3
    - Cadence (macb):
      - support PHYs that have inband autoneg disabled with GEM
      - support IEEE 802.3az EEE
      - rework usrio capabilities and handling
    - AMD (xgbe):
      - improve power management for S0i3
      - improve TX resilience for link-down handling
 
  - Virtual:
    - Google cloud vNIC:
      - support larger ring sizes in DQO-QPL mode
      - improve HW-GRO handling
      - support UDP GSO for DQO format
    - PCIe NTB:
      - support queue count configuration
 
  - Ethernet PHYs:
    - automatically disable PHY autonomous EEE if MAC is in charge
    - Broadcom:
      - add BCM84891/BCM84892 support
    - Micrel:
      - support for LAN9645X internal PHY
    - Realtek:
      - add RTL8224 pair order support
      - support PHY LEDs on RTL8211F-VD
      - support spread spectrum clocking (SSC)
    - Maxlinear:
      - add PHY-level statistics via ethtool
 
  - Ethernet switches:
    - Maxlinear (mxl862xx):
      - support for bridge offloading
      - support for VLANs
      - support driver statistics
 
  - Bluetooth:
    - large number of fixes and new device IDs
    - Mediatek:
      - support MT6639 (MT7927)
      - support MT7902 SDIO
 
  - WiFi:
    - Intel (iwlwifi):
      - UNII-9 and continuing UHR work
    - MediaTek (mt76):
      - mt7996/mt7925 MLO fixes/improvements
      - mt7996 NPU support (HW eth/wifi traffic offload)
    - Qualcomm (ath12k):
      - monitor mode support on IPQ5332
      - basic hwmon temperature reporting
      - support IPQ5424
    - Realtek:
      - add USB RX aggregation to improve performance
      - add USB TX flow control by tracking in-flight URBs
 
  - Cellular:
    - IPA v5.2 support
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmnelNoACgkQMUZtbf5S
 IrtWFw//WyiXuEiGawVQONnbu1dtR+3nw/cvNpSYi0IM66vbRUB9n+9fxm2MIyG4
 4jI/c/X/fxIvUxEqGez3yPn5P7KqkQR8WRYwkxrMYKRpXeukN0IDk5Euew5DskCe
 wtBKNJOQWKdKXff0bLQoJ9dHWYuJ2IMRVil5M3fhUbeUOXeyJD7Yn1w2ICvJAbj+
 T/Hw7sEtchNaHp6h6SbaQfahkUFHQG5peNoETkZF4UDF6ALGY29WH91GXeO2lrgN
 IxX203KtaavV0oU8T0oixZgOc57Ns081YfFL/F1JP2HV6lgkwhuq+zxCrRTi1c9M
 HPTXgwD7Z80Y74nM3YTLrPfoMOP8GLBZgdV3rUpwmteM26+gMTm+O1zHUur5ZoGy
 D6TaMFguPTIqiRyrARa9xY/J6r9TQkc2Wfu4bIuPndKFg8xPoepuEObODnh0+5Hg
 4j4pdFhIo2huENhSg7kVb/yl+1q68SFwM3RqTmx+OhCa0AyjcKIKgt/UBhismdnG
 r8obxzb+nXeJc2rRDuwNMwlBlcMSbep27uGt64zeHMMXVhTVqOoytNaL/X/ZpH2m
 A0DscUrpHvb36IoDPtanc6irP+JOh5Xe7Nw5qhkgwsMc7hlf8SyyHB4OUBBaz1qA
 ETSnHlfwklRmXSpWqH2LyGXjdOQpDKP46+h0W3dttMD2/cRBqYo=
 =EhQZ
 -----END PGP SIGNATURE-----

Merge tag 'net-next-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "Core & protocols:

   - Support HW queue leasing, allowing containers to be granted access
     to HW queues for zero-copy operations and AF_XDP

   - Number of code moves to help the compiler with inlining. Avoid
     output arguments for returning drop reason where possible

   - Rework drop handling within qdiscs to include more metadata about
     the reason and dropping qdisc in the tracepoints

   - Remove the rtnl_lock use from IP Multicast Routing

   - Pack size information into the Rx Flow Steering table pointer
     itself. This allows making the table itself a flat array of u32s,
     thus making the table allocation size a power of two

   - Report TCP delayed ack timer information via socket diag

   - Add ip_local_port_step_width sysctl to allow distributing the
     randomly selected ports more evenly throughout the allowed space

   - Add support for per-route tunsrc in IPv6 segment routing

   - Start work of switching sockopt handling to iov_iter

   - Improve dynamic recvbuf sizing in MPTCP, limit burstiness and avoid
     buffer size drifting up

   - Support MSG_EOR in MPTCP

   - Add stp_mode attribute to the bridge driver for STP mode selection.
     This addresses concerns about call_usermodehelper() usage

   - Remove UDP-Lite support (as announced in 2023)

   - Remove support for building IPv6 as a module. Remove the now
     unnecessary function calling indirection

  Cross-tree stuff:

   - Move Michael MIC code from generic crypto into wireless, it's
     considered insecure but some WiFi networks still need it

  Netfilter:

   - Switch nft_fib_ipv6 module to no longer need temporary dst_entry
     object allocations by using fib6_lookup() + RCU.

     Florian W reports this gets us ~13% higher packet rate

   - Convert IPVS's global __ip_vs_mutex to per-net service_mutex and
     switch the service tables to be per-net. Convert some code that
     walks the service lists to use RCU instead of the service_mutex

   - Add more opinionated input validation to lower security exposure

   - Make IPVS hash tables to be per-netns and resizable

  Wireless:

   - Finished assoc frame encryption/EPPKE/802.1X-over-auth

   - Radar detection improvements

   - Add 6 GHz incumbent signal detection APIs

   - Multi-link support for FILS, probe response templates and client
     probing

   - New APIs and mac80211 support for NAN (Neighbor Aware Networking,
     aka Wi-Fi Aware) so less work must be in firmware

  Driver API:

   - Add numerical ID for devlink instances (to avoid having to create
     fake bus/device pairs just to have an ID). Support shared devlink
     instances which span multiple PFs

   - Add standard counters for reporting pause storm events (implement
     in mlx5 and fbnic)

   - Add configuration API for completion writeback buffering (implement
     in mana)

   - Support driver-initiated change of RSS context sizes

   - Support DPLL monitoring input frequency (implement in zl3073x)

   - Support per-port resources in devlink (implement in mlx5)

  Misc:

   - Expand the YAML spec for Netfilter

  Drivers

   - Software:
      - macvlan: support multicast rx for bridge ports with shared
        source MAC address
      - team: decouple receive and transmit enablement for IEEE 802.3ad
        LACP "independent control"

   - Ethernet high-speed NICs:
      - nVidia/Mellanox:
         - support high order pages in zero-copy mode (for payload
           coalescing)
         - support multiple packets in a page (for systems with 64kB
           pages)
      - Broadcom 25-400GE (bnxt):
         - implement XDP RSS hash metadata extraction
         - add software fallback for UDP GSO, lowering the IOMMU cost
      - Broadcom 800GE (bnge):
         - add link status and configuration handling
         - add various HW and SW statistics
      - Marvell/Cavium:
         - NPC HW block support for cn20k
      - Huawei (hinic3):
         - add mailbox / control queue
         - add rx VLAN offload
         - add driver info and link management

   - Ethernet NICs:
      - Marvell/Aquantia:
         - support reading SFP module info on some AQC100 cards
      - Realtek PCI (r8169):
         - add support for RTL8125cp
      - Realtek USB (r8152):
         - support for the RTL8157 5Gbit chip
         - add 2500baseT EEE status/configuration support

   - Ethernet NICs embedded and off-the-shelf IP:
      - Synopsys (stmmac):
         - cleanup and reorganize SerDes handling and PCS support
         - cleanup descriptor handling and per-platform data
         - cleanup and consolidate MDIO defines and handling
         - shrink driver memory use for internal structures
         - improve Tx IRQ coalescing
         - improve TCP segmentation handling
         - add support for Spacemit K3
      - Cadence (macb):
         - support PHYs that have inband autoneg disabled with GEM
         - support IEEE 802.3az EEE
         - rework usrio capabilities and handling
      - AMD (xgbe):
         - improve power management for S0i3
         - improve TX resilience for link-down handling

   - Virtual:
      - Google cloud vNIC:
         - support larger ring sizes in DQO-QPL mode
         - improve HW-GRO handling
         - support UDP GSO for DQO format
      - PCIe NTB:
         - support queue count configuration

   - Ethernet PHYs:
      - automatically disable PHY autonomous EEE if MAC is in charge
      - Broadcom:
         - add BCM84891/BCM84892 support
      - Micrel:
         - support for LAN9645X internal PHY
      - Realtek:
         - add RTL8224 pair order support
         - support PHY LEDs on RTL8211F-VD
         - support spread spectrum clocking (SSC)
      - Maxlinear:
         - add PHY-level statistics via ethtool

   - Ethernet switches:
      - Maxlinear (mxl862xx):
         - support for bridge offloading
         - support for VLANs
         - support driver statistics

   - Bluetooth:
      - large number of fixes and new device IDs
      - Mediatek:
         - support MT6639 (MT7927)
         - support MT7902 SDIO

   - WiFi:
      - Intel (iwlwifi):
         - UNII-9 and continuing UHR work
      - MediaTek (mt76):
         - mt7996/mt7925 MLO fixes/improvements
         - mt7996 NPU support (HW eth/wifi traffic offload)
      - Qualcomm (ath12k):
         - monitor mode support on IPQ5332
         - basic hwmon temperature reporting
         - support IPQ5424
      - Realtek:
         - add USB RX aggregation to improve performance
         - add USB TX flow control by tracking in-flight URBs

   - Cellular:
      - IPA v5.2 support"

* tag 'net-next-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1561 commits)
  net: pse-pd: fix kernel-doc function name for pse_control_find_by_id()
  wireguard: device: use exit_rtnl callback instead of manual rtnl_lock in pre_exit
  wireguard: allowedips: remove redundant space
  tools: ynl: add sample for wireguard
  wireguard: allowedips: Use kfree_rcu() instead of call_rcu()
  MAINTAINERS: Add netkit selftest files
  selftests/net: Add additional test coverage in nk_qlease
  selftests/net: Split netdevsim tests from HW tests in nk_qlease
  tools/ynl: Make YnlFamily closeable as a context manager
  net: airoha: Add missing PPE configurations in airoha_ppe_hw_init()
  net: airoha: Fix VIP configuration for AN7583 SoC
  net: caif: clear client service pointer on teardown
  net: strparser: fix skb_head leak in strp_abort_strp()
  net: usb: cdc-phonet: fix skb frags[] overflow in rx_complete()
  selftests/bpf: add test for xdp_master_redirect with bond not up
  net, bpf: fix null-ptr-deref in xdp_master_redirect() for down master
  net: airoha: Remove PCE_MC_EN_MASK bit in REG_FE_PCE_CFG configuration
  sctp: disable BH before calling udp_tunnel_xmit_skb()
  sctp: fix missing encap_port propagation for GSO fragments
  net: airoha: Rely on net_device pointer in ETS callbacks
  ...
2026-04-14 18:36:10 -07:00
Jakub Kicinski
35c2c39832 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Merge in late fixes in preparation for the net-next PR.

Conflicts:

include/net/sch_generic.h
  a6bd339dbb ("net_sched: fix skb memory leak in deferred qdisc drops")
  ff2998f29f ("net: sched: introduce qdisc-specific drop reason tracing")
https://lore.kernel.org/adz0iX85FHMz0HdO@sirena.org.uk

drivers/net/ethernet/airoha/airoha_eth.c
  1acdfbdb51 ("net: airoha: Fix VIP configuration for AN7583 SoC")
  bf3471e6e6 ("net: airoha: Make flow control source port mapping dependent on nbq parameter")

Adjacent changes:

drivers/net/ethernet/airoha/airoha_ppe.c
  f44218cd5e ("net: airoha: Reset PPE cpu port configuration in airoha_ppe_hw_init()")
  7da62262ec ("inet: add ip_local_port_step_width sysctl to improve port usage distribution")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-14 12:04:00 -07:00
Kuniyuki Iwashima
c058bbf05b tcp: Don't set treq->req_usec_ts in cookie_tcp_reqsk_init().
Commit de5626b95e ("tcp: Factorise cookie-independent fields
initialisation in cookie_v[46]_check().") miscategorised
tcp_rsk(req)->req_usec_ts init to cookie_tcp_reqsk_init(),
which is used by both BPF/non-BPF SYN cookie reqsk.

Rather, it should have been moved to cookie_tcp_reqsk_alloc() by
commit 8e7bab6b96 ("tcp: Factorise cookie-dependent fields
initialisation in cookie_v[46]_check()") so that only non-BPF SYN
cookie sets tcp_rsk(req)->req_usec_ts to false.

Let's move the initialisation to cookie_tcp_reqsk_alloc() to
respect bpf_tcp_req_attrs.usec_ts_ok.

Fixes: e472f88891 ("bpf: tcp: Support arbitrary SYN Cookie.")
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260410235328.1773449-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-13 15:58:08 -07:00
Gabriel Krisman Bertazi
b80a95ccf1 udp: Force compute_score to always inline
Back in 2024 I reported a 7-12% regression on an iperf3 UDP loopback
thoughput test that we traced to the extra overhead of calling
compute_score on two places, introduced by commit f0ea27e7bf ("udp:
re-score reuseport groups when connected sockets are present").  At the
time, I pointed out the overhead was caused by the multiple calls,
associated with cpu-specific mitigations, and merged commit
50aee97d15 ("udp: Avoid call to compute_score on multiple sites") to
jump back explicitly, to force the rescore call in a single place.

Recently though, we got another regression report against a newer distro
version, which a team colleague traced back to the same root-cause.
Turns out that once we updated to gcc-13, the compiler got smart enough
to unroll the loop, undoing my previous mitigation.  Let's bite the
bullet and __always_inline compute_score on both ipv4 and ipv6 to
prevent gcc from de-optimizing it again in the future.  These functions
are only called in two places each, udpX_lib_lookup1 and
udpX_lib_lookup2, so the extra size shouldn't be a problem and it is hot
enough to be very visible in profilings.  In fact, with gcc13, forcing
the inline will prevent gcc from unrolling the fix from commit
50aee97d15, so we don't end up increasing udpX_lib_lookup2 at all.

I haven't recollected the results myself, as I don't have access to the
machine at the moment.  But the same colleague reported 4.67%
inprovement with this patch in the loopback benchmark, solving the
regression report within noise margins.

Eric Dumazet reported no size change to vmlinux when built with clang.
I report the same also with gcc-13:

scripts/bloat-o-meter vmlinux vmlinux-inline
add/remove: 0/2 grow/shrink: 4/0 up/down: 616/-416 (200)
Function                                     old     new   delta
udp6_lib_lookup2                             762     949    +187
__udp6_lib_lookup                            810     975    +165
udp4_lib_lookup2                             757     906    +149
__udp4_lib_lookup                            871     986    +115
__pfx_compute_score                           32       -     -32
compute_score                                384       -    -384
Total: Before=35011784, After=35011984, chg +0.00%

Fixes: 50aee97d15 ("udp: Avoid call to compute_score on multiple sites")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://patch.msgid.link/20260410155936.654915-1-krisman@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-13 15:44:42 -07:00
Linus Torvalds
b7d74ea0fd vfs-7.1-rc1.kino
Please consider pulling these changes from the signed vfs-7.1-rc1.kino tag.
 
 Thanks!
 Christian
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCadjZCgAKCRCRxhvAZXjc
 otmnAP4sbsxZQdz2TG2hJuOwnEZOkkxZQOUMc3ERVyZaWXIeTAEA7e5M+8FpoG9n
 8ipO76UoaXdGLESrqVdp9EOhLqOW7QY=
 =uMeJ
 -----END PGP SIGNATURE-----

Merge tag 'vfs-7.1-rc1.kino' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs i_ino updates from Christian Brauner:
 "For historical reasons, the inode->i_ino field is an unsigned long,
  which means that it's 32 bits on 32 bit architectures. This has caused
  a number of filesystems to implement hacks to hash a 64-bit identifier
  into a 32-bit field, and deprives us of a universal identifier field
  for an inode.

  This changes the inode->i_ino field from an unsigned long to a u64.
  This shouldn't make any material difference on 64-bit hosts, but
  32-bit hosts will see struct inode grow by at least 4 bytes. This
  could have effects on slabcache sizes and field alignment.

  The bulk of the changes are to format strings and tracepoints, since
  the kernel itself doesn't care that much about the i_ino field. The
  first patch changes some vfs function arguments, so check that one out
  carefully.

  With this change, we may be able to shrink some inode structures. For
  instance, struct nfs_inode has a fileid field that holds the 64-bit
  inode number. With this set of changes, that field could be
  eliminated. I'd rather leave that sort of cleanups for later just to
  keep this simple"

* tag 'vfs-7.1-rc1.kino' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  nilfs2: fix 64-bit division operations in nilfs_bmap_find_target_in_group()
  EVM: add comment describing why ino field is still unsigned long
  vfs: remove externs from fs.h on functions modified by i_ino widening
  treewide: fix missed i_ino format specifier conversions
  ext4: fix signed format specifier in ext4_load_inode trace event
  treewide: change inode->i_ino from unsigned long to u64
  nilfs2: widen trace event i_ino fields to u64
  f2fs: widen trace event i_ino fields to u64
  ext4: widen trace event i_ino fields to u64
  zonefs: widen trace event i_ino fields to u64
  hugetlbfs: widen trace event i_ino fields to u64
  ext2: widen trace event i_ino fields to u64
  cachefiles: widen trace event i_ino fields to u64
  vfs: widen trace event i_ino fields to u64
  net: change sock.sk_ino and sock_i_ino() to u64
  audit: widen ino fields to u64
  vfs: widen inode hash/lookup functions to u64
2026-04-13 12:19:01 -07:00
Jakub Kicinski
b025461303 tcp: update window_clamp when SO_RCVBUF is set
Commit under Fixes moved recomputing the window clamp to
tcp_measure_rcv_mss() (when scaling_ratio changes).
I suspect it missed the fact that we don't recompute the clamp
when rcvbuf is set. Until scaling_ratio changes we are
stuck with the old window clamp which may be based on
the small initial buffer. scaling_ratio may never change.

Inspired by Eric's recent commit d1361840f8 ("tcp: fix
SO_RCVLOWAT and RCVBUF autotuning") plumb the user action
thru to TCP and have it update the clamp.

A smaller fix would be to just have tcp_rcvbuf_grow()
adjust the clamp even if SOCK_RCVBUF_LOCK is set.
But IIUC this is what we were trying to get away from
in the first place.

Fixes: a2cbb16039 ("tcp: Update window clamping condition")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumaze@google.com>
Link: https://patch.msgid.link/20260408001438.129165-1-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-04-13 15:32:35 +02:00
Jakub Kicinski
9336854a59 Merge branch 'net-reduce-sk_filter-and-friends-bloat'
Eric Dumazet says:

====================
net: reduce sk_filter() (and friends) bloat

Some functions return an error by value, and a drop_reason
by an output parameter. This extra parameter can force stack canaries.

A drop_reason is enough and more efficient.

This series reduces bloat by 678 bytes on x86_64:

$ scripts/bloat-o-meter -t vmlinux.old vmlinux.final
add/remove: 0/0 grow/shrink: 3/18 up/down: 79/-757 (-678)
Function                                     old     new   delta
vsock_queue_rcv_skb                           50      79     +29
ipmr_cache_report                           1290    1315     +25
ip6mr_cache_report                          1322    1347     +25
tcp_v6_rcv                                  3169    3167      -2
packet_rcv_spkt                              329     327      -2
unix_dgram_sendmsg                          1731    1726      -5
netlink_unicast                              957     945     -12
netlink_dump                                1372    1359     -13
sk_filter_trim_cap                           889     858     -31
netlink_broadcast_filtered                  1633    1595     -38
tcp_v4_rcv                                  3152    3111     -41
raw_rcv_skb                                  122      80     -42
ping_queue_rcv_skb                           109      61     -48
ping_rcv                                     215     162     -53
rawv6_rcv_skb                                278     224     -54
__sk_receive_skb                             690     632     -58
raw_rcv                                      591     527     -64
udpv6_queue_rcv_one_skb                      935     869     -66
udp_queue_rcv_one_skb                        919     853     -66
tun_net_xmit                                1146    1074     -72
sock_queue_rcv_skb_reason                    166      76     -90
Total: Before=29722890, After=29722212, chg -0.00%

Future conversions from sock_queue_rcv_skb() to sock_queue_rcv_skb_reason()
can be done later.
====================

Link: https://patch.msgid.link/20260409145625.2306224-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-12 14:30:28 -07:00
Eric Dumazet
fb37aea2a0 net: change sk_filter_trim_cap() to return a drop_reason by value
Current return value can be replaced with the drop_reason,
reducing kernel bloat:

$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/2 grow/shrink: 1/11 up/down: 32/-603 (-571)
Function                                     old     new   delta
tcp_v6_rcv                                  3135    3167     +32
unix_dgram_sendmsg                          1731    1726      -5
netlink_unicast                              957     945     -12
netlink_dump                                1372    1359     -13
sk_filter_trim_cap                           882     858     -24
tcp_v4_rcv                                  3143    3111     -32
__pfx_tcp_filter                              32       -     -32
netlink_broadcast_filtered                  1633    1595     -38
sock_queue_rcv_skb_reason                    126      76     -50
tun_net_xmit                                1127    1074     -53
__sk_receive_skb                             690     632     -58
udpv6_queue_rcv_one_skb                      935     869     -66
udp_queue_rcv_one_skb                        919     853     -66
tcp_filter                                   154       -    -154
Total: Before=29722783, After=29722212, chg -0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260409145625.2306224-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-12 14:30:25 -07:00
Eric Dumazet
97449a5f1a tcp: change tcp_filter() to return the reason by value
sk_filter_trim_cap() will soon return the reason by value,
do the same for tcp_filter().

Note:

tcp_filter() is no longer inlined. Following patch will inline it again.

$ scripts/bloat-o-meter -t vmlinux.4 vmlinux.5
add/remove: 2/0 grow/shrink: 0/2 up/down: 186/-43 (143)
Function                                     old     new   delta
tcp_filter                                     -     154    +154
__pfx_tcp_filter                               -      32     +32
tcp_v4_rcv                                  3152    3143      -9
tcp_v6_rcv                                  3169    3135     -34
Total: Before=29722640, After=29722783, chg +0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260409145625.2306224-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-12 14:30:25 -07:00
Eric Dumazet
900f27fb79 net: change sock_queue_rcv_skb_reason() to return a drop_reason
Change sock_queue_rcv_skb_reason() to return the drop_reason directly
instead of using a reference.

This is part of an effort to remove stack canaries and reduce bloat.

$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/0 grow/shrink: 3/7 up/down: 79/-301 (-222)
Function                                     old     new   delta
vsock_queue_rcv_skb                           50      79     +29
ipmr_cache_report                           1290    1315     +25
ip6mr_cache_report                          1322    1347     +25
packet_rcv_spkt                              329     327      -2
sock_queue_rcv_skb_reason                    166     128     -38
raw_rcv_skb                                  122      80     -42
ping_queue_rcv_skb                           109      61     -48
ping_rcv                                     215     162     -53
rawv6_rcv_skb                                278     224     -54
raw_rcv                                      591     527     -64
Total: Before=29722890, After=29722668, chg -0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260409145625.2306224-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-12 14:30:25 -07:00
Gal Pressman
8632175ccb gre: Count GRE packet drops
GRE is silently dropping packets without updating statistics.

In case of drop, increment rx_dropped counter to provide visibility into
packet loss. For the case where no GRE protocol handler is registered,
use rx_nohandler.

Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Nimrod Oren <noren@nvidia.com>
Signed-off-by: Gal Pressman <gal@nvidia.com>
Link: https://patch.msgid.link/20260409090945.1542440-1-gal@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-12 12:33:33 -07:00
Eric Dumazet
29703d7813 tcp: add indirect call wrapper in tcp_conn_request()
Small improvement in SYN processing, to directly call
tcp_v6_init_seq_and_ts_off() or tcp_v4_init_seq_and_ts_off().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260410174950.745670-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-12 09:17:03 -07:00