Commit Graph

2129 Commits

Author SHA1 Message Date
Linus Torvalds
8be01e1280 io_uring-7.1-20260508
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmn+FU4QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpnIsEADVG1zqBrfj6JE0pscyrFbFHwAjoihs7jW2
 dHVRzrWKlCG0iflVAxIcoA6WQLAc1W5cmWi9GHuOHtPvWY1rdiZCvp8GMGuq40ye
 3OQrMAcDdpowAUBfO9tiZ9L9Bn96HFZCa92V0PEp/fSPxuRv1HGE/yTpWsardbxn
 eUGBoOMAclqawMU5thfOXFMT+DetwrY//nd799iEElzyNfk92mDCZZ5n3WPyl1J4
 hn/iUu04YVozto9P17SJfEOg1c4kz84wL6ATR+2IuxrWm8/LxXspmbIovJYaCLRr
 EkdevTrxABBTJ77dllnnaFg233F75ZdYr0z0xgHoOFT2totSz2lZFxqz8R0/b/NE
 mHdshkn4LTU4yDHuILDt0LxImi62i1Bmn7QQIbICcMAEhTGh2hebaM/lmmHp7IlN
 R98q4ALm+dTu5vp+MDlve53P4UITxsgclAICqxyY26FrNneHd/TodeDYPGYLwj4F
 2EPZyzHe3WpTXMmF6pxLlEr2r8DRqZhBqj+mohN/pZK+ecs6GmCLh1F1zcuaLQyg
 VOPf3FY48f6EWku0gCUpOev2iFTQHICf3RC39uR5pVVdC3+Yzc2+yqa9xg+I5ckt
 gNTbmD1vkssaCmxIXR0Cj/pHskZxJqtdmDJmcBQfbLm9ytFuCGFK0IqpvpyRya/H
 jGjRGJxJ5w==
 =y3+7
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-7.1-20260508' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull io_uring fixes from Jens Axboe:

 - Ensure that the absolute timeouts for both the command side and the
   waiting side honor the callers time namespace

 - Ensure tracked NAPI entries are cleared at unregistration time, as
   the NAPI polling loop checks the list state rather than the general
   NAPI state. This can lead to NAPI polling even after unregistration
   has been done. If unregistered, all NAPI polling should be disabled

 - Fix for eventfd recursive invocation handling

* tag 'io_uring-7.1-20260508' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  io_uring/wait: honour caller's time namespace for IORING_ENTER_ABS_TIMER
  io_uring/timeout: honour caller's time namespace for IORING_TIMEOUT_ABS
  io_uring/eventfd: reset deferred signal state
  io_uring/napi: clear tracked NAPI entries on unregister
2026-05-08 13:12:48 -07:00
Maoyi Xie
45d2b37a37 io_uring/wait: honour caller's time namespace for IORING_ENTER_ABS_TIMER
io_uring_enter() with IORING_ENTER_ABS_TIMER takes an absolute
timespec from the caller via ext_arg->ts. It arms an ABS mode
hrtimer in __io_cqring_wait_schedule(). The conversion path in
io_uring/wait.c parses ext_arg->ts inline rather than going
through io_parse_user_time(). It therefore does not pick up the
time namespace conversion added by the previous patch.

Apply timens_ktime_to_host() to the parsed time on the
IORING_ENTER_ABS_TIMER branch. This mirrors the IORING_TIMEOUT_ABS
fix in io_parse_user_time(). Use ctx->clockid as the clock id.
ctx->clockid is set either at ring creation or via
IORING_REGISTER_CLOCK.

timens_ktime_to_host() is a no-op for clocks not affected by time
namespaces. It is also a no-op for callers in the initial time
namespace. The fast path is unchanged.

Reproducer: in unshare --user --time, with a -10s monotonic
offset, call io_uring_enter with min_complete=1,
IORING_ENTER_ABS_TIMER, and ts = now + 1s. The call returns
-ETIME after <1ms instead of after the expected ~1s.

Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Maoyi Xie <maoyi.xie@ntu.edu.sg>
Link: https://patch.msgid.link/20260504153755.1293932-3-maoyi.xie@ntu.edu.sg
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-06 04:58:56 -06:00
Maoyi Xie
9cc6bac1be io_uring/timeout: honour caller's time namespace for IORING_TIMEOUT_ABS
io_uring's IORING_OP_TIMEOUT and IORING_OP_LINK_TIMEOUT accept a
timespec from the caller via io_parse_user_time(). With
IORING_TIMEOUT_ABS, the timestamp is an absolute deadline on the
selected clock. The clock is CLOCK_MONOTONIC by default.
CLOCK_BOOTTIME and CLOCK_REALTIME are also selectable.

A submitter inside a CLONE_NEWTIME time namespace observes
CLOCK_MONOTONIC and CLOCK_BOOTTIME shifted by the namespace's
offsets relative to the host. Every other ABS timer interface in
the kernel converts the caller's absolute time to host view via
timens_ktime_to_host() before arming an hrtimer:

  kernel/time/posix-timers.c    -- timer_settime(TIMER_ABSTIME)
  kernel/time/posix-stubs.c     -- clock_nanosleep(TIMER_ABSTIME)
  kernel/time/alarmtimer.c      -- alarm_timer_nsleep(TIMER_ABSTIME)
  fs/timerfd.c                  -- timerfd_settime(TFD_TIMER_ABSTIME)

io_parse_user_time() does not. As a result, an absolute timeout
submitted from within a time namespace is interpreted in host
view. That is generally a different point in time. It may already
be in the past, causing the timer to fire immediately, or far in
the future, causing the timer not to fire when expected.

Reproducer: in unshare --user --time, with a -10s monotonic
offset, submit IORING_OP_TIMEOUT with IORING_TIMEOUT_ABS and
deadline = now + 1s. The CQE is delivered after <1ms instead of
the expected ~1s.

Apply timens_ktime_to_host() to the parsed time when
IORING_TIMEOUT_ABS is set. Split the existing clock id resolver
in io_timeout_get_clock() into a flags only helper
io_flags_to_clock(), so io_parse_user_time() can resolve the
clock without a struct io_timeout_data.

timens_ktime_to_host() is a no-op for clocks not affected by time
namespaces, e.g. CLOCK_REALTIME. It is also a no-op for callers
in the initial time namespace. The fast path is unchanged.

SQPOLL is also covered. The SQPOLL kernel thread is created via
create_io_thread() with CLONE_THREAD and no CLONE_NEW* flag.
copy_namespaces() therefore shares the submitter's nsproxy by
reference. Inside the SQPOLL kthread, current->nsproxy->time_ns
is the submitter's time_ns. timens_ktime_to_host() resolves
correctly.

Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Maoyi Xie <maoyi.xie@ntu.edu.sg>
Link: https://patch.msgid.link/20260504153755.1293932-2-maoyi.xie@ntu.edu.sg
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-06 04:58:56 -06:00
Yufan Chen
04fe9aeb4f io_uring/eventfd: reset deferred signal state
Recursive eventfd wakeups must defer io_uring eventfd signaling because
eventfd_signal_mask() rejects reentry from eventfd wakeup handlers. The
io_ev_fd ops bit tracks an outstanding deferred signal so that the same
rcu_head is not queued twice.

That bit is only set today. Once the first deferred callback runs, later
recursive notifications still see the bit set and skip queueing another
deferred signal. This can leave new completions without a matching
eventfd wake after the first recursive deferral.

Clear the pending bit before issuing the deferred signal. If the wakeup
path recurses while the callback runs, a new signal can be queued for
the next RCU grace period while the current callback keeps its reference
until it returns.

Signed-off-by: Yufan Chen <ericterminal@gmail.com>
Fixes: 60b6c075e8 ("io_uring/eventfd: move to more idiomatic RCU free usage")
Link: https://patch.msgid.link/20260503175710.37209-1-yufan.chen@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-03 23:21:40 -06:00
Yufan Chen
b8c2e9e276 io_uring/napi: clear tracked NAPI entries on unregister
IORING_UNREGISTER_NAPI disables NAPI busy polling, but it currently
leaves any previously tracked NAPI IDs on the ring context. The normal
wait path only checks whether the list is empty before entering the busy
poll helper, so an unregistered ring can still observe stale entries and
run an unexpected busy poll pass.

Make unregister switch the context to inactive and free the tracked
entries. Do the same inactive transition while changing the tracking
strategy, and recheck the expected tracking mode under napi_lock before
inserting a newly learned NAPI ID. This prevents a racing poll path from
repopulating the list after unregister or reconfiguration.

Also make the busy poll dispatcher ignore inactive mode explicitly.

Signed-off-by: Yufan Chen <ericterminal@gmail.com>
Fixes: 6bf90bd8c5 ("io_uring/napi: add static napi tracking strategy")
Link: https://patch.msgid.link/20260503175610.35521-1-yufan.chen@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-03 23:21:23 -06:00
Linus Torvalds
9d88bb929a io_uring-7.1-20260430
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmnz4ikQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpoecEACan9sGbcqrTAqBaJnNqMfjo5OoX0X2/3LP
 JoFH+GbLzJ7Ojj1+AWSHNsjjwhSZ71HmwRk78uWCPcl3oiKQLGyXTnbho0qKrwhk
 4cnoFfhdcBEDGuh8GW/PnBset8ukq9occ11TbojC681tmaTma1WpXFk1vRabcwvw
 T8/Jr18kttHi8aj+MPowkTcqXV7iOjzX9RD/vS97jCWBxUbAmYjRGfm3nbDbDydI
 oMEstxqp+8jiFF1SHBdq3aGreoZDIegh1nXsjobAmoEMvAJQQ3K7zRsiqFEnoXFU
 CDVoS6LhlSBmG2jT657azYzhF3o7HwSiYk2B15YiYHO+EqIxMhIQYRlP5s/3UJD8
 KLJPSYqivQ14m9yff5zjn//mad3QBxvOhrVrxHj/diIKclZDLs9VDPZjjB6A6DUO
 X01uJy7zuzp57GFh0FwyFGU3yBUl7WJGscLarIMHnOdmEWOIU3WRLWGYZRgZRUny
 1yHXxGEucR7LMiYPzh7PnGnaAzDxtJJUzIXbIF+l+A5A0f1Ayb8cfFy6QoGc7v8j
 t+vG2gbRtwPR6DxFRNhGDMeZtstEoKj0IX4zw7ZQF7MFgpvdfjlkDGCveJ0jQO6x
 pw8UJW1KOQpT9MiheOAvop5hhvGlqSYWXByluW05y7O+CDQcHiVeXcjQsG7zREsO
 +zGvO5WHEg==
 =5d8D
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-7.1-20260430' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull io_uring fixes from Jens Axboe:

 - Remove dead struct io_buffer_list member

 - Fix for incrementally consumed buffers with recvmsg multishot, which
   requires a minimum value left in a buffer for any receive for the
   headers. If there's still a bit of buffer left but it's smaller than
   that value, then userspace will see a spurious -EFAULT returned in
   the CQE

 - Locking fix for the DEFER_TASKRUN retry list, which otherwise could
   race with fallback cancelations. If the task is exiting with
   task_work left in both the normal and retry list AND the exit cleanup
   races with the task running task work, then entries could either be
   doubly completed or lost

 - Cap NAPI busy poll timeout to something sane, to avoid syzbot running
   into excessive polling and triggering warnings around that

* tag 'io_uring-7.1-20260430' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  io_uring/tw: serialize ctx->retry_llist with ->uring_lock
  io_uring/napi: cap busy_poll_to 10 msec
  io_uring/kbuf: support min length left for incremental buffers
  io_uring/kbuf: kill dead struct io_buffer_list 'nr_entries' member
2026-05-01 11:01:31 -07:00
Jens Axboe
17666e2d75 io_uring/tw: serialize ctx->retry_llist with ->uring_lock
The DEFER_TASKRUN local task work paths all run under ctx->uring_lock,
which serializes them with each other and with the rest of the ring's
hot paths. io_move_task_work_from_local() is the exception - it's called
from io_ring_exit_work() on a kworker without holding the lock and from
the iopoll cancelation side right after dropping it.

->work_llist is fine with this, as it's only ever updated via the
expected paths. But the ->retry_llist is updated while runing, and hence
it could potentially race between normal task_work running and the
task-has-exited shutdown path.

Simply grab ->uring_lock while moving the local work to the fallback
list for exit purposes, which nicely serializes it across both the
normal additions and the exit prune path.

Cc: stable@vger.kernel.org
Fixes: f46b9cdb22 ("io_uring: limit local tw done")
Reported-by: Robert Femmer <robert.femmer@x41-dsec.de>
Reported-by: Christian Reitter <invd@inhq.net>
Reported-by: Michael Rodler <michael.rodler@x41-dsec.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-30 06:57:20 -06:00
Jakub Kicinski
735a309b4b net: add net_iov_init() and use it to initialize ->page_type
Commit db359fccf2 ("mm: introduce a new page type for page pool in
page type") added a page_type field to struct net_iov at the same
offset as struct page::page_type, so that page_pool_set_pp_info() can
call __SetPageNetpp() uniformly on both pages and net_iovs.

The page-type API requires the field to hold the UINT_MAX "no type"
sentinel before a type can be set; for real struct page that invariant
is established by the page allocator on free. struct net_iov is not
allocated through the page allocator, so the field is left as zero
(io_uring zcrx, which uses __GFP_ZERO) or as slab garbage (devmem,
which uses kvmalloc_objs() without zeroing). When the page pool then
calls page_pool_set_pp_info() on a freshly-bound niov,
__SetPageNetpp()'s VM_BUG_ON_PAGE(page->page_type != UINT_MAX) fires
and the kernel BUGs. Triggered in selftests by io_uring zcrx setup
through the fbnic queue restart path:

 kernel BUG at ./include/linux/page-flags.h:1062!
 RIP: 0010:page_pool_set_pp_info (./include/linux/page-flags.h:1062
                                  net/core/page_pool.c:716)
 Call Trace:
  <TASK>
  net_mp_niov_set_page_pool (net/core/page_pool.c:1360)
  io_pp_zc_alloc_netmems (io_uring/zcrx.c:1089 io_uring/zcrx.c:1110)
  fbnic_fill_bdq (./include/net/page_pool/helpers.h:160
                  drivers/net/ethernet/meta/fbnic/fbnic_txrx.c:906)
  __fbnic_nv_restart (drivers/net/ethernet/meta/fbnic/fbnic_txrx.c:2470
                      drivers/net/ethernet/meta/fbnic/fbnic_txrx.c:2874)
  fbnic_queue_start (drivers/net/ethernet/meta/fbnic/fbnic_txrx.c:2903)
  netdev_rx_queue_reconfig (net/core/netdev_rx_queue.c:137)
  __netif_mp_open_rxq (net/core/netdev_rx_queue.c:234)
  io_register_zcrx (io_uring/zcrx.c:818 io_uring/zcrx.c:903)
  __io_uring_register (io_uring/register.c:931)
  __do_sys_io_uring_register (io_uring/register.c:1029)
  do_syscall_64 (arch/x86/entry/syscall_64.c:63
                 arch/x86/entry/syscall_64.c:94)
  </TASK>

The same path is reachable through devmem dmabuf binding via
netdev_nl_bind_rx_doit() -> net_devmem_bind_dmabuf_to_queue().

Add a net_iov_init() helper that stamps ->owner, ->type and the
->page_type sentinel, and use it from both the devmem and io_uring
zcrx niov init loops.

Fixes: db359fccf2 ("mm: introduce a new page type for page pool in page type")
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Acked-by: Byungchul Park <byungchul@sk.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/20260428025320.853452-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-29 16:40:08 -07:00
Jens Axboe
df8599ee18 io_uring/napi: cap busy_poll_to 10 msec
Currently there's no cap on the maximum amount of time that napi is
allowed to poll if no events are found, which can lead to kernel
complaints on a task being stuck as there's no conditional rescheduling
done within that loop.

Just cap it to 10 msec in total, that's already way above any kind of
sane value that will reap any benefits, yet low enough that it's
nowhere near being able to trigger preemption complaints.

Fixes: 8d0c12a80c ("io-uring: add napi busy poll support")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-28 16:09:02 -06:00
Martin Michaelis
7deba791ad io_uring/kbuf: support min length left for incremental buffers
Incrementally consumed buffer rings are generally fully consumed, but
it's quite possible that the application has a minimum size it needs to
meet to avoid truncation. Currently that minimum limit is 1 byte, but
this should be a setting that is the hands of the application. For
recvmsg multishot, a prime use case for incrementally consumed buffers,
the application may get spurious -EFAULT returned at the end of an
incrementally consumed buffer, as less space is available than the
headers need.

Grab a u32 field in struct io_uring_buf_reg, which the application can
use to inform the kernel of the minimum size that should be available
in an incrementally consumed buffer. If less than that is available,
the current buffer is fully processed and the next one will be picked.

Cc: stable@vger.kernel.org
Fixes: ae98dbf43d ("io_uring/kbuf: add support for incremental buffer consumption")
Link: https://github.com/axboe/liburing/issues/1433
Signed-off-by: Martin Michaelis <code@mgjm.de>
[axboe: write commit message, change io_buffer_list member name]
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-28 16:08:56 -06:00
Jens Axboe
55ea968389 io_uring/kbuf: kill dead struct io_buffer_list 'nr_entries' member
This is only ever assigned, never used. The only used part is the
calculated mask, which is used for indexing. Kill 'nr_entries'.

Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-28 16:08:44 -06:00
Greg Kroah-Hartman
d0be8884f5 io_uring: take page references for NOMMU pbuf_ring mmaps
Under !CONFIG_MMU, io_uring_get_unmapped_area() returns the kernel
virtual address of the io_mapped_region's backing pages directly;
the user's VMA aliases the kernel allocation. io_uring_mmap() then
just returns 0 -- it takes no page references.

The CONFIG_MMU path uses vm_insert_pages(), which takes a reference on
each inserted page.  Those references are released when the VMA is torn
down (zap_pte_range -> put_page). io_free_region() -> release_pages()
drops the io_uring-side references, but the pages survive until munmap
drops the VMA-side references.

Under NOMMU there are no VMA-side references. io_unregister_pbuf_ring ->
io_put_bl -> io_free_region -> release_pages drops the only references
and the pages return to the buddy allocator while the user's VMA still
has vm_start pointing into them.  The user can then write into whatever
the allocator hands out next.

Mirror the MMU lifetime: take get_page references in io_uring_mmap() and
release them via vm_ops->close.  NOMMU's delete_vma() calls vma_close()
which runs ->close on munmap.

This also incidentally addresses the duplicate-vm_start case: two mmaps
of SQ_RING and CQ_RING resolve to the same ctx->ring_region pointer.
With page refs taken per mmap, the second mmap takes its own refs and
the pages survive until both mmaps are closed.  The nommu rb-tree BUG_ON
on duplicate vm_start is a separate mm/nommu.c concern (it should share
the existing region rather than BUG), but the page lifetime is now
correct.

Cc: Jens Axboe <axboe@kernel.dk>
Reported-by: Anthropic
Assisted-by: gkh_clanker_t1000
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://patch.msgid.link/2026042115-body-attention-d15b@gregkh
[axboe: get rid of region lookup, just iterate pages in vma]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-21 20:14:39 -06:00
Jens Axboe
1967f0b1ca io_uring/poll: ensure EPOLL_ONESHOT is propagated for EPOLL_URING_WAKE
Commit:

aacf2f9f38 ("io_uring: fix req->apoll_events")

fixed an issue where poll->events and req->apoll_events weren't
synchronized, but then when the commit referenced in Fixes got added,
it didn't ensure the same thing.

If we mask in EPOLLONESHOT in the regular EPOLL_URING_WAKE path, then
ensure it's done for both. Including a link to the original report
below, even though it's mostly nonsense. But it includes a reproducer
that does show that IORING_CQE_F_MORE is set in the previous CQE,
while no more CQEs will be generated for this request. Just ignore
anything that pretends this is security related in any way, it's just
the typical AI nonsense.

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/io-uring/CAM0zi7yQzF3eKncgHo4iVM5yFLAjsiob_ucqyWKs=hyd_GqiMg@mail.gmail.com/
Reported-by: Azizcan Daştan <azizcan.d@mileniumsec.com>
Fixes: 4464853277 ("io_uring: pass in EPOLL_URING_WAKE for eventfd signaling and wakeups")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-21 19:18:34 -06:00
Pavel Begunkov
770594e78c io_uring/zcrx: warn on freelist violations
The freelist is appropriately sized to always be able to take a free
niov, but let's be more defensive and check the invariant with a
warning. That should help to catch any double-free issues.

Suggested-by: Kai Aizen <kai@snailsploit.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/2f3cea363b04649755e3b6bb9ab66485a95936d5.1776760901.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-21 12:19:11 -06:00
Pavel Begunkov
4f02cc4071 io_uring/zcrx: clear RQ headers on init
It might be unexpected to users if the RQ head/tail after a ring
creation are not zeroed, fix that.

Cc: stable@vger.kernel.org
Fixes: 6f377873cb ("io_uring/zcrx: add interface queue and refill queue")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/331f94663c3e8f021ffa3cb770ca2844a07d4855.1776760911.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-21 12:19:11 -06:00
Pavel Begunkov
0fcccfd871 io_uring/zcrx: fix user_struct uaf
io_free_rbuf_ring() usees a struct user_struct, which
io_zcrx_ifq_free() puts it down before destroying the ring.

Cc: stable@vger.kernel.org
Fixes: 5c686456a4 ("io_uring/zcrx: add user_struct and mm_struct to io_zcrx_ifq")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/e560ae00960d27a810522a7efc0e201c82dff351.1776760917.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-21 12:19:11 -06:00
Jens Axboe
45cd95763e io_uring/register: fix ring resizing with mixed/large SQEs/CQEs
The ring resizing only properly handles "normal" sized SQEs or CQEs, if
there are pending entries around a resize. This normally should not be
the case, but the code is supposed to handle this regardless.

For the mixed SQE/CQE cases, the current copying works fine as they
are indexed in the same way. Each half is just copied separately. But
for fixed large SQEs and CQEs, the iteration and copy need to take that
into account.

Cc: stable@kernel.org
Fixes: 79cfe9e59c ("io_uring/register: add IORING_REGISTER_RESIZE_RINGS")
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-21 12:19:08 -06:00
Jens Axboe
7faaa6812a io_uring/futex: ensure partial wakes are appropriately dequeued
If a FUTEX_WAITV vectored operation is only partially woken, we
should call __futex_wake_mark() on the queue to account for that.
If not, then a later wakeup will wake the same entry, rather than
the next one in line.

Fixes: 8f350194d5 ("io_uring: add support for vectored futex waits")
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-21 12:19:06 -06:00
Jens Axboe
7996883455 io_uring/rw: add defensive hardening for negative kbuf lengths
No real bug here, just being a bit defensive in ensuring that whatever
gets passed into io_put_kbuf() is always >= 0 and not some random error
value.

Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-21 12:19:03 -06:00
Jens Axboe
02b8d41c17 io_uring/rsrc: use kvfree() for the imu cache
Currently anything that requires kvmalloc_flex() for allocations will
not get re-cached, and hence the cache freeing path is correct in that
it always uses kfree() to free the allocated memory. But this seems a
bit fragile as it's something that could get mix should that situation
change, so switch io_free_imu() and io_alloc_cache_free() to use kvfree
as the desctructor.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-21 12:19:01 -06:00
Jens Axboe
53262c91f7 io_uring/rsrc: unify nospec indexing for direct descriptors
For file updates, the node reset isn't capping the value via
array_index_nospec() like the other paths do. Ensure it's all sane and
have the update path do the proper capping as well.

Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-21 12:18:54 -06:00
Jens Axboe
8e1f412b5b io_uring: fix spurious fput in registered ring path
Fix an issue with io_uring_ctx_get_file() not gating fput() on whether
or not the file descriptor is a registered/direct one or not.

Fixes: c5e9f6a96b ("io_uring: unify getting ctx from passed in file descriptor")
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-21 12:18:44 -06:00
Jens Axboe
42a702aaed io_uring: fix iowq_limits data race in tctx node addition
__io_uring_add_tctx_node() reads ctx->int_flags and
ctx->iowq_limits[0..1] without holding ctx->uring_lock, while
io_register_iowq_max_workers() writes these same fields under the lock.

Mostly an application problem if you try and make these race, but let's
silence KCSAN by just grabbing the ->uring_lock around the operation.
This is a slow path operation anyway, and ->uring_lock will be grabbed
by submission right after anyway.

Fixes: 2e480058dd ("io-wq: provide a way to limit max number of workers")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-20 14:57:21 -06:00
Jens Axboe
41859843f2 io_uring/tctx: mark io_wq as exiting before error path teardown
syzbot reports that it's hitting the below condition for exiting an
io_wq context:

WARN_ON_ONCE(!test_bit(IO_WQ_BIT_EXIT, &wq->state))

in io_wq_put_and_exit(), which can be triggered with memory allocation
fault injection. Ensure that the io_wq is marked as exiting to silence
this warning trigger.

Reported-by: syzbot+79a4cc863a8db58cd92b@syzkaller.appspotmail.com
Fixes: 7880174e1e ("io_uring/tctx: clean up __io_uring_add_tctx_node() error handling")
Reviewed-by: Clément Léger <cleger@meta.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-20 14:47:37 -06:00
Jens Axboe
ee5417fd02 io_uring/tctx: check for setup tctx->io_wq before teardown
As with the idling code before it, the error exit path should check for
a NULL tctx->io_wq before calling io_wq_put_and_exit().

Fixes: 7880174e1e ("io_uring/tctx: clean up __io_uring_add_tctx_node() error handling")
Reported-by: Dan Carpenter <error27@gmail.com>
Reviewed-by: Clément Léger <cleger@meta.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-20 14:47:29 -06:00
Longxuan Yu
326941b228 io_uring/poll: fix signed comparison in io_poll_get_ownership()
io_poll_get_ownership() uses a signed comparison to check whether
poll_refs has reached the threshold for the slowpath:

    if (unlikely(atomic_read(&req->poll_refs) >= IO_POLL_REF_BIAS))

atomic_read() returns int (signed). When IO_POLL_CANCEL_FLAG
(BIT(31)) is set in poll_refs, the value becomes negative in
signed arithmetic, so the >= 128 comparison always evaluates to
false and the slowpath is never taken.

Fix this by casting the atomic_read() result to unsigned int
before the comparison, so that the cancel flag is treated as a
large positive value and correctly triggers the slowpath.

Fixes: a26a35e901 ("io_uring: make poll refs more robust")
Cc: stable@vger.kernel.org
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Co-developed-by: Yuan Tan <yuantan098@gmail.com>
Signed-off-by: Yuan Tan <yuantan098@gmail.com>
Suggested-by: Xin Liu <bird@lzu.edu.cn>
Tested-by: Zhengchuan Liang <zcliangcn@gmail.com>
Signed-off-by: Longxuan Yu <ylong030@ucr.edu>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/3a3508b08bcd7f1bc3beff848ae6e1d73d355043.1775965597.git.ylong030@ucr.edu
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-15 14:07:47 -06:00
Linus Torvalds
91a4855d6c Networking changes for 7.1.
Core & protocols
 ----------------
 
  - Support HW queue leasing, allowing containers to be granted access
    to HW queues for zero-copy operations and AF_XDP.
 
  - Number of code moves to help the compiler with inlining.
    Avoid output arguments for returning drop reason where possible.
 
  - Rework drop handling within qdiscs to include more metadata
    about the reason and dropping qdisc in the tracepoints.
 
  - Remove the rtnl_lock use from IP Multicast Routing.
 
  - Pack size information into the Rx Flow Steering table pointer
    itself. This allows making the table itself a flat array of u32s,
    thus making the table allocation size a power of two.
 
  - Report TCP delayed ack timer information via socket diag.
 
  - Add ip_local_port_step_width sysctl to allow distributing the randomly
    selected ports more evenly throughout the allowed space.
 
  - Add support for per-route tunsrc in IPv6 segment routing.
 
  - Start work of switching sockopt handling to iov_iter.
 
  - Improve dynamic recvbuf sizing in MPTCP, limit burstiness and avoid
    buffer size drifting up.
 
  - Support MSG_EOR in MPTCP.
 
  - Add stp_mode attribute to the bridge driver for STP mode selection.
    This addresses concerns about call_usermodehelper() usage.
 
  - Remove UDP-Lite support (as announced in 2023).
 
  - Remove support for building IPv6 as a module.
    Remove the now unnecessary function calling indirection.
 
 Cross-tree stuff
 ----------------
 
  - Move Michael MIC code from generic crypto into wireless,
    it's considered insecure but some WiFi networks still need it.
 
 Netfilter
 ---------
 
  - Switch nft_fib_ipv6 module to no longer need temporary dst_entry
    object allocations by using fib6_lookup() + RCU.
    Florian W reports this gets us ~13% higher packet rate.
 
  - Convert IPVS's global __ip_vs_mutex to per-net service_mutex and
    switch the service tables to be per-net. Convert some code that
    walks the service lists to use RCU instead of the service_mutex.
 
  - Add more opinionated input validation to lower security exposure.
 
  - Make IPVS hash tables to be per-netns and resizable.
 
 Wireless
 --------
 
  - Finished assoc frame encryption/EPPKE/802.1X-over-auth.
 
  - Radar detection improvements.
 
  - Add 6 GHz incumbent signal detection APIs.
 
  - Multi-link support for FILS, probe response templates and
    client probing.
 
  - New APIs and mac80211 support for NAN (Neighbor Aware Networking,
    aka Wi-Fi Aware) so less work must be in firmware.
 
 Driver API
 ----------
 
  - Add numerical ID for devlink instances (to avoid having to create
    fake bus/device pairs just to have an ID). Support shared devlink
    instances which span multiple PFs.
 
  - Add standard counters for reporting pause storm events
    (implement in mlx5 and fbnic).
 
  - Add configuration API for completion writeback buffering
    (implement in mana).
 
  - Support driver-initiated change of RSS context sizes.
 
  - Support DPLL monitoring input frequency (implement in zl3073x).
 
  - Support per-port resources in devlink (implement in mlx5).
 
 Misc
 ----
 
  - Expand the YAML spec for Netfilter.
 
 Drivers
 -------
 
  - Software:
    - macvlan: support multicast rx for bridge ports with shared source
      MAC address
    - team: decouple receive and transmit enablement for IEEE 802.3ad
      LACP "independent control"
 
  - Ethernet high-speed NICs:
    - nVidia/Mellanox:
      - support high order pages in zero-copy mode (for payload
        coalescing)
      - support multiple packets in a page (for systems with 64kB pages)
    - Broadcom 25-400GE (bnxt):
      - implement XDP RSS hash metadata extraction
      - add software fallback for UDP GSO, lowering the IOMMU cost
    - Broadcom 800GE (bnge):
      - add link status and configuration handling
      - add various HW and SW statistics
    - Marvell/Cavium:
      - NPC HW block support for cn20k
    - Huawei (hinic3):
      - add mailbox / control queue
      - add rx VLAN offload
      - add driver info and link management
 
  - Ethernet NICs:
    - Marvell/Aquantia:
      - support reading SFP module info on some AQC100 cards
    - Realtek PCI (r8169):
      - add support for RTL8125cp
    - Realtek USB (r8152):
      - support for the RTL8157 5Gbit chip
      - add 2500baseT EEE status/configuration support
 
  - Ethernet NICs embedded and off-the-shelf IP:
    - Synopsys (stmmac):
      - cleanup and reorganize SerDes handling and PCS support
      - cleanup descriptor handling and per-platform data
      - cleanup and consolidate MDIO defines and handling
      - shrink driver memory use for internal structures
      - improve Tx IRQ coalescing
      - improve TCP segmentation handling
      - add support for Spacemit K3
    - Cadence (macb):
      - support PHYs that have inband autoneg disabled with GEM
      - support IEEE 802.3az EEE
      - rework usrio capabilities and handling
    - AMD (xgbe):
      - improve power management for S0i3
      - improve TX resilience for link-down handling
 
  - Virtual:
    - Google cloud vNIC:
      - support larger ring sizes in DQO-QPL mode
      - improve HW-GRO handling
      - support UDP GSO for DQO format
    - PCIe NTB:
      - support queue count configuration
 
  - Ethernet PHYs:
    - automatically disable PHY autonomous EEE if MAC is in charge
    - Broadcom:
      - add BCM84891/BCM84892 support
    - Micrel:
      - support for LAN9645X internal PHY
    - Realtek:
      - add RTL8224 pair order support
      - support PHY LEDs on RTL8211F-VD
      - support spread spectrum clocking (SSC)
    - Maxlinear:
      - add PHY-level statistics via ethtool
 
  - Ethernet switches:
    - Maxlinear (mxl862xx):
      - support for bridge offloading
      - support for VLANs
      - support driver statistics
 
  - Bluetooth:
    - large number of fixes and new device IDs
    - Mediatek:
      - support MT6639 (MT7927)
      - support MT7902 SDIO
 
  - WiFi:
    - Intel (iwlwifi):
      - UNII-9 and continuing UHR work
    - MediaTek (mt76):
      - mt7996/mt7925 MLO fixes/improvements
      - mt7996 NPU support (HW eth/wifi traffic offload)
    - Qualcomm (ath12k):
      - monitor mode support on IPQ5332
      - basic hwmon temperature reporting
      - support IPQ5424
    - Realtek:
      - add USB RX aggregation to improve performance
      - add USB TX flow control by tracking in-flight URBs
 
  - Cellular:
    - IPA v5.2 support
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmnelNoACgkQMUZtbf5S
 IrtWFw//WyiXuEiGawVQONnbu1dtR+3nw/cvNpSYi0IM66vbRUB9n+9fxm2MIyG4
 4jI/c/X/fxIvUxEqGez3yPn5P7KqkQR8WRYwkxrMYKRpXeukN0IDk5Euew5DskCe
 wtBKNJOQWKdKXff0bLQoJ9dHWYuJ2IMRVil5M3fhUbeUOXeyJD7Yn1w2ICvJAbj+
 T/Hw7sEtchNaHp6h6SbaQfahkUFHQG5peNoETkZF4UDF6ALGY29WH91GXeO2lrgN
 IxX203KtaavV0oU8T0oixZgOc57Ns081YfFL/F1JP2HV6lgkwhuq+zxCrRTi1c9M
 HPTXgwD7Z80Y74nM3YTLrPfoMOP8GLBZgdV3rUpwmteM26+gMTm+O1zHUur5ZoGy
 D6TaMFguPTIqiRyrARa9xY/J6r9TQkc2Wfu4bIuPndKFg8xPoepuEObODnh0+5Hg
 4j4pdFhIo2huENhSg7kVb/yl+1q68SFwM3RqTmx+OhCa0AyjcKIKgt/UBhismdnG
 r8obxzb+nXeJc2rRDuwNMwlBlcMSbep27uGt64zeHMMXVhTVqOoytNaL/X/ZpH2m
 A0DscUrpHvb36IoDPtanc6irP+JOh5Xe7Nw5qhkgwsMc7hlf8SyyHB4OUBBaz1qA
 ETSnHlfwklRmXSpWqH2LyGXjdOQpDKP46+h0W3dttMD2/cRBqYo=
 =EhQZ
 -----END PGP SIGNATURE-----

Merge tag 'net-next-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "Core & protocols:

   - Support HW queue leasing, allowing containers to be granted access
     to HW queues for zero-copy operations and AF_XDP

   - Number of code moves to help the compiler with inlining. Avoid
     output arguments for returning drop reason where possible

   - Rework drop handling within qdiscs to include more metadata about
     the reason and dropping qdisc in the tracepoints

   - Remove the rtnl_lock use from IP Multicast Routing

   - Pack size information into the Rx Flow Steering table pointer
     itself. This allows making the table itself a flat array of u32s,
     thus making the table allocation size a power of two

   - Report TCP delayed ack timer information via socket diag

   - Add ip_local_port_step_width sysctl to allow distributing the
     randomly selected ports more evenly throughout the allowed space

   - Add support for per-route tunsrc in IPv6 segment routing

   - Start work of switching sockopt handling to iov_iter

   - Improve dynamic recvbuf sizing in MPTCP, limit burstiness and avoid
     buffer size drifting up

   - Support MSG_EOR in MPTCP

   - Add stp_mode attribute to the bridge driver for STP mode selection.
     This addresses concerns about call_usermodehelper() usage

   - Remove UDP-Lite support (as announced in 2023)

   - Remove support for building IPv6 as a module. Remove the now
     unnecessary function calling indirection

  Cross-tree stuff:

   - Move Michael MIC code from generic crypto into wireless, it's
     considered insecure but some WiFi networks still need it

  Netfilter:

   - Switch nft_fib_ipv6 module to no longer need temporary dst_entry
     object allocations by using fib6_lookup() + RCU.

     Florian W reports this gets us ~13% higher packet rate

   - Convert IPVS's global __ip_vs_mutex to per-net service_mutex and
     switch the service tables to be per-net. Convert some code that
     walks the service lists to use RCU instead of the service_mutex

   - Add more opinionated input validation to lower security exposure

   - Make IPVS hash tables to be per-netns and resizable

  Wireless:

   - Finished assoc frame encryption/EPPKE/802.1X-over-auth

   - Radar detection improvements

   - Add 6 GHz incumbent signal detection APIs

   - Multi-link support for FILS, probe response templates and client
     probing

   - New APIs and mac80211 support for NAN (Neighbor Aware Networking,
     aka Wi-Fi Aware) so less work must be in firmware

  Driver API:

   - Add numerical ID for devlink instances (to avoid having to create
     fake bus/device pairs just to have an ID). Support shared devlink
     instances which span multiple PFs

   - Add standard counters for reporting pause storm events (implement
     in mlx5 and fbnic)

   - Add configuration API for completion writeback buffering (implement
     in mana)

   - Support driver-initiated change of RSS context sizes

   - Support DPLL monitoring input frequency (implement in zl3073x)

   - Support per-port resources in devlink (implement in mlx5)

  Misc:

   - Expand the YAML spec for Netfilter

  Drivers

   - Software:
      - macvlan: support multicast rx for bridge ports with shared
        source MAC address
      - team: decouple receive and transmit enablement for IEEE 802.3ad
        LACP "independent control"

   - Ethernet high-speed NICs:
      - nVidia/Mellanox:
         - support high order pages in zero-copy mode (for payload
           coalescing)
         - support multiple packets in a page (for systems with 64kB
           pages)
      - Broadcom 25-400GE (bnxt):
         - implement XDP RSS hash metadata extraction
         - add software fallback for UDP GSO, lowering the IOMMU cost
      - Broadcom 800GE (bnge):
         - add link status and configuration handling
         - add various HW and SW statistics
      - Marvell/Cavium:
         - NPC HW block support for cn20k
      - Huawei (hinic3):
         - add mailbox / control queue
         - add rx VLAN offload
         - add driver info and link management

   - Ethernet NICs:
      - Marvell/Aquantia:
         - support reading SFP module info on some AQC100 cards
      - Realtek PCI (r8169):
         - add support for RTL8125cp
      - Realtek USB (r8152):
         - support for the RTL8157 5Gbit chip
         - add 2500baseT EEE status/configuration support

   - Ethernet NICs embedded and off-the-shelf IP:
      - Synopsys (stmmac):
         - cleanup and reorganize SerDes handling and PCS support
         - cleanup descriptor handling and per-platform data
         - cleanup and consolidate MDIO defines and handling
         - shrink driver memory use for internal structures
         - improve Tx IRQ coalescing
         - improve TCP segmentation handling
         - add support for Spacemit K3
      - Cadence (macb):
         - support PHYs that have inband autoneg disabled with GEM
         - support IEEE 802.3az EEE
         - rework usrio capabilities and handling
      - AMD (xgbe):
         - improve power management for S0i3
         - improve TX resilience for link-down handling

   - Virtual:
      - Google cloud vNIC:
         - support larger ring sizes in DQO-QPL mode
         - improve HW-GRO handling
         - support UDP GSO for DQO format
      - PCIe NTB:
         - support queue count configuration

   - Ethernet PHYs:
      - automatically disable PHY autonomous EEE if MAC is in charge
      - Broadcom:
         - add BCM84891/BCM84892 support
      - Micrel:
         - support for LAN9645X internal PHY
      - Realtek:
         - add RTL8224 pair order support
         - support PHY LEDs on RTL8211F-VD
         - support spread spectrum clocking (SSC)
      - Maxlinear:
         - add PHY-level statistics via ethtool

   - Ethernet switches:
      - Maxlinear (mxl862xx):
         - support for bridge offloading
         - support for VLANs
         - support driver statistics

   - Bluetooth:
      - large number of fixes and new device IDs
      - Mediatek:
         - support MT6639 (MT7927)
         - support MT7902 SDIO

   - WiFi:
      - Intel (iwlwifi):
         - UNII-9 and continuing UHR work
      - MediaTek (mt76):
         - mt7996/mt7925 MLO fixes/improvements
         - mt7996 NPU support (HW eth/wifi traffic offload)
      - Qualcomm (ath12k):
         - monitor mode support on IPQ5332
         - basic hwmon temperature reporting
         - support IPQ5424
      - Realtek:
         - add USB RX aggregation to improve performance
         - add USB TX flow control by tracking in-flight URBs

   - Cellular:
      - IPA v5.2 support"

* tag 'net-next-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1561 commits)
  net: pse-pd: fix kernel-doc function name for pse_control_find_by_id()
  wireguard: device: use exit_rtnl callback instead of manual rtnl_lock in pre_exit
  wireguard: allowedips: remove redundant space
  tools: ynl: add sample for wireguard
  wireguard: allowedips: Use kfree_rcu() instead of call_rcu()
  MAINTAINERS: Add netkit selftest files
  selftests/net: Add additional test coverage in nk_qlease
  selftests/net: Split netdevsim tests from HW tests in nk_qlease
  tools/ynl: Make YnlFamily closeable as a context manager
  net: airoha: Add missing PPE configurations in airoha_ppe_hw_init()
  net: airoha: Fix VIP configuration for AN7583 SoC
  net: caif: clear client service pointer on teardown
  net: strparser: fix skb_head leak in strp_abort_strp()
  net: usb: cdc-phonet: fix skb frags[] overflow in rx_complete()
  selftests/bpf: add test for xdp_master_redirect with bond not up
  net, bpf: fix null-ptr-deref in xdp_master_redirect() for down master
  net: airoha: Remove PCE_MC_EN_MASK bit in REG_FE_PCE_CFG configuration
  sctp: disable BH before calling udp_tunnel_xmit_skb()
  sctp: fix missing encap_port propagation for GSO fragments
  net: airoha: Rely on net_device pointer in ETS callbacks
  ...
2026-04-14 18:36:10 -07:00
Linus Torvalds
23acda7c22 for-7.1/io_uring-20260411
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmna0vIQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpu8MEACN6owH/1suaJp5HBhrKseVIPQl1ldmsGF3
 ZDwZndUE6pWXaeuI3g5QjSPcfWIUuLG6vs/btkIh4M32zAcFsSD8zYPItvgFzMVp
 X762WPCrUcfFwKt5GqeNn6IblO8BrsbzoJWNCaSVRhWqCdzQRVktq6684nNy/fj1
 JBFnMsRpwGhoKzpg1oCLOrs0V57CRdJqFdmMzQHwRTWHemvfHf6SD2+h9axfKCaV
 baqvXGOLQXLwr8qHFo1LIu8lqEltHUa7boU8EMFQn/v8sPjUv46EuqZ8VVtzXH08
 fY2zqWI5atA3DZCfORCHnK0qh6tPiSUtVUilXbIffhqd6lCTs891RJf3TegRCGTZ
 k8WfBFVKzVlhbgGk0Km6+tiHTaK1ZmcKU0Q+uucnb3RlOdOoPvXJy3u+I5BK74aV
 36JmNPWRQfzh5icmrrGKySBTX0z7NPtMiEA+qHEndIO5FWrkf5pf9U5C5gu0WEMh
 iK2gotbd0Vym3EpqKQnefxflce6IpYteOACeYPXAprcQOzPK+WYjiVUJ9JcH6DhP
 RPUIXXck8+GkHnM9vWtBXBKaoR7gcATHUzLX8ZnhDkAhsTJ+tOXN8skq28gglUtj
 8kLMzyXklbhAJsykxKn0rqcNUOcVMatFyK4VIFyp2tWRhzMDAY4xyXYSz0lRowkd
 pZAm4eSkmw==
 =IoaB
 -----END PGP SIGNATURE-----

Merge tag 'for-7.1/io_uring-20260411' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull io_uring updates from Jens Axboe:

 - Add a callback driven main loop for io_uring, and BPF struct_ops
   on top to allow implementing custom event loop logic

 - Decouple IOPOLL from being a ring-wide all-or-nothing setting,
   allowing IOPOLL use cases to also issue certain white listed
   non-polled opcodes

 - Timeout improvements. Migrate internal timeout storage from
   timespec64 to ktime_t for simpler arithmetic and avoid copying of
   timespec data

 - Zero-copy receive (zcrx) updates:

      - Add a device-less mode (ZCRX_REG_NODEV) for testing and
        experimentation where data flows through the copy fallback path

      - Fix two-step unregistration regression, DMA length calculations,
        xarray mark usage, and a potential 32-bit overflow in id
        shifting

      - Refactoring toward multi-area support: dedicated refill queue
        struct, consolidated DMA syncing, netmem array refilling format,
        and guard-based locking

 - Zero-copy transmit (zctx) cleanup:

      - Unify io_send_zc() and io_sendmsg_zc() into a single function

      - Add vectorized registered buffer send for IORING_OP_SEND_ZC

      - Add separate notification user_data via sqe->addr3 so
        notification and completion CQEs can be distinguished without
        extra reference counting

 - Switch struct io_ring_ctx internal bitfields to explicit flag bits
   with atomic-safe accessors, and annotate the known harmless races on
   those flags

 - Various optimizations caching ctx and other request fields in local
   variables to avoid repeated loads, and cleanups for tctx setup, ring
   fd registration, and read path early returns

* tag 'for-7.1/io_uring-20260411' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (58 commits)
  io_uring: unify getting ctx from passed in file descriptor
  io_uring/register: don't get a reference to the registered ring fd
  io_uring/tctx: clean up __io_uring_add_tctx_node() error handling
  io_uring/tctx: have io_uring_alloc_task_context() return tctx
  io_uring/timeout: use 'ctx' consistently
  io_uring/rw: clean up __io_read() obsolete comment and early returns
  io_uring/zcrx: use correct mmap off constants
  io_uring/zcrx: use dma_len for chunk size calculation
  io_uring/zcrx: don't clear not allocated niovs
  io_uring/zcrx: don't use mark0 for allocating xarray
  io_uring: cast id to u64 before shifting in io_allocate_rbuf_ring()
  io_uring/zcrx: reject REG_NODEV with large rx_buf_size
  io_uring/cancel: validate opcode for IORING_ASYNC_CANCEL_OP
  io_uring/rsrc: use io_cache_free() to free node
  io_uring/zcrx: rename zcrx [un]register functions
  io_uring/zcrx: check ctrl op payload struct sizes
  io_uring/zcrx: cache fallback availability in zcrx ctx
  io_uring/zcrx: warn on a repeated area append
  io_uring/zcrx: consolidate dma syncing
  io_uring/zcrx: netmem array as refiling format
  ...
2026-04-13 16:22:30 -07:00
Linus Torvalds
ef3da345cc vfs-7.1-rc1.misc
Please consider pulling these changes from the signed vfs-7.1-rc1.misc tag.
 
 Thanks!
 Christian
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCadjZCwAKCRCRxhvAZXjc
 ohhBAQCAmQMlMRAXAgUZFYMTZpeQlcujP5rv+/vT2Tf/xS76YwD/dRDaw1FH294+
 qtk/Z1NjleNixzE2sld1K9J32NxeyAc=
 =+g9q
 -----END PGP SIGNATURE-----

Merge tag 'vfs-7.1-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull misc vfs updates from Christian Brauner:
 "Features:
   - coredump: add tracepoint for coredump events
   - fs: hide file and bfile caches behind runtime const machinery

  Fixes:
   - fix architecture-specific compat_ftruncate64 implementations
   - dcache: Limit the minimal number of bucket to two
   - fs/omfs: reject s_sys_blocksize smaller than OMFS_DIR_START
   - fs/mbcache: cancel shrink work before destroying the cache
   - dcache: permit dynamic_dname()s up to NAME_MAX

  Cleanups:
   - remove or unexport unused fs_context infrastructure
   - trivial ->setattr cleanups
   - selftests/filesystems: Assume that TIOCGPTPEER is defined
   - writeback: fix kernel-doc function name mismatch for wb_put_many()
   - autofs: replace manual symlink buffer allocation in autofs_dir_symlink
   - init/initramfs.c: trivial fix: FSM -> Finite-state machine
   - fs: remove stale and duplicate forward declarations
   - readdir: Introduce dirent_size()
   - fs: Replace user_access_{begin/end} by scoped user access
   - kernel: acct: fix duplicate word in comment
   - fs: write a better comment in step_into() concerning .mnt assignment
   - fs: attr: fix comment formatting and spelling issues"

* tag 'vfs-7.1-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (28 commits)
  dcache: permit dynamic_dname()s up to NAME_MAX
  fs: attr: fix comment formatting and spelling issues
  fs: hide file and bfile caches behind runtime const machinery
  fs: write a better comment in step_into() concerning .mnt assignment
  proc: rename proc_notify_change to proc_setattr
  proc: rename proc_setattr to proc_nochmod_setattr
  affs: rename affs_notify_change to affs_setattr
  adfs: rename adfs_notify_change to adfs_setattr
  hfs: update comments on hfs_inode_setattr
  kernel: acct: fix duplicate word in comment
  fs: Replace user_access_{begin/end} by scoped user access
  readdir: Introduce dirent_size()
  coredump: add tracepoint for coredump events
  fs: remove do_sys_truncate
  fs: pass on FTRUNCATE_* flags to do_truncate
  fs: fix archiecture-specific compat_ftruncate64
  fs: remove stale and duplicate forward declarations
  init/initramfs.c: trivial fix: FSM -> Finite-state machine
  autofs: replace manual symlink buffer allocation in autofs_dir_symlink
  fs/mbcache: cancel shrink work before destroying the cache
  ...
2026-04-13 14:20:11 -07:00
Jakub Kicinski
1508922588 Merge branch 'netkit-support-for-io_uring-zero-copy-and-af_xdp'
Daniel Borkmann says:

====================
netkit: Support for io_uring zero-copy and AF_XDP

Containers use virtual netdevs to route traffic from a physical netdev
in the host namespace. They do not have access to the physical netdev
in the host and thus can't use memory providers or AF_XDP that require
reconfiguring/restarting queues in the physical netdev.

This patchset adds the concept of queue leasing to virtual netdevs that
allow containers to use memory providers and AF_XDP at native speed.
Leased queues are bound to a real queue in a physical netdev and act
as a proxy.

Memory providers and AF_XDP operations take an ifindex and queue id,
so containers would pass in an ifindex for a virtual netdev and a queue
id of a leased queue, which then gets proxied to the underlying real
queue.

We have implemented support for this concept in netkit and tested the
latter against Nvidia ConnectX-6 (mlx5) as well as Broadcom BCM957504
(bnxt_en) 100G NICs. For more details see the individual patches.
====================

Link: https://patch.msgid.link/20260402231031.447597-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-09 18:24:35 -07:00
David Wei
222b5566a0 net: Proxy netdev_queue_get_dma_dev for leased queues
Extend netdev_queue_get_dma_dev to return the physical device of the
real rxq for DMA in case the queue was leased. This allows memory
providers like io_uring zero-copy or devmem to bind to the physically
leased rxq via virtual devices such as netkit.

Signed-off-by: David Wei <dw@davidwei.uk>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260402231031.447597-8-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-09 18:21:46 -07:00
Daniel Borkmann
1e91c98bc9 net: Slightly simplify net_mp_{open,close}_rxq
net_mp_open_rxq is currently not used in the tree as all callers are
using __net_mp_open_rxq directly, and net_mp_close_rxq is only used
once while all other locations use __net_mp_close_rxq.

Consolidate into a single API, netif_mp_{open,close}_rxq, using the
netif_ prefix to indicate that the caller is responsible for locking.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260402231031.447597-6-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-09 18:21:46 -07:00
Jens Axboe
c5e9f6a96b io_uring: unify getting ctx from passed in file descriptor
io_uring_enter() and io_uring_register() end up having duplicated code
for getting a ctx from a passed in file descriptor, for either a
registered ring descriptor or a normal file descriptor. Move the
io_uring_register_get_file() into io_uring.c and name it a bit more
generically, and use it from both callsites rather than have that logic
and handling duplicated.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-08 13:21:35 -06:00
Jens Axboe
b4d893d636 io_uring/register: don't get a reference to the registered ring fd
This isn't necessary and was only done because the register path isn't a
hot path and hence the extra ref/put doesn't matter, and to have the
exit path be able to unconditionally put whatever file was gotten
regardless of the type.

In preparation for sharing this code with the main io_uring_enter(2)
syscall, drop the reference and have the caller conditionally put the
file if it was a normal file descriptor.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-08 13:21:35 -06:00
Jens Axboe
7880174e1e io_uring/tctx: clean up __io_uring_add_tctx_node() error handling
Refactor __io_uring_add_tctx_node() so that on error it never leaves
current->io_uring pointing at a half-setup tctx. This moves the
assignment of current->io_uring to the end of the function post any
failure points.

Separate out the node installation into io_tctx_install_node() to
further clean this up.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-08 13:21:34 -06:00
Jens Axboe
2c453a4281 io_uring/tctx: have io_uring_alloc_task_context() return tctx
Instead of having io_uring_alloc_task_context() return an int and
assign tsk->io_uring, just have it return the task context directly.
This enables cleaner error handling in callers, which may have
failure points post calling io_uring_alloc_task_context().

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-08 13:21:30 -06:00
Linus Torvalds
e41255ce7a io_uring-7.0-20260403
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmnPokUQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjW6D/91Xg/mGvYUBVvwEhP0ydPncuAsThnkoDHY
 6Pu+VxawKW480yAC06nktAeDgJNnpFpJXatPEtk2n8r7Ol3Cx0sDWdQjzoKSlBC7
 9wj+MVpCcU970Gb1G6PNLKQoW+DxKuD9Iq6Ph434uCx/bgC2EKthj0vYpssoU48S
 OxyFGBTjhbgnmiZaAEMHpLC/LJP27eH24QQbobeVWyY7C6jy6YI0WQaoG4Qt+UMd
 S2XdFe97xrVaCVS3E5X5BAyHCcMX4e1D6/Y7bNDGG3Ke673RuUJHhqvk8P1NJnTI
 CaMlfoGhNw36FpkzTYIvoZlkCFl48axXmscRcekTg4d9ssnY9aSFVY+xMSHmkhKu
 zs1r1tZK970xUbQK0NAoD9T+LsFKU1S0PaEaCL2KMHwz9vG0uY7iUYteKqdM8L/f
 jUpYcxn9R6AhdeL77eEu3w6vCdMqP2+OgDv1uEpyJv6oWSdhfI38+EIwmMoEq+Az
 BkDipYNh4lAiI23qbS9CDe5aam6pv+hwecDn3x7MZVpGZ6cJjs43QWwk+jZz+KQj
 gacQu01q/TN7rpyaFYhkxPGKHVs259/uSLJY647ORgpJNXg4a+6DlB7/4YdW35il
 O4gnKECSflmoePm7B4QFh8Q89XPma74hVqtDB7opz0xL3PQMq07EQKQmoNP7RTOp
 GLW71uD2MA==
 =q/wK
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-7.0-20260403' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull io_uring fixes from Jens Axboe:

 - A previous fix in this release covered the case of the rings being
   RCU protected during resize, but it missed a few spots. This covers
   the rest

 - Fix the cBPF filters when COW'ed, introduced in this merge window

 - Fix for an attempt to import a zero sized buffer

 - Fix for a missing clamp in importing bundle buffers

* tag 'io_uring-7.0-20260403' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  io_uring/bpf_filters: retain COW'ed settings on parse failures
  io_uring: protect remaining lockless ctx->rings accesses with RCU
  io_uring/rsrc: reject zero-length fixed buffer import
  io_uring/net: fix slab-out-of-bounds read in io_bundle_nbufs()
2026-04-03 11:58:04 -07:00
Yang Xiuwei
f847bf6d29 io_uring/timeout: use 'ctx' consistently
There's already a local ctx variable, yet cq_timeouts accounting uses
req->ctx. Use ctx consistently.

Signed-off-by: Yang Xiuwei <yangxiuwei@kylinos.cn>
Link: https://patch.msgid.link/20260402014952.260414-1-yangxiuwei@kylinos.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-02 07:08:40 -06:00
Joanne Koong
c7f3aaf3e8 io_uring/rw: clean up __io_read() obsolete comment and early returns
After commit a9165b83c1 ("io_uring/rw: always setup io_async_rw for
read/write requests") which moved the iovec allocation into the prep
path and stores it in req->async_data where it now gets freed as part of
the request lifecycle, this comment is now outdated.

Remove it and clean up the goto as well.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20260401173511.4052303-1-joannelkoong@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-02 06:55:50 -06:00
Pavel Begunkov
4c6f93951b io_uring/zcrx: use correct mmap off constants
zcrx was using IORING_OFF_PBUF_SHIFT during first iterations, but there
is now a separate constant it should use. Both are 16 so it doesn't
change anything, but improve it for the future.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/fe16ebe9ba4048a7e12f9b3b50880bd175b1ce03.1774780198.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-02 06:55:48 -06:00
Pavel Begunkov
7120b87bed io_uring/zcrx: use dma_len for chunk size calculation
Buffers are now dma-mapped earlier and we can sg_dma_len(), otherwise,
since it's walking with for_each_sgtable_dma_sg(), it might wrongfully
reject some configurations. As a bonus, it'd now be able to use larger
chunks if dma addresses are coalesced e.g by iommu.

Fixes: 8c0cab0b7bf7 ("io_uring/zcrx: always dma map in advance")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/03b219af3f6cfdd1cf64679b8bab7461e47cc123.1774780198.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-02 06:55:47 -06:00
Pavel Begunkov
52dcd1776b io_uring/zcrx: don't clear not allocated niovs
Now that area->is_mapped is set earlier before niovs array is allocated,
io_zcrx_free_area -> io_zcrx_unmap_area in an error path can try to
clear dma addresses for unallocated niovs, fix it.

Fixes: 8c0cab0b7bf7 ("io_uring/zcrx: always dma map in advance")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/cbcb7749b5a001ecd4d1c303515ce9403215640c.1774780198.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-02 06:55:36 -06:00
Pavel Begunkov
8ae2837d5a io_uring/zcrx: don't use mark0 for allocating xarray
XA_MARK_0 is not compatible with xarray allocating entries, use
XA_MARK_1.

Fixes: fda90d43f4fac ("io_uring/zcrx: return back two step unregistration")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/f232cfd3c466047d333b474dd2bddd246b6ebb82.1774780198.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 10:21:13 -06:00
Anas Iqbal
77d8c8d0f1 io_uring: cast id to u64 before shifting in io_allocate_rbuf_ring()
Smatch warns:
io_uring/zcrx.c:393 io_allocate_rbuf_ring() warn: should 'id << 16' be a 64 bit type?

The expression 'id << IORING_OFF_PBUF_SHIFT' is evaluated using 32-bit
arithmetic because id is a u32. This may overflow before being promoted
to the 64-bit mmap_offset.

Cast id to u64 before shifting to ensure the shift is performed in
64-bit arithmetic.

Signed-off-by: Anas Iqbal <mohd.abd.6602@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/52400e1b343691416bef3ed3ae287fb1a88d407f.1774780198.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 10:21:13 -06:00
Pavel Begunkov
a9d008489f io_uring/zcrx: reject REG_NODEV with large rx_buf_size
The copy fallback path doesn't care about the actual niov size and only
uses first PAGE_SIZE bytes, and any additional space will be wasted.
Since ZCRX_REG_NODEV solely relies on the copy path, it doesn't make
sense to support non-standard rx_buf_len. Reject it for now, and
re-enable once improved.

Fixes: c11728021d5cd ("io_uring/zcrx: implement device-less mode for zcrx")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/3e7652d9c27f8ac5d2b141e3af47971f2771fb05.1774780198.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 10:21:13 -06:00
Amir Mohammad Jahangirzad
85a58309c0 io_uring/cancel: validate opcode for IORING_ASYNC_CANCEL_OP
io_async_cancel_prep() reads the opcode selector from sqe->len and
stores it in cancel->opcode, which is an 8-bit field. Since sqe->len
is a 32-bit value, values larger than U8_MAX are implicitly truncated.

This can cause unintended opcode matches when the truncated value
corresponds to a valid io_uring opcode. For example, submitting a value
such as 0x10b will be truncated to 0x0b (IORING_OP_TIMEOUT), allowing a
cancel request to match operations it did not intend to target.
Validate the opcode value before assigning it to the 8-bit field and
reject values outside the valid io_uring opcode range.

Signed-off-by: Amir Mohammad Jahangirzad <a.jahangirzad@gmail.com>
Link: https://patch.msgid.link/20260331232113.615972-1-a.jahangirzad@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 10:21:13 -06:00
Jackie Liu
19a8cc6cda io_uring/rsrc: use io_cache_free() to free node
Replace kfree(node) with io_cache_free() in io_buffer_register_bvec()
to match all other error paths that free nodes allocated via
io_rsrc_node_alloc(). The node is allocated through io_cache_alloc()
internally, so it should be returned to the cache via io_cache_free()
for proper object reuse.

Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Link: https://patch.msgid.link/20260331104509.7055-1-liu.yun@linux.dev
[axboe: remove fixes tag, it's not a fix, it's a cleanup]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 10:21:13 -06:00
Pavel Begunkov
7c713dd007 io_uring/zcrx: rename zcrx [un]register functions
Drop "ifqs" from function names, as it refers to an interface queue and
there might be none once a device-less mode is introduced.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/657874acd117ec30fa6f45d9d844471c753b5a0f.1774261953.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 10:21:13 -06:00
Pavel Begunkov
de6ed1b323 io_uring/zcrx: check ctrl op payload struct sizes
Add a build check that ctrl payloads are of the same size and don't grow
struct zcrx_ctrl.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/af66caf9776d18e9ff880ab828eb159a6a03caf5.1774261953.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 10:21:13 -06:00
Pavel Begunkov
5c727ce042 io_uring/zcrx: cache fallback availability in zcrx ctx
Store a flag in struct io_zcrx_ifq telling if the backing memory is
normal page or dmabuf based. It was looking it up from the area, however
it logically allocates from the zcrx ctx and not a particular area, and
once we add more than one area it'll become a mess.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/65e75408a7758fe7e60fae89b7a8d5ae4857f515.1774261953.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01 10:21:13 -06:00