linux

mirror of https://github.com/torvalds/linux.git synced 2026-05-12 16:18:45 +02:00

Author	SHA1	Message	Date
Linus Torvalds	8be01e1280	io_uring-7.1-20260508 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmn+FU4QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpnIsEADVG1zqBrfj6JE0pscyrFbFHwAjoihs7jW2 dHVRzrWKlCG0iflVAxIcoA6WQLAc1W5cmWi9GHuOHtPvWY1rdiZCvp8GMGuq40ye 3OQrMAcDdpowAUBfO9tiZ9L9Bn96HFZCa92V0PEp/fSPxuRv1HGE/yTpWsardbxn eUGBoOMAclqawMU5thfOXFMT+DetwrY//nd799iEElzyNfk92mDCZZ5n3WPyl1J4 hn/iUu04YVozto9P17SJfEOg1c4kz84wL6ATR+2IuxrWm8/LxXspmbIovJYaCLRr EkdevTrxABBTJ77dllnnaFg233F75ZdYr0z0xgHoOFT2totSz2lZFxqz8R0/b/NE mHdshkn4LTU4yDHuILDt0LxImi62i1Bmn7QQIbICcMAEhTGh2hebaM/lmmHp7IlN R98q4ALm+dTu5vp+MDlve53P4UITxsgclAICqxyY26FrNneHd/TodeDYPGYLwj4F 2EPZyzHe3WpTXMmF6pxLlEr2r8DRqZhBqj+mohN/pZK+ecs6GmCLh1F1zcuaLQyg VOPf3FY48f6EWku0gCUpOev2iFTQHICf3RC39uR5pVVdC3+Yzc2+yqa9xg+I5ckt gNTbmD1vkssaCmxIXR0Cj/pHskZxJqtdmDJmcBQfbLm9ytFuCGFK0IqpvpyRya/H jGjRGJxJ5w== =y3+7 -----END PGP SIGNATURE----- Merge tag 'io_uring-7.1-20260508' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring fixes from Jens Axboe: - Ensure that the absolute timeouts for both the command side and the waiting side honor the callers time namespace - Ensure tracked NAPI entries are cleared at unregistration time, as the NAPI polling loop checks the list state rather than the general NAPI state. This can lead to NAPI polling even after unregistration has been done. If unregistered, all NAPI polling should be disabled - Fix for eventfd recursive invocation handling * tag 'io_uring-7.1-20260508' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: io_uring/wait: honour caller's time namespace for IORING_ENTER_ABS_TIMER io_uring/timeout: honour caller's time namespace for IORING_TIMEOUT_ABS io_uring/eventfd: reset deferred signal state io_uring/napi: clear tracked NAPI entries on unregister	2026-05-08 13:12:48 -07:00
Maoyi Xie	45d2b37a37	io_uring/wait: honour caller's time namespace for IORING_ENTER_ABS_TIMER io_uring_enter() with IORING_ENTER_ABS_TIMER takes an absolute timespec from the caller via ext_arg->ts. It arms an ABS mode hrtimer in __io_cqring_wait_schedule(). The conversion path in io_uring/wait.c parses ext_arg->ts inline rather than going through io_parse_user_time(). It therefore does not pick up the time namespace conversion added by the previous patch. Apply timens_ktime_to_host() to the parsed time on the IORING_ENTER_ABS_TIMER branch. This mirrors the IORING_TIMEOUT_ABS fix in io_parse_user_time(). Use ctx->clockid as the clock id. ctx->clockid is set either at ring creation or via IORING_REGISTER_CLOCK. timens_ktime_to_host() is a no-op for clocks not affected by time namespaces. It is also a no-op for callers in the initial time namespace. The fast path is unchanged. Reproducer: in unshare --user --time, with a -10s monotonic offset, call io_uring_enter with min_complete=1, IORING_ENTER_ABS_TIMER, and ts = now + 1s. The call returns -ETIME after <1ms instead of after the expected ~1s. Suggested-by: Pavel Begunkov <asml.silence@gmail.com> Suggested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Maoyi Xie <maoyi.xie@ntu.edu.sg> Link: https://patch.msgid.link/20260504153755.1293932-3-maoyi.xie@ntu.edu.sg Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-05-06 04:58:56 -06:00
Maoyi Xie	9cc6bac1be	io_uring/timeout: honour caller's time namespace for IORING_TIMEOUT_ABS io_uring's IORING_OP_TIMEOUT and IORING_OP_LINK_TIMEOUT accept a timespec from the caller via io_parse_user_time(). With IORING_TIMEOUT_ABS, the timestamp is an absolute deadline on the selected clock. The clock is CLOCK_MONOTONIC by default. CLOCK_BOOTTIME and CLOCK_REALTIME are also selectable. A submitter inside a CLONE_NEWTIME time namespace observes CLOCK_MONOTONIC and CLOCK_BOOTTIME shifted by the namespace's offsets relative to the host. Every other ABS timer interface in the kernel converts the caller's absolute time to host view via timens_ktime_to_host() before arming an hrtimer: kernel/time/posix-timers.c -- timer_settime(TIMER_ABSTIME) kernel/time/posix-stubs.c -- clock_nanosleep(TIMER_ABSTIME) kernel/time/alarmtimer.c -- alarm_timer_nsleep(TIMER_ABSTIME) fs/timerfd.c -- timerfd_settime(TFD_TIMER_ABSTIME) io_parse_user_time() does not. As a result, an absolute timeout submitted from within a time namespace is interpreted in host view. That is generally a different point in time. It may already be in the past, causing the timer to fire immediately, or far in the future, causing the timer not to fire when expected. Reproducer: in unshare --user --time, with a -10s monotonic offset, submit IORING_OP_TIMEOUT with IORING_TIMEOUT_ABS and deadline = now + 1s. The CQE is delivered after <1ms instead of the expected ~1s. Apply timens_ktime_to_host() to the parsed time when IORING_TIMEOUT_ABS is set. Split the existing clock id resolver in io_timeout_get_clock() into a flags only helper io_flags_to_clock(), so io_parse_user_time() can resolve the clock without a struct io_timeout_data. timens_ktime_to_host() is a no-op for clocks not affected by time namespaces, e.g. CLOCK_REALTIME. It is also a no-op for callers in the initial time namespace. The fast path is unchanged. SQPOLL is also covered. The SQPOLL kernel thread is created via create_io_thread() with CLONE_THREAD and no CLONE_NEW* flag. copy_namespaces() therefore shares the submitter's nsproxy by reference. Inside the SQPOLL kthread, current->nsproxy->time_ns is the submitter's time_ns. timens_ktime_to_host() resolves correctly. Suggested-by: Pavel Begunkov <asml.silence@gmail.com> Suggested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Maoyi Xie <maoyi.xie@ntu.edu.sg> Link: https://patch.msgid.link/20260504153755.1293932-2-maoyi.xie@ntu.edu.sg Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-05-06 04:58:56 -06:00
Yufan Chen	04fe9aeb4f	io_uring/eventfd: reset deferred signal state Recursive eventfd wakeups must defer io_uring eventfd signaling because eventfd_signal_mask() rejects reentry from eventfd wakeup handlers. The io_ev_fd ops bit tracks an outstanding deferred signal so that the same rcu_head is not queued twice. That bit is only set today. Once the first deferred callback runs, later recursive notifications still see the bit set and skip queueing another deferred signal. This can leave new completions without a matching eventfd wake after the first recursive deferral. Clear the pending bit before issuing the deferred signal. If the wakeup path recurses while the callback runs, a new signal can be queued for the next RCU grace period while the current callback keeps its reference until it returns. Signed-off-by: Yufan Chen <ericterminal@gmail.com> Fixes: `60b6c075e8` ("io_uring/eventfd: move to more idiomatic RCU free usage") Link: https://patch.msgid.link/20260503175710.37209-1-yufan.chen@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-05-03 23:21:40 -06:00
Yufan Chen	b8c2e9e276	io_uring/napi: clear tracked NAPI entries on unregister IORING_UNREGISTER_NAPI disables NAPI busy polling, but it currently leaves any previously tracked NAPI IDs on the ring context. The normal wait path only checks whether the list is empty before entering the busy poll helper, so an unregistered ring can still observe stale entries and run an unexpected busy poll pass. Make unregister switch the context to inactive and free the tracked entries. Do the same inactive transition while changing the tracking strategy, and recheck the expected tracking mode under napi_lock before inserting a newly learned NAPI ID. This prevents a racing poll path from repopulating the list after unregister or reconfiguration. Also make the busy poll dispatcher ignore inactive mode explicitly. Signed-off-by: Yufan Chen <ericterminal@gmail.com> Fixes: `6bf90bd8c5` ("io_uring/napi: add static napi tracking strategy") Link: https://patch.msgid.link/20260503175610.35521-1-yufan.chen@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-05-03 23:21:23 -06:00
Linus Torvalds	9d88bb929a	io_uring-7.1-20260430 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmnz4ikQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpoecEACan9sGbcqrTAqBaJnNqMfjo5OoX0X2/3LP JoFH+GbLzJ7Ojj1+AWSHNsjjwhSZ71HmwRk78uWCPcl3oiKQLGyXTnbho0qKrwhk 4cnoFfhdcBEDGuh8GW/PnBset8ukq9occ11TbojC681tmaTma1WpXFk1vRabcwvw T8/Jr18kttHi8aj+MPowkTcqXV7iOjzX9RD/vS97jCWBxUbAmYjRGfm3nbDbDydI oMEstxqp+8jiFF1SHBdq3aGreoZDIegh1nXsjobAmoEMvAJQQ3K7zRsiqFEnoXFU CDVoS6LhlSBmG2jT657azYzhF3o7HwSiYk2B15YiYHO+EqIxMhIQYRlP5s/3UJD8 KLJPSYqivQ14m9yff5zjn//mad3QBxvOhrVrxHj/diIKclZDLs9VDPZjjB6A6DUO X01uJy7zuzp57GFh0FwyFGU3yBUl7WJGscLarIMHnOdmEWOIU3WRLWGYZRgZRUny 1yHXxGEucR7LMiYPzh7PnGnaAzDxtJJUzIXbIF+l+A5A0f1Ayb8cfFy6QoGc7v8j t+vG2gbRtwPR6DxFRNhGDMeZtstEoKj0IX4zw7ZQF7MFgpvdfjlkDGCveJ0jQO6x pw8UJW1KOQpT9MiheOAvop5hhvGlqSYWXByluW05y7O+CDQcHiVeXcjQsG7zREsO +zGvO5WHEg== =5d8D -----END PGP SIGNATURE----- Merge tag 'io_uring-7.1-20260430' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring fixes from Jens Axboe: - Remove dead struct io_buffer_list member - Fix for incrementally consumed buffers with recvmsg multishot, which requires a minimum value left in a buffer for any receive for the headers. If there's still a bit of buffer left but it's smaller than that value, then userspace will see a spurious -EFAULT returned in the CQE - Locking fix for the DEFER_TASKRUN retry list, which otherwise could race with fallback cancelations. If the task is exiting with task_work left in both the normal and retry list AND the exit cleanup races with the task running task work, then entries could either be doubly completed or lost - Cap NAPI busy poll timeout to something sane, to avoid syzbot running into excessive polling and triggering warnings around that * tag 'io_uring-7.1-20260430' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: io_uring/tw: serialize ctx->retry_llist with ->uring_lock io_uring/napi: cap busy_poll_to 10 msec io_uring/kbuf: support min length left for incremental buffers io_uring/kbuf: kill dead struct io_buffer_list 'nr_entries' member	2026-05-01 11:01:31 -07:00
Jens Axboe	17666e2d75	io_uring/tw: serialize ctx->retry_llist with ->uring_lock The DEFER_TASKRUN local task work paths all run under ctx->uring_lock, which serializes them with each other and with the rest of the ring's hot paths. io_move_task_work_from_local() is the exception - it's called from io_ring_exit_work() on a kworker without holding the lock and from the iopoll cancelation side right after dropping it. ->work_llist is fine with this, as it's only ever updated via the expected paths. But the ->retry_llist is updated while runing, and hence it could potentially race between normal task_work running and the task-has-exited shutdown path. Simply grab ->uring_lock while moving the local work to the fallback list for exit purposes, which nicely serializes it across both the normal additions and the exit prune path. Cc: stable@vger.kernel.org Fixes: `f46b9cdb22` ("io_uring: limit local tw done") Reported-by: Robert Femmer <robert.femmer@x41-dsec.de> Reported-by: Christian Reitter <invd@inhq.net> Reported-by: Michael Rodler <michael.rodler@x41-dsec.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-30 06:57:20 -06:00
Jakub Kicinski	735a309b4b	net: add net_iov_init() and use it to initialize ->page_type Commit `db359fccf2` ("mm: introduce a new page type for page pool in page type") added a page_type field to struct net_iov at the same offset as struct page::page_type, so that page_pool_set_pp_info() can call __SetPageNetpp() uniformly on both pages and net_iovs. The page-type API requires the field to hold the UINT_MAX "no type" sentinel before a type can be set; for real struct page that invariant is established by the page allocator on free. struct net_iov is not allocated through the page allocator, so the field is left as zero (io_uring zcrx, which uses __GFP_ZERO) or as slab garbage (devmem, which uses kvmalloc_objs() without zeroing). When the page pool then calls page_pool_set_pp_info() on a freshly-bound niov, __SetPageNetpp()'s VM_BUG_ON_PAGE(page->page_type != UINT_MAX) fires and the kernel BUGs. Triggered in selftests by io_uring zcrx setup through the fbnic queue restart path: kernel BUG at ./include/linux/page-flags.h:1062! RIP: 0010:page_pool_set_pp_info (./include/linux/page-flags.h:1062 net/core/page_pool.c:716) Call Trace: <TASK> net_mp_niov_set_page_pool (net/core/page_pool.c:1360) io_pp_zc_alloc_netmems (io_uring/zcrx.c:1089 io_uring/zcrx.c:1110) fbnic_fill_bdq (./include/net/page_pool/helpers.h:160 drivers/net/ethernet/meta/fbnic/fbnic_txrx.c:906) __fbnic_nv_restart (drivers/net/ethernet/meta/fbnic/fbnic_txrx.c:2470 drivers/net/ethernet/meta/fbnic/fbnic_txrx.c:2874) fbnic_queue_start (drivers/net/ethernet/meta/fbnic/fbnic_txrx.c:2903) netdev_rx_queue_reconfig (net/core/netdev_rx_queue.c:137) __netif_mp_open_rxq (net/core/netdev_rx_queue.c:234) io_register_zcrx (io_uring/zcrx.c:818 io_uring/zcrx.c:903) __io_uring_register (io_uring/register.c:931) __do_sys_io_uring_register (io_uring/register.c:1029) do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94) </TASK> The same path is reachable through devmem dmabuf binding via netdev_nl_bind_rx_doit() -> net_devmem_bind_dmabuf_to_queue(). Add a net_iov_init() helper that stamps ->owner, ->type and the ->page_type sentinel, and use it from both the devmem and io_uring zcrx niov init loops. Fixes: `db359fccf2` ("mm: introduce a new page type for page pool in page type") Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Acked-by: Byungchul Park <byungchul@sk.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Acked-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/20260428025320.853452-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-29 16:40:08 -07:00
Jens Axboe	df8599ee18	io_uring/napi: cap busy_poll_to 10 msec Currently there's no cap on the maximum amount of time that napi is allowed to poll if no events are found, which can lead to kernel complaints on a task being stuck as there's no conditional rescheduling done within that loop. Just cap it to 10 msec in total, that's already way above any kind of sane value that will reap any benefits, yet low enough that it's nowhere near being able to trigger preemption complaints. Fixes: `8d0c12a80c` ("io-uring: add napi busy poll support") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-28 16:09:02 -06:00
Martin Michaelis	7deba791ad	io_uring/kbuf: support min length left for incremental buffers Incrementally consumed buffer rings are generally fully consumed, but it's quite possible that the application has a minimum size it needs to meet to avoid truncation. Currently that minimum limit is 1 byte, but this should be a setting that is the hands of the application. For recvmsg multishot, a prime use case for incrementally consumed buffers, the application may get spurious -EFAULT returned at the end of an incrementally consumed buffer, as less space is available than the headers need. Grab a u32 field in struct io_uring_buf_reg, which the application can use to inform the kernel of the minimum size that should be available in an incrementally consumed buffer. If less than that is available, the current buffer is fully processed and the next one will be picked. Cc: stable@vger.kernel.org Fixes: `ae98dbf43d` ("io_uring/kbuf: add support for incremental buffer consumption") Link: https://github.com/axboe/liburing/issues/1433 Signed-off-by: Martin Michaelis <code@mgjm.de> [axboe: write commit message, change io_buffer_list member name] Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-28 16:08:56 -06:00
Jens Axboe	55ea968389	io_uring/kbuf: kill dead struct io_buffer_list 'nr_entries' member This is only ever assigned, never used. The only used part is the calculated mask, which is used for indexing. Kill 'nr_entries'. Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-28 16:08:44 -06:00
Greg Kroah-Hartman	d0be8884f5	io_uring: take page references for NOMMU pbuf_ring mmaps Under !CONFIG_MMU, io_uring_get_unmapped_area() returns the kernel virtual address of the io_mapped_region's backing pages directly; the user's VMA aliases the kernel allocation. io_uring_mmap() then just returns 0 -- it takes no page references. The CONFIG_MMU path uses vm_insert_pages(), which takes a reference on each inserted page. Those references are released when the VMA is torn down (zap_pte_range -> put_page). io_free_region() -> release_pages() drops the io_uring-side references, but the pages survive until munmap drops the VMA-side references. Under NOMMU there are no VMA-side references. io_unregister_pbuf_ring -> io_put_bl -> io_free_region -> release_pages drops the only references and the pages return to the buddy allocator while the user's VMA still has vm_start pointing into them. The user can then write into whatever the allocator hands out next. Mirror the MMU lifetime: take get_page references in io_uring_mmap() and release them via vm_ops->close. NOMMU's delete_vma() calls vma_close() which runs ->close on munmap. This also incidentally addresses the duplicate-vm_start case: two mmaps of SQ_RING and CQ_RING resolve to the same ctx->ring_region pointer. With page refs taken per mmap, the second mmap takes its own refs and the pages survive until both mmaps are closed. The nommu rb-tree BUG_ON on duplicate vm_start is a separate mm/nommu.c concern (it should share the existing region rather than BUG), but the page lifetime is now correct. Cc: Jens Axboe <axboe@kernel.dk> Reported-by: Anthropic Assisted-by: gkh_clanker_t1000 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://patch.msgid.link/2026042115-body-attention-d15b@gregkh [axboe: get rid of region lookup, just iterate pages in vma] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-21 20:14:39 -06:00
Jens Axboe	1967f0b1ca	io_uring/poll: ensure EPOLL_ONESHOT is propagated for EPOLL_URING_WAKE Commit: `aacf2f9f38` ("io_uring: fix req->apoll_events") fixed an issue where poll->events and req->apoll_events weren't synchronized, but then when the commit referenced in Fixes got added, it didn't ensure the same thing. If we mask in EPOLLONESHOT in the regular EPOLL_URING_WAKE path, then ensure it's done for both. Including a link to the original report below, even though it's mostly nonsense. But it includes a reproducer that does show that IORING_CQE_F_MORE is set in the previous CQE, while no more CQEs will be generated for this request. Just ignore anything that pretends this is security related in any way, it's just the typical AI nonsense. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/io-uring/CAM0zi7yQzF3eKncgHo4iVM5yFLAjsiob_ucqyWKs=hyd_GqiMg@mail.gmail.com/ Reported-by: Azizcan Daştan <azizcan.d@mileniumsec.com> Fixes: `4464853277` ("io_uring: pass in EPOLL_URING_WAKE for eventfd signaling and wakeups") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-21 19:18:34 -06:00
Pavel Begunkov	770594e78c	io_uring/zcrx: warn on freelist violations The freelist is appropriately sized to always be able to take a free niov, but let's be more defensive and check the invariant with a warning. That should help to catch any double-free issues. Suggested-by: Kai Aizen <kai@snailsploit.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/2f3cea363b04649755e3b6bb9ab66485a95936d5.1776760901.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-21 12:19:11 -06:00
Pavel Begunkov	4f02cc4071	io_uring/zcrx: clear RQ headers on init It might be unexpected to users if the RQ head/tail after a ring creation are not zeroed, fix that. Cc: stable@vger.kernel.org Fixes: `6f377873cb` ("io_uring/zcrx: add interface queue and refill queue") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/331f94663c3e8f021ffa3cb770ca2844a07d4855.1776760911.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-21 12:19:11 -06:00
Pavel Begunkov	0fcccfd871	io_uring/zcrx: fix user_struct uaf io_free_rbuf_ring() usees a struct user_struct, which io_zcrx_ifq_free() puts it down before destroying the ring. Cc: stable@vger.kernel.org Fixes: `5c686456a4` ("io_uring/zcrx: add user_struct and mm_struct to io_zcrx_ifq") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/e560ae00960d27a810522a7efc0e201c82dff351.1776760917.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-21 12:19:11 -06:00
Jens Axboe	45cd95763e	io_uring/register: fix ring resizing with mixed/large SQEs/CQEs The ring resizing only properly handles "normal" sized SQEs or CQEs, if there are pending entries around a resize. This normally should not be the case, but the code is supposed to handle this regardless. For the mixed SQE/CQE cases, the current copying works fine as they are indexed in the same way. Each half is just copied separately. But for fixed large SQEs and CQEs, the iteration and copy need to take that into account. Cc: stable@kernel.org Fixes: `79cfe9e59c` ("io_uring/register: add IORING_REGISTER_RESIZE_RINGS") Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-21 12:19:08 -06:00
Jens Axboe	7faaa6812a	io_uring/futex: ensure partial wakes are appropriately dequeued If a FUTEX_WAITV vectored operation is only partially woken, we should call __futex_wake_mark() on the queue to account for that. If not, then a later wakeup will wake the same entry, rather than the next one in line. Fixes: `8f350194d5` ("io_uring: add support for vectored futex waits") Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-21 12:19:06 -06:00
Jens Axboe	7996883455	io_uring/rw: add defensive hardening for negative kbuf lengths No real bug here, just being a bit defensive in ensuring that whatever gets passed into io_put_kbuf() is always >= 0 and not some random error value. Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-21 12:19:03 -06:00
Jens Axboe	02b8d41c17	io_uring/rsrc: use kvfree() for the imu cache Currently anything that requires kvmalloc_flex() for allocations will not get re-cached, and hence the cache freeing path is correct in that it always uses kfree() to free the allocated memory. But this seems a bit fragile as it's something that could get mix should that situation change, so switch io_free_imu() and io_alloc_cache_free() to use kvfree as the desctructor. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-21 12:19:01 -06:00
Jens Axboe	53262c91f7	io_uring/rsrc: unify nospec indexing for direct descriptors For file updates, the node reset isn't capping the value via array_index_nospec() like the other paths do. Ensure it's all sane and have the update path do the proper capping as well. Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-21 12:18:54 -06:00
Jens Axboe	8e1f412b5b	io_uring: fix spurious fput in registered ring path Fix an issue with io_uring_ctx_get_file() not gating fput() on whether or not the file descriptor is a registered/direct one or not. Fixes: `c5e9f6a96b` ("io_uring: unify getting ctx from passed in file descriptor") Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-21 12:18:44 -06:00
Jens Axboe	42a702aaed	io_uring: fix iowq_limits data race in tctx node addition __io_uring_add_tctx_node() reads ctx->int_flags and ctx->iowq_limits[0..1] without holding ctx->uring_lock, while io_register_iowq_max_workers() writes these same fields under the lock. Mostly an application problem if you try and make these race, but let's silence KCSAN by just grabbing the ->uring_lock around the operation. This is a slow path operation anyway, and ->uring_lock will be grabbed by submission right after anyway. Fixes: `2e480058dd` ("io-wq: provide a way to limit max number of workers") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-20 14:57:21 -06:00
Jens Axboe	41859843f2	io_uring/tctx: mark io_wq as exiting before error path teardown syzbot reports that it's hitting the below condition for exiting an io_wq context: WARN_ON_ONCE(!test_bit(IO_WQ_BIT_EXIT, &wq->state)) in io_wq_put_and_exit(), which can be triggered with memory allocation fault injection. Ensure that the io_wq is marked as exiting to silence this warning trigger. Reported-by: syzbot+79a4cc863a8db58cd92b@syzkaller.appspotmail.com Fixes: `7880174e1e` ("io_uring/tctx: clean up __io_uring_add_tctx_node() error handling") Reviewed-by: Clément Léger <cleger@meta.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-20 14:47:37 -06:00
Jens Axboe	ee5417fd02	io_uring/tctx: check for setup tctx->io_wq before teardown As with the idling code before it, the error exit path should check for a NULL tctx->io_wq before calling io_wq_put_and_exit(). Fixes: `7880174e1e` ("io_uring/tctx: clean up __io_uring_add_tctx_node() error handling") Reported-by: Dan Carpenter <error27@gmail.com> Reviewed-by: Clément Léger <cleger@meta.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-20 14:47:29 -06:00
Longxuan Yu	326941b228	io_uring/poll: fix signed comparison in io_poll_get_ownership() io_poll_get_ownership() uses a signed comparison to check whether poll_refs has reached the threshold for the slowpath: if (unlikely(atomic_read(&req->poll_refs) >= IO_POLL_REF_BIAS)) atomic_read() returns int (signed). When IO_POLL_CANCEL_FLAG (BIT(31)) is set in poll_refs, the value becomes negative in signed arithmetic, so the >= 128 comparison always evaluates to false and the slowpath is never taken. Fix this by casting the atomic_read() result to unsigned int before the comparison, so that the cancel flag is treated as a large positive value and correctly triggers the slowpath. Fixes: `a26a35e901` ("io_uring: make poll refs more robust") Cc: stable@vger.kernel.org Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Co-developed-by: Yuan Tan <yuantan098@gmail.com> Signed-off-by: Yuan Tan <yuantan098@gmail.com> Suggested-by: Xin Liu <bird@lzu.edu.cn> Tested-by: Zhengchuan Liang <zcliangcn@gmail.com> Signed-off-by: Longxuan Yu <ylong030@ucr.edu> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/3a3508b08bcd7f1bc3beff848ae6e1d73d355043.1775965597.git.ylong030@ucr.edu Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-15 14:07:47 -06:00
Linus Torvalds	91a4855d6c	Networking changes for 7.1. Core & protocols ---------------- - Support HW queue leasing, allowing containers to be granted access to HW queues for zero-copy operations and AF_XDP. - Number of code moves to help the compiler with inlining. Avoid output arguments for returning drop reason where possible. - Rework drop handling within qdiscs to include more metadata about the reason and dropping qdisc in the tracepoints. - Remove the rtnl_lock use from IP Multicast Routing. - Pack size information into the Rx Flow Steering table pointer itself. This allows making the table itself a flat array of u32s, thus making the table allocation size a power of two. - Report TCP delayed ack timer information via socket diag. - Add ip_local_port_step_width sysctl to allow distributing the randomly selected ports more evenly throughout the allowed space. - Add support for per-route tunsrc in IPv6 segment routing. - Start work of switching sockopt handling to iov_iter. - Improve dynamic recvbuf sizing in MPTCP, limit burstiness and avoid buffer size drifting up. - Support MSG_EOR in MPTCP. - Add stp_mode attribute to the bridge driver for STP mode selection. This addresses concerns about call_usermodehelper() usage. - Remove UDP-Lite support (as announced in 2023). - Remove support for building IPv6 as a module. Remove the now unnecessary function calling indirection. Cross-tree stuff ---------------- - Move Michael MIC code from generic crypto into wireless, it's considered insecure but some WiFi networks still need it. Netfilter --------- - Switch nft_fib_ipv6 module to no longer need temporary dst_entry object allocations by using fib6_lookup() + RCU. Florian W reports this gets us ~13% higher packet rate. - Convert IPVS's global __ip_vs_mutex to per-net service_mutex and switch the service tables to be per-net. Convert some code that walks the service lists to use RCU instead of the service_mutex. - Add more opinionated input validation to lower security exposure. - Make IPVS hash tables to be per-netns and resizable. Wireless -------- - Finished assoc frame encryption/EPPKE/802.1X-over-auth. - Radar detection improvements. - Add 6 GHz incumbent signal detection APIs. - Multi-link support for FILS, probe response templates and client probing. - New APIs and mac80211 support for NAN (Neighbor Aware Networking, aka Wi-Fi Aware) so less work must be in firmware. Driver API ---------- - Add numerical ID for devlink instances (to avoid having to create fake bus/device pairs just to have an ID). Support shared devlink instances which span multiple PFs. - Add standard counters for reporting pause storm events (implement in mlx5 and fbnic). - Add configuration API for completion writeback buffering (implement in mana). - Support driver-initiated change of RSS context sizes. - Support DPLL monitoring input frequency (implement in zl3073x). - Support per-port resources in devlink (implement in mlx5). Misc ---- - Expand the YAML spec for Netfilter. Drivers ------- - Software: - macvlan: support multicast rx for bridge ports with shared source MAC address - team: decouple receive and transmit enablement for IEEE 802.3ad LACP "independent control" - Ethernet high-speed NICs: - nVidia/Mellanox: - support high order pages in zero-copy mode (for payload coalescing) - support multiple packets in a page (for systems with 64kB pages) - Broadcom 25-400GE (bnxt): - implement XDP RSS hash metadata extraction - add software fallback for UDP GSO, lowering the IOMMU cost - Broadcom 800GE (bnge): - add link status and configuration handling - add various HW and SW statistics - Marvell/Cavium: - NPC HW block support for cn20k - Huawei (hinic3): - add mailbox / control queue - add rx VLAN offload - add driver info and link management - Ethernet NICs: - Marvell/Aquantia: - support reading SFP module info on some AQC100 cards - Realtek PCI (r8169): - add support for RTL8125cp - Realtek USB (r8152): - support for the RTL8157 5Gbit chip - add 2500baseT EEE status/configuration support - Ethernet NICs embedded and off-the-shelf IP: - Synopsys (stmmac): - cleanup and reorganize SerDes handling and PCS support - cleanup descriptor handling and per-platform data - cleanup and consolidate MDIO defines and handling - shrink driver memory use for internal structures - improve Tx IRQ coalescing - improve TCP segmentation handling - add support for Spacemit K3 - Cadence (macb): - support PHYs that have inband autoneg disabled with GEM - support IEEE 802.3az EEE - rework usrio capabilities and handling - AMD (xgbe): - improve power management for S0i3 - improve TX resilience for link-down handling - Virtual: - Google cloud vNIC: - support larger ring sizes in DQO-QPL mode - improve HW-GRO handling - support UDP GSO for DQO format - PCIe NTB: - support queue count configuration - Ethernet PHYs: - automatically disable PHY autonomous EEE if MAC is in charge - Broadcom: - add BCM84891/BCM84892 support - Micrel: - support for LAN9645X internal PHY - Realtek: - add RTL8224 pair order support - support PHY LEDs on RTL8211F-VD - support spread spectrum clocking (SSC) - Maxlinear: - add PHY-level statistics via ethtool - Ethernet switches: - Maxlinear (mxl862xx): - support for bridge offloading - support for VLANs - support driver statistics - Bluetooth: - large number of fixes and new device IDs - Mediatek: - support MT6639 (MT7927) - support MT7902 SDIO - WiFi: - Intel (iwlwifi): - UNII-9 and continuing UHR work - MediaTek (mt76): - mt7996/mt7925 MLO fixes/improvements - mt7996 NPU support (HW eth/wifi traffic offload) - Qualcomm (ath12k): - monitor mode support on IPQ5332 - basic hwmon temperature reporting - support IPQ5424 - Realtek: - add USB RX aggregation to improve performance - add USB TX flow control by tracking in-flight URBs - Cellular: - IPA v5.2 support Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmnelNoACgkQMUZtbf5S IrtWFw//WyiXuEiGawVQONnbu1dtR+3nw/cvNpSYi0IM66vbRUB9n+9fxm2MIyG4 4jI/c/X/fxIvUxEqGez3yPn5P7KqkQR8WRYwkxrMYKRpXeukN0IDk5Euew5DskCe wtBKNJOQWKdKXff0bLQoJ9dHWYuJ2IMRVil5M3fhUbeUOXeyJD7Yn1w2ICvJAbj+ T/Hw7sEtchNaHp6h6SbaQfahkUFHQG5peNoETkZF4UDF6ALGY29WH91GXeO2lrgN IxX203KtaavV0oU8T0oixZgOc57Ns081YfFL/F1JP2HV6lgkwhuq+zxCrRTi1c9M HPTXgwD7Z80Y74nM3YTLrPfoMOP8GLBZgdV3rUpwmteM26+gMTm+O1zHUur5ZoGy D6TaMFguPTIqiRyrARa9xY/J6r9TQkc2Wfu4bIuPndKFg8xPoepuEObODnh0+5Hg 4j4pdFhIo2huENhSg7kVb/yl+1q68SFwM3RqTmx+OhCa0AyjcKIKgt/UBhismdnG r8obxzb+nXeJc2rRDuwNMwlBlcMSbep27uGt64zeHMMXVhTVqOoytNaL/X/ZpH2m A0DscUrpHvb36IoDPtanc6irP+JOh5Xe7Nw5qhkgwsMc7hlf8SyyHB4OUBBaz1qA ETSnHlfwklRmXSpWqH2LyGXjdOQpDKP46+h0W3dttMD2/cRBqYo= =EhQZ -----END PGP SIGNATURE----- Merge tag 'net-next-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core & protocols: - Support HW queue leasing, allowing containers to be granted access to HW queues for zero-copy operations and AF_XDP - Number of code moves to help the compiler with inlining. Avoid output arguments for returning drop reason where possible - Rework drop handling within qdiscs to include more metadata about the reason and dropping qdisc in the tracepoints - Remove the rtnl_lock use from IP Multicast Routing - Pack size information into the Rx Flow Steering table pointer itself. This allows making the table itself a flat array of u32s, thus making the table allocation size a power of two - Report TCP delayed ack timer information via socket diag - Add ip_local_port_step_width sysctl to allow distributing the randomly selected ports more evenly throughout the allowed space - Add support for per-route tunsrc in IPv6 segment routing - Start work of switching sockopt handling to iov_iter - Improve dynamic recvbuf sizing in MPTCP, limit burstiness and avoid buffer size drifting up - Support MSG_EOR in MPTCP - Add stp_mode attribute to the bridge driver for STP mode selection. This addresses concerns about call_usermodehelper() usage - Remove UDP-Lite support (as announced in 2023) - Remove support for building IPv6 as a module. Remove the now unnecessary function calling indirection Cross-tree stuff: - Move Michael MIC code from generic crypto into wireless, it's considered insecure but some WiFi networks still need it Netfilter: - Switch nft_fib_ipv6 module to no longer need temporary dst_entry object allocations by using fib6_lookup() + RCU. Florian W reports this gets us ~13% higher packet rate - Convert IPVS's global __ip_vs_mutex to per-net service_mutex and switch the service tables to be per-net. Convert some code that walks the service lists to use RCU instead of the service_mutex - Add more opinionated input validation to lower security exposure - Make IPVS hash tables to be per-netns and resizable Wireless: - Finished assoc frame encryption/EPPKE/802.1X-over-auth - Radar detection improvements - Add 6 GHz incumbent signal detection APIs - Multi-link support for FILS, probe response templates and client probing - New APIs and mac80211 support for NAN (Neighbor Aware Networking, aka Wi-Fi Aware) so less work must be in firmware Driver API: - Add numerical ID for devlink instances (to avoid having to create fake bus/device pairs just to have an ID). Support shared devlink instances which span multiple PFs - Add standard counters for reporting pause storm events (implement in mlx5 and fbnic) - Add configuration API for completion writeback buffering (implement in mana) - Support driver-initiated change of RSS context sizes - Support DPLL monitoring input frequency (implement in zl3073x) - Support per-port resources in devlink (implement in mlx5) Misc: - Expand the YAML spec for Netfilter Drivers - Software: - macvlan: support multicast rx for bridge ports with shared source MAC address - team: decouple receive and transmit enablement for IEEE 802.3ad LACP "independent control" - Ethernet high-speed NICs: - nVidia/Mellanox: - support high order pages in zero-copy mode (for payload coalescing) - support multiple packets in a page (for systems with 64kB pages) - Broadcom 25-400GE (bnxt): - implement XDP RSS hash metadata extraction - add software fallback for UDP GSO, lowering the IOMMU cost - Broadcom 800GE (bnge): - add link status and configuration handling - add various HW and SW statistics - Marvell/Cavium: - NPC HW block support for cn20k - Huawei (hinic3): - add mailbox / control queue - add rx VLAN offload - add driver info and link management - Ethernet NICs: - Marvell/Aquantia: - support reading SFP module info on some AQC100 cards - Realtek PCI (r8169): - add support for RTL8125cp - Realtek USB (r8152): - support for the RTL8157 5Gbit chip - add 2500baseT EEE status/configuration support - Ethernet NICs embedded and off-the-shelf IP: - Synopsys (stmmac): - cleanup and reorganize SerDes handling and PCS support - cleanup descriptor handling and per-platform data - cleanup and consolidate MDIO defines and handling - shrink driver memory use for internal structures - improve Tx IRQ coalescing - improve TCP segmentation handling - add support for Spacemit K3 - Cadence (macb): - support PHYs that have inband autoneg disabled with GEM - support IEEE 802.3az EEE - rework usrio capabilities and handling - AMD (xgbe): - improve power management for S0i3 - improve TX resilience for link-down handling - Virtual: - Google cloud vNIC: - support larger ring sizes in DQO-QPL mode - improve HW-GRO handling - support UDP GSO for DQO format - PCIe NTB: - support queue count configuration - Ethernet PHYs: - automatically disable PHY autonomous EEE if MAC is in charge - Broadcom: - add BCM84891/BCM84892 support - Micrel: - support for LAN9645X internal PHY - Realtek: - add RTL8224 pair order support - support PHY LEDs on RTL8211F-VD - support spread spectrum clocking (SSC) - Maxlinear: - add PHY-level statistics via ethtool - Ethernet switches: - Maxlinear (mxl862xx): - support for bridge offloading - support for VLANs - support driver statistics - Bluetooth: - large number of fixes and new device IDs - Mediatek: - support MT6639 (MT7927) - support MT7902 SDIO - WiFi: - Intel (iwlwifi): - UNII-9 and continuing UHR work - MediaTek (mt76): - mt7996/mt7925 MLO fixes/improvements - mt7996 NPU support (HW eth/wifi traffic offload) - Qualcomm (ath12k): - monitor mode support on IPQ5332 - basic hwmon temperature reporting - support IPQ5424 - Realtek: - add USB RX aggregation to improve performance - add USB TX flow control by tracking in-flight URBs - Cellular: - IPA v5.2 support" * tag 'net-next-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1561 commits) net: pse-pd: fix kernel-doc function name for pse_control_find_by_id() wireguard: device: use exit_rtnl callback instead of manual rtnl_lock in pre_exit wireguard: allowedips: remove redundant space tools: ynl: add sample for wireguard wireguard: allowedips: Use kfree_rcu() instead of call_rcu() MAINTAINERS: Add netkit selftest files selftests/net: Add additional test coverage in nk_qlease selftests/net: Split netdevsim tests from HW tests in nk_qlease tools/ynl: Make YnlFamily closeable as a context manager net: airoha: Add missing PPE configurations in airoha_ppe_hw_init() net: airoha: Fix VIP configuration for AN7583 SoC net: caif: clear client service pointer on teardown net: strparser: fix skb_head leak in strp_abort_strp() net: usb: cdc-phonet: fix skb frags[] overflow in rx_complete() selftests/bpf: add test for xdp_master_redirect with bond not up net, bpf: fix null-ptr-deref in xdp_master_redirect() for down master net: airoha: Remove PCE_MC_EN_MASK bit in REG_FE_PCE_CFG configuration sctp: disable BH before calling udp_tunnel_xmit_skb() sctp: fix missing encap_port propagation for GSO fragments net: airoha: Rely on net_device pointer in ETS callbacks ...	2026-04-14 18:36:10 -07:00
Linus Torvalds	23acda7c22	for-7.1/io_uring-20260411 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmna0vIQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpu8MEACN6owH/1suaJp5HBhrKseVIPQl1ldmsGF3 ZDwZndUE6pWXaeuI3g5QjSPcfWIUuLG6vs/btkIh4M32zAcFsSD8zYPItvgFzMVp X762WPCrUcfFwKt5GqeNn6IblO8BrsbzoJWNCaSVRhWqCdzQRVktq6684nNy/fj1 JBFnMsRpwGhoKzpg1oCLOrs0V57CRdJqFdmMzQHwRTWHemvfHf6SD2+h9axfKCaV baqvXGOLQXLwr8qHFo1LIu8lqEltHUa7boU8EMFQn/v8sPjUv46EuqZ8VVtzXH08 fY2zqWI5atA3DZCfORCHnK0qh6tPiSUtVUilXbIffhqd6lCTs891RJf3TegRCGTZ k8WfBFVKzVlhbgGk0Km6+tiHTaK1ZmcKU0Q+uucnb3RlOdOoPvXJy3u+I5BK74aV 36JmNPWRQfzh5icmrrGKySBTX0z7NPtMiEA+qHEndIO5FWrkf5pf9U5C5gu0WEMh iK2gotbd0Vym3EpqKQnefxflce6IpYteOACeYPXAprcQOzPK+WYjiVUJ9JcH6DhP RPUIXXck8+GkHnM9vWtBXBKaoR7gcATHUzLX8ZnhDkAhsTJ+tOXN8skq28gglUtj 8kLMzyXklbhAJsykxKn0rqcNUOcVMatFyK4VIFyp2tWRhzMDAY4xyXYSz0lRowkd pZAm4eSkmw== =IoaB -----END PGP SIGNATURE----- Merge tag 'for-7.1/io_uring-20260411' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring updates from Jens Axboe: - Add a callback driven main loop for io_uring, and BPF struct_ops on top to allow implementing custom event loop logic - Decouple IOPOLL from being a ring-wide all-or-nothing setting, allowing IOPOLL use cases to also issue certain white listed non-polled opcodes - Timeout improvements. Migrate internal timeout storage from timespec64 to ktime_t for simpler arithmetic and avoid copying of timespec data - Zero-copy receive (zcrx) updates: - Add a device-less mode (ZCRX_REG_NODEV) for testing and experimentation where data flows through the copy fallback path - Fix two-step unregistration regression, DMA length calculations, xarray mark usage, and a potential 32-bit overflow in id shifting - Refactoring toward multi-area support: dedicated refill queue struct, consolidated DMA syncing, netmem array refilling format, and guard-based locking - Zero-copy transmit (zctx) cleanup: - Unify io_send_zc() and io_sendmsg_zc() into a single function - Add vectorized registered buffer send for IORING_OP_SEND_ZC - Add separate notification user_data via sqe->addr3 so notification and completion CQEs can be distinguished without extra reference counting - Switch struct io_ring_ctx internal bitfields to explicit flag bits with atomic-safe accessors, and annotate the known harmless races on those flags - Various optimizations caching ctx and other request fields in local variables to avoid repeated loads, and cleanups for tctx setup, ring fd registration, and read path early returns * tag 'for-7.1/io_uring-20260411' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (58 commits) io_uring: unify getting ctx from passed in file descriptor io_uring/register: don't get a reference to the registered ring fd io_uring/tctx: clean up __io_uring_add_tctx_node() error handling io_uring/tctx: have io_uring_alloc_task_context() return tctx io_uring/timeout: use 'ctx' consistently io_uring/rw: clean up __io_read() obsolete comment and early returns io_uring/zcrx: use correct mmap off constants io_uring/zcrx: use dma_len for chunk size calculation io_uring/zcrx: don't clear not allocated niovs io_uring/zcrx: don't use mark0 for allocating xarray io_uring: cast id to u64 before shifting in io_allocate_rbuf_ring() io_uring/zcrx: reject REG_NODEV with large rx_buf_size io_uring/cancel: validate opcode for IORING_ASYNC_CANCEL_OP io_uring/rsrc: use io_cache_free() to free node io_uring/zcrx: rename zcrx [un]register functions io_uring/zcrx: check ctrl op payload struct sizes io_uring/zcrx: cache fallback availability in zcrx ctx io_uring/zcrx: warn on a repeated area append io_uring/zcrx: consolidate dma syncing io_uring/zcrx: netmem array as refiling format ...	2026-04-13 16:22:30 -07:00
Linus Torvalds	ef3da345cc	vfs-7.1-rc1.misc Please consider pulling these changes from the signed vfs-7.1-rc1.misc tag. Thanks! Christian -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCadjZCwAKCRCRxhvAZXjc ohhBAQCAmQMlMRAXAgUZFYMTZpeQlcujP5rv+/vT2Tf/xS76YwD/dRDaw1FH294+ qtk/Z1NjleNixzE2sld1K9J32NxeyAc= =+g9q -----END PGP SIGNATURE----- Merge tag 'vfs-7.1-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "Features: - coredump: add tracepoint for coredump events - fs: hide file and bfile caches behind runtime const machinery Fixes: - fix architecture-specific compat_ftruncate64 implementations - dcache: Limit the minimal number of bucket to two - fs/omfs: reject s_sys_blocksize smaller than OMFS_DIR_START - fs/mbcache: cancel shrink work before destroying the cache - dcache: permit dynamic_dname()s up to NAME_MAX Cleanups: - remove or unexport unused fs_context infrastructure - trivial ->setattr cleanups - selftests/filesystems: Assume that TIOCGPTPEER is defined - writeback: fix kernel-doc function name mismatch for wb_put_many() - autofs: replace manual symlink buffer allocation in autofs_dir_symlink - init/initramfs.c: trivial fix: FSM -> Finite-state machine - fs: remove stale and duplicate forward declarations - readdir: Introduce dirent_size() - fs: Replace user_access_{begin/end} by scoped user access - kernel: acct: fix duplicate word in comment - fs: write a better comment in step_into() concerning .mnt assignment - fs: attr: fix comment formatting and spelling issues" * tag 'vfs-7.1-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (28 commits) dcache: permit dynamic_dname()s up to NAME_MAX fs: attr: fix comment formatting and spelling issues fs: hide file and bfile caches behind runtime const machinery fs: write a better comment in step_into() concerning .mnt assignment proc: rename proc_notify_change to proc_setattr proc: rename proc_setattr to proc_nochmod_setattr affs: rename affs_notify_change to affs_setattr adfs: rename adfs_notify_change to adfs_setattr hfs: update comments on hfs_inode_setattr kernel: acct: fix duplicate word in comment fs: Replace user_access_{begin/end} by scoped user access readdir: Introduce dirent_size() coredump: add tracepoint for coredump events fs: remove do_sys_truncate fs: pass on FTRUNCATE_* flags to do_truncate fs: fix archiecture-specific compat_ftruncate64 fs: remove stale and duplicate forward declarations init/initramfs.c: trivial fix: FSM -> Finite-state machine autofs: replace manual symlink buffer allocation in autofs_dir_symlink fs/mbcache: cancel shrink work before destroying the cache ...	2026-04-13 14:20:11 -07:00
Jakub Kicinski	1508922588	Merge branch 'netkit-support-for-io_uring-zero-copy-and-af_xdp' Daniel Borkmann says: ==================== netkit: Support for io_uring zero-copy and AF_XDP Containers use virtual netdevs to route traffic from a physical netdev in the host namespace. They do not have access to the physical netdev in the host and thus can't use memory providers or AF_XDP that require reconfiguring/restarting queues in the physical netdev. This patchset adds the concept of queue leasing to virtual netdevs that allow containers to use memory providers and AF_XDP at native speed. Leased queues are bound to a real queue in a physical netdev and act as a proxy. Memory providers and AF_XDP operations take an ifindex and queue id, so containers would pass in an ifindex for a virtual netdev and a queue id of a leased queue, which then gets proxied to the underlying real queue. We have implemented support for this concept in netkit and tested the latter against Nvidia ConnectX-6 (mlx5) as well as Broadcom BCM957504 (bnxt_en) 100G NICs. For more details see the individual patches. ==================== Link: https://patch.msgid.link/20260402231031.447597-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-09 18:24:35 -07:00
David Wei	222b5566a0	net: Proxy netdev_queue_get_dma_dev for leased queues Extend netdev_queue_get_dma_dev to return the physical device of the real rxq for DMA in case the queue was leased. This allows memory providers like io_uring zero-copy or devmem to bind to the physically leased rxq via virtual devices such as netkit. Signed-off-by: David Wei <dw@davidwei.uk> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260402231031.447597-8-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-09 18:21:46 -07:00
Daniel Borkmann	1e91c98bc9	net: Slightly simplify net_mp_{open,close}_rxq net_mp_open_rxq is currently not used in the tree as all callers are using __net_mp_open_rxq directly, and net_mp_close_rxq is only used once while all other locations use __net_mp_close_rxq. Consolidate into a single API, netif_mp_{open,close}_rxq, using the netif_ prefix to indicate that the caller is responsible for locking. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Co-developed-by: David Wei <dw@davidwei.uk> Signed-off-by: David Wei <dw@davidwei.uk> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260402231031.447597-6-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-09 18:21:46 -07:00
Jens Axboe	c5e9f6a96b	io_uring: unify getting ctx from passed in file descriptor io_uring_enter() and io_uring_register() end up having duplicated code for getting a ctx from a passed in file descriptor, for either a registered ring descriptor or a normal file descriptor. Move the io_uring_register_get_file() into io_uring.c and name it a bit more generically, and use it from both callsites rather than have that logic and handling duplicated. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-08 13:21:35 -06:00
Jens Axboe	b4d893d636	io_uring/register: don't get a reference to the registered ring fd This isn't necessary and was only done because the register path isn't a hot path and hence the extra ref/put doesn't matter, and to have the exit path be able to unconditionally put whatever file was gotten regardless of the type. In preparation for sharing this code with the main io_uring_enter(2) syscall, drop the reference and have the caller conditionally put the file if it was a normal file descriptor. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-08 13:21:35 -06:00
Jens Axboe	7880174e1e	io_uring/tctx: clean up __io_uring_add_tctx_node() error handling Refactor __io_uring_add_tctx_node() so that on error it never leaves current->io_uring pointing at a half-setup tctx. This moves the assignment of current->io_uring to the end of the function post any failure points. Separate out the node installation into io_tctx_install_node() to further clean this up. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-08 13:21:34 -06:00
Jens Axboe	2c453a4281	io_uring/tctx: have io_uring_alloc_task_context() return tctx Instead of having io_uring_alloc_task_context() return an int and assign tsk->io_uring, just have it return the task context directly. This enables cleaner error handling in callers, which may have failure points post calling io_uring_alloc_task_context(). Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-08 13:21:30 -06:00
Linus Torvalds	e41255ce7a	io_uring-7.0-20260403 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmnPokUQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpjW6D/91Xg/mGvYUBVvwEhP0ydPncuAsThnkoDHY 6Pu+VxawKW480yAC06nktAeDgJNnpFpJXatPEtk2n8r7Ol3Cx0sDWdQjzoKSlBC7 9wj+MVpCcU970Gb1G6PNLKQoW+DxKuD9Iq6Ph434uCx/bgC2EKthj0vYpssoU48S OxyFGBTjhbgnmiZaAEMHpLC/LJP27eH24QQbobeVWyY7C6jy6YI0WQaoG4Qt+UMd S2XdFe97xrVaCVS3E5X5BAyHCcMX4e1D6/Y7bNDGG3Ke673RuUJHhqvk8P1NJnTI CaMlfoGhNw36FpkzTYIvoZlkCFl48axXmscRcekTg4d9ssnY9aSFVY+xMSHmkhKu zs1r1tZK970xUbQK0NAoD9T+LsFKU1S0PaEaCL2KMHwz9vG0uY7iUYteKqdM8L/f jUpYcxn9R6AhdeL77eEu3w6vCdMqP2+OgDv1uEpyJv6oWSdhfI38+EIwmMoEq+Az BkDipYNh4lAiI23qbS9CDe5aam6pv+hwecDn3x7MZVpGZ6cJjs43QWwk+jZz+KQj gacQu01q/TN7rpyaFYhkxPGKHVs259/uSLJY647ORgpJNXg4a+6DlB7/4YdW35il O4gnKECSflmoePm7B4QFh8Q89XPma74hVqtDB7opz0xL3PQMq07EQKQmoNP7RTOp GLW71uD2MA== =q/wK -----END PGP SIGNATURE----- Merge tag 'io_uring-7.0-20260403' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring fixes from Jens Axboe: - A previous fix in this release covered the case of the rings being RCU protected during resize, but it missed a few spots. This covers the rest - Fix the cBPF filters when COW'ed, introduced in this merge window - Fix for an attempt to import a zero sized buffer - Fix for a missing clamp in importing bundle buffers * tag 'io_uring-7.0-20260403' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: io_uring/bpf_filters: retain COW'ed settings on parse failures io_uring: protect remaining lockless ctx->rings accesses with RCU io_uring/rsrc: reject zero-length fixed buffer import io_uring/net: fix slab-out-of-bounds read in io_bundle_nbufs()	2026-04-03 11:58:04 -07:00
Yang Xiuwei	f847bf6d29	io_uring/timeout: use 'ctx' consistently There's already a local ctx variable, yet cq_timeouts accounting uses req->ctx. Use ctx consistently. Signed-off-by: Yang Xiuwei <yangxiuwei@kylinos.cn> Link: https://patch.msgid.link/20260402014952.260414-1-yangxiuwei@kylinos.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-02 07:08:40 -06:00
Joanne Koong	c7f3aaf3e8	io_uring/rw: clean up __io_read() obsolete comment and early returns After commit `a9165b83c1` ("io_uring/rw: always setup io_async_rw for read/write requests") which moved the iovec allocation into the prep path and stores it in req->async_data where it now gets freed as part of the request lifecycle, this comment is now outdated. Remove it and clean up the goto as well. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Link: https://patch.msgid.link/20260401173511.4052303-1-joannelkoong@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-02 06:55:50 -06:00
Pavel Begunkov	4c6f93951b	io_uring/zcrx: use correct mmap off constants zcrx was using IORING_OFF_PBUF_SHIFT during first iterations, but there is now a separate constant it should use. Both are 16 so it doesn't change anything, but improve it for the future. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/fe16ebe9ba4048a7e12f9b3b50880bd175b1ce03.1774780198.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-02 06:55:48 -06:00
Pavel Begunkov	7120b87bed	io_uring/zcrx: use dma_len for chunk size calculation Buffers are now dma-mapped earlier and we can sg_dma_len(), otherwise, since it's walking with for_each_sgtable_dma_sg(), it might wrongfully reject some configurations. As a bonus, it'd now be able to use larger chunks if dma addresses are coalesced e.g by iommu. Fixes: 8c0cab0b7bf7 ("io_uring/zcrx: always dma map in advance") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/03b219af3f6cfdd1cf64679b8bab7461e47cc123.1774780198.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-02 06:55:47 -06:00
Pavel Begunkov	52dcd1776b	io_uring/zcrx: don't clear not allocated niovs Now that area->is_mapped is set earlier before niovs array is allocated, io_zcrx_free_area -> io_zcrx_unmap_area in an error path can try to clear dma addresses for unallocated niovs, fix it. Fixes: 8c0cab0b7bf7 ("io_uring/zcrx: always dma map in advance") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/cbcb7749b5a001ecd4d1c303515ce9403215640c.1774780198.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-02 06:55:36 -06:00
Pavel Begunkov	8ae2837d5a	io_uring/zcrx: don't use mark0 for allocating xarray XA_MARK_0 is not compatible with xarray allocating entries, use XA_MARK_1. Fixes: fda90d43f4fac ("io_uring/zcrx: return back two step unregistration") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/f232cfd3c466047d333b474dd2bddd246b6ebb82.1774780198.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-01 10:21:13 -06:00
Anas Iqbal	77d8c8d0f1	io_uring: cast id to u64 before shifting in io_allocate_rbuf_ring() Smatch warns: io_uring/zcrx.c:393 io_allocate_rbuf_ring() warn: should 'id << 16' be a 64 bit type? The expression 'id << IORING_OFF_PBUF_SHIFT' is evaluated using 32-bit arithmetic because id is a u32. This may overflow before being promoted to the 64-bit mmap_offset. Cast id to u64 before shifting to ensure the shift is performed in 64-bit arithmetic. Signed-off-by: Anas Iqbal <mohd.abd.6602@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/52400e1b343691416bef3ed3ae287fb1a88d407f.1774780198.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-01 10:21:13 -06:00
Pavel Begunkov	a9d008489f	io_uring/zcrx: reject REG_NODEV with large rx_buf_size The copy fallback path doesn't care about the actual niov size and only uses first PAGE_SIZE bytes, and any additional space will be wasted. Since ZCRX_REG_NODEV solely relies on the copy path, it doesn't make sense to support non-standard rx_buf_len. Reject it for now, and re-enable once improved. Fixes: c11728021d5cd ("io_uring/zcrx: implement device-less mode for zcrx") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/3e7652d9c27f8ac5d2b141e3af47971f2771fb05.1774780198.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-01 10:21:13 -06:00
Amir Mohammad Jahangirzad	85a58309c0	io_uring/cancel: validate opcode for IORING_ASYNC_CANCEL_OP io_async_cancel_prep() reads the opcode selector from sqe->len and stores it in cancel->opcode, which is an 8-bit field. Since sqe->len is a 32-bit value, values larger than U8_MAX are implicitly truncated. This can cause unintended opcode matches when the truncated value corresponds to a valid io_uring opcode. For example, submitting a value such as 0x10b will be truncated to 0x0b (IORING_OP_TIMEOUT), allowing a cancel request to match operations it did not intend to target. Validate the opcode value before assigning it to the 8-bit field and reject values outside the valid io_uring opcode range. Signed-off-by: Amir Mohammad Jahangirzad <a.jahangirzad@gmail.com> Link: https://patch.msgid.link/20260331232113.615972-1-a.jahangirzad@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-01 10:21:13 -06:00
Jackie Liu	19a8cc6cda	io_uring/rsrc: use io_cache_free() to free node Replace kfree(node) with io_cache_free() in io_buffer_register_bvec() to match all other error paths that free nodes allocated via io_rsrc_node_alloc(). The node is allocated through io_cache_alloc() internally, so it should be returned to the cache via io_cache_free() for proper object reuse. Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Link: https://patch.msgid.link/20260331104509.7055-1-liu.yun@linux.dev [axboe: remove fixes tag, it's not a fix, it's a cleanup] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-01 10:21:13 -06:00
Pavel Begunkov	7c713dd007	io_uring/zcrx: rename zcrx [un]register functions Drop "ifqs" from function names, as it refers to an interface queue and there might be none once a device-less mode is introduced. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/657874acd117ec30fa6f45d9d844471c753b5a0f.1774261953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-01 10:21:13 -06:00
Pavel Begunkov	de6ed1b323	io_uring/zcrx: check ctrl op payload struct sizes Add a build check that ctrl payloads are of the same size and don't grow struct zcrx_ctrl. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/af66caf9776d18e9ff880ab828eb159a6a03caf5.1774261953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-01 10:21:13 -06:00
Pavel Begunkov	5c727ce042	io_uring/zcrx: cache fallback availability in zcrx ctx Store a flag in struct io_zcrx_ifq telling if the backing memory is normal page or dmabuf based. It was looking it up from the area, however it logically allocates from the zcrx ctx and not a particular area, and once we add more than one area it'll become a mess. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/65e75408a7758fe7e60fae89b7a8d5ae4857f515.1774261953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-01 10:21:13 -06:00

1 2 3 4 5 ...

2129 Commits