-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmnPokUQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpjW6D/91Xg/mGvYUBVvwEhP0ydPncuAsThnkoDHY
6Pu+VxawKW480yAC06nktAeDgJNnpFpJXatPEtk2n8r7Ol3Cx0sDWdQjzoKSlBC7
9wj+MVpCcU970Gb1G6PNLKQoW+DxKuD9Iq6Ph434uCx/bgC2EKthj0vYpssoU48S
OxyFGBTjhbgnmiZaAEMHpLC/LJP27eH24QQbobeVWyY7C6jy6YI0WQaoG4Qt+UMd
S2XdFe97xrVaCVS3E5X5BAyHCcMX4e1D6/Y7bNDGG3Ke673RuUJHhqvk8P1NJnTI
CaMlfoGhNw36FpkzTYIvoZlkCFl48axXmscRcekTg4d9ssnY9aSFVY+xMSHmkhKu
zs1r1tZK970xUbQK0NAoD9T+LsFKU1S0PaEaCL2KMHwz9vG0uY7iUYteKqdM8L/f
jUpYcxn9R6AhdeL77eEu3w6vCdMqP2+OgDv1uEpyJv6oWSdhfI38+EIwmMoEq+Az
BkDipYNh4lAiI23qbS9CDe5aam6pv+hwecDn3x7MZVpGZ6cJjs43QWwk+jZz+KQj
gacQu01q/TN7rpyaFYhkxPGKHVs259/uSLJY647ORgpJNXg4a+6DlB7/4YdW35il
O4gnKECSflmoePm7B4QFh8Q89XPma74hVqtDB7opz0xL3PQMq07EQKQmoNP7RTOp
GLW71uD2MA==
=q/wK
-----END PGP SIGNATURE-----
Merge tag 'io_uring-7.0-20260403' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring fixes from Jens Axboe:
- A previous fix in this release covered the case of the rings being
RCU protected during resize, but it missed a few spots. This covers
the rest
- Fix the cBPF filters when COW'ed, introduced in this merge window
- Fix for an attempt to import a zero sized buffer
- Fix for a missing clamp in importing bundle buffers
* tag 'io_uring-7.0-20260403' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
io_uring/bpf_filters: retain COW'ed settings on parse failures
io_uring: protect remaining lockless ctx->rings accesses with RCU
io_uring/rsrc: reject zero-length fixed buffer import
io_uring/net: fix slab-out-of-bounds read in io_bundle_nbufs()
If io_parse_restrictions() fails, it ends up clearing any restrictions
currently set. The intent is only to clear whatever it already applied,
but it ends up clearing everything, including whatever settings may have
been applied in a copy-on-write fashion already. Ensure that those are
retained.
Link: https://lore.kernel.org/io-uring/CAK8a0jzF-zaO5ZmdOrmfuxrhXuKg5m5+RDuO7tNvtj=kUYbW7Q@mail.gmail.com/
Reported-by: antonius <bluedragonsec2023@gmail.com>
Fixes: ed82f35b92 ("io_uring: allow registration of per-task restrictions")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit 9618908026 addressed one case of ctx->rings being potentially
accessed while a resize is happening on the ring, but there are still
a few others that need handling. Add a helper for retrieving the
rings associated with an io_uring context, and add some sanity checking
to that to catch bad uses. ->rings_rcu is always valid, as long as it's
used within RCU read lock. Any use of ->rings_rcu or ->rings inside
either ->uring_lock or ->completion_lock is sane as well.
Do the minimum fix for the current kernel, but set it up such that this
basic infra can be extended for later kernels to make this harder to
mess up in the future.
Thanks to Junxi Qian for finding and debugging this issue.
Cc: stable@vger.kernel.org
Fixes: 79cfe9e59c ("io_uring/register: add IORING_REGISTER_RESIZE_RINGS")
Reviewed-by: Junxi Qian <qjx1298677004@gmail.com>
Tested-by: Junxi Qian <qjx1298677004@gmail.com>
Link: https://lore.kernel.org/io-uring/20260330172348.89416-1-qjx1298677004@gmail.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
validate_fixed_range() admits buf_addr at the exact end of the
registered region when len is zero, because the check uses strict
greater-than (buf_end > imu->ubuf + imu->len). io_import_fixed()
then computes offset == imu->len, which causes the bvec skip logic
to advance past the last bio_vec entry and read bv_offset from
out-of-bounds slab memory.
Return early from io_import_fixed() when len is zero. A zero-length
import has no data to transfer and should not walk the bvec array
at all.
BUG: KASAN: slab-out-of-bounds in io_import_reg_buf+0x697/0x7f0
Read of size 4 at addr ffff888002bcc254 by task poc/103
Call Trace:
io_import_reg_buf+0x697/0x7f0
io_write_fixed+0xd9/0x250
__io_issue_sqe+0xad/0x710
io_issue_sqe+0x7d/0x1100
io_submit_sqes+0x86a/0x23c0
__do_sys_io_uring_enter+0xa98/0x1590
Allocated by task 103:
The buggy address is located 12 bytes to the right of
allocated 584-byte region [ffff888002bcc000, ffff888002bcc248)
Fixes: 8622b20f23 ("io_uring: add validate_fixed_range() for validate fixed buffer")
Signed-off-by: Qi Tang <tpluszz77@gmail.com>
Link: https://patch.msgid.link/20260329164936.240871-1-tpluszz77@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
sqe->len is __u32 but gets stored into sr->len which is int. When
userspace passes sqe->len values exceeding INT_MAX (e.g. 0xFFFFFFFF),
sr->len overflows to a negative value. This negative value propagates
through the bundle recv/send path:
1. io_recv(): sel.val = sr->len (ssize_t gets -1)
2. io_recv_buf_select(): arg.max_len = sel->val (size_t gets
0xFFFFFFFFFFFFFFFF)
3. io_ring_buffers_peek(): buf->len is not clamped because max_len
is astronomically large
4. iov[].iov_len = 0xFFFFFFFF flows into io_bundle_nbufs()
5. io_bundle_nbufs(): min_t(int, 0xFFFFFFFF, ret) yields -1,
causing ret to increase instead of decrease, creating an
infinite loop that reads past the allocated iov[] array
This results in a slab-out-of-bounds read in io_bundle_nbufs() from
the kmalloc-64 slab, as nbufs increments past the allocated iovec
entries.
BUG: KASAN: slab-out-of-bounds in io_bundle_nbufs+0x128/0x160
Read of size 8 at addr ffff888100ae05c8 by task exp/145
Call Trace:
io_bundle_nbufs+0x128/0x160
io_recv_finish+0x117/0xe20
io_recv+0x2db/0x1160
Fix this by rejecting negative sr->len values early in both
io_sendmsg_prep() and io_recvmsg_prep(). Since sqe->len is __u32,
any value > INT_MAX indicates overflow and is not a valid length.
Fixes: a05d1f625c ("io_uring/net: support bundles for send")
Cc: stable@vger.kernel.org
Signed-off-by: Junxi Qian <qjx1298677004@gmail.com>
Link: https://patch.msgid.link/20260329153909.279046-1-qjx1298677004@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmnGbO4QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpgrFD/9QbhCy3qf8WSfBWYlo9DzxDjaDjYJw5Oa+
4LijRxG5rK4Oj9po0ayR76fEjXYiTJGH5mrrt2avrgk9xtQmMlTwZNbh4imTpgWG
CILQKaWccgOZdL16l3laQ3zdkq1YR9YgzQPQvcrUxk2DfFwikNCZKm/+D/EW1jpu
Eo0YyqF0mTGHGG8PfWvTfd79Axt4yuXvjq6KG+usrAkgxFTDLx0uxLfSr++NQg7O
iM6LVxF2RfgKGF/FlInEKdf3NDDXcJaOegpw+vsZXieQuFXdep6arOj6SwX2t5bc
LrjTHDRWo3L8fb5B++PfTriyK9GGcoSDMa6QI38qb0CmoEoZuFx6OR+n7Bof7eSv
vUNeBhLA2U15RCCuasOZrFrqCRpPu7hLaEORpVCY3WuR4nqOw6+vNy+BZUdWq5pZ
RcDGxJIpvr2/Xn/3bJdJwhT4hjwq61z67EbJHeSnNYMpS6sV6GrT8PDFsxyqNfw/
Qu4c1hvA9bQsHA2iiZVAKb1GnooBL5bZFCKdtCUVyv0jJNnpL6Ccd1cgm/KowMQK
qYfCvH2ewslB5EGvjZ6Jix0bjqBBnWzJNk8RfoyvLFWpq8vauvrTK3Yc/SyNaB8n
9V/Nlqi/Rur+2g17flFUU/UMvHH6Jl++AYRNswrA9RZNirvOiG5mqE7b4RhdT44i
KUC04XpFRw==
=ldfF
-----END PGP SIGNATURE-----
Merge tag 'io_uring-7.0-20260327' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring fixes from Jens Axboe:
"Just two small fixes, both fixing regressions added in the fdinfo code
in 6.19 with the SQE mixed size support"
* tag 'io_uring-7.0-20260327' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
io_uring/fdinfo: fix OOB read in SQE_MIXED wrap check
io_uring/fdinfo: fix SQE_MIXED SQE displaying
__io_uring_show_fdinfo() iterates over pending SQEs and, for 128-byte
SQEs on an IORING_SETUP_SQE_MIXED ring, needs to detect when the second
half of the SQE would be past the end of the sq_sqes array. The current
check tests (++sq_head & sq_mask) == 0, but sq_head is only incremented
when a 128-byte SQE is encountered, not on every iteration. The actual
array index is sq_idx = (i + sq_head) & sq_mask, which can be sq_mask
(the last slot) while the wrap check passes.
Fix by checking sq_idx directly. Keep the sq_head increment so the loop
still skips the second half of the 128-byte SQE on the next iteration.
Fixes: 1cba30bf9f ("io_uring: add support for IORING_SETUP_SQE_MIXED")
Signed-off-by: Nicholas Carlini <nicholas@carlini.com>
Link: https://patch.msgid.link/20260327021823.3138396-1-nicholas@carlini.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When displaying pending SQEs for a MIXED ring, each 128-byte SQE
increments sq_head to skip the second slot, but the loop counter is not
adjusted. This can cause the loop to read past sq_tail by one entry for
each 128-byte SQE encountered, displaying SQEs that haven't been made
consumable yet by the application.
Match the kernel's own consumption logic in io_init_req() which
decrements what's left when consuming the extra slot.
Fixes: 1cba30bf9f ("io_uring: add support for IORING_SETUP_SQE_MIXED")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmm9O2oQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgprb2D/9EM91W+zo0M+6Xjs+A/YaHETDdSUUQwyuT
B8ZNHEHwkB3TdLHxPvFDopf8bC8IFy1T3GhGkHgq/Uvh7hx9oYTvVrV1xjXCqW/W
mpP5bgguJYlOTwFW2lYFWQTI5CZDletOeWbng00ecAZPoVVSjJx+KNVFm3DntYSR
WeO20Y/OWXZ6vFu4b7825lHFYgVVPdO1iy2X5HYx4XHlWOjj7LK7KW2UDai4lPIF
w1IjsOk8Cv7Wj38qYB2d3dMlgryER0dF1Hf2N+1awh1FFjlsiZ+YbLnrTiFyKwa0
zip6ejDXwpevS6/FmCa6u19/DHxQ0WgvEZOwQlMyggCrIdaqN01LMGhCkleTev72
demfEKA71avcG1gWmJLX9aNGfkY0rXT0Q9ArLk8FTPXnImnXQdJDfeLuql0ExMrL
2MWkhJS4Dvp4MDERq0WwOdPrAMiDrG9O6DnHQBZYZjJuuqljF2LWb6Dbj7V9gO+M
ePADlnKhR4o0MKvTcFqeM33kKIkdwSNzDvcLO+gnAeDxNFQsCyDSm8erGj+LEMfN
LA2668ZWF24plKmWbEq1IZaw8P9CM5AqD6+QCIB+XPge9JNxzpeUuAfTkwuNp+bz
V5vDw3RQ8p65cxhhD34rqor/KGN5WlSOYTrp9aU4BNIse9S/o8XAyPcR3ywZChYJ
OHvT3W4O3g==
=xEWR
-----END PGP SIGNATURE-----
Merge tag 'io_uring-7.0-20260320' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring fixes from Jens Axboe:
- A bit of a work-around for AF_UNIX recv multishot, as the in-kernel
implementation doesn't properly signal EOF. We'll likely rework this
one going forward, but the fix is sufficient for now
- Two fixes for incrementally consumed buffers, for non-pollable files
and for 0 byte reads
* tag 'io_uring-7.0-20260320' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
io_uring/kbuf: propagate BUF_MORE through early buffer commit path
io_uring/kbuf: fix missing BUF_MORE for incremental buffers at EOF
io_uring/poll: fix multishot recv missing EOF on wakeup race
When io_should_commit() returns true (eg for non-pollable files), buffer
commit happens at buffer selection time and sel->buf_list is set to
NULL. When __io_put_kbufs() generates CQE flags at completion time, it
calls __io_put_kbuf_ring() which finds a NULL buffer_list and hence
cannot determine whether the buffer was consumed or not. This means that
IORING_CQE_F_BUF_MORE is never set for non-pollable input with
incrementally consumed buffers.
Likewise for io_buffers_select(), which always commits upfront and
discards the return value of io_kbuf_commit().
Add REQ_F_BUF_MORE to store the result of io_kbuf_commit() during early
commit. Then __io_put_kbuf_ring() can check this flag and set
IORING_F_BUF_MORE accordingy.
Reported-by: Martin Michaelis <code@mgjm.de>
Cc: stable@vger.kernel.org
Fixes: ae98dbf43d ("io_uring/kbuf: add support for incremental buffer consumption")
Link: https://github.com/axboe/liburing/issues/1553
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For a zero length transfer, io_kbuf_inc_commit() is called with !len.
Since we never enter the while loop to consume the buffers,
io_kbuf_inc_commit() ends up returning true, consuming the buffer. But
if no data was consumed, by definition it cannot have consumed the
buffer. Return false for that case.
Reported-by: Martin Michaelis <code@mgjm.de>
Cc: stable@vger.kernel.org
Fixes: ae98dbf43d ("io_uring/kbuf: add support for incremental buffer consumption")
Link: https://github.com/axboe/liburing/issues/1553
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When a socket send and shutdown() happen back-to-back, both fire
wake-ups before the receiver's task_work has a chance to run. The first
wake gets poll ownership (poll_refs=1), and the second bumps it to 2.
When io_poll_check_events() runs, it calls io_poll_issue() which does a
recv that reads the data and returns IOU_RETRY. The loop then drains all
accumulated refs (atomic_sub_return(2) -> 0) and exits, even though only
the first event was consumed. Since the shutdown is a persistent state
change, no further wakeups will happen, and the multishot recv can hang
forever.
Check specifically for HUP in the poll loop, and ensure that another
loop is done to check for status if more than a single poll activation
is pending. This ensures we don't lose the shutdown event.
Cc: stable@vger.kernel.org
Fixes: dbc2564cfe ("io_uring: let fast poll support multishot")
Reported-by: Francis Brosseau <francis@malagauche.com>
Link: https://github.com/axboe/liburing/issues/1549
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmmzLoUQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpsCCEADLLNNmeOjRIK1+FkvMcJkxWhctw5SMN2eQ
eBDW3S3oBvy4eWhXrKilIABkvVHDlbAu/GR0aOqm28Dm4fg/ltRjKBODWlPOpSaG
3HauLNo62Kjgbr2jTLVbt6omrcytgwkxm3wvZQf0H4SM3fT0s7yKw1wlcu67+hyJ
/xM5zOyqYa+JjplgOMs2chPCKykwYTbJx66FTNxB61ZKo5xrodkY9sr0wdxsyuFV
ctdcwVtOvtGBrBeH+QUC1K2e47/FzqM4NvrA53svcH8RjZ4xd3E4THa4St3+AUoR
tzaXevHXndRN7tMUOIRhLpbr6Z+dkVPwR4w7NDTv2B9tzn4+RglhNfJDouUhW38e
pryAkVNJN0aNe3bTmoAV8ND/v+fJIaN5y3KFwz8gky6qK9bG5e50kzA4/tcuwEjF
e5m7uT4OPlDXmTlYAgpp6dL7htDsDUi8MEjoTNzyfr5pjhygWPtbOWAR5TLrLrof
E3dH3c8H5gDOM+cSdHQVP3OdjruuTAEQc4Ggevf5BxJEiXvmGSc27bMgg7b+vdNI
9AakmCVKL0O8rI/BkJ7QDVFQoRIq5nDe3VnZilX+MY/3MfzW+kBG/2OyW/zrfwmy
F1+bA3/LQA7OZnCQnLHP1XYI5zq48rxkWA9G1ODaOJ/smyrxBM5sfIFtpWx27dkr
ZWQXB+Biug==
=w1nw
-----END PGP SIGNATURE-----
Merge tag 'io_uring-7.0-20260312' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring fixes from Jens Axboe:
- Fix an inverted true/false comment on task_no_new_privs, from the
BPF filtering changes merged in this release
- Use the migration disabling way of running the BPF filters, as the
io_uring side doesn't do that already
- Fix an issue with ->rings stability under resize, both for local
task_work additions and for eventfd signaling
- Fix an issue with SQE mixed mode, where a bounds check wasn't correct
for having a 128b SQE
- Fix an issue where a legacy provided buffer group is changed to to
ring mapped one while legacy buffers from that group are in flight
* tag 'io_uring-7.0-20260312' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
io_uring/kbuf: check if target buffer list is still legacy on recycle
io_uring: fix physical SQE bounds check for SQE_MIXED 128-byte ops
io_uring/eventfd: use ctx->rings_rcu for flags checking
io_uring: ensure ctx->rings is stable for task work flags manipulation
io_uring/bpf_filter: use bpf_prog_run_pin_on_cpu() to prevent migration
io_uring/register: fix comment about task_no_new_privs
There's a gap between when the buffer was grabbed and when it
potentially gets recycled, where if the list is empty, someone could've
upgraded it to a ring provided type. This can happen if the request
is forced via io-wq. The legacy recycling is missing checking if the
buffer_list still exists, and if it's of the correct type. Add those
checks.
Cc: stable@vger.kernel.org
Fixes: c7fb19428d ("io_uring: add support for ring mapped supplied buffers")
Reported-by: Keenan Dong <keenanat2000@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When IORING_SETUP_SQE_MIXED is used without IORING_SETUP_NO_SQARRAY,
the boundary check for 128-byte SQE operations in io_init_req()
validated the logical SQ head position rather than the physical SQE
index.
The existing check:
!(ctx->cached_sq_head & (ctx->sq_entries - 1))
ensures the logical position isn't at the end of the ring, which is
correct for NO_SQARRAY rings where physical == logical. However, when
sq_array is present, an unprivileged user can remap any logical
position to an arbitrary physical index via sq_array. Setting
sq_array[N] = sq_entries - 1 places a 128-byte operation at the last
physical SQE slot, causing the 128-byte memcpy in
io_uring_cmd_sqe_copy() to read 64 bytes past the end of the SQE
array.
Replace the cached_sq_head alignment check with a direct validation
of the physical SQE index, which correctly handles both sq_array and
NO_SQARRAY cases.
Fixes: 1cba30bf9f ("io_uring: add support for IORING_SETUP_SQE_MIXED")
Signed-off-by: Tom Ryan <ryan36005@gmail.com>
Link: https://patch.msgid.link/20260310052003.72871-1-ryan36005@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Similarly to what commit e78f7b70e837 did for local task work additions,
use ->rings_rcu under RCU rather than dereference ->rings directly. See
that commit for more details.
Cc: stable@vger.kernel.org
Fixes: 79cfe9e59c ("io_uring/register: add IORING_REGISTER_RESIZE_RINGS")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If DEFER_TASKRUN | SETUP_TASKRUN is used and task work is added while
the ring is being resized, it's possible for the OR'ing of
IORING_SQ_TASKRUN to happen in the small window of swapping into the
new rings and the old rings being freed.
Prevent this by adding a 2nd ->rings pointer, ->rings_rcu, which is
protected by RCU. The task work flags manipulation is inside RCU
already, and if the resize ring freeing is done post an RCU synchronize,
then there's no need to add locking to the fast path of task work
additions.
Note: this is only done for DEFER_TASKRUN, as that's the only setup mode
that supports ring resizing. If this ever changes, then they too need to
use the io_ctx_mark_taskrun() helper.
Link: https://lore.kernel.org/io-uring/20260309062759.482210-1-naup96721@gmail.com/
Cc: stable@vger.kernel.org
Fixes: 79cfe9e59c ("io_uring/register: add IORING_REGISTER_RESIZE_RINGS")
Reported-by: Hao-Yu Yang <naup96721@gmail.com>
Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since the caller, __io_uring_run_bpf_filters(), doesn't prevent
migration, it should use the migration disabling variant for running
the BPF program.
Fixes: d42eb05e60 ("io_uring: add support for BPF filtering for opcode restrictions")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The actual code is right, but the comment is the wrong way around.
Fixes: ed82f35b92 ("io_uring: allow registration of per-task restrictions")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmmqPTAQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpueIEACHF0uws+uZSEsy9LUyC9ha8+5YN9szIJ3K
QUGxa9pmCnQG5K50KpxyEYP6buaJDy1smJGgD2obkeGncC4w6xKK2kQTQV1U+C1C
YA+7B/3HLhz5AWS6GIbRy6VZ599I4evlF8W79dX8BTnF8Y1ddkSuUnKx//q0AoQZ
hr3foglcFlchy8JuQ2/MpxzfOouvNMdMmeUN4O+t8iXDrmePFYIOgrLcT+ObgC5D
SXWx2cc3hMJ35hcSzedMWEBFcXnkX9nfh8Hd/+uPRcKsIwS8kCo6z01GoT/BCPRA
jdrxAfoYSL16HPfq6GU52n6iCaRd+5NK+tt/ECCzTxGL32Hadrr+nxw4O7g3Q96u
07zeaqHSoTGUchtlqrGjALQLP2yxdACEjxMh3rfdStRv3x3bbbVVDdioVEzPukCr
EBA+AbqaaG3LIYXwcY+15zx5NrAfeBAP1RjLgoV0s2ch4ghEqvnZGY4NLBDkcQ2R
97tM9+OdecBrsnlQr5GBoDbwpqc2pDEqSjkYDwoXqvqXs0DrMRq2MQw1Hjjh7Z7G
FZx1KNTiLB/YQ0sSyMcUKnH+qBA0FxwN/C6dDnRjj4dH5YsoeG/GhsS3B00a+0yE
S3MKrsf+uN21OYLVPSTEN6qS+02ZvK6E/Aw7/fk2IV60JMeM5KvCccmxa53dKls8
iyEJ7nVLOg==
=xyKA
-----END PGP SIGNATURE-----
Merge tag 'io_uring-7.0-20260305' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring fixes from Jens Axboe:
- Fix a typo in the mock_file help text
- Fix a comment regarding IORING_SETUP_TASKRUN_FLAG in the
io_uring.h UAPI header
- Use READ_ONCE() for reading refill queue entries
- Reject SEND_VECTORIZED for fixed buffer sends, as it isn't
implemented. Currently this flag is silently ignored
This is in preparation for making these work, but first we
need a fixup so that older kernels will correctly reject them
- Ensure "0" means default for the rx page size
* tag 'io_uring-7.0-20260305' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
io_uring/zcrx: use READ_ONCE with user shared RQEs
io_uring/mock: Fix typo in help text
io_uring/net: reject SEND_VECTORIZED when unsupported
io_uring: correct comment for IORING_SETUP_TASKRUN_FLAG
io_uring/zcrx: don't set rx_page_size when not requested
Refill queue entries are shared with the user space, use READ_ONCE when
reading them.
Fixes: 34a3e60821 ("io_uring/zcrx: implement zerocopy receive pp memory provider");
Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
IORING_SEND_VECTORIZED with registered buffers is not implemented but
could be. Don't silently ignore the flag in this case but reject it with
an error. It only affects sendzc as normal sends don't support
registered buffers.
Fixes: 6f02527729 ("io_uring/net: Allow to do vectorized send")
Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The rx_buf_len parameter was recently added to the Rx zero-copy
implementation. The expectation is that when not set system will
maintain previous behavior and use the default buffer size (PAGE_SIZE).
This works correctly at the iouring level, but we don't preserve
the same "zero means default" semantics when registering the memory
provider on the netdev. mp_param.rx_page_size is unconditionally
set to PAGE_SIZE. This causes __net_mp_open_rxq() to check for
QCFG_RX_PAGE_SIZE support in the driver, and return -EOPNOTSUPP
for drivers that don't advertise it -- even though the user never
asked for large buffers.
Only set mp_param.rx_page_size when rx_buf_len was explicitly provided,
so that the default page size path works on all zcrx-capable drivers.
mlx5 and fbnic only support 4kB pages in the current release.
Fixes: 795663b4d1 ("io_uring/zcrx: implement large rx buffer support")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmmhxlEQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpot9EACmBXos2lXMkiFN5TcSOLKW9Lh7PNAuDUns
oflfPVkS/F17muPnQIHzn5wqHpbWijx2uH0ehUXNJi+6U7OBdvZyZvEfdslh2ewZ
TAPnTFDZZbvdPqnGw7MJ1fLqGOB0RLoMJ72GkwIPhV6SQqmfu6U89ppplyMxybLK
MPLOx3j8HK/pX/3uLEyOpZ6XIfZjGyGiiMj8lEN+UZ5a9b5G3W87LnGPTRitl6SL
j5QTC5abGVk0vOqEPjm6Qws1icU9MumNAqTBTGL12WVlDRw0bQfSsZQzwySLXQpc
pbT3CUurt3jgU873S8xPnbc0v5g/YMLuGv7morMRHJ13h4g1lTR3Q5u5+KX3RTBQ
/I+R6C0uptLity+mBymdf0ZSgFOG7J7vfI6MzlqjUXSUqCfF3QVqrw5GrSDiDbI7
oO+JjSExM01w6kf1dcxk1ROJxxNiNWPpuP958poDydpD9jnp+Mr6jHuN66Es0Ctd
fBvVMa1w2MzYopIeclN4KNvuPQV+HxITMt9RgNW+6iXo7Fwqy1SCyGbuOr6t1f4Y
SD8s6xVYw47OilvGyaFZrERIq0xe8SjoCaNeVgDk9sX8g2XCZL1Utop9FbkzgPWW
WzH6Y9by3UobHoCB2iehIbglaQtCKSwJou76Z7nXi15luZNL5MlQbQzYEaDcSlea
i1dBMOmi+w==
=lbcV
-----END PGP SIGNATURE-----
Merge tag 'io_uring-7.0-20260227' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring fixes from Jens Axboe:
"Just two minor patches in here, ensuring the use of READ_ONCE() for
sqe field reading is consistent across the codebase. There were two
missing cases, now they are covered too"
* tag 'io_uring-7.0-20260227' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
io_uring/timeout: READ_ONCE sqe->addr
io_uring/cmd_net: use READ_ONCE() for ->addr3 read
We should use READ_ONCE when reading from a SQE, make sure timeout gets
a stable timespec address.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Any SQE read should use READ_ONCE(), to ensure the result is read once
and only once. Doesn't really matter for this case, but it's better to
keep these 100% consistent and always use READ_ONCE() for the prep side
of SQE handling.
Fixes: 5d24321e4c ("io_uring: Introduce getsockname io_uring cmd")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is the exact same thing as the 'alloc_obj()' version, only much
smaller because there are a lot fewer users of the *alloc_flex()
interface.
As with alloc_obj() version, this was done entirely with mindless brute
force, using the same script, except using 'flex' in the pattern rather
than 'objs*'.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This was done entirely with mindless brute force, using
git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'
to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.
Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.
For the same reason the 'flex' versions will be done as a separate
conversion.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRSPkdeREjth1dHnSE2KwveOeQkuwUCaZl14wAKCRA2KwveOeQk
uz8aAQCBFLYlij3Y3ivVADkBxuVF3xECaznFya41ENYsBwlHdwEArXqMyNrw+DiG
TvWCK/tiddNmGIRpI2sxBFzyRpsHfAY=
=rVD3
-----END PGP SIGNATURE-----
Merge tag 'kmalloc_obj-treewide-v7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull kmalloc_obj conversion from Kees Cook:
"This does the tree-wide conversion to kmalloc_obj() and friends using
coccinelle, with a subsequent small manual cleanup of whitespace
alignment that coccinelle does not handle.
This uncovered a clang bug in __builtin_counted_by_ref(), so the
conversion is preceded by disabling that for current versions of
clang. The imminent clang 22.1 release has the fix.
I've done allmodconfig build tests for x86_64, arm64, i386, and arm. I
did defconfig builds for alpha, m68k, mips, parisc, powerpc, riscv,
s390, sparc, sh, arc, csky, xtensa, hexagon, and openrisc"
* tag 'kmalloc_obj-treewide-v7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
kmalloc_obj: Clean up after treewide replacements
treewide: Replace kmalloc with kmalloc_obj for non-scalar types
compiler_types: Disable __builtin_counted_by_ref for Clang
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:
Single allocations: kmalloc(sizeof(TYPE), ...)
are replaced with: kmalloc_obj(TYPE, ...)
Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with: kmalloc_objs(TYPE, COUNT, ...)
Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...)
(where TYPE may also be *VAR)
The resulting allocations no longer return "void *", instead returning
"TYPE *".
Signed-off-by: Kees Cook <kees@kernel.org>
io_should_commit(), io_uring_classic_poll(), and io_do_iopoll() compare
struct io_kiocb's opcode against IORING_OP_URING_CMD to implement
special treatment for uring_cmds. The recently added opcode
IORING_OP_URING_CMD128 is meant to be equivalent to IORING_OP_URING_CMD,
so treat it the same way in these functions.
Fixes: 1cba30bf9f ("io_uring: add support for IORING_SETUP_SQE_MIXED")
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The io_zcrx_put_niov_uref() function uses a non-atomic
check-then-decrement pattern (atomic_read followed by separate
atomic_dec) to manipulate user_refs. This is serialized against other
callers by rq_lock, but io_zcrx_scrub() modifies the same counter with
atomic_xchg() WITHOUT holding rq_lock.
On SMP systems, the following race exists:
CPU0 (refill, holds rq_lock) CPU1 (scrub, no rq_lock)
put_niov_uref:
atomic_read(uref) - 1
// window opens
atomic_xchg(uref, 0) - 1
return_niov_freelist(niov) [PUSH #1]
// window closes
atomic_dec(uref) - wraps to -1
returns true
return_niov(niov)
return_niov_freelist(niov) [PUSH #2: DOUBLE-FREE]
The same niov is pushed to the freelist twice, causing free_count to
exceed nr_iovs. Subsequent freelist pushes then perform an out-of-bounds
write (a u32 value) past the kvmalloc'd freelist array into the adjacent
slab object.
Fix this by replacing the non-atomic read-then-dec in
io_zcrx_put_niov_uref() with an atomic_try_cmpxchg loop that atomically
tests and decrements user_refs. This makes the operation safe against
concurrent atomic_xchg from scrub without requiring scrub to acquire
rq_lock.
Fixes: 34a3e60821 ("io_uring/zcrx: implement zerocopy receive pp memory provider")
Cc: stable@vger.kernel.org
Signed-off-by: Kai Aizen <kai@snailsploit.com>
[pavel: removed a warning and a comment]
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmmTrLsQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgplYaEACgWcIcGa9/nWq1x02uN7Zi9vHWpDJqgEhq
JCLpMLdn3ZG6Ksn8RAfI4dKAKZKS7MuXDrpoXgchQ8LQjpssN6kTj2TlKdZR8Je3
NNWfkPnLUp/t3MN/V0vZiX5NQaJVCNblbcnauDzlN+6WkWku5p1wkwYwy3I7NPJ4
P7HHqFJAOwhyBpk/Nr3sQEDnKIn/vOiedyOuO+3HB6rlmnSmjY1cQ+FUSaOI+rNQ
D3i9TMEojHYhMDt76ql2YdKcksBu6HaZQ6JNpIiN9iqNB+96e+X2bcLPyfwkuHwC
N7G1IMfyTsuV7JWktcZP+AT8WK4Qf45fuUN/1EkKEL9MWF2TUMob8toQ0GXRCb22
NqSC1JyeVJ/sSnKzb2Z4wY4+BgRMo83ME3l6hi6QckWXfFyTAQe70JyUnu4w11qn
62astpZXVRSfvbH3vT76BWTa+5HUZExQgLRgor19BTeVY4ihh+muaoMH6An6jf6i
ZnqUSsn7nFB20MEudVqhgiKTvqVic2Atsl6JD4wjwWs5nEP9wzmmCSEGd3Nkrrji
HPWN4zu+1qczDZxmCJAj3w29cRO/vZCNpFARlSCMcXNOQsZaFWVaaQlzt26ZMhTi
AyMav25X8fNCERvGP++uo7cKzDGCuhhIR6y5GlXZ6yTHsGTcSgooW/NNz6Ik2jUW
Bwa5GBK36A==
=TgoD
-----END PGP SIGNATURE-----
Merge tag 'io_uring-7.0-20260216' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull more io_uring updates from Jens Axboe:
"This is a mix of cleanups and fixes. No major fixes in here, just a
bunch of little fixes. Some of them marked for stable as it fixes
behavioral issues
- Fix an issue with SOCKET_URING_OP_SETSOCKOPT for netlink sockets,
due to a too restrictive check on it having an ioctl handler
- Remove a redundant SQPOLL check in ring creation
- Kill dead accounting for zero-copy send, which doesn't use ->buf
or ->len post the initial setup
- Fix missing clamp of the allocation hint, which could cause
allocations to fall outside of the range the application asked
for. Still within the allowed limits.
- Fix for IORING_OP_PIPE's handling of direct descriptors
- Tweak to the API for the newly added BPF filters, making them
more future proof in terms of how applications deal with them
- A few fixes for zcrx, fixing a few error handling conditions
- Fix for zcrx request flag checking
- Add support for querying the zcrx page size
- Improve the NO_SQARRAY static branch inc/dec, avoiding busy
conditions causing too much traffic
- Various little cleanups"
* tag 'io_uring-7.0-20260216' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
io_uring/bpf_filter: pass in expected filter payload size
io_uring/bpf_filter: move filter size and populate helper into struct
io_uring/cancel: de-unionize file and user_data in struct io_cancel_data
io_uring/rsrc: improve regbuf iov validation
io_uring: remove unneeded io_send_zc accounting
io_uring/cmd_net: fix too strict requirement on ioctl
io_uring: delay sqarray static branch disablement
io_uring/query: add query.h copyright notice
io_uring/query: return support for custom rx page size
io_uring/zcrx: check unsupported flags on import
io_uring/zcrx: fix post open error handling
io_uring/zcrx: fix sgtable leak on mapping failures
io_uring: use the right type for creds iteration
io_uring/openclose: fix io_pipe_fixed() slot tracking for specific slots
io_uring/filetable: clamp alloc_hint to the configured alloc range
io_uring/rsrc: replace reg buffer bit field with flags
io_uring/zcrx: improve types for size calculation
io_uring/tctx: avoid modifying loop variable in io_ring_add_registered_file
io_uring: simplify IORING_SETUP_DEFER_TASKRUN && !SQPOLL check
It's quite possible that opcodes that have payloads attached to them,
like IORING_OP_OPENAT/OPENAT2 or IORING_OP_SOCKET, that these paylods
can change over time. For example, on the openat/openat2 side, the
struct open_how argument is extensible, and could be extended in the
future to allow further arguments to be passed in.
Allow registration of a cBPF filter to give the size of the filter as
seen by userspace. If that filter is for an opcode that takes extra
payload data, allow it if the application payload expectation is the
same size than the kernels. If that is the case, the kernel supports
filtering on the payload that the application expects. If the size
differs, the behavior depends on the IO_URING_BPF_FILTER_SZ_STRICT flag:
1) If IO_URING_BPF_FILTER_SZ_STRICT is set and the size expectation
differs, fail the attempt to load the filter.
2) If IO_URING_BPF_FILTER_SZ_STRICT isn't set, allow the filter if
the userspace pdu size is smaller than what the kernel offers.
3) Regardless if IO_URING_BPF_FILTER_SZ_STRICT, fail loading the filter
if the userspace pdu size is bigger than what the kernel supports.
An attempt to load a filter due to sizing will error with -EMSGSIZE.
For that error, the registration struct will have filter->pdu_size
populated with the pdu size that the kernel uses.
Reported-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Rather than open-code this logic in io_uring_populate_bpf_ctx() with
a switch, move it to the issue side definitions. Outside of making this
easier to extend in the future, it's also a prep patch for using the
pdu size for a given opcode filter elsewhere.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
By having them share the same space in struct io_cancel_data, it ends up
disallowing IORING_ASYNC_CANCEL_FD|IORING_ASYNC_CANCEL_USERDATA from
working. Eg you cannot match on both a file and user_data for
cancelation purposes. This obviously isn't a common use case as nobody
has reported this, but it does result in -ENOENT potentially being
returned when trying to match on both, rather than actually doing what
the API says it would.
Fixes: 4bf94615b8 ("io_uring: allow IORING_OP_ASYNC_CANCEL with 'fd' key")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Deduplicate io_buffer_validate() calls by moving the checks into
io_sqe_buffer_register(). Now we also don't need special handling in
io_buffer_validate() passing through buffer removal requests. I also
was using it as a cleanup before some other changes.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
zc->len and zc->buf are not actually used once you get to the retry
stage. The buffer remains in kmsg->msg.msg_iter, which is setup in
io_send_setup.
Note: it still seems needed in io_send due to io_send_select_buffer
needing it (for the len parameter).
Signed-off-by: Dylan Yudaken <dyudaken@gmail.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Attempting SOCKET_URING_OP_SETSOCKOPT on an AF_NETLINK socket resulted
in an -EOPNOTSUPP, as AF_NETLINK doesn't have an ioctl in its struct
proto, but only in struct proto_ops.
Prior to the blamed commit, io_uring_cmd_sock() only had two cmd_op
operations, both requiring ioctl, thus the check was warranted.
Since then, 4 new cmd_op operations have been added, none of which
depend on ioctl. This patch moves the ioctl check, so it only applies
to the original operations.
AFAICT, the ioctl requirement was unintentional, and it wasn't
visible in the blamed patch within 3 lines of context.
Cc: stable@vger.kernel.org
Fixes: a5d2f99aff ("io_uring/cmd: Introduce SOCKET_URING_OP_GETSOCKOPT")
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_key_has_sqarray static branch can be easily switched on/off by the
user every time patching the kernel. That can be very disruptive as it
might require heavy synchronisation across all CPUs. Use deferred static
keys, which can rate-limit it by deferring, batching and potentially
effectively eliminating dec+inc pairs.
Fixes: 9b296c625a ("io_uring: static_key for !IORING_SETUP_NO_SQARRAY")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add an ability to query if the zcrx rx page size setting is available.
Note, even when the API is supported by io_uring, the registration can
still get rejected for various reasons, e.g. when the NIC or the driver
doesn't support it, when the particular specified size is unsupported,
when the memory area doesn't satisfy all requirements, etc.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The imoorted zcrx registration path checks for ZCRX_REG_IMPORT, as it
should, but doesn't reject any unsupported flags. Fix that.
Cc: stable@vger.kernel.org
Fixes: 00d9148127 ("io_uring/zcrx: share an ifq between rings")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Closing a queue doesn't guarantee that all associated page pools are
terminated right away, let the refcounting do the work instead of
releasing the zcrx ctx directly.
Cc: stable@vger.kernel.org
Fixes: e0793de24a ("io_uring/zcrx: set pp memory provider for an rx queue")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In an unlikely case when io_populate_area_dma() fails, which could only
happen on a PAGE_POOL_32BIT_ARCH_WITH_64BIT_DMA machine,
io_zcrx_map_area() will have an initialised and not freed table. It was
supposed to be cleaned up in the error path, but !is_mapped prevents
that.
Fixes: 439a98b972 ("io_uring/zcrx: deduplicate area mapping")
Cc: stable@vger.kernel.org
Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmmGJ4AQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpnL0D/9rEC0FGoGM3473v2SUPEkcpSIIab3KMZl5
YW/GcPeE8i/jMTfa/QJjTpp24ke1m+AuMYmEUm+KDHNXNQTjJQVnSFRPxER5dLJl
goarFuk5JVDdzovL8QZuYGQHY3o1G54QLuBpVypBBqZdOQFTk8UwOdLFk4JrHMM2
PkUBUf9lZq9KiH6jdwn3v4qpZsZq93IubZ2dSncDcuZ3l2FbiZG88C3pp7wd+w1Z
VM+xTLkqS3OhXiLVfbmRVM//3PgQZU4bO6k+vjWhlztC+5+ELcuiPyN6nd++6lLw
LHH55T/xzko+TVc7kW7NTZ79MjBrjFKSNq4M/LV5SXSEpUAlqMoEakXVupVrBv+Z
gheHHHas3t8FWKwIkjEH6iV1TyGOxZzxaiQ8MurzQ5v3FC2PdfH+Uqisf56LTNKY
yAI0Ka8JWBv0sImgGB5J5ZzMP+xPxp4oyLEqKGLkUSi2OGtOeSHXwEiNbbJuTJBQ
Q+xGdTXukG+zlMnTIcq0fGFaT0hXYs8a6ZiF7vyaYdW9R6/WdfDlS/YN+gM+JgrD
EH2k29NE4kZi6cPFayfmgnfh10leiyMYt020b0GR4VpCHtthz7k+oasLB400qGrs
X6wad+Y1YfdEpD1SW447ZrM8JmcOVrLCj/yIodyV30Kw0v3QSjKLcp2LoxwvPpRi
Crs4Sb74Lg==
=d42R
-----END PGP SIGNATURE-----
Merge tag 'for-7.0/io_uring-zcrx-large-buffers-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring large rx buffer support from Jens Axboe:
"Now that the networking updates are upstream, here's the support for
large buffers for zcrx.
Using larger (bigger than 4K) rx buffers can increase the effiency of
zcrx. For example, it's been shown that using 32K buffers can decrease
CPU usage by ~30% compared to 4K buffers"
* tag 'for-7.0/io_uring-zcrx-large-buffers-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
io_uring/zcrx: implement large rx buffer support
In io_ring_ctx_wait_and_kill(), struct creds *creds is used to
iterate and prune credentials. But the correct type is struct cred.
This doesn't matter as the variable isn't used at all, only the index
is used. But it's confusing using a type that isn't valid, so fix it
up.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
__io_fixed_fd_install() returns 0 on success for non-alloc mode
(specific slot), not the slot index. io_pipe_fixed() used this return
value directly as the slot index in fds[], which can cause the reported
values returned via copy_to_user() to be incorrect, or the error path
operating on the incorrect direct descriptor.
Fix by computing the actual 0-based slot index (slot - 1) for specific
slot mode, while preserving the existing behavior for auto-alloc mode
where __io_fixed_fd_install() already returns the allocated index.
Cc: stable@vger.kernel.org
Fixes: 53db8a71ec ("io_uring: add support for IORING_OP_PIPE")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Explicit fixed file install/remove operations on slots outside the
configured alloc range can corrupt alloc_hint via io_file_bitmap_set()
and io_file_bitmap_clear(), which unconditionally update alloc_hint to
the bit position. This causes subsequent auto-allocations to fall
outside the configured range.
For example, if the alloc range is [10, 20) and a file is removed at
slot 2, alloc_hint gets set to 2. The next auto-alloc then starts
searching from slot 2, potentially returning a slot below the range.
Fix this by clamping alloc_hint to [file_alloc_start, file_alloc_end)
at the top of io_file_bitmap_get() before starting the search.
Cc: stable@vger.kernel.org
Fixes: 6e73dffbb9 ("io_uring: let to set a range for file slot allocation")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
I'll need a flag in the registered buffer struct for dmabuf work, and
it'll be more convenient to have a flags field rather than bit fields,
especially for io_mapped_ubuf initialisation.
We might want to add more flags in the future as well. For example, it
might be useful for debugging and potentially optimisations to split out
a flag indicating the shape of the buffer to gate iov_iter_advance()
walks vs bit/mask arithmetics. It can also be combined with the
direction mask field.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Make sure io_import_umem() promotes the type to long before calculating
the area size. While the area size is capped at 1GB by
io_validate_user_buf_range() and fits into an "int", it's still too
error prone.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>