Commit Graph

8871 Commits

Author SHA1 Message Date
Ming Lei
e4c4bfec2b ublk: fix canceling flag handling in batch I/O recovery
Two issues with ubq->canceling flag handling:

1) In ublk_queue_reset_io_flags(), ubq->canceling is set outside
   cancel_lock, violating the locking requirement. Move it inside
   the spinlock-protected section.

2) In ublk_batch_unprep_io(), when rolling back after a batch prep
   failure, if the queue became ready during prep (which cleared
   canceling), the flag is not restored when the queue becomes
   not-ready again. This allows new requests to be queued to
   uninitialized IO slots.

Fix by restoring ubq->canceling = true under cancel_lock when the
queue transitions from ready to not-ready during rollback.

Reported-by: Jens Axboe <axboe@kernel.dk>
Fixes: 3f38507855 ("ublk: fix batch I/O recovery -ENODEV error")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-23 05:11:03 -07:00
Ming Lei
dbc635c4be ublk: move ublk_mark_io_ready() out of __ublk_fetch()
ublk_batch_prep_io() calls __ublk_fetch() while holding io->lock
spinlock. When the last IO makes the device ready, ublk_mark_io_ready()
tries to acquire ub->cancel_mutex which can sleep, causing a
sleeping-while-atomic bug.

Fix by moving ublk_mark_io_ready() out of __ublk_fetch() and into the
callers (ublk_fetch and ublk_batch_prep_io) after the spinlock is
released.

Reported-by: Jens Axboe <axboe@kernel.dk>
Fixes: b256795b36 ("ublk: handle UBLK_U_IO_PREP_IO_CMDS")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-23 05:11:02 -07:00
Ming Lei
3f38507855 ublk: fix batch I/O recovery -ENODEV error
During recovery with batch I/O, UBLK_U_IO_FETCH_IO_CMDS command fails with
-ENODEV because ublk_batch_attach() rejects them when ubq->canceling is set.
The canceling flag remains set until all queues are ready.

Fix this by tracking per-queue readiness and clearing ubq->canceling as
soon as each individual queue becomes ready, rather than waiting for all
queues. This allows subsequent UBLK_U_IO_FETCH_IO_CMDS commands to succeed
during recovery.

Changes:
- Add ubq->nr_io_ready to track I/Os ready per queue
- Add ub->nr_queue_ready to track number of ready queues
- Add ublk_queue_ready() helper to check queue readiness
- Redefine ublk_dev_ready() based on queue count instead of I/O count
- Clear ubq->canceling immediately when queue becomes ready
- Add ublk_queue_reset_io_flags() to reset per-queue flags

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-22 20:05:41 -07:00
Ming Lei
7aa78d4a3c ublk: implement batch request completion via blk_mq_end_request_batch()
Reduce overhead when completing multiple requests in batch I/O mode by
accumulating them in an io_comp_batch structure and completing them
together via blk_mq_end_request_batch(). This minimizes per-request
completion overhead and improves performance for high IOPS workloads.

The implementation adds an io_comp_batch pointer to struct ublk_io and
initializes it in __ublk_fetch(). For batch I/O, the pointer is set to
the batch structure in ublk_batch_commit_io(). The __ublk_complete_rq()
function uses io->iob to call blk_mq_add_to_batch() for batch mode.
After processing all batch I/Os, the completion callback is invoked in
ublk_handle_batch_commit_cmd() to complete all accumulated requests
efficiently.

So far just covers direct completion. For deferred completion(zero copy,
auto buffer reg), ublk_io_release() is often delayed in freeing buffer
consumer io_uring request's code path, so this patch often doesn't work,
also it is hard to pass the per-task 'struct io_comp_batch' for deferred
completion.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-22 20:05:40 -07:00
Ming Lei
e2723e6ce6 ublk: add new feature UBLK_F_BATCH_IO
Add new feature UBLK_F_BATCH_IO which replaces the following two
per-io commands:

	- UBLK_U_IO_FETCH_REQ

	- UBLK_U_IO_COMMIT_AND_FETCH_REQ

with three per-queue batch io uring_cmd:

	- UBLK_U_IO_PREP_IO_CMDS

	- UBLK_U_IO_COMMIT_IO_CMDS

	- UBLK_U_IO_FETCH_IO_CMDS

Then ublk can deliver batch io commands to ublk server in single
multishort uring_cmd, also allows to prepare & commit multiple
commands in batch style via single uring_cmd, communication cost is
reduced a lot.

This feature also doesn't limit task context any more for all supported
commands, so any allowed uring_cmd can be issued in any task context.
ublk server implementation becomes much easier.

Meantime load balance becomes much easier to support with this feature.
The command `UBLK_U_IO_FETCH_IO_CMDS` can be issued from multiple task
contexts, so each task can adjust this command's buffer length or number
of inflight commands for controlling how much load is handled by current
task.

Later, priority parameter will be added to command `UBLK_U_IO_FETCH_IO_CMDS`
for improving load balance support.

UBLK_U_IO_NEED_GET_DATA isn't supported in batch io yet, but it may be
enabled in future via its batch pair.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-22 20:05:40 -07:00
Ming Lei
29d0a927f9 ublk: abort requests filled in event kfifo
In case of BATCH_IO, any request filled in event kfifo, they don't get
chance to be dispatched any more when releasing ublk char device, so
we have to abort them too.

Add ublk_abort_batch_queue() for aborting this kind of requests.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-22 20:05:40 -07:00
Ming Lei
3ac4796b88 ublk: refactor ublk_queue_rq() and add ublk_batch_queue_rq()
Extract common request preparation and cancellation logic into
__ublk_queue_rq_common() helper function. Add dedicated
ublk_batch_queue_rq() for batch mode operations to eliminate runtime check
in ublk_queue_rq().

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-22 20:05:40 -07:00
Ming Lei
a4d8837553 ublk: add UBLK_U_IO_FETCH_IO_CMDS for batch I/O processing
Add UBLK_U_IO_FETCH_IO_CMDS command to enable efficient batch processing
of I/O requests. This multishot uring_cmd allows the ublk server to fetch
multiple I/O commands in a single operation, significantly reducing
submission overhead compared to individual FETCH_REQ* commands.

Key Design Features:

1. Multishot Operation: One UBLK_U_IO_FETCH_IO_CMDS can fetch many I/O
   commands, with the batch size limited by the provided buffer length.

2. Dynamic Load Balancing: Multiple fetch commands can be submitted
   simultaneously, but only one is active at any time. This enables
   efficient load distribution across multiple server task contexts.

3. Implicit State Management: The implementation uses three key variables
   to track state:
   - evts_fifo: Queue of request tags awaiting processing
   - fcmd_head: List of available fetch commands
   - active_fcmd: Currently active fetch command (NULL = none active)

   States are derived implicitly:
   - IDLE: No fetch commands available
   - READY: Fetch commands available, none active
   - ACTIVE: One fetch command processing events

4. Lockless Reader Optimization: The active fetch command can read from
   evts_fifo without locking (single reader guarantee), while writers
   (ublk_queue_rq/ublk_queue_rqs) use evts_lock protection. The memory
   barrier pairing plays key role for the single lockless reader
   optimization.

Implementation Details:

- ublk_queue_rq() and ublk_queue_rqs() save request tags to evts_fifo
- __ublk_acquire_fcmd() selects an available fetch command when
  events arrive and no command is currently active
- ublk_batch_dispatch() moves tags from evts_fifo to the fetch command's
  buffer and posts completion via io_uring_mshot_cmd_post_cqe()
- State transitions are coordinated via evts_lock to maintain consistency

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-22 20:05:40 -07:00
Ming Lei
7a1bb41947 ublk: add batch I/O dispatch infrastructure
Add infrastructure for delivering I/O commands to ublk server in batches,
preparing for the upcoming UBLK_U_IO_FETCH_IO_CMDS feature.

Key components:

- struct ublk_batch_fetch_cmd: Represents a batch fetch uring_cmd that will
  receive multiple I/O tags in a single operation, using io_uring's
  multishot command for efficient ublk IO delivery.

- ublk_batch_dispatch(): Batch version of ublk_dispatch_req() that:
  * Pulls multiple request tags from the events FIFO (lock-free reader)
  * Prepares each I/O for delivery (including auto buffer registration)
  * Delivers tags to userspace via single uring_cmd notification
  * Handles partial failures by restoring undelivered tags to FIFO

The batch approach significantly reduces notification overhead by aggregating
multiple I/O completions into single uring_cmd, while maintaining the same
I/O processing semantics as individual operations.

Error handling ensures system consistency: if buffer selection or CQE
posting fails, undelivered tags are restored to the FIFO for retry,
meantime IO state has to be restored.

This runs in task work context, scheduled via io_uring_cmd_complete_in_task()
or called directly from ->uring_cmd(), enabling efficient batch processing
without blocking the I/O submission path.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-22 20:05:40 -07:00
Ming Lei
f1f99ddf60 ublk: add io events fifo structure
Add ublk io events fifo structure and prepare for supporting command
batch, which will use io_uring multishot uring_cmd for fetching one
batch of io commands each time.

One nice feature of kfifo is to allow multiple producer vs single
consumer. We just need lock the producer side, meantime the single
consumer can be lockless.

The producer is actually from ublk_queue_rq() or ublk_queue_rqs(), so
lock contention can be eased by setting proper blk-mq nr_queues.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-22 20:05:40 -07:00
Ming Lei
1e500e106d ublk: handle UBLK_U_IO_COMMIT_IO_CMDS
Handle UBLK_U_IO_COMMIT_IO_CMDS by walking the uring_cmd fixed buffer:

- read each element into one temp buffer in batch style

- parse and apply each element for committing io result

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-22 20:05:40 -07:00
Ming Lei
b256795b36 ublk: handle UBLK_U_IO_PREP_IO_CMDS
This commit implements the handling of the UBLK_U_IO_PREP_IO_CMDS command,
which allows userspace to prepare a batch of I/O requests.

The core of this change is the `ublk_walk_cmd_buf` function, which iterates
over the elements in the uring_cmd fixed buffer. For each element, it parses
the I/O details, finds the corresponding `ublk_io` structure, and prepares it
for future dispatch.

Add per-io lock for protecting concurrent delivery and committing.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-22 20:05:40 -07:00
Ming Lei
e86f89ab24 ublk: add new batch command UBLK_U_IO_PREP_IO_CMDS & UBLK_U_IO_COMMIT_IO_CMDS
Add new command UBLK_U_IO_PREP_IO_CMDS, which is the batch version of
UBLK_IO_FETCH_REQ.

Add new command UBLK_U_IO_COMMIT_IO_CMDS, which is for committing io command
result only, still the batch version.

The new command header type is `struct ublk_batch_io`.

This patch doesn't actually implement these commands yet, just validates the
SQE fields.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-22 20:05:40 -07:00
Ming Lei
7ba62f5969 ublk: prepare for not tracking task context for command batch
batch io is designed to be independent of task context, and we will not
track task context for batch io feature.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-22 20:05:40 -07:00
Ming Lei
fb027d5694 ublk: define ublk_ch_batch_io_fops for the coming feature F_BATCH_IO
Introduces the basic structure for a batched I/O feature in the ublk driver.
It adds placeholder functions and a new file operations structure,
ublk_ch_batch_io_fops, which will be used for fetching and committing I/O
commands in batches. Currently, the feature is disabled.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-22 20:05:40 -07:00
Seamus Connor
47bdf1d29c ublk: fix ublksrv pid handling for pid namespaces
When ublksrv runs inside a pid namespace, START/END_RECOVERY compared
the stored init-ns tgid against the userspace pid (getpid vnr), so the
check failed and control ops could not proceed. Compare against the
caller’s init-ns tgid and store that value, then translate it back to
the caller’s pid namespace when reporting GET_DEV_INFO so ublk list
shows a sensible pid.

Testing: start/recover in a pid namespace; `ublk list` shows
reasonable pid values in init, child, and sibling namespaces.

Fixes: c2c8089f32 ("ublk: validate ublk server pid")
Signed-off-by: Seamus Connor <sconnor@purestorage.com>
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-21 07:46:14 -07:00
Alejandro Colomar
436debc9ca array_size.h: add ARRAY_END()
Patch series "Add ARRAY_END(), and use it to fix off-by-one bugs", v6.

Add ARRAY_END(), and use it to fix off-by-one bugs

ARRAY_END() is a macro to calculate a pointer to one past the last element
of an array argument.  This is a very common pointer, which is used to
iterate over all elements of an array:

        for (T *p = a; p < ARRAY_END(a); p++)
                ...

Of course, this pointer should never be dereferenced.  A pointer one past
the last element of an array should not be dereferenced; it's perfectly
fine to hold such a pointer --and a good thing to do--, but the only thing
it should be used for is comparing it with other pointers derived from the
same array.

Due to how special these pointers are, it would be good to use consistent
naming.  It's common to name such a pointer 'end' --in fact, we have many
such cases in the kernel--.  C++ even standardized this name with
std::end().  Let's try naming such pointers 'end', and try also avoid
using 'end' for pointers that are not the result of ARRAY_END().

It has been incorrectly suggested that these pointers are dangerous, and
that they should never be used, suggesting to use something like

	#define ARRAY_LAST(a)  ((a) + ARRAY_SIZE(a) - 1)

	for (T *p = a; p <= ARRAY_LAST(a); p++)
		...

This is bogus, as it doesn't scale down to arrays of 0 elements.  In the
case of an array of 0 elements, ARRAY_LAST() would underflow the pointer,
which not only it can't be dereferenced, it can't even be held (it
produces Undefined Behavior).  That would be a footgun.  Such arrays don't
exist per the ISO C standard; however, GCC supports them as an extension
(with partial support, though; GCC has a few bugs which need to be fixed).

This patch set fixes a few places where it was intended to use the array
end (that is, one past the last element), but accidentally a pointer to
the last element was used instead, thus wasting one byte.

It also replaces other places where the array end was correctly calculated
with ARRAY_SIZE(), by using the simpler ARRAY_END().

Also, there was one drivers/ file that already defined this macro.  We
remove that definition, to not conflict with this one.


This patch (of 4):

ARRAY_END() returns a pointer one past the end of the last element in the
array argument.  This pointer is useful for iterating over the elements of
an array:

	for (T *p = a, p < ARRAY_END(a); p++)
		...

Link: https://lkml.kernel.org/r/cover.1765449750.git.alx@kernel.org
Link: https://lkml.kernel.org/r/5973cfb674192bc8e533485dbfb54e3062896be1.1765449750.git.alx@kernel.org
Signed-off-by: Alejandro Colomar <alx@kernel.org>
Cc: Kees Cook <kees@kernel.org>
Cc: Christopher Bazley <chris.bazley.wg14@gmail.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Marco Elver <elver@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Maciej W. Rozycki <macro@orcam.me.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20 19:44:19 -08:00
Heiko Carstens
a9853ac1c3 zram: remove KMSG_COMPONENT macro
The KMSG_COMPONENT macro is a leftover of the s390 specific "kernel
message catalog" from 2008 [1] which never made it upstream.

The macro was added to s390 code to allow for an out-of-tree patch which
used this to generate unique message ids.  Also this out-of-tree doesn't
exist anymore.

The pattern of how the KMSG_COMPONENT is used was partially also used for
non s390 specific code, for whatever reasons.

Remove the macro in order to get rid of a pointless indirection.

Link: https://lkml.kernel.org/r/20251126143602.2207435-1-hca@linux.ibm.com
Link: https://lwn.net/Articles/292650/ [1]
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20 19:24:48 -08:00
Sergey Senozhatsky
657a81fe3b zram: drop pp_in_progress
pp_in_progress makes sure that only one post-processing (writeback or
recomrpession) is active at any given time.  Functionality wise it,
basically, shadows zram init_lock, when init_lock is acquired in writer
mode.

Switch recompress_store() and writeback_store() to take zram init_lock in
writer mode, like all store() sysfs handlers should do, so that we can
drop pp_in_progress.  Recompression and writeback can be somewhat slow, so
holding init_lock in writer mode can block zram attrs reads, but in
reality the only zram attrs reads that take place are mm_stat reads, and
usually it's the same process that reads mm_stat and does recompression or
writeback.

Link: https://lkml.kernel.org/r/20251216071342.687993-1-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20 19:24:46 -08:00
Sergey Senozhatsky
8b05d2d8af zram: fixup read_block_state()
ac_time is now in seconds, do not use ktime_to_timespec64()

[akpm@linux-foundation.org: remove now-unused local `ts']
[akpm@linux-foundation.org: fix build]
Link: https://lkml.kernel.org/r/20260115033031.3818977-1-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reported-by: Chris Mason <clm@meta.com>
Closes: https://lkml.kernel.org/r/20260114124522.1326519-1-clm@meta.com
Cc: Brian Geffon <bgeffon@google.com>
Cc: David Stevens <stevensd@google.com>
Cc: Minchan Kim <minchan@google.com>
Cc: Richard Chang <richardycc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20 19:24:38 -08:00
Sergey Senozhatsky
4932844eb8 zram: trivial fix of recompress_slot() coding styles
A minor fixup of 80-cols breakage in recompress_slot() comment and
zs_malloc() call.

Link: https://lkml.kernel.org/r/ff3254847dbdc6fbd2e3fed53c572a261d60b7b6.1765775954.git.senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: David Stevens <stevensd@google.com>
Cc: Minchan Kim <minchan@google.com>
Cc: Richard Chang <richardycc@google.com>
Cc: Chris Mason <clm@meta.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20 19:24:38 -08:00
Sergey Senozhatsky
bde60fe747 zram: rename internal slot API
We have a somewhat confusing internal API naming.  E.g.  the following
code:

	zram_slot_lock()
	if (zram_allocated())
		zram_set_flag()
	zram_slot_unlock()

may look like it does something on zram device level, but in fact it tests
and sets slot entry flags, not the device ones.

Rename API to explicitly distinguish functions that operate on the slot
level from functions that operate on the zram device level.

While at it, fixup some coding styles.

[senozhatsky@chromium.org: fix up mark_slot_accessed()]
  Link: https://lkml.kernel.org/r/20260115031922.3813659-1-senozhatsky@chromium.org
Link: https://lkml.kernel.org/r/775a0b1a0ace5caf1f05965d8bc637c1192820fa.1765775954.git.senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: David Stevens <stevensd@google.com>
Cc: Minchan Kim <minchan@google.com>
Cc: Richard Chang <richardycc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20 19:24:38 -08:00
Sergey Senozhatsky
2e8ff2f51d zram: use u32 for entry ac_time tracking
We can reduce sizeof(zram_table_entry) on 64-bit systems by converting
flags and ac_time to u32.  Entry flags fit into u32, and for ac_time u32
gives us over a century of entry lifespan (approx 136 years) which is
plenty (zram uses system boot time (seconds)).

In struct zram_table_entry we use bytes aliasing, because bit-wait API
(for slot lock) requires a whole unsigned long word.

Link: https://lkml.kernel.org/r/d7c0b48450c70eeb5fd8acd6ecd23593f30dbf1f.1765775954.git.senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Suggested-by: David Stevens <stevensd@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Minchan Kim <minchan@google.com>
Cc: Richard Chang <richardycc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20 19:24:38 -08:00
Sergey Senozhatsky
0327a86213 zram: consolidate device-attr declarations
Do not spread device attributes declarations across the file, move
io_stat, mm_stat, debug_stat to a common device-attr section.

Link: https://lkml.kernel.org/r/20251201094754.4149975-8-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: David Stevens <stevensd@google.com>
Cc: Minchan Kim <minchan@google.com>
Cc: Richard Chang <richardycc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20 19:24:37 -08:00
Sergey Senozhatsky
0d38260c2a zram: switch to guard() for init_lock
Use init_lock guard() in sysfs store/show handlers, in order to simplify
and, more importantly, to modernize the code.

While at it, fix up more coding styles.

Link: https://lkml.kernel.org/r/20251201094754.4149975-7-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: David Stevens <stevensd@google.com>
Cc: Minchan Kim <minchan@google.com>
Cc: Richard Chang <richardycc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20 19:24:37 -08:00
Sergey Senozhatsky
7ad688c0cd zram: rename zram_free_page()
We don't free page in zram_free_page(), not all slots even have any memory
associated with them (e.g.  ZRAM_SAME).  We free the slot (or reset it),
rename the function accordingly.

Link: https://lkml.kernel.org/r/20251201094754.4149975-6-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: David Stevens <stevensd@google.com>
Cc: Minchan Kim <minchan@google.com>
Cc: Richard Chang <richardycc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20 19:24:37 -08:00
Sergey Senozhatsky
910bbb441c zram: move bd_stat to writeback section
Move bd_stat function and attribute declaration to
existing CONFIG_WRITEBACK ifdef-sections.

Link: https://lkml.kernel.org/r/20251201094754.4149975-5-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: David Stevens <stevensd@google.com>
Cc: Minchan Kim <minchan@google.com>
Cc: Richard Chang <richardycc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20 19:24:37 -08:00
Richard Chang
4c1d61389e zram: introduce writeback_compressed device attribute
Introduce witeback_compressed device attribute to toggle compressed
writeback (decompression on demand) feature.

[senozhatsky@chromium.org: rewrote original patch, added documentation]
Link: https://lkml.kernel.org/r/20251201094754.4149975-3-senozhatsky@chromium.org
Signed-off-by: Richard Chang <richardycc@google.com>
Co-developed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: David Stevens <stevensd@google.com>
Cc: Minchan Kim <minchan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20 19:24:36 -08:00
Richard Chang
d38fab605c zram: introduce compressed data writeback
Patch series "zram: introduce compressed data writeback", v2.

As writeback becomes more common there is another shortcoming that needs
to be addressed - compressed data writeback.  Currently zram does
uncompressed data writeback which is not optimal due to potential CPU and
battery wastage.  This series changes suboptimal uncompressed writeback to
a more optimal compressed data writeback.


This patch (of 7):

zram stores all written back slots raw, which implies that during
writeback zram first has to decompress slots (except for ZRAM_HUGE slots,
which are raw already).  The problem with this approach is that not every
written back page gets read back (either via read() or via page-fault),
which means that zram basically wastes CPU cycles and battery
decompressing such slots.  This changes with introduction of decompression
on demand, in other words decompression on read()/page-fault.

One caveat of decompression on demand is that async read is completed in
IRQ context, while zram decompression is sleepable.  To workaround this,
read-back decompression is offloaded to a preemptible context - system
high-prio work-queue.

At this point compressed writeback is still disabled, a follow up patch
will introduce a new device attribute which will make it possible to
toggle compressed writeback per-device.

[senozhatsky@chromium.org: rewrote original implementation]
Link: https://lkml.kernel.org/r/20251201094754.4149975-1-senozhatsky@chromium.org
Link: https://lkml.kernel.org/r/20251201094754.4149975-2-senozhatsky@chromium.org
Signed-off-by: Richard Chang <richardycc@google.com>
Co-developed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Suggested-by: Minchan Kim <minchan@google.com>
Suggested-by: Brian Geffon <bgeffon@google.com>
Cc: David Stevens <stevensd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20 19:24:36 -08:00
Linus Torvalds
d3eeb99bbc block-6.19-20260116
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmlrBfMQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgphVYEADD88/MTkm0x/Ydwq5WjoJ1yfYsJXFpO+ZW
 UZpsKGhPbmnu1d+541vlzMJS7OkIE3OsCHn8sXTagrKP862aDDZ7r71cG1OPF49U
 7ICK660OPr0s1E0sVEC0O3tB1C1jJf+cKgPmpDiz9Lyj43J23mGhXfuWyX4hpx3W
 3HVJBFb5KBaTV4hiFKGwNnPGNgRyWMt2tj5hSKGTZ8UU4jhi+CtWju64HUOqL8X6
 gPFKoqHx+qWwrnT7frd+B3ldVgrVUVY5hosrmqa9qKapEoxy+mZ897UZrV8mJc9X
 gbp+403kNMilMKOI1sE0ekU7aCpz7iAyWAX7+h5HTH19fmwlRVGrNZpqRINY06xR
 Wc9XdLgIrPwLdpngfO9YXFik6rLa5uhiXVYaIIaKwl0riORCxzKGpL3bROjYJKfh
 N0nLkvt55mO2BA4dxJd1h3Mz5Bwxme+B21nsNJf2pBSKBBfpW86t+Euqzm5xZwYW
 UYMbIfLnTGd7swLXhAzX67pz7dkkVTPd6cx6lOXtSq+WUS+Lng7c1tTTryuF2o/X
 LmHDTvsaWBkli4s8DKbxdoJQen8g9YE0jqvwJIgFecIlrhOl/6OjvI4VbKd18i56
 bkL/0RBtoNRfw8KyuXabZOX4h/NVSsO/7l1nPbqdiBXXuWTYUZFqvcuPD1Z76slt
 abUDIP8OQw==
 =8QRq
 -----END PGP SIGNATURE-----

Merge tag 'block-6.19-20260116' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull block fixes from Jens Axboe:

 - NVMe pull request via Keith:
     - Device quirk to disable faulty temperature (Ilikara)
     - TCP target null pointer fix from bad host protocol usage (Shivam)
     - Add apple,t8103-nvme-ans2 as a compatible apple controller
       (Janne)
     - FC tagset leak fix (Chaitanya)
     - TCP socket deadlock fix (Hannes)
     - Target name buffer overrun fix (Shin'ichiro)

 - Fix for an underflow for rnbd during device unmap

 - Zero the non-PI part of the auto integrity buffer

 - Fix for a configfs memory leak in the null block driver

* tag 'block-6.19-20260116' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  rnbd-clt: fix refcount underflow in device unmap path
  nvme: fix PCIe subsystem reset controller state transition
  nvmet: do not copy beyond sybsysnqn string length
  nvmet-tcp: fixup hang in nvmet_tcp_listen_data_ready()
  null_blk: fix kmemleak by releasing references to fault configfs items
  block: zero non-PI portion of auto integrity buffer
  nvme-fc: release admin tagset if init fails
  nvme-apple: add "apple,t8103-nvme-ans2" as compatible
  nvme-tcp: fix NULL pointer dereferences in nvmet_tcp_build_pdu_iovec
  nvme-pci: disable secondary temp for Wodposit WPBSNM8
2026-01-16 20:59:46 -08:00
Chaitanya Kulkarni
ec19ed2b3e rnbd-clt: fix refcount underflow in device unmap path
During device unmapping (triggered by module unload or explicit unmap),
a refcount underflow occurs causing a use-after-free warning:

  [14747.574913] ------------[ cut here ]------------
  [14747.574916] refcount_t: underflow; use-after-free.
  [14747.574917] WARNING: lib/refcount.c:28 at refcount_warn_saturate+0x55/0x90, CPU#9: kworker/9:1/378
  [14747.574924] Modules linked in: rnbd_client(-) rtrs_client rnbd_server rtrs_server rtrs_core ...
  [14747.574998] CPU: 9 UID: 0 PID: 378 Comm: kworker/9:1 Tainted: G           O     N  6.19.0-rc3lblk-fnext+ #42 PREEMPT(voluntary)
  [14747.575005] Workqueue: rnbd_clt_wq unmap_device_work [rnbd_client]
  [14747.575010] RIP: 0010:refcount_warn_saturate+0x55/0x90
  [14747.575037]  Call Trace:
  [14747.575038]   <TASK>
  [14747.575038]   rnbd_clt_unmap_device+0x170/0x1d0 [rnbd_client]
  [14747.575044]   process_one_work+0x211/0x600
  [14747.575052]   worker_thread+0x184/0x330
  [14747.575055]   ? __pfx_worker_thread+0x10/0x10
  [14747.575058]   kthread+0x10d/0x250
  [14747.575062]   ? __pfx_kthread+0x10/0x10
  [14747.575066]   ret_from_fork+0x319/0x390
  [14747.575069]   ? __pfx_kthread+0x10/0x10
  [14747.575072]   ret_from_fork_asm+0x1a/0x30
  [14747.575083]   </TASK>
  [14747.575096] ---[ end trace 0000000000000000 ]---

Befor this patch :-

The bug is a double kobject_put() on dev->kobj during device cleanup.

Kobject Lifecycle:
  kobject_init_and_add()  sets kobj.kref = 1  (initialization)
  kobject_put()           sets kobj.kref = 0  (should be called once)

* Before this patch:

rnbd_clt_unmap_device()
  rnbd_destroy_sysfs()
    kobject_del(&dev->kobj)                   [remove from sysfs]
    kobject_put(&dev->kobj)                   PUT #1 (WRONG!)
      kref: 1 to 0
      rnbd_dev_release()
        kfree(dev)                            [DEVICE FREED!]

  rnbd_destroy_gen_disk()                     [use-after-free!]

  rnbd_clt_put_dev()
    refcount_dec_and_test(&dev->refcount)
    kobject_put(&dev->kobj)                   PUT #2 (UNDERFLOW!)
      kref: 0 to -1                           [WARNING!]

The first kobject_put() in rnbd_destroy_sysfs() prematurely frees the
device via rnbd_dev_release(), then the second kobject_put() in
rnbd_clt_put_dev() causes refcount underflow.

* After this patch :-

Remove kobject_put() from rnbd_destroy_sysfs(). This function should
only remove sysfs visibility (kobject_del), not manage object lifetime.

Call Graph (FIXED):

rnbd_clt_unmap_device()
  rnbd_destroy_sysfs()
    kobject_del(&dev->kobj)                   [remove from sysfs only]
                                              [kref unchanged: 1]

  rnbd_destroy_gen_disk()                     [device still valid]

  rnbd_clt_put_dev()
    refcount_dec_and_test(&dev->refcount)
    kobject_put(&dev->kobj)                   ONLY PUT (CORRECT!)
      kref: 1 to 0                            [BALANCED]
      rnbd_dev_release()
        kfree(dev)                            [CLEAN DESTRUCTION]

This follows the kernel pattern where sysfs removal (kobject_del) is
separate from object destruction (kobject_put).

Fixes: 581cf833ca ("block: rnbd: add .release to rnbd_dev_ktype")
Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com>
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Reviewed-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-15 07:22:09 -07:00
Nilay Shroff
40b94ec7ed null_blk: fix kmemleak by releasing references to fault configfs items
When CONFIG_BLK_DEV_NULL_BLK_FAULT_INJECTION is enabled, the null-blk
driver sets up fault injection support by creating the timeout_inject,
requeue_inject, and init_hctx_fault_inject configfs items as children
of the top-level nullbX configfs group.

However, when the nullbX device is removed, the references taken to
these fault-config configfs items are not released. As a result,
kmemleak reports a memory leak, for example:

unreferenced object 0xc00000021ff25c40 (size 32):
  comm "mkdir", pid 10665, jiffies 4322121578
  hex dump (first 32 bytes):
    69 6e 69 74 5f 68 63 74 78 5f 66 61 75 6c 74 5f  init_hctx_fault_
    69 6e 6a 65 63 74 00 88 00 00 00 00 00 00 00 00  inject..........
  backtrace (crc 1a018c86):
    __kmalloc_node_track_caller_noprof+0x494/0xbd8
    kvasprintf+0x74/0xf4
    config_item_set_name+0xf0/0x104
    config_group_init_type_name+0x48/0xfc
    fault_config_init+0x48/0xf0
    0xc0080000180559e4
    configfs_mkdir+0x304/0x814
    vfs_mkdir+0x49c/0x604
    do_mkdirat+0x314/0x3d0
    sys_mkdir+0xa0/0xd8
    system_call_exception+0x1b0/0x4f0
    system_call_vectored_common+0x15c/0x2ec

Fix this by explicitly releasing the references to the fault-config
configfs items when dropping the reference to the top-level nullbX
configfs group.

Cc: stable@vger.kernel.org
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Fixes: bb4c19e030 ("block: null_blk: make fault-injection dynamically configurable per device")
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-13 07:24:37 -07:00
Yoav Cohen
93ada1b3da ublk: add UBLK_CMD_TRY_STOP_DEV command
Add a best-effort stop command, UBLK_CMD_TRY_STOP_DEV, which only stops a
ublk device when it has no active openers.

Unlike UBLK_CMD_STOP_DEV, this command does not disrupt existing users.
New opens are blocked only after disk_openers has reached zero; if the
device is busy, the command returns -EBUSY and leaves it running.

The ub->block_open flag is used only to close a race with an in-progress
open and does not otherwise change open behavior.

Advertise support via the UBLK_F_SAFE_STOP_DEV feature flag.

Signed-off-by: Yoav Cohen <yoav@nvidia.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-12 15:07:31 -07:00
Yoav Cohen
9e386f49fa ublk: make ublk_ctrl_stop_dev return void
This function always returns 0, so there is no need to return a value.

Signed-off-by: Yoav Cohen <yoav@nvidia.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-12 15:07:31 -07:00
Caleb Sander Mateos
bfe1255712 ublk: optimize ublk_user_copy() on daemon task
ublk user copy syscalls may be issued from any task, so they take a
reference count on the struct ublk_io to check whether it is owned by
the ublk server and prevent a concurrent UBLK_IO_COMMIT_AND_FETCH_REQ
from completing the request. However, if the user copy syscall is issued
on the io's daemon task, a concurrent UBLK_IO_COMMIT_AND_FETCH_REQ isn't
possible, so the atomic reference count dance is unnecessary. Check for
UBLK_IO_FLAG_OWNED_BY_SRV to ensure the request is dispatched to the
sever and obtain the request from ublk_io's req field instead of looking
it up on the tagset. Skip the reference count increment and decrement.
Commit 8a8fe42d76 ("ublk: optimize UBLK_IO_REGISTER_IO_BUF on daemon
task") made an analogous optimization for ublk zero copy buffer
registration.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-12 09:16:38 -07:00
Stanley Zhang
b2503e936b ublk: support UBLK_F_INTEGRITY
Now that all the components of the ublk integrity feature have been
implemented, add UBLK_F_INTEGRITY to UBLK_F_ALL, conditional on block
layer integrity support (CONFIG_BLK_DEV_INTEGRITY). This allows ublk
servers to create ublk devices with UBLK_F_INTEGRITY set and
UBLK_U_CMD_GET_FEATURES to report the feature as supported.

Signed-off-by: Stanley Zhang <stazhang@purestorage.com>
[csander: make feature conditional on CONFIG_BLK_DEV_INTEGRITY]
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-12 09:16:38 -07:00
Stanley Zhang
be82a89066 ublk: implement integrity user copy
Add a function ublk_copy_user_integrity() to copy integrity information
between a request and a user iov_iter. This mirrors the existing
ublk_copy_user_pages() but operates on request integrity data instead of
regular data. Check UBLKSRV_IO_INTEGRITY_FLAG in iocb->ki_pos in
ublk_user_copy() to choose between copying data or integrity data.

[csander: change offset units from data bytes to integrity data bytes,
 fix CONFIG_BLK_DEV_INTEGRITY=n build, rebase on user copy refactor]

Signed-off-by: Stanley Zhang <stazhang@purestorage.com>
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-12 09:16:38 -07:00
Caleb Sander Mateos
fd5a005fa6 ublk: move offset check out of __ublk_check_and_get_req()
__ublk_check_and_get_req() checks that the passed in offset is within
the data length of the specified ublk request. However, only user copy
(ublk_check_and_get_req()) supports accessing ublk request data at a
nonzero offset. Zero-copy buffer registration (ublk_register_io_buf())
always passes 0 for the offset, so the check is unnecessary. Move the
check from __ublk_check_and_get_req() to ublk_check_and_get_req().

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-12 09:16:33 -07:00
Caleb Sander Mateos
ca80afd870 ublk: inline ublk_check_and_get_req() into ublk_user_copy()
ublk_check_and_get_req() has a single callsite in ublk_user_copy(). It
takes a ton of arguments in order to pass local variables from
ublk_user_copy() to ublk_check_and_get_req() and vice versa. And more
are about to be added. Combine the functions to reduce the argument
passing noise.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-12 09:15:05 -07:00
Caleb Sander Mateos
5bfbbc9938 ublk: split out ublk_user_copy() helper
ublk_ch_read_iter() and ublk_ch_write_iter() are nearly identical except
for the iter direction. Split out a helper function ublk_user_copy() to
reduce the code duplication as these functions are about to get larger.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-12 09:15:05 -07:00
Caleb Sander Mateos
fc652d415c ublk: split out ublk_copy_user_bvec() helper
Factor a helper function ublk_copy_user_bvec() out of
ublk_copy_user_pages(). It will be used for copying integrity data too.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-12 09:15:05 -07:00
Caleb Sander Mateos
f82f0a16a8 ublk: set UBLK_IO_F_INTEGRITY in ublksrv_io_desc
Indicate to the ublk server when an incoming request has integrity data
by setting UBLK_IO_F_INTEGRITY in the ublksrv_io_desc's op_flags field.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-12 09:15:05 -07:00
Stanley Zhang
98bf225685 ublk: support UBLK_PARAM_TYPE_INTEGRITY in device creation
Add a feature flag UBLK_F_INTEGRITY for a ublk server to request
integrity/metadata support when creating a ublk device. The ublk server
can also check for the feature flag on the created device or the result
of UBLK_U_CMD_GET_FEATURES to tell if the ublk driver supports it.
UBLK_F_INTEGRITY requires UBLK_F_USER_COPY, as user copy is the only
data copy mode initially supported for integrity data.
Add UBLK_PARAM_TYPE_INTEGRITY and struct ublk_param_integrity to struct
ublk_params to specify the integrity params of a ublk device.
UBLK_PARAM_TYPE_INTEGRITY requires UBLK_F_INTEGRITY and a nonzero
metadata_size. The LBMD_PI_CAP_* and LBMD_PI_CSUM_* values from the
linux/fs.h UAPI header are used for the flags and csum_type fields.
If the UBLK_PARAM_TYPE_INTEGRITY flag is set, validate the integrity
parameters and apply them to the blk_integrity limits.
The struct ublk_param_integrity validations are based on the checks in
blk_validate_integrity_limits(). Any invalid parameters should be
rejected before being applied to struct blk_integrity.

[csander: drop redundant pi_tuple_size field, use block metadata UAPI
 constants, add param validation]

Signed-off-by: Stanley Zhang <stazhang@purestorage.com>
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-12 09:15:05 -07:00
Caleb Sander Mateos
e859e7c26a ublk: move ublk flag check functions earlier
ublk_dev_support_user_copy() will be used in ublk_validate_params().
Move these functions next to ublk_{dev,queue}_is_zoned() to avoid
needing to forward-declare them.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-12 09:15:05 -07:00
Jens Axboe
5df832ba5f Merge branch 'block-6.19' into for-7.0/block
Merge in fixes that went to 6.19 after for-7.0/block was branched.
Pending ublk changes depend on particularly the async scan work.

* block-6.19:
  block: zero non-PI portion of auto integrity buffer
  ublk: fix use-after-free in ublk_partition_scan_work
  blk-mq: avoid stall during boot due to synchronize_rcu_expedited
  loop: add missing bd_abort_claiming in loop_set_status
  block: don't merge bios with different app_tags
  blk-rq-qos: Remove unlikely() hints from QoS checks
  loop: don't change loop device under exclusive opener in loop_set_status
  block, bfq: update outdated comment
  blk-mq: skip CPU offline notify on unmapped hctx
  selftests/ublk: fix Makefile to rebuild on header changes
  selftests/ublk: add test for async partition scan
  ublk: scan partition in async way
  block,bfq: fix aux stat accumulation destination
  md: Fix forward incompatibility from configurable logical block size
  md: Fix logical_block_size configuration being overwritten
  md: suspend array while updating raid_disks via sysfs
  md/raid5: fix possible null-pointer dereferences in raid5_store_group_thread_cnt()
  md: Fix static checker warning in analyze_sbs
2026-01-11 13:16:36 -07:00
Linus Torvalds
cb2076b091 block-6.19-20260109
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmlhS3UQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpqN1EACZI7gCMHL+CI5utvaQVPoZbyDf3jED73KO
 NwDyLKl/frGW2njbM/hcSSH0SITGYnrN0+KGr9JFIIu/AMnl+0prl74DrPjUsQ3x
 b9FwHYcjgQxPEIR39KxqSGAJTrxNxGFyS0OaTg91OMKg8Ze57WlkDRtRIJBpsTB4
 I2OUrMC34fVvjSTzefErB/eNsY3xAO8aFpWbBGD2h/GpH0f3SgGTAu7JH6Hj1Zfw
 kFWyMMSc/JkGB7wSOLxDB2IepS7PkLwlRaU6rHV3xzI1DXs24oUT8E20VU8JMedf
 WLQpzNSfqKws6KQa9LIywMo/bwA4dh3FogUJ6MflJKZoGCiMQnps4f18L6EI+w9L
 NpDCWkNgNwd6siDbTBZebd8YlqkWJYJ7NPwTl9dBdczX4DWsfej0exC2UPgN3B7R
 MQNKuP/+oC7y92igMAXIgFRQIwriVNFCsW/Q3oZSDTJSmaDc7CvONNaLnRom0sen
 1uPt/8w7bz8PkUlVUt6SFl0+KaCXX3mFUnEDiY7+du7nSUeyo1BEL6tm46q7gybC
 lRjyDWp5mz/a/JL3tmiOtavVbnyZ1iy03Nd5HfULUhsARJAQKbE+hAvBEhZGq2F2
 A4FKJgzRd7u5dBcaGLNf8H6UVml600ZX9GPkjH35tVXkqB6z87mQTfJmT6ViLKLU
 vM8AfGWbLQ==
 =DaMl
 -----END PGP SIGNATURE-----

Merge tag 'block-6.19-20260109' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull block fixes from Jens Axboe:

 - Kill unlikely checks for blk-rq-qos. These checks are really
   all-or-nothing, either the branch is taken all the time, or it's not.
   Depending on the configuration, either one of those cases may be
   true. Just remove the annotation

 - Fix for merging bios with different app tags set

 - Fix for a recently introduced slowdown due to RCU synchronization

 - Fix for a status change on loop while it's in use, and then a later
   fix for that fix

 - Fix for the async partition scanning in ublk

* tag 'block-6.19-20260109' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  ublk: fix use-after-free in ublk_partition_scan_work
  blk-mq: avoid stall during boot due to synchronize_rcu_expedited
  loop: add missing bd_abort_claiming in loop_set_status
  block: don't merge bios with different app_tags
  blk-rq-qos: Remove unlikely() hints from QoS checks
  loop: don't change loop device under exclusive opener in loop_set_status
2026-01-09 15:42:46 -10:00
Ming Lei
f0d385f668 ublk: fix use-after-free in ublk_partition_scan_work
A race condition exists between the async partition scan work and device
teardown that can lead to a use-after-free of ub->ub_disk:

1. ublk_ctrl_start_dev() schedules partition_scan_work after add_disk()
2. ublk_stop_dev() calls ublk_stop_dev_unlocked() which does:
   - del_gendisk(ub->ub_disk)
   - ublk_detach_disk() sets ub->ub_disk = NULL
   - put_disk() which may free the disk
3. The worker ublk_partition_scan_work() then dereferences ub->ub_disk
   leading to UAF

Fix this by using ublk_get_disk()/ublk_put_disk() in the worker to hold
a reference to the disk during the partition scan. The spinlock in
ublk_get_disk() synchronizes with ublk_detach_disk() ensuring the worker
either gets a valid reference or sees NULL and exits early.

Also change flush_work() to cancel_work_sync() to avoid running the
partition scan work unnecessarily when the disk is already detached.

Fixes: 7fc4da6a30 ("ublk: scan partition in async way")
Reported-by: Ruikai Peng <ruikai@pwno.io>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-09 06:55:30 -07:00
Tetsuo Handa
2704024d83 loop: add missing bd_abort_claiming in loop_set_status
Commit 08e136ebd1 ("loop: don't change loop device under exclusive
opener in loop_set_status") forgot to call bd_abort_claiming() when
mutex_lock_killable() failed.

Fixes: 08e136ebd1 ("loop: don't change loop device under exclusive opener in loop_set_status")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-07 08:04:42 -07:00
Raphael Pinsonneault-Thibeault
08e136ebd1 loop: don't change loop device under exclusive opener in loop_set_status
loop_set_status() is allowed to change the loop device while there
are other openers of the device, even exclusive ones.

In this case, it causes a KASAN: slab-out-of-bounds Read in
ext4_search_dir(), since when looking for an entry in an inlined
directory, e_value_offs is changed underneath the filesystem by
loop_set_status().

Fix the problem by forbidding loop_set_status() from modifying the loop
device while there are exclusive openers of the device. This is similar
to the fix in loop_configure() by commit 33ec3e53e7 ("loop: Don't
change loop device under exclusive opener") alongside commit ecbe6bc000
("block: use bd_prepare_to_claim directly in the loop driver").

Reported-by: syzbot+3ee481e21fd75e14c397@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=3ee481e21fd75e14c397
Tested-by: syzbot+3ee481e21fd75e14c397@syzkaller.appspotmail.com
Tested-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Signed-off-by: Raphael Pinsonneault-Thibeault <rpthibeault@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-06 05:30:18 -07:00
Md Haris Iqbal
69d26698e4 rnbd-srv: Zero the rsp buffer before using it
Before using the data buffer to send back the response message, zero it
completely. This prevents any stray bytes to be picked up by the client
side when there the message is exchanged between different protocol
versions.

Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-06 05:28:10 -07:00
Florian-Ewald Mueller
4ac9690d4b rnbd-srv: Fix server side setting of bi_size for special IOs
On rnbd-srv, the bi_size of the bio is set during the bio_add_page
function, to which datalen is passed. But for special IOs like DISCARD
and WRITE_ZEROES, datalen is 0, since there is no data to write. For
these special IOs, use the bi_size of the rnbd_msg_io.

Fixes: f6f84be089 ("block/rnbd-srv: Add sanity check and remove redundant assignment")
Signed-off-by: Florian-Ewald Mueller <florian-ewald.mueller@ionos.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-06 05:28:10 -07:00
Jack Wang
e1384543e8 rnbd-srv: fix the trace format for flags
The __print_flags helper meant for bitmask, while the rnbd_rw_flags is
mixed with bitmask and enum, to avoid confusion, just print the data
as it is.

Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Reviewed-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-06 05:28:10 -07:00
Md Haris Iqbal
ef63e9ef76 block/rnbd-proto: Check and retain the NOUNMAP flag for requests
The NOUNMAP flag is in combination with WRITE_ZEROES flag to indicate
that the upper layers wants the sectors zeroed, but does not want it to
get freed. This instruction is especially important for storage stacks
which involves a layer capable of thin provisioning.

This commit makes RNBD block device transfer and retain this NOUNMAP flag
for requests, so it can be passed onto the backend device on the server
side.

Since it is a change in the wire protocol, bump the minor version of
protocol.

Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-06 05:28:10 -07:00
Zhu Yanjun
581cf833ca block: rnbd: add .release to rnbd_dev_ktype
Every ktype must provides a .release function that will be called after
the last kobject_put.

Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Reviewed-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-06 05:28:10 -07:00
Md Haris Iqbal
483cbec342 block/rnbd-proto: Handle PREFLUSH flag properly for IOs
In RNBD client, for a WRITE request of size 0, with only the REQ_PREFLUSH
bit set, while converting from bio_opf to rnbd_opf, we do REQ_OP_WRITE to
RNBD_OP_WRITE, and then check if the rq is flush through function
op_is_flush. That function checks both REQ_PREFLUSH and REQ_FUA flag, and
if any of them is set, the RNBD_F_FUA is set.
On the RNBD server side, while converting the RNBD flags to req flags, if
the RNBD_F_FUA flag is set, we just set the REQ_FUA flag. This means we
have lost the PREFLUSH flag, and added the REQ_FUA flag in its place.

This commits adds a new RNBD_F_PREFLUSH flag, and also adds separate
handling for REQ_PREFLUSH flag. On the server side, if the RNBD_F_PREFLUSH
is present, the REQ_PREFLUSH is added to the bio.

Since it is a change in the wire protocol, bump the minor version of
protocol.
The change is backwards compatible, and does not change the functionality
if either the client or the server is running older/newer versions.
If the client side is running the older version, both REQ_PREFLUSH and
REQ_FUA is converted to RNBD_F_FUA. The server running newer one would
still add only the REQ_FUA flag which is what happens when both client and
server is running the older version.
If the client side is running the newer version, just like before a
RNBD_F_FUA is added, but now a RNBD_F_PREFLUSH is also added to the
rnbd_opf. In case the server is running the older version the
RNBD_F_PREFLUSH is ignored, and only the RNBD_F_FUA is processed.

Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Reviewed-by: Jack Wang <jinpu.wang@ionos.com>
Reviewed-by: Florian-Ewald Mueller <florian-ewald.mueller@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-06 05:28:10 -07:00
Linus Torvalds
bea82c80a5 block-6.19-20260102
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmlX7MMQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpvuYEACG0VFYmcqmB4JZygecJB3xaxhbVIrCbjFv
 Vmc0XNTkcCpjYAv1jpkS5F3nkJhzZlFNn9xOaP/O8E+6tSctFIre7qjMRpxZM3yl
 GA+MqPI+zNbpYMgsoAH/XTASTVfaTEPOlaoAPQeo8Ey3JRw3Ko1IDNU7zIYK94Xl
 rSAeT65W7vJ+HBjctBoCZYMsE2x0Sn0yrVctkL1mMusQwIg6oMhJ1w1p36P17Mc1
 YgLWQYtfK+eogdTM0Jh9RvDtVJL3WT1I2Ii3KBdCgryY7iSxFXvM0pm1lrOBH+kI
 4bKHTylBnjfmxv7dlz3jHwRmahwdXDk7rpq1EMPygDSj835h3SgAFz3rm9nCUjNI
 xWyEZeN6z4ykdOlqJ6ghTnZTroRdM/12HbSV46n69tczxepG3Mn1i3gBd4UQhn5T
 z6aqa7akIsynlzOnLgrwQjxgVhtfAHptrgAg7g7Kz9hq9xTAEPc2f9Nq7glmLP6f
 wPMoy2lla69vk4Tlzh8TZpTHRPLYLHTtL5OQPM6dnyQ6MzWm2/PHJ/MNfV7/o+VR
 W61BYXUz6d2q81c/I16VWVQvJ0nUa3v7hUGCLUeimQUg+ulyIlMX4wrOI7iYTFTy
 V/4c3DHKEh9y/ptmCgv0jDZdwSoUYvXkn0vFe0fcF3q/T7xea4dok8mcXLcKhMuc
 xPFtx92dhQ==
 =4NB3
 -----END PGP SIGNATURE-----

Merge tag 'block-6.19-20260102' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull block fixes from Jens Axboe:

 - Scan partition tables asynchronously for ublk, similarly to how nvme
   does it. This avoids potential deadlocks, which is why nvme does it
   that way too. Includes a set of selftests as well.

 - MD pull request via Yu:
     - Fix null-pointer dereference in raid5 sysfs group_thread_cnt
       store (Tuo Li)
     - Fix possible mempool corruption during raid1 raid_disks update
       via sysfs (FengWei Shih)
     - Fix logical_block_size configuration being overwritten during
       super_1_validate() (Li Nan)
     - Fix forward incompatibility with configurable logical block size:
       arrays assembled on new kernels could not be assembled on older
       kernels (v6.18 and before) due to non-zero reserved pad rejection
       (Li Nan)
     - Fix static checker warning about iterator not incremented (Li Nan)

 - Skip CPU offlining notifications on unmapped hardware queues

 - bfq-iosched block stats fix

 - Fix outdated comment in bfq-iosched

* tag 'block-6.19-20260102' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  block, bfq: update outdated comment
  blk-mq: skip CPU offline notify on unmapped hctx
  selftests/ublk: fix Makefile to rebuild on header changes
  selftests/ublk: add test for async partition scan
  ublk: scan partition in async way
  block,bfq: fix aux stat accumulation destination
  md: Fix forward incompatibility from configurable logical block size
  md: Fix logical_block_size configuration being overwritten
  md: suspend array while updating raid_disks via sysfs
  md/raid5: fix possible null-pointer dereferences in raid5_store_group_thread_cnt()
  md: Fix static checker warning in analyze_sbs
2026-01-02 12:15:59 -08:00
Christophe JAILLET
9e371032cb null_blk: Constify struct configfs_item_operations and configfs_group_operations
'struct configfs_item_operations' and 'configfs_group_operations' are not
modified in this driver.

Constifying these structures moves some data to a read-only section, so
increases overall security, especially when the structure holds some
function pointers.

On a x86_64, with allmodconfig:
Before:
======
   text	   data	    bss	    dec	    hex	filename
 100263	  37808	   2752	 140823	  22617	drivers/block/null_blk/main.o

After:
=====
   text	   data	    bss	    dec	    hex	filename
 100423	  37648	   2752	 140823	  22617	drivers/block/null_blk/main.o

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-29 08:19:15 -07:00
Thorsten Blum
e1418af766 brd: replace simple_strtol with kstrtoul in ramdisk_size
Replace simple_strtol() with the recommended kstrtoul() for parsing the
'ramdisk_size=' boot parameter. Unlike simple_strtol(), which returns a
long, kstrtoul() converts the string directly to an unsigned long and
avoids implicit casting.

Check the return value of kstrtoul() and reject invalid values. This
adds error handling while preserving behavior for existing values, and
removes use of the deprecated simple_strtol() helper. The current code
silently sets 'rd_size = 0' if parsing fails, instead of leaving the
default value (CONFIG_BLK_DEV_RAM_SIZE) unchanged.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-28 15:54:38 -07:00
Tamir Duberstein
4cef2fcda3 rnull: replace kernel::c_str! with C-Strings
C-String literals were added in Rust 1.77. Replace instances of
`kernel::c_str!` with C-String literals where possible.

Signed-off-by: Tamir Duberstein <tamird@gmail.com>
Reviewed-by: Daniel Almeida <daniel.almeida@collabora.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-28 15:54:38 -07:00
Ming Lei
7fc4da6a30 ublk: scan partition in async way
Implement async partition scan to avoid IO hang when reading partition
tables. Similar to nvme_partition_scan_work(), partition scanning is
deferred to a work queue to prevent deadlocks.

When partition scan happens synchronously during add_disk(), IO errors
can cause the partition scan to wait while holding ub->mutex, which
can deadlock with other operations that need the mutex.

Changes:
- Add partition_scan_work to ublk_device structure
- Implement ublk_partition_scan_work() to perform async scan
- Always suppress sync partition scan during add_disk()
- Schedule async work after add_disk() for trusted daemons
- Add flush_work() in ublk_stop_dev() before grabbing ub->mutex

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Reported-by: Yoav Cohen <yoav@nvidia.com>
Closes: https://lore.kernel.org/linux-block/DM4PR12MB63280C5637917C071C2F0D65A9A8A@DM4PR12MB6328.namprd12.prod.outlook.com/
Fixes: 71f28f3136 ("ublk_drv: add io_uring based userspace block driver")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-28 09:25:26 -07:00
Linus Torvalds
3f0e9c8cef block-6.19-20251226
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmlOrqkQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpmIoEAC25bhRVlbwHPGTXiT2tFpxb3AQofmcxNZK
 2Ooj525wcHvgPoh6shBiPDoH8tDoUJaqg+ZUtMjA3gYWmbhL719G5n3TuJSE/I6m
 AUwP0pAU+dtJbDRxTfYD7JjSGidKd74WlLHAoF5SiDW1U1DMfyYPfQO+p+MhuGJV
 Olo+oFVdWst4+aV8elxY/SgRISI+3VIRwe+zIlPofRTrvkhh1qYcDKzUZmrxDCRl
 TP4csqKanuTrTZaWTG6tTrIEfyuvsfxDSst/slZziUlEx3k+WnB6WHUbWufWMFp6
 zRuK+NGKrwrLb/K0J1kSGeGsc2Ptkx6Sn+dOp3zNsrH1chihZvr7k+q8z5kBpv7i
 mwu1CbreGRI3pgQYMIdO3UD7PZKEj/5XJ36cHPYWQwG056vrtD2X7dMfGFfJ4SdE
 KBSm3tXDiBLqbP6g2lZ6BixeUB/nHoDqhZtfsP/guw8coVyGkN4w0ND0/mIbbD0p
 VQ47HSQB4zfE6NHlr4tcnDDxEg4V+9d3H1yva0KRwVXjrEsTUcMVSoRwA99JUQoi
 VLmQ2sGn8qEGWxVHPGXvF6gEHnU8eoQfZD2oG2gZPyw145HKeuml4VtbB6wy0J0n
 GHWY8cLcq5y2quWfHDYEbd/g9yedE+Y8eVkHgPbHmGjwCLZOt3pltKmfpjw5otkt
 xdNoodcOTw==
 =DwyX
 -----END PGP SIGNATURE-----

Merge tag 'block-6.19-20251226' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull block fixes from Jens Axboe:

 - Fix for a signedness issue introduced in this kernel release for rnbd

 - Fix up user copy references for ublk when the server exits

* tag 'block-6.19-20251226' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  block: rnbd-clt: Fix signedness bug in init_dev()
  ublk: clean up user copy references on ublk server exit
2025-12-26 11:44:35 -08:00
Dan Carpenter
1ddb815fdf block: rnbd-clt: Fix signedness bug in init_dev()
The "dev->clt_device_id" variable is set using ida_alloc_max() which
returns an int and in particular it returns negative error codes.
Change the type from u32 to int to fix the error checking.

Fixes: c9b5645fd8 ("block: rnbd-clt: Fix leaked ID in init_dev()")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-20 12:56:48 -07:00
Caleb Sander Mateos
daa24603d9 ublk: clean up user copy references on ublk server exit
If a ublk server process releases a ublk char device file, any requests
dispatched to the ublk server but not yet completed will retain a ref
value of UBLK_REFCOUNT_INIT. Before commit e63d2228ef ("ublk: simplify
aborting ublk request"), __ublk_fail_req() would decrement the reference
count before completing the failed request. However, that commit
optimized __ublk_fail_req() to call __ublk_complete_rq() directly
without decrementing the request reference count.
The leaked reference count incorrectly allows user copy and zero copy
operations on the completed ublk request. It also triggers the
WARN_ON_ONCE(refcount_read(&io->ref)) warnings in ublk_queue_reinit()
and ublk_deinit_queue().
Commit c5c5eb24ed ("ublk: avoid ublk_io_release() called after ublk
char dev is closed") already fixed the issue for ublk devices using
UBLK_F_SUPPORT_ZERO_COPY or UBLK_F_AUTO_BUF_REG. However, the reference
count leak also affects UBLK_F_USER_COPY, the other reference-counted
data copy mode. Fix the condition in ublk_check_and_reset_active_ref()
to include all reference-counted data copy modes. This ensures that any
ublk requests still owned by the ublk server when it exits have their
reference counts reset to 0.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Fixes: e63d2228ef ("ublk: simplify aborting ublk request")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-20 12:56:37 -07:00
Linus Torvalds
d8ba32c5a4 block-6.19-20251218
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmlEvbAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgphTLD/9Gx4wnirRvzo1Vz77Odhqqbipww/gjDV0G
 HASzF4KczfusTei4NABfuSVA4/LS28csZOYq0ZeDmFVg+YPQcmr4DNw/eJzzi1Xh
 TDmYuf8y7jAJjwWmX0fgVF2uDg3/kp3tWKfBxSFqpqHrZUkOLr8IY3y8/uJYKakm
 r8K9HsS7hyPhx4xdfRQixUil5TgQxBtCgwTX+JRv/K9AMnEVHkBkJ+m4GSmjJ3n8
 tr5lL8fQmIQcsSWqTvMH/m2xeLG7gr9LE33wfu2sCLvNJ4CAYFGT9aMf2F0nGbzK
 Uu5XOyq1rJ8vKWJuDVfFsjOca4fyyvaKX4bqycv89Z7MVzQOXsjbs+4sHFycsfYC
 P+OBmxLb2+rowNsawyzObT9wnd37xptl+tF5nXL0LL6W4PWHmUgnDuM6Vg5zpNv+
 n0nO6y7I0iL9o4hgqN8kUOr5e7IjvJccJbratOA/jVXmAw1I/Gg6HDo+W7rEqsVP
 0hrlE+RuJSRypbCO2pcvviahaDaCWo3gQmp9TZCBmgqgoiI0WStPAgYrG424+69Y
 Hoq5HBmtN4GV3EnU0Ybuawzl2OKtQwwr9DrRpvrxdN9qm4E44CD6eTmE5fEnrTnX
 Rcs5VosBAavPTbl8LFDtj2h6qPak+5VHHVZwimLNdOyCLTNf7D/kiAjmOHRBbmdt
 AopIuvCipA==
 =lk7H
 -----END PGP SIGNATURE-----

Merge tag 'block-6.19-20251218' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull block fixes from Jens Axboe:

 - ublk selftests for missing coverage

 - two fixes for the block integrity code

 - fix for the newly added newly added PR read keys ioctl, limiting the
   memory that can be allocated

 - work around for a deadlock that can occur with ublk, where partition
   scanning ends up recursing back into file closure, which needs the
   same mutex grabbed. Not the prettiest thing in the world, but an
   acceptable work-around until we can eliminate the reliance on
   disk->open_mutex for this

 - fix for a race between enabling writeback throttling and new IO
   submissions

 - move a bit of bio flag handling code. No changes, but needed for a
   patchset for a future kernel

 - fix for an init time id leak failure in rnbd

 - loop/zloop state check fix

* tag 'block-6.19-20251218' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  block: validate interval_exp integrity limit
  block: validate pi_offset integrity limit
  block: rnbd-clt: Fix leaked ID in init_dev()
  ublk: fix deadlock when reading partition table
  block: add allocation size check in blkdev_pr_read_keys()
  Documentation: admin-guide: blockdev: replace zone_capacity with zone_capacity_mb when creating devices
  zloop: use READ_ONCE() to read lo->lo_state in queue_rq path
  loop: use READ_ONCE() to read lo->lo_state without locking
  block: fix race between wbt_enable_default and IO submission
  selftests: ublk: add user copy test cases
  selftests: ublk: add support for user copy to kublk
  selftests: ublk: forbid multiple data copy modes
  selftests: ublk: don't share backing files between ublk servers
  selftests: ublk: use auto_zc for PER_IO_DAEMON tests in stress_04
  selftests: ublk: fix fio arguments in run_io_and_recover()
  selftests: ublk: remove unused ios map in seq_io.bt
  selftests: ublk: correct last_rw map type in seq_io.bt
  selftests: ublk: fix overflow in ublk_queue_auto_zc_fallback()
  block: move around bio flagging helpers
2025-12-20 09:48:56 -08:00
Thomas Fourier
c9b5645fd8 block: rnbd-clt: Fix leaked ID in init_dev()
If kstrdup() fails in init_dev(), then the newly allocated ID is lost.

Fixes: 64e8a6ece1 ("block/rnbd-clt: Dynamically alloc buffer for pathname & blk_symlink_name")
Signed-off-by: Thomas Fourier <fourier.thomas@gmail.com>
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-18 08:10:28 -07:00
Ming Lei
c258f5c450 ublk: fix deadlock when reading partition table
When one process(such as udev) opens ublk block device (e.g., to read
the partition table via bdev_open()), a deadlock[1] can occur:

1. bdev_open() grabs disk->open_mutex
2. The process issues read I/O to ublk backend to read partition table
3. In __ublk_complete_rq(), blk_update_request() or blk_mq_end_request()
   runs bio->bi_end_io() callbacks
4. If this triggers fput() on file descriptor of ublk block device, the
   work may be deferred to current task's task work (see fput() implementation)
5. This eventually calls blkdev_release() from the same context
6. blkdev_release() tries to grab disk->open_mutex again
7. Deadlock: same task waiting for a mutex it already holds

The fix is to run blk_update_request() and blk_mq_end_request() with bottom
halves disabled. This forces blkdev_release() to run in kernel work-queue
context instead of current task work context, and allows ublk server to make
forward progress, and avoids the deadlock.

Fixes: 71f28f3136 ("ublk_drv: add io_uring based userspace block driver")
Link: https://github.com/ublk-org/ublksrv/issues/170 [1]
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
[axboe: rewrite comment in ublk]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-17 19:40:28 -07:00
Yongpeng Yang
4b2b03151e zloop: use READ_ONCE() to read lo->lo_state in queue_rq path
In the queue_rq path, zlo->state is accessed without locking, and direct
access may read stale data. This patch uses READ_ONCE() to read
zlo->state and data_race() to silence code checkers, and changes all
assignments to use WRITE_ONCE().

Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-15 09:32:42 -07:00
Yongpeng Yang
54891a96b7 loop: use READ_ONCE() to read lo->lo_state without locking
When lo->lo_mutex is not held, direct access may read stale data. This
patch uses READ_ONCE() to read lo->lo_state and data_race() to silence
code checkers, and changes all assignments to use WRITE_ONCE().

Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-15 09:32:42 -07:00
Linus Torvalds
0dfb36b2dc We have a patch that adds an initial set of tracepoints to the MDS
client from Max, a fix that hardens osdmap parsing code from myself
 (marked for stable) and a few assorted fixups.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmk8YxITHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi702B/9Mfj0HQSAaWCyMz6GJAv8+1ijVTSbG
 nTFeMmipYZhPn188OvWqHsf1cw9rWY0fzenIEW05Tk3YmMvYdeRmCcOkZdeG06xw
 um8XxX4L8315E0b98CCpQpVa02ux7XoNtBPjeHccl8PEErJQgQJrQ3Cc/C8kk5U5
 a0KlfeVRXYWkOPJva3+wosOu0t9QgJ9ABt5stqcvYDfdkKfvatQMBN3N1nNRKkFH
 yhNPv+nRtypSk8jiHhKeeCVmosC0L7MnKuO593vDr761cY5mKrgKFqc2LLlPF6r0
 /p13s5SG6X38RUegWCjcK4XRJAzVcR/tyod8LkVVp8d5DC/ptcyy8H+I
 =h1gC
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-6.19-rc1' of https://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "We have a patch that adds an initial set of tracepoints to the MDS
  client from Max, a fix that hardens osdmap parsing code from myself
  (marked for stable) and a few assorted fixups"

* tag 'ceph-for-6.19-rc1' of https://github.com/ceph/ceph-client:
  rbd: stop selecting CRC32, CRYPTO, and CRYPTO_AES
  ceph: stop selecting CRC32, CRYPTO, and CRYPTO_AES
  libceph: make decode_pool() more resilient against corrupted osdmaps
  libceph: Amend checking to fix `make W=1` build breakage
  ceph: Amend checking to fix `make W=1` build breakage
  ceph: add trace points to the MDS client
  libceph: fix log output race condition in OSD client
2025-12-14 15:24:10 +12:00
Linus Torvalds
35ebee7e72 block-6.19-20251211
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmk7RxcQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpnClD/94oSzn0ViI+kmtPcqiHVilGCQYIaBiQuUN
 N+Z3XiLCPUgPDeWxycbFflQ2pmuODXzOC5XZddC0hitxj4jIqL8jBwI0T/WOPXyw
 A0g8S7/Ny3Le5FftBy7duqjIDycXGYhKYD9sQEvSBTf0yu3QpT4hPRveuPouSPkz
 d4H73j+9VMrLRyXGuALhhdwIaqu6/QRtArjc1Yickisi5dEqpwSrHk0CQEe1zJgs
 wgeItEwfnDVdU0wNiLxSJY0HsTzYXtyYWAT5KiFPCPkHYZd1tadcwZ3D9aLF/oH8
 LzLAX19QrTX11lVXP7FbipClYE5gprKDe4qPTExXQrPD7j3Ba4LWIl4QXZ2A4LPE
 Epam6R+ugOyly2+S2dea1lByoKafviRm4CqR3Ixr+S8ayTUser3oy6I1xGEi9v7D
 qF4LJ1ziLWz1kWoLdoOyJCDv0W3vK1U1Rflt24woOLZNpw2S20q7+vwwLQHoWxnY
 GBDRMi3NjCXH4qCJOaly5tnLNTzdxh0h64WsbjO+DGXOnOr39wH6TN4czkW4PPR5
 IwFpP7HurRJMivoSHP50tRqbFLXETlAdceYV8HuhNYhlCIY6NaQbYr7PKzNk/GcJ
 e2/AkRNgJf5GRzemrfkCtndBCsyi1IMsFN0GXbhH6Xr705Lpkpf5qs77Mexg3+TK
 laf5G/vCDw==
 =h5ad
 -----END PGP SIGNATURE-----

Merge tag 'block-6.19-20251211' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull block fixes from Jens Axboe:

 - Always initialize DMA state, fixing a potentially nasty issue on the
   block side

 - btrfs zoned write fix with cached zone reports

 - Fix corruption issues in bcache with chained bio's, and further make
   it clear that the chained IO handler is simply a marker, it's not
   code meant to be executed

 - Kill old code dealing with synchronous IO polling in the block layer,
   that has been dead for a long time. Only async polling is supported
   these days

 - Fix a lockdep issue in tag_set management, moving it to RCU

 - Fix an issue with ublks bio_vec iteration

 - Don't unconditionally enforce blocking issue of ublk control
   commands, allow some of them with non-blocking issue as they
   do not block

* tag 'block-6.19-20251211' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  blk-mq-dma: always initialize dma state
  blk-mq: delete task running check in blk_hctx_poll()
  block: fix cached zone reports on devices with native zone append
  block: Use RCU in blk_mq_[un]quiesce_tagset() instead of set->tag_list_lock
  ublk: don't mutate struct bio_vec in iteration
  block: prohibit calls to bio_chain_endio
  bcache: fix improper use of bi_end_io
  ublk: allow non-blocking ctrl cmds in IO_URING_F_NONBLOCK issue
2025-12-12 22:04:18 +12:00
Ilya Dryomov
21c1466ea2 rbd: stop selecting CRC32, CRYPTO, and CRYPTO_AES
None of the RBD code directly requires CRC32, CRYPTO, or CRYPTO_AES.
These options are needed by CEPH_LIB code and they are selected there
directly.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@linux.dev>
2025-12-10 11:50:54 +01:00
Caleb Sander Mateos
db339b4067 ublk: don't mutate struct bio_vec in iteration
__bio_for_each_segment() uses the returned struct bio_vec's bv_len field
to advance the struct bvec_iter at the end of each loop iteration. So
it's incorrect to modify it during the loop. Don't assign to bv_len (or
bv_offset, for that matter) in ublk_copy_user_pages().

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Fixes: e87d66ab27 ("ublk: use rq_for_each_segment() for user copy")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-09 10:20:18 -07:00
Linus Torvalds
4482ebb297 block-6.19-20251208
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmk3KZsQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpkNYD/91yqAJeehx2Heq3dWj9L8hDuETQelj/g9j
 gtZCiriAPy+bb1/BmWjK+BmvjtBt+g3a4Cwi6tVj4F1zoE46IPeLhO+2iJTEBiBq
 AhRtEf/MFXFK3qUnTpEnS8w3CtsXejOTB81VQ+6BysSu+B708m/1AQHv2HocZ37R
 jivrzfCsEdBr+ISwYw/EG5KcDBVTFo/JdXIhs7k4Z8bBfa3P5ye4EhKjORtgbFNU
 5nXb78SZoWNCZF143YV++9MpZc3M2jzkzrk1CTLsUHhOxWg4T/6wTXfPGZc/W4m8
 UBhs03u/gMJnKHhlZd4kpZWDito1TQZTdY2f5sBsysRQqeT7bwDK/1xiQ1nllZiP
 oYbeD6t65yMAlELwNFXo7y/DNcS2VLBMvChIX6p1gweEzyf23YneoHYyN5agEQlN
 9C4EdcYzZRt0DwtHlIRtKvDk2LZzkJAcLau3D6ahU/DPLOawyWZKmvGiU+sSyJjF
 bEIO5c/+MLqkAgLAGaFgA4twFF1aYH9ssmJerDxprarkf1jtlOBLvUQ391Gtb5Hd
 B1yugmIgEwLbCFzhk9FlCtv2nQcWRCElnaeqv+Lv+xCBVPGCLm2qIHoTqmvHZPCd
 GbN/h0XLdgUboYPCFWVAX72/4K/cv+fQQcb+a7tiq6vMKcgJ/2I1szFGpFqz7azB
 hyiK0v3x2g==
 =r1xa
 -----END PGP SIGNATURE-----

Merge tag 'block-6.19-20251208' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull block updates from Jens Axboe:
 "Followup set of fixes and updates for block for the 6.19 merge window.

  NVMe had some late minute debates which lead to dropping some patches
  from that tree, which is why the initial PR didn't have NVMe included.
  It's here now. This pull request contains:

   - NVMe pull request via Keith:
       - Subsystem usage cleanups (Max)
       - Endpoint device fixes (Shin'ichiro)
       - Debug statements (Gerd)
       - FC fabrics cleanups and fixes (Daniel)
       - Consistent alloc API usages (Israel)
       - Code comment updates (Chu)
       - Authentication retry fix (Justin)

   - Fix a memory leak in the discard ioctl code, if the task is being
     interrupted by a signal at just the wrong time

   - Zoned write plugging fixes

   - Add ioctls for for persistent reservations

   - Enable per-cpu bio caching by default

   - Various little fixes and tweaks"

* tag 'block-6.19-20251208' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (27 commits)
  nvme-fabrics: add ENOKEY to no retry criteria for authentication failures
  nvme-auth: use kvfree() for memory allocated with kvcalloc()
  nvmet-tcp: use kvcalloc for commands array
  nvmet-rdma: use kvcalloc for commands and responses arrays
  nvme: fix typo error in nvme target
  nvmet-fc: use pr_* print macros instead of dev_*
  nvmet-fcloop: remove unused lsdir member.
  nvmet-fcloop: check all request and response have been processed
  nvme-fc: check all request and response have been processed
  block: fix memory leak in __blkdev_issue_zero_pages
  block: fix comment for op_is_zone_mgmt() to include RESET_ALL
  block: Clear BLK_ZONE_WPLUG_PLUGGED when aborting plugged BIOs
  blk-mq: Abort suspend when wakeup events are pending
  blk-mq: add blk_rq_nr_bvec() helper
  block: add IOC_PR_READ_RESERVATION ioctl
  block: add IOC_PR_READ_KEYS ioctl
  nvme: reject invalid pr_read_keys() num_keys values
  scsi: sd: reject invalid pr_read_keys() num_keys values
  block: enable per-cpu bio cache by default
  block: use bio_alloc_bioset for passthru IO by default
  ...
2025-12-09 08:53:24 +09:00
Caleb Sander Mateos
87213b0d84 ublk: allow non-blocking ctrl cmds in IO_URING_F_NONBLOCK issue
Handling most of the ublksrv_ctrl_cmd opcodes require locking a mutex,
so ublk_ctrl_uring_cmd() bails out with EAGAIN when called with the
IO_URING_F_NONBLOCK issue flag. However, several opcodes can be handled
without blocking:
- UBLK_CMD_GET_QUEUE_AFFINITY
- UBLK_CMD_GET_DEV_INFO
- UBLK_CMD_GET_DEV_INFO2
- UBLK_U_CMD_GET_FEATURES

Handle these opcodes synchronously instead of returning EAGAIN so
io_uring doesn't need to issue the command via the worker thread pool.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-08 13:32:30 -07:00
Linus Torvalds
7203ca412f Significant patch series in this merge are as follows:
- The 10 patch series "__vmalloc()/kvmalloc() and no-block support" from
   Uladzislau Rezki reworks the vmalloc() code to support non-blocking
   allocations (GFP_ATOIC, GFP_NOWAIT).
 
 - The 2 patch series "ksm: fix exec/fork inheritance" from xu xin fixes
   a rare case where the KSM MMF_VM_MERGE_ANY prctl state is not inherited
   across fork/exec.
 
 - The 4 patch series "mm/zswap: misc cleanup of code and documentations"
   from SeongJae Park does some light maintenance work on the zswap code.
 
 - The 5 patch series "mm/page_owner: add debugfs files 'show_handles'
   and 'show_stacks_handles'" from Mauricio Faria de Oliveira enhances the
   /sys/kernel/debug/page_owner debug feature.  It adds unique identifiers
   to differentiate the various stack traces so that userspace monitoring
   tools can better match stack traces over time.
 
 - The 2 patch series "mm/page_alloc: pcp->batch cleanups" from Joshua
   Hahn makes some minor alterations to the page allocator's per-cpu-pages
   feature.
 
 - The 2 patch series "Improve UFFDIO_MOVE scalability by removing
   anon_vma lock" from Lokesh Gidra addresses a scalability issue in
   userfaultfd's UFFDIO_MOVE operation.
 
 - The 2 patch series "kasan: cleanups for kasan_enabled() checks" from
   Sabyrzhan Tasbolatov performs some cleanup in the KASAN code.
 
 - The 2 patch series "drivers/base/node: fold node register and
   unregister functions" from Donet Tom cleans up the NUMA node handling
   code a little.
 
 - The 4 patch series "mm: some optimizations for prot numa" from Kefeng
   Wang provides some cleanups and small optimizations to the NUMA
   allocation hinting code.
 
 - The 5 patch series "mm/page_alloc: Batch callers of
   free_pcppages_bulk" from Joshua Hahn addresses long lock hold times at
   boot on large machines.  These were causing (harmless) softlockup
   warnings.
 
 - The 2 patch series "optimize the logic for handling dirty file folios
   during reclaim" from Baolin Wang removes some now-unnecessary work from
   page reclaim.
 
 - The 10 patch series "mm/damon: allow DAMOS auto-tuned for per-memcg
   per-node memory usage" from SeongJae Park enhances the DAMOS auto-tuning
   feature.
 
 - The 2 patch series "mm/damon: fixes for address alignment issues in
   DAMON_LRU_SORT and DAMON_RECLAIM" from Quanmin Yan fixes DAMON_LRU_SORT
   and DAMON_RECLAIM with certain userspace configuration.
 
 - The 15 patch series "expand mmap_prepare functionality, port more
   users" from Lorenzo Stoakes enhances the new(ish)
   file_operations.mmap_prepare() method and ports additional callsites
   from the old ->mmap() over to ->mmap_prepare().
 
 - The 8 patch series "Fix stale IOTLB entries for kernel address space"
   from Lu Baolu fixes a bug (and possible security issue on non-x86) in
   the IOMMU code.  In some situations the IOMMU could be left hanging onto
   a stale kernel pagetable entry.
 
 - The 4 patch series "mm/huge_memory: cleanup __split_unmapped_folio()"
   from Wei Yang cleans up and optimizes the folio splitting code.
 
 - The 5 patch series "mm, swap: misc cleanup and bugfix" from Kairui
   Song implements some cleanups and a minor fix in the swap discard code.
 
 - The 8 patch series "mm/damon: misc documentation fixups" from SeongJae
   Park does as advertised.
 
 - The 9 patch series "mm/damon: support pin-point targets removal" from
   SeongJae Park permits userspace to remove a specific monitoring target
   in the middle of the current targets list.
 
 - The 2 patch series "mm: MISC follow-up patches for linux/pgalloc.h"
   from Harry Yoo implements a couple of cleanups related to mm header file
   inclusion.
 
 - The 2 patch series "mm/swapfile.c: select swap devices of default
   priority round robin" from Baoquan He improves the selection of swap
   devices for NUMA machines.
 
 - The 3 patch series "mm: Convert memory block states (MEM_*) macros to
   enums" from Israel Batista changes the memory block labels from macros
   to enums so they will appear in kernel debug info.
 
 - The 3 patch series "ksm: perform a range-walk to jump over holes in
   break_ksm" from Pedro Demarchi Gomes addresses an inefficiency when KSM
   unmerges an address range.
 
 - The 22 patch series "mm/damon/tests: fix memory bugs in kunit tests"
   from SeongJae Park fixes leaks and unhandled malloc() failures in DAMON
   userspace unit tests.
 
 - The 2 patch series "some cleanups for pageout()" from Baolin Wang
   cleans up a couple of minor things in the page scanner's
   writeback-for-eviction code.
 
 - The 2 patch series "mm/hugetlb: refactor sysfs/sysctl interfaces" from
   Hui Zhu moves hugetlb's sysfs/sysctl handling code into a new file.
 
 - The 9 patch series "introduce VM_MAYBE_GUARD and make it sticky" from
   Lorenzo Stoakes makes the VMA guard regions available in /proc/pid/smaps
   and improves the mergeability of guarded VMAs.
 
 - The 2 patch series "mm: perform guard region install/remove under VMA
   lock" from Lorenzo Stoakes reduces mmap lock contention for callers
   performing VMA guard region operations.
 
 - The 2 patch series "vma_start_write_killable" from Matthew Wilcox
   starts work in permitting applications to be killed when they are
   waiting on a read_lock on the VMA lock.
 
 - The 11 patch series "mm/damon/tests: add more tests for online
   parameters commit" from SeongJae Park adds additional userspace testing
   of DAMON's "commit" feature.
 
 - The 9 patch series "mm/damon: misc cleanups" from SeongJae Park does
   that.
 
 - The 2 patch series "make VM_SOFTDIRTY a sticky VMA flag" from Lorenzo
   Stoakes addresses the possible loss of a VMA's VM_SOFTDIRTY flag when
   that VMA is merged with another.
 
 - The 16 patch series "mm: support device-private THP" from Balbir Singh
   introduces support for Transparent Huge Page (THP) migration in zone
   device-private memory.
 
 - The 3 patch series "Optimize folio split in memory failure" from Zi
   Yan optimizes folio split operations in the memory failure code.
 
 - The 2 patch series "mm/huge_memory: Define split_type and consolidate
   split support checks" from Wei Yang provides some more cleanups in the
   folio splitting code.
 
 - The 16 patch series "mm: remove is_swap_[pte, pmd]() + non-swap
   entries, introduce leaf entries" from Lorenzo Stoakes cleans up our
   handling of pagetable leaf entries by introducing the concept of
   'software leaf entries', of type softleaf_t.
 
 - The 4 patch series "reparent the THP split queue" from Muchun Song
   reparents the THP split queue to its parent memcg.  This is in
   preparation for addressing the long-standing "dying memcg" problem,
   wherein dead memcg's linger for too long, consuming memory resources.
 
 - The 3 patch series "unify PMD scan results and remove redundant
   cleanup" from Wei Yang does a little cleanup in the hugepage collapse
   code.
 
 - The 6 patch series "zram: introduce writeback bio batching" from
   Sergey Senozhatsky improves zram writeback efficiency by introducing
   batched bio writeback support.
 
 - The 4 patch series "memcg: cleanup the memcg stats interfaces" from
   Shakeel Butt cleans up our handling of the interrupt safety of some
   memcg stats.
 
 - The 4 patch series "make vmalloc gfp flags usage more apparent" from
   Vishal Moola cleans up vmalloc's handling of incoming GFP flags.
 
 - The 6 patch series "mm: Add soft-dirty and uffd-wp support for RISC-V"
   from Chunyan Zhang teches soft dirty and userfaultfd write protect
   tracking to use RISC-V's Svrsw60t59b extension.
 
 - The 5 patch series "mm: swap: small fixes and comment cleanups" from
   Youngjun Park fixes a small bug and cleans up some of the swap code.
 
 - The 4 patch series "initial work on making VMA flags a bitmap" from
   Lorenzo Stoakes starts work on converting the vma struct's flags to a
   bitmap, so we stop running out of them, especially on 32-bit.
 
 - The 2 patch series "mm/swapfile: fix and cleanup swap list iterations"
   from Youngjun Park addresses a possible bug in the swap discard code and
   cleans things up a little.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaTEb0wAKCRDdBJ7gKXxA
 jjfIAP94W4EkCCwNOupnChoG+YWw/JW21anXt5NN+i5svn1yugEAwzvv6A+cAFng
 o+ug/fyrfPZG7PLp2R8WFyGIP0YoBA4=
 =IUzS
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2025-12-03-21-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

  "__vmalloc()/kvmalloc() and no-block support" (Uladzislau Rezki)
     Rework the vmalloc() code to support non-blocking allocations
     (GFP_ATOIC, GFP_NOWAIT)

  "ksm: fix exec/fork inheritance" (xu xin)
     Fix a rare case where the KSM MMF_VM_MERGE_ANY prctl state is not
     inherited across fork/exec

  "mm/zswap: misc cleanup of code and documentations" (SeongJae Park)
     Some light maintenance work on the zswap code

  "mm/page_owner: add debugfs files 'show_handles' and 'show_stacks_handles'" (Mauricio Faria de Oliveira)
     Enhance the /sys/kernel/debug/page_owner debug feature by adding
     unique identifiers to differentiate the various stack traces so
     that userspace monitoring tools can better match stack traces over
     time

  "mm/page_alloc: pcp->batch cleanups" (Joshua Hahn)
     Minor alterations to the page allocator's per-cpu-pages feature

  "Improve UFFDIO_MOVE scalability by removing anon_vma lock" (Lokesh Gidra)
     Address a scalability issue in userfaultfd's UFFDIO_MOVE operation

  "kasan: cleanups for kasan_enabled() checks" (Sabyrzhan Tasbolatov)

  "drivers/base/node: fold node register and unregister functions" (Donet Tom)
     Clean up the NUMA node handling code a little

  "mm: some optimizations for prot numa" (Kefeng Wang)
     Cleanups and small optimizations to the NUMA allocation hinting
     code

  "mm/page_alloc: Batch callers of free_pcppages_bulk" (Joshua Hahn)
     Address long lock hold times at boot on large machines. These were
     causing (harmless) softlockup warnings

  "optimize the logic for handling dirty file folios during reclaim" (Baolin Wang)
     Remove some now-unnecessary work from page reclaim

  "mm/damon: allow DAMOS auto-tuned for per-memcg per-node memory usage" (SeongJae Park)
     Enhance the DAMOS auto-tuning feature

  "mm/damon: fixes for address alignment issues in DAMON_LRU_SORT and DAMON_RECLAIM" (Quanmin Yan)
     Fix DAMON_LRU_SORT and DAMON_RECLAIM with certain userspace
     configuration

  "expand mmap_prepare functionality, port more users" (Lorenzo Stoakes)
     Enhance the new(ish) file_operations.mmap_prepare() method and port
     additional callsites from the old ->mmap() over to ->mmap_prepare()

  "Fix stale IOTLB entries for kernel address space" (Lu Baolu)
     Fix a bug (and possible security issue on non-x86) in the IOMMU
     code. In some situations the IOMMU could be left hanging onto a
     stale kernel pagetable entry

  "mm/huge_memory: cleanup __split_unmapped_folio()" (Wei Yang)
     Clean up and optimize the folio splitting code

  "mm, swap: misc cleanup and bugfix" (Kairui Song)
     Some cleanups and a minor fix in the swap discard code

  "mm/damon: misc documentation fixups" (SeongJae Park)

  "mm/damon: support pin-point targets removal" (SeongJae Park)
     Permit userspace to remove a specific monitoring target in the
     middle of the current targets list

  "mm: MISC follow-up patches for linux/pgalloc.h" (Harry Yoo)
     A couple of cleanups related to mm header file inclusion

  "mm/swapfile.c: select swap devices of default priority round robin" (Baoquan He)
     improve the selection of swap devices for NUMA machines

  "mm: Convert memory block states (MEM_*) macros to enums" (Israel Batista)
     Change the memory block labels from macros to enums so they will
     appear in kernel debug info

  "ksm: perform a range-walk to jump over holes in break_ksm" (Pedro Demarchi Gomes)
     Address an inefficiency when KSM unmerges an address range

  "mm/damon/tests: fix memory bugs in kunit tests" (SeongJae Park)
     Fix leaks and unhandled malloc() failures in DAMON userspace unit
     tests

  "some cleanups for pageout()" (Baolin Wang)
     Clean up a couple of minor things in the page scanner's
     writeback-for-eviction code

  "mm/hugetlb: refactor sysfs/sysctl interfaces" (Hui Zhu)
     Move hugetlb's sysfs/sysctl handling code into a new file

  "introduce VM_MAYBE_GUARD and make it sticky" (Lorenzo Stoakes)
     Make the VMA guard regions available in /proc/pid/smaps and
     improves the mergeability of guarded VMAs

  "mm: perform guard region install/remove under VMA lock" (Lorenzo Stoakes)
     Reduce mmap lock contention for callers performing VMA guard region
     operations

  "vma_start_write_killable" (Matthew Wilcox)
     Start work on permitting applications to be killed when they are
     waiting on a read_lock on the VMA lock

  "mm/damon/tests: add more tests for online parameters commit" (SeongJae Park)
     Add additional userspace testing of DAMON's "commit" feature

  "mm/damon: misc cleanups" (SeongJae Park)

  "make VM_SOFTDIRTY a sticky VMA flag" (Lorenzo Stoakes)
     Address the possible loss of a VMA's VM_SOFTDIRTY flag when that
     VMA is merged with another

  "mm: support device-private THP" (Balbir Singh)
     Introduce support for Transparent Huge Page (THP) migration in zone
     device-private memory

  "Optimize folio split in memory failure" (Zi Yan)

  "mm/huge_memory: Define split_type and consolidate split support checks" (Wei Yang)
     Some more cleanups in the folio splitting code

  "mm: remove is_swap_[pte, pmd]() + non-swap entries, introduce leaf entries" (Lorenzo Stoakes)
     Clean up our handling of pagetable leaf entries by introducing the
     concept of 'software leaf entries', of type softleaf_t

  "reparent the THP split queue" (Muchun Song)
     Reparent the THP split queue to its parent memcg. This is in
     preparation for addressing the long-standing "dying memcg" problem,
     wherein dead memcg's linger for too long, consuming memory
     resources

  "unify PMD scan results and remove redundant cleanup" (Wei Yang)
     A little cleanup in the hugepage collapse code

  "zram: introduce writeback bio batching" (Sergey Senozhatsky)
     Improve zram writeback efficiency by introducing batched bio
     writeback support

  "memcg: cleanup the memcg stats interfaces" (Shakeel Butt)
     Clean up our handling of the interrupt safety of some memcg stats

  "make vmalloc gfp flags usage more apparent" (Vishal Moola)
     Clean up vmalloc's handling of incoming GFP flags

  "mm: Add soft-dirty and uffd-wp support for RISC-V" (Chunyan Zhang)
     Teach soft dirty and userfaultfd write protect tracking to use
     RISC-V's Svrsw60t59b extension

  "mm: swap: small fixes and comment cleanups" (Youngjun Park)
     Fix a small bug and clean up some of the swap code

  "initial work on making VMA flags a bitmap" (Lorenzo Stoakes)
     Start work on converting the vma struct's flags to a bitmap, so we
     stop running out of them, especially on 32-bit

  "mm/swapfile: fix and cleanup swap list iterations" (Youngjun Park)
     Address a possible bug in the swap discard code and clean things
     up a little

[ This merge also reverts commit ebb9aeb980 ("vfio/nvgrace-gpu:
  register device memory for poison handling") because it looks
  broken to me, I've asked for clarification   - Linus ]

* tag 'mm-stable-2025-12-03-21-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits)
  mm: fix vma_start_write_killable() signal handling
  mm/swapfile: use plist_for_each_entry in __folio_throttle_swaprate
  mm/swapfile: fix list iteration when next node is removed during discard
  fs/proc/task_mmu.c: fix make_uffd_wp_huge_pte() huge pte handling
  mm/kfence: add reboot notifier to disable KFENCE on shutdown
  memcg: remove inc/dec_lruvec_kmem_state helpers
  selftests/mm/uffd: initialize char variable to Null
  mm: fix DEBUG_RODATA_TEST indentation in Kconfig
  mm: introduce VMA flags bitmap type
  tools/testing/vma: eliminate dependency on vma->__vm_flags
  mm: simplify and rename mm flags function for clarity
  mm: declare VMA flags by bit
  zram: fix a spelling mistake
  mm/page_alloc: optimize lowmem_reserve max lookup using its semantic monotonicity
  mm/vmscan: skip increasing kswapd_failures when reclaim was boosted
  pagemap: update BUDDY flag documentation
  mm: swap: remove scan_swap_map_slots() references from comments
  mm: swap: change swap_alloc_slow() to void
  mm, swap: remove redundant comment for read_swap_cache_async
  mm, swap: use SWP_SOLIDSTATE to determine if swap is rotational
  ...
2025-12-05 13:52:43 -08:00
Chaitanya Kulkarni
71075d25ca blk-mq: add blk_rq_nr_bvec() helper
Add a new helper function blk_rq_nr_bvec() that returns the number of
bvecs in a request. This count represents the number of iterations
rq_for_each_bvec() would perform on a request.

Drivers need to pre-allocate bvec arrays before iterating through
a request's bvecs. Currently, they manually count bvecs using
rq_for_each_bvec() in a loop, which is repetitive. The new helper
centralizes this logic.

This pattern exists in loop and zloop drivers, where multi-bio requests
require copying bvecs into a contiguous array before creating
an iov_iter for file operations.

Update loop and zloop drivers to use the new helper, eliminating
duplicate code.

This patch also provides a clear API to avoid any potential misuse of
blk_nr_phys_segments() for calculating the bvecs since, one bvec can
have more than one segments and use of blk_nr_phys_segments() can
lead to extra memory allocation :-

[ 6155.673749] nullb_bio: 128K bio as ONE bvec: sector=0, size=131072
[ 6155.673846] null_blk: #### null_handle_data_transfer:1375
[ 6155.673850] null_blk: nr_bvec=1 blk_rq_nr_phys_segments=2
[ 6155.674263] null_blk: #### null_handle_data_transfer:1375
[ 6155.674267] null_blk: nr_bvec=1 blk_rq_nr_phys_segments=1

Reviewed-by: Niklas Cassel <cassel@kernel.org>
Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-04 07:19:26 -07:00
Linus Torvalds
cc25df3e2e for-6.19/block-20251201
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmktsoMQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpuiUD/92eivL+HmOh10o8trvxajB0yuyqfSjHHrL
 g+xUbF4s9bgAg/v+Upx7sTY8jdrTcMjKov+G9T6uPvBMqVmeVdZckA1PSAKQaIX1
 Zb7nS2LnO7F6JKbwpwVrrIaqVbcz8MfGIIMbN4yNNEOMCwdIVMp4fo7trPBknJNx
 WddNSGUFlIF3NqSI8AflSS/pYnGm+McfBHXBpJAKipI3iquKKubHv+FX9kLp7Tn4
 x27ZoCWOHglIBTJXU0mmXCVsLF8b5BA8DQcGtT62azb8+l0cRTkaHY0DFAv5BvhG
 TqcjrKdmR0cGSNt+nEmFrujE3atBRl0G0kiHA80YgA1MTtYzdPaUVOUtM9k/rEem
 gpiGMDpBypdxyJAyijPSaVJdfcg0psOlYbhIR4N2wbj/dq8268h+cWzXlF1spgVt
 /7ygoaCmfMNbTy9rKThTjH+es787AVXUAXXaPHhIFsnCKUj8xQl4pT7XltmgYeWx
 1/XD1NEJeLHHog5upAVlGX3H5tbvP1nIICxbZa9mDOJX1rwxxI7/s/RucPjbNXuY
 AiaKPTfxtB9+Ihd2HrJ/76RVMkckcOBc4GIKoFfwuKDbcdLXQ5FcZCmVRoI1V9SV
 KsH7JBgihLwR9XWKE1vp9+CBNe1Qlu3K4IjG/E7CNLeuDntIBu73ihqGP/DqV6Bq
 RX1Dc0OyAQ==
 =m22w
 -----END PGP SIGNATURE-----

Merge tag 'for-6.19/block-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull block updates from Jens Axboe:

 - Fix head insertion for mq-deadline, a regression from when priority
   support was added

 - Series simplifying and improving the ublk user copy code

 - Various ublk related cleanups

 - Fixup REQ_NOWAIT handling in loop/zloop, clearing NOWAIT when the
   request is punted to a thread for handling

 - Merge and then later revert loop dio nowait support, as it ended up
   causing excessive stack usage for when the inline issue code needs to
   dip back into the full file system code

 - Improve auto integrity code, making it less deadlock prone

 - Speedup polled IO handling, but manually managing the hctx lookups

 - Fixes for blk-throttle for SSD devices

 - Small series with fixes for the S390 dasd driver

 - Add support for caching zones, avoiding unnecessary report zone
   queries

 - MD pull requests via Yu:
      - fix null-ptr-dereference regression for dm-raid0
      - fix IO hang for raid5 when array is broken with IO inflight
      - remove legacy 1s delay to speed up system shutdown
      - change maintainer's email address
      - data can be lost if array is created with different lbs devices,
        fix this problem and record lbs of the array in metadata
      - fix rcu protection for md_thread
      - fix mddev kobject lifetime regression
      - enable atomic writes for md-linear
      - some cleanups

 - bcache updates via Coly
      - remove useless discard and cache device code
      - improve usage of per-cpu workqueues

 - Reorganize the IO scheduler switching code, fixing some lockdep
   reports as well

 - Improve the block layer P2P DMA support

 - Add support to the block tracing code for zoned devices

 - Segment calculation improves, and memory alignment flexibility
   improvements

 - Set of prep and cleanups patches for ublk batching support. The
   actual batching hasn't been added yet, but helps shrink down the
   workload of getting that patchset ready for 6.20

 - Fix for how the ps3 block driver handles segments offsets

 - Improve how block plugging handles batch tag allocations

 - nbd fixes for use-after-free of the configuration on device clear/put

 - Set of improvements and fixes for zloop

 - Add Damien as maintainer of the block zoned device code handling

 - Various other fixes and cleanups

* tag 'for-6.19/block-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (162 commits)
  block/rnbd: correct all kernel-doc complaints
  blk-mq: use queue_hctx in blk_mq_map_queue_type
  md: remove legacy 1s delay in md_notify_reboot
  md/raid5: fix IO hang when array is broken with IO inflight
  md: warn about updating super block failure
  md/raid0: fix NULL pointer dereference in create_strip_zones() for dm-raid
  sbitmap: fix all kernel-doc warnings
  ublk: add helper of __ublk_fetch()
  ublk: pass const pointer to ublk_queue_is_zoned()
  ublk: refactor auto buffer register in ublk_dispatch_req()
  ublk: add `union ublk_io_buf` with improved naming
  ublk: add parameter `struct io_uring_cmd *` to ublk_prep_auto_buf_reg()
  kfifo: add kfifo_alloc_node() helper for NUMA awareness
  blk-mq: fix potential uaf for 'queue_hw_ctx'
  blk-mq: use array manage hctx map instead of xarray
  ublk: prevent invalid access with DEBUG
  s390/dasd: Use scnprintf() instead of sprintf()
  s390/dasd: Move device name formatting into separate function
  s390/dasd: Remove unnecessary debugfs_create() return checks
  s390/dasd: Fix gendisk parent after copy pair swap
  ...
2025-12-03 19:26:18 -08:00
Linus Torvalds
0abcfd8983 for-6.19/io_uring-20251201
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmktsm0QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpiLvD/0dptgeJyLHKchOtRHzi/UvtM/EuNFKJrvI
 LBWCyIMjygxsVfPR41Lave9SE3UpcavF8Mg/EddasTci8VlMcDF8zPxWLb289Lz2
 tkp/wOVuyYmDhNXKmKNW59NOPTd0NosEJFTZI4VhMudwx+UtAHELJGfBWW5hRyQB
 Md+UwZ2+J9HbYd19mToaDFxz7jpIPLEE4BYUGtljveRUdpnxhyFGGUS2+CQXZt/5
 lnRvJmmEv4nSGH9ZRksix1xnV6KvJM0UwYQhrWvXhgwyiKu47zG7ONpd39KqoaRw
 Fw+6zZd0t7nyyuZkk15cKNnBLnjilnsCzmdcPq0Cuvkmbf6y1hlhEQQTGWXTKfJx
 zCZxEZcnCC4wL0CBQjZjS38AEMfH2p76M/36+NTWtlYCibY7qUtd9ndpUr49BYGo
 o4qfT0HMpI1PHuUvpZwpMcf4OX5qvtLmavT9vt78uqmtM+Aryzzuy3bI3S2SGjNe
 if/cNHnZc8Z06hUqdEit5NW+lYzj642AoF/j7qH9ADDH+VXRWaCdK/iI8tPaEpDV
 Rw6j442eVugS5tDPoTjdO8jsJ9+OCNNV1t/Jxy+Or+zrGdq7lfg4mnzEia1/izy5
 8MnSubRy6LEd+I5PnK/9y9mPIwFMIFgULi+mUjucAhJjRj5beiG74eR6+jBAdyp1
 GhFvN6fwdw==
 =4g/f
 -----END PGP SIGNATURE-----

Merge tag 'for-6.19/io_uring-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull io_uring updates from Jens Axboe:

 - Unify how task_work cancelations are detected, placing it in the
   task_work running state rather than needing to check the task state

 - Series cleaning up and moving the cancelation code to where it
   belongs, in cancel.c

 - Cleanup of waitid and futex argument handling

 - Add support for mixed sized SQEs. 6.18 added support for mixed sized
   CQEs, improving flexibility and efficiency of workloads that need big
   CQEs. This adds similar support for SQEs, where the occasional need
   for a 128b SQE doesn't necessitate having all SQEs be 128b in size

 - Introduce zcrx and SQ/CQ layout queries. The former returns what zcrx
   features are available. And both return the ring size information to
   help with allocation size calculation for user provided rings like
   IORING_SETUP_NO_MMAP and IORING_MEM_REGION_TYPE_USER

 - Zcrx updates for 6.19. It includes a bunch of small patches,
   IORING_REGISTER_ZCRX_CTRL and RQ flushing and David's work on sharing
   zcrx b/w multiple io_uring instances

 - Series cleaning up ring initializations, notable deduplicating ring
   size and offset calculations. It also moves most of the checking
   before doing any allocations, making the code simpler

 - Add support for getsockname and getpeername, which is mostly a
   trivial hookup after a bit of refactoring on the networking side

 - Various fixes and cleanups

* tag 'for-6.19/io_uring-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (68 commits)
  io_uring: Introduce getsockname io_uring cmd
  socket: Split out a getsockname helper for io_uring
  socket: Unify getsockname and getpeername implementation
  io_uring/query: drop unused io_handle_query_entry() ctx arg
  io_uring/kbuf: remove obsolete buf_nr_pages and update comments
  io_uring/register: use correct location for io_rings_layout
  io_uring/zcrx: share an ifq between rings
  io_uring/zcrx: add io_fill_zcrx_offsets()
  io_uring/zcrx: export zcrx via a file
  io_uring/zcrx: move io_zcrx_scrub() and dependencies up
  io_uring/zcrx: count zcrx users
  io_uring/zcrx: add sync refill queue flushing
  io_uring/zcrx: introduce IORING_REGISTER_ZCRX_CTRL
  io_uring/zcrx: elide passing msg flags
  io_uring/zcrx: use folio_nr_pages() instead of shift operation
  io_uring/zcrx: convert to use netmem_desc
  io_uring/query: introduce rings info query
  io_uring/query: introduce zcrx query
  io_uring: move cq/sq user offset init around
  io_uring: pre-calculate scq layout
  ...
2025-12-03 18:58:57 -08:00
Linus Torvalds
8f7aa3d3c7 Networking changes for 6.19.
Core & protocols
 ----------------
 
  - Replace busylock at the Tx queuing layer with a lockless list. Resulting
    in a 300% (4x) improvement on heavy TX workloads, sending twice the
    number of packets per second, for half the cpu cycles.
 
  - Allow constantly busy flows to migrate to a more suitable CPU/NIC
    queue. Normally we perform queue re-selection when flow comes out
    of idle, but under extreme circumstances the flows may be constantly
    busy. Add sysctl to allow periodic rehashing even if it'd risk packet
    reordering.
 
  - Optimize the NAPI skb cache, make it larger, use it in more paths.
 
  - Attempt returning Tx skbs to the originating CPU (like we already did
    for Rx skbs).
 
  - Various data structure layout and prefetch optimizations from Eric.
 
  - Remove ktime_get() from the recvmsg() fast path, ktime_get() is sadly
    quite expensive on recent AMD machines.
 
  - Extend threaded NAPI polling to allow the kthread busy poll for packets.
 
  - Make MPTCP use Rx backlog processing. This lowers the lock pressure,
    improving the Rx performance.
 
  - Support memcg accounting of MPTCP socket memory.
 
  - Allow admin to opt sockets out of global protocol memory accounting
    (using a sysctl or BPF-based policy). The global limits are a poor fit
    for modern container workloads, where limits are imposed using cgroups.
 
  - Improve heuristics for when to kick off AF_UNIX garbage collection.
 
  - Allow users to control TCP SACK compression, and default to 33% of RTT.
 
  - Add tcp_rcvbuf_low_rtt sysctl to let datacenter users avoid unnecessarily
    aggressive rcvbuf growth and overshot when the connection RTT is low.
 
  - Preserve skb metadata space across skb_push / skb_pull operations.
 
  - Support for IPIP encapsulation in the nftables flowtable offload.
 
  - Support appending IP interface information to ICMP messages (RFC 5837).
 
  - Support setting max record size in TLS (RFC 8449).
 
  - Remove taking rtnl_lock from RTM_GETNEIGHTBL and RTM_SETNEIGHTBL.
 
  - Use a dedicated lock (and RCU) in MPLS, instead of rtnl_lock.
 
  - Let users configure the number of write buffers in SMC.
 
  - Add new struct sockaddr_unsized for sockaddr of unknown length,
    from Kees.
 
  - Some conversions away from the crypto_ahash API, from Eric Biggers.
 
  - Some preparations for slimming down struct page.
 
  - YAML Netlink protocol spec for WireGuard.
 
  - Add a tool on top of YAML Netlink specs/lib for reporting commonly
    computed derived statistics and summarized system state.
 
 Driver API
 ----------
 
  - Add CAN XL support to the CAN Netlink interface.
 
  - Add uAPI for reporting PHY Mean Square Error (MSE) diagnostics,
    as defined by the OPEN Alliance's "Advanced diagnostic features
    for 100BASE-T1 automotive Ethernet PHYs" specification.
 
  - Add DPLL phase-adjust-gran pin attribute (and implement it in zl3073x).
 
  - Refactor xfrm_input lock to reduce contention when NIC offloads IPsec
    and performs RSS.
 
  - Add info to devlink params whether the current setting is the default
    or a user override. Allow resetting back to default.
 
  - Add standard device stats for PSP crypto offload.
 
  - Leverage DSA frame broadcast to implement simple HSR frame duplication
    for a lot of switches without dedicated HSR offload.
 
  - Add uAPI defines for 1.6Tbps link modes.
 
 Device drivers
 --------------
 
  - Add Motorcomm YT921x gigabit Ethernet switch support.
 
  - Add MUCSE driver for N500/N210 1GbE NIC series.
 
  - Convert drivers to support dedicated ops for timestamping control,
    and away from the direct IOCTL handling. While at it support GET
    operations for PHY timestamping.
 
  - Add (and convert most drivers to) a dedicated ethtool callback
    for reading the Rx ring count.
 
  - Significant refactoring efforts in the STMMAC driver, which supports
    Synopsys turn-key MAC IP integrated into a ton of SoCs.
 
  - Ethernet high-speed NICs:
    - Broadcom (bnxt):
      - support PPS in/out on all pins
    - Intel (100G, ice, idpf):
      - ice: implement standard ethtool and timestamping stats
      - i40e: support setting the max number of MAC addresses per VF
      - iavf: support RSS of GTP tunnels for 5G and LTE deployments
    - nVidia/Mellanox (mlx5):
      - reduce downtime on interface reconfiguration
      - disable being an XDP redirect target by default (same as other
        drivers) to avoid wasting resources if feature is unused
    - Meta (fbnic):
      - add support for Linux-managed PCS on 25G, 50G, and 100G links
    - Wangxun:
      - support Rx descriptor merge, and Tx head writeback
      - support Rx coalescing offload
      - support 25G SPF and 40G QSFP modules
 
  - Ethernet virtual:
    - Google (gve):
      - allow ethtool to configure rx_buf_len
      - implement XDP HW RX Timestamping support for DQ descriptor format
    - Microsoft vNIC (mana):
      - support HW link state events
      - handle hardware recovery events when probing the device
 
  - Ethernet NICs consumer, and embedded:
    - usbnet: add support for Byte Queue Limits (BQL)
    - AMD (amd-xgbe):
      - add device selftests
    - NXP (enetc):
      - add i.MX94 support
    - Broadcom integrated MACs (bcmgenet, bcmasp):
      - bcmasp: add support for PHY-based Wake-on-LAN
    - Broadcom switches (b53):
      - support port isolation
      - support BCM5389/97/98 and BCM63XX ARL formats
    - Lantiq/MaxLinear switches:
      - support bridge FDB entries on the CPU port
      - use regmap for register access
      - allow user to enable/disable learning
      - support Energy Efficient Ethernet
      - support configuring RMII clock delays
      - add tagging driver for MaxLinear GSW1xx switches
    - Synopsys (stmmac):
      - support using the HW clock in free running mode
      - add Eswin EIC7700 support
      - add Rockchip RK3506 support
      - add Altera Agilex5 support
    - Cadence (macb):
      - cleanup and consolidate descriptor and DMA address handling
      - add EyeQ5 support
    - TI:
      - icssg-prueth: support AF_XDP
    - Airoha access points:
      - add missing Ethernet stats and link state callback
      - add AN7583 support
      - support out-of-order Tx completion processing
    - Power over Ethernet:
      - pd692x0: preserve PSE configuration across reboots
      - add support for TPS23881B devices
 
  - Ethernet PHYs:
    - Open Alliance OATC14 10BASE-T1S PHY cable diagnostic support
    - Support 50G SerDes and 100G interfaces in Linux-managed PHYs
    - micrel:
      - support for non PTP SKUs of lan8814
      - enable in-band auto-negotiation on lan8814
    - realtek:
      - cable testing support on RTL8224
      - interrupt support on RTL8221B
    - motorcomm: support for PHY LEDs on YT853
    - microchip: support for LAN867X Rev.D0 PHYs w/ SQI and cable diag
    - mscc: support for PHY LED control
 
  - CAN drivers:
    - m_can: add support for optional reset and system wake up
    - remove can_change_mtu() obsoleted by core handling
    - mcp251xfd: support GPIO controller functionality
 
  - Bluetooth:
    - add initial support for PASTa
 
  - WiFi:
    - split ieee80211.h file, it's way too big
    - improvements in VHT radiotap reporting, S1G, Channel Switch
      Announcement handling, rate tracking in mesh networks
    - improve multi-radio monitor mode support, and add a cfg80211 debugfs
      interface for it
    - HT action frame handling on 6 GHz
    - initial chanctx work towards NAN
    - MU-MIMO sniffer improvements
 
  - WiFi drivers:
    - RealTek (rtw89):
      - support USB devices RTL8852AU and RTL8852CU
      - initial work for RTL8922DE
      - improved injection support
    - Intel:
      - iwlwifi: new sniffer API support
    - MediaTek (mt76):
      - WED support for >32-bit DMA
      - airoha NPU support
      - regdomain improvements
      - continued WiFi7/MLO work
    - Qualcomm/Atheros:
      - ath10k: factory test support
      - ath11k: TX power insertion support
      - ath12k: BSS color change support
      - ath12k: statistics improvements
    - brcmfmac: Acer A1 840 tablet quirk
    - rtl8xxxu: 40 MHz connection fixes/support
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmkveRQACgkQMUZtbf5S
 IrvY7A/+Nb0o4BxLHjPkAl1m3t3q2d0Y29B7SNkwnwEtxAV8EkNeZ3GWrdtDnTQY
 MYhmc7LEzvz8/lihapr7UJkcokzSASUV54hbez5jDBKC8EEoyUk8FdWDPerwlcRI
 zmCFNAVFyh9GX8i7wcrzKbDTHT5+GZLbSlGl9U5mhLsDdRlJgH7d8PJ7vWcmtLFY
 XN0paDyaeHfCl8wReWNAYx4C/I0ODOvlscpO0tnAKhB0ngJbQCKY2t6tn3rOYdif
 ZSQ5KwVRnJtQ4fYOFMOy9+FSCjVXtyrxF8KLxD+mqom2ZhmO00UpOMl09tqhq3uT
 WnvwoHUVBt6F+iITHwg5kMgIDPUq1kpUvL4S4UbVSuUm9ZKD+4KRU2ZHRBYMx+MU
 bsqmtY8/IULClUoRz+tZhltA8eb0NEqNZE2JPOFDiJHn1YiCCkFwxibhir893oM3
 sB7x65D7LQI2ty2BBGVGYnwYDPtyaxOA/s3WTwPvLEi3+Y/TGNIIrS9lBLA4U+Yr
 Gi93WQGVjttMmVyaHgXBUGmi3L52hvolm0AZ8zSRGrnIEpecjhly2KfYuaOzuxXC
 IHEQ6AFLdRh6JzafXGb/mQwGCHNmhwsY8A49i94fakWQamaL/L6A+1dyPu4LXMqi
 NwqCmlVb/LKGlfNG+V4wT27srJ+yBA2Vk3tpR1sZQQytFh0LKHI=
 =UoDR
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "Core & protocols:

   - Replace busylock at the Tx queuing layer with a lockless list.

     Resulting in a 300% (4x) improvement on heavy TX workloads, sending
     twice the number of packets per second, for half the cpu cycles.

   - Allow constantly busy flows to migrate to a more suitable CPU/NIC
     queue.

     Normally we perform queue re-selection when flow comes out of idle,
     but under extreme circumstances the flows may be constantly busy.

     Add sysctl to allow periodic rehashing even if it'd risk packet
     reordering.

   - Optimize the NAPI skb cache, make it larger, use it in more paths.

   - Attempt returning Tx skbs to the originating CPU (like we already
     did for Rx skbs).

   - Various data structure layout and prefetch optimizations from Eric.

   - Remove ktime_get() from the recvmsg() fast path, ktime_get() is
     sadly quite expensive on recent AMD machines.

   - Extend threaded NAPI polling to allow the kthread busy poll for
     packets.

   - Make MPTCP use Rx backlog processing. This lowers the lock
     pressure, improving the Rx performance.

   - Support memcg accounting of MPTCP socket memory.

   - Allow admin to opt sockets out of global protocol memory accounting
     (using a sysctl or BPF-based policy). The global limits are a poor
     fit for modern container workloads, where limits are imposed using
     cgroups.

   - Improve heuristics for when to kick off AF_UNIX garbage collection.

   - Allow users to control TCP SACK compression, and default to 33% of
     RTT.

   - Add tcp_rcvbuf_low_rtt sysctl to let datacenter users avoid
     unnecessarily aggressive rcvbuf growth and overshot when the
     connection RTT is low.

   - Preserve skb metadata space across skb_push / skb_pull operations.

   - Support for IPIP encapsulation in the nftables flowtable offload.

   - Support appending IP interface information to ICMP messages (RFC
     5837).

   - Support setting max record size in TLS (RFC 8449).

   - Remove taking rtnl_lock from RTM_GETNEIGHTBL and RTM_SETNEIGHTBL.

   - Use a dedicated lock (and RCU) in MPLS, instead of rtnl_lock.

   - Let users configure the number of write buffers in SMC.

   - Add new struct sockaddr_unsized for sockaddr of unknown length,
     from Kees.

   - Some conversions away from the crypto_ahash API, from Eric Biggers.

   - Some preparations for slimming down struct page.

   - YAML Netlink protocol spec for WireGuard.

   - Add a tool on top of YAML Netlink specs/lib for reporting commonly
     computed derived statistics and summarized system state.

  Driver API:

   - Add CAN XL support to the CAN Netlink interface.

   - Add uAPI for reporting PHY Mean Square Error (MSE) diagnostics, as
     defined by the OPEN Alliance's "Advanced diagnostic features for
     100BASE-T1 automotive Ethernet PHYs" specification.

   - Add DPLL phase-adjust-gran pin attribute (and implement it in
     zl3073x).

   - Refactor xfrm_input lock to reduce contention when NIC offloads
     IPsec and performs RSS.

   - Add info to devlink params whether the current setting is the
     default or a user override. Allow resetting back to default.

   - Add standard device stats for PSP crypto offload.

   - Leverage DSA frame broadcast to implement simple HSR frame
     duplication for a lot of switches without dedicated HSR offload.

   - Add uAPI defines for 1.6Tbps link modes.

  Device drivers:

   - Add Motorcomm YT921x gigabit Ethernet switch support.

   - Add MUCSE driver for N500/N210 1GbE NIC series.

   - Convert drivers to support dedicated ops for timestamping control,
     and away from the direct IOCTL handling. While at it support GET
     operations for PHY timestamping.

   - Add (and convert most drivers to) a dedicated ethtool callback for
     reading the Rx ring count.

   - Significant refactoring efforts in the STMMAC driver, which
     supports Synopsys turn-key MAC IP integrated into a ton of SoCs.

   - Ethernet high-speed NICs:
      - Broadcom (bnxt):
         - support PPS in/out on all pins
      - Intel (100G, ice, idpf):
         - ice: implement standard ethtool and timestamping stats
         - i40e: support setting the max number of MAC addresses per VF
         - iavf: support RSS of GTP tunnels for 5G and LTE deployments
      - nVidia/Mellanox (mlx5):
         - reduce downtime on interface reconfiguration
         - disable being an XDP redirect target by default (same as
           other drivers) to avoid wasting resources if feature is
           unused
      - Meta (fbnic):
         - add support for Linux-managed PCS on 25G, 50G, and 100G links
      - Wangxun:
         - support Rx descriptor merge, and Tx head writeback
         - support Rx coalescing offload
         - support 25G SPF and 40G QSFP modules

   - Ethernet virtual:
      - Google (gve):
         - allow ethtool to configure rx_buf_len
         - implement XDP HW RX Timestamping support for DQ descriptor
           format
      - Microsoft vNIC (mana):
         - support HW link state events
         - handle hardware recovery events when probing the device

   - Ethernet NICs consumer, and embedded:
      - usbnet: add support for Byte Queue Limits (BQL)
      - AMD (amd-xgbe):
         - add device selftests
      - NXP (enetc):
         - add i.MX94 support
      - Broadcom integrated MACs (bcmgenet, bcmasp):
         - bcmasp: add support for PHY-based Wake-on-LAN
      - Broadcom switches (b53):
         - support port isolation
         - support BCM5389/97/98 and BCM63XX ARL formats
      - Lantiq/MaxLinear switches:
         - support bridge FDB entries on the CPU port
         - use regmap for register access
         - allow user to enable/disable learning
         - support Energy Efficient Ethernet
         - support configuring RMII clock delays
         - add tagging driver for MaxLinear GSW1xx switches
      - Synopsys (stmmac):
         - support using the HW clock in free running mode
         - add Eswin EIC7700 support
         - add Rockchip RK3506 support
         - add Altera Agilex5 support
      - Cadence (macb):
         - cleanup and consolidate descriptor and DMA address handling
         - add EyeQ5 support
      - TI:
         - icssg-prueth: support AF_XDP
      - Airoha access points:
         - add missing Ethernet stats and link state callback
         - add AN7583 support
         - support out-of-order Tx completion processing
      - Power over Ethernet:
         - pd692x0: preserve PSE configuration across reboots
         - add support for TPS23881B devices

   - Ethernet PHYs:
      - Open Alliance OATC14 10BASE-T1S PHY cable diagnostic support
      - Support 50G SerDes and 100G interfaces in Linux-managed PHYs
      - micrel:
         - support for non PTP SKUs of lan8814
         - enable in-band auto-negotiation on lan8814
      - realtek:
         - cable testing support on RTL8224
         - interrupt support on RTL8221B
      - motorcomm: support for PHY LEDs on YT853
      - microchip: support for LAN867X Rev.D0 PHYs w/ SQI and cable diag
      - mscc: support for PHY LED control

   - CAN drivers:
      - m_can: add support for optional reset and system wake up
      - remove can_change_mtu() obsoleted by core handling
      - mcp251xfd: support GPIO controller functionality

   - Bluetooth:
      - add initial support for PASTa

   - WiFi:
      - split ieee80211.h file, it's way too big
      - improvements in VHT radiotap reporting, S1G, Channel Switch
        Announcement handling, rate tracking in mesh networks
      - improve multi-radio monitor mode support, and add a cfg80211
        debugfs interface for it
      - HT action frame handling on 6 GHz
      - initial chanctx work towards NAN
      - MU-MIMO sniffer improvements

   - WiFi drivers:
      - RealTek (rtw89):
         - support USB devices RTL8852AU and RTL8852CU
         - initial work for RTL8922DE
         - improved injection support
      - Intel:
         - iwlwifi: new sniffer API support
      - MediaTek (mt76):
         - WED support for >32-bit DMA
         - airoha NPU support
         - regdomain improvements
         - continued WiFi7/MLO work
      - Qualcomm/Atheros:
         - ath10k: factory test support
         - ath11k: TX power insertion support
         - ath12k: BSS color change support
         - ath12k: statistics improvements
      - brcmfmac: Acer A1 840 tablet quirk
      - rtl8xxxu: 40 MHz connection fixes/support"

* tag 'net-next-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1381 commits)
  net: page_pool: sanitise allocation order
  net: page pool: xa init with destroy on pp init
  net/mlx5e: Support XDP target xmit with dummy program
  net/mlx5e: Update XDP features in switch channels
  selftests/tc-testing: Test CAKE scheduler when enqueue drops packets
  net/sched: sch_cake: Fix incorrect qlen reduction in cake_drop
  wireguard: netlink: generate netlink code
  wireguard: uapi: generate header with ynl-gen
  wireguard: uapi: move flag enums
  wireguard: uapi: move enum wg_cmd
  wireguard: netlink: add YNL specification
  selftests: drv-net: Fix tolerance calculation in devlink_rate_tc_bw.py
  selftests: drv-net: Fix and clarify TC bandwidth split in devlink_rate_tc_bw.py
  selftests: drv-net: Set shell=True for sysfs writes in devlink_rate_tc_bw.py
  selftests: drv-net: Use Iperf3Runner in devlink_rate_tc_bw.py
  selftests: drv-net: introduce Iperf3Runner for measurement use cases
  selftests: drv-net: Add devlink_rate_tc_bw.py to TEST_PROGS
  net: ps3_gelic_net: Use napi_alloc_skb() and napi_gro_receive()
  Documentation: net: dsa: mention simple HSR offload helpers
  Documentation: net: dsa: mention availability of RedBox
  ...
2025-12-03 17:24:33 -08:00
Linus Torvalds
784faa8eca Rust changes for v6.19
Toolchain and infrastructure:
 
  - Add support for 'syn'.
 
      Syn is a parsing library for parsing a stream of Rust tokens into a
      syntax tree of Rust source code.
 
      Currently this library is geared toward use in Rust procedural
      macros, but contains some APIs that may be useful more generally.
 
    'syn' allows us to greatly simplify writing complex macros such as
    'pin-init' (Benno has already prepared the 'syn'-based version). We
    will use it in the 'macros' crate too.
 
    'syn' is the most downloaded Rust crate (according to crates.io), and
    it is also used by the Rust compiler itself. While the amount of code
    is substantial, there should not be many updates needed for these
    crates, and even if there are, they should not be too big, e.g. +7k
    -3k lines across the 3 crates in the last year.
 
    'syn' requires two smaller dependencies: 'quote' and 'proc-macro2'.
    I only modified their code to remove a third dependency
    ('unicode-ident') and to add the SPDX identifiers. The code can be
    easily verified to exactly match upstream with the provided scripts.
 
    They are all licensed under "Apache-2.0 OR MIT", like the other
    vendored 'alloc' crate we had for a while.
 
    Please see the merge commit with the cover letter for more context.
 
  - Allow 'unreachable_pub' and 'clippy::disallowed_names' for doctests.
 
    Examples (i.e. doctests) may want to do things like show public items
    and use names such as 'foo'.
 
    Nevertheless, we still try to keep examples as close to real code as
    possible (this is part of why running Clippy on doctests is important
    for us, e.g. for safety comments, which userspace Rust does not
    support yet but we are stricter).
 
 'kernel' crate:
 
  - Replace our custom 'CStr' type with 'core::ffi::CStr'.
 
    Using the standard library type reduces our custom code footprint,
    and we retain needed custom functionality through an extension trait
    and a new 'fmt!' macro which replaces the previous 'core' import.
 
    This started in 6.17 and continued in 6.18, and we finally land the
    replacement now. This required quite some stamina from Tamir, who
    split the changes in steps to prepare for the flag day change here.
 
  - Replace 'kernel::c_str!' with C string literals.
 
    C string literals were added in Rust 1.77, which produce '&CStr's
    (the 'core' one), so now we can write:
 
        c"hi"
 
    instead of:
 
        c_str!("hi")
 
  - Add 'num' module for numerical features.
 
    It includes the 'Integer' trait, implemented for all primitive
    integer types.
 
    It also includes the 'Bounded' integer wrapping type: an integer
    value that requires only the 'N' less significant bits of the wrapped
    type to be encoded:
 
        // An unsigned 8-bit integer, of which only the 4 LSBs are used.
        let v = Bounded::<u8, 4>:🆕:<15>();
        assert_eq!(v.get(), 15);
 
    'Bounded' is useful to e.g. enforce guarantees when working with
    bitfields that have an arbitrary number of bits.
 
    Values can be constructed from simple non-constant expressions or,
    for more complex ones, validated at runtime.
 
    'Bounded' also comes with comparison and arithmetic operations (with
    both their backing type and other 'Bounded's with a compatible
    backing type), casts to change the backing type, extending/shrinking
    and infallible/fallible conversions from/to primitives as applicable.
 
  - 'rbtree' module: add immutable cursor ('Cursor').
 
    It enables to use just an immutable tree reference where appropriate.
    The existing fully-featured mutable cursor is renamed to 'CursorMut'.
 
 kallsyms:
 
  - Fix wrong "big" kernel symbol type read from procfs.
 
 'pin-init' crate:
 
  - A couple minor fixes (Benno asked me to pick these patches up for
    him this cycle).
 
 Documentation:
 
  - Quick Start guide: add Debian 13 (Trixie).
 
    Debian Stable is now able to build Linux, since Debian 13 (released
    2025-08-09) packages Rust 1.85.0, which is recent enough.
 
    We are planning to propose that the minimum supported Rust version in
    Linux follows Debian Stable releases, with Debian 13 being the first
    one we upgrade to, i.e. Rust 1.85.
 
 MAINTAINERS:
 
  - Add entry for the new 'num' module.
 
  - Remove Alex as Rust maintainer: he hasn't had the time to contribute
    for a few years now, so it is a no-op change in practice.
 
 And a few other cleanups and improvements.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEPjU5OPd5QIZ9jqqOGXyLc2htIW0FAmks0WkACgkQGXyLc2ht
 IW3QQg/+LpBmrz0ZSKH24kcX3x/hpfA2Erd4cmn+qjXev9RSAM1bt9xf3dsAhhFd
 BStUpf0aglHOSEuWAvNHqEb5+yu+6qy6EFXXqH0ASexHK93t77Jbztjtf3SMlykT
 N/lSJ+LWw2WiRT0NRWoTfKaEWzZQ8j9fi9Jb/IGNZGdNMryisVUYWqzLwNupPuK+
 RMcEitHdO2NWjyodk2GGRyYQ7+XxQgbXZoxtgeubPSrrmGuGTXV42RlQKC2KHPx3
 gz6CwcO3Xd0bGHHSgP32QDtGRJtniO8iXBKxiooT+ys+M83fTKbwNrIrW3tHdheY
 765qsd/NvUmAkcgTCoLqj5biU6LCsepyimNg1vf4pYFohBoTaGeN+UqzbXBrSjy2
 pmrgxwMRVHsYz+zoSKAVKJl7ASba5BXFdI4Whgfqwwc9So/X7uyujIYXGbRoznCV
 W5vu7OUboBy26NvcsPrf6BqWcsJEpGV/M4z2UBRjAoJTRGQMcm/ckuo/GfYm3yW+
 bUW62UmVCdY5crpo7XPH/G4ZGBR/k3p9dLVt8OJxEoTlfw4KDE5BszJoXmejZqdi
 9LEMhzTWwoFp9NspQuEGdYdfGRlfG6XXqrwGZtQI+dlc4RvFEgBBu2Lxotq+Ods0
 EfCVCJQjWmyCodVdJ/QqbCRFuXtOFLr/hPdWnvlrRxVkPtF2CDw=
 =9nM+
 -----END PGP SIGNATURE-----

Merge tag 'rust-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux

Pull Rust updates from Miguel Ojeda:
 "Toolchain and infrastructure:

   - Add support for 'syn'.

     Syn is a parsing library for parsing a stream of Rust tokens into a
     syntax tree of Rust source code.

     Currently this library is geared toward use in Rust procedural
     macros, but contains some APIs that may be useful more generally.

     'syn' allows us to greatly simplify writing complex macros such as
     'pin-init' (Benno has already prepared the 'syn'-based version). We
     will use it in the 'macros' crate too.

     'syn' is the most downloaded Rust crate (according to crates.io),
     and it is also used by the Rust compiler itself. While the amount
     of code is substantial, there should not be many updates needed for
     these crates, and even if there are, they should not be too big,
     e.g. +7k -3k lines across the 3 crates in the last year.

     'syn' requires two smaller dependencies: 'quote' and 'proc-macro2'.
     I only modified their code to remove a third dependency
     ('unicode-ident') and to add the SPDX identifiers. The code can be
     easily verified to exactly match upstream with the provided
     scripts.

     They are all licensed under "Apache-2.0 OR MIT", like the other
     vendored 'alloc' crate we had for a while.

     Please see the merge commit with the cover letter for more context.

   - Allow 'unreachable_pub' and 'clippy::disallowed_names' for
     doctests.

     Examples (i.e. doctests) may want to do things like show public
     items and use names such as 'foo'.

     Nevertheless, we still try to keep examples as close to real code
     as possible (this is part of why running Clippy on doctests is
     important for us, e.g. for safety comments, which userspace Rust
     does not support yet but we are stricter).

  'kernel' crate:

   - Replace our custom 'CStr' type with 'core::ffi::CStr'.

     Using the standard library type reduces our custom code footprint,
     and we retain needed custom functionality through an extension
     trait and a new 'fmt!' macro which replaces the previous 'core'
     import.

     This started in 6.17 and continued in 6.18, and we finally land the
     replacement now. This required quite some stamina from Tamir, who
     split the changes in steps to prepare for the flag day change here.

   - Replace 'kernel::c_str!' with C string literals.

     C string literals were added in Rust 1.77, which produce '&CStr's
     (the 'core' one), so now we can write:

         c"hi"

     instead of:

         c_str!("hi")

   - Add 'num' module for numerical features.

     It includes the 'Integer' trait, implemented for all primitive
     integer types.

     It also includes the 'Bounded' integer wrapping type: an integer
     value that requires only the 'N' least significant bits of the
     wrapped type to be encoded:

         // An unsigned 8-bit integer, of which only the 4 LSBs are used.
         let v = Bounded::<u8, 4>:🆕:<15>();
         assert_eq!(v.get(), 15);

     'Bounded' is useful to e.g. enforce guarantees when working with
     bitfields that have an arbitrary number of bits.

     Values can also be constructed from simple non-constant expressions
     or, for more complex ones, validated at runtime.

     'Bounded' also comes with comparison and arithmetic operations
     (with both their backing type and other 'Bounded's with a
     compatible backing type), casts to change the backing type,
     extending/shrinking and infallible/fallible conversions from/to
     primitives as applicable.

   - 'rbtree' module: add immutable cursor ('Cursor').

     It enables to use just an immutable tree reference where
     appropriate. The existing fully-featured mutable cursor is renamed
     to 'CursorMut'.

  kallsyms:

   - Fix wrong "big" kernel symbol type read from procfs.

  'pin-init' crate:

   - A couple minor fixes (Benno asked me to pick these patches up for
     him this cycle).

  Documentation:

   - Quick Start guide: add Debian 13 (Trixie).

     Debian Stable is now able to build Linux, since Debian 13 (released
     2025-08-09) packages Rust 1.85.0, which is recent enough.

     We are planning to propose that the minimum supported Rust version
     in Linux follows Debian Stable releases, with Debian 13 being the
     first one we upgrade to, i.e. Rust 1.85.

  MAINTAINERS:

   - Add entry for the new 'num' module.

   - Remove Alex as Rust maintainer: he hasn't had the time to
     contribute for a few years now, so it is a no-op change in
     practice.

  And a few other cleanups and improvements"

* tag 'rust-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux: (53 commits)
  rust: macros: support `proc-macro2`, `quote` and `syn`
  rust: syn: enable support in kbuild
  rust: syn: add `README.md`
  rust: syn: remove `unicode-ident` dependency
  rust: syn: add SPDX License Identifiers
  rust: syn: import crate
  rust: quote: enable support in kbuild
  rust: quote: add `README.md`
  rust: quote: add SPDX License Identifiers
  rust: quote: import crate
  rust: proc-macro2: enable support in kbuild
  rust: proc-macro2: add `README.md`
  rust: proc-macro2: remove `unicode_ident` dependency
  rust: proc-macro2: add SPDX License Identifiers
  rust: proc-macro2: import crate
  rust: kbuild: support using libraries in `rustc_procmacro`
  rust: kbuild: support skipping flags in `rustc_test_library`
  rust: kbuild: add proc macro library support
  rust: kbuild: simplify `--cfg` handling
  rust: kbuild: introduce `core-flags` and `core-skip_flags`
  ...
2025-12-03 14:16:49 -08:00
Linus Torvalds
1d18101a64 kernel-6.19-rc1.cred
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZQAKCRCRxhvAZXjc
 orJLAP9UD+dX6cicJDkzFZowDakmoIQkR5ZSDwChSlmvLcmquwEAlSq4svVd9Bdl
 7kOFUk71DqhVHrPAwO7ap0BxehokEAA=
 =Cli6
 -----END PGP SIGNATURE-----

Merge tag 'kernel-6.19-rc1.cred' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull cred guard updates from Christian Brauner:
 "This contains substantial credential infrastructure improvements
  adding guard-based credential management that simplifies code and
  eliminates manual reference counting in many subsystems.

  Features:

   - Kernel Credential Guards

     Add with_kernel_creds() and scoped_with_kernel_creds() guards that
     allow using the kernel credentials without allocating and copying
     them. This was requested by Linus after seeing repeated
     prepare_kernel_creds() calls that duplicate the kernel credentials
     only to drop them again later.

     The new guards completely avoid the allocation and never expose the
     temporary variable to hold the kernel credentials anywhere in
     callers.

   - Generic Credential Guards

     Add scoped_with_creds() guards for the common override_creds() and
     revert_creds() pattern. This builds on earlier work that made
     override_creds()/revert_creds() completely reference count free.

   - Prepare Credential Guards

     Add prepare credential guards for the more complex pattern of
     preparing a new set of credentials and overriding the current
     credentials with them:
      - prepare_creds()
      - modify new creds
      - override_creds()
      - revert_creds()
      - put_cred()

  Cleanups:

   - Make init_cred static since it should not be directly accessed

   - Add kernel_cred() helper to properly access the kernel credentials

   - Fix scoped_class() macro that was introduced two cycles ago

   - coredump: split out do_coredump() from vfs_coredump() for cleaner
     credential handling

   - coredump: move revert_cred() before coredump_cleanup()

   - coredump: mark struct mm_struct as const

   - coredump: pass struct linux_binfmt as const

   - sev-dev: use guard for path"

* tag 'kernel-6.19-rc1.cred' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (36 commits)
  trace: use override credential guard
  trace: use prepare credential guard
  coredump: use override credential guard
  coredump: use prepare credential guard
  coredump: split out do_coredump() from vfs_coredump()
  coredump: mark struct mm_struct as const
  coredump: pass struct linux_binfmt as const
  coredump: move revert_cred() before coredump_cleanup()
  sev-dev: use override credential guards
  sev-dev: use prepare credential guard
  sev-dev: use guard for path
  cred: add prepare credential guard
  net/dns_resolver: use credential guards in dns_query()
  cgroup: use credential guards in cgroup_attach_permissions()
  act: use credential guards in acct_write_process()
  smb: use credential guards in cifs_get_spnego_key()
  nfs: use credential guards in nfs_idmap_get_key()
  nfs: use credential guards in nfs_local_call_write()
  nfs: use credential guards in nfs_local_call_read()
  erofs: use credential guards
  ...
2025-12-01 13:45:41 -08:00
Randy Dunlap
d211a28035 block/rnbd: correct all kernel-doc complaints
Fix all kernel-doc warnings in rnbd-proto.h:
- use correct enum name in kdoc comment
- mark several struct members as "/* private: */" so that no kdoc is
  required for them
- don't use "/**" for a non-kernel-doc comment
- use the correct struct member name for "dev_name"
- use " *" for a blank kernel-doc line

Fixes these warnings:
Warning: drivers/block/rnbd/rnbd-proto.h:41 expecting prototype for
 enum rnbd_msg_types. Prototype was for enum rnbd_msg_type instead
Warning: drivers/block/rnbd/rnbd-proto.h:50 struct member '__padding'
 not described in 'rnbd_msg_hdr'
Warning: drivers/block/rnbd/rnbd-proto.h:53 This comment starts with
 '/**', but isn't a kernel-doc comment.
 * We allow to map RO many times and RW only once. We allow to map yet another
Warning: drivers/block/rnbd/rnbd-proto.h:81 struct member 'reserved'
 not described in 'rnbd_msg_sess_info'
Warning: drivers/block/rnbd/rnbd-proto.h:92 struct member 'reserved'
 not described in 'rnbd_msg_sess_info_rsp'
Warning: drivers/block/rnbd/rnbd-proto.h:107 struct member 'resv1'
 not described in 'rnbd_msg_open'
Warning: drivers/block/rnbd/rnbd-proto.h:107 struct member 'dev_name'
 not described in 'rnbd_msg_open'
Warning: drivers/block/rnbd/rnbd-proto.h:107 struct member 'reserved'
 not described in 'rnbd_msg_open'
Warning: drivers/block/rnbd/rnbd-proto.h:158 struct member 'reserved'
 not described in 'rnbd_msg_open_rsp'
Warning: drivers/block/rnbd/rnbd-proto.h:189 bad line:

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-01 07:19:50 -07:00
Chu Guangqing
8f4338b114 zram: fix a spelling mistake
The spelling of the word "relases" is incorrect; it should be "releases".

Link: https://lkml.kernel.org/r/20251125020522.1913-1-chuguangqing@inspur.com
Signed-off-by: Chu Guangqing <chuguangqing@inspur.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-29 10:41:08 -08:00
Ming Lei
28d7a371f0 ublk: add helper of __ublk_fetch()
Add helper __ublk_fetch() for refactoring ublk_fetch().

Meantime move ublk_config_io_buf() out of __ublk_fetch() to make
the code structure cleaner.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-28 09:20:13 -07:00
Ming Lei
3443bab2f8 ublk: pass const pointer to ublk_queue_is_zoned()
Pass const pointer to ublk_queue_is_zoned() because it is readonly.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-28 09:20:13 -07:00
Ming Lei
0a9beafa7c ublk: refactor auto buffer register in ublk_dispatch_req()
Refactor auto buffer register code and prepare for supporting batch IO
feature, and the main motivation is to put 'ublk_io' operation code
together, so that per-io lock can be applied for the code block.

The key changes are:
- Rename ublk_auto_buf_reg() as ublk_do_auto_buf_reg()
- Introduce an enum `auto_buf_reg_res` to represent the result of
  the buffer registration attempt (FAIL, FALLBACK, OK).
- Split the existing `ublk_do_auto_buf_reg` function into two:
  - `__ublk_do_auto_buf_reg`: Performs the actual buffer registration
    and returns the `auto_buf_reg_res` status.
  - `ublk_do_auto_buf_reg`: A wrapper that calls the internal function
    and handles the I/O preparation based on the result.
- Introduce `ublk_prep_auto_buf_reg_io` to encapsulate the logic for
  preparing the I/O for completion after buffer registration.
- Pass the `tag` directly to `ublk_auto_buf_reg_fallback` to avoid
  recalculating it.

This refactoring makes the control flow clearer and isolates the different
stages of the auto buffer registration process.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-28 09:20:13 -07:00
Ming Lei
8d61ece156 ublk: add union ublk_io_buf with improved naming
Add `union ublk_io_buf` for naming the anonymous union of struct ublk_io's
addr and buf fields, meantime apply it to `struct ublk_io` for storing either
ublk auto buffer register data or ublk server io buffer address.

The union uses clear field names:
- `addr`: for regular ublk server io buffer addresses
- `auto_reg`: for ublk auto buffer registration data

This eliminates confusing access patterns and improves code readability.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-28 09:20:13 -07:00
Ming Lei
3035b9b46b ublk: add parameter struct io_uring_cmd * to ublk_prep_auto_buf_reg()
Add parameter `struct io_uring_cmd *` to ublk_prep_auto_buf_reg() and
prepare for reusing this helper for the coming UBLK_BATCH_IO feature,
which can fetch & commit one batch of io commands via single uring_cmd.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-28 09:20:13 -07:00
Kevin Brodsky
c6a45ee760 ublk: prevent invalid access with DEBUG
ublk_ch_uring_cmd_local() may jump to the out label before
initialising the io pointer. This will cause trouble if DEBUG is
defined, because the pr_devel() call dereferences io. Clang reports:

drivers/block/ublk_drv.c:2403:6: error: variable 'io' is used uninitialized whenever 'if' condition is true [-Werror,-Wsometimes-uninitialized]
 2403 |         if (tag >= ub->dev_info.queue_depth)
      |             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/block/ublk_drv.c:2492:32: note: uninitialized use occurs here
 2492 |                         __func__, cmd_op, tag, ret, io->flags);
      |

Fix this by initialising io to NULL and checking it before
dereferencing it.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Fixes: 71f28f3136 ("ublk_drv: add io_uring based userspace block driver")
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-26 10:59:18 -07:00
Jens Axboe
96f03c8cb2 Revert "Merge branch 'loop-aio-nowait' into for-6.19/block"
This reverts commit f43fdeb9a3, reversing
changes made to 2c6d792d4b.

There are concerns that doing inline submits can cause excessive
stack usage, particularly when going back into the filesystem. Revert
the loop dio nowait change for now.

Link: https://lore.kernel.org/linux-block/aSP3SG_KaROJTBHx@infradead.org/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-24 20:53:19 -07:00
Yuwen Chen
04d31610a7 zram: fix the issue that the write - back limits might overflow
When the page size exceeds 4KB, if bd_wb_limit is set to a value that is
not aligned with the page size, it will cause a numerical wrap-around
issue for bd_wb_limit.  For example, when the page size is set to 16KB and
bd_wb_limit is set to 3, after one write-back operation, the value of
bd_wb_limit will become -1.  More seriously, since bd_wb_limit is an
unsigned number, its value may become as large as 2^64 - 1.

The core reason for this problem is that the unit of bd_wb_limit is 4KB. 
For example, when a write-back occurs on a system with a page size of
16KB, 4 needs to be subtracted from bd_wb_limit.  This operation takes
place in the zram_account_writeback_submit function.

This patch fixes the issue by limiting bd_wb_limit to be an integer
multiple of PAGE_SIZE / 4096.

Link: https://lkml.kernel.org/r/tencent_5936CFE72BAB2BA76887BB69DCC1B5E67C05@qq.com
Fixes: 1d69a3f8ae ("zram: idle writeback fixes and cleanup")
Signed-off-by: Yuwen Chen <ywen.chen@foxmail.com>
Acked-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Richard Chang <richardycc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-24 15:08:55 -08:00
Sergey Senozhatsky
1b1a4e4d67 zram: read slot block idx under slot lock
Read slot's block id under slot-lock.  We release the slot-lock for bdev
read so, technically, slot still can get freed in the meantime, but at
least we will read bdev block (page) that holds previous know slot data,
not from slot->handle bdev block, which can be anything at that point.

Link: https://lkml.kernel.org/r/20251122074029.3948921-7-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Minchan Kim <minchan@google.com>
Cc: Richard Chang <richardycc@google.com>
Cc: Yuwen Chen <ywen.chen@foxmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-24 15:08:53 -08:00
Sergey Senozhatsky
e87ddea345 zram: rework bdev block allocation
First, writeback bdev ->bitmap bits are set only from one context, as we
can have only one single task performing writeback, so we cannot race with
anything else.  Remove retry path.

Second, we always check ZRAM_WB flag to distinguish writtenback slots, so
we should not confuse 0 bdev block index and 0 handle.  We can use first
bdev block (0 bit) for writeback as well.

While at it, give functions slightly more accurate names, as we don't
alloc/free anything there, we reserve a block for async writeback or
release the block.

Link: https://lkml.kernel.org/r/20251122074029.3948921-6-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Brian Geffon <bgeffon@google.com>
Cc: Minchan Kim <minchan@google.com>
Cc: Richard Chang <richardycc@google.com>
Cc: Yuwen Chen <ywen.chen@foxmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-24 15:08:53 -08:00
Sergey Senozhatsky
a4f506c569 zram: drop wb_limit_lock
We don't need wb_limit_lock.  Writeback limit setters take an exclusive
write zram init_lock, while wb_limit modifications happen only from a
single task and under zram read init_lock.  No concurrent wb_limit
modifications are possible (we permit only one post-processing task at a
time).  Add lockdep assertions to wb_limit mutators.

While at it, fixup coding styles.

Link: https://lkml.kernel.org/r/20251122074029.3948921-5-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Brian Geffon <bgeffon@google.com>
Cc: Minchan Kim <minchan@google.com>
Cc: Richard Chang <richardycc@google.com>
Cc: Yuwen Chen <ywen.chen@foxmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-24 15:08:53 -08:00
Sergey Senozhatsky
7c929664fd zram: take write lock in wb limit store handlers
Write device attrs handlers should take write zram init_lock.  While at
it, fixup coding styles.

Link: https://lkml.kernel.org/r/20251122074029.3948921-4-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Brian Geffon <bgeffon@google.com>
Cc: Minchan Kim <minchan@google.com>
Cc: Richard Chang <richardycc@google.com>
Cc: Yuwen Chen <ywen.chen@foxmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-24 15:08:53 -08:00
Sergey Senozhatsky
e828cccb72 zram: add writeback batch size device attr
Introduce writeback_batch_size device attribute so that the maximum number
of in-flight writeback bio requests can be configured at run-time
per-device.  This essentially enables batched bio writeback.

Link: https://lkml.kernel.org/r/20251122074029.3948921-3-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Brian Geffon <bgeffon@google.com>
Cc: Minchan Kim <minchan@google.com>
Cc: Richard Chang <richardycc@google.com>
Cc: Yuwen Chen <ywen.chen@foxmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-24 15:08:53 -08:00
Sergey Senozhatsky
f405066a1f zram: introduce writeback bio batching
Patch series "zram: introduce writeback bio batching", v6.

As writeback is becoming more and more common the longstanding limitations
of zram writeback throughput are becoming more visible.  Introduce
writeback bio batching so that multiple writeback bios can be processed
simultaneously.


This patch (of 6):

As was stated in a comment [1] a single page writeback IO is not
efficient, but it works.  It's time to address this throughput limitation
as writeback becomes used more often.  Introduce batched (multiple) bio
writeback support to take advantage of parallel requests processing and
better requests scheduling.

Approach used in this patch doesn't use a dedicated kthread like in [2],
or blk-plug like in [3].  Dedicated kthread adds complexity, which can be
avoided.  Apart from that not all zram setups use writeback, so having
numerous per-device kthreads (on systems that create multiple zram
devices) hanging around is not the most optimal thing to do.  blk-plug, on
the other hand, works best when request are sequential, which doesn't
particularly fit zram writebck IO patterns: zram writeback IO patterns are
expected to be random, due to how bdev block reservation/release are
handled.  blk-plug approach also works in cycles: idle IO, when zram sets
up requests in a batch, is followed by bursts of IO, when zram submits the
entire batch.

Instead we use a batch of requests and submit new bio as soon as one of
the in-flight requests completes.

For the time being the writeback batch size (maximum number of in-flight
bio requests) is set to 32 for all devices.  A follow up patch adds a
writeback_batch_size device attribute, so the batch size becomes run-time
configurable.

Link: https://lkml.kernel.org/r/20251122074029.3948921-1-senozhatsky@chromium.org
Link: https://lkml.kernel.org/r/20251122074029.3948921-2-senozhatsky@chromium.org
Link: https://lore.kernel.org/all/20181203024045.153534-6-minchan@kernel.org/ [1]
Link: https://lore.kernel.org/all/20250731064949.1690732-1-richardycc@google.com/ [2]
Link: https://lore.kernel.org/all/tencent_78FC2C4FE16BA1EBAF0897DB60FCD675ED05@qq.com/ [3]
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Co-developed-by: Yuwen Chen <ywen.chen@foxmail.com>
Co-developed-by: Richard Chang <richardycc@google.com>
Suggested-by: Minchan Kim <minchan@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Richard Chang <richardycc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-24 15:08:52 -08:00
Chaitanya Kulkarni
e8f0abdd49 zloop: clear nowait flag in workqueue context
The zloop driver advertises REQ_NOWAIT support through BLK_FEAT_NOWAIT
(enabled by default for all blk-mq devices), and honors the nowait
behavior throughout zloop_queue_rq().

However, actual I/O to the backing file is performed in a workqueue,
where blocking is allowed.

To avoid imposing unnecessary non-blocking constraints in this blocking
context, clear the REQ_NOWAIT flag before processing the request in the
workqueue context.

Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-20 07:42:32 -07:00
Chaitanya Kulkarni
b11e483a1c loop: clear nowait flag in workqueue context
The loop driver advertises REQ_NOWAIT support through BLK_FEAT_NOWAIT
(enabled by default for all blk-mq devices), and honors the nowait
behavior throughout loop_queue_rq().

However, actual I/O to the backing file is performed in a workqueue,
where blocking is allowed.

To avoid imposing unnecessary non-blocking constraints in this blocking
context, clear the REQ_NOWAIT flag before processing the request in the
workqueue context.

Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-20 07:42:32 -07:00
Damien Le Moal
a9637ab93c zloop: fix zone append check in zloop_rw()
While commit cf28f6f923 ("zloop: fail zone append operations that are
targeting full zones") added a check in zloop_rw() that a zone append is
not issued to a full zone, commit e3a96ca904 ("zloop: simplify checks
for writes to sequential zones") inadvertently removed the check to
verify that there is enough unwritten space in a zone for an incoming
zone append opration.

Re-add this check in zloop_rw() to make sure we do not write beyond the
end of a zone. Of note is that this same check is already present in the
function zloop_set_zone_append_sector() when ordered zone append is in
use.

Reported-by: Hans Holmberg <Hans.Holmberg@wdc.com>
Fixes: e3a96ca904 ("zloop: simplify checks for writes to sequential zones")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-19 07:39:42 -07:00
Ming Lei
837ed30396 loop: add hint for handling aio via IOCB_NOWAIT
Add hint for using IOCB_NOWAIT to handle loop aio command for avoiding
to cause write(especially randwrite) perf regression on sparse backed file.

Try IOCB_NOWAIT in the following situations:

- backing file is block device

OR

- READ aio command

OR

- there isn't any queued blocking async WRITEs, because NOWAIT won't cause
contention with blocking WRITE, which often implies exclusive lock

With this simple policy, perf regression of randwrite/write on sparse
backing file is fixed.

Link: https://lore.kernel.org/dm-devel/7d6ae2c9-df8e-50d0-7ad6-b787cb3cfab4@redhat.com/
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-18 06:49:52 -07:00
Ming Lei
0ba93a906d loop: try to handle loop aio command via NOWAIT IO first
Try to handle loop aio command via NOWAIT IO first, then we can avoid to
queue the aio command into workqueue. This is usually one big win in
case that FS block mapping is stable, Mikulas verified [1] that this way
improves IO perf by close to 5X in 12jobs sequential read/write test,
in which FS block mapping is just stable.

Fallback to workqueue in case of -EAGAIN. This way may bring a little
cost from the 1st retry, but when running the following write test over
loop/sparse_file, the actual effect on randwrite is obvious:

```
truncate -s 4G 1.img    #1.img is created on XFS/virtio-scsi
losetup -f 1.img --direct-io=on
fio --direct=1 --bs=4k --runtime=40 --time_based --numjobs=1 --ioengine=libaio \
	--iodepth=16 --group_reporting=1 --filename=/dev/loop0 -name=job --rw=$RW
```

- RW=randwrite: obvious IOPS drop observed
- RW=write: a little drop(%5 - 10%)

This perf drop on randwrite over sparse file will be addressed in the
following patch.

BLK_MQ_F_BLOCKING has to be set for calling into .read_iter() or .write_iter()
which might sleep even though it is NOWAIT, and the only effect is that rcu read
lock is replaced with srcu read lock.

Link: https://lore.kernel.org/linux-block/a8e5c76a-231f-07d1-a394-847de930f638@redhat.com/ [1]
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-18 06:49:52 -07:00
Ming Lei
f4788ae9d7 loop: move command blkcg/memcg initialization into loop_queue_work
Move loop command blkcg/memcg initialization into loop_queue_work,
and prepare for supporting to handle loop io command by IOCB_NOWAIT.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-18 06:49:52 -07:00
Ming Lei
c66e9708f9 loop: add lo_submit_rw_aio()
Refactor lo_rw_aio() by extracting the I/O submission logic into a new
helper function lo_submit_rw_aio(). This further improves code organization
by separating the I/O preparation, submission, and completion handling into
distinct phases.

Prepare for using NOWAIT to improve loop performance.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-18 06:49:51 -07:00
Ming Lei
fd858d1ca9 loop: add helper lo_rw_aio_prep()
Add helper lo_rw_aio_prep() to separate the preparation phase(setting up bio
vectors and initializing the iocb structure) from the actual I/O execution
in the loop block driver.

Prepare for using NOWAIT to improve loop performance.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-18 06:49:51 -07:00
Ming Lei
c3e6c11147 loop: add helper lo_cmd_nr_bvec()
Add lo_cmd_nr_bvec() and prepare for refactoring lo_rw_aio().

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-18 06:49:51 -07:00
Sukrut Heroorkar
2c6d792d4b drbd: turn bitmap I/O comments into regular block comments
W=1 build warns because the bitmap I/O comments use '/**', which
marks them as kernel-doc comments even though these functions do not
document an external API.

Convert these comments to regular block comments so kernel-doc no
longer parses them.

Signed-off-by: Sukrut Heroorkar <hsukrut3@gmail.com>
Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-18 06:48:33 -07:00
Damien Le Moal
fcc6eaa3a0 zloop: introduce the ordered_zone_append configuration parameter
The zone append operation processing for zloop devices is similar to any
other command, that is, the operation is processed as a command work
item, without any special serialization between the work items (beside
the zone mutex for mutually exclusive code sections).

This processing is fine and gives excellent performance. However, it has
a side effect: zone append operation are very often reordered and
processed in a sequence that is very different from their issuing order
by the user. This effect is very visible using an XFS file system on top
of a zloop device. A simple file write leads to many file extents as the
data writes using zone append are reordered and so result in the
physical order being different than the file logical order.
E.g. executing:

$ dd if=/dev/zero of=/mnt/test bs=1M count=10 && sync
$ xfs_bmap /mnt/test
/mnt/test:
	0: [0..4095]: 2162688..2166783
	1: [4096..6143]: 2168832..2170879
	2: [6144..8191]: 2166784..2168831
	3: [8192..10239]: 2170880..2172927
	4: [10240..12287]: 2174976..2177023
	5: [12288..14335]: 2172928..2174975
	6: [14336..20479]: 2177024..2183167

For 10 IOs, 6 extents are created.

This is fine and actually allows to exercise XFS zone garbage collection
very well. However, this also makes debugging/working on XFS data
placement harder as the underlying device will most of the time reorder
IOs, resulting in many file extents.

Allow a user to mitigate this with the new ordered_zone_append
configuration parameter. For a zloop device created with this parameter
specified, the sector of a zone append command is set early, when the
command is submitted by the block layer with the zloop_queue_rq()
function, instead of in the zloop_rw() function which is exectued later
in the command work item context. This change ensures that more often
than not, zone append operations data end up being written in the same
order as the command submission by the user.

In the case of XFS, this leads to far less file data extents. E.g., for
the previous example, we get a single file data extent for the written
file.

$ dd if=/dev/zero of=/mnt/test bs=1M count=10 && sync
$ xfs_bmap /mnt/test
/mnt/test:
	0: [0..20479]: 2162688..2183167

Since we cannot use a mutex in the context of the zloop_queue_rq()
function to atomically set a zone append operation sector to the target
zone write pointer location and increment that the write pointer, a new
per-zone spinlock is introduced to protect a zone write pointer access
and modifications. To check a zone write pointer location and set a zone
append operation target sector to that value, the function
zloop_set_zone_append_sector() is introduced and called from
zloop_queue_rq().

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-17 09:40:09 -07:00
Damien Le Moal
9236c5fdd5 zloop: introduce the zone_append configuration parameter
A zloop zoned block device declares to the block layer that it supports
zone append operations. That is, a zloop device ressembles an NVMe ZNS
devices supporting zone append.

This native support is fine but it does not allow exercising the block
layer zone write plugging emulation of zone append, as is done with SCSI
or ATA SMR HDDs.

Introduce the zone_append configuration parameter to allow creating a
zloop device without native support for zone append, thus relying on the
block layer zone append emulation. If not specified, zone append support
is enabled by default. Otherwise, a value of 0 disables native zone
append and a value of 1 enables it.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-17 09:40:09 -07:00
Damien Le Moal
e3a96ca904 zloop: simplify checks for writes to sequential zones
The function zloop_rw() already checks early that a request is fully
contained within the target zone. So this check does not need to be done
again for regular writes to sequential zones. Furthermore, since zone
append operations are always directed to the zone write pointer
location, we do not need to check for their alignment to that value
after setting it. So turn the "if" checking the write pointer alignment
into an "else if".

While at it, improve the comment describing the write pointer
modification and how this value is corrected in case of error.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-17 09:40:09 -07:00
Damien Le Moal
cf28f6f923 zloop: fail zone append operations that are targeting full zones
zloop_rw() will fail any regular write operation that targets a full
sequential zone. The check for this is indirect and achieved by checking
the write pointer alignment of the write operation. But this check is
ineffective for zone append operations since these are alwasy
automatically directed at a zone write pointer.

Prevent zone append operations from being executed in a full zone with
an explicit check of the zone condition.

Fixes: eb0570c7df ("block: new zoned loop block device driver")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-17 09:40:09 -07:00
Damien Le Moal
866d65745b zloop: make the write pointer of full zones invalid
The write pointer of zones that are in the full condition is always
invalid. Reflect that fact by setting the write pointer of full zones
to ULLONG_MAX.

Fixes: eb0570c7df ("block: new zoned loop block device driver")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-17 09:40:09 -07:00
Rene Rebe
82d2048102 floppy: fix for PAGE_SIZE != 4KB
For years I wondered why the floppy driver does not just work on
sparc64, e.g:

root@SUNW_375_0066:# disktype /dev/fd0
disktype: Can't open /dev/fd0: No such device or address

[  525.341906] disktype: attempt to access beyond end of device
fd0: rw=0, sector=0, nr_sectors = 16 limit=8
[  525.341991] floppy: error 10 while reading block 0

Turns out floppy.c __floppy_read_block_0 tries to read one page for
the first test read to determine the disk size and thus fails if that
is greater than 4k. Adjust minimum MAX_DISK_SIZE to PAGE_SIZE to fix
floppy on sparc64 and likely all other PAGE_SIZE != 4KB configs.

Cc: stable@vger.kernel.org
Signed-off-by: René Rebe <rene@exactco.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-17 08:22:00 -07:00
Rene Rebe
79bd8c9814 ps3disk: use memcpy_{from,to}_bvec index
With 6e0a48552b (ps3disk: use memcpy_{from,to}_bvec) converting
ps3disk to new bvec helpers, incrementing the offset was accidently
lost, corrupting consecutive buffers. Restore index for non-corrupted
data transfers.

Fixes: 6e0a48552b (ps3disk: use memcpy_{from,to}_bvec)
Signed-off-by: René Rebe <rene@exactco.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-14 09:10:16 -07:00
Keith Busch
3749ea4dee null_blk: fix zone read length beyond write pointer
Fix up the divisor calculating the number of zone sectors being read and
handle a read that straddles the zone write pointer. The length is
rounded up a sector boundary, so be sure to truncate any excess bytes
off to avoid copying past the data segment.

Fixes: 3451cf34f5 ("null_blk: allow byte aligned memory offsets")
Signed-off-by: Keith Busch <kbusch@kernel.org>
Tested-by: Bart van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-12 10:02:56 -07:00
Caleb Sander Mateos
727a440278 ublk: return unsigned from ublk_{,un}map_io()
ublk_map_io() and ublk_unmap_io() never return negative values, and
their return values are stored in variables of type unsigned. Clarify
that they can't fail by making their return types unsigned.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-11 07:57:20 -07:00
Caleb Sander Mateos
6b0a29933f ublk: remove unnecessary checks in ublk_check_and_get_req()
ub = iocb->ki_filp->private_data cannot be NULL, as it's set in
ublk_ch_open() before it returns succesfully. req->mq_hctx cannot be
NULL as any inflight ublk request must belong to some queue. And
req->mq_hctx->driver_data cannot be NULL as it's set to the ublk_queue
pointer in ublk_init_hctx(). So drop the unnecessary checks.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-11 07:57:05 -07:00
Zheng Qixing
1649714b93 nbd: defer config unlock in nbd_genl_connect
There is one use-after-free warning when running NBD_CMD_CONNECT and
NBD_CLEAR_SOCK:

nbd_genl_connect
  nbd_alloc_and_init_config // config_refs=1
  nbd_start_device // config_refs=2
  set NBD_RT_HAS_CONFIG_REF			open nbd // config_refs=3
  recv_work done // config_refs=2
						NBD_CLEAR_SOCK // config_refs=1
						close nbd // config_refs=0
  refcount_inc -> uaf

------------[ cut here ]------------
refcount_t: addition on 0; use-after-free.
WARNING: CPU: 24 PID: 1014 at lib/refcount.c:25 refcount_warn_saturate+0x12e/0x290
 nbd_genl_connect+0x16d0/0x1ab0
 genl_family_rcv_msg_doit+0x1f3/0x310
 genl_rcv_msg+0x44a/0x790

The issue can be easily reproduced by adding a small delay before
refcount_inc(&nbd->config_refs) in nbd_genl_connect():

        mutex_unlock(&nbd->config_lock);
        if (!ret) {
                set_bit(NBD_RT_HAS_CONFIG_REF, &config->runtime_flags);
+               printk("before sleep\n");
+               mdelay(5 * 1000);
+               printk("after sleep\n");
                refcount_inc(&nbd->config_refs);
                nbd_connect_reply(info, nbd->index);
        }

Fixes: e46c7287b1 ("nbd: add a basic netlink interface")
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Reviewed-by: Yu Kuai <yukuai@fnnas.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-11 07:50:15 -07:00
Zheng Qixing
9517b82d8d nbd: defer config put in recv_work
There is one uaf issue in recv_work when running NBD_CLEAR_SOCK and
NBD_CMD_RECONFIGURE:
  nbd_genl_connect     // conf_ref=2 (connect and recv_work A)
  nbd_open	       // conf_ref=3
  recv_work A done     // conf_ref=2
  NBD_CLEAR_SOCK       // conf_ref=1
  nbd_genl_reconfigure // conf_ref=2 (trigger recv_work B)
  close nbd	       // conf_ref=1
  recv_work B
    config_put         // conf_ref=0
    atomic_dec(&config->recv_threads); -> UAF

Or only running NBD_CLEAR_SOCK:
  nbd_genl_connect   // conf_ref=2
  nbd_open 	     // conf_ref=3
  NBD_CLEAR_SOCK     // conf_ref=2
  close nbd
    nbd_release
      config_put     // conf_ref=1
  recv_work
    config_put 	     // conf_ref=0
    atomic_dec(&config->recv_threads); -> UAF

Commit 87aac3a80a ("nbd: call nbd_config_put() before notifying the
waiter") moved nbd_config_put() to run before waking up the waiter in
recv_work, in order to ensure that nbd_start_device_ioctl() would not
be woken up while nbd->task_recv was still uncleared.

However, in nbd_start_device_ioctl(), after being woken up it explicitly
calls flush_workqueue() to make sure all current works are finished.
Therefore, there is no need to move the config put ahead of the wakeup.

Move nbd_config_put() to the end of recv_work, so that the reference is
held for the whole lifetime of the worker thread. This makes sure the
config cannot be freed while recv_work is still running, even if clear
+ reconfigure interleave.

In addition, we don't need to worry about recv_work dropping the last
nbd_put (which causes deadlock):

path A (netlink with NBD_CFLAG_DESTROY_ON_DISCONNECT):
  connect  // nbd_refs=1 (trigger recv_work)
  open nbd // nbd_refs=2
  NBD_CLEAR_SOCK
  close nbd
    nbd_release
      nbd_disconnect_and_put
        flush_workqueue // recv_work done
      nbd_config_put
        nbd_put // nbd_refs=1
      nbd_put // nbd_refs=0
        queue_work

path B (netlink without NBD_CFLAG_DESTROY_ON_DISCONNECT):
  connect  // nbd_refs=2 (trigger recv_work)
  open nbd // nbd_refs=3
  NBD_CLEAR_SOCK // conf_refs=2
  close nbd
    nbd_release
      nbd_config_put // conf_refs=1
      nbd_put // nbd_refs=2
  recv_work done // conf_refs=0, nbd_refs=1
  rmmod // nbd_refs=0

Reported-by: syzbot+56fbf4c7ddf65e95c7cc@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/6907edce.a70a0220.37351b.0014.GAE@google.com/T/
Fixes: 87aac3a80a ("nbd: make the config put is called before the notifying the waiter")
Depends-on: e2daec488c ("nbd: Fix hungtask when nbd_config_put")
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-08 06:37:54 -07:00
Cong Zhang
0739c2c6a0 virtio_blk: NULL out vqs to avoid double free on failed resume
The vblk->vqs releases during freeze. If resume fails before vblk->vqs
is allocated, later freeze/remove may attempt to free vqs again.
Set vblk->vqs to NULL after freeing to avoid double free.

Signed-off-by: Cong Zhang <cong.zhang@oss.qualcomm.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-06 16:32:58 -07:00
Keith Busch
3451cf34f5 null_blk: allow byte aligned memory offsets
Allowing byte aligned memory provides a nice testing ground for
direct-io.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-06 16:28:55 -07:00
Keith Busch
262a3dd04e null_blk: single kmap per bio segment
Rather than kmap the the request bio segment for each sector, do
the mapping just once.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-06 16:27:36 -07:00
Keith Busch
8459283819 null_blk: consistently use blk_status_t
No need to mix errno and blk_status_t error types. Just use the standard
block layer type.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-06 16:27:21 -07:00
Keith Busch
1165d20f4d null_blk: simplify copy_from_nullb
It always returns success, so the code that saves the errors status, but
proceeds without checking it looks a bit odd. Clean this up.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-06 16:27:21 -07:00
Caleb Sander Mateos
e87d66ab27 ublk: use rq_for_each_segment() for user copy
ublk_advance_io_iter() and ublk_copy_io_pages() currently open-code the
iteration over the request's bvecs. Switch to the rq_for_each_segment()
macro provided by blk-mq to avoid reaching into the bio internals and
simplify the code.

Suggested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-06 16:26:04 -07:00
Caleb Sander Mateos
2299ceec36 ublk: use copy_{to,from}_iter() for user copy
ublk_copy_user_pages()/ublk_copy_io_pages() currently uses
iov_iter_get_pages2() to extract the pages from the iov_iter and
memcpy()s between the bvec_iter and the iov_iter's pages one at a time.
Switch to using copy_to_iter()/copy_from_iter() instead. This avoids the
user page reference count increments and decrements and needing to split
the memcpy() at user page boundaries. It also simplifies the code
considerably.
Ming reports a 40% throughput improvement when issuing I/O to the
selftests null ublk server with zero-copy disabled.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-06 16:26:04 -07:00
Jakub Kicinski
1ec9871fbb Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.18-rc5).

Conflicts:

drivers/net/wireless/ath/ath12k/mac.c
  9222582ec5 ("Revert "wifi: ath12k: Fix missing station power save configuration"")
  6917e268c4 ("wifi: ath12k: Defer vdev bring-up until CSA finalize to avoid stale beacon")
https://lore.kernel.org/11cece9f7e36c12efd732baa5718239b1bf8c950.camel@sipsolutions.net

Adjacent changes:

drivers/net/ethernet/intel/Kconfig
  b1d16f7c00 ("libie: depend on DEBUG_FS when building LIBIE_FWLOG")
  93f53db9f9 ("ice: switch to Page Pool")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-06 09:27:40 -08:00
Shankari Anand
ba13710ddd rust: block: update ARef and AlwaysRefCounted imports from sync::aref
Update call sites in the block subsystem to import `ARef` and
`AlwaysRefCounted` from `sync::aref` instead of `types`.

This aligns with the ongoing effort to move `ARef` and
`AlwaysRefCounted` to sync.

Suggested-by: Benno Lossin <lossin@kernel.org>
Link: https://github.com/Rust-for-Linux/linux/issues/1173
Signed-off-by: Shankari Anand <shankari.ak0208@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-05 18:24:10 -07:00
Damien Le Moal
fdb9aed869 block: introduce disk_report_zone()
Commit b76b840fd9 ("dm: Fix dm-zoned-reclaim zone write pointer
alignment") introduced an indirect call for the callback function of a
report zones executed with blkdev_report_zones(). This is necessary so
that the function disk_zone_wplug_sync_wp_offset() can be called to
refresh a zone write plug zone write pointer offset after a write error.
However, this solution makes following the path of a zone information
harder to understand.

Clean this up by introducing the new blk_report_zones_args structure to
define a zone report callback and its private data and introduce the
helper function disk_report_zone() which calls both
disk_zone_wplug_sync_wp_offset() and the zone report user callback
function for all zones of a zone report. This helper function must be
called by all block device drivers that implement the report zones
block operation in order to correctly report a zone information.

All block device drivers supporting the report_zones block operation are
updated to use this new scheme.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-05 08:07:21 -07:00
Kees Cook
85cb0757d7 net: Convert proto_ops connect() callbacks to use sockaddr_unsized
Update all struct proto_ops connect() callback function prototypes from
"struct sockaddr *" to "struct sockaddr_unsized *" to avoid lying to the
compiler about object sizes. Calls into struct proto handlers gain casts
that will be removed in the struct proto conversion patch.

No binary changes expected.

Signed-off-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20251104002617.2752303-3-kees@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-04 19:10:32 -08:00
Kees Cook
0e50474fa5 net: Convert proto_ops bind() callbacks to use sockaddr_unsized
Update all struct proto_ops bind() callback function prototypes from
"struct sockaddr *" to "struct sockaddr_unsized *" to avoid lying to the
compiler about object sizes. Calls into struct proto handlers gain casts
that will be removed in the struct proto conversion patch.

No binary changes expected.

Signed-off-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20251104002617.2752303-2-kees@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-04 19:10:32 -08:00
Christian Brauner
4601b7923d
nbd: don't copy kernel creds
No need to copy kernel credentials.

Link: https://patch.msgid.link/20251103-work-creds-init_cred-v1-6-cb3ec8711a6a@kernel.org
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04 12:36:16 +01:00
Ming Lei
c28ba6b6c5 ublk: use struct_size() for allocation
Convert ublk_queue to use struct_size() for allocation.

Changes in this commit:

1. Update ublk_init_queue() to use struct_size(ubq, ios, depth)
   instead of manual size calculation (sizeof(struct ublk_queue) +
   depth * sizeof(struct ublk_io)).

This provides better type safety and makes the code more maintainable
by using standard kernel macro for flexible array handling.

Meantime annotate ublk_queue.ios by __counted_by().

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-03 08:34:59 -07:00
Ming Lei
529d4d6327 ublk: implement NUMA-aware memory allocation
Implement NUMA-friendly memory allocation for ublk driver to improve
performance on multi-socket systems.

This commit includes the following changes:

1. Rename __queues to queues, dropping the __ prefix since the field is
   now accessed directly throughout the codebase rather than only through
   the ublk_get_queue() helper.

2. Remove the queue_size field from struct ublk_device as it is no longer
   needed.

3. Move queue allocation and deallocation into ublk_init_queue() and
   ublk_deinit_queue() respectively, improving encapsulation. This
   simplifies ublk_init_queues() and ublk_deinit_queues() to just
   iterate and call the per-queue functions.

4. Add ublk_get_queue_numa_node() helper function to determine the
   appropriate NUMA node for a queue by finding the first CPU mapped
   to that queue via tag_set.map[HCTX_TYPE_DEFAULT].mq_map[] and
   converting it to a NUMA node using cpu_to_node(). This function is
   called internally by ublk_init_queue() to determine the allocation
   node.

5. Allocate each queue structure on its local NUMA node using
   kvzalloc_node() in ublk_init_queue().

6. Allocate the I/O command buffer on the same NUMA node using
   alloc_pages_node().

This reduces memory access latency on multi-socket NUMA systems by
ensuring each queue's data structures are local to the CPUs that
access them.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-03 08:34:59 -07:00
Ming Lei
011af85ccd ublk: reorder tag_set initialization before queue allocation
Move ublk_add_tag_set() before ublk_init_queues() in the device
initialization path. This allows us to use the blk-mq CPU-to-queue
mapping established by the tag_set to determine the appropriate
NUMA node for each queue allocation.

The error handling paths are also reordered accordingly.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-03 08:34:59 -07:00
Caleb Sander Mateos
20fb3d05a3 io_uring/uring_cmd: avoid double indirect call in task work dispatch
io_uring task work dispatch makes an indirect call to struct io_kiocb's
io_task_work.func field to allow running arbitrary task work functions.
In the uring_cmd case, this calls io_uring_cmd_work(), which immediately
makes another indirect call to struct io_uring_cmd's task_work_cb field.
Change the uring_cmd task work callbacks to functions whose signatures
match io_req_tw_func_t. Add a function io_uring_cmd_from_tw() to convert
from the task work's struct io_tw_req argument to struct io_uring_cmd *.
Define a constant IO_URING_CMD_TASK_WORK_ISSUE_FLAGS to avoid
manufacturing issue_flags in the uring_cmd task work callbacks. Now
uring_cmd task work dispatch makes a single indirect call to the
uring_cmd implementation's callback. This also allows removing the
task_work_cb field from struct io_uring_cmd, freeing up 8 bytes for
future storage.
Since fuse_uring_send_in_task() now has access to the io_tw_token_t,
check its cancel field directly instead of relying on the
IO_URING_F_TASK_DEAD issue flag.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-03 08:31:26 -07:00
Shi Hao
77220f6d18 drbd: replace kmap() with kmap_local_page() in receiver path
Use kmap_local_page() instead of kmap() to avoid
CPU contention.

kmap() uses a global set of mapping slots that can cause contention
between multiple CPUs, while kmap_local_page() uses per-CPU slots
eliminating this contention. It also ensures non-sleeping operation
and provides better cache locality.

Convert kmap() to kmap_local_page() as it aligns with ongoing
kernel efforts to modernize kmap() usage for better multi-core
scalability.

Signed-off-by: Shi Hao <i.shihao.999@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-03 08:15:54 -07:00
Linus Torvalds
a5beb58e53 block-6.18-20251031
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmkE0BoQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpvHID/0bh5wEXK/IFMIDdiyqdbr2GsoMfHxiM2k0
 OSeQdwMgEGPY2frB8SirTBZWPIskOFdgSbRQyuYiu5XpdbsRJY+JdtkaGYp1L+Fc
 4I4C1ZpK4Kdlw+nbBrcUxdedYKx4ZN/00otZWV2K2ZpJFn1ZCLhyInZZ8ZbosKEn
 HeAW54YLu+q3pO9BSbJBO97FP38AZAOqkT9suUDkQYUUnNivejFKV0qbKlRm5v4H
 fQLU2sfT1J78DHdhJ1Gdk+uNKzVuYxR7lJRC+1c0yi2fZN3VGNRYlTk1f4VX3mOn
 RcRaUr4r9LMZc9K2IYEpQgAyuznttokWI0SkklFVTFDZwa1KmsIZjEccNXvESDXN
 vSxUXuZtgePo2qijK0F8VoPqgQRLBoP5MeAfp+VlkUWAu49zljwrIZXuZl0xuHpT
 JIEzbzvk+KfPS/gKtQdWxuN3eqZvv596SxnWnzGMg17zmhsj2kEZ9BF4Q+9BNVMZ
 NdK0jmdsBA3iTI8xVy2ajEY6U2W3KDdkSKPWR2SDg+vBd/qu3VBmrC9ptr1AoYpO
 54UOyBtIAumMyaOAUDSGiKC4KSbgMWUhN2uBFC8uWvuh733Z333xb9BnY1T7D624
 cfacmSzkoXKUACmcLaod2+MDJlSXhxmOtVN65euxst8ZGQsSah1TpY4Tr/2UHKnO
 ru+vbsqJyQ==
 =rC6y
 -----END PGP SIGNATURE-----

Merge tag 'block-6.18-20251031' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull block fixes from Jens Axboe:

 - Fix blk-crypto reporting EIO when EINVAL is the correct error code

 - Two bug fixes for the block zone support

 - NVME pull request via Keith:
      - Target side authentication fixup
      - Peer-to-peer metadata fixup

 - null_blk DMA alignment fix

* tag 'block-6.18-20251031' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  null_blk: set dma alignment to logical block size
  blk-crypto: use BLK_STS_INVAL for alignment errors
  block: make REQ_OP_ZONE_OPEN a write operation
  block: fix op_is_zone_mgmt() to handle REQ_OP_ZONE_RESET_ALL
  nvme-pci: use blk_map_iter for p2p metadata
  nvmet-auth: update sc_c in host response
2025-10-31 12:57:19 -07:00
Hans Holmberg
0d92a3eaa6 null_blk: set dma alignment to logical block size
This driver assumes that bio vectors are memory aligned to the logical
block size, so set the queue limit to reflect that.

Unless we set up the limit based on the logical block size, we will go
out of page bounds in copy_to_nullb / copy_from_nullb.

Apparently this wasn't noticed so far because none of the tests generate
such buffers, but since commit 851c4c96db ("xfs: implement
XFS_IOC_DIOINFO in terms of vfs_getattr") xfstests generates unaligned
I/O, which now lead to memory corruption when using null_blk devices
with 4k block size.

Fixes: bf8d08532b ("iomap: add support for dma aligned direct-io")
Fixes: b1a000d3b8 ("block: relax direct io memory alignment")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-31 09:03:12 -06:00
Linus Torvalds
d2818517e3 block-6.18-20251023
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmj62psQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjMiD/0cTxemB2tYk5Nd1QIAda8cwO0fn1jLgamH
 tjQfy0uq4kxzSY4QWWx8HkA8sEybAOpAwP2u+F3RN/CsW3//TMA+H8JGW1h0k5OG
 dh+0asF0iru9euyAePTLUExOw2V3VgEajjvt/2ezkjussNki6vcXBoIzGfeZKQ5E
 MSx6LTbnpzAy+SUydYFpLFtFcokXzUyp/TKZY+QgsIzsqo/ReUm3Caa/KbxQBPQm
 7MhpUpnTdI1PjYZZE/Y/p4iWtesCSpiSOayYKhtBQX4FzMo12MZw5nRkJkliLUvm
 EtPuSYBSCQEKnYVlfCqLuVd8r7drgMgwZOmNhOsdtUHLigtkPolxQOqQKniX3u70
 ycMqn3b1BdEFSqVe/eXhIRZ3YCL3xEAJUYTBRvwbf7XVC804F8VV+CqAey835A4D
 IIcIh8vYrkw0HD5HP3aILKlWPHilArDqjcuU260Qd9i79EV7zVRUJrySc0mZ9zK9
 XVKX0csETx1SrdH9vRlwBaeJzQyF9J18fuYMD7JV1dK0FhkEX6+pF5dY5rE6S+0r
 /tjZgEwSS4siQYhsOM+q3J/ZoLMP2RmW10rYYcKiS9NrqYm1b5VNenBVm7bX6SQO
 P29JtDJG374ygPiFn7opMbY79LDJ8JNS5g+vq0en8HtEGwtWuFM3vVCYyerPkiCl
 IgVxWd/xYQ==
 =Zqjz
 -----END PGP SIGNATURE-----

Merge tag 'block-6.18-20251023' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull block fixes from Jens Axboe:

 - Fix dma alignment for PI

 - Fix selinux bogosity with nbd, where sendmsg would get rejected

* tag 'block-6.18-20251023' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  block: require LBA dma_alignment when using PI
  nbd: override creds to kernel when calling sock_{send,recv}msg()
2025-10-24 12:48:19 -07:00
Ondrej Mosnacek
81ccca3121 nbd: override creds to kernel when calling sock_{send,recv}msg()
sock_{send,recv}msg() internally calls security_socket_{send,recv}msg(),
which does security checks (e.g. SELinux) for socket access against the
current task. However, _sock_xmit() in drivers/block/nbd.c may be called
indirectly from a userspace syscall, where the NBD socket access would
be incorrectly checked against the calling userspace task (which simply
tries to read/write a file that happens to reside on an NBD device).

To fix this, temporarily override creds to kernel ones before calling
the sock_*() functions. This allows the security modules to recognize
this as internal access by the kernel, which will normally be allowed.

A way to trigger the issue is to do the following (on a system with
SELinux set to enforcing):

    ### Create nbd device:
    truncate -s 256M /tmp/testfile
    nbd-server localhost:10809 /tmp/testfile

    ### Connect to the nbd server:
    nbd-client localhost

    ### Create mdraid array
    mdadm --create -l 1 -n 2 /dev/md/testarray /dev/nbd0 missing

After these steps, assuming the SELinux policy doesn't allow the
unexpected access pattern, errors will be visible on the kernel console:

[  142.204243] nbd0: detected capacity change from 0 to 524288
[  165.189967] md: async del_gendisk mode will be removed in future, please upgrade to mdadm-4.5+
[  165.252299] md/raid1:md127: active with 1 out of 2 mirrors
[  165.252725] md127: detected capacity change from 0 to 522240
[  165.255434] block nbd0: Send control failed (result -13)
[  165.255718] block nbd0: Request send failed, requeueing
[  165.256006] block nbd0: Dead connection, failed to find a fallback
[  165.256041] block nbd0: Receive control failed (result -32)
[  165.256423] block nbd0: shutting down sockets
[  165.257196] I/O error, dev nbd0, sector 2048 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[  165.257736] Buffer I/O error on dev md127, logical block 0, async page read
[  165.258263] I/O error, dev nbd0, sector 2048 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[  165.259376] Buffer I/O error on dev md127, logical block 0, async page read
[  165.259920] I/O error, dev nbd0, sector 2048 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[  165.260628] Buffer I/O error on dev md127, logical block 0, async page read
[  165.261661] ldm_validate_partition_table(): Disk read failed.
[  165.262108] I/O error, dev nbd0, sector 2048 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[  165.262769] Buffer I/O error on dev md127, logical block 0, async page read
[  165.263697] I/O error, dev nbd0, sector 2048 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[  165.264412] Buffer I/O error on dev md127, logical block 0, async page read
[  165.265412] I/O error, dev nbd0, sector 2048 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[  165.265872] Buffer I/O error on dev md127, logical block 0, async page read
[  165.266378] I/O error, dev nbd0, sector 2048 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[  165.267168] Buffer I/O error on dev md127, logical block 0, async page read
[  165.267564]  md127: unable to read partition table
[  165.269581] I/O error, dev nbd0, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[  165.269960] Buffer I/O error on dev nbd0, logical block 0, async page read
[  165.270316] I/O error, dev nbd0, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[  165.270913] Buffer I/O error on dev nbd0, logical block 0, async page read
[  165.271253] I/O error, dev nbd0, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[  165.271809] Buffer I/O error on dev nbd0, logical block 0, async page read
[  165.272074] ldm_validate_partition_table(): Disk read failed.
[  165.272360]  nbd0: unable to read partition table
[  165.289004] ldm_validate_partition_table(): Disk read failed.
[  165.289614]  nbd0: unable to read partition table

The corresponding SELinux denial on Fedora/RHEL will look like this
(assuming it's not silenced):
type=AVC msg=audit(1758104872.510:116): avc:  denied  { write } for  pid=1908 comm="mdadm" laddr=::1 lport=32772 faddr=::1 fport=10809 scontext=system_u:system_r:mdadm_t:s0-s0:c0.c1023 tcontext=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 tclass=tcp_socket permissive=0

The respective backtrace looks like this:
@security[mdadm, -13,
        handshake_exit+221615650
        handshake_exit+221615650
        handshake_exit+221616465
        security_socket_sendmsg+5
        sock_sendmsg+106
        handshake_exit+221616150
        sock_sendmsg+5
        __sock_xmit+162
        nbd_send_cmd+597
        nbd_handle_cmd+377
        nbd_queue_rq+63
        blk_mq_dispatch_rq_list+653
        __blk_mq_do_dispatch_sched+184
        __blk_mq_sched_dispatch_requests+333
        blk_mq_sched_dispatch_requests+38
        blk_mq_run_hw_queue+239
        blk_mq_dispatch_plug_list+382
        blk_mq_flush_plug_list.part.0+55
        __blk_flush_plug+241
        __submit_bio+353
        submit_bio_noacct_nocheck+364
        submit_bio_wait+84
        __blkdev_direct_IO_simple+232
        blkdev_read_iter+162
        vfs_read+591
        ksys_read+95
        do_syscall_64+92
        entry_SYSCALL_64_after_hwframe+120
]: 1

The issue has started to appear since commit 060406c61c ("block: add
plug while submitting IO").

Cc: Ming Lei <ming.lei@redhat.com>
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2348878
Fixes: 060406c61c ("block: add plug while submitting IO")
Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Acked-by: Stephen Smalley <stephen.smalley.work@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-20 09:27:58 -06:00
Tamir Duberstein
5cc5d805e3 rnull: use kernel::fmt
Reduce coupling to implementation details of the formatting machinery by
avoiding direct use for `core`'s formatting traits and macros.

This backslid in commit d969d504bc ("rnull: enable configuration via
`configfs`") and commit 34585dc649 ("rnull: add soft-irq completion
support").

Acked-by: Andreas Hindborg <a.hindborg@kernel.org>
Signed-off-by: Tamir Duberstein <tamird@gmail.com>
Link: https://patch.msgid.link/20251018-cstr-core-v18-5-9378a54385f8@gmail.com
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
2025-10-20 04:04:23 +02:00
Linus Torvalds
1b1391b9c4 block-6.18-20251009
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmjoDmUQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgptyjD/94YYv1sabG9M6UHq7j9lOAgruqXaaEMOw+
 Blnm4ejuLNcM8FMCBcuvbhp3ktzT7v1/bWal7FLnmujuKBfhAe+t2AVcHFWUQie2
 CIfMjc3p77U/bwL5wt0O5WFqu1UPDVe+qzrppqRYduTxvPKk9Fi6mqpYCXKlYN7K
 FhINsytoZp/CvTdf5EDSsPv2r4W85OhrPeq0VjYufFBD1wxXD94ii8WAvyfsl20s
 0gIfdlfa2vaNVwH1kdCd+IeATrSBpyCZKGEVTzcHYoo/1MgfNFigrJ8GUA5c+DLM
 fmNE+E+wFtobq5WBmbrtmAxtBnzzV49HS1OT1amUktuq87ryiY5Svn6vFAqEJQl6
 2HLE9nNN2PBdPMAmQ57u1bvp/3nGD0mk/hC1666MTDxHpxg5c6cugCSlJGVG+uC9
 ShLgi8bWV6RXelso0qMaSmNNCA8dskxJg/YDJ06AViTSuW8Y1+adoXddCjE7jne9
 3lci/r2WiuwqTJuub9D7LUtC7VhbCY19VVkgDE64VB2+CjR8B9AlLVG3sGl1HDOY
 EFAddJ3lAEOz5F1H2AzcOBPqqeBfuipr6lEpdb9+6hNu5wRILAHtme8W76c4PtuF
 PRk/3JYcHE77DZlFeE+iN8n0y1tNdWR/6QzWIOsGcNlUyeGGV/zvgGOodtFRpHt2
 t7Eue56EFw==
 =/1jf
 -----END PGP SIGNATURE-----

Merge tag 'block-6.18-20251009' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull block fixes from Jens Axboe:

 - Don't include __GFP_NOWARN for loop worker allocation, as it already
   uses GFP_NOWAIT which has __GFP_NOWARN set already

 - Small series cleaning up the recent bio_iov_iter_get_pages() changes

 - loop fix for leaking the backing reference file, if validation fails

 - Update of a comment pertaining to disk/partition stat locking

* tag 'block-6.18-20251009' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  loop: remove redundant __GFP_NOWARN flag
  block: move bio_iov_iter_get_bdev_pages to block/fops.c
  iomap: open code bio_iov_iter_get_bdev_pages
  block: rename bio_iov_iter_get_pages_aligned to bio_iov_iter_get_pages
  block: remove bio_iov_iter_get_pages
  block: Update a comment of disk statistics
  loop: fix backing file reference leak on validation error
2025-10-10 10:37:13 -07:00
Pedro Demarchi Gomes
455281c0ef loop: remove redundant __GFP_NOWARN flag
GFP_NOWAIT already includes __GFP_NOWARN, so let's remove the
redundant __GFP_NOWARN.

Signed-off-by: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-08 06:27:53 -06:00
Linus Torvalds
8804d970fa Summary of significant series in this pull request:
- The 3 patch series "mm, swap: improve cluster scan strategy" from
   Kairui Song improves performance and reduces the failure rate of swap
   cluster allocation.
 
 - The 4 patch series "support large align and nid in Rust allocators"
   from Vitaly Wool permits Rust allocators to set NUMA node and large
   alignment when perforning slub and vmalloc reallocs.
 
 - The 2 patch series "mm/damon/vaddr: support stat-purpose DAMOS" from
   Yueyang Pan extend DAMOS_STAT's handling of the DAMON operations sets
   for virtual address spaces for ops-level DAMOS filters.
 
 - The 3 patch series "execute PROCMAP_QUERY ioctl under per-vma lock"
   from Suren Baghdasaryan reduces mmap_lock contention during reads of
   /proc/pid/maps.
 
 - The 2 patch series "mm/mincore: minor clean up for swap cache
   checking" from Kairui Song performs some cleanup in the swap code.
 
 - The 11 patch series "mm: vm_normal_page*() improvements" from David
   Hildenbrand provides code cleanup in the pagemap code.
 
 - The 5 patch series "add persistent huge zero folio support" from
   Pankaj Raghav provides a block layer speedup by optionalls making the
   huge_zero_pagepersistent, instead of releasing it when its refcount
   falls to zero.
 
 - The 3 patch series "kho: fixes and cleanups" from Mike Rapoport adds a
   few touchups to the recently added Kexec Handover feature.
 
 - The 10 patch series "mm: make mm->flags a bitmap and 64-bit on all
   arches" from Lorenzo Stoakes turns mm_struct.flags into a bitmap.  To
   end the constant struggle with space shortage on 32-bit conflicting with
   64-bit's needs.
 
 - The 2 patch series "mm/swapfile.c and swap.h cleanup" from Chris Li
   cleans up some swap code.
 
 - The 7 patch series "selftests/mm: Fix false positives and skip
   unsupported tests" from Donet Tom fixes a few things in our selftests
   code.
 
 - The 7 patch series "prctl: extend PR_SET_THP_DISABLE to only provide
   THPs when advised" from David Hildenbrand "allows individual processes
   to opt-out of THP=always into THP=madvise, without affecting other
   workloads on the system".
 
   It's a long story - the [1/N] changelog spells out the considerations.
 
 - The 11 patch series "Add and use memdesc_flags_t" from Matthew Wilcox
   gets us started on the memdesc project.  Please see
   https://kernelnewbies.org/MatthewWilcox/Memdescs and
   https://blogs.oracle.com/linux/post/introducing-memdesc.
 
 - The 3 patch series "Tiny optimization for large read operations" from
   Chi Zhiling improves the efficiency of the pagecache read path.
 
 - The 5 patch series "Better split_huge_page_test result check" from Zi
   Yan improves our folio splitting selftest code.
 
 - The 2 patch series "test that rmap behaves as expected" from Wei Yang
   adds some rmap selftests.
 
 - The 3 patch series "remove write_cache_pages()" from Christoph Hellwig
   removes that function and converts its two remaining callers.
 
 - The 2 patch series "selftests/mm: uffd-stress fixes" from Dev Jain
   fixes some UFFD selftests issues.
 
 - The 3 patch series "introduce kernel file mapped folios" from Boris
   Burkov introduces the concept of "kernel file pages".  Using these
   permits btrfs to account its metadata pages to the root cgroup, rather
   than to the cgroups of random inappropriate tasks.
 
 - The 2 patch series "mm/pageblock: improve readability of some
   pageblock handling" from Wei Yang provides some readability improvements
   to the page allocator code.
 
 - The 11 patch series "mm/damon: support ARM32 with LPAE" from SeongJae
   Park teaches DAMON to understand arm32 highmem.
 
 - The 4 patch series "tools: testing: Use existing atomic.h for
   vma/maple tests" from Brendan Jackman performs some code cleanups and
   deduplication under tools/testing/.
 
 - The 2 patch series "maple_tree: Fix testing for 32bit compiles" from
   Liam Howlett fixes a couple of 32-bit issues in
   tools/testing/radix-tree.c.
 
 - The 2 patch series "kasan: unify kasan_enabled() and remove
   arch-specific implementations" from Sabyrzhan Tasbolatov moves KASAN
   arch-specific initialization code into a common arch-neutral
   implementation.
 
 - The 3 patch series "mm: remove zpool" from Johannes Weiner removes
   zspool - an indirection layer which now only redirects to a single thing
   (zsmalloc).
 
 - The 2 patch series "mm: task_stack: Stack handling cleanups" from
   Pasha Tatashin makes a couple of cleanups in the fork code.
 
 - The 37 patch series "mm: remove nth_page()" from David Hildenbrand
   makes rather a lot of adjustments at various nth_page() callsites,
   eventually permitting the removal of that undesirable helper function.
 
 - The 2 patch series "introduce kasan.write_only option in hw-tags" from
   Yeoreum Yun creates a KASAN read-only mode for ARM, using that
   architecture's memory tagging feature.  It is felt that a read-only mode
   KASAN is suitable for use in production systems rather than debug-only.
 
 - The 3 patch series "mm: hugetlb: cleanup hugetlb folio allocation"
   from Kefeng Wang does some tidying in the hugetlb folio allocation code.
 
 - The 12 patch series "mm: establish const-correctness for pointer
   parameters" from Max Kellermann makes quite a number of the MM API
   functions more accurate about the constness of their arguments.  This
   was getting in the way of subsystems (in this case CEPH) when they
   attempt to improving their own const/non-const accuracy.
 
 - The 7 patch series "Cleanup free_pages() misuse" from Vishal Moola
   fixes a number of code sites which were confused over when to use
   free_pages() vs __free_pages().
 
 - The 3 patch series "Add Rust abstraction for Maple Trees" from Alice
   Ryhl makes the mapletree code accessible to Rust.  Required by nouveau
   and by its forthcoming successor: the new Rust Nova driver.
 
 - The 2 patch series "selftests/mm: split_huge_page_test:
   split_pte_mapped_thp improvements" from David Hildenbrand adds a fix and
   some cleanups to the thp selftesting code.
 
 - The 14 patch series "mm, swap: introduce swap table as swap cache
   (phase I)" from Chris Li and Kairui Song is the first step along the
   path to implementing "swap tables" - a new approach to swap allocation
   and state tracking which is expected to yield speed and space
   improvements.  This patchset itself yields a 5-20% performance benefit
   in some situations.
 
 - The 3 patch series "Some ptdesc cleanups" from Matthew Wilcox utilizes
   the new memdesc layer to clean up the ptdesc code a little.
 
 - The 3 patch series "Fix va_high_addr_switch.sh test failure" from
   Chunyu Hu fixes some issues in our 5-level pagetable selftesting code.
 
 - The 2 patch series "Minor fixes for memory allocation profiling" from
   Suren Baghdasaryan addresses a couple of minor issues in relatively new
   memory allocation profiling feature.
 
 - The 3 patch series "Small cleanups" from Matthew Wilcox has a few
   cleanups in preparation for more memdesc work.
 
 - The 2 patch series "mm/damon: add addr_unit for DAMON_LRU_SORT and
   DAMON_RECLAIM" from Quanmin Yan makes some changes to DAMON in
   furtherance of supporting arm highmem.
 
 - The 2 patch series "selftests/mm: Add -Wunreachable-code and fix
   warnings" from Muhammad Anjum adds that compiler check to selftests code
   and fixes the fallout, by removing dead code.
 
 - The 10 patch series "Improvements to Victim Process Thawing and OOM
   Reaper Traversal Order" from zhongjinji makes a number of improvements
   in the OOM killer: mainly thawing a more appropriate group of victim
   threads so they can release resources.
 
 - The 5 patch series "mm/damon: misc fixups and improvements for 6.18"
   from SeongJae Park is a bunch of small and unrelated fixups for DAMON.
 
 - The 7 patch series "mm/damon: define and use DAMON initialization
   check function" from SeongJae Park implement reliability and
   maintainability improvements to a recently-added bug fix.
 
 - The 2 patch series "mm/damon/stat: expose auto-tuned intervals and
   non-idle ages" from SeongJae Park provides additional transparency to
   userspace clients of the DAMON_STAT information.
 
 - The 2 patch series "Expand scope of khugepaged anonymous collapse"
   from Dev Jain removes some constraints on khubepaged's collapsing of
   anon VMAs.  It also increases the success rate of MADV_COLLAPSE against
   an anon vma.
 
 - The 2 patch series "mm: do not assume file == vma->vm_file in
   compat_vma_mmap_prepare()" from Lorenzo Stoakes moves us further towards
   removal of file_operations.mmap().  This patchset concentrates upon
   clearing up the treatment of stacked filesystems.
 
 - The 6 patch series "mm: Improve mlock tracking for large folios" from
   Kiryl Shutsemau provides some fixes and improvements to mlock's tracking
   of large folios.  /proc/meminfo's "Mlocked" field became more accurate.
 
 - The 2 patch series "mm/ksm: Fix incorrect accounting of KSM counters
   during fork" from Donet Tom fixes several user-visible KSM stats
   inaccuracies across forks and adds selftest code to verify these
   counters.
 
 - The 2 patch series "mm_slot: fix the usage of mm_slot_entry" from Wei
   Yang addresses some potential but presently benign issues in KSM's
   mm_slot handling.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaN3cywAKCRDdBJ7gKXxA
 jtaPAQDmIuIu7+XnVUK5V11hsQ/5QtsUeLHV3OsAn4yW5/3dEQD/UddRU08ePN+1
 2VRB0EwkLAdfMWW7TfiNZ+yhuoiL/AA=
 =4mhY
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - "mm, swap: improve cluster scan strategy" from Kairui Song improves
   performance and reduces the failure rate of swap cluster allocation

 - "support large align and nid in Rust allocators" from Vitaly Wool
   permits Rust allocators to set NUMA node and large alignment when
   perforning slub and vmalloc reallocs

 - "mm/damon/vaddr: support stat-purpose DAMOS" from Yueyang Pan extend
   DAMOS_STAT's handling of the DAMON operations sets for virtual
   address spaces for ops-level DAMOS filters

 - "execute PROCMAP_QUERY ioctl under per-vma lock" from Suren
   Baghdasaryan reduces mmap_lock contention during reads of
   /proc/pid/maps

 - "mm/mincore: minor clean up for swap cache checking" from Kairui Song
   performs some cleanup in the swap code

 - "mm: vm_normal_page*() improvements" from David Hildenbrand provides
   code cleanup in the pagemap code

 - "add persistent huge zero folio support" from Pankaj Raghav provides
   a block layer speedup by optionalls making the
   huge_zero_pagepersistent, instead of releasing it when its refcount
   falls to zero

 - "kho: fixes and cleanups" from Mike Rapoport adds a few touchups to
   the recently added Kexec Handover feature

 - "mm: make mm->flags a bitmap and 64-bit on all arches" from Lorenzo
   Stoakes turns mm_struct.flags into a bitmap. To end the constant
   struggle with space shortage on 32-bit conflicting with 64-bit's
   needs

 - "mm/swapfile.c and swap.h cleanup" from Chris Li cleans up some swap
   code

 - "selftests/mm: Fix false positives and skip unsupported tests" from
   Donet Tom fixes a few things in our selftests code

 - "prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised"
   from David Hildenbrand "allows individual processes to opt-out of
   THP=always into THP=madvise, without affecting other workloads on the
   system".

   It's a long story - the [1/N] changelog spells out the considerations

 - "Add and use memdesc_flags_t" from Matthew Wilcox gets us started on
   the memdesc project. Please see

      https://kernelnewbies.org/MatthewWilcox/Memdescs and
      https://blogs.oracle.com/linux/post/introducing-memdesc

 - "Tiny optimization for large read operations" from Chi Zhiling
   improves the efficiency of the pagecache read path

 - "Better split_huge_page_test result check" from Zi Yan improves our
   folio splitting selftest code

 - "test that rmap behaves as expected" from Wei Yang adds some rmap
   selftests

 - "remove write_cache_pages()" from Christoph Hellwig removes that
   function and converts its two remaining callers

 - "selftests/mm: uffd-stress fixes" from Dev Jain fixes some UFFD
   selftests issues

 - "introduce kernel file mapped folios" from Boris Burkov introduces
   the concept of "kernel file pages". Using these permits btrfs to
   account its metadata pages to the root cgroup, rather than to the
   cgroups of random inappropriate tasks

 - "mm/pageblock: improve readability of some pageblock handling" from
   Wei Yang provides some readability improvements to the page allocator
   code

 - "mm/damon: support ARM32 with LPAE" from SeongJae Park teaches DAMON
   to understand arm32 highmem

 - "tools: testing: Use existing atomic.h for vma/maple tests" from
   Brendan Jackman performs some code cleanups and deduplication under
   tools/testing/

 - "maple_tree: Fix testing for 32bit compiles" from Liam Howlett fixes
   a couple of 32-bit issues in tools/testing/radix-tree.c

 - "kasan: unify kasan_enabled() and remove arch-specific
   implementations" from Sabyrzhan Tasbolatov moves KASAN arch-specific
   initialization code into a common arch-neutral implementation

 - "mm: remove zpool" from Johannes Weiner removes zspool - an
   indirection layer which now only redirects to a single thing
   (zsmalloc)

 - "mm: task_stack: Stack handling cleanups" from Pasha Tatashin makes a
   couple of cleanups in the fork code

 - "mm: remove nth_page()" from David Hildenbrand makes rather a lot of
   adjustments at various nth_page() callsites, eventually permitting
   the removal of that undesirable helper function

 - "introduce kasan.write_only option in hw-tags" from Yeoreum Yun
   creates a KASAN read-only mode for ARM, using that architecture's
   memory tagging feature. It is felt that a read-only mode KASAN is
   suitable for use in production systems rather than debug-only

 - "mm: hugetlb: cleanup hugetlb folio allocation" from Kefeng Wang does
   some tidying in the hugetlb folio allocation code

 - "mm: establish const-correctness for pointer parameters" from Max
   Kellermann makes quite a number of the MM API functions more accurate
   about the constness of their arguments. This was getting in the way
   of subsystems (in this case CEPH) when they attempt to improving
   their own const/non-const accuracy

 - "Cleanup free_pages() misuse" from Vishal Moola fixes a number of
   code sites which were confused over when to use free_pages() vs
   __free_pages()

 - "Add Rust abstraction for Maple Trees" from Alice Ryhl makes the
   mapletree code accessible to Rust. Required by nouveau and by its
   forthcoming successor: the new Rust Nova driver

 - "selftests/mm: split_huge_page_test: split_pte_mapped_thp
   improvements" from David Hildenbrand adds a fix and some cleanups to
   the thp selftesting code

 - "mm, swap: introduce swap table as swap cache (phase I)" from Chris
   Li and Kairui Song is the first step along the path to implementing
   "swap tables" - a new approach to swap allocation and state tracking
   which is expected to yield speed and space improvements. This
   patchset itself yields a 5-20% performance benefit in some situations

 - "Some ptdesc cleanups" from Matthew Wilcox utilizes the new memdesc
   layer to clean up the ptdesc code a little

 - "Fix va_high_addr_switch.sh test failure" from Chunyu Hu fixes some
   issues in our 5-level pagetable selftesting code

 - "Minor fixes for memory allocation profiling" from Suren Baghdasaryan
   addresses a couple of minor issues in relatively new memory
   allocation profiling feature

 - "Small cleanups" from Matthew Wilcox has a few cleanups in
   preparation for more memdesc work

 - "mm/damon: add addr_unit for DAMON_LRU_SORT and DAMON_RECLAIM" from
   Quanmin Yan makes some changes to DAMON in furtherance of supporting
   arm highmem

 - "selftests/mm: Add -Wunreachable-code and fix warnings" from Muhammad
   Anjum adds that compiler check to selftests code and fixes the
   fallout, by removing dead code

 - "Improvements to Victim Process Thawing and OOM Reaper Traversal
   Order" from zhongjinji makes a number of improvements in the OOM
   killer: mainly thawing a more appropriate group of victim threads so
   they can release resources

 - "mm/damon: misc fixups and improvements for 6.18" from SeongJae Park
   is a bunch of small and unrelated fixups for DAMON

 - "mm/damon: define and use DAMON initialization check function" from
   SeongJae Park implement reliability and maintainability improvements
   to a recently-added bug fix

 - "mm/damon/stat: expose auto-tuned intervals and non-idle ages" from
   SeongJae Park provides additional transparency to userspace clients
   of the DAMON_STAT information

 - "Expand scope of khugepaged anonymous collapse" from Dev Jain removes
   some constraints on khubepaged's collapsing of anon VMAs. It also
   increases the success rate of MADV_COLLAPSE against an anon vma

 - "mm: do not assume file == vma->vm_file in compat_vma_mmap_prepare()"
   from Lorenzo Stoakes moves us further towards removal of
   file_operations.mmap(). This patchset concentrates upon clearing up
   the treatment of stacked filesystems

 - "mm: Improve mlock tracking for large folios" from Kiryl Shutsemau
   provides some fixes and improvements to mlock's tracking of large
   folios. /proc/meminfo's "Mlocked" field became more accurate

 - "mm/ksm: Fix incorrect accounting of KSM counters during fork" from
   Donet Tom fixes several user-visible KSM stats inaccuracies across
   forks and adds selftest code to verify these counters

 - "mm_slot: fix the usage of mm_slot_entry" from Wei Yang addresses
   some potential but presently benign issues in KSM's mm_slot handling

* tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (372 commits)
  mm: swap: check for stable address space before operating on the VMA
  mm: convert folio_page() back to a macro
  mm/khugepaged: use start_addr/addr for improved readability
  hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list
  alloc_tag: fix boot failure due to NULL pointer dereference
  mm: silence data-race in update_hiwater_rss
  mm/memory-failure: don't select MEMORY_ISOLATION
  mm/khugepaged: remove definition of struct khugepaged_mm_slot
  mm/ksm: get mm_slot by mm_slot_entry() when slot is !NULL
  hugetlb: increase number of reserving hugepages via cmdline
  selftests/mm: add fork inheritance test for ksm_merging_pages counter
  mm/ksm: fix incorrect KSM counter handling in mm_struct during fork
  drivers/base/node: fix double free in register_one_node()
  mm: remove PMD alignment constraint in execmem_vmalloc()
  mm/memory_hotplug: fix typo 'esecially' -> 'especially'
  mm/rmap: improve mlock tracking for large folios
  mm/filemap: map entire large folio faultaround
  mm/fault: try to map the entire file folio in finish_fault()
  mm/rmap: mlock large folios in try_to_unmap_one()
  mm/rmap: fix a mlock race condition in folio_referenced_one()
  ...
2025-10-02 18:18:33 -07:00
Li Chen
98b7bf5433 loop: fix backing file reference leak on validation error
loop_change_fd() and loop_configure() call loop_check_backing_file()
to validate the new backing file. If validation fails, the reference
acquired by fget() was not dropped, leaking a file reference.

Fix this by calling fput(file) before returning the error.

Cc: stable@vger.kernel.org
Cc: Markus Elfring <Markus.Elfring@web.de>
CC: Yang Erkun <yangerkun@huawei.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Yu Kuai <yukuai1@huaweicloud.com>
Fixes: f5c84eff63 ("loop: Add sanity check for read/write_iter")
Signed-off-by: Li Chen <chenl311@chinatelecom.cn>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Yang Erkun <yangerkun@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-02 15:25:47 -06:00
Linus Torvalds
e1b1d03cee for-6.18/block-20250929
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmjbLCgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpoY0D/9J+11BC88pBxCrLKv/V2TwCNokRMi0dU3L
 r3EUdA46k0oXmvb6ueZqIcfY2e+IX7rdQkaRbh1zRdsNejqHo4548C3ePWGdBAcM
 OdNEGfpehO0aD0td1+mK/NxoJMLhbs5QraPanz+SOkGZOKeF+vGCga5PUDivsr5J
 16T9yb7i+isENLdAc2RJbZVyAphqHQlo5GHi5ZIKOVi5cNt8GU/q2sQl7NYmGvHd
 aq37svvZHFOhLRajP959Fw9WOxEYITewzQ4UYf1FZjUodJUxO+vCnP0ooBQRlyu8
 1B4PYWwSE+Vn3GkQE0Om+mzo9AVPOiLmoAWGxdgJBMyEkZndocr46XEslXOufQ1Z
 T3Gu19G6jCxcyByNVhjVnaajYKmvSQAy1w75m4XlfqTRm4f9Om+LAJavUk3RuaOL
 7lXKQ7Ql1/Tby9Jmf8afjYYXXotNDNku6rz2P3qtOwAA26mNJfgVt0rO+8XGRDe9
 ioLbCkTjslYMc/Oh4jSsbrspsVALbaQMq/Dmah8k0EWb4QAHVgCJyGBoff3hOboI
 jD6B1enaKOQVgcjWcjm/FjOk3jv2h3v4X26YWQZTvEc/1PnSnST78Zi/ePhzDdmt
 sBALUAS37TfTgNMzrhbHl5Zs13k0C0XyANuayuKuo5hlNnC1wbdap+5FZJOmpuOB
 YT+VkYnaOA==
 =kOmc
 -----END PGP SIGNATURE-----

Merge tag 'for-6.18/block-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull block updates from Jens Axboe:

 - NVMe pull request via Keith:
     - FC target fixes (Daniel)
     - Authentication fixes and updates (Martin, Chris)
     - Admin controller handling (Kamaljit)
     - Target lockdep assertions (Max)
     - Keep-alive updates for discovery (Alastair)
     - Suspend quirk (Georg)

 - MD pull request via Yu:
     - Add support for a lockless bitmap.

       A key feature for the new bitmap are that the IO fastpath is
       lockless. If a user issues lots of write IO to the same bitmap
       bit in a short time, only the first write has additional overhead
       to update bitmap bit, no additional overhead for the following
       writes.

       By supporting only resync or recover written data, means in the
       case creating new array or replacing with a new disk, there is no
       need to do a full disk resync/recovery.

 - Switch ->getgeo() and ->bios_param() to using struct gendisk rather
   than struct block_device.

 - Rust block changes via Andreas. This series adds configuration via
   configfs and remote completion to the rnull driver. The series also
   includes a set of changes to the rust block device driver API: a few
   cleanup patches, and a few features supporting the rnull changes.

   The series removes the raw buffer formatting logic from
   `kernel::block` and improves the logic available in `kernel::string`
   to support the same use as the removed logic.

 - floppy arch cleanups

 - Reduce the number of dereferencing needed for ublk commands

 - Restrict supported sockets for nbd. Mostly done to eliminate a class
   of issues perpetually reported by syzbot, by using nonsensical socket
   setups.

 - A few s390 dasd block fixes

 - Fix a few issues around atomic writes

 - Improve DMA interation for integrity requests

 - Improve how iovecs are treated with regards to O_DIRECT aligment
   constraints.

   We used to require each segment to adhere to the constraints, now
   only the request as a whole needs to.

 - Clean up and improve p2p support, enabling use of p2p for metadata
   payloads

 - Improve locking of request lookup, using SRCU where appropriate

 - Use page references properly for brd, avoiding very long RCU sections

 - Fix ordering of recursively submitted IOs

 - Clean up and improve updating nr_requests for a live device

 - Various fixes and cleanups

* tag 'for-6.18/block-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (164 commits)
  s390/dasd: enforce dma_alignment to ensure proper buffer validation
  s390/dasd: Return BLK_STS_INVAL for EINVAL from do_dasd_request
  ublk: remove redundant zone op check in ublk_setup_iod()
  nvme: Use non zero KATO for persistent discovery connections
  nvmet: add safety check for subsys lock
  nvme-core: use nvme_is_io_ctrl() for I/O controller check
  nvme-core: do ioccsz/iorcsz validation only for I/O controllers
  nvme-core: add method to check for an I/O controller
  blk-cgroup: fix possible deadlock while configuring policy
  blk-mq: fix null-ptr-deref in blk_mq_free_tags() from error path
  blk-mq: Fix more tag iteration function documentation
  selftests: ublk: fix behavior when fio is not installed
  ublk: don't access ublk_queue in ublk_unmap_io()
  ublk: pass ublk_io to __ublk_complete_rq()
  ublk: don't access ublk_queue in ublk_need_complete_req()
  ublk: don't access ublk_queue in ublk_check_commit_and_fetch()
  ublk: don't pass ublk_queue to ublk_fetch()
  ublk: don't access ublk_queue in ublk_config_io_buf()
  ublk: don't access ublk_queue in ublk_check_fetch_buf()
  ublk: pass q_id and tag to __ublk_check_and_get_req()
  ...
2025-10-02 10:16:56 -07:00
Linus Torvalds
5832d26433 for-6.18/io_uring-20250929
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmjbLEcQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpnEUD/4/FgfQP2LFS/88BBF5ukZjRySe4wmyyZ2Q
 MFh2ehdxzkZxVXjbeA2wRAXdqjw2MbNhx8tzU9VrW7rweNDZxHbwi6jJIP7OAjxE
 4ZP0goAQj7P0TFyXC2KGj7k6dP20FkAltx5gGLVwsuOWDDrQKp2EykAcRnGYAD4W
 3yf+nojVr2bjHyO7dx8dM7jUDjMg7J8nmHD6zgHOlHRLblWwfzw907bhz+eBX/FI
 9kYvtX2c9MgY4Isa+43rZd5qvj9S3Cs8PD6tFPbq+n+3l7yWgMBTu/y+SNI8hupT
 W7CqjPcpvppFHhPkcXDA3yARnW7ccEx5aiQuvUCmRUioHtGwXvC63HMp8OjcQspV
 NNoIHYFsi1alzYq2kJLxY1IleWZ8j0hUkSSU8u7al8VIvtD43LGkv51xavxQUFjg
 BO9mLyS51H2agffySs4vhHJE82lZizvmh/RJfSJ0ezALzE2k42MrximX1D1rBJE6
 KPOhCiPt/jqpQMyqDYnY10FgTXQVwgPIVH1JLpo611tPFHlGW8Y4YxxR1Xduh5JX
 jbGLEjVREsDZ7EHrimLNLmJRAQpyQujv/yhf7k96gWBelVwVuISQLI4Ca5IeVQyk
 9yifgLXNGddgAwj0POMFeKXSm2We9nrrPDYLCKrsBMSN96/3SLveJC7fkW88aUZr
 ye4/K8Y3vA==
 =uc/3
 -----END PGP SIGNATURE-----

Merge tag 'for-6.18/io_uring-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull io_uring updates from Jens Axboe:

 - Store ring provided buffers locally for the users, rather than stuff
   them into struct io_kiocb.

   These types of buffers must always be fully consumed or recycled in
   the current context, and leaving them in struct io_kiocb is hence not
   a good ideas as that struct has a vastly different life time.

   Basically just an architecture cleanup that can help prevent issues
   with ring provided buffers in the future.

 - Support for mixed CQE sizes in the same ring.

   Before this change, a CQ ring either used the default 16b CQEs, or it
   was setup with 32b CQE using IORING_SETUP_CQE32. For use cases where
   a few 32b CQEs were needed, this caused everything else to use big
   CQEs. This is wasteful both in terms of memory usage, but also memory
   bandwidth for the posted CQEs.

   With IORING_SETUP_CQE_MIXED, applications may use request types that
   post both normal 16b and big 32b CQEs on the same ring.

 - Add helpers for async data management, to make it harder for opcode
   handlers to mess it up.

 - Add support for multishot for uring_cmd, which ublk can use. This
   helps improve efficiency, by providing a persistent request type that
   can trigger multiple CQEs.

 - Add initial support for ring feature querying.

   We had basic support for probe operations, but the API isn't great.
   Rather than expand that, add support for QUERY which is easily
   expandable and can cover a lot more cases than the existing probe
   support. This will help applications get a better idea of what
   operations are supported on a given host.

 - zcrx improvements from Pavel:
        - Improve refill entry alignment for better caching
        - Various cleanups, especially around deduplicating normal
          memory vs dmabuf setup.
        - Generalisation of the niov size (Patch 12). It's still hard
          coded to PAGE_SIZE on init, but will let the user to specify
          the rx buffer length on setup.
        - Syscall / synchronous bufer return. It'll be used as a slow
          fallback path for returning buffers when the refill queue is
          full. Useful for tolerating slight queue size misconfiguration
          or with inconsistent load.
        - Accounting more memory to cgroups.
        - Additional independent cleanups that will also be useful for
          mutli-area support.

 - Various fixes and cleanups

* tag 'for-6.18/io_uring-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (68 commits)
  io_uring/cmd: drop unused res2 param from io_uring_cmd_done()
  io_uring: fix nvme's 32b cqes on mixed cq
  io_uring/query: cap number of queries
  io_uring/query: prevent infinite loops
  io_uring/zcrx: account niov arrays to cgroup
  io_uring/zcrx: allow synchronous buffer return
  io_uring/zcrx: introduce io_parse_rqe()
  io_uring/zcrx: don't adjust free cache space
  io_uring/zcrx: use guards for the refill lock
  io_uring/zcrx: reduce netmem scope in refill
  io_uring/zcrx: protect netdev with pp_lock
  io_uring/zcrx: rename dma lock
  io_uring/zcrx: make niov size variable
  io_uring/zcrx: set sgt for umem area
  io_uring/zcrx: remove dmabuf_offset
  io_uring/zcrx: deduplicate area mapping
  io_uring/zcrx: pass ifq to io_zcrx_alloc_fallback()
  io_uring/zcrx: check all niovs filled with dma addresses
  io_uring/zcrx: move area reg checks into io_import_area
  io_uring/zcrx: don't pass slot to io_zcrx_create_area
  ...
2025-10-02 09:56:23 -07:00
Linus Torvalds
f4e0ff7e45 Rust changes for v6.18
Toolchain and infrastructure:
 
  - Derive 'Zeroable' for all structs and unions generated by 'bindgen'
    where possible and corresponding cleanups. To do so, add the
    'pin-init' crate as a dependency to 'bindings' and 'uapi'.
 
    It also includes its first use in the 'cpufreq' module, with more to
    come in the next cycle.
 
  - Add warning to the 'rustdoc' target to detect broken 'srctree/' links
    and fix existing cases.
 
  - Remove support for unused (since v6.16) host '#[test]'s, simplifying
    the 'rusttest' target. Tests should generally run within KUnit.
 
 'kernel' crate:
 
  - Add 'ptr' module with a new 'Alignment' type, which is always a power
    of two and is used to validate that a given value is a valid
    alignment and to perform masking and alignment operations:
 
        // Checked at build time.
        assert_eq!(Alignment:🆕:<16>().as_usize(), 16);
 
        // Checked at runtime.
        assert_eq!(Alignment::new_checked(15), None);
 
        assert_eq!(Alignment::of::<u8>().log2(), 0);
 
        assert_eq!(0x25u8.align_down(Alignment:🆕:<0x10>()), 0x20);
        assert_eq!(0x5u8.align_up(Alignment:🆕:<0x10>()), Some(0x10));
        assert_eq!(u8::MAX.align_up(Alignment:🆕:<0x10>()), None);
 
    It also includes its first use in Nova.
 
  - Add 'core::mem::{align,size}_of{,_val}' to the prelude, matching
    Rust 1.80.0.
 
  - Keep going with the steps on our migration to the standard library
    'core::ffi::CStr' type (use 'kernel::{fmt, prelude::fmt!}' and use
    upstream method names).
 
  - 'error' module: improve 'Error::from_errno' and 'to_result'
    documentation, including examples/tests.
 
  - 'sync' module: extend 'aref' submodule documentation now that it
    exists, and more updates to complete the ongoing move of 'ARef' and
    'AlwaysRefCounted' to 'sync::aref'.
 
  - 'list' module: add an example/test for 'ListLinksSelfPtr' usage.
 
  - 'alloc' module:
 
    - Implement 'Box::pin_slice()', which constructs a pinned slice of
      elements.
 
    - Provide information about the minimum alignment guarantees of
      'Kmalloc', 'Vmalloc' and 'KVmalloc'.
 
    - Take minimum alignment guarantees of allocators for
      'ForeignOwnable' into account.
 
    - Remove the 'allocator_test' (including 'Cmalloc').
 
    - Add doctest for 'Vec::as_slice()'.
 
    - Constify various methods.
 
  - 'time' module:
 
    - Add methods on 'HrTimer' that can only be called with exclusive
      access to an unarmed timer, or from timer callback context.
 
    - Add arithmetic operations to 'Instant' and 'Delta'.
 
    - Add a few convenience and access methods to 'HrTimer' and
      'Instant'.
 
 'macros' crate:
 
  - Reduce collections in 'quote!' macro.
 
 And a few other cleanups and improvements.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEPjU5OPd5QIZ9jqqOGXyLc2htIW0FAmjZb3kACgkQGXyLc2ht
 IW2+PA//T23FOYFjN2M+N2qBFocL4qBK0nSjp1UnnTsJ7ohlLU3orApY/Nl2DJTq
 oO7SrWrdw6OVapvJN9IC2Jk0SfgFEiGu4L/eDg/xzkRmw89GGOOv+gp8gzz190mH
 vZS5Nbbvs1GOlALA0BxwJG0vXtAu1de284/v0CCzXms6tCxSaUSes0vB7JNNzBSW
 u73StEM5WlU3giGvnREl2lyX+jUFwG3mtfwpOuNavSYi3yO9Kg1pIIeP7ie/qrKF
 30D8X3VacO2JcGC1qpQPsFCSfIlNl0yjkVPpFi8mIQO/XEqcej3tlojXq9oyP9Tj
 tXcQL37ayBYnFSMPbelbOyTsgIyU9WwIJF+4V8u1H2C89k3f7/zqj+RMvW4y90Dc
 /43z0OwW/N5PzUQ/EyTY0JYfMeNcsOyVcGXYawD/0pZuHgOz39WHPJSdq+wMpZUy
 XESd6tr7ZHZudDsX+oq0hO1AI3pwkMvievFKG7ZtUiIcR9slv246M+WroOJcZUJ3
 6I9v/f/z9rxsIYExQA2rTHiJ0+GAuXZ5lH5x/owZEZmzN3WLCHwuMoaIp/oL6387
 y17yBpDtp6ar4B1KJILjuyjTF/kehazCNy7uiG1P8KTiCRUUTueIDYs257NMghg2
 VKkyfdABAwgnsdrGLQXgRmI09RHg0xqslgT11DiPmLVVxJYCeLI=
 =+VG2
 -----END PGP SIGNATURE-----

Merge tag 'rust-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux

Pull rust updates from Miguel Ojeda:
 "Toolchain and infrastructure:

   - Derive 'Zeroable' for all structs and unions generated by 'bindgen'
     where possible and corresponding cleanups. To do so, add the
     'pin-init' crate as a dependency to 'bindings' and 'uapi'.

     It also includes its first use in the 'cpufreq' module, with more
     to come in the next cycle.

   - Add warning to the 'rustdoc' target to detect broken 'srctree/'
     links and fix existing cases.

   - Remove support for unused (since v6.16) host '#[test]'s,
     simplifying the 'rusttest' target. Tests should generally run
     within KUnit.

  'kernel' crate:

   - Add 'ptr' module with a new 'Alignment' type, which is always a
     power of two and is used to validate that a given value is a valid
     alignment and to perform masking and alignment operations:

         // Checked at build time.
         assert_eq!(Alignment:🆕:<16>().as_usize(), 16);

         // Checked at runtime.
         assert_eq!(Alignment::new_checked(15), None);

         assert_eq!(Alignment::of::<u8>().log2(), 0);

         assert_eq!(0x25u8.align_down(Alignment:🆕:<0x10>()), 0x20);
         assert_eq!(0x5u8.align_up(Alignment:🆕:<0x10>()), Some(0x10));
         assert_eq!(u8::MAX.align_up(Alignment:🆕:<0x10>()), None);

     It also includes its first use in Nova.

   - Add 'core::mem::{align,size}_of{,_val}' to the prelude, matching
     Rust 1.80.0.

   - Keep going with the steps on our migration to the standard library
     'core::ffi::CStr' type (use 'kernel::{fmt, prelude::fmt!}' and use
     upstream method names).

   - 'error' module: improve 'Error::from_errno' and 'to_result'
     documentation, including examples/tests.

   - 'sync' module: extend 'aref' submodule documentation now that it
     exists, and more updates to complete the ongoing move of 'ARef' and
     'AlwaysRefCounted' to 'sync::aref'.

   - 'list' module: add an example/test for 'ListLinksSelfPtr' usage.

   - 'alloc' module:

      - Implement 'Box::pin_slice()', which constructs a pinned slice of
        elements.

      - Provide information about the minimum alignment guarantees of
        'Kmalloc', 'Vmalloc' and 'KVmalloc'.

      - Take minimum alignment guarantees of allocators for
        'ForeignOwnable' into account.

      - Remove the 'allocator_test' (including 'Cmalloc').

      - Add doctest for 'Vec::as_slice()'.

      - Constify various methods.

   - 'time' module:

      - Add methods on 'HrTimer' that can only be called with exclusive
        access to an unarmed timer, or from timer callback context.

      - Add arithmetic operations to 'Instant' and 'Delta'.

      - Add a few convenience and access methods to 'HrTimer' and
        'Instant'.

  'macros' crate:

   - Reduce collections in 'quote!' macro.

  And a few other cleanups and improvements"

* tag 'rust-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux: (58 commits)
  gpu: nova-core: use Alignment for alignment-related operations
  rust: add `Alignment` type
  rust: macros: reduce collections in `quote!` macro
  rust: acpi: use `core::ffi::CStr` method names
  rust: of: use `core::ffi::CStr` method names
  rust: net: use `core::ffi::CStr` method names
  rust: miscdevice: use `core::ffi::CStr` method names
  rust: kunit: use `core::ffi::CStr` method names
  rust: firmware: use `core::ffi::CStr` method names
  rust: drm: use `core::ffi::CStr` method names
  rust: cpufreq: use `core::ffi::CStr` method names
  rust: configfs: use `core::ffi::CStr` method names
  rust: auxiliary: use `core::ffi::CStr` method names
  drm/panic: use `core::ffi::CStr` method names
  rust: device: use `kernel::{fmt,prelude::fmt!}`
  rust: sync: use `kernel::{fmt,prelude::fmt!}`
  rust: seq_file: use `kernel::{fmt,prelude::fmt!}`
  rust: kunit: use `kernel::{fmt,prelude::fmt!}`
  rust: file: use `kernel::{fmt,prelude::fmt!}`
  rust: device: use `kernel::{fmt,prelude::fmt!}`
  ...
2025-09-30 19:12:49 -07:00
Caleb Sander Mateos
f85e254b51 ublk: remove redundant zone op check in ublk_setup_iod()
ublk_setup_iod() checks first whether the request is a zoned operation
issued to a device without zoned support and returns BLK_STS_IOERR if
so. However, such a request would already hit the default case in the
subsequent switch statement and fail the ublk_queue_is_zoned() check,
which also results in a return of BLK_STS_IOERR. So remove the redundant
early check for unsupported zone ops.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-24 05:48:05 -06:00