linux/drivers/block
Damien Le Moal fcc6eaa3a0 zloop: introduce the ordered_zone_append configuration parameter
The zone append operation processing for zloop devices is similar to any
other command, that is, the operation is processed as a command work
item, without any special serialization between the work items (beside
the zone mutex for mutually exclusive code sections).

This processing is fine and gives excellent performance. However, it has
a side effect: zone append operation are very often reordered and
processed in a sequence that is very different from their issuing order
by the user. This effect is very visible using an XFS file system on top
of a zloop device. A simple file write leads to many file extents as the
data writes using zone append are reordered and so result in the
physical order being different than the file logical order.
E.g. executing:

$ dd if=/dev/zero of=/mnt/test bs=1M count=10 && sync
$ xfs_bmap /mnt/test
/mnt/test:
	0: [0..4095]: 2162688..2166783
	1: [4096..6143]: 2168832..2170879
	2: [6144..8191]: 2166784..2168831
	3: [8192..10239]: 2170880..2172927
	4: [10240..12287]: 2174976..2177023
	5: [12288..14335]: 2172928..2174975
	6: [14336..20479]: 2177024..2183167

For 10 IOs, 6 extents are created.

This is fine and actually allows to exercise XFS zone garbage collection
very well. However, this also makes debugging/working on XFS data
placement harder as the underlying device will most of the time reorder
IOs, resulting in many file extents.

Allow a user to mitigate this with the new ordered_zone_append
configuration parameter. For a zloop device created with this parameter
specified, the sector of a zone append command is set early, when the
command is submitted by the block layer with the zloop_queue_rq()
function, instead of in the zloop_rw() function which is exectued later
in the command work item context. This change ensures that more often
than not, zone append operations data end up being written in the same
order as the command submission by the user.

In the case of XFS, this leads to far less file data extents. E.g., for
the previous example, we get a single file data extent for the written
file.

$ dd if=/dev/zero of=/mnt/test bs=1M count=10 && sync
$ xfs_bmap /mnt/test
/mnt/test:
	0: [0..20479]: 2162688..2183167

Since we cannot use a mutex in the context of the zloop_queue_rq()
function to atomically set a zone append operation sector to the target
zone write pointer location and increment that the write pointer, a new
per-zone spinlock is introduced to protect a zone write pointer access
and modifications. To check a zone write pointer location and set a zone
append operation target sector to that value, the function
zloop_set_zone_append_sector() is introduced and called from
zloop_queue_rq().

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-17 09:40:09 -07:00
..
aoe Summary of significant series in this pull request: 2025-10-02 18:18:33 -07:00
drbd drbd: replace kmap() with kmap_local_page() in receiver path 2025-11-03 08:15:54 -07:00
mtip32xx block: switch ->getgeo() to struct gendisk 2025-08-13 02:59:29 -04:00
null_blk null_blk: fix zone read length beyond write pointer 2025-11-12 10:02:56 -07:00
rnbd drivers/block: WQ_PERCPU added to alloc_workqueue users 2025-09-09 09:11:31 -06:00
rnull rust: block: update ARef and AlwaysRefCounted imports from sync::aref 2025-11-05 18:24:10 -07:00
xen-blkback xen/blkback: convert timeouts to secs_to_jiffies() 2025-01-12 20:21:03 -08:00
zram Summary of significant series in this pull request: 2025-10-02 18:18:33 -07:00
amiflop.c block: switch ->getgeo() to struct gendisk 2025-08-13 02:59:29 -04:00
ataflop.c treewide: Switch/rename to timer_delete[_sync]() 2025-04-05 10:30:12 +02:00
brd.c brd: use page reference to protect page lifetime 2025-09-01 08:37:29 -06:00
floppy.c floppy: fix for PAGE_SIZE != 4KB 2025-11-17 08:22:00 -07:00
Kconfig rnull: move driver to separate directory 2025-09-02 05:23:56 -06:00
loop.c loop: remove redundant __GFP_NOWARN flag 2025-10-08 06:27:53 -06:00
Makefile rnull: move driver to separate directory 2025-09-02 05:23:56 -06:00
n64cart.c block: move the nonrot flag to queue_limits 2024-06-19 07:58:28 -06:00
nbd.c nbd: defer config unlock in nbd_genl_connect 2025-11-11 07:50:15 -07:00
ps3disk.c ps3disk: use memcpy_{from,to}_bvec index 2025-11-14 09:10:16 -07:00
ps3vram.c block: pass a queue_limits argument to blk_alloc_disk 2024-02-19 16:58:23 -07:00
rbd_types.h
rbd.c drivers/block: WQ_PERCPU added to alloc_workqueue users 2025-09-09 09:11:31 -06:00
sunvdc.c drivers/block: WQ_PERCPU added to alloc_workqueue users 2025-09-09 09:11:31 -06:00
swim_asm.S
swim.c block: switch ->getgeo() to struct gendisk 2025-08-13 02:59:29 -04:00
swim3.c treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
ublk_drv.c ublk: return unsigned from ublk_{,un}map_io() 2025-11-11 07:57:20 -07:00
virtio_blk.c virtio_blk: NULL out vqs to avoid double free on failed resume 2025-11-06 16:32:58 -07:00
xen-blkfront.c block: switch ->getgeo() to struct gendisk 2025-08-13 02:59:29 -04:00
z2ram.c block: remove BLK_MQ_F_SHOULD_MERGE 2024-12-23 08:17:23 -07:00
zloop.c zloop: introduce the ordered_zone_append configuration parameter 2025-11-17 09:40:09 -07:00