linux/block
Damien Le Moal 1365b6904f block: allow submitting all zone writes from a single context
In order to maintain sequential write patterns per zone with zoned block
devices, zone write plugging issues only a single write BIO per zone at
any time. This works well but has the side effect that when large
sequential write streams are issued by the user and these streams cross
zone boundaries, the device ends up receiving a discontiguous set of
write commands for different zones. The same also happens when a user
writes simultaneously at high queue depth multiple zones: the device
does not see all sequential writes per zone and receives discontiguous
writes to different zones. While this does not affect the performance of
solid state zoned block devices, when using an SMR HDD, this pattern
change from sequential writes to discontiguous writes to different zones
significantly increases head seek which results in degraded write
throughput.

In order to reduce this seek overhead for rotational media devices,
introduce a per disk zone write plugs kernel thread to issue all write
BIOs to zones. This single zone write issuing context is enabled for
any zoned block device that has a request queue flagged with the new
QUEUE_ZONED_QD1_WRITES flag.

The flag QUEUE_ZONED_QD1_WRITES is visible as the sysfs queue attribute
zoned_qd1_writes for zoned devices. For regular block devices, this
attribute is not visible. For zoned block devices, a user can override
the default value set to force the global write maximum queue depth of
1 for a zoned block device, or clear this attribute to fallback to the
default behavior of zone write plugging which limits writes to QD=1 per
sequential zone.

Writing to a zoned block device flagged with QUEUE_ZONED_QD1_WRITES is
implemented using a list of zone write plugs that have a non-empty BIO
list. Listed zone write plugs are processed by the disk zone write plugs
worker kthread in FIFO order, and all BIOs of a zone write plug are all
processed before switching to the next listed zone write plug. A newly
submitted BIO for a non-FULL zone write plug that is not yet listed
causes the addition of the zone write plug at the end of the disk list
of zone write plugs.

Since the write BIOs queued in a zone write plug BIO list are
necessarilly sequential, for rotational media, using the single zone
write plugs kthread to issue all BIOs maintains a sequential write
pattern and thus reduces seek overhead and improves write throughput.
This processing essentially result in always writing to HDDs at QD=1,
which is not an issue for HDDs operating with write caching enabled.
Performance with write cache disabled is also not degraded thanks to
the efficient write handling of modern SMR HDDs.

A disk list of zone write plugs is defined using the new struct gendisk
zone_wplugs_list, and accesses to this list is protected using the
zone_wplugs_list_lock spinlock.  The per disk kthread
(zone_wplugs_worker) code is implemented by the function
disk_zone_wplugs_worker(). A reference on listed zone write plugs is
always held until all BIOs of the zone write plug are processed by the
worker kthread. BIO issuing at QD=1 is driven using a completion
structure (zone_wplugs_worker_bio_done) and calls to blk_io_wait().

With this change, performance when sequentially writing the zones of a
30 TB SMR SATA HDD connected to an AHCI adapter changes as follows
(1MiB direct I/Os, results in MB/s unit):

                    +--------------------+
		    |   Write BW (MB/s)  |
 +------------------+----------+---------+
 | Sequential write | Baseline | Patched |
 |  Queue Depth     | 6.19-rc8 |         |
 +------------------+----------+---------+
 | 1                | 244      | 245     |
 | 2                | 244      | 245     |
 | 4                | 245      | 245     |
 | 8                | 242      | 245     |
 | 16               | 222      | 246     |
 | 32               | 211      | 245     |
 | 64               | 193      | 244     |
 | 128              | 112      | 246     |
 +------------------+----------+---------+

With the current code (baseline), as the sequential write stream crosses
a zone boundary, higher queue depth creates a gap between the
last IO to the previous zone and the first IOs to the following zones,
causing head seeks and degrading performance. Using the disk zone
write plugs worker thread, this pattern disappears and the maximum
throughput of the drive is maintained, leading to over 100%
improvements in throughput for high queue depth write.

Using 16 fio jobs all writing to randomly chosen zones at QD=32 with 1
MiB direct IOs, write throughput also increases significantly.

                    +--------------------+
		    |   Write BW (MB/s)  |
 +------------------+----------+---------+
 |   Random write   | Baseline | Patched |
 |  Number of zones | 6.19-rc7 |         |
 +------------------+----------+---------+
 | 1                | 191      | 192     |
 | 2                | 101      | 128     |
 | 4                | 115      | 123     |
 | 8                | 90       | 120     |
 | 16               | 64       | 115     |
 | 32               | 58       | 105     |
 | 64               | 56       | 101     |
 | 128              | 55       | 99      |
 +------------------+----------+---------+

Tests using XFS shows that buffered write speed with 8 jobs writing
files increases by 12% to 35% depending on the workload.

                    +--------------------+
		    |   Write BW (MB/s)  |
 +------------------+----------+---------+
 |     Workload     | Baseline | Patched |
 |                  | 6.19-rc7 |         |
 +------------------+----------+---------+
 | 256MiB file size | 212      | 238     |
 +------------------+----------+---------+
 | 4MiB .. 128 MiB  | 213      | 243     |
 | random file size |          |         |
 +------------------+----------+---------+
 | 2MiB .. 8 MiB    | 179      | 242     |
 | random file size |          |         |
 +------------------+----------+---------+

Performance gains are even more significant when using an HBA that
limits the maximum size of commands to a small value, e.g. HBAs
controlled with the mpi3mr driver limit commands to a maximum of 1 MiB.
In such case, the write throughput gains are over 40%.

                    +--------------------+
		    |   Write BW (MB/s)  |
 +------------------+----------+---------+
 |     Workload     | Baseline | Patched |
 |                  | 6.19-rc7 |         |
 +------------------+----------+---------+
 | 256MiB file size | 175      | 245     |
 +------------------+----------+---------+
 | 4MiB .. 128 MiB  | 174      | 244     |
 | random file size |          |         |
 +------------------+----------+---------+
 | 2MiB .. 8 MiB    | 171      | 243     |
 | random file size |          |         |
 +------------------+----------+---------+

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-09 14:30:00 -06:00
..
partitions Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
badblocks.c badblocks: Fix a nonsense WARN_ON() which checks whether a u64 variable < 0 2025-03-10 07:41:58 -06:00
bdev.c block: remove redundant kill_bdev() call in set_blocksize() 2026-02-04 09:28:18 -07:00
bfq-cgroup.c treewide: Replace kmalloc with kmalloc_obj for non-scalar types 2026-02-21 01:02:28 -08:00
bfq-iosched.c treewide: Replace kmalloc with kmalloc_obj for non-scalar types 2026-02-21 01:02:28 -08:00
bfq-iosched.h block, bfq: update outdated comment 2026-01-01 08:57:37 -07:00
bfq-wf2q.c
bio-integrity-auto.c Merge branch 'block-6.19' into for-7.0/block 2026-01-11 13:16:36 -07:00
bio-integrity.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
bio.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
blk-cgroup-fc-appid.c
blk-cgroup-rwstat.c
blk-cgroup-rwstat.h
blk-cgroup.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
blk-cgroup.h block: initialize bio issue time in blk_mq_submit_bio() 2025-09-10 05:23:45 -06:00
blk-core.c blk-mq: add a new queue sysfs attribute async_depth 2026-02-03 07:45:36 -07:00
blk-crypto-fallback.c Convert remaining multi-line kmalloc_obj/flex GFP_KERNEL uses 2026-02-22 08:26:33 -08:00
blk-crypto-internal.h blk-crypto: handle the fallback above the block layer 2026-01-11 12:55:41 -07:00
blk-crypto-profile.c Convert more 'alloc_obj' cases to default GFP_KERNEL arguments 2026-02-21 20:03:00 -08:00
blk-crypto-sysfs.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
blk-crypto.c blk-crypto: handle the fallback above the block layer 2026-01-11 12:55:41 -07:00
blk-flush.c block: pass io_comp_batch to rq_end_io_fn callback 2026-01-20 10:12:54 -07:00
blk-ia-ranges.c block: get rid of request queue ->sysfs_dir_lock 2025-01-29 07:16:47 -07:00
blk-integrity.c block: don't merge bios with different app_tags 2026-01-06 19:10:08 -07:00
blk-ioc.c copy_process: pass clone_flags as u64 across calltree 2025-09-01 15:31:34 +02:00
blk-iocost.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
blk-iolatency.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
blk-ioprio.c treewide: Replace kmalloc with kmalloc_obj for non-scalar types 2026-02-21 01:02:28 -08:00
blk-ioprio.h
blk-lib.c block: change return type to void 2026-02-12 04:23:53 -07:00
blk-map.c block-7.0-20260305 2026-03-06 08:36:18 -08:00
blk-merge.c for-7.0/block-stable-pages-20260206 2026-02-09 18:14:52 -08:00
blk-mq-cpumap.c blk-mq: add number of queue calc helper 2025-07-01 10:24:19 -06:00
blk-mq-debugfs.c block: allow submitting all zone writes from a single context 2026-03-09 14:30:00 -06:00
blk-mq-debugfs.h blk-mq-debugfs: remove blk_mq_debugfs_unregister_rqos() 2026-02-02 07:05:19 -07:00
blk-mq-dma.c block: fix partial IOVA mapping cleanup in blk_rq_dma_map_iova 2026-02-12 04:23:31 -07:00
blk-mq-sched.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
blk-mq-sched.h blk-mq-sched: unify elevators checking for async requests 2026-02-03 07:45:36 -07:00
blk-mq-sysfs.c blk-mq: Move flush queue allocation into blk_mq_init_hctx() 2025-09-08 08:05:32 -06:00
blk-mq-tag.c blk-mq: use array manage hctx map instead of xarray 2025-11-28 09:09:19 -07:00
blk-mq.c block-7.0-20260305 2026-03-06 08:36:18 -08:00
blk-mq.h blk-mq: use queue_hctx in blk_mq_map_queue_type 2025-12-01 07:18:31 -07:00
blk-pm.c block: force noio scope in blk_mq_freeze_queue 2025-01-31 07:20:08 -07:00
blk-pm.h
blk-rq-qos.c blk-mq-debugfs: remove blk_mq_debugfs_unregister_rqos() 2026-02-02 07:05:19 -07:00
blk-rq-qos.h blk-rq-qos: Remove unlikely() hints from QoS checks 2026-01-06 19:08:23 -07:00
blk-settings.c block: validate interval_exp integrity limit 2025-12-18 09:51:49 -07:00
blk-stat.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
blk-stat.h blk-stat: convert struct blk_stat_callback to kernel-doc 2026-02-16 10:21:06 -07:00
blk-sysfs.c block: allow submitting all zone writes from a single context 2026-03-09 14:30:00 -06:00
blk-throttle.c block/blk-throttle: Remove throtl_slice from struct throtl_data 2025-11-17 09:39:48 -07:00
blk-throttle.h blk-throttle: fix access race during throttle policy activation 2025-09-08 08:24:44 -06:00
blk-timeout.c
blk-wbt.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
blk-wbt.h blk-wbt: factor out a helper wbt_set_lat() 2026-02-02 07:05:19 -07:00
blk-zoned.c block: allow submitting all zone writes from a single context 2026-03-09 14:30:00 -06:00
blk.h blk-mq: use NOIO context to prevent deadlock during debugfs creation 2026-02-16 10:47:25 -07:00
bsg-lib.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
bsg.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
disk-events.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
early-lookup.c
elevator.c block: use trylock to avoid lockdep circular dependency in sysfs 2026-03-05 04:01:42 -07:00
elevator.h block: fix race between wbt_enable_default and IO submission 2025-12-12 12:51:11 -07:00
fops.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
genhd.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
holder.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
ioctl.c block: allow IOC_PR_READ_* ioctls with BLK_OPEN_READ 2026-02-11 10:36:54 -07:00
ioprio.c block: remove test of incorrect io priority level 2025-05-08 09:04:12 -06:00
Kconfig block: Remove obsolete configs BLK_MQ_{PCI,VIRTIO} 2025-05-14 05:43:56 -06:00
Kconfig.iosched
kyber-iosched.c kyber: covert to use request_queue->async_depth 2026-02-03 07:45:36 -07:00
Makefile blk-mq: move the DMA mapping code to a separate file 2025-05-16 08:43:41 -06:00
mq-deadline.c mq-deadline: covert to use request_queue->async_depth 2026-02-03 07:45:36 -07:00
opal_proto.h sed-opal: add IOC_OPAL_REACTIVATE_LSP. 2026-03-09 14:29:59 -06:00
sed-opal.c sed-opal: add IOC_OPAL_GET_SUM_STATUS ioctl. 2026-03-09 14:29:59 -06:00
t10-pi.c block: rename tuple_size field in blk_integrity to metadata_size 2025-07-01 14:00:14 +02:00