Commit Graph

8237 Commits

Author SHA1 Message Date
Linus Torvalds
bea82c80a5 block-6.19-20260102
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmlX7MMQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpvuYEACG0VFYmcqmB4JZygecJB3xaxhbVIrCbjFv
 Vmc0XNTkcCpjYAv1jpkS5F3nkJhzZlFNn9xOaP/O8E+6tSctFIre7qjMRpxZM3yl
 GA+MqPI+zNbpYMgsoAH/XTASTVfaTEPOlaoAPQeo8Ey3JRw3Ko1IDNU7zIYK94Xl
 rSAeT65W7vJ+HBjctBoCZYMsE2x0Sn0yrVctkL1mMusQwIg6oMhJ1w1p36P17Mc1
 YgLWQYtfK+eogdTM0Jh9RvDtVJL3WT1I2Ii3KBdCgryY7iSxFXvM0pm1lrOBH+kI
 4bKHTylBnjfmxv7dlz3jHwRmahwdXDk7rpq1EMPygDSj835h3SgAFz3rm9nCUjNI
 xWyEZeN6z4ykdOlqJ6ghTnZTroRdM/12HbSV46n69tczxepG3Mn1i3gBd4UQhn5T
 z6aqa7akIsynlzOnLgrwQjxgVhtfAHptrgAg7g7Kz9hq9xTAEPc2f9Nq7glmLP6f
 wPMoy2lla69vk4Tlzh8TZpTHRPLYLHTtL5OQPM6dnyQ6MzWm2/PHJ/MNfV7/o+VR
 W61BYXUz6d2q81c/I16VWVQvJ0nUa3v7hUGCLUeimQUg+ulyIlMX4wrOI7iYTFTy
 V/4c3DHKEh9y/ptmCgv0jDZdwSoUYvXkn0vFe0fcF3q/T7xea4dok8mcXLcKhMuc
 xPFtx92dhQ==
 =4NB3
 -----END PGP SIGNATURE-----

Merge tag 'block-6.19-20260102' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull block fixes from Jens Axboe:

 - Scan partition tables asynchronously for ublk, similarly to how nvme
   does it. This avoids potential deadlocks, which is why nvme does it
   that way too. Includes a set of selftests as well.

 - MD pull request via Yu:
     - Fix null-pointer dereference in raid5 sysfs group_thread_cnt
       store (Tuo Li)
     - Fix possible mempool corruption during raid1 raid_disks update
       via sysfs (FengWei Shih)
     - Fix logical_block_size configuration being overwritten during
       super_1_validate() (Li Nan)
     - Fix forward incompatibility with configurable logical block size:
       arrays assembled on new kernels could not be assembled on older
       kernels (v6.18 and before) due to non-zero reserved pad rejection
       (Li Nan)
     - Fix static checker warning about iterator not incremented (Li Nan)

 - Skip CPU offlining notifications on unmapped hardware queues

 - bfq-iosched block stats fix

 - Fix outdated comment in bfq-iosched

* tag 'block-6.19-20260102' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  block, bfq: update outdated comment
  blk-mq: skip CPU offline notify on unmapped hctx
  selftests/ublk: fix Makefile to rebuild on header changes
  selftests/ublk: add test for async partition scan
  ublk: scan partition in async way
  block,bfq: fix aux stat accumulation destination
  md: Fix forward incompatibility from configurable logical block size
  md: Fix logical_block_size configuration being overwritten
  md: suspend array while updating raid_disks via sysfs
  md/raid5: fix possible null-pointer dereferences in raid5_store_group_thread_cnt()
  md: Fix static checker warning in analyze_sbs
2026-01-02 12:15:59 -08:00
Julia Lawall
69153e8b97 block, bfq: update outdated comment
The function bfq_bfqq_may_idle() was renamed as bfq_better_to_idle()
in commit 277a4a9b56 ("block, bfq: give a better name to
bfq_bfqq_may_idle").  Update the comment accordingly.

Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-01 08:57:37 -07:00
Cong Zhang
10845a105b blk-mq: skip CPU offline notify on unmapped hctx
If an hctx has no software ctx mapped, blk_mq_map_swqueue() never
allocates tags and leaves hctx->tags NULL. The CPU hotplug offline
notifier can still run for that hctx, return early since hctx cannot
hold any requests.

Signed-off-by: Cong Zhang <cong.zhang@oss.qualcomm.com>
Fixes: bf0beec060 ("blk-mq: drain I/O when all CPUs in a hctx are offline")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-30 09:02:22 -07:00
shechenglong
04bdb1a04d block,bfq: fix aux stat accumulation destination
Route bfqg_stats_add_aux() time accumulation into the destination
stats object instead of the source, aligning with other stat fields.

Reviewed-by: Yu Kuai <yukuai@fnnas.com>
Signed-off-by: shechenglong <shechenglong@xfusion.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-28 09:03:51 -07:00
Linus Torvalds
d8ba32c5a4 block-6.19-20251218
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmlEvbAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgphTLD/9Gx4wnirRvzo1Vz77Odhqqbipww/gjDV0G
 HASzF4KczfusTei4NABfuSVA4/LS28csZOYq0ZeDmFVg+YPQcmr4DNw/eJzzi1Xh
 TDmYuf8y7jAJjwWmX0fgVF2uDg3/kp3tWKfBxSFqpqHrZUkOLr8IY3y8/uJYKakm
 r8K9HsS7hyPhx4xdfRQixUil5TgQxBtCgwTX+JRv/K9AMnEVHkBkJ+m4GSmjJ3n8
 tr5lL8fQmIQcsSWqTvMH/m2xeLG7gr9LE33wfu2sCLvNJ4CAYFGT9aMf2F0nGbzK
 Uu5XOyq1rJ8vKWJuDVfFsjOca4fyyvaKX4bqycv89Z7MVzQOXsjbs+4sHFycsfYC
 P+OBmxLb2+rowNsawyzObT9wnd37xptl+tF5nXL0LL6W4PWHmUgnDuM6Vg5zpNv+
 n0nO6y7I0iL9o4hgqN8kUOr5e7IjvJccJbratOA/jVXmAw1I/Gg6HDo+W7rEqsVP
 0hrlE+RuJSRypbCO2pcvviahaDaCWo3gQmp9TZCBmgqgoiI0WStPAgYrG424+69Y
 Hoq5HBmtN4GV3EnU0Ybuawzl2OKtQwwr9DrRpvrxdN9qm4E44CD6eTmE5fEnrTnX
 Rcs5VosBAavPTbl8LFDtj2h6qPak+5VHHVZwimLNdOyCLTNf7D/kiAjmOHRBbmdt
 AopIuvCipA==
 =lk7H
 -----END PGP SIGNATURE-----

Merge tag 'block-6.19-20251218' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull block fixes from Jens Axboe:

 - ublk selftests for missing coverage

 - two fixes for the block integrity code

 - fix for the newly added newly added PR read keys ioctl, limiting the
   memory that can be allocated

 - work around for a deadlock that can occur with ublk, where partition
   scanning ends up recursing back into file closure, which needs the
   same mutex grabbed. Not the prettiest thing in the world, but an
   acceptable work-around until we can eliminate the reliance on
   disk->open_mutex for this

 - fix for a race between enabling writeback throttling and new IO
   submissions

 - move a bit of bio flag handling code. No changes, but needed for a
   patchset for a future kernel

 - fix for an init time id leak failure in rnbd

 - loop/zloop state check fix

* tag 'block-6.19-20251218' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  block: validate interval_exp integrity limit
  block: validate pi_offset integrity limit
  block: rnbd-clt: Fix leaked ID in init_dev()
  ublk: fix deadlock when reading partition table
  block: add allocation size check in blkdev_pr_read_keys()
  Documentation: admin-guide: blockdev: replace zone_capacity with zone_capacity_mb when creating devices
  zloop: use READ_ONCE() to read lo->lo_state in queue_rq path
  loop: use READ_ONCE() to read lo->lo_state without locking
  block: fix race between wbt_enable_default and IO submission
  selftests: ublk: add user copy test cases
  selftests: ublk: add support for user copy to kublk
  selftests: ublk: forbid multiple data copy modes
  selftests: ublk: don't share backing files between ublk servers
  selftests: ublk: use auto_zc for PER_IO_DAEMON tests in stress_04
  selftests: ublk: fix fio arguments in run_io_and_recover()
  selftests: ublk: remove unused ios map in seq_io.bt
  selftests: ublk: correct last_rw map type in seq_io.bt
  selftests: ublk: fix overflow in ublk_queue_auto_zc_fallback()
  block: move around bio flagging helpers
2025-12-20 09:48:56 -08:00
Caleb Sander Mateos
af65faf34f block: validate interval_exp integrity limit
Various code assumes that the integrity interval is at least 1 sector
and evenly divides the logical block size. Add these checks to
blk_validate_integrity_limits(). This guards against block drivers that
report invalid interval_exp values.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-18 09:51:49 -07:00
Caleb Sander Mateos
ccb8a3c08a block: validate pi_offset integrity limit
The PI tuple must be contained within the metadata value, so validate
that pi_offset + pi_tuple_size <= metadata_size. This guards against
block drivers that report invalid pi_offset values.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-18 09:51:49 -07:00
Deepanshu Kartikey
a58383fa45 block: add allocation size check in blkdev_pr_read_keys()
blkdev_pr_read_keys() takes num_keys from userspace and uses it to
calculate the allocation size for keys_info via struct_size(). While
there is a check for SIZE_MAX (integer overflow), there is no upper
bound validation on the allocation size itself.

A malicious or buggy userspace can pass a large num_keys value that
doesn't trigger overflow but still results in an excessive allocation
attempt, causing a warning in the page allocator when the order exceeds
MAX_PAGE_ORDER.

Fix this by introducing PR_KEYS_MAX to limit the number of keys to
a sane value. This makes the SIZE_MAX check redundant, so remove it.
Also switch to kvzalloc/kvfree to handle larger allocations gracefully.

Fixes: 22a1ffea5f ("block: add IOC_PR_READ_KEYS ioctl")
Tested-by: syzbot+660d079d90f8a1baf54d@syzkaller.appspotmail.com
Reported-by: syzbot+660d079d90f8a1baf54d@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=660d079d90f8a1baf54d
Link: https://lore.kernel.org/all/20251212013510.3576091-1-kartikey406@gmail.com/T/ [v1]
Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-17 07:35:22 -07:00
Ming Lei
9869d3a6fe block: fix race between wbt_enable_default and IO submission
When wbt_enable_default() is moved out of queue freezing in elevator_change(),
it can cause the wbt inflight counter to become negative (-1), leading to hung
tasks in the writeback path. Tasks get stuck in wbt_wait() because the counter
is in an inconsistent state.

The issue occurs because wbt_enable_default() could race with IO submission,
allowing the counter to be decremented before proper initialization. This manifests
as:

  rq_wait[0]:
    inflight:             -1
    has_waiters:        True

rwb_enabled() checks the state, which can be updated exactly between wbt_wait()
(rq_qos_throttle()) and wbt_track()(rq_qos_track()), then the inflight counter
will become negative.

And results in hung task warnings like:
  task:kworker/u24:39 state:D stack:0 pid:14767
  Call Trace:
    rq_qos_wait+0xb4/0x150
    wbt_wait+0xa9/0x100
    __rq_qos_throttle+0x24/0x40
    blk_mq_submit_bio+0x672/0x7b0
    ...

Fix this by:

1. Splitting wbt_enable_default() into:
   - __wbt_enable_default(): Returns true if wbt_init() should be called
   - wbt_enable_default(): Wrapper for existing callers (no init)
   - wbt_init_enable_default(): New function that checks and inits WBT

2. Using wbt_init_enable_default() in blk_register_queue() to ensure
   proper initialization during queue registration

3. Move wbt_init() out of wbt_enable_default() which is only for enabling
   disabled wbt from bfq and iocost, and wbt_init() isn't needed. Then the
   original lock warning can be avoided.

4. Removing the ELEVATOR_FLAG_ENABLE_WBT_ON_EXIT flag and its handling
   code since it's no longer needed

This ensures WBT is properly initialized before any IO can be submitted,
preventing the counter from going negative.

Cc: Nilay Shroff <nilay@linux.ibm.com>
Cc: Yu Kuai <yukuai@fnnas.com>
Cc: Guangwu Zhang <guazhang@redhat.com>
Fixes: 78c271344b ("block: move wbt_enable_default() out of queue freezing from sched ->exit()")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-12 12:51:11 -07:00
Linus Torvalds
35ebee7e72 block-6.19-20251211
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmk7RxcQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpnClD/94oSzn0ViI+kmtPcqiHVilGCQYIaBiQuUN
 N+Z3XiLCPUgPDeWxycbFflQ2pmuODXzOC5XZddC0hitxj4jIqL8jBwI0T/WOPXyw
 A0g8S7/Ny3Le5FftBy7duqjIDycXGYhKYD9sQEvSBTf0yu3QpT4hPRveuPouSPkz
 d4H73j+9VMrLRyXGuALhhdwIaqu6/QRtArjc1Yickisi5dEqpwSrHk0CQEe1zJgs
 wgeItEwfnDVdU0wNiLxSJY0HsTzYXtyYWAT5KiFPCPkHYZd1tadcwZ3D9aLF/oH8
 LzLAX19QrTX11lVXP7FbipClYE5gprKDe4qPTExXQrPD7j3Ba4LWIl4QXZ2A4LPE
 Epam6R+ugOyly2+S2dea1lByoKafviRm4CqR3Ixr+S8ayTUser3oy6I1xGEi9v7D
 qF4LJ1ziLWz1kWoLdoOyJCDv0W3vK1U1Rflt24woOLZNpw2S20q7+vwwLQHoWxnY
 GBDRMi3NjCXH4qCJOaly5tnLNTzdxh0h64WsbjO+DGXOnOr39wH6TN4czkW4PPR5
 IwFpP7HurRJMivoSHP50tRqbFLXETlAdceYV8HuhNYhlCIY6NaQbYr7PKzNk/GcJ
 e2/AkRNgJf5GRzemrfkCtndBCsyi1IMsFN0GXbhH6Xr705Lpkpf5qs77Mexg3+TK
 laf5G/vCDw==
 =h5ad
 -----END PGP SIGNATURE-----

Merge tag 'block-6.19-20251211' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull block fixes from Jens Axboe:

 - Always initialize DMA state, fixing a potentially nasty issue on the
   block side

 - btrfs zoned write fix with cached zone reports

 - Fix corruption issues in bcache with chained bio's, and further make
   it clear that the chained IO handler is simply a marker, it's not
   code meant to be executed

 - Kill old code dealing with synchronous IO polling in the block layer,
   that has been dead for a long time. Only async polling is supported
   these days

 - Fix a lockdep issue in tag_set management, moving it to RCU

 - Fix an issue with ublks bio_vec iteration

 - Don't unconditionally enforce blocking issue of ublk control
   commands, allow some of them with non-blocking issue as they
   do not block

* tag 'block-6.19-20251211' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  blk-mq-dma: always initialize dma state
  blk-mq: delete task running check in blk_hctx_poll()
  block: fix cached zone reports on devices with native zone append
  block: Use RCU in blk_mq_[un]quiesce_tagset() instead of set->tag_list_lock
  ublk: don't mutate struct bio_vec in iteration
  block: prohibit calls to bio_chain_endio
  bcache: fix improper use of bi_end_io
  ublk: allow non-blocking ctrl cmds in IO_URING_F_NONBLOCK issue
2025-12-12 22:04:18 +12:00
Keith Busch
a0750fae73 blk-mq-dma: always initialize dma state
Ensure the dma state is initialized when we're not using the contiguous
iova, otherwise the caller may be using a stale state from a previous
request that could use the coalesed iova allocation.

Fixes: 2f6b2565d4 ("block: accumulate memory segment gaps per bio")
Reported-by: Sebastian Ott <sebott@redhat.com>
Tested-by: Sebastian Ott <sebott@redhat.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-10 13:41:11 -07:00
Fengnan Chang
f22ecf9c14 blk-mq: delete task running check in blk_hctx_poll()
blk_hctx_poll() always checks if the task is running or not, and returns
1 if the task is running. This is a leftover from when polled IO was
purely for synchronous IO, and doesn't make sense anymore when polled IO
is purely asynchronous. Similarly, marking the task as TASK_RUNNING is
also superflous, as the very much has to be running to enter the
function in the first place.

It looks like there has been this judgment for historical reasons, and
in very early versions of this function the user would set the process
state to TASK_UNINTERRUPTIBLE.

Signed-off-by: Diangang Li <lidiangang@bytedance.com>
Signed-off-by: Fengnan Chang <changfengnan@bytedance.com>
[axboe: kill all remnants of task running, pointless now. massage message]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-10 02:40:52 -07:00
Johannes Thumshirn
2c38ec934d block: fix cached zone reports on devices with native zone append
When mounting a btrfs file system on virtio-blk which supports native
Zone Append there has been a WARN triggering in btrfs' space management
code.

Further looking into btrfs' zoned statistics uncovered the filesystem
expecting the zones to be used, but the write pointers being 0:
 # cat /sys/fs/btrfs/8eabd2e7-3294-4f9e-9b58-7e64135c8bf4/zoned_stats
 active block-groups: 4
         reclaimable: 0
         unused: 0
         need reclaim: false
 data relocation block-group: 1342177280
 active zones:
         start: 1073741824, wp: 0 used: 0, reserved: 0, unusable: 0
         start: 1342177280, wp: 0 used: 0, reserved: 0, unusable: 0
         start: 1610612736, wp: 0 used: 16384, reserved: 0, unusable: 18446744073709535232
         start: 1879048192, wp: 0 used: 131072, reserved: 0, unusable: 18446744073709420544

Looking at the blkzone report output for the zone in question
(1610612736) the write pointer on the device moved, but the filesystem
did not see a change on the write pointer:
 # blkzone report -c 1 -o 0x300000 /dev/vda
   start: 0x000300000, len 0x080000, cap 0x080000, wptr 0x000040 reset:0 non-seq:0, zcond: 2(oi) [type: 2(SEQ_WRITE_REQUIRED)]

The zone write pointer is 0, because btrfs is using the cached version
of blkdev_report_zones() and as virtio-blk is supporting native zone
append, but blkdev_revalidate_zones() does not initialize the zone write
plugs in this case.

Not skipping the revalidate of sequential zones in
blkdev_revalidate_zones() callchain fixes this issue.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Fixes: a6aa36e957 ("block: Remove zone write plugs when handling native zone append writes")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-09 22:35:54 -07:00
Mohamed Khalfella
59e25ef2b4 block: Use RCU in blk_mq_[un]quiesce_tagset() instead of set->tag_list_lock
blk_mq_{add,del}_queue_tag_set() functions add and remove queues from
tagset, the functions make sure that tagset and queues are marked as
shared when two or more queues are attached to the same tagset.
Initially a tagset starts as unshared and when the number of added
queues reaches two, blk_mq_add_queue_tag_set() marks it as shared along
with all the queues attached to it. When the number of attached queues
drops to 1 blk_mq_del_queue_tag_set() need to mark both the tagset and
the remaining queues as unshared.

Both functions need to freeze current queues in tagset before setting on
unsetting BLK_MQ_F_TAG_QUEUE_SHARED flag. While doing so, both functions
hold set->tag_list_lock mutex, which makes sense as we do not want
queues to be added or deleted in the process. This used to work fine
until commit 98d81f0df7 ("nvme: use blk_mq_[un]quiesce_tagset")
made the nvme driver quiesce tagset instead of quiscing individual
queues. blk_mq_quiesce_tagset() does the job and quiesce the queues in
set->tag_list while holding set->tag_list_lock also.

This results in deadlock between two threads with these stacktraces:

  __schedule+0x47c/0xbb0
  ? timerqueue_add+0x66/0xb0
  schedule+0x1c/0xa0
  schedule_preempt_disabled+0xa/0x10
  __mutex_lock.constprop.0+0x271/0x600
  blk_mq_quiesce_tagset+0x25/0xc0
  nvme_dev_disable+0x9c/0x250
  nvme_timeout+0x1fc/0x520
  blk_mq_handle_expired+0x5c/0x90
  bt_iter+0x7e/0x90
  blk_mq_queue_tag_busy_iter+0x27e/0x550
  ? __blk_mq_complete_request_remote+0x10/0x10
  ? __blk_mq_complete_request_remote+0x10/0x10
  ? __call_rcu_common.constprop.0+0x1c0/0x210
  blk_mq_timeout_work+0x12d/0x170
  process_one_work+0x12e/0x2d0
  worker_thread+0x288/0x3a0
  ? rescuer_thread+0x480/0x480
  kthread+0xb8/0xe0
  ? kthread_park+0x80/0x80
  ret_from_fork+0x2d/0x50
  ? kthread_park+0x80/0x80
  ret_from_fork_asm+0x11/0x20

  __schedule+0x47c/0xbb0
  ? xas_find+0x161/0x1a0
  schedule+0x1c/0xa0
  blk_mq_freeze_queue_wait+0x3d/0x70
  ? destroy_sched_domains_rcu+0x30/0x30
  blk_mq_update_tag_set_shared+0x44/0x80
  blk_mq_exit_queue+0x141/0x150
  del_gendisk+0x25a/0x2d0
  nvme_ns_remove+0xc9/0x170
  nvme_remove_namespaces+0xc7/0x100
  nvme_remove+0x62/0x150
  pci_device_remove+0x23/0x60
  device_release_driver_internal+0x159/0x200
  unbind_store+0x99/0xa0
  kernfs_fop_write_iter+0x112/0x1e0
  vfs_write+0x2b1/0x3d0
  ksys_write+0x4e/0xb0
  do_syscall_64+0x5b/0x160
  entry_SYSCALL_64_after_hwframe+0x4b/0x53

The top stacktrace is showing nvme_timeout() called to handle nvme
command timeout. timeout handler is trying to disable the controller and
as a first step, it needs to blk_mq_quiesce_tagset() to tell blk-mq not
to call queue callback handlers. The thread is stuck waiting for
set->tag_list_lock as it tries to walk the queues in set->tag_list.

The lock is held by the second thread in the bottom stack which is
waiting for one of queues to be frozen. The queue usage counter will
drop to zero after nvme_timeout() finishes, and this will not happen
because the thread will wait for this mutex forever.

Given that [un]quiescing queue is an operation that does not need to
sleep, update blk_mq_[un]quiesce_tagset() to use RCU instead of taking
set->tag_list_lock, update blk_mq_{add,del}_queue_tag_set() to use RCU
safe list operations. Also, delete INIT_LIST_HEAD(&q->tag_set_list)
in blk_mq_del_queue_tag_set() because we can not re-initialize it while
the list is being traversed under RCU. The deleted queue will not be
added/deleted to/from a tagset and it will be freed in blk_free_queue()
after the end of RCU grace period.

Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Fixes: 98d81f0df7 ("nvme: use blk_mq_[un]quiesce_tagset")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-09 10:24:55 -07:00
Shida Zhang
cfdeb588ae block: prohibit calls to bio_chain_endio
Now that all potential callers of bio_chain_endio have been
eliminated, completely prohibit any future calls to this function.

Suggested-by: Ming Lei <ming.lei@redhat.com>
Suggested-by: Andreas Gruenbacher <agruenba@redhat.com>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Shida Zhang <zhangshida@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-09 10:20:03 -07:00
Linus Torvalds
4482ebb297 block-6.19-20251208
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmk3KZsQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpkNYD/91yqAJeehx2Heq3dWj9L8hDuETQelj/g9j
 gtZCiriAPy+bb1/BmWjK+BmvjtBt+g3a4Cwi6tVj4F1zoE46IPeLhO+2iJTEBiBq
 AhRtEf/MFXFK3qUnTpEnS8w3CtsXejOTB81VQ+6BysSu+B708m/1AQHv2HocZ37R
 jivrzfCsEdBr+ISwYw/EG5KcDBVTFo/JdXIhs7k4Z8bBfa3P5ye4EhKjORtgbFNU
 5nXb78SZoWNCZF143YV++9MpZc3M2jzkzrk1CTLsUHhOxWg4T/6wTXfPGZc/W4m8
 UBhs03u/gMJnKHhlZd4kpZWDito1TQZTdY2f5sBsysRQqeT7bwDK/1xiQ1nllZiP
 oYbeD6t65yMAlELwNFXo7y/DNcS2VLBMvChIX6p1gweEzyf23YneoHYyN5agEQlN
 9C4EdcYzZRt0DwtHlIRtKvDk2LZzkJAcLau3D6ahU/DPLOawyWZKmvGiU+sSyJjF
 bEIO5c/+MLqkAgLAGaFgA4twFF1aYH9ssmJerDxprarkf1jtlOBLvUQ391Gtb5Hd
 B1yugmIgEwLbCFzhk9FlCtv2nQcWRCElnaeqv+Lv+xCBVPGCLm2qIHoTqmvHZPCd
 GbN/h0XLdgUboYPCFWVAX72/4K/cv+fQQcb+a7tiq6vMKcgJ/2I1szFGpFqz7azB
 hyiK0v3x2g==
 =r1xa
 -----END PGP SIGNATURE-----

Merge tag 'block-6.19-20251208' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull block updates from Jens Axboe:
 "Followup set of fixes and updates for block for the 6.19 merge window.

  NVMe had some late minute debates which lead to dropping some patches
  from that tree, which is why the initial PR didn't have NVMe included.
  It's here now. This pull request contains:

   - NVMe pull request via Keith:
       - Subsystem usage cleanups (Max)
       - Endpoint device fixes (Shin'ichiro)
       - Debug statements (Gerd)
       - FC fabrics cleanups and fixes (Daniel)
       - Consistent alloc API usages (Israel)
       - Code comment updates (Chu)
       - Authentication retry fix (Justin)

   - Fix a memory leak in the discard ioctl code, if the task is being
     interrupted by a signal at just the wrong time

   - Zoned write plugging fixes

   - Add ioctls for for persistent reservations

   - Enable per-cpu bio caching by default

   - Various little fixes and tweaks"

* tag 'block-6.19-20251208' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (27 commits)
  nvme-fabrics: add ENOKEY to no retry criteria for authentication failures
  nvme-auth: use kvfree() for memory allocated with kvcalloc()
  nvmet-tcp: use kvcalloc for commands array
  nvmet-rdma: use kvcalloc for commands and responses arrays
  nvme: fix typo error in nvme target
  nvmet-fc: use pr_* print macros instead of dev_*
  nvmet-fcloop: remove unused lsdir member.
  nvmet-fcloop: check all request and response have been processed
  nvme-fc: check all request and response have been processed
  block: fix memory leak in __blkdev_issue_zero_pages
  block: fix comment for op_is_zone_mgmt() to include RESET_ALL
  block: Clear BLK_ZONE_WPLUG_PLUGGED when aborting plugged BIOs
  blk-mq: Abort suspend when wakeup events are pending
  blk-mq: add blk_rq_nr_bvec() helper
  block: add IOC_PR_READ_RESERVATION ioctl
  block: add IOC_PR_READ_KEYS ioctl
  nvme: reject invalid pr_read_keys() num_keys values
  scsi: sd: reject invalid pr_read_keys() num_keys values
  block: enable per-cpu bio cache by default
  block: use bio_alloc_bioset for passthru IO by default
  ...
2025-12-09 08:53:24 +09:00
Linus Torvalds
a3ebb59eee VFIO updates for v6.19-rc1
- Move libvfio selftest artifacts in preparation of more tightly
    coupled integration with KVM selftests. (David Matlack)
 
  - Fix comment typo in mtty driver. (Chu Guangqing)
 
  - Support for new hardware revision in the hisi_acc vfio-pci variant
    driver where the migration registers can now be accessed via the PF.
    When enabled for this support, the full BAR can be exposed to the
    user. (Longfang Liu)
 
  - Fix vfio cdev support for VF token passing, using the correct size
    for the kernel structure, thereby actually allowing userspace to
    provide a non-zero UUID token.  Also set the match token callback for
    the hisi_acc, fixing VF token support for this this vfio-pci variant
    driver. (Raghavendra Rao Ananta)
 
  - Introduce internal callbacks on vfio devices to simplify and
    consolidate duplicate code for generating VFIO_DEVICE_GET_REGION_INFO
    data, removing various ioctl intercepts with a more structured
    solution. (Jason Gunthorpe)
 
  - Introduce dma-buf support for vfio-pci devices, allowing MMIO regions
    to be exposed through dma-buf objects with lifecycle managed through
    move operations.  This enables low-level interactions such as a
    vfio-pci based SPDK drivers interacting directly with dma-buf capable
    RDMA devices to enable peer-to-peer operations.  IOMMUFD is also now
    able to build upon this support to fill a long standing feature gap
    versus the legacy vfio type1 IOMMU backend with an implementation of
    P2P support for VM use cases that better manages the lifecycle of the
    P2P mapping. (Leon Romanovsky, Jason Gunthorpe, Vivek Kasireddy)
 
  - Convert eventfd triggering for error and request signals to use RCU
    mechanisms in order to avoid a 3-way lockdep reported deadlock issue.
    (Alex Williamson)
 
  - Fix a 32-bit overflow introduced via dma-buf support manifesting with
    large DMA buffers. (Alex Mastro)
 
  - Convert nvgrace-gpu vfio-pci variant driver to insert mappings on
    fault rather than at mmap time.  This conversion serves both to make
    use of huge PFNMAPs but also to both avoid corrected RAS events
    during reset by now being subject to vfio-pci-core's use of
    unmap_mapping_range(), and to enable a device readiness test after
    reset. (Ankit Agrawal)
 
  - Refactoring of vfio selftests to support multi-device tests and split
    code to provide better separation between IOMMU and device objects.
    This work also enables a new test suite addition to measure parallel
    device initialization latency. (David Matlack)
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEQvbATlQL0amee4qQI5ubbjuwiyIFAmkvV3IRHGFsZXhAc2hh
 emJvdC5vcmcACgkQI5ubbjuwiyIpIQ/9GwpjLH5Vdv0v2d9mkHmZIWFpG/tr3zJa
 +spQqOjO0etASc67PtIJArT9pWib+s6O8OaG7iFrdNR65HCSsXSZbIGbMThPODfy
 DdDj1ipAqMVwcaCZT8un2N8Sktu9YpFQMvc5IoXWWYhw88vili7bBx+OTrEFV2T0
 6qQijSBdhw1TXVFHG6BGSmqmisyMepIebA6GmPWdfYu6BfoWBYMdcMjDwd1J61Q5
 DDwFRzn/Dz2Tvb1jbXiiRMRuFIuegFQii+wtd30S/cRPFZhZLWzc+drimC6oOFiQ
 qL19vQQsBPnLtGvch40HsET/AbY5w0pLCkYX5qacxP3sq27+N+KuotzCvbnVMN+H
 e2BqOCujyoce8z1Br6BzV71Lr2yzPDcc5pXTuEuuBT+J/ptOY8hfEikOj85s5Wzj
 aKsTrdDRGMrn/o11NkGSzYwFcMs9MxCX9mo98U6OkWDr0+cmPLf4LGZgpJudWg4E
 POUlzPpnzJrTlX5d+OqCdKJG0a1hTlTa2udzRa5hCDANHaZWLoAssfgSEKfV9xt1
 PzOMf0UIJmPJmFcw/OpMO72/5xp8O4WslJS0ulSm6vrAJDtutLApHZ7bJ44KniNd
 4vte+gOjyZY8ibTDKRULhXVlCDxkEnZjRBbApgI9HJD61IElOzjqohRuRx77J09B
 7c8OSLI8d1U=
 =tpee
 -----END PGP SIGNATURE-----

Merge tag 'vfio-v6.19-rc1' of https://github.com/awilliam/linux-vfio

Pull VFIO updates from Alex Williamson:

 - Move libvfio selftest artifacts in preparation of more tightly
   coupled integration with KVM selftests (David Matlack)

 - Fix comment typo in mtty driver (Chu Guangqing)

 - Support for new hardware revision in the hisi_acc vfio-pci variant
   driver where the migration registers can now be accessed via the PF.
   When enabled for this support, the full BAR can be exposed to the
   user (Longfang Liu)

 - Fix vfio cdev support for VF token passing, using the correct size
   for the kernel structure, thereby actually allowing userspace to
   provide a non-zero UUID token. Also set the match token callback for
   the hisi_acc, fixing VF token support for this this vfio-pci variant
   driver (Raghavendra Rao Ananta)

 - Introduce internal callbacks on vfio devices to simplify and
   consolidate duplicate code for generating VFIO_DEVICE_GET_REGION_INFO
   data, removing various ioctl intercepts with a more structured
   solution (Jason Gunthorpe)

 - Introduce dma-buf support for vfio-pci devices, allowing MMIO regions
   to be exposed through dma-buf objects with lifecycle managed through
   move operations. This enables low-level interactions such as a
   vfio-pci based SPDK drivers interacting directly with dma-buf capable
   RDMA devices to enable peer-to-peer operations. IOMMUFD is also now
   able to build upon this support to fill a long standing feature gap
   versus the legacy vfio type1 IOMMU backend with an implementation of
   P2P support for VM use cases that better manages the lifecycle of the
   P2P mapping (Leon Romanovsky, Jason Gunthorpe, Vivek Kasireddy)

 - Convert eventfd triggering for error and request signals to use RCU
   mechanisms in order to avoid a 3-way lockdep reported deadlock issue
   (Alex Williamson)

 - Fix a 32-bit overflow introduced via dma-buf support manifesting with
   large DMA buffers (Alex Mastro)

 - Convert nvgrace-gpu vfio-pci variant driver to insert mappings on
   fault rather than at mmap time. This conversion serves both to make
   use of huge PFNMAPs but also to both avoid corrected RAS events
   during reset by now being subject to vfio-pci-core's use of
   unmap_mapping_range(), and to enable a device readiness test after
   reset (Ankit Agrawal)

 - Refactoring of vfio selftests to support multi-device tests and split
   code to provide better separation between IOMMU and device objects.
   This work also enables a new test suite addition to measure parallel
   device initialization latency (David Matlack)

* tag 'vfio-v6.19-rc1' of https://github.com/awilliam/linux-vfio: (65 commits)
  vfio: selftests: Add vfio_pci_device_init_perf_test
  vfio: selftests: Eliminate INVALID_IOVA
  vfio: selftests: Split libvfio.h into separate header files
  vfio: selftests: Move vfio_selftests_*() helpers into libvfio.c
  vfio: selftests: Rename vfio_util.h to libvfio.h
  vfio: selftests: Stop passing device for IOMMU operations
  vfio: selftests: Move IOVA allocator into iova_allocator.c
  vfio: selftests: Move IOMMU library code into iommu.c
  vfio: selftests: Rename struct vfio_dma_region to dma_region
  vfio: selftests: Upgrade driver logging to dev_err()
  vfio: selftests: Prefix logs with device BDF where relevant
  vfio: selftests: Eliminate overly chatty logging
  vfio: selftests: Support multiple devices in the same container/iommufd
  vfio: selftests: Introduce struct iommu
  vfio: selftests: Rename struct vfio_iommu_mode to iommu_mode
  vfio: selftests: Allow passing multiple BDFs on the command line
  vfio: selftests: Split run.sh into separate scripts
  vfio: selftests: Move run.sh into scripts directory
  vfio/nvgrace-gpu: wait for the GPU mem to be ready
  vfio/nvgrace-gpu: Inform devmem unmapped after reset
  ...
2025-12-04 18:42:48 -08:00
Shaurya Rane
f7e3f852a4 block: fix memory leak in __blkdev_issue_zero_pages
Move the fatal signal check before bio_alloc() to prevent a memory
leak when BLKDEV_ZERO_KILLABLE is set and a fatal signal is pending.

Previously, the bio was allocated before checking for a fatal signal.
If a signal was pending, the code would break out of the loop without
freeing or chaining the just-allocated bio, causing a memory leak.

This matches the pattern already used in __blkdev_issue_write_zeroes()
where the signal check precedes the allocation.

Fixes: bf86bcdb40 ("blk-lib: check for kill signal in ioctl BLKZEROOUT")
Reported-by: syzbot+527a7e48a3d3d315d862@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=527a7e48a3d3d315d862
Signed-off-by: Shaurya Rane <ssrane_b23@ee.vjti.ac.in>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Tested-by: syzbot+527a7e48a3d3d315d862@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-04 15:43:28 -07:00
Damien Le Moal
552c1149af block: Clear BLK_ZONE_WPLUG_PLUGGED when aborting plugged BIOs
Commit fe0418eb9b ("block: Prevent potential deadlocks in zone write
plug error recovery") added a WARN check in disk_put_zone_wplug() to
verify that when the last reference to a zone write plug is dropped,
this zone write plug does not have the BLK_ZONE_WPLUG_PLUGGED flag set,
that is, that it is not plugged.

However, the function disk_zone_wplug_abort(), which is called for zone
reset and zone finish operations, does not clear this flag after
emptying a zone write plug BIO list. This can result in the
disk_put_zone_wplug() warning to trigger if the user (erroneously as
that is bad pratcice) issues zone reset or zone finish operations while
the target zone still has plugged BIOs.

Modify disk_put_zone_wplug() to clear the BLK_ZONE_WPLUG_PLUGGED flag.
And while at it, also add a lockdep annotation to ensure that this
function is called with the zone write plug spinlock held.

Fixes: fe0418eb9b ("block: Prevent potential deadlocks in zone write plug error recovery")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Niklas Cassel <cassel@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-04 15:43:28 -07:00
Cong Zhang
c196bf43d7 blk-mq: Abort suspend when wakeup events are pending
During system suspend, wakeup capable IRQs for block device can be
delayed, which can cause blk_mq_hctx_notify_offline() to hang
indefinitely while waiting for pending request to complete.
Skip the request waiting loop and abort suspend when wakeup events are
pending to prevent the deadlock.

Fixes: bf0beec060 ("blk-mq: drain I/O when all CPUs in a hctx are offline")
Signed-off-by: Cong Zhang <cong.zhang@oss.qualcomm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-04 07:19:26 -07:00
Stefan Hajnoczi
3e2cb9ee76 block: add IOC_PR_READ_RESERVATION ioctl
Add a Persistent Reservations ioctl to read the current reservation.
This calls the pr_ops->read_reservation() function that was previously
added in commit c787f1baa5 ("block: Add PR callouts for read keys and
reservation") but was only used by the in-kernel SCSI target so far.

The IOC_PR_READ_RESERVATION ioctl is necessary so that userspace
applications that rely on Persistent Reservations ioctls have a way of
inspecting the current state. Cluster managers and validation tests need
this functionality.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-04 07:19:26 -07:00
Stefan Hajnoczi
22a1ffea5f block: add IOC_PR_READ_KEYS ioctl
Add a Persistent Reservations ioctl to read the list of currently
registered reservation keys. This calls the pr_ops->read_keys() function
that was previously added in commit c787f1baa5 ("block: Add PR
callouts for read keys and reservation") but was only used by the
in-kernel SCSI target so far.

The IOC_PR_READ_KEYS ioctl is necessary so that userspace applications
that rely on Persistent Reservations ioctls have a way of inspecting the
current state. Cluster managers and validation tests need this
functionality.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-04 07:19:26 -07:00
Fengnan Chang
48f22f8093 block: enable per-cpu bio cache by default
Since after commit 12e4e8c7ab ("io_uring/rw: enable bio caches for
IRQ rw"), bio_put is safe for task and irq context, bio_alloc_bioset is
safe for task context and no one calls in irq context, so we can enable
per cpu bio cache by default.

Benchmarked with t/io_uring and ext4+nvme:
taskset -c 6 /root/fio/t/io_uring  -p0 -d128 -b4096 -s1 -c1 -F1 -B1 -R1
-X1 -n1 -P1  /mnt/testfile
base IOPS is 562K, patch IOPS is 574K. The CPU usage of bio_alloc_bioset
decrease from 1.42% to 1.22%.

The worst case is allocate bio in CPU A but free in CPU B, still use
t/io_uring and ext4+nvme:
base IOPS is 648K, patch IOPS is 647K.

Also use fio test ext4/xfs with libaio/sync/io_uring on null_blk and
nvme, no obvious performance regression.

Signed-off-by: Fengnan Chang <changfengnan@bytedance.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-04 07:19:24 -07:00
Fengnan Chang
05ce4c584c block: use bio_alloc_bioset for passthru IO by default
Use bio_alloc_bioset for passthru IO by default, so that we can enable
bio cache for irq and polled passthru IO in later.

Signed-off-by: Fengnan Chang <changfengnan@bytedance.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-04 07:18:54 -07:00
Linus Torvalds
cc25df3e2e for-6.19/block-20251201
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmktsoMQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpuiUD/92eivL+HmOh10o8trvxajB0yuyqfSjHHrL
 g+xUbF4s9bgAg/v+Upx7sTY8jdrTcMjKov+G9T6uPvBMqVmeVdZckA1PSAKQaIX1
 Zb7nS2LnO7F6JKbwpwVrrIaqVbcz8MfGIIMbN4yNNEOMCwdIVMp4fo7trPBknJNx
 WddNSGUFlIF3NqSI8AflSS/pYnGm+McfBHXBpJAKipI3iquKKubHv+FX9kLp7Tn4
 x27ZoCWOHglIBTJXU0mmXCVsLF8b5BA8DQcGtT62azb8+l0cRTkaHY0DFAv5BvhG
 TqcjrKdmR0cGSNt+nEmFrujE3atBRl0G0kiHA80YgA1MTtYzdPaUVOUtM9k/rEem
 gpiGMDpBypdxyJAyijPSaVJdfcg0psOlYbhIR4N2wbj/dq8268h+cWzXlF1spgVt
 /7ygoaCmfMNbTy9rKThTjH+es787AVXUAXXaPHhIFsnCKUj8xQl4pT7XltmgYeWx
 1/XD1NEJeLHHog5upAVlGX3H5tbvP1nIICxbZa9mDOJX1rwxxI7/s/RucPjbNXuY
 AiaKPTfxtB9+Ihd2HrJ/76RVMkckcOBc4GIKoFfwuKDbcdLXQ5FcZCmVRoI1V9SV
 KsH7JBgihLwR9XWKE1vp9+CBNe1Qlu3K4IjG/E7CNLeuDntIBu73ihqGP/DqV6Bq
 RX1Dc0OyAQ==
 =m22w
 -----END PGP SIGNATURE-----

Merge tag 'for-6.19/block-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull block updates from Jens Axboe:

 - Fix head insertion for mq-deadline, a regression from when priority
   support was added

 - Series simplifying and improving the ublk user copy code

 - Various ublk related cleanups

 - Fixup REQ_NOWAIT handling in loop/zloop, clearing NOWAIT when the
   request is punted to a thread for handling

 - Merge and then later revert loop dio nowait support, as it ended up
   causing excessive stack usage for when the inline issue code needs to
   dip back into the full file system code

 - Improve auto integrity code, making it less deadlock prone

 - Speedup polled IO handling, but manually managing the hctx lookups

 - Fixes for blk-throttle for SSD devices

 - Small series with fixes for the S390 dasd driver

 - Add support for caching zones, avoiding unnecessary report zone
   queries

 - MD pull requests via Yu:
      - fix null-ptr-dereference regression for dm-raid0
      - fix IO hang for raid5 when array is broken with IO inflight
      - remove legacy 1s delay to speed up system shutdown
      - change maintainer's email address
      - data can be lost if array is created with different lbs devices,
        fix this problem and record lbs of the array in metadata
      - fix rcu protection for md_thread
      - fix mddev kobject lifetime regression
      - enable atomic writes for md-linear
      - some cleanups

 - bcache updates via Coly
      - remove useless discard and cache device code
      - improve usage of per-cpu workqueues

 - Reorganize the IO scheduler switching code, fixing some lockdep
   reports as well

 - Improve the block layer P2P DMA support

 - Add support to the block tracing code for zoned devices

 - Segment calculation improves, and memory alignment flexibility
   improvements

 - Set of prep and cleanups patches for ublk batching support. The
   actual batching hasn't been added yet, but helps shrink down the
   workload of getting that patchset ready for 6.20

 - Fix for how the ps3 block driver handles segments offsets

 - Improve how block plugging handles batch tag allocations

 - nbd fixes for use-after-free of the configuration on device clear/put

 - Set of improvements and fixes for zloop

 - Add Damien as maintainer of the block zoned device code handling

 - Various other fixes and cleanups

* tag 'for-6.19/block-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (162 commits)
  block/rnbd: correct all kernel-doc complaints
  blk-mq: use queue_hctx in blk_mq_map_queue_type
  md: remove legacy 1s delay in md_notify_reboot
  md/raid5: fix IO hang when array is broken with IO inflight
  md: warn about updating super block failure
  md/raid0: fix NULL pointer dereference in create_strip_zones() for dm-raid
  sbitmap: fix all kernel-doc warnings
  ublk: add helper of __ublk_fetch()
  ublk: pass const pointer to ublk_queue_is_zoned()
  ublk: refactor auto buffer register in ublk_dispatch_req()
  ublk: add `union ublk_io_buf` with improved naming
  ublk: add parameter `struct io_uring_cmd *` to ublk_prep_auto_buf_reg()
  kfifo: add kfifo_alloc_node() helper for NUMA awareness
  blk-mq: fix potential uaf for 'queue_hw_ctx'
  blk-mq: use array manage hctx map instead of xarray
  ublk: prevent invalid access with DEBUG
  s390/dasd: Use scnprintf() instead of sprintf()
  s390/dasd: Move device name formatting into separate function
  s390/dasd: Remove unnecessary debugfs_create() return checks
  s390/dasd: Fix gendisk parent after copy pair swap
  ...
2025-12-03 19:26:18 -08:00
Linus Torvalds
0abcfd8983 for-6.19/io_uring-20251201
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmktsm0QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpiLvD/0dptgeJyLHKchOtRHzi/UvtM/EuNFKJrvI
 LBWCyIMjygxsVfPR41Lave9SE3UpcavF8Mg/EddasTci8VlMcDF8zPxWLb289Lz2
 tkp/wOVuyYmDhNXKmKNW59NOPTd0NosEJFTZI4VhMudwx+UtAHELJGfBWW5hRyQB
 Md+UwZ2+J9HbYd19mToaDFxz7jpIPLEE4BYUGtljveRUdpnxhyFGGUS2+CQXZt/5
 lnRvJmmEv4nSGH9ZRksix1xnV6KvJM0UwYQhrWvXhgwyiKu47zG7ONpd39KqoaRw
 Fw+6zZd0t7nyyuZkk15cKNnBLnjilnsCzmdcPq0Cuvkmbf6y1hlhEQQTGWXTKfJx
 zCZxEZcnCC4wL0CBQjZjS38AEMfH2p76M/36+NTWtlYCibY7qUtd9ndpUr49BYGo
 o4qfT0HMpI1PHuUvpZwpMcf4OX5qvtLmavT9vt78uqmtM+Aryzzuy3bI3S2SGjNe
 if/cNHnZc8Z06hUqdEit5NW+lYzj642AoF/j7qH9ADDH+VXRWaCdK/iI8tPaEpDV
 Rw6j442eVugS5tDPoTjdO8jsJ9+OCNNV1t/Jxy+Or+zrGdq7lfg4mnzEia1/izy5
 8MnSubRy6LEd+I5PnK/9y9mPIwFMIFgULi+mUjucAhJjRj5beiG74eR6+jBAdyp1
 GhFvN6fwdw==
 =4g/f
 -----END PGP SIGNATURE-----

Merge tag 'for-6.19/io_uring-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull io_uring updates from Jens Axboe:

 - Unify how task_work cancelations are detected, placing it in the
   task_work running state rather than needing to check the task state

 - Series cleaning up and moving the cancelation code to where it
   belongs, in cancel.c

 - Cleanup of waitid and futex argument handling

 - Add support for mixed sized SQEs. 6.18 added support for mixed sized
   CQEs, improving flexibility and efficiency of workloads that need big
   CQEs. This adds similar support for SQEs, where the occasional need
   for a 128b SQE doesn't necessitate having all SQEs be 128b in size

 - Introduce zcrx and SQ/CQ layout queries. The former returns what zcrx
   features are available. And both return the ring size information to
   help with allocation size calculation for user provided rings like
   IORING_SETUP_NO_MMAP and IORING_MEM_REGION_TYPE_USER

 - Zcrx updates for 6.19. It includes a bunch of small patches,
   IORING_REGISTER_ZCRX_CTRL and RQ flushing and David's work on sharing
   zcrx b/w multiple io_uring instances

 - Series cleaning up ring initializations, notable deduplicating ring
   size and offset calculations. It also moves most of the checking
   before doing any allocations, making the code simpler

 - Add support for getsockname and getpeername, which is mostly a
   trivial hookup after a bit of refactoring on the networking side

 - Various fixes and cleanups

* tag 'for-6.19/io_uring-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (68 commits)
  io_uring: Introduce getsockname io_uring cmd
  socket: Split out a getsockname helper for io_uring
  socket: Unify getsockname and getpeername implementation
  io_uring/query: drop unused io_handle_query_entry() ctx arg
  io_uring/kbuf: remove obsolete buf_nr_pages and update comments
  io_uring/register: use correct location for io_rings_layout
  io_uring/zcrx: share an ifq between rings
  io_uring/zcrx: add io_fill_zcrx_offsets()
  io_uring/zcrx: export zcrx via a file
  io_uring/zcrx: move io_zcrx_scrub() and dependencies up
  io_uring/zcrx: count zcrx users
  io_uring/zcrx: add sync refill queue flushing
  io_uring/zcrx: introduce IORING_REGISTER_ZCRX_CTRL
  io_uring/zcrx: elide passing msg flags
  io_uring/zcrx: use folio_nr_pages() instead of shift operation
  io_uring/zcrx: convert to use netmem_desc
  io_uring/query: introduce rings info query
  io_uring/query: introduce zcrx query
  io_uring: move cq/sq user offset init around
  io_uring: pre-calculate scq layout
  ...
2025-12-03 18:58:57 -08:00
Linus Torvalds
9368f0f941 vfs-6.19-rc1.inode
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZAAKCRCRxhvAZXjc
 omMSAP9GLhavxyWQ24Q+49CNWWRQWDY1wTOiUK2BwtIvZ0YEcAD8D1dAiMckL5pC
 RwEAVA5p+y+qi+bZP0KXCBxQddoTIQM=
 =zo/J
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.19-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs inode updates from Christian Brauner:
 "Features:

   - Hide inode->i_state behind accessors. Open-coded accesses prevent
     asserting they are done correctly. One obvious aspect is locking,
     but significantly more can be checked. For example it can be
     detected when the code is clearing flags which are already missing,
     or is setting flags when it is illegal (e.g., I_FREEING when
     ->i_count > 0)

   - Provide accessors for ->i_state, converts all filesystems using
     coccinelle and manual conversions (btrfs, ceph, smb, f2fs, gfs2,
     overlayfs, nilfs2, xfs), and makes plain ->i_state access fail to
     compile

   - Rework I_NEW handling to operate without fences, simplifying the
     code after the accessor infrastructure is in place

  Cleanups:

   - Move wait_on_inode() from writeback.h to fs.h

   - Spell out fenced ->i_state accesses with explicit smp_wmb/smp_rmb
     for clarity

   - Cosmetic fixes to LRU handling

   - Push list presence check into inode_io_list_del()

   - Touch up predicts in __d_lookup_rcu()

   - ocfs2: retire ocfs2_drop_inode() and I_WILL_FREE usage

   - Assert on ->i_count in iput_final()

   - Assert ->i_lock held in __iget()

  Fixes:

   - Add missing fences to I_NEW handling"

* tag 'vfs-6.19-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (22 commits)
  dcache: touch up predicts in __d_lookup_rcu()
  fs: push list presence check into inode_io_list_del()
  fs: cosmetic fixes to lru handling
  fs: rework I_NEW handling to operate without fences
  fs: make plain ->i_state access fail to compile
  xfs: use the new ->i_state accessors
  nilfs2: use the new ->i_state accessors
  overlayfs: use the new ->i_state accessors
  gfs2: use the new ->i_state accessors
  f2fs: use the new ->i_state accessors
  smb: use the new ->i_state accessors
  ceph: use the new ->i_state accessors
  btrfs: use the new ->i_state accessors
  Manual conversion to use ->i_state accessors of all places not covered by coccinelle
  Coccinelle-based conversion to use ->i_state accessors
  fs: provide accessors for ->i_state
  fs: spell out fenced ->i_state accesses with explicit smp_wmb/smp_rmb
  fs: move wait_on_inode() from writeback.h to fs.h
  fs: add missing fences to I_NEW handling
  ocfs2: retire ocfs2_drop_inode() and I_WILL_FREE usage
  ...
2025-12-01 09:02:34 -08:00
Linus Torvalds
b04b2e7a61 vfs-6.19-rc1.misc
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZAAKCRCRxhvAZXjc
 onGCAQDEHKNEuZMhkyd3K5YsJtMzZlW/uXp4+Wddeob+5yQp0wEA09xN4CJNMwhP
 J6Kjaa80hWfrFacqSvyMUwQHHw6mngs=
 =5Mom
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.19-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull misc vfs updates from Christian Brauner:
 "Features:

   - Cheaper MAY_EXEC handling for path lookup. This elides MAY_WRITE
     permission checks during path lookup and adds the
     IOP_FASTPERM_MAY_EXEC flag so filesystems like btrfs can avoid
     expensive permission work.

   - Hide dentry_cache behind runtime const machinery.

   - Add German Maglione as virtiofs co-maintainer.

  Cleanups:

   - Tidy up and inline step_into() and walk_component() for improved
     code generation.

   - Re-enable IOCB_NOWAIT writes to files. This refactors file
     timestamp update logic, fixing a layering bypass in btrfs when
     updating timestamps on device files and improving FMODE_NOCMTIME
     handling in VFS now that nfsd started using it.

   - Path lookup optimizations extracting slowpaths into dedicated
     routines and adding branch prediction hints for mntput_no_expire(),
     fd_install(), lookup_slow(), and various other hot paths.

   - Enable clang's -fms-extensions flag, requiring a JFS rename to
     avoid conflicts.

   - Remove spurious exports in fs/file_attr.c.

   - Stop duplicating union pipe_index declaration. This depends on the
     shared kbuild branch that brings in -fms-extensions support which
     is merged into this branch.

   - Use MD5 library instead of crypto_shash in ecryptfs.

   - Use largest_zero_folio() in iomap_dio_zero().

   - Replace simple_strtol/strtoul with kstrtoint/kstrtouint in init and
     initrd code.

   - Various typo fixes.

  Fixes:

   - Fix emergency sync for btrfs. Btrfs requires an explicit sync_fs()
     call with wait == 1 to commit super blocks. The emergency sync path
     never passed this, leaving btrfs data uncommitted during emergency
     sync.

   - Use local kmap in watch_queue's post_one_notification().

   - Add hint prints in sb_set_blocksize() for LBS dependency on THP"

* tag 'vfs-6.19-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits)
  MAINTAINERS: add German Maglione as virtiofs co-maintainer
  fs: inline step_into() and walk_component()
  fs: tidy up step_into() & friends before inlining
  orangefs: use inode_update_timestamps directly
  btrfs: fix the comment on btrfs_update_time
  btrfs: use vfs_utimes to update file timestamps
  fs: export vfs_utimes
  fs: lift the FMODE_NOCMTIME check into file_update_time_flags
  fs: refactor file timestamp update logic
  include/linux/fs.h: trivial fix: regualr -> regular
  fs/splice.c: trivial fix: pipes -> pipe's
  fs: mark lookup_slow() as noinline
  fs: add predicts based on nd->depth
  fs: move mntput_no_expire() slowpath into a dedicated routine
  fs: remove spurious exports in fs/file_attr.c
  watch_queue: Use local kmap in post_one_notification()
  fs: touch up predicts in path lookup
  fs: move fd_install() slowpath into a dedicated routine and provide commentary
  fs: hide dentry_cache behind runtime const machinery
  fs: touch predicts in do_dentry_open()
  ...
2025-12-01 08:44:26 -08:00
Linus Torvalds
1885cdbfbb vfs-6.19-rc1.iomap
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZAAKCRCRxhvAZXjc
 ooCXAQCwzX2GS/55QHV6JXBBoNxguuSQ5dCj91ZmTfHzij0xNAEAhKEBw7iMGX72
 c2/x+xYf+Pc6mAfxdus5RLMggqBFPAk=
 =jInB
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.19-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull iomap updates from Christian Brauner:
 "FUSE iomap Support for Buffered Reads:

    This adds iomap support for FUSE buffered reads and readahead. This
    enables granular uptodate tracking with large folios so only
    non-uptodate portions need to be read. Also fixes a race condition
    with large folios + writeback cache that could cause data corruption
    on partial writes followed by reads.

     - Refactored iomap read/readahead bio logic into helpers
     - Added caller-provided callbacks for read operations
     - Moved buffered IO bio logic into new file
     - FUSE now uses iomap for read_folio and readahead

  Zero Range Folio Batch Support:

    Add folio batch support for iomap_zero_range() to handle dirty
    folios over unwritten mappings. Fix raciness issues where dirty data
    could be lost during zero range operations.

     - filemap_get_folios_tag_range() helper for dirty folio lookup
     - Optional zero range dirty folio processing
     - XFS fills dirty folios on zero range of unwritten mappings
     - Removed old partial EOF zeroing optimization

  DIO Write Completions from Interrupt Context:

    Restore pre-iomap behavior where pure overwrite completions run
    inline rather than being deferred to workqueue. Reduces context
    switches for high-performance workloads like ScyllaDB.

     - Removed unused IOCB_DIO_CALLER_COMP code
     - Error completions always run in user context (fixes zonefs)
     - Reworked REQ_FUA selection logic
     - Inverted IOMAP_DIO_INLINE_COMP to IOMAP_DIO_OFFLOAD_COMP

  Buffered IO Cleanups:

    Some performance and code clarity improvements:

     - Replace manual bitmap scanning with find_next_bit()
     - Simplify read skip logic for writes
     - Optimize pending async writeback accounting
     - Better variable naming
     - Documentation for iomap_finish_folio_write() requirements

  Misaligned Vectors for Zoned XFS:

    Enables sub-block aligned vectors in XFS always-COW mode for zoned
    devices via new IOMAP_DIO_FSBLOCK_ALIGNED flag.

  Bug Fixes:

     - Allocate s_dio_done_wq for async reads (fixes syzbot report after
       error completion changes)
     - Fix iomap_read_end() for already uptodate folios (regression fix)"

* tag 'vfs-6.19-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (40 commits)
  iomap: allocate s_dio_done_wq for async reads as well
  iomap: fix iomap_read_end() for already uptodate folios
  iomap: invert the polarity of IOMAP_DIO_INLINE_COMP
  iomap: support write completions from interrupt context
  iomap: rework REQ_FUA selection
  iomap: always run error completions in user context
  fs, iomap: remove IOCB_DIO_CALLER_COMP
  iomap: use find_next_bit() for uptodate bitmap scanning
  iomap: use find_next_bit() for dirty bitmap scanning
  iomap: simplify when reads can be skipped for writes
  iomap: simplify ->read_folio_range() error handling for reads
  iomap: optimize pending async writeback accounting
  docs: document iomap writeback's iomap_finish_folio_write() requirement
  iomap: account for unaligned end offsets when truncating read range
  iomap: rename bytes_pending/bytes_accounted to bytes_submitted/bytes_not_submitted
  xfs: support sub-block aligned vectors in always COW mode
  iomap: add IOMAP_DIO_FSBLOCK_ALIGNED flag
  xfs: error tag to force zeroing on debug kernels
  iomap: remove old partial eof zeroing optimization
  xfs: fill dirty folios on zero range of unwritten mappings
  ...
2025-12-01 08:14:00 -08:00
Fengnan Chang
4d0e1f2139 blk-mq: use queue_hctx in blk_mq_map_queue_type
Some caller of blk_mq_map_queue_type now didn't grab
'q_usage_counter', such as blk_mq_cpu_mapped_to_hctx, so we need
protect 'queue_hw_ctx' through rcu.

Also checked all other functions, no more missed cases.

Fixes: 89e1fb7cef ("blk-mq: fix potential uaf for 'queue_hw_ctx'")
Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Fengnan Chang <changfengnan@bytedance.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-01 07:18:31 -07:00
Fengnan Chang
89e1fb7cef blk-mq: fix potential uaf for 'queue_hw_ctx'
This is just apply Kuai's patch in [1] with mirror changes.

blk_mq_realloc_hw_ctxs() will free the 'queue_hw_ctx'(e.g. undate
submit_queues through configfs for null_blk), while it might still be
used from other context(e.g. switch elevator to none):

t1					t2
elevator_switch
 blk_mq_unquiesce_queue
  blk_mq_run_hw_queues
   queue_for_each_hw_ctx
    // assembly code for hctx = (q)->queue_hw_ctx[i]
    mov    0x48(%rbp),%rdx -> read old queue_hw_ctx

					__blk_mq_update_nr_hw_queues
					 blk_mq_realloc_hw_ctxs
					  hctxs = q->queue_hw_ctx
					  q->queue_hw_ctx = new_hctxs
					  kfree(hctxs)
    movslq %ebx,%rax
    mov    (%rdx,%rax,8),%rdi ->uaf

This problem was found by code review, and I comfirmed that the concurrent
scenario do exist(specifically 'q->queue_hw_ctx' can be changed during
blk_mq_run_hw_queues()), however, the uaf problem hasn't been repoduced yet
without hacking the kernel.

Sicne the queue is freezed in __blk_mq_update_nr_hw_queues(), fix the
problem by protecting 'queue_hw_ctx' through rcu where it can be accessed
without grabbing 'q_usage_counter'.

[1] https://lore.kernel.org/all/20220225072053.2472431-1-yukuai3@huawei.com/

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Fengnan Chang <changfengnan@bytedance.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-28 09:09:19 -07:00
Fengnan Chang
d0c98769ee blk-mq: use array manage hctx map instead of xarray
After commit 4e5cc99e1e ("blk-mq: manage hctx map via xarray"), we use
an xarray instead of array to store hctx, but in poll mode, each time
in blk_mq_poll, we need use xa_load to find corresponding hctx, this
introduce some costs. In my test, xa_load may cost 3.8% cpu.

This patch revert previous change, eliminates the overhead of xa_load
and can result in a 3% performance improvement.

Signed-off-by: Fengnan Chang <changfengnan@bytedance.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-28 09:09:19 -07:00
Chaitanya Kulkarni
7d09a8e251 block: ignore __blkdev_issue_discard() return value
__blkdev_issue_discard() always returns 0, making the error check
in blkdev_issue_discard() dead code.

In function blkdev_issue_discard() initialize ret = 0, remove ret
assignment from __blkdev_issue_discard(), rely on bio == NULL check to
call submit_bio_wait(), preserve submit_bio_wait() error handling, and
preserve -EOPNOTSUPP to 0 mapping.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-25 12:19:39 -07:00
shechenglong
3a64c46c40 block: fix typos in comments and strings in blk-core
This patch fixes multiple spelling mistakes in comments and documentation
in the file block/blk-core.c.

No functional changes intended.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: shechenglong <shechenglong@xfusion.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-25 10:39:49 -07:00
John Garry
a74de0c366 block: Remove references to __device_add_disk()
Since commit d1254a8749 ("block: remove support for delayed queue
registrations"), function __device_add_disk() has been replaced with
device_add_disk(), so fix up comments.

Signed-off-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-25 10:36:37 -07:00
Leon Romanovsky
d4504262f7 PCI/P2PDMA: Simplify bus address mapping API
Update the pci_p2pdma_bus_addr_map() function to take a direct pointer
to the p2pdma_provider structure instead of the pci_p2pdma_map_state.
This simplifies the API by removing the need for callers to extract
the provider from the state structure.

The change updates all callers across the kernel (block layer, IOMMU,
DMA direct, and HMM) to pass the provider pointer directly, making
the code more explicit and reducing unnecessary indirection. This
also removes the runtime warning check since callers now have direct
control over which provider they use.

Tested-by: Alex Mastro <amastro@fb.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-2-d7f71607f371@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-20 12:01:41 -07:00
David Laight
9420e720ad block: use min() instead of min_t()
min_t(unsigned int, a, b) casts an 'unsigned long' to 'unsigned int'.
Use min(a, b) instead as it promotes any 'unsigned int' to 'unsigned long'
and so cannot discard significant bits.

In this case the 'unsigned long' value is small enough that the result
is ok.

(Similarly for max_t() and clamp_t().)

Detected by an extra check added to min_t().

Signed-off-by: David Laight <david.laight.linux@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-20 07:44:29 -07:00
Chengkaitao
8e1d91c258 block: remove the declaration of elevator_init_mq function
In commit 1e44bedbc9 ("block: unifying elevator change"), the
elevator_init_mq function was deleted, but its declaration in elevator.h
was overlooked. This patch fixes it.

Signed-off-by: Chengkaitao <chengkaitao@kylinos.cn>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-18 16:27:19 -07:00
Jens Axboe
caebce24f6 Revert "block: consider discard merge last"
This reverts commit 2516c246d0.

Suspected issues with discard merging post this patch, hence revert
it for now.

Link: https://lore.kernel.org/linux-block/26acdfdf-de13-430b-8c73-f890c7689a84@kernel.dk/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-18 15:00:12 -07:00
Bart Van Assche
935a20d1be block: Remove queue freezing from several sysfs store callbacks
Freezing the request queue from inside sysfs store callbacks may cause a
deadlock in combination with the dm-multipath driver and the
queue_if_no_path option. Additionally, freezing the request queue slows
down system boot on systems where sysfs attributes are set synchronously.

Fix this by removing the blk_mq_freeze_queue() / blk_mq_unfreeze_queue()
calls from the store callbacks that do not strictly need these callbacks.
Add the __data_racy annotation to request_queue.rq_timeout to suppress
KCSAN data race reports about the rq_timeout reads.

This patch may cause a small delay in applying the new settings.

For all the attributes affected by this patch, I/O will complete
correctly whether the old or the new value of the attribute is used.

This patch affects the following sysfs attributes:
* io_poll_delay
* io_timeout
* nomerges
* read_ahead_kb
* rq_affinity

Here is an example of a deadlock triggered by running test srp/002
if this patch is not applied:

task:multipathd
Call Trace:
 <TASK>
 __schedule+0x8c1/0x1bf0
 schedule+0xdd/0x270
 schedule_preempt_disabled+0x1c/0x30
 __mutex_lock+0xb89/0x1650
 mutex_lock_nested+0x1f/0x30
 dm_table_set_restrictions+0x823/0xdf0
 __bind+0x166/0x590
 dm_swap_table+0x2a7/0x490
 do_resume+0x1b1/0x610
 dev_suspend+0x55/0x1a0
 ctl_ioctl+0x3a5/0x7e0
 dm_ctl_ioctl+0x12/0x20
 __x64_sys_ioctl+0x127/0x1a0
 x64_sys_call+0xe2b/0x17d0
 do_syscall_64+0x96/0x3a0
 entry_SYSCALL_64_after_hwframe+0x4b/0x53
 </TASK>
task:(udev-worker)
Call Trace:
 <TASK>
 __schedule+0x8c1/0x1bf0
 schedule+0xdd/0x270
 blk_mq_freeze_queue_wait+0xf2/0x140
 blk_mq_freeze_queue_nomemsave+0x23/0x30
 queue_ra_store+0x14e/0x290
 queue_attr_store+0x23e/0x2c0
 sysfs_kf_write+0xde/0x140
 kernfs_fop_write_iter+0x3b2/0x630
 vfs_write+0x4fd/0x1390
 ksys_write+0xfd/0x230
 __x64_sys_write+0x76/0xc0
 x64_sys_call+0x276/0x17d0
 do_syscall_64+0x96/0x3a0
 entry_SYSCALL_64_after_hwframe+0x4b/0x53
 </TASK>

Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Nilay Shroff <nilay@linux.ibm.com>
Cc: Martin Wilck <mwilck@suse.com>
Cc: Benjamin Marzinski <bmarzins@redhat.com>
Cc: stable@vger.kernel.org
Fixes: af28141498 ("block: freeze the queue in queue_attr_store")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-18 15:00:11 -07:00
Xue He
152c331bcd block: plug attempts to batch allocate tags multiple times
This patch aims to enable batch allocation of sufficient tags after
batch IO submission with plug mechanism, thereby avoiding the need for
frequent individual requests when the initial allocation is
insufficient.
-----------------------------------------------------------
HW:
16 CPUs/16 poll queues
Disk: Samsung PM9A3 Gen4 3.84T

CMD:
[global]
ioengine=io_uring
group_reporting=1
time_based=1
runtime=1m
refill_buffers=1
norandommap=1
randrepeat=0
fixedbufs=1
registerfiles=1
rw=randread
iodepth=128
iodepth_batch_submit=32
iodepth_batch_complete_min=32
iodepth_batch_complete_max=128
iodepth_low=32
bs=4k
numjobs=1
direct=1
hipri=1

[job1]
filename=/dev/nvme0n1
name=batch_test
------------------------------------------------------------
Perf:
base code: __blk_mq_alloc_requests() 1.47%
patch: __blk_mq_alloc_requests() 0.75%
------------------------------------------------------------

Signed-off-by: hexue <xue01.he@samsung.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-18 14:59:41 -07:00
Li Chen
3179a5f7f8 block: rate-limit capacity change info log
loop devices under heavy stress-ng loop streessor can trigger many
capacity change events in a short time. Each event prints an info
message from set_capacity_and_notify(), flooding the console and
contributing to soft lockups on slow consoles.

Switch the printk in set_capacity_and_notify() to
pr_info_ratelimited() so frequent capacity changes do not spam
the log while still reporting occasional changes.

Cc: stable@vger.kernel.org
Signed-off-by: Li Chen <chenl311@chinatelecom.cn>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-17 11:30:24 -07:00
Linus Torvalds
e7c375b181 vfs-6.18-rc7.fixes
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaRtBJwAKCRCRxhvAZXjc
 ou5CAQCJb5y2ULKklblICU1wR7Nr15WvTW7VVOcv44RJ22S3NgEAy4DLDBFBw8zC
 8e7Hp8gxbjsq8ZJmU088aobFcqbZOwk=
 =TAnu
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.18-rc7.fixes' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs

Pull vfs fixes from Christian Brauner:

 - Fix unitialized variable in statmount_string()

 - Fix hostfs mounting when passing host root during boot

 - Fix dynamic lookup to fail on cell lookup failure

 - Fix missing file type when reading bfs inodes from disk

 - Enforce checking of sb_min_blocksize() calls and update all callers
   accordingly

 - Restore write access before closing files opened by open_exec() in
   binfmt_misc

 - Always freeze efivarfs during suspend/hibernate cycles

 - Fix statmount()'s and listmount()'s grab_requested_mnt_ns() helper to
   actually allow mount namespace file descriptor in addition to mount
   namespace ids

 - Fix tmpfs remount when noswap is specified

 - Switch Landlock to iput_not_last() to remove false-positives from
   might_sleep() annotations in iput()

 - Remove dead node_to_mnt_ns() code

 - Ensure that per-queue kobjects are successfully created

* tag 'vfs-6.18-rc7.fixes' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs:
  landlock: fix splats from iput() after it started calling might_sleep()
  fs: add iput_not_last()
  shmem: fix tmpfs reconfiguration (remount) when noswap is set
  fs/namespace: correctly handle errors returned by grab_requested_mnt_ns
  power: always freeze efivarfs
  binfmt_misc: restore write access before closing files opened by open_exec()
  block: add __must_check attribute to sb_min_blocksize()
  virtio-fs: fix incorrect check for fsvq->kobj
  xfs: check the return value of sb_min_blocksize() in xfs_fs_fill_super
  isofs: check the return value of sb_min_blocksize() in isofs_fill_super
  exfat: check return value of sb_min_blocksize in exfat_read_boot_sector
  vfat: fix missing sb_min_blocksize() return value checks
  mnt: Remove dead code which might prevent from building
  bfs: Reconstruct file type when loading from disk
  afs: Fix dynamic lookup to fail on cell lookup failure
  hostfs: Fix only passing host root in boot stage with new mount
  fs: Fix uninitialized 'offp' in statmount_string()
2025-11-17 09:11:27 -08:00
Guenter Roeck
6483faa393 block/blk-throttle: Remove throtl_slice from struct throtl_data
throtl_slice is now a constant. Remove the variable and use the constant
directly where needed.

Cc: Yu Kuai <yukuai@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
Reviewed-by: Yu Kuai <yukuai@fnnas.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-17 09:39:48 -07:00
Guenter Roeck
20d0b359c7 block/blk-throttle: drop unneeded blk_stat_enable_accounting
After the removal of CONFIG_BLK_DEV_THROTTLING_LOW, it is no longer
necessary to enable block accounting, so remove the call to
blk_stat_enable_accounting(). With that, the track_bio_latency variable
is no longer used and can be deleted from struct throtl_data. Also,
including blk-stat.h is no longer necessary.

Fixes: bf20ab538c ("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW")
Cc: Yu Kuai <yukuai@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
Reviewed-by: Yu Kuai <yukuai@fnnas.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-17 09:39:48 -07:00
Guenter Roeck
f76581f9f1 block/blk-throttle: Fix throttle slice time for SSDs
Commit d61fcfa4bb ("blk-throttle: choose a small throtl_slice for SSD")
introduced device type specific throttle slices if BLK_DEV_THROTTLING_LOW
was enabled. Commit bf20ab538c ("blk-throttle: remove
CONFIG_BLK_DEV_THROTTLING_LOW") removed support for BLK_DEV_THROTTLING_LOW,
but left the device type specific throttle slices in place. This
effectively changed throttling behavior on systems with SSD which now use
a different and non-configurable slice time compared to non-SSD devices.
Practical impact is that throughput tests with low configured throttle
values (65536 bps) experience less than expected throughput on SSDs,
presumably due to rounding errors associated with the small throttle slice
time used for those devices. The same tests pass when setting the throttle
values to 65536 * 4 = 262144 bps.

The original code sets the throttle slice time to DFL_THROTL_SLICE_HD if
CONFIG_BLK_DEV_THROTTLING_LOW is disabled. Restore that code to fix the
problem. With that, DFL_THROTL_SLICE_SSD is no longer necessary. Revert to
the original code and re-introduce DFL_THROTL_SLICE to replace both
DFL_THROTL_SLICE_HD and DFL_THROTL_SLICE_SSD. This effectively reverts
commit d61fcfa4bb ("blk-throttle: choose a small throtl_slice for SSD").

While at it, also remove MAX_THROTL_SLICE since it is not used anymore.

Fixes: bf20ab538c ("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW")
Cc: Yu Kuai <yukuai@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
Reviewed-by: Yu Kuai <yukuai@fnnas.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-17 09:39:48 -07:00
Keith Busch
2516c246d0 block: consider discard merge last
If the next discard range is contiguous with the current range being
considered, it's cheaper to expand the current range than to append an
additional bio.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-17 09:39:36 -07:00
Leon Romanovsky
37f0c7a8df block-dma: properly take MMIO path
In commit eadaa8b255 ("dma-mapping: introduce new DMA attribute to
indicate MMIO memory"), DMA_ATTR_MMIO attribute was added to describe
MMIO addresses, which require to avoid any memory cache flushing, as
an outcome of the discussion pointed in Link tag below.

In case of PCI_P2PDMA_MAP_THRU_HOST_BRIDGE transfer, blk-mq-dm logic
treated this as regular page and relied on "struct page" DMA flow.
That flow performs CPU cache flushing, which shouldn't be done here,
and doesn't set IOMMU_MMIO flag in DMA-IOMMU case.

As a solution, let's encode peer-to-peer transaction type in NVMe IOD
flags variable and provide it to blk-mq-dma API.

Link: https://lore.kernel.org/all/f912c446-1ae9-4390-9c11-00dce7bf0fd3@arm.com/
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-14 05:09:56 -07:00
Leon Romanovsky
61d43b1731 nvme-pci: migrate to dma_map_phys instead of map_page
After introduction of dma_map_phys(), there is no need to convert
from physical address to struct page in order to map page. So let's
use it directly.

Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-14 05:09:56 -07:00
Nilay Shroff
d4c3ef56a1 block: define alloc_sched_data and free_sched_data methods for kyber
Currently, the Kyber elevator allocates its private data dynamically in
->init_sched and frees it in ->exit_sched. However, since ->init_sched
is invoked during elevator switch after acquiring both ->freeze_lock and
->elevator_lock, it may trigger the lockdep splat [1] due to dependency
on pcpu_alloc_mutex.

To resolve this, move the elevator data allocation and deallocation
logic from ->init_sched and ->exit_sched into the newly introduced
->alloc_sched_data and ->free_sched_data methods. These callbacks are
invoked before acquiring ->freeze_lock and ->elevator_lock, ensuring
that memory allocation happens safely without introducing additional
locking dependencies.

This change breaks the dependency chain involving pcpu_alloc_mutex and
prevents the reported lockdep warning.

[1] https://lore.kernel.org/all/CAGVVp+VNW4M-5DZMNoADp6o2VKFhi7KxWpTDkcnVyjO0=-D5+A@mail.gmail.com/

Reported-by: Changhui Zhong <czhong@redhat.com>
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Closes: https://lore.kernel.org/all/CAGVVp+VNW4M-5DZMNoADp6o2VKFhi7KxWpTDkcnVyjO0=-D5+A@mail.gmail.com/
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Yu Kuai <yukuai@fnnas.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-13 09:27:49 -07:00
Nilay Shroff
0315476e78 block: use {alloc|free}_sched data methods
The previous patch introduced ->alloc_sched_data and
->free_sched_data methods. This patch builds upon that
by now using these methods during elevator switch and
nr_hw_queue update.

It's also ensured that scheduler-specific data is
allocated and freed through the new callbacks outside
of the ->freeze_lock and ->elevator_lock locking contexts,
thereby preventing any dependency on pcpu_alloc_mutex.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Yu Kuai <yukuai@fnnas.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-13 09:27:49 -07:00
Nilay Shroff
61019afdf6 block: introduce alloc_sched_data and free_sched_data elevator methods
The recent lockdep splat [1] highlights a potential deadlock risk
involving ->elevator_lock and ->freeze_lock dependencies on -pcpu_alloc_
mutex. The trace shows that the issue occurs when the Kyber scheduler
allocates dynamic memory for its elevator data during initialization.

To address this, introduce two new elevator operation callbacks:
->alloc_sched_data and ->free_sched_data. The subsequent patch would
build upon these newly introduced methods to suppress lockdep splat[1].

[1] https://lore.kernel.org/all/CAGVVp+VNW4M-5DZMNoADp6o2VKFhi7KxWpTDkcnVyjO0=-D5+A@mail.gmail.com/

Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-13 09:27:49 -07:00
Nilay Shroff
04728ce909 block: move elevator tags into struct elevator_resources
This patch introduces a new structure, struct elevator_resources, to
group together all elevator-related resources that share the same
lifetime. As a first step, this change moves the elevator tag pointer
from struct elv_change_ctx into the new struct elevator_resources.

Additionally, rename blk_mq_alloc_sched_tags_batch() and
blk_mq_free_sched_tags_batch() to blk_mq_alloc_sched_res_batch() and
blk_mq_free_sched_res_batch(), respectively. Introduce two new wrapper
helpers, blk_mq_alloc_sched_res() and blk_mq_free_sched_res(), around
blk_mq_alloc_sched_tags() and blk_mq_free_sched_tags().

These changes pave the way for consolidating the allocation and freeing
of elevator-specific resources into common helper functions. This
refactoring improves encapsulation and prepares the code for future
extensions, allowing additional elevator-specific data to be added to
struct elevator_resources without cluttering struct elv_change_ctx.

Subsequent patches will extend struct elevator_resources to include
other elevator-related data.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Yu Kuai <yukuai@fnnas.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-13 09:27:49 -07:00
Nilay Shroff
232143b605 block: unify elevator tags and type xarrays into struct elv_change_ctx
Currently, the nr_hw_queues update path manages two disjoint xarrays —
one for elevator tags and another for elevator type — both used during
elevator switching. Maintaining these two parallel structures for the
same purpose adds unnecessary complexity and potential for mismatched
state.

This patch unifies both xarrays into a single structure, struct
elv_change_ctx, which holds all per-queue elevator change context. A
single xarray, named elv_tbl, now maps each queue (q->id) in a tagset
to its corresponding elv_change_ctx entry, encapsulating the elevator
tags, type and name references.

This unification simplifies the code, improves maintainability, and
clarifies ownership of per-queue elevator state.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Yu Kuai <yukuai@fnnas.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-13 09:27:49 -07:00
Damien Le Moal
881880b6f3 block: fix NULL pointer dereference in disk_report_zones()
Commit 2284eec5053d ("block: introduce blkdev_get_zone_info()")
introduced the report_active field in struct blk_report_zones_args so
that open and closed zones can be reported with the condition
BLK_ZONE_COND_ACTIVE in the case of a cached report zone.
However, the args pointer to a struct blk_report_zones_args that is
passed to disk_report_zones() can be NULL, e.g. in the case of internal
report zones operations for device mapper zoned targets.

Fix disk_report_zones() to make sure to check that the args is not null
before updating a zone condition for cached zone reports.

Fixes: 2284eec5053d ("block: introduce blkdev_get_zone_info()")
Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-13 09:10:04 -07:00
Damien Le Moal
c2b8d20628 block: fix NULL pointer dereference in blk_zone_reset_all_bio_endio()
For zoned block devices that do not need zone write plugs (e.g. most
device mapper devices that support zones), the disk hash table of zone
write plugs is NULL. For such devices, blk_zone_reset_all_bio_endio()
should not attempt to scan this has table as that causes a NULL pointer
dereference.

Fix this by checking that the disk does have zone write plugs using the
atomic counter. This is equivalent to checking for a non-NULL hash table
but has the advantage to also speed up the execution of
blk_zone_reset_all_bio_endio() for devices that do use zone write plugs
but do not have any plug in the hash table (e.g. a disk with only full
zones).

Fixes: efae226c2e ("block: handle zone management operations completions")
Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-13 09:10:04 -07:00
Bart Van Assche
f233339188 blk-zoned: Move code from disk_zone_wplug_add_bio() into its caller
Move the following code into the only caller of disk_zone_wplug_add_bio():
 - The code for clearing the REQ_NOWAIT flag.
 - The code that sets the BLK_ZONE_WPLUG_PLUGGED flag.
 - The disk_zone_wplug_schedule_bio_work() call.

This patch moves all code that is related to REQ_NOWAIT or to bio
scheduling into a single function. Additionally, the 'schedule_bio_work'
variable is removed. No functionality has been changed.

Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlmoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-12 14:05:23 -07:00
Bart Van Assche
faa3be1a61 blk-zoned: Document disk_zone_wplug_schedule_bio_work() locking
Document that all callers hold this lock because the code in
disk_zone_wplug_schedule_bio_work() depends on this.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-12 14:05:23 -07:00
Bart Van Assche
fd0ae4754c blk-zoned: Fix a typo in a source code comment
Remove a superfluous parenthesis that was introduced by commit fa8555630b
("blk-zoned: Improve the queue reference count strategy documentation").

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-12 14:05:23 -07:00
Baokun Li
50b2a4f19b
bdev: add hint prints in sb_set_blocksize() for LBS dependency on THP
Support for block sizes greater than the page size depends on large
folios, which in turn require CONFIG_TRANSPARENT_HUGEPAGE to be enabled.

Because the code is wrapped in multiple layers of abstraction, this
dependency is rather obscure, so users may not realize it and may be
unsure how to enable LBS.

As suggested by Theodore, I have added hint messages in sb_set_blocksize
so that users can distinguish whether a mount failure with block size
larger than page size is due to lack of filesystem support or the absence
of CONFIG_TRANSPARENT_HUGEPAGE.

Suggested-by: Theodore Ts'o <tytso@mit.edu>
Link: https://patch.msgid.link/20251110043226.GD2988753@mit.edu
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Link: https://patch.msgid.link/20251110124714.1329978-1-libaokun@huaweicloud.com
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-12 12:19:09 +01:00
Kriish Sharma
6d7e3870af blk-mq-dma: fix kernel-doc function name for integrity DMA iterator
Documentation build reported:

  Warning: block/blk-mq-dma.c:373 expecting prototype for blk_rq_integrity_dma_map_iter_start(). Prototype was for blk_rq_integrity_dma_map_iter_next() instead

The kernel-doc comment above `blk_rq_integrity_dma_map_iter_next()` used
the wrong function name (`blk_rq_integrity_dma_map_iter_start`) in its
header. This patch corrects the function name in the kernel-doc block to
match the actual implementation, ensuring clean documentation builds.

Fixes: fec9b16dc5 ("blk-mq-dma: add scatter-less integrity data DMA mapping")
Signed-off-by: Kriish Sharma <kriish.sharma2006@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-11 08:37:33 -07:00
Keith Busch
fd9ecd0052 block: fix merging data-less bios
The data segment gaps the block layer tracks doesn't apply to bio's that
don't have data. Skip calculating this to fix a NULL pointer access.

Fixes: 2f6b2565d4 ("block: accumulate memory segment gaps per bio")
Reported-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-11 08:35:59 -07:00
Chaitanya Kulkarni
86afb1cdc2 block: add lockdep to queue_limits_commit_update()
queue_limits_commit_update() expects q->limits_lock to be held by
the caller (via queue_limits_start_update()).

The API pattern is:

  lim = queue_limits_start_update(q);  /* acquires lock */
              /* modify lim */
  queue_limits_commit_update(q, &lim); /* releases lock */

  OR

  queue_limits_commit_update_frozen(q, &lim);
   lim = queue_limits_start_update(q); /* acquires lock */
  queue_limits_commit_update(q, &lim); /* releases lock */

Add lockdep_assert_held() to report incorrect API usage.

Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-11 07:51:08 -07:00
Caleb Sander Mateos
4cda40dce9 block: clean up indentation in blk_rq_map_iter_init()
blk_rq_map_iter_init() has one line with 7 spaces of indentation and
another that mixes 1 tab and 8 spaces. Convert both to tabs.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-08 06:38:18 -07:00
Damien Le Moal
25976c314f block: introduce bdev_zone_start()
Introduce the function bdev_zone_start() as a more explicit (and clear)
replacement for ALIGN_DOWN() to get the start sector of a zone
containing a particular sector of a zoned block device.

Use this new helper in blkdev_get_zone_info() and
blkdev_report_zones_cached().

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-07 09:28:08 -07:00
Damien Le Moal
e2b0ec7761 block: refactor disk_zone_wplug_sync_wp_offset()
The helper function blk_zone_wp_offset() is called from
disk_zone_wplug_sync_wp_offset(), and again called from
blk_revalidate_seq_zone() right after the call to
disk_zone_wplug_sync_wp_offset().

Change disk_zone_wplug_sync_wp_offset() to return the value of obtained
with blk_zone_wp_offset() to avoid this double call, which simplifies a
little blk_revalidate_seq_zone().

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-07 09:28:08 -07:00
Damien Le Moal
bbac6e0fa5 block: improve blk_zone_wp_offset()
blk_zone_wp_offset() is always called with a struct blk_zone obtained
from the device, that is, it will never see the BLK_ZONE_COND_ACTIVE
condition. However, handling this condition makes this function more
solid and will also avoid issues when propagating cached report requests
to underlying stacked devices is implemented. Add BLK_ZONE_COND_ACTIVE
as a new case in blk_zone_wp_offset() switch.

Also while at it, change the handling of the full condition to return
UINT_MAX for the zone write pointer to reflect the fact that the write
pointer of a full zone is invalid.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-07 09:28:08 -07:00
Christoph Hellwig
86a9ce21f5 block: don't return 1 for the fallback case in blkdev_get_zone_info
blkdev_do_report_zones returns the number of reported zones, but
blkdev_get_zone_info returns 0 or an errno.  Translate to the expected
return value in blkdev_report_zone_fallback.

Fixes: b037d41762fd ("block: introduce blkdev_get_zone_info()")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-07 04:38:28 -07:00
Keith Busch
2f6b2565d4 block: accumulate memory segment gaps per bio
The blk-mq dma iterator has an optimization for requests that align to
the device's iommu merge boundary. This boundary may be larger than the
device's virtual boundary, but the code had been depending on that queue
limit to know ahead of time if the request is guaranteed to align to
that optimization.

Rather than rely on that queue limit, which many devices may not report,
save the lowest set bit of any boundary gap between each segment in the
bio while checking the segments. The request stores the value for
merging and quickly checking per io if the request can use iova
optimizations.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-06 18:11:58 -07:00
Christoph Hellwig
15638d52cb block: fix cached zone reporting after zone append was used
No zone plugs are allocated when a zone is opened by calling Zone Append
on it.  This makes the cached zone reporting report incorrectly empty
zones if the file system is unmounted and report zones is called after
that, e.g. by xfstests test cases using the scratch device.

Fix this by recording if zone append was used on a device, and disable
cached reporting for the device until a ZONE_RESET_ALL happens that
guarantees all zones are empty.

We could probably do even better using a per-zone flag, but the practical
use cache for zone reporting after the initial mount are rather limited,
so let's keep things simple for now.

Fixes: 31f0656a4a ("block: introduce blkdev_report_zones_cached()")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-06 16:15:27 -07:00
Christoph Hellwig
c6886cf610 block: don't leak disk->zones_cond for !disk_need_zone_resources
disk->zones_cond is allocated for all zoned devices, but
disk_free_zone_resources skips it when the zone write plug hash is not
allocated, leaking the allocation for non-mq devices that don't emulate
zone append.  This is reported by kmemleak-enabled xfstests for various
tests that use simple device mapper targets.

Fix this by moving all code that requires writes plugs from
disk_free_zone_resources into disk_destroy_zone_wplugs_hash_table
and executing the rest of the code, including the disk->zones_cond
freeing unconditionally.

Fixes: 6e945ffb65 ("block: use zone condition to determine conventional zones")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-06 16:15:27 -07:00
Damien Le Moal
1efbbc641e block: add zone write plug condition to debugfs zone_wplugs
Modify queue_zone_wplug_show() to include the condition of a zone write
plug to the zone_wplugs debugfs attribute of a zoned block device.
To improve readability and ease of use, rather than the zone condition
raw value, the zone condition name is given using blk_zone_cond_str().

Suggested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-05 08:07:21 -07:00
Damien Le Moal
2b39d4a6c6 block: improve zone_wplugs debugfs attribute output
Make the output of the zone_wplugs debugfs attribute file more easily
readable by adding the name of the zone write plugs fields in the
output.

No functional changes.

Suggested-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-05 08:07:21 -07:00
Damien Le Moal
b30ffcdc0c block: introduce BLKREPORTZONESV2 ioctl
Introduce the new BLKREPORTZONESV2 ioctl command to allow user
applications access to the fast zone report implemented by
blkdev_report_zones_cached(). This new ioctl is defined as number 142
and is documented in include/uapi/linux/fs.h.

Unlike the existing BLKREPORTZONES ioctl, this new ioctl uses the flags
field of struct blk_zone_report also as an input. If the user sets the
BLK_ZONE_REP_CACHED flag as an input, then blkdev_report_zones_cached()
is used to generate the zone report using cached zone information. If
this flag is not set, then BLKREPORTZONESV2 behaves in the same manner
as BLKREPORTZONES and the zone report is generated by accessing the
zoned device.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-05 08:07:21 -07:00
Damien Le Moal
31f0656a4a block: introduce blkdev_report_zones_cached()
Introduce the function blkdev_report_zones_cached() to provide a fast
report zone built using the blkdev_get_zone_info() function, which gets
zone information from a disk zones_cond array or zone write plugs.
For a large capacity SMR drive, such fast report zone can be completed
in a few milliseconds compared to several seconds completion times
when the report zone is obtained from the device.

The zone report is built in the same manner as with the regular
blkdev_report_zones() function, that is, the first zone reported is the
one containing the specified start sector and the report is limited to
the specified number of zones (nr_zones argument). The information for
each zone in the report is obtained using blkdev_get_zone_info().

For zoned devices that do not use zone write plug resources,
using blkdev_get_zone_info() is inefficient as the zone report would
be very slow, generated one zone at a time. To avoid this,
blkdev_report_zones_cached() falls back to calling
blkdev_do_report_zones() to execute a regular zone report. In this case,
the .report_active field of struct blk_report_zones_args is set to true
to report zone conditions using the BLK_ZONE_COND_ACTIVE condition in
place of the implicit open, explicit open and closed conditions.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-05 08:07:21 -07:00
Damien Le Moal
f2284eec50 block: introduce blkdev_get_zone_info()
Introduce the function blkdev_get_zone_info() to obtain a single zone
information from cached zone data, that is, either from the zone write
plug for the target zone if it exists and from the disk zones_cond
array otherwise.

Since sequential zones that do not have a zone write plug are either
full, empty or in a bad state (read-only or offline), the zone write
pointer can be inferred from the zone condition cached in the disk
zones_cond array. For sequential zones that have a zone write plug, the
zone condition and zone write pointer are obtained from the condition
and write pointer offset managed with the zone write plug. This allows
obtaining the information for a zone much more quickly than having to
execute a report zones command on the device.

blkdev_get_zone_info() falls back to using a regular zone report if the
target zone is flagged as needing an update with the
BLK_ZONE_WPLUG_NEED_WP_UPDATE flag, or if the target device does not
use zone write plugs (i.e. a device mapper device). In this case, the
new function blkdev_report_zone_fallback() is used and the zone
condition is reported consistantly with the cahced report, that is, the
BLK_ZONE_COND_ACTIVE condition is used in place of the implicit open,
explicit open and closed conditions. This is achieved by adding the
.report_active field to struct blk_report_zones_args and by having
disk_report_zone() sets the correct zone condition if .report_active is
true.

In preparation for using blkdev_get_zone_info() in upcoming file systems
changes, also export this function as a GPL symbol.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-05 08:07:21 -07:00
Damien Le Moal
1af3f4e0c4 block: refactor blkdev_report_zones() code
In preparation for implementing cached report zone, split the main part
of the code of blkdev_report_zones() into the helper function
blkdev_do_report_zones(), with this new helper taking as argument a
struct blk_report_zones_args pointer instead of a report callback
function and its private argument.

No functional changes.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-05 08:07:21 -07:00
Damien Le Moal
0bf0e2e466 block: track zone conditions
The function blk_revalidate_zone_cond() already caches the condition of
all zones of a zoned block device in the zones_cond array of a gendisk.
However, the zone conditions are updated only when the device is scanned
or revalidated.

Implement tracking of the runtime changes to zone conditions using
the new cond field in struct blk_zone_wplug. The size of this structure
remains 112 Bytes as the new field replaces the 4 Bytes padding at the
end of the structure.

Beause zones that do not have a zone write plug can be in the empty,
implicit open, explicit open or full condition, the zones_cond array of
a disk is used to track the conditions, of zones that do not have a zone
write plug. The condition of such zone is updated in the disk zones_cond
array when a zone reset, reset all or finish operation is executed, and
also when a zone write plug is removed from the disk hash table when the
zone becomes full.

Since a device may automatically close an implicitly open zone when
writing to an empty or closed zone, if the total number of open zones
has reached the device limit, the BLK_ZONE_COND_IMP_OPEN and
BLK_ZONE_COND_CLOSED zone conditions cannot be precisely tracked. To
overcome this, the zone condition BLK_ZONE_COND_ACTIVE is introduced to
represent a zone that has the condition BLK_ZONE_COND_IMP_OPEN,
BLK_ZONE_COND_EXP_OPEN or BLK_ZONE_COND_CLOSED.  This follows the
definition of an active zone as defined in the NVMe Zoned Namespace
specifications. As such, for a zoned device that has a limit on the
maximum number of open zones, we will never have more zones in the
BLK_ZONE_COND_ACTIVE condition than the device limit. This is compatible
with the SCSI ZBC and ATA ZAC specifications for SMR HDDs as these
devices do not have a limit on the number of active zones.

The function disk_zone_wplug_set_wp_offset() is modified to use the new
helper disk_zone_wplug_update_cond() to update a zone write plug
condition whenever a zone write plug write offset is updated on
submission or merging of write BIOs to a zone.

The functions blk_zone_reset_bio_endio(), blk_zone_reset_all_bio_endio()
and blk_zone_finish_bio_endio() are modified to update the condition of
the zones targeted by reset, reset_all and finish operations, either
using though disk_zone_wplug_set_wp_offset() for zones that have a
zone write plug, or using the disk_zone_set_cond() helper to update the
zones_cond array of the disk for zones that do not have a zone write
plug.

When a zone write plug is removed from the disk hash table (when the
zone becomes empty or full), the condition of struct blk_zone_wplug is
used to update the disk zones_cond array. Conversely, when a zone write
plug is added to the disk hash table, the zones_cond array is used to
initialize the zone write plug condition.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-05 08:07:21 -07:00
Damien Le Moal
6e945ffb65 block: use zone condition to determine conventional zones
The conv_zones_bitmap field of struct gendisk is used to define a bitmap
to identify the conventional zones of a zoned block device. The bit for
a zone is set in this bitmap if the zone is a conventional one, that is,
if the zone type is BLK_ZONE_TYPE_CONVENTIONAL. For such zone, this
always corresponds to the zone condition BLK_ZONE_COND_NOT_WP.
In other words, conv_zones_bitmap tracks a single condition of the
zones of a zoned block device.

In preparation for tracking more zone conditions, change
conv_zones_bitmap into an array of zone conditions, using 1 byte per
zone. This increases the memory usage from 1 bit per zone to 1 byte per
zone, that is, from 16 KiB to about 100 KiB for a 30 TB SMR HDD with 256
MiB zones. This is a trade-off to allow fast cached report zones later
on top of this change.

Rename the conv_zones_bitmap field of struct gendisk to zones_cond. Add
a blk_revalidate_zone_cond() function to initialize the zones_cond array
of a disk during device scan and to update it on device revalidation.
Move the allocation of the zones_cond array to
disk_revalidate_zone_resources(), making sure that this array is always
allocated, even for devices that do not need zone write plugs (zone
resources), to ensure that bdev_zone_is_seq() can be re-implemented to
use the zone condition array in place of the conv zones bitmap.

Finally, the function bdev_zone_is_seq() is rewritten to use a test on
the condition of the target zone.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-05 08:07:21 -07:00
Damien Le Moal
ca1a897fb2 block: reorganize struct blk_zone_wplug
Reorganize the fields of struct blk_zone_wplug to remove a hole after
the wp_offset field and avoid having the bio_work structure split
between 2 cache lines.

No functional changes.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-05 08:07:21 -07:00
Damien Le Moal
fdb9aed869 block: introduce disk_report_zone()
Commit b76b840fd9 ("dm: Fix dm-zoned-reclaim zone write pointer
alignment") introduced an indirect call for the callback function of a
report zones executed with blkdev_report_zones(). This is necessary so
that the function disk_zone_wplug_sync_wp_offset() can be called to
refresh a zone write plug zone write pointer offset after a write error.
However, this solution makes following the path of a zone information
harder to understand.

Clean this up by introducing the new blk_report_zones_args structure to
define a zone report callback and its private data and introduce the
helper function disk_report_zone() which calls both
disk_zone_wplug_sync_wp_offset() and the zone report user callback
function for all zones of a zone report. This helper function must be
called by all block device drivers that implement the report zones
block operation in order to correctly report a zone information.

All block device drivers supporting the report_zones block operation are
updated to use this new scheme.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-05 08:07:21 -07:00
Damien Le Moal
e8ecb21f08 block: cleanup blkdev_report_zones()
The variable capacity is used only in one place and so can be removed
and get_capacity(disk) used directly instead.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-05 08:07:21 -07:00
Damien Le Moal
bba4322e3f block: freeze queue when updating zone resources
Modify disk_update_zone_resources() to freeze the device queue before
updating the number of zones, zone capacity and other zone related
resources. The locking order resulting from the call to
queue_limits_commit_update_frozen() is preserved, that is, the queue
limits lock is first taken by calling queue_limits_start_update() before
freezing the queue, and the queue is unfrozen after executing
queue_limits_commit_update(), which replaces the call to
queue_limits_commit_update_frozen().

This change ensures that there are no in-flights I/Os when the zone
resources are updated due to a zone revalidation. In case of error when
the limits are applied, directly call disk_free_zone_resources() from
disk_update_zone_resources() while the disk queue is still frozen to
avoid needing to freeze & unfreeze the queue again in
blk_revalidate_disk_zones(), thus simplifying that function code a
little.

Fixes: 0b83c86b44 ("block: Prevent potential deadlock in blk_revalidate_disk_zones()")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-05 08:07:21 -07:00
Damien Le Moal
efae226c2e block: handle zone management operations completions
The functions blk_zone_wplug_handle_reset_or_finish() and
blk_zone_wplug_handle_reset_all() both modify the zone write pointer
offset of zone write plugs that are the target of a reset, reset all or
finish zone management operation. However, these functions do this
modification before the BIO is executed. So if the zone operation fails,
the modified zone write pointer offsets become invalid.

Avoid this by modifying the zone write pointer offset of a zone write
plug that is the target of a zone management operation when the
operation completes. To do so, modify blk_zone_bio_endio() to call the
new function blk_zone_mgmt_bio_endio() which in turn calls the functions
blk_zone_reset_all_bio_endio(), blk_zone_reset_bio_endio() or
blk_zone_finish_bio_endio() depending on the operation of the completed
BIO, to modify a zone write plug write pointer offset accordingly.
These functions are called only if the BIO execution was successful.

Fixes: dd291d77cc ("block: Introduce zone write plugging")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-05 08:07:21 -07:00
Yongpeng Yang
8637fa89e6
block: add __must_check attribute to sb_min_blocksize()
When sb_min_blocksize() returns 0 and the return value is not checked,
it may lead to a situation where sb->s_blocksize is 0 when
accessing the filesystem super block. After commit a64e5a5960
("bdev: add back PAGE_SIZE block size validation for
sb_set_blocksize()"), this becomes more likely to happen when the
block device’s logical_block_size is larger than PAGE_SIZE and the
filesystem is unformatted. Add the __must_check attribute to ensure
callers always check the return value.

Cc: stable@vger.kernel.org # v6.15
Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Link: https://patch.msgid.link/20251104125009.2111925-6-yangyongpeng.storage@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 14:00:16 +01:00
Joanne Koong
b2f35ac414
iomap: add caller-provided callbacks for read and readahead
Add caller-provided callbacks for read and readahead so that it can be
used generically, especially by filesystems that are not block-based.

In particular, this:
* Modifies the read and readahead interface to take in a
  struct iomap_read_folio_ctx that is publicly defined as:

  struct iomap_read_folio_ctx {
	const struct iomap_read_ops *ops;
	struct folio *cur_folio;
	struct readahead_control *rac;
	void *read_ctx;
  };

  where struct iomap_read_ops is defined as:

  struct iomap_read_ops {
      int (*read_folio_range)(const struct iomap_iter *iter,
                             struct iomap_read_folio_ctx *ctx,
                             size_t len);
      void (*read_submit)(struct iomap_read_folio_ctx *ctx);
  };

  read_folio_range() reads in the folio range and is required by the
  caller to provide. read_submit() is optional and is used for
  submitting any pending read requests.

* Modifies existing filesystems that use iomap for read and readahead to
  use the new API, through the new statically inlined helpers
  iomap_bio_read_folio() and iomap_bio_readahead(). There is no change
  in functionality for those filesystems.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 12:57:23 +01:00
Christoph Hellwig
ec7f31b2a2 block: make bio auto-integrity deadlock safe
The current block layer automatic integrity protection allocates the
actual integrity buffer, which has three problems:

 - because it happens at the bottom of the I/O stack and doesn't use a
   mempool it can deadlock under load
 - because the data size in a bio is almost unbounded when using lage
   folios it can relatively easily exceed the maximum kmalloc size
 - even when it does not exceed the maximum kmalloc size, it could
   exceed the maximum segment size of the device

Fix this by limiting the I/O size so that we can allocate at least a
2MiB integrity buffer, i.e. 128MiB for 8 byte PI and 512 byte integrity
intervals, and create a mempool as a last resort for this maximum size,
mirroring the scheme used for bvecs.  As a nice upside none of this
can fail now, so we remove the error handling and open code the
trivial addition of the bip vec.

The new allocation helpers sit outside of bio-integrity-auto.c because
I plan to reuse them for file system based PI in the near future.

Fixes: 7ba1ba12ee ("block: Block layer data integrity support")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-04 12:41:50 -07:00
Christoph Hellwig
eef09f742b block: blocking mempool_alloc doesn't fail
So remove the error check for it in bio_integrity_prep.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-04 12:40:46 -07:00
Caleb Sander Mateos
20fb3d05a3 io_uring/uring_cmd: avoid double indirect call in task work dispatch
io_uring task work dispatch makes an indirect call to struct io_kiocb's
io_task_work.func field to allow running arbitrary task work functions.
In the uring_cmd case, this calls io_uring_cmd_work(), which immediately
makes another indirect call to struct io_uring_cmd's task_work_cb field.
Change the uring_cmd task work callbacks to functions whose signatures
match io_req_tw_func_t. Add a function io_uring_cmd_from_tw() to convert
from the task work's struct io_tw_req argument to struct io_uring_cmd *.
Define a constant IO_URING_CMD_TASK_WORK_ISSUE_FLAGS to avoid
manufacturing issue_flags in the uring_cmd task work callbacks. Now
uring_cmd task work dispatch makes a single indirect call to the
uring_cmd implementation's callback. This also allows removing the
task_work_cb field from struct io_uring_cmd, freeing up 8 bytes for
future storage.
Since fuse_uring_send_in_task() now has access to the io_tw_token_t,
check its cancel field directly instead of relying on the
IO_URING_F_TASK_DEAD issue flag.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-03 08:31:26 -07:00
Linus Torvalds
a5beb58e53 block-6.18-20251031
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmkE0BoQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpvHID/0bh5wEXK/IFMIDdiyqdbr2GsoMfHxiM2k0
 OSeQdwMgEGPY2frB8SirTBZWPIskOFdgSbRQyuYiu5XpdbsRJY+JdtkaGYp1L+Fc
 4I4C1ZpK4Kdlw+nbBrcUxdedYKx4ZN/00otZWV2K2ZpJFn1ZCLhyInZZ8ZbosKEn
 HeAW54YLu+q3pO9BSbJBO97FP38AZAOqkT9suUDkQYUUnNivejFKV0qbKlRm5v4H
 fQLU2sfT1J78DHdhJ1Gdk+uNKzVuYxR7lJRC+1c0yi2fZN3VGNRYlTk1f4VX3mOn
 RcRaUr4r9LMZc9K2IYEpQgAyuznttokWI0SkklFVTFDZwa1KmsIZjEccNXvESDXN
 vSxUXuZtgePo2qijK0F8VoPqgQRLBoP5MeAfp+VlkUWAu49zljwrIZXuZl0xuHpT
 JIEzbzvk+KfPS/gKtQdWxuN3eqZvv596SxnWnzGMg17zmhsj2kEZ9BF4Q+9BNVMZ
 NdK0jmdsBA3iTI8xVy2ajEY6U2W3KDdkSKPWR2SDg+vBd/qu3VBmrC9ptr1AoYpO
 54UOyBtIAumMyaOAUDSGiKC4KSbgMWUhN2uBFC8uWvuh733Z333xb9BnY1T7D624
 cfacmSzkoXKUACmcLaod2+MDJlSXhxmOtVN65euxst8ZGQsSah1TpY4Tr/2UHKnO
 ru+vbsqJyQ==
 =rC6y
 -----END PGP SIGNATURE-----

Merge tag 'block-6.18-20251031' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull block fixes from Jens Axboe:

 - Fix blk-crypto reporting EIO when EINVAL is the correct error code

 - Two bug fixes for the block zone support

 - NVME pull request via Keith:
      - Target side authentication fixup
      - Peer-to-peer metadata fixup

 - null_blk DMA alignment fix

* tag 'block-6.18-20251031' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  null_blk: set dma alignment to logical block size
  blk-crypto: use BLK_STS_INVAL for alignment errors
  block: make REQ_OP_ZONE_OPEN a write operation
  block: fix op_is_zone_mgmt() to handle REQ_OP_ZONE_RESET_ALL
  nvme-pci: use blk_map_iter for p2p metadata
  nvmet-auth: update sc_c in host response
2025-10-31 12:57:19 -07:00
Carlos Llamas
0b39ca4572 blk-crypto: use BLK_STS_INVAL for alignment errors
Make __blk_crypto_bio_prep() propagate BLK_STS_INVAL when IO segments
fail the data unit alignment check.

This was flagged by an LTP test that expects EINVAL when performing an
O_DIRECT read with a misaligned buffer [1].

Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/all/aP-c5gPjrpsn0vJA@google.com/ [1]
Signed-off-by: Carlos Llamas <cmllamas@google.com>
Reviewed-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-30 08:52:57 -06:00
Linus Torvalds
d2818517e3 block-6.18-20251023
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmj62psQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjMiD/0cTxemB2tYk5Nd1QIAda8cwO0fn1jLgamH
 tjQfy0uq4kxzSY4QWWx8HkA8sEybAOpAwP2u+F3RN/CsW3//TMA+H8JGW1h0k5OG
 dh+0asF0iru9euyAePTLUExOw2V3VgEajjvt/2ezkjussNki6vcXBoIzGfeZKQ5E
 MSx6LTbnpzAy+SUydYFpLFtFcokXzUyp/TKZY+QgsIzsqo/ReUm3Caa/KbxQBPQm
 7MhpUpnTdI1PjYZZE/Y/p4iWtesCSpiSOayYKhtBQX4FzMo12MZw5nRkJkliLUvm
 EtPuSYBSCQEKnYVlfCqLuVd8r7drgMgwZOmNhOsdtUHLigtkPolxQOqQKniX3u70
 ycMqn3b1BdEFSqVe/eXhIRZ3YCL3xEAJUYTBRvwbf7XVC804F8VV+CqAey835A4D
 IIcIh8vYrkw0HD5HP3aILKlWPHilArDqjcuU260Qd9i79EV7zVRUJrySc0mZ9zK9
 XVKX0csETx1SrdH9vRlwBaeJzQyF9J18fuYMD7JV1dK0FhkEX6+pF5dY5rE6S+0r
 /tjZgEwSS4siQYhsOM+q3J/ZoLMP2RmW10rYYcKiS9NrqYm1b5VNenBVm7bX6SQO
 P29JtDJG374ygPiFn7opMbY79LDJ8JNS5g+vq0en8HtEGwtWuFM3vVCYyerPkiCl
 IgVxWd/xYQ==
 =Zqjz
 -----END PGP SIGNATURE-----

Merge tag 'block-6.18-20251023' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull block fixes from Jens Axboe:

 - Fix dma alignment for PI

 - Fix selinux bogosity with nbd, where sendmsg would get rejected

* tag 'block-6.18-20251023' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  block: require LBA dma_alignment when using PI
  nbd: override creds to kernel when calling sock_{send,recv}msg()
2025-10-24 12:48:19 -07:00
Johannes Thumshirn
4ae8efb4f9 blktrace: handle BLKTRACESETUP2 ioctl
Handle the BLKTRACESETUP2 ioctl, requesting an extended version of the
blktrace protocol from user-space.

Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-22 11:14:06 -06:00
Christoph Hellwig
4c8cf6bd28 block: require LBA dma_alignment when using PI
The block layer PI generation / verification code expects the bio_vecs
to have at least LBA size (or more correctly integrity internal)
granularity.  With the direct I/O alignment relaxation in 2022, user
space can now feed bios with less alignment than that, leading to
scribbling outside the PI buffers.  Apparently this wasn't noticed so far
because none of the tests generate such buffers, but since 851c4c96db
("xfs: implement XFS_IOC_DIOINFO in terms of vfs_getattr"), xfstests
generic/013 by default generates such I/O now that the relaxed alignment
is advertised by the XFS_IOC_DIOINFO ioctl.

Fix this by increasing the required alignment when using PI, although
handling arbitrary alignment in the long run would be even nicer.

Fixes: bf8d08532b ("iomap: add support for dma aligned direct-io")
Fixes: b1a000d3b8 ("block: relax direct io memory alignment")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-22 10:02:54 -06:00
Keith Busch
5c5028ee59 block: rename min_segment_size
Despite its name, the block layer is fine with segments smaller that the
"min_segment_size" limit. The value is an optimization limit indicating
the largest segment that can be used without considering boundary
limits. Smaller segments can take a fast path, so give it a name that
reflects that: max_fast_segment_size.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-22 07:39:39 -06:00
Mateusz Guzik
b4dbfd8653
Coccinelle-based conversion to use ->i_state accessors
All places were patched by coccinelle with the default expecting that
->i_lock is held, afterwards entries got fixed up by hand to use
unlocked variants as needed.

The script:
@@
expression inode, flags;
@@

- inode->i_state & flags
+ inode_state_read(inode) & flags

@@
expression inode, flags;
@@

- inode->i_state &= ~flags
+ inode_state_clear(inode, flags)

@@
expression inode, flag1, flag2;
@@

- inode->i_state &= ~flag1 & ~flag2
+ inode_state_clear(inode, flag1 | flag2)

@@
expression inode, flags;
@@

- inode->i_state |= flags
+ inode_state_set(inode, flags)

@@
expression inode, flags;
@@

- inode->i_state = flags
+ inode_state_assign(inode, flags)

@@
expression inode, flags;
@@

- flags = inode->i_state
+ flags = inode_state_read(inode)

@@
expression inode, flags;
@@

- READ_ONCE(inode->i_state) & flags
+ inode_state_read(inode) & flags

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-20 20:22:26 +02:00
Mehdi Ben Hadj Khelifa
e5a82249d8 blk-mq: use struct_size() in kmalloc()
Change struct size calculation to use struct_size()
to align with new recommended practices[1] which quotes:
"Another common case to avoid is calculating the size of a structure with
a trailing array of others structures, as in:

header = kzalloc(sizeof(*header) + count * sizeof(*header->item),
                 GFP_KERNEL);

Instead, use the helper:

header = kzalloc(struct_size(header, item, count), GFP_KERNEL);"

Signed-off-by: Mehdi Ben Hadj Khelifa <mehdi.benhadjkhelifa@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-20 10:38:56 -06:00
Bart Van Assche
d60055cf52 block/mq-deadline: Switch back to a single dispatch list
Commit c807ab520f ("block/mq-deadline: Add I/O priority support")
modified the behavior of request flag BLK_MQ_INSERT_AT_HEAD from
dispatching a request before other requests into dispatching a request
before other requests with the same I/O priority. This is not correct since
BLK_MQ_INSERT_AT_HEAD is used when requeuing requests and also when a flush
request is inserted.  Both types of requests should be dispatched as soon
as possible. Hence, make the mq-deadline I/O scheduler again ignore the I/O
priority for BLK_MQ_INSERT_AT_HEAD requests.

Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Yu Kuai <yukuai@kernel.org>
Reported-by: chengkaitao <chengkaitao@kylinos.cn>
Closes: https://lore.kernel.org/linux-block/20251009155253.14611-1-pilgrimtao@gmail.com/
Fixes: c807ab520f ("block/mq-deadline: Add I/O priority support")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moalv <dlemoal@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-20 10:37:42 -06:00
Bart Van Assche
93a358af59 block/mq-deadline: Introduce dd_start_request()
Prepare for adding a second caller of this function. No functionality
has been changed.

Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Yu Kuai <yukuai@kernel.org>
Cc: chengkaitao <chengkaitao@kylinos.cn>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-20 10:37:42 -06:00
Linus Torvalds
0c8df15f75 block-6.18-20251016
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmjxoHoQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgprKCD/4irkkA7mBorYyNROXMwANOUg+2pl20xp/X
 8reZsIZKWztUS18Emfg2jS2NIXP6LIFc3ZehfJX/9FrM26B9URH9cq/F/D/mHc/+
 G4qfT5HUR5Eyav0qCP+pbru53irUOUWSUKKgrWRR8gDY9BcT7apjV8pULd/1PAfo
 3XLfY2o39u68TBzmcwZvDudtBFcBfSan/JCIiW6IMxHWerHhV+IEJG5ABncFo8n9
 +Ep5uOVWYQanM1lvat+Zy+aiWz0Fb0yYzXvtDatcGsAxfxJIf2Bs8ryZMAxgw7yk
 B9Jsd5kGTw9Tfn/H7kl2P4RGQ0gGr91dl0FmaUkDMXTyZcsz/Nq2PbwiiJaESp/4
 Ixk3m9QjXpA6ofxAeorXFtTo98obnKklZLpCPzV5sqslzpGSWXdsbPmHOB5A4XcH
 M2QT/uM2eZbUtHUkymoUBMJTcqCfUsL827+Z6DGLl+Rrb0bjRvunlVCoxoTPuAeg
 ulOpuWd888Gy3X8lT7vBhY/9iWyljFwja/suiFx3f29e2DULXiXDTXrrA0GIxcO+
 l7PA7BgcMf/0lJfo2gpGtPZvHYvtFXoRwxGaIGbusXixgg/dLY2LQ64BYPpu0JU8
 Ph3xsL3pgLLPFVMBarHqwKoSb/4avOvzMaM7xGQQICg+0Gx2x9YPpLosaMgbdYPY
 OFPSLihwUg==
 =qiT3
 -----END PGP SIGNATURE-----

Merge tag 'block-6.18-20251016' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull block fixes from Jens Axboe:

 - NVMe pull request via Keith:
     - iostats accounting fixed on multipath retries (Amit)
     - secure concatenation response fixup (Martin)
     - tls partial record fixup (Wilfred)

 - Fix for a lockdep reported issue with the elevator lock and
   blk group frozen operations

 - Fix for a regression in this merge window, where updating
   'nr_requests' would not do the right thing for queues with
   shared tags

* tag 'block-6.18-20251016' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  nvme/tcp: handle tls partially sent records in write_space()
  block: Remove elevator_lock usage from blkg_conf frozen operations
  blk-mq: fix stale tag depth for shared sched tags in blk_mq_update_nr_requests()
  nvme-auth: update sc_c in host response
  nvme-multipath: Skip nr_active increments in RETRY disposition
2025-10-17 08:31:26 -07:00
Ming Lei
08823e89e3 block: Remove elevator_lock usage from blkg_conf frozen operations
Remove the acquisition and release of q->elevator_lock in the
blkg_conf_open_bdev_frozen() and blkg_conf_exit_frozen() functions. The
elevator lock is no longer needed in these code paths since commit
78c271344b ("block: move wbt_enable_default() out of queue freezing
from sched ->exit()") which introduces `disk->rqos_state_mutex` for
protecting wbt state change, and not necessary to abuse elevator_lock
for this purpose.

This change helps to solve the lockdep warning reported from Yu Kuai[1].

Pass blktests/throtl with lockdep enabled.

Links: https://lore.kernel.org/linux-block/e5e7ac3f-2063-473a-aafb-4d8d43e5576e@yukuai.org.cn/ [1]
Fixes: commit 78c271344b ("block: move wbt_enable_default() out of queue freezing from sched ->exit()")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-15 08:00:19 -06:00
Yu Kuai
dc96cefef0 blk-mq: fix stale tag depth for shared sched tags in blk_mq_update_nr_requests()
Commit 7f2799c546 ("blk-mq: cleanup shared tags case in
blk_mq_update_nr_requests()") moves blk_mq_tag_update_sched_shared_tags()
before q->nr_requests is updated, however, it's still using the old
q->nr_requests to resize tag depth.

Fix this problem by passing in expected new tag depth.

Fixes: 7f2799c546 ("blk-mq: cleanup shared tags case in blk_mq_update_nr_requests()")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reported-by: Chris Mason <clm@meta.com>
Link: https://lore.kernel.org/linux-block/20251014130507.4187235-2-clm@meta.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-15 07:49:19 -06:00
Linus Torvalds
1b1391b9c4 block-6.18-20251009
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmjoDmUQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgptyjD/94YYv1sabG9M6UHq7j9lOAgruqXaaEMOw+
 Blnm4ejuLNcM8FMCBcuvbhp3ktzT7v1/bWal7FLnmujuKBfhAe+t2AVcHFWUQie2
 CIfMjc3p77U/bwL5wt0O5WFqu1UPDVe+qzrppqRYduTxvPKk9Fi6mqpYCXKlYN7K
 FhINsytoZp/CvTdf5EDSsPv2r4W85OhrPeq0VjYufFBD1wxXD94ii8WAvyfsl20s
 0gIfdlfa2vaNVwH1kdCd+IeATrSBpyCZKGEVTzcHYoo/1MgfNFigrJ8GUA5c+DLM
 fmNE+E+wFtobq5WBmbrtmAxtBnzzV49HS1OT1amUktuq87ryiY5Svn6vFAqEJQl6
 2HLE9nNN2PBdPMAmQ57u1bvp/3nGD0mk/hC1666MTDxHpxg5c6cugCSlJGVG+uC9
 ShLgi8bWV6RXelso0qMaSmNNCA8dskxJg/YDJ06AViTSuW8Y1+adoXddCjE7jne9
 3lci/r2WiuwqTJuub9D7LUtC7VhbCY19VVkgDE64VB2+CjR8B9AlLVG3sGl1HDOY
 EFAddJ3lAEOz5F1H2AzcOBPqqeBfuipr6lEpdb9+6hNu5wRILAHtme8W76c4PtuF
 PRk/3JYcHE77DZlFeE+iN8n0y1tNdWR/6QzWIOsGcNlUyeGGV/zvgGOodtFRpHt2
 t7Eue56EFw==
 =/1jf
 -----END PGP SIGNATURE-----

Merge tag 'block-6.18-20251009' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull block fixes from Jens Axboe:

 - Don't include __GFP_NOWARN for loop worker allocation, as it already
   uses GFP_NOWAIT which has __GFP_NOWARN set already

 - Small series cleaning up the recent bio_iov_iter_get_pages() changes

 - loop fix for leaking the backing reference file, if validation fails

 - Update of a comment pertaining to disk/partition stat locking

* tag 'block-6.18-20251009' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  loop: remove redundant __GFP_NOWARN flag
  block: move bio_iov_iter_get_bdev_pages to block/fops.c
  iomap: open code bio_iov_iter_get_bdev_pages
  block: rename bio_iov_iter_get_pages_aligned to bio_iov_iter_get_pages
  block: remove bio_iov_iter_get_pages
  block: Update a comment of disk statistics
  loop: fix backing file reference leak on validation error
2025-10-10 10:37:13 -07:00
Christoph Hellwig
506aa235f6 block: move bio_iov_iter_get_bdev_pages to block/fops.c
Keep bio_iov_iter_get_bdev_pages local with the callers, as blindly
looking at the bdev logical block size is often not the best idea
unless on a block device.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-07 08:05:44 -06:00
Christoph Hellwig
82dd5d763c block: rename bio_iov_iter_get_pages_aligned to bio_iov_iter_get_pages
Now that the bio_iov_iter_get_pages is free again, use it instead of
the more complicated now.  Also drop the unused export.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-07 08:05:44 -06:00
Christoph Hellwig
1ed06c8350 block: remove bio_iov_iter_get_pages
Switch the only caller to bio_iov_iter_get_pages, and explain why it does
not have any alignment requirements.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-07 08:05:44 -06:00
Linus Torvalds
8804d970fa Summary of significant series in this pull request:
- The 3 patch series "mm, swap: improve cluster scan strategy" from
   Kairui Song improves performance and reduces the failure rate of swap
   cluster allocation.
 
 - The 4 patch series "support large align and nid in Rust allocators"
   from Vitaly Wool permits Rust allocators to set NUMA node and large
   alignment when perforning slub and vmalloc reallocs.
 
 - The 2 patch series "mm/damon/vaddr: support stat-purpose DAMOS" from
   Yueyang Pan extend DAMOS_STAT's handling of the DAMON operations sets
   for virtual address spaces for ops-level DAMOS filters.
 
 - The 3 patch series "execute PROCMAP_QUERY ioctl under per-vma lock"
   from Suren Baghdasaryan reduces mmap_lock contention during reads of
   /proc/pid/maps.
 
 - The 2 patch series "mm/mincore: minor clean up for swap cache
   checking" from Kairui Song performs some cleanup in the swap code.
 
 - The 11 patch series "mm: vm_normal_page*() improvements" from David
   Hildenbrand provides code cleanup in the pagemap code.
 
 - The 5 patch series "add persistent huge zero folio support" from
   Pankaj Raghav provides a block layer speedup by optionalls making the
   huge_zero_pagepersistent, instead of releasing it when its refcount
   falls to zero.
 
 - The 3 patch series "kho: fixes and cleanups" from Mike Rapoport adds a
   few touchups to the recently added Kexec Handover feature.
 
 - The 10 patch series "mm: make mm->flags a bitmap and 64-bit on all
   arches" from Lorenzo Stoakes turns mm_struct.flags into a bitmap.  To
   end the constant struggle with space shortage on 32-bit conflicting with
   64-bit's needs.
 
 - The 2 patch series "mm/swapfile.c and swap.h cleanup" from Chris Li
   cleans up some swap code.
 
 - The 7 patch series "selftests/mm: Fix false positives and skip
   unsupported tests" from Donet Tom fixes a few things in our selftests
   code.
 
 - The 7 patch series "prctl: extend PR_SET_THP_DISABLE to only provide
   THPs when advised" from David Hildenbrand "allows individual processes
   to opt-out of THP=always into THP=madvise, without affecting other
   workloads on the system".
 
   It's a long story - the [1/N] changelog spells out the considerations.
 
 - The 11 patch series "Add and use memdesc_flags_t" from Matthew Wilcox
   gets us started on the memdesc project.  Please see
   https://kernelnewbies.org/MatthewWilcox/Memdescs and
   https://blogs.oracle.com/linux/post/introducing-memdesc.
 
 - The 3 patch series "Tiny optimization for large read operations" from
   Chi Zhiling improves the efficiency of the pagecache read path.
 
 - The 5 patch series "Better split_huge_page_test result check" from Zi
   Yan improves our folio splitting selftest code.
 
 - The 2 patch series "test that rmap behaves as expected" from Wei Yang
   adds some rmap selftests.
 
 - The 3 patch series "remove write_cache_pages()" from Christoph Hellwig
   removes that function and converts its two remaining callers.
 
 - The 2 patch series "selftests/mm: uffd-stress fixes" from Dev Jain
   fixes some UFFD selftests issues.
 
 - The 3 patch series "introduce kernel file mapped folios" from Boris
   Burkov introduces the concept of "kernel file pages".  Using these
   permits btrfs to account its metadata pages to the root cgroup, rather
   than to the cgroups of random inappropriate tasks.
 
 - The 2 patch series "mm/pageblock: improve readability of some
   pageblock handling" from Wei Yang provides some readability improvements
   to the page allocator code.
 
 - The 11 patch series "mm/damon: support ARM32 with LPAE" from SeongJae
   Park teaches DAMON to understand arm32 highmem.
 
 - The 4 patch series "tools: testing: Use existing atomic.h for
   vma/maple tests" from Brendan Jackman performs some code cleanups and
   deduplication under tools/testing/.
 
 - The 2 patch series "maple_tree: Fix testing for 32bit compiles" from
   Liam Howlett fixes a couple of 32-bit issues in
   tools/testing/radix-tree.c.
 
 - The 2 patch series "kasan: unify kasan_enabled() and remove
   arch-specific implementations" from Sabyrzhan Tasbolatov moves KASAN
   arch-specific initialization code into a common arch-neutral
   implementation.
 
 - The 3 patch series "mm: remove zpool" from Johannes Weiner removes
   zspool - an indirection layer which now only redirects to a single thing
   (zsmalloc).
 
 - The 2 patch series "mm: task_stack: Stack handling cleanups" from
   Pasha Tatashin makes a couple of cleanups in the fork code.
 
 - The 37 patch series "mm: remove nth_page()" from David Hildenbrand
   makes rather a lot of adjustments at various nth_page() callsites,
   eventually permitting the removal of that undesirable helper function.
 
 - The 2 patch series "introduce kasan.write_only option in hw-tags" from
   Yeoreum Yun creates a KASAN read-only mode for ARM, using that
   architecture's memory tagging feature.  It is felt that a read-only mode
   KASAN is suitable for use in production systems rather than debug-only.
 
 - The 3 patch series "mm: hugetlb: cleanup hugetlb folio allocation"
   from Kefeng Wang does some tidying in the hugetlb folio allocation code.
 
 - The 12 patch series "mm: establish const-correctness for pointer
   parameters" from Max Kellermann makes quite a number of the MM API
   functions more accurate about the constness of their arguments.  This
   was getting in the way of subsystems (in this case CEPH) when they
   attempt to improving their own const/non-const accuracy.
 
 - The 7 patch series "Cleanup free_pages() misuse" from Vishal Moola
   fixes a number of code sites which were confused over when to use
   free_pages() vs __free_pages().
 
 - The 3 patch series "Add Rust abstraction for Maple Trees" from Alice
   Ryhl makes the mapletree code accessible to Rust.  Required by nouveau
   and by its forthcoming successor: the new Rust Nova driver.
 
 - The 2 patch series "selftests/mm: split_huge_page_test:
   split_pte_mapped_thp improvements" from David Hildenbrand adds a fix and
   some cleanups to the thp selftesting code.
 
 - The 14 patch series "mm, swap: introduce swap table as swap cache
   (phase I)" from Chris Li and Kairui Song is the first step along the
   path to implementing "swap tables" - a new approach to swap allocation
   and state tracking which is expected to yield speed and space
   improvements.  This patchset itself yields a 5-20% performance benefit
   in some situations.
 
 - The 3 patch series "Some ptdesc cleanups" from Matthew Wilcox utilizes
   the new memdesc layer to clean up the ptdesc code a little.
 
 - The 3 patch series "Fix va_high_addr_switch.sh test failure" from
   Chunyu Hu fixes some issues in our 5-level pagetable selftesting code.
 
 - The 2 patch series "Minor fixes for memory allocation profiling" from
   Suren Baghdasaryan addresses a couple of minor issues in relatively new
   memory allocation profiling feature.
 
 - The 3 patch series "Small cleanups" from Matthew Wilcox has a few
   cleanups in preparation for more memdesc work.
 
 - The 2 patch series "mm/damon: add addr_unit for DAMON_LRU_SORT and
   DAMON_RECLAIM" from Quanmin Yan makes some changes to DAMON in
   furtherance of supporting arm highmem.
 
 - The 2 patch series "selftests/mm: Add -Wunreachable-code and fix
   warnings" from Muhammad Anjum adds that compiler check to selftests code
   and fixes the fallout, by removing dead code.
 
 - The 10 patch series "Improvements to Victim Process Thawing and OOM
   Reaper Traversal Order" from zhongjinji makes a number of improvements
   in the OOM killer: mainly thawing a more appropriate group of victim
   threads so they can release resources.
 
 - The 5 patch series "mm/damon: misc fixups and improvements for 6.18"
   from SeongJae Park is a bunch of small and unrelated fixups for DAMON.
 
 - The 7 patch series "mm/damon: define and use DAMON initialization
   check function" from SeongJae Park implement reliability and
   maintainability improvements to a recently-added bug fix.
 
 - The 2 patch series "mm/damon/stat: expose auto-tuned intervals and
   non-idle ages" from SeongJae Park provides additional transparency to
   userspace clients of the DAMON_STAT information.
 
 - The 2 patch series "Expand scope of khugepaged anonymous collapse"
   from Dev Jain removes some constraints on khubepaged's collapsing of
   anon VMAs.  It also increases the success rate of MADV_COLLAPSE against
   an anon vma.
 
 - The 2 patch series "mm: do not assume file == vma->vm_file in
   compat_vma_mmap_prepare()" from Lorenzo Stoakes moves us further towards
   removal of file_operations.mmap().  This patchset concentrates upon
   clearing up the treatment of stacked filesystems.
 
 - The 6 patch series "mm: Improve mlock tracking for large folios" from
   Kiryl Shutsemau provides some fixes and improvements to mlock's tracking
   of large folios.  /proc/meminfo's "Mlocked" field became more accurate.
 
 - The 2 patch series "mm/ksm: Fix incorrect accounting of KSM counters
   during fork" from Donet Tom fixes several user-visible KSM stats
   inaccuracies across forks and adds selftest code to verify these
   counters.
 
 - The 2 patch series "mm_slot: fix the usage of mm_slot_entry" from Wei
   Yang addresses some potential but presently benign issues in KSM's
   mm_slot handling.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaN3cywAKCRDdBJ7gKXxA
 jtaPAQDmIuIu7+XnVUK5V11hsQ/5QtsUeLHV3OsAn4yW5/3dEQD/UddRU08ePN+1
 2VRB0EwkLAdfMWW7TfiNZ+yhuoiL/AA=
 =4mhY
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - "mm, swap: improve cluster scan strategy" from Kairui Song improves
   performance and reduces the failure rate of swap cluster allocation

 - "support large align and nid in Rust allocators" from Vitaly Wool
   permits Rust allocators to set NUMA node and large alignment when
   perforning slub and vmalloc reallocs

 - "mm/damon/vaddr: support stat-purpose DAMOS" from Yueyang Pan extend
   DAMOS_STAT's handling of the DAMON operations sets for virtual
   address spaces for ops-level DAMOS filters

 - "execute PROCMAP_QUERY ioctl under per-vma lock" from Suren
   Baghdasaryan reduces mmap_lock contention during reads of
   /proc/pid/maps

 - "mm/mincore: minor clean up for swap cache checking" from Kairui Song
   performs some cleanup in the swap code

 - "mm: vm_normal_page*() improvements" from David Hildenbrand provides
   code cleanup in the pagemap code

 - "add persistent huge zero folio support" from Pankaj Raghav provides
   a block layer speedup by optionalls making the
   huge_zero_pagepersistent, instead of releasing it when its refcount
   falls to zero

 - "kho: fixes and cleanups" from Mike Rapoport adds a few touchups to
   the recently added Kexec Handover feature

 - "mm: make mm->flags a bitmap and 64-bit on all arches" from Lorenzo
   Stoakes turns mm_struct.flags into a bitmap. To end the constant
   struggle with space shortage on 32-bit conflicting with 64-bit's
   needs

 - "mm/swapfile.c and swap.h cleanup" from Chris Li cleans up some swap
   code

 - "selftests/mm: Fix false positives and skip unsupported tests" from
   Donet Tom fixes a few things in our selftests code

 - "prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised"
   from David Hildenbrand "allows individual processes to opt-out of
   THP=always into THP=madvise, without affecting other workloads on the
   system".

   It's a long story - the [1/N] changelog spells out the considerations

 - "Add and use memdesc_flags_t" from Matthew Wilcox gets us started on
   the memdesc project. Please see

      https://kernelnewbies.org/MatthewWilcox/Memdescs and
      https://blogs.oracle.com/linux/post/introducing-memdesc

 - "Tiny optimization for large read operations" from Chi Zhiling
   improves the efficiency of the pagecache read path

 - "Better split_huge_page_test result check" from Zi Yan improves our
   folio splitting selftest code

 - "test that rmap behaves as expected" from Wei Yang adds some rmap
   selftests

 - "remove write_cache_pages()" from Christoph Hellwig removes that
   function and converts its two remaining callers

 - "selftests/mm: uffd-stress fixes" from Dev Jain fixes some UFFD
   selftests issues

 - "introduce kernel file mapped folios" from Boris Burkov introduces
   the concept of "kernel file pages". Using these permits btrfs to
   account its metadata pages to the root cgroup, rather than to the
   cgroups of random inappropriate tasks

 - "mm/pageblock: improve readability of some pageblock handling" from
   Wei Yang provides some readability improvements to the page allocator
   code

 - "mm/damon: support ARM32 with LPAE" from SeongJae Park teaches DAMON
   to understand arm32 highmem

 - "tools: testing: Use existing atomic.h for vma/maple tests" from
   Brendan Jackman performs some code cleanups and deduplication under
   tools/testing/

 - "maple_tree: Fix testing for 32bit compiles" from Liam Howlett fixes
   a couple of 32-bit issues in tools/testing/radix-tree.c

 - "kasan: unify kasan_enabled() and remove arch-specific
   implementations" from Sabyrzhan Tasbolatov moves KASAN arch-specific
   initialization code into a common arch-neutral implementation

 - "mm: remove zpool" from Johannes Weiner removes zspool - an
   indirection layer which now only redirects to a single thing
   (zsmalloc)

 - "mm: task_stack: Stack handling cleanups" from Pasha Tatashin makes a
   couple of cleanups in the fork code

 - "mm: remove nth_page()" from David Hildenbrand makes rather a lot of
   adjustments at various nth_page() callsites, eventually permitting
   the removal of that undesirable helper function

 - "introduce kasan.write_only option in hw-tags" from Yeoreum Yun
   creates a KASAN read-only mode for ARM, using that architecture's
   memory tagging feature. It is felt that a read-only mode KASAN is
   suitable for use in production systems rather than debug-only

 - "mm: hugetlb: cleanup hugetlb folio allocation" from Kefeng Wang does
   some tidying in the hugetlb folio allocation code

 - "mm: establish const-correctness for pointer parameters" from Max
   Kellermann makes quite a number of the MM API functions more accurate
   about the constness of their arguments. This was getting in the way
   of subsystems (in this case CEPH) when they attempt to improving
   their own const/non-const accuracy

 - "Cleanup free_pages() misuse" from Vishal Moola fixes a number of
   code sites which were confused over when to use free_pages() vs
   __free_pages()

 - "Add Rust abstraction for Maple Trees" from Alice Ryhl makes the
   mapletree code accessible to Rust. Required by nouveau and by its
   forthcoming successor: the new Rust Nova driver

 - "selftests/mm: split_huge_page_test: split_pte_mapped_thp
   improvements" from David Hildenbrand adds a fix and some cleanups to
   the thp selftesting code

 - "mm, swap: introduce swap table as swap cache (phase I)" from Chris
   Li and Kairui Song is the first step along the path to implementing
   "swap tables" - a new approach to swap allocation and state tracking
   which is expected to yield speed and space improvements. This
   patchset itself yields a 5-20% performance benefit in some situations

 - "Some ptdesc cleanups" from Matthew Wilcox utilizes the new memdesc
   layer to clean up the ptdesc code a little

 - "Fix va_high_addr_switch.sh test failure" from Chunyu Hu fixes some
   issues in our 5-level pagetable selftesting code

 - "Minor fixes for memory allocation profiling" from Suren Baghdasaryan
   addresses a couple of minor issues in relatively new memory
   allocation profiling feature

 - "Small cleanups" from Matthew Wilcox has a few cleanups in
   preparation for more memdesc work

 - "mm/damon: add addr_unit for DAMON_LRU_SORT and DAMON_RECLAIM" from
   Quanmin Yan makes some changes to DAMON in furtherance of supporting
   arm highmem

 - "selftests/mm: Add -Wunreachable-code and fix warnings" from Muhammad
   Anjum adds that compiler check to selftests code and fixes the
   fallout, by removing dead code

 - "Improvements to Victim Process Thawing and OOM Reaper Traversal
   Order" from zhongjinji makes a number of improvements in the OOM
   killer: mainly thawing a more appropriate group of victim threads so
   they can release resources

 - "mm/damon: misc fixups and improvements for 6.18" from SeongJae Park
   is a bunch of small and unrelated fixups for DAMON

 - "mm/damon: define and use DAMON initialization check function" from
   SeongJae Park implement reliability and maintainability improvements
   to a recently-added bug fix

 - "mm/damon/stat: expose auto-tuned intervals and non-idle ages" from
   SeongJae Park provides additional transparency to userspace clients
   of the DAMON_STAT information

 - "Expand scope of khugepaged anonymous collapse" from Dev Jain removes
   some constraints on khubepaged's collapsing of anon VMAs. It also
   increases the success rate of MADV_COLLAPSE against an anon vma

 - "mm: do not assume file == vma->vm_file in compat_vma_mmap_prepare()"
   from Lorenzo Stoakes moves us further towards removal of
   file_operations.mmap(). This patchset concentrates upon clearing up
   the treatment of stacked filesystems

 - "mm: Improve mlock tracking for large folios" from Kiryl Shutsemau
   provides some fixes and improvements to mlock's tracking of large
   folios. /proc/meminfo's "Mlocked" field became more accurate

 - "mm/ksm: Fix incorrect accounting of KSM counters during fork" from
   Donet Tom fixes several user-visible KSM stats inaccuracies across
   forks and adds selftest code to verify these counters

 - "mm_slot: fix the usage of mm_slot_entry" from Wei Yang addresses
   some potential but presently benign issues in KSM's mm_slot handling

* tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (372 commits)
  mm: swap: check for stable address space before operating on the VMA
  mm: convert folio_page() back to a macro
  mm/khugepaged: use start_addr/addr for improved readability
  hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list
  alloc_tag: fix boot failure due to NULL pointer dereference
  mm: silence data-race in update_hiwater_rss
  mm/memory-failure: don't select MEMORY_ISOLATION
  mm/khugepaged: remove definition of struct khugepaged_mm_slot
  mm/ksm: get mm_slot by mm_slot_entry() when slot is !NULL
  hugetlb: increase number of reserving hugepages via cmdline
  selftests/mm: add fork inheritance test for ksm_merging_pages counter
  mm/ksm: fix incorrect KSM counter handling in mm_struct during fork
  drivers/base/node: fix double free in register_one_node()
  mm: remove PMD alignment constraint in execmem_vmalloc()
  mm/memory_hotplug: fix typo 'esecially' -> 'especially'
  mm/rmap: improve mlock tracking for large folios
  mm/filemap: map entire large folio faultaround
  mm/fault: try to map the entire file folio in finish_fault()
  mm/rmap: mlock large folios in try_to_unmap_one()
  mm/rmap: fix a mlock race condition in folio_referenced_one()
  ...
2025-10-02 18:18:33 -07:00
Linus Torvalds
e1b1d03cee for-6.18/block-20250929
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmjbLCgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpoY0D/9J+11BC88pBxCrLKv/V2TwCNokRMi0dU3L
 r3EUdA46k0oXmvb6ueZqIcfY2e+IX7rdQkaRbh1zRdsNejqHo4548C3ePWGdBAcM
 OdNEGfpehO0aD0td1+mK/NxoJMLhbs5QraPanz+SOkGZOKeF+vGCga5PUDivsr5J
 16T9yb7i+isENLdAc2RJbZVyAphqHQlo5GHi5ZIKOVi5cNt8GU/q2sQl7NYmGvHd
 aq37svvZHFOhLRajP959Fw9WOxEYITewzQ4UYf1FZjUodJUxO+vCnP0ooBQRlyu8
 1B4PYWwSE+Vn3GkQE0Om+mzo9AVPOiLmoAWGxdgJBMyEkZndocr46XEslXOufQ1Z
 T3Gu19G6jCxcyByNVhjVnaajYKmvSQAy1w75m4XlfqTRm4f9Om+LAJavUk3RuaOL
 7lXKQ7Ql1/Tby9Jmf8afjYYXXotNDNku6rz2P3qtOwAA26mNJfgVt0rO+8XGRDe9
 ioLbCkTjslYMc/Oh4jSsbrspsVALbaQMq/Dmah8k0EWb4QAHVgCJyGBoff3hOboI
 jD6B1enaKOQVgcjWcjm/FjOk3jv2h3v4X26YWQZTvEc/1PnSnST78Zi/ePhzDdmt
 sBALUAS37TfTgNMzrhbHl5Zs13k0C0XyANuayuKuo5hlNnC1wbdap+5FZJOmpuOB
 YT+VkYnaOA==
 =kOmc
 -----END PGP SIGNATURE-----

Merge tag 'for-6.18/block-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull block updates from Jens Axboe:

 - NVMe pull request via Keith:
     - FC target fixes (Daniel)
     - Authentication fixes and updates (Martin, Chris)
     - Admin controller handling (Kamaljit)
     - Target lockdep assertions (Max)
     - Keep-alive updates for discovery (Alastair)
     - Suspend quirk (Georg)

 - MD pull request via Yu:
     - Add support for a lockless bitmap.

       A key feature for the new bitmap are that the IO fastpath is
       lockless. If a user issues lots of write IO to the same bitmap
       bit in a short time, only the first write has additional overhead
       to update bitmap bit, no additional overhead for the following
       writes.

       By supporting only resync or recover written data, means in the
       case creating new array or replacing with a new disk, there is no
       need to do a full disk resync/recovery.

 - Switch ->getgeo() and ->bios_param() to using struct gendisk rather
   than struct block_device.

 - Rust block changes via Andreas. This series adds configuration via
   configfs and remote completion to the rnull driver. The series also
   includes a set of changes to the rust block device driver API: a few
   cleanup patches, and a few features supporting the rnull changes.

   The series removes the raw buffer formatting logic from
   `kernel::block` and improves the logic available in `kernel::string`
   to support the same use as the removed logic.

 - floppy arch cleanups

 - Reduce the number of dereferencing needed for ublk commands

 - Restrict supported sockets for nbd. Mostly done to eliminate a class
   of issues perpetually reported by syzbot, by using nonsensical socket
   setups.

 - A few s390 dasd block fixes

 - Fix a few issues around atomic writes

 - Improve DMA interation for integrity requests

 - Improve how iovecs are treated with regards to O_DIRECT aligment
   constraints.

   We used to require each segment to adhere to the constraints, now
   only the request as a whole needs to.

 - Clean up and improve p2p support, enabling use of p2p for metadata
   payloads

 - Improve locking of request lookup, using SRCU where appropriate

 - Use page references properly for brd, avoiding very long RCU sections

 - Fix ordering of recursively submitted IOs

 - Clean up and improve updating nr_requests for a live device

 - Various fixes and cleanups

* tag 'for-6.18/block-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (164 commits)
  s390/dasd: enforce dma_alignment to ensure proper buffer validation
  s390/dasd: Return BLK_STS_INVAL for EINVAL from do_dasd_request
  ublk: remove redundant zone op check in ublk_setup_iod()
  nvme: Use non zero KATO for persistent discovery connections
  nvmet: add safety check for subsys lock
  nvme-core: use nvme_is_io_ctrl() for I/O controller check
  nvme-core: do ioccsz/iorcsz validation only for I/O controllers
  nvme-core: add method to check for an I/O controller
  blk-cgroup: fix possible deadlock while configuring policy
  blk-mq: fix null-ptr-deref in blk_mq_free_tags() from error path
  blk-mq: Fix more tag iteration function documentation
  selftests: ublk: fix behavior when fio is not installed
  ublk: don't access ublk_queue in ublk_unmap_io()
  ublk: pass ublk_io to __ublk_complete_rq()
  ublk: don't access ublk_queue in ublk_need_complete_req()
  ublk: don't access ublk_queue in ublk_check_commit_and_fetch()
  ublk: don't pass ublk_queue to ublk_fetch()
  ublk: don't access ublk_queue in ublk_config_io_buf()
  ublk: don't access ublk_queue in ublk_check_fetch_buf()
  ublk: pass q_id and tag to __ublk_check_and_get_req()
  ...
2025-10-02 10:16:56 -07:00
Linus Torvalds
5832d26433 for-6.18/io_uring-20250929
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmjbLEcQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpnEUD/4/FgfQP2LFS/88BBF5ukZjRySe4wmyyZ2Q
 MFh2ehdxzkZxVXjbeA2wRAXdqjw2MbNhx8tzU9VrW7rweNDZxHbwi6jJIP7OAjxE
 4ZP0goAQj7P0TFyXC2KGj7k6dP20FkAltx5gGLVwsuOWDDrQKp2EykAcRnGYAD4W
 3yf+nojVr2bjHyO7dx8dM7jUDjMg7J8nmHD6zgHOlHRLblWwfzw907bhz+eBX/FI
 9kYvtX2c9MgY4Isa+43rZd5qvj9S3Cs8PD6tFPbq+n+3l7yWgMBTu/y+SNI8hupT
 W7CqjPcpvppFHhPkcXDA3yARnW7ccEx5aiQuvUCmRUioHtGwXvC63HMp8OjcQspV
 NNoIHYFsi1alzYq2kJLxY1IleWZ8j0hUkSSU8u7al8VIvtD43LGkv51xavxQUFjg
 BO9mLyS51H2agffySs4vhHJE82lZizvmh/RJfSJ0ezALzE2k42MrximX1D1rBJE6
 KPOhCiPt/jqpQMyqDYnY10FgTXQVwgPIVH1JLpo611tPFHlGW8Y4YxxR1Xduh5JX
 jbGLEjVREsDZ7EHrimLNLmJRAQpyQujv/yhf7k96gWBelVwVuISQLI4Ca5IeVQyk
 9yifgLXNGddgAwj0POMFeKXSm2We9nrrPDYLCKrsBMSN96/3SLveJC7fkW88aUZr
 ye4/K8Y3vA==
 =uc/3
 -----END PGP SIGNATURE-----

Merge tag 'for-6.18/io_uring-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull io_uring updates from Jens Axboe:

 - Store ring provided buffers locally for the users, rather than stuff
   them into struct io_kiocb.

   These types of buffers must always be fully consumed or recycled in
   the current context, and leaving them in struct io_kiocb is hence not
   a good ideas as that struct has a vastly different life time.

   Basically just an architecture cleanup that can help prevent issues
   with ring provided buffers in the future.

 - Support for mixed CQE sizes in the same ring.

   Before this change, a CQ ring either used the default 16b CQEs, or it
   was setup with 32b CQE using IORING_SETUP_CQE32. For use cases where
   a few 32b CQEs were needed, this caused everything else to use big
   CQEs. This is wasteful both in terms of memory usage, but also memory
   bandwidth for the posted CQEs.

   With IORING_SETUP_CQE_MIXED, applications may use request types that
   post both normal 16b and big 32b CQEs on the same ring.

 - Add helpers for async data management, to make it harder for opcode
   handlers to mess it up.

 - Add support for multishot for uring_cmd, which ublk can use. This
   helps improve efficiency, by providing a persistent request type that
   can trigger multiple CQEs.

 - Add initial support for ring feature querying.

   We had basic support for probe operations, but the API isn't great.
   Rather than expand that, add support for QUERY which is easily
   expandable and can cover a lot more cases than the existing probe
   support. This will help applications get a better idea of what
   operations are supported on a given host.

 - zcrx improvements from Pavel:
        - Improve refill entry alignment for better caching
        - Various cleanups, especially around deduplicating normal
          memory vs dmabuf setup.
        - Generalisation of the niov size (Patch 12). It's still hard
          coded to PAGE_SIZE on init, but will let the user to specify
          the rx buffer length on setup.
        - Syscall / synchronous bufer return. It'll be used as a slow
          fallback path for returning buffers when the refill queue is
          full. Useful for tolerating slight queue size misconfiguration
          or with inconsistent load.
        - Accounting more memory to cgroups.
        - Additional independent cleanups that will also be useful for
          mutli-area support.

 - Various fixes and cleanups

* tag 'for-6.18/io_uring-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (68 commits)
  io_uring/cmd: drop unused res2 param from io_uring_cmd_done()
  io_uring: fix nvme's 32b cqes on mixed cq
  io_uring/query: cap number of queries
  io_uring/query: prevent infinite loops
  io_uring/zcrx: account niov arrays to cgroup
  io_uring/zcrx: allow synchronous buffer return
  io_uring/zcrx: introduce io_parse_rqe()
  io_uring/zcrx: don't adjust free cache space
  io_uring/zcrx: use guards for the refill lock
  io_uring/zcrx: reduce netmem scope in refill
  io_uring/zcrx: protect netdev with pp_lock
  io_uring/zcrx: rename dma lock
  io_uring/zcrx: make niov size variable
  io_uring/zcrx: set sgt for umem area
  io_uring/zcrx: remove dmabuf_offset
  io_uring/zcrx: deduplicate area mapping
  io_uring/zcrx: pass ifq to io_zcrx_alloc_fallback()
  io_uring/zcrx: check all niovs filled with dma addresses
  io_uring/zcrx: move area reg checks into io_import_area
  io_uring/zcrx: don't pass slot to io_zcrx_create_area
  ...
2025-10-02 09:56:23 -07:00
Linus Torvalds
18b19abc37 namespace-6.18-rc1
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaNZQgQAKCRCRxhvAZXjc
 oiFXAQCpbLvkWbld9wLgxUBhq+q+kw5NvGxzpvqIhXwJB9F9YAEA44/Wevln4xGx
 +kRUbP+xlRQqenIYs2dLzVHzAwAdfQ4=
 =EO4Y
 -----END PGP SIGNATURE-----

Merge tag 'namespace-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull namespace updates from Christian Brauner:
 "This contains a larger set of changes around the generic namespace
  infrastructure of the kernel.

  Each specific namespace type (net, cgroup, mnt, ...) embedds a struct
  ns_common which carries the reference count of the namespace and so
  on.

  We open-coded and cargo-culted so many quirks for each namespace type
  that it just wasn't scalable anymore. So given there's a bunch of new
  changes coming in that area I've started cleaning all of this up.

  The core change is to make it possible to correctly initialize every
  namespace uniformly and derive the correct initialization settings
  from the type of the namespace such as namespace operations, namespace
  type and so on. This leaves the new ns_common_init() function with a
  single parameter which is the specific namespace type which derives
  the correct parameters statically. This also means the compiler will
  yell as soon as someone does something remotely fishy.

  The ns_common_init() addition also allows us to remove ns_alloc_inum()
  and drops any special-casing of the initial network namespace in the
  network namespace initialization code that Linus complained about.

  Another part is reworking the reference counting. The reference
  counting was open-coded and copy-pasted for each namespace type even
  though they all followed the same rules. This also removes all open
  accesses to the reference count and makes it private and only uses a
  very small set of dedicated helpers to manipulate them just like we do
  for e.g., files.

  In addition this generalizes the mount namespace iteration
  infrastructure introduced a few cycles ago. As reminder, the vfs makes
  it possible to iterate sequentially and bidirectionally through all
  mount namespaces on the system or all mount namespaces that the caller
  holds privilege over. This allow userspace to iterate over all mounts
  in all mount namespaces using the listmount() and statmount() system
  call.

  Each mount namespace has a unique identifier for the lifetime of the
  systems that is exposed to userspace. The network namespace also has a
  unique identifier working exactly the same way. This extends the
  concept to all other namespace types.

  The new nstree type makes it possible to lookup namespaces purely by
  their identifier and to walk the namespace list sequentially and
  bidirectionally for all namespace types, allowing userspace to iterate
  through all namespaces. Looking up namespaces in the namespace tree
  works completely locklessly.

  This also means we can move the mount namespace onto the generic
  infrastructure and remove a bunch of code and members from struct
  mnt_namespace itself.

  There's a bunch of stuff coming on top of this in the future but for
  now this uses the generic namespace tree to extend a concept
  introduced first for pidfs a few cycles ago. For a while now we have
  supported pidfs file handles for pidfds. This has proven to be very
  useful.

  This extends the concept to cover namespaces as well. It is possible
  to encode and decode namespace file handles using the common
  name_to_handle_at() and open_by_handle_at() apis.

  As with pidfs file handles, namespace file handles are exhaustive,
  meaning it is not required to actually hold a reference to nsfs in
  able to decode aka open_by_handle_at() a namespace file handle.
  Instead the FD_NSFS_ROOT constant can be passed which will let the
  kernel grab a reference to the root of nsfs internally and thus decode
  the file handle.

  Namespaces file descriptors can already be derived from pidfds which
  means they aren't subject to overmount protection bugs. IOW, it's
  irrelevant if the caller would not have access to an appropriate
  /proc/<pid>/ns/ directory as they could always just derive the
  namespace based on a pidfd already.

  It has the same advantage as pidfds. It's possible to reliably and for
  the lifetime of the system refer to a namespace without pinning any
  resources and to compare them trivially.

  Permission checking is kept simple. If the caller is located in the
  namespace the file handle refers to they are able to open it otherwise
  they must hold privilege over the owning namespace of the relevant
  namespace.

  The namespace file handle layout is exposed as uapi and has a stable
  and extensible format. For now it simply contains the namespace
  identifier, the namespace type, and the inode number. The stable
  format means that userspace may construct its own namespace file
  handles without going through name_to_handle_at() as they are already
  allowed for pidfs and cgroup file handles"

* tag 'namespace-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (65 commits)
  ns: drop assert
  ns: move ns type into struct ns_common
  nstree: make struct ns_tree private
  ns: add ns_debug()
  ns: simplify ns_common_init() further
  cgroup: add missing ns_common include
  ns: use inode initializer for initial namespaces
  selftests/namespaces: verify initial namespace inode numbers
  ns: rename to __ns_ref
  nsfs: port to ns_ref_*() helpers
  net: port to ns_ref_*() helpers
  uts: port to ns_ref_*() helpers
  ipv4: use check_net()
  net: use check_net()
  net-sysfs: use check_net()
  user: port to ns_ref_*() helpers
  time: port to ns_ref_*() helpers
  pid: port to ns_ref_*() helpers
  ipc: port to ns_ref_*() helpers
  cgroup: port to ns_ref_*() helpers
  ...
2025-09-29 11:20:29 -07:00
Linus Torvalds
722df25ddf kernel-6.18-rc1.clone3
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaNZgMQAKCRCRxhvAZXjc
 ornXAP954dZjz+OJw6lJLCf0j9TXJOczGHvK3oW5ZD9KnqtTdwEA7p1A6WMOKJyl
 8VtTgCS0yNt8QlznUnsSDfVm0jXVGAY=
 =tUXG
 -----END PGP SIGNATURE-----

Merge tag 'kernel-6.18-rc1.clone3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull copy_process updates from Christian Brauner:
 "This contains the changes to enable support for clone3() on nios2
  which apparently is still a thing.

  The more exciting part of this is that it cleans up the inconsistency
  in how the 64-bit flag argument is passed from copy_process() into the
  various other copy_*() helpers"

[ Fixed up rv ltl_monitor 32-bit support as per Sasha Levin in the merge ]

* tag 'kernel-6.18-rc1.clone3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  nios2: implement architecture-specific portion of sys_clone3
  arch: copy_thread: pass clone_flags as u64
  copy_process: pass clone_flags as u64 across calltree
  copy_sighand: Handle architectures where sizeof(unsigned long) < sizeof(u64)
2025-09-29 10:36:50 -07:00
Linus Torvalds
b7ce6fa90f vfs-6.18-rc1.misc
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaNZQMQAKCRCRxhvAZXjc
 omNLAQCgrwzd9sa1JTlixweu3OAxQlSEbLuMpEv7Ztm+B7Wz0AD9HtwPC44Kev03
 GbMcB2DCFLC4evqYECj6IG7NBmoKsAs=
 =1ICf
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.18-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull misc vfs updates from Christian Brauner:
 "This contains the usual selections of misc updates for this cycle.

  Features:

   - Add "initramfs_options" parameter to set initramfs mount options.
     This allows to add specific mount options to the rootfs to e.g.,
     limit the memory size

   - Add RWF_NOSIGNAL flag for pwritev2()

     Add RWF_NOSIGNAL flag for pwritev2. This flag prevents the SIGPIPE
     signal from being raised when writing on disconnected pipes or
     sockets. The flag is handled directly by the pipe filesystem and
     converted to the existing MSG_NOSIGNAL flag for sockets

   - Allow to pass pid namespace as procfs mount option

     Ever since the introduction of pid namespaces, procfs has had very
     implicit behaviour surrounding them (the pidns used by a procfs
     mount is auto-selected based on the mounting process's active
     pidns, and the pidns itself is basically hidden once the mount has
     been constructed)

     This implicit behaviour has historically meant that userspace was
     required to do some special dances in order to configure the pidns
     of a procfs mount as desired. Examples include:

     * In order to bypass the mnt_too_revealing() check, Kubernetes
       creates a procfs mount from an empty pidns so that user
       namespaced containers can be nested (without this, the nested
       containers would fail to mount procfs)

       But this requires forking off a helper process because you cannot
       just one-shot this using mount(2)

     * Container runtimes in general need to fork into a container
       before configuring its mounts, which can lead to security issues
       in the case of shared-pidns containers (a privileged process in
       the pidns can interact with your container runtime process)

       While SUID_DUMP_DISABLE and user namespaces make this less of an
       issue, the strict need for this due to a minor uAPI wart is kind
       of unfortunate

       Things would be much easier if there was a way for userspace to
       just specify the pidns they want. So this pull request contains
       changes to implement a new "pidns" argument which can be set
       using fsconfig(2):

           fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
           fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);

       or classic mount(2) / mount(8):

           // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
           mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");

  Cleanups:

   - Remove the last references to EXPORT_OP_ASYNC_LOCK

   - Make file_remove_privs_flags() static

   - Remove redundant __GFP_NOWARN when GFP_NOWAIT is used

   - Use try_cmpxchg() in start_dir_add()

   - Use try_cmpxchg() in sb_init_done_wq()

   - Replace offsetof() with struct_size() in ioctl_file_dedupe_range()

   - Remove vfs_ioctl() export

   - Replace rwlock() with spinlock in epoll code as rwlock causes
     priority inversion on preempt rt kernels

   - Make ns_entries in fs/proc/namespaces const

   - Use a switch() statement() in init_special_inode() just like we do
     in may_open()

   - Use struct_size() in dir_add() in the initramfs code

   - Use str_plural() in rd_load_image()

   - Replace strcpy() with strscpy() in find_link()

   - Rename generic_delete_inode() to inode_just_drop() and
     generic_drop_inode() to inode_generic_drop()

   - Remove unused arguments from fcntl_{g,s}et_rw_hint()

  Fixes:

   - Document @name parameter for name_contains_dotdot() helper

   - Fix spelling mistake

   - Always return zero from replace_fd() instead of the file descriptor
     number

   - Limit the size for copy_file_range() in compat mode to prevent a
     signed overflow

   - Fix debugfs mount options not being applied

   - Verify the inode mode when loading it from disk in minixfs

   - Verify the inode mode when loading it from disk in cramfs

   - Don't trigger automounts with RESOLVE_NO_XDEV

     If openat2() was called with RESOLVE_NO_XDEV it didn't traverse
     through automounts, but could still trigger them

   - Add FL_RECLAIM flag to show_fl_flags() macro so it appears in
     tracepoints

   - Fix unused variable warning in rd_load_image() on s390

   - Make INITRAMFS_PRESERVE_MTIME depend on BLK_DEV_INITRD

   - Use ns_capable_noaudit() when determining net sysctl permissions

   - Don't call path_put() under namespace semaphore in listmount() and
     statmount()"

* tag 'vfs-6.18-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (38 commits)
  fcntl: trim arguments
  listmount: don't call path_put() under namespace semaphore
  statmount: don't call path_put() under namespace semaphore
  pid: use ns_capable_noaudit() when determining net sysctl permissions
  fs: rename generic_delete_inode() and generic_drop_inode()
  init: INITRAMFS_PRESERVE_MTIME should depend on BLK_DEV_INITRD
  initramfs: Replace strcpy() with strscpy() in find_link()
  initrd: Use str_plural() in rd_load_image()
  initramfs: Use struct_size() helper to improve dir_add()
  initrd: Fix unused variable warning in rd_load_image() on s390
  fs: use the switch statement in init_special_inode()
  fs/proc/namespaces: make ns_entries const
  filelock: add FL_RECLAIM to show_fl_flags() macro
  eventpoll: Replace rwlock with spinlock
  selftests/proc: add tests for new pidns APIs
  procfs: add "pidns" mount option
  pidns: move is-ancestor logic to helper
  openat2: don't trigger automounts with RESOLVE_NO_XDEV
  namei: move cross-device check to __traverse_mounts
  namei: remove LOOKUP_NO_XDEV check from handle_mounts
  ...
2025-09-29 09:03:07 -07:00
Linus Torvalds
3a654ee549 block-6.17-20250925
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmjWKd4QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgphViEADJF/LndvCUkcNFagHcmAHoJWXajelT/c2w
 yUfiP5kH3nyjC+eXUNT819QyXNxgr7jhJWtOSnWXGcX28aHEx1BuCQSbCW7L0VxR
 DtEfipBeSDt/ixMMokgV0fG/je9D8s4xHLUqFW3tni91jK73G9JxXZoQzj1lQ5t8
 qNDGH4Z5BZvX/ClxWXZf1tIPWy1Bbjx8YncJQCDxg9C1fOJXlP9NuWBk7iL5svPV
 trgIExix3YoyjOk9d5/P6604wbzTffF8CGoPJEC08LZxxjXkob/ipsd6+Wv1aCyF
 3RNIX5bsoN/u0uabyfh5imYxGOkesqqK96sTz+pOExTALNtwKerdmTV790tq73yG
 EeNDAHkRve5xBAgHwRlbU9sH6mQypzhiR7DaLXe/INKp6rUOMOhH3JgYwycrzvnC
 bDgI0kbs1IrEk/rr1yGupu0Fqav30yWlgQ13vVu3rwp2cGabTnmoOPl34siWjEc9
 XL1Q0ftsBtXPxOomKYIDatBCiN8i33/KZ5/IGhDH6qO2or0ydzjK1yEeuDcIDClg
 HCfGCnh5Rs11c4iiMjSBSwmWjwwbciZ4XV5XM9VqcqRDeKg0XHb/jaRYYM3UGu37
 nTUmJlLN/S90S9tYTd9b7iUlpq5PKZV3TvSW1QeZVxelTgRw7MIwyWGz4qH6XUc3
 Im0RjM/ZiA==
 =AKR9
 -----END PGP SIGNATURE-----

Merge tag 'block-6.17-20250925' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull block fixes from Jens Axboe:
 "A regression fix for this series where an attempt to silence an EOD
  error got messed up a bit, and then a change of git trees for the
  block and io_uring trees.

  Switching the git trees to kernel.org now, as I've just about had it
  trying to battle AI bots that bring the box to its knees, continually.
  At least I don't have to maintain the kernel.org side"

* tag 'block-6.17-20250925' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  MAINTAINERS: update io_uring and block tree git trees
  block: fix EOD return for device with nr_sectors == 0
2025-09-26 09:46:51 -07:00
Yu Kuai
5d726c4dbe blk-cgroup: fix possible deadlock while configuring policy
Following deadlock can be triggered easily by lockdep:

WARNING: possible circular locking dependency detected
6.17.0-rc3-00124-ga12c2658ced0 #1665 Not tainted
------------------------------------------------------
check/1334 is trying to acquire lock:
ff1100011d9d0678 (&q->sysfs_lock){+.+.}-{4:4}, at: blk_unregister_queue+0x53/0x180

but task is already holding lock:
ff1100011d9d00e0 (&q->q_usage_counter(queue)#3){++++}-{0:0}, at: del_gendisk+0xba/0x110

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (&q->q_usage_counter(queue)#3){++++}-{0:0}:
       blk_queue_enter+0x40b/0x470
       blkg_conf_prep+0x7b/0x3c0
       tg_set_limit+0x10a/0x3e0
       cgroup_file_write+0xc6/0x420
       kernfs_fop_write_iter+0x189/0x280
       vfs_write+0x256/0x490
       ksys_write+0x83/0x190
       __x64_sys_write+0x21/0x30
       x64_sys_call+0x4608/0x4630
       do_syscall_64+0xdb/0x6b0
       entry_SYSCALL_64_after_hwframe+0x76/0x7e

-> #1 (&q->rq_qos_mutex){+.+.}-{4:4}:
       __mutex_lock+0xd8/0xf50
       mutex_lock_nested+0x2b/0x40
       wbt_init+0x17e/0x280
       wbt_enable_default+0xe9/0x140
       blk_register_queue+0x1da/0x2e0
       __add_disk+0x38c/0x5d0
       add_disk_fwnode+0x89/0x250
       device_add_disk+0x18/0x30
       virtblk_probe+0x13a3/0x1800
       virtio_dev_probe+0x389/0x610
       really_probe+0x136/0x620
       __driver_probe_device+0xb3/0x230
       driver_probe_device+0x2f/0xe0
       __driver_attach+0x158/0x250
       bus_for_each_dev+0xa9/0x130
       driver_attach+0x26/0x40
       bus_add_driver+0x178/0x3d0
       driver_register+0x7d/0x1c0
       __register_virtio_driver+0x2c/0x60
       virtio_blk_init+0x6f/0xe0
       do_one_initcall+0x94/0x540
       kernel_init_freeable+0x56a/0x7b0
       kernel_init+0x2b/0x270
       ret_from_fork+0x268/0x4c0
       ret_from_fork_asm+0x1a/0x30

-> #0 (&q->sysfs_lock){+.+.}-{4:4}:
       __lock_acquire+0x1835/0x2940
       lock_acquire+0xf9/0x450
       __mutex_lock+0xd8/0xf50
       mutex_lock_nested+0x2b/0x40
       blk_unregister_queue+0x53/0x180
       __del_gendisk+0x226/0x690
       del_gendisk+0xba/0x110
       sd_remove+0x49/0xb0 [sd_mod]
       device_remove+0x87/0xb0
       device_release_driver_internal+0x11e/0x230
       device_release_driver+0x1a/0x30
       bus_remove_device+0x14d/0x220
       device_del+0x1e1/0x5a0
       __scsi_remove_device+0x1ff/0x2f0
       scsi_remove_device+0x37/0x60
       sdev_store_delete+0x77/0x100
       dev_attr_store+0x1f/0x40
       sysfs_kf_write+0x65/0x90
       kernfs_fop_write_iter+0x189/0x280
       vfs_write+0x256/0x490
       ksys_write+0x83/0x190
       __x64_sys_write+0x21/0x30
       x64_sys_call+0x4608/0x4630
       do_syscall_64+0xdb/0x6b0
       entry_SYSCALL_64_after_hwframe+0x76/0x7e

other info that might help us debug this:

Chain exists of:
  &q->sysfs_lock --> &q->rq_qos_mutex --> &q->q_usage_counter(queue)#3

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&q->q_usage_counter(queue)#3);
                               lock(&q->rq_qos_mutex);
                               lock(&q->q_usage_counter(queue)#3);
  lock(&q->sysfs_lock);

Root cause is that queue_usage_counter is grabbed with rq_qos_mutex
held in blkg_conf_prep(), while queue should be freezed before
rq_qos_mutex from other context.

The blk_queue_enter() from blkg_conf_prep() is used to protect against
policy deactivation, which is already protected with blkcg_mutex, hence
convert blk_queue_enter() to blkcg_mutex to fix this problem. Meanwhile,
consider that blkcg_mutex is held after queue is freezed from policy
deactivation, also convert blkg_alloc() to use GFP_NOIO.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-23 05:22:14 -06:00
Yu Kuai
670bfe6838 blk-mq: fix null-ptr-deref in blk_mq_free_tags() from error path
blk_mq_free_tags() can be called after blk_mq_init_tags(), while
tags->page_list is still not initialized, causing null-ptr-deref.

Fix this problem by initializing tags->page_list at blk_mq_init_tags(),
meanwhile, also free tags directly from error path because there is no
srcu barrier.

Fixes: ad0d05dbdd ("blk-mq: Defer freeing of tags page_list to SRCU callback")
Reported-by: syzbot+5c5d41e80248d610221f@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/68d1b079.a70a0220.1b52b.0000.GAE@google.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-23 01:35:52 -06:00
Bart Van Assche
fea55691ac blk-mq: Fix more tag iteration function documentation
Commit 8ab30a3319 ("blk-mq: Drop busy_iter_fn blk_mq_hw_ctx argument")
removed the hctx argument from the callback functions called by
bt_for_each() and blk_mq_queue_tag_busy_iter(). Commit 2dd6532e95
("blk-mq: Drop 'reserved' arg of busy_tag_iter_fn") removed the
'reserved' argument of the busy_tag_iter_fn function pointer type. Bring
the documentation of the tag iteration functions in sync with these
changes.

Cc: John Garry <john.g.garry@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-23 00:20:31 -06:00
Caleb Sander Mateos
ef9f603fd3 io_uring/cmd: drop unused res2 param from io_uring_cmd_done()
Commit 79525b51ac ("io_uring: fix nvme's 32b cqes on mixed cq") split
out a separate io_uring_cmd_done32() helper for ->uring_cmd()
implementations that return 32-byte CQEs. The res2 value passed to
io_uring_cmd_done() is now unused because __io_uring_cmd_done() ignores
it when is_cqe32 is passed as false. So drop the parameter from
io_uring_cmd_done() to simplify the callers and clarify that it's not
possible to return an extra value beyond the 32-bit CQE result.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-23 00:15:02 -06:00
Jens Axboe
ab073abf6d block: fix EOD return for device with nr_sectors == 0
A recent commit skipped dumping the usual "attempt to access beyond end
of device" message if the device size is 0 sectors, as that's a common
pattern for devices that have been hot removed. But while it stopped
that message, it also prevented returning -EIO for that condition.
Reinstate the -EIO return, while retaining the quiet operation for
triggering EOD for a device with 0 sectors.

Reported-by: syzbot+4b12286339fe4c2700c1@syzkaller.appspotmail.com
Reported-by: Sahil Chandna <chandna.linuxkernel@gmail.com>
Fixes: d0a2b527d8 ("block: tone down bio_check_eod")
Tested-by: Sahil Chandna <chandna.linuxkernel@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-22 09:35:24 -06:00
Christian Brauner
7914f15c5e
Merge branch 'no-rebase-mnt_ns_tree_remove'
Bring in the fix for removing a mount namespace from the mount namespace
rbtree and list.

Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-19 14:26:14 +02:00
Christian Brauner
fa8ee8627b
block: use extensible_ioctl_valid()
Use the new extensible_ioctl_valid() helper which is equivalent to what
is done here.

Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-19 14:26:05 +02:00
Yu Kuai
336aec7b06 blk-throttle: fix throtl_data leak during disk release
Tightening the throttle activation check in blk_throtl_activated() to
require both q->td presence and policy bit set introduced a memory leak
during disk release:

blkg_destroy_all() clears the policy bit first during queue deactivation,
causing subsequent blk_throtl_exit() to skip throtl_data cleanup when
blk_throtl_activated() fails policy check.

Idealy we should avoid modifying blk_throtl_exit() activation check because
it's intuitive that blk-throtl start from blk_throtl_init() and end in
blk_throtl_exit(). However, call blk_throtl_exit() before
blkg_destroy_all() will make a long term deadlock problem easier to
trigger[1], hence fix this problem by checking if q->td is NULL from
blk_throtl_exit(), and remove policy deactivation as well since it's
useless.

[1] https://lore.kernel.org/all/CAHj4cs9p9H5yx+ywsb3CMUdbqGPhM+8tuBvhW=9ADiCjAqza9w@mail.gmail.com/#t

Fixes: bd9fd5be6b ("blk-throttle: fix access race during throttle policy activation")
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Closes: https://lore.kernel.org/all/CAHj4cs-p-ZwBEKigBj7T6hQCOo-H68-kVwCrV6ZvRovrr9Z+HA@mail.gmail.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-17 07:27:29 -06:00
Bart Van Assche
0b507305a0 blk-mq: Fix the blk_mq_tagset_busy_iter() documentation
Commit 2dd6532e95 ("blk-mq: Drop 'reserved' arg of busy_tag_iter_fn")
removed the 'reserved' argument from tag iteration callback functions.
Bring the blk_mq_tagset_busy_iter() documentation in sync with that
change.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: John Garry <john.g.garry@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-17 07:26:29 -06:00
John Garry
da7b97ba0d block: relax atomic write boundary vs chunk size check
blk_validate_atomic_write_limits() ensures that any boundary fits into
and is aligned to any chunk size.

However, it should also be possible to fit the chunk size into any
boundary. That check is already made in
blk_stack_atomic_writes_boundary_head().

Relax the check in blk_validate_atomic_write_limits() by reusing (and
renaming) blk_stack_atomic_writes_boundary_head().

Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-16 12:29:10 -06:00
John Garry
f2d8c5a2f7 block: fix stacking of atomic writes when atomics are not supported
Atomic writes support may not always be possible when stacking devices
which support atomic writes. Such as case is a different atomic write
boundary between stacked devices (which is not supported).

In the case that atomic writes cannot supported, the top device queue HW
limits are set to 0.

However, in blk_stack_atomic_writes_limits(), we detect that we are
stacking the first bottom device by checking the top device
atomic_write_hw_max value == 0. This get confused with the case of atomic
writes not supported, above.

Make the distinction between stacking the first bottom device and no
atomics supported by initializing stacked device atomic_write_hw_max =
UINT_MAX and checking that for stacking the first bottom device.

Fixes: d7f36dc446 ("block: Support atomic writes limits for stacked devices")
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-16 12:29:10 -06:00
John Garry
bfd4037296 block: update validation of atomic writes boundary for stacked devices
In commit 63d092d1c1 ("block: use chunk_sectors when evaluating stacked
atomic write limits"), it was missed to use a chunk sectors limit check
in blk_stack_atomic_writes_boundary_head(), so update that function to
do the proper check.

Fixes: 63d092d1c1 ("block: use chunk_sectors when evaluating stacked atomic write limits")
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-16 12:29:10 -06:00
chengkaitao
74b1db8684 block/mq-deadline: Remove the redundant rb_entry_rq in the deadline_from_pos().
In commit(fde02699c2), the "if (blk_rq_is_seq_zoned_write(rq))"
was removed, but the "rb_entry_rq(node)" and some other code were
inadvertently left behind. This patch fixed it.

Signed-off-by: chengkaitao <chengkaitao@kylinos.cn>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Li Nan <linan122@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-15 13:00:05 -06:00
Mateusz Guzik
f99b391778
fs: rename generic_delete_inode() and generic_drop_inode()
generic_delete_inode() is rather misleading for what the routine is
doing. inode_just_drop() should be much clearer.

The new naming is inconsistent with generic_drop_inode(), so rename that
one as well with inode_ as the suffix.

No functional changes.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-15 16:09:42 +02:00
Pankaj Raghav
ea5e101fb6 block: use largest_zero_folio in __blkdev_issue_zero_pages()
Use largest_zero_folio() in __blkdev_issue_zero_pages().  On systems with
CONFIG_PERSISTENT_HUGE_ZERO_FOLIO enabled, we will end up sending larger
bvecs instead of multiple small ones.

Noticed a 4% increase in performance on a commercial NVMe SSD which does
not support OP_WRITE_ZEROES.  The device's MDTS was 128K.  The performance
gains might be bigger if the device supports bigger MDTS.

Link: https://lkml.kernel.org/r/20250811084113.647267-6-kernel@pankajraghav.com
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cc: Kiryl Shutsemau <kirill@shutemov.name>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Luis Chamberalin <mcgrof@kernel.org>
Cc: Mariano Pache <npache@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13 16:54:54 -07:00
Yu Kuai
9784041145 blk-mq: remove blk_mq_tag_update_depth()
This helper is not used now.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10 05:25:56 -06:00
Yu Kuai
b86433721f blk-mq: fix potential deadlock while nr_requests grown
Allocate and free sched_tags while queue is freezed can deadlock[1],
this is a long term problem, hence allocate memory before freezing
queue and free memory after queue is unfreezed.

[1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
Fixes: e3a2b3f931 ("blk-mq: allow changing of queue depth through sysfs")

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10 05:25:56 -06:00
Yu Kuai
6293e336f6 blk-mq-sched: add new parameter nr_requests in blk_mq_alloc_sched_tags()
This helper only support to allocate the default number of requests,
add a new parameter to support specific number of requests.

Prepare to fix potential deadlock in the case nr_requests grow.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10 05:25:56 -06:00
Yu Kuai
e632004044 blk-mq: split bitmap grow and resize case in blk_mq_update_nr_requests()
No functional changes are intended, make code cleaner and prepare to fix
the grow case in following patches.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10 05:25:56 -06:00
Yu Kuai
7f2799c546 blk-mq: cleanup shared tags case in blk_mq_update_nr_requests()
For shared tags case, all hctx->sched_tags/tags are the same, it doesn't
make sense to call into blk_mq_tag_update_depth() multiple times for the
same tags.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10 05:25:56 -06:00
Yu Kuai
626ff4f8eb blk-mq: convert to serialize updating nr_requests with update_nr_hwq_lock
request_queue->nr_requests can be changed by:

a) switch elevator by updating nr_hw_queues
b) switch elevator by elevator sysfs attribute
c) configue queue sysfs attribute nr_requests

Current lock order is:

1) update_nr_hwq_lock, case a,b
2) freeze_queue
3) elevator_lock, case a,b,c

And update nr_requests is seriablized by elevator_lock() already,
however, in the case c, we'll have to allocate new sched_tags if
nr_requests grow, and do this with elevator_lock held and queue
freezed has the risk of deadlock.

Hence use update_nr_hwq_lock instead, make it possible to allocate
memory if tags grow, meanwhile also prevent nr_requests to be changed
concurrently.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10 05:25:56 -06:00
Yu Kuai
b46d4c447d blk-mq: check invalid nr_requests in queue_requests_store()
queue_requests_store() is the only caller of
blk_mq_update_nr_requests(), and blk_mq_update_nr_requests() is the
only caller of blk_mq_tag_update_depth(), however, they all have
checkings for nr_requests input by user.

Make code cleaner by moving all the checkings to the top function:

1) nr_requests > reserved tags;
2) if there is elevator, 4 <= nr_requests <= 2048;
3) if elevator is none, 4 <= nr_requests <= tag_set->queue_depth;

Meanwhile, case 2 is the only case tags can grow and -ENOMEM might be
returned.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10 05:25:56 -06:00
Yu Kuai
8bd7195fea blk-mq: remove useless checkings in blk_mq_update_nr_requests()
1) queue_requests_store() is the only caller of
blk_mq_update_nr_requests(), where queue is already freezed, no need to
check mq_freeze_depth;
2) q->tag_set must be set for request based device, and queue_is_mq() is
already checked in blk_mq_queue_attr_visible(), no need to check
q->tag_set.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10 05:25:56 -06:00
Yu Kuai
dc1dd13d44 blk-mq: remove useless checking in queue_requests_store()
blk_mq_queue_attr_visible() already checked queue_is_mq(), no need to
check this again in queue_requests_store().

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10 05:25:56 -06:00
Yu Kuai
b2f5974079 block: fix ordering of recursive split IO
Currently, split bio will be chained to original bio, and original bio
will be resubmitted to the tail of current->bio_list, waiting for
split bio to be issued. However, if split bio get split again, the IO
order will be messed up. This problem, on the one hand, will cause
performance degradation, especially for mdraid with large IO size; on
the other hand, will cause write errors for zoned block devices[1].

For example, in raid456 IO will first be split by max_sector from
md_submit_bio(), and then later be split again by chunksize for internal
handling:

For example, assume max_sectors is 1M, and chunksize is 512k

1) issue a 2M IO:

bio issuing: 0+2M
current->bio_list: NULL

2) md_submit_bio() split by max_sector:

bio issuing: 0+1M
current->bio_list: 1M+1M

3) chunk_aligned_read() split by chunksize:

bio issuing: 0+512k
current->bio_list: 1M+1M -> 512k+512k

4) after first bio issued, __submit_bio_noacct() will contuine issuing
next bio:

bio issuing: 1M+1M
current->bio_list: 512k+512k
bio issued: 0+512k

5) chunk_aligned_read() split by chunksize:

bio issuing: 1M+512k
current->bio_list: 512k+512k -> 1536k+512k
bio issued: 0+512k

6) no split afterwards, finally the issue order is:

0+512k -> 1M+512k -> 512k+512k -> 1536k+512k

This behaviour will cause large IO read on raid456 endup to be small
discontinuous IO in underlying disks. Fix this problem by placing split
bio to the head of current->bio_list.

Test script: test on 8 disk raid5 with 64k chunksize
dd if=/dev/md0 of=/dev/null bs=4480k iflag=direct

Test results:
Before this patch
1) iostat results:
Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz  aqu-sz  %util
md0           52430.00   3276.87     0.00   0.00    0.62    64.00   32.60  80.10
sd*           4487.00    409.00  2054.00  31.40    0.82    93.34    3.68  71.20
2) blktrace G stage:
  8,0    0   486445    11.357392936   843  G   R 14071424 + 128 [dd]
  8,0    0   486451    11.357466360   843  G   R 14071168 + 128 [dd]
  8,0    0   486454    11.357515868   843  G   R 14071296 + 128 [dd]
  8,0    0   486468    11.357968099   843  G   R 14072192 + 128 [dd]
  8,0    0   486474    11.358031320   843  G   R 14071936 + 128 [dd]
  8,0    0   486480    11.358096298   843  G   R 14071552 + 128 [dd]
  8,0    0   486490    11.358303858   843  G   R 14071808 + 128 [dd]
3) io seek for sdx:
Noted io seek is the result from blktrace D stage, statistic of:
ABS((offset of next IO) - (offset + len of previous IO))

Read|Write seek
cnt 55175, zero cnt 25079
    >=(KB) .. <(KB)     : count       ratio |distribution                            |
         0 .. 1         : 25079       45.5% |########################################|
         1 .. 2         : 0            0.0% |                                        |
         2 .. 4         : 0            0.0% |                                        |
         4 .. 8         : 0            0.0% |                                        |
         8 .. 16        : 0            0.0% |                                        |
        16 .. 32        : 0            0.0% |                                        |
        32 .. 64        : 12540       22.7% |#####################                   |
        64 .. 128       : 2508         4.5% |#####                                   |
       128 .. 256       : 0            0.0% |                                        |
       256 .. 512       : 10032       18.2% |#################                       |
       512 .. 1024      : 5016         9.1% |#########                               |

After this patch:
1) iostat results:
Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz  aqu-sz  %util
md0           87965.00   5271.88     0.00   0.00    0.16    61.37   14.03  90.60
sd*           6020.00    658.44  5117.00  45.95    0.44   112.00    2.68  86.50
2) blktrace G stage:
  8,0    0   206296     5.354894072   664  G   R 7156992 + 128 [dd]
  8,0    0   206305     5.355018179   664  G   R 7157248 + 128 [dd]
  8,0    0   206316     5.355204438   664  G   R 7157504 + 128 [dd]
  8,0    0   206319     5.355241048   664  G   R 7157760 + 128 [dd]
  8,0    0   206333     5.355500923   664  G   R 7158016 + 128 [dd]
  8,0    0   206344     5.355837806   664  G   R 7158272 + 128 [dd]
  8,0    0   206353     5.355960395   664  G   R 7158528 + 128 [dd]
  8,0    0   206357     5.356020772   664  G   R 7158784 + 128 [dd]
3) io seek for sdx
Read|Write seek
cnt 28644, zero cnt 21483
    >=(KB) .. <(KB)     : count       ratio |distribution                            |
         0 .. 1         : 21483       75.0% |########################################|
         1 .. 2         : 0            0.0% |                                        |
         2 .. 4         : 0            0.0% |                                        |
         4 .. 8         : 0            0.0% |                                        |
         8 .. 16        : 0            0.0% |                                        |
        16 .. 32        : 0            0.0% |                                        |
        32 .. 64        : 7161        25.0% |##############                          |

BTW, this looks like a long term problem from day one, and large
sequential IO read is pretty common case like video playing.

And even with this patch, in this test case IO is merged to at most 128k
is due to block layer plug limit BLK_PLUG_FLUSH_SIZE, increase such
limit can get even better performance. However, we'll figure out how to do
this properly later.

[1] https://lore.kernel.org/all/e40b076d-583d-406b-b223-005910a9f46f@acm.org/

Fixes: d89d87965d ("When stacked block devices are in-use (e.g. md or dm), the recursive calls")
Reported-by: Tie Ren <tieren@fnnas.com>
Closes: https://lore.kernel.org/all/7dro5o7u5t64d6bgiansesjavxcuvkq5p2pok7dtwkav7b7ape@3isfr44b6352/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10 05:23:46 -06:00
Yu Kuai
0b64682e78 block: skip unnecessary checks for split bio
Lots of checks are already done while submitting this bio the first
time, and there is no need to check them again when this bio is
resubmitted after split.

Hence open code should_fail_bio() and blk_throtl_bio() that are still
necessary from submit_bio_split_bioset().

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10 05:23:46 -06:00
Yu Kuai
e3290419d9 blk-crypto: convert to use bio_submit_split_bioset()
Unify bio split code, prepare to fix ordering of split IO.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10 05:23:46 -06:00
Yu Kuai
e37b5596a1 block: factor out a helper bio_submit_split_bioset()
No functional changes are intended, some drivers like mdraid will split
bio by internal processing, prepare to unify bio split codes.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10 05:23:45 -06:00
Yu Kuai
06d712d297 blk-crypto: fix missing blktrace bio split events
trace_block_split() is missing, resulting in blktrace inability to catch
BIO split events and making it harder to analyze the BIO sequence.

Cc: stable@vger.kernel.org
Fixes: 488f6682c8 ("block: blk-crypto-fallback for Inline Encryption")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10 05:23:45 -06:00
Yu Kuai
ea3d1f104d blk-mq: add QUEUE_FLAG_BIO_ISSUE_TIME
bio->issue_time_ns is initialized for every bio, however, it's only used
by blk-iolatency. Add a new queue_flag and only set this flag when
blk-iolatency is enabled, so that extra blk_time_get_ns() can be saved
for disks that blk-iolatency is not enabled.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10 05:23:45 -06:00
Yu Kuai
1f963bdd64 block: initialize bio issue time in blk_mq_submit_bio()
bio->issue_time_ns is only used by blk-iolatency, which can only be
enabled for rq-based disk, hence it's not necessary to initialize
the time for bio-based disk.

Meanwhile, if bio is split by blk_crypto_fallback_split_bio_if_needed(),
the issue time is not initialized for new split bio, this can be fixed
as well.

Noted the next patch will optimize better that bio issue time will
only be used when blk-iolatency is really enabled by the disk.

Fixes: 488f6682c8 ("block: blk-crypto-fallback for Inline Encryption")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10 05:23:45 -06:00
Yu Kuai
1733e88874 block: cleanup bio_issue
Now that bio->bi_issue is only used by blk-iolatency to get bio issue
time, replace bio_issue with u64 time directly and remove bio_issue to
make code cleaner.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10 05:23:45 -06:00
Keith Busch
d0d1d52231 blk-map: provide the bdev to bio if one exists
We can now safely provide a block device when extracting user pages for
driver and user passthrough commands. Set the bdev so the caller doesn't
have to do that later. This has an additional  benefit of being able to
extract P2P pages in the passthrough path.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-09 10:35:28 -06:00
Keith Busch
d57447ffb5 blk-mq-dma: bring back p2p request flags
We only need to consider data and metadata dma mapping types separately.
The request and bio integrity payload have enough flag bits to
internally track the mapping type for each. Use these so the caller
doesn't need to track them, and provide separete request and integrity
helpers to the common code. This will make it easier to scale new
mappings, like the proposed MMIO attribute, without burdening the caller
to track such things.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-09 10:33:35 -06:00
Keith Busch
05ceea5d3e blk-integrity: enable p2p source and destination
Set the extraction flags to allow p2p pages for the metadata buffer if
the block device allows it. Similar to data payloads, ensure the bio
does not use merging if we see a p2p page.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-09 10:33:27 -06:00
Keith Busch
69d7ed5b9e blk-integrity: use simpler alignment check
We're checking length and addresses against the same alignment value, so
use the more simple iterator check.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-09 10:27:01 -06:00
Keith Busch
5ff3f74e14 block: simplify direct io validity check
The block layer checks all the segments for validity later, so no need
for an early check. Just reduce it to a simple position and total length
check, and defer the more invasive segment checks to the block layer.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-09 10:27:01 -06:00
Keith Busch
20a0e6276e block: align the bio after building it
Instead of ensuring each vector is block size aligned while constructing
the bio, just ensure the entire size is aligned after it's built. This
makes getting bio pages more flexible to accepting device valid io
vectors that would otherwise get rejected by alignment checks.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-09 10:27:01 -06:00
Keith Busch
743bf2e0c4 block: add size alignment to bio_iov_iter_get_pages
The block layer tries to align bio vectors to the block device's logical
block size. Some cases don't have a block device, or we may need to
align to something larger, which we can't derive it from the queue
limits. Have the caller specify what they want, or allow any length
alignment if nothing was specified. Since the most common use case
relies on the block device's limits, a helper function is provided.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-09 10:27:01 -06:00
Keith Busch
fec2e70572 block: check for valid bio while splitting
We're already iterating every segment, so check these for a valid IO
lengths at the same time. Individual segment lengths will not be checked
on passthrough commands. The read/write command segments must be sized
to the dma alignment.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-09 10:27:01 -06:00
Christoph Hellwig
d86eaa0f3c block: remove the bi_inline_vecs variable sized array from struct bio
Bios are embedded into other structures, and at least spare is unhappy
about embedding structures with variable sized arrays.  There's no
real need to the array anyway, we can replace it with a helper pointing
to the memory just behind the bio, and with the previous cleanups there
is very few site doing anything special with it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-09 07:31:59 -06:00
Christoph Hellwig
70a6f71b1a block: add a bio_init_inline helper
Just a simpler wrapper around bio_init for callers that want to
initialize a bio with inline bvecs.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-09 07:31:59 -06:00
Linus Torvalds
f777d1112e vfs-6.17-rc6.fixes
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaL6SyQAKCRCRxhvAZXjc
 ouTGAQDGiTnaENiOzRhzNl1XONTRv8a1uV0pxg4W3fNdiRlxgQEA/O90/+nM48KC
 pdV3WHz5eGfcnMTpqgHxK6HYgwklJAY=
 =oKnm
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.17-rc6.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs fixes from Christian Brauner:
 "fuse:

   - Prevent opening of non-regular backing files.

     Fuse doesn't support non-regular files anyway.

   - Check whether copy_file_range() returns a larger size than
     requested.

   - Prevent overflow in copy_file_range() as fuse currently only
     supports 32-bit sized copies.

   - Cache the blocksize value if the server returned a new value as
     inode->i_blkbits isn't modified directly anymore.

   - Fix i_blkbits handling for iomap partial writes.

     By default i_blkbits is set to PAGE_SIZE which causes iomap to mark
     the whole folio as uptodate even on a partial write. But fuseblk
     filesystems support choosing a blocksize smaller than PAGE_SIZE
     risking data corruption. Simply enforce PAGE_SIZE as blocksize for
     fuseblk's internal inode for now.

   - Prevent out-of-bounds acces in fuse_dev_write() when the number of
     bytes to be retrieved is truncated to the fc->max_pages limit.

  virtiofs:

   - Fix page faults for DAX page addresses.

  Misc:

   - Tighten file handle decoding from userns.

     Check that the decoded dentry itself has a valid idmapping in the
     user namespace.

   - Fix mount-notify selftests.

   - Fix some indentation errors.

   - Add an FMODE_ flag to indicate IOCB_HAS_METADATA availability.

     This will be moved to an FOP_* flag with a bit more rework needed
     for that to happen not suitable for a fix.

   - Don't silently ignore metadata for sync read/write.

   - Don't pointlessly log warning when reading coredump sysctls"

* tag 'vfs-6.17-rc6.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fuse: virtio_fs: fix page fault for DAX page address
  selftests/fs/mount-notify: Fix compilation failure.
  fhandle: use more consistent rules for decoding file handle from userns
  fuse: Block access to folio overlimit
  fuse: fix fuseblk i_blkbits for iomap partial writes
  fuse: reflect cached blocksize if blocksize was changed
  fuse: prevent overflow in copy_file_range return value
  fuse: check if copy_file_range() returns larger than requested size
  fuse: do not allow mapping a non-regular backing file
  coredump: don't pointlessly check and spew warnings
  fs: fix indentation style
  block: don't silently ignore metadata for sync read/write
  fs: add a FMODE_ flag to indicate IOCB_HAS_METADATA availability
  Please enter a commit message to explain why this merge is necessary,
  especially if it merges an updated upstream into a topic branch.
2025-09-08 07:53:01 -07:00
Han Guangjiang
bd9fd5be6b blk-throttle: fix access race during throttle policy activation
On repeated cold boots we occasionally hit a NULL pointer crash in
blk_should_throtl() when throttling is consulted before the throttle
policy is fully enabled for the queue. Checking only q->td != NULL is
insufficient during early initialization, so blkg_to_pd() for the
throttle policy can still return NULL and blkg_to_tg() becomes NULL,
which later gets dereferenced.

 Unable to handle kernel NULL pointer dereference
 at virtual address 0000000000000156
 ...
 pc : submit_bio_noacct+0x14c/0x4c8
 lr : submit_bio_noacct+0x48/0x4c8
 sp : ffff800087f0b690
 x29: ffff800087f0b690 x28: 0000000000005f90 x27: ffff00068af393c0
 x26: 0000000000080000 x25: 000000000002fbc0 x24: ffff000684ddcc70
 x23: 0000000000000000 x22: 0000000000000000 x21: 0000000000000000
 x20: 0000000000080000 x19: ffff000684ddcd08 x18: ffffffffffffffff
 x17: 0000000000000000 x16: ffff80008132a550 x15: 0000ffff98020fff
 x14: 0000000000000000 x13: 1fffe000d11d7021 x12: ffff000688eb810c
 x11: ffff00077ec4bb80 x10: ffff000688dcb720 x9 : ffff80008068ef60
 x8 : 00000a6fb8a86e85 x7 : 000000000000111e x6 : 0000000000000002
 x5 : 0000000000000246 x4 : 0000000000015cff x3 : 0000000000394500
 x2 : ffff000682e35e40 x1 : 0000000000364940 x0 : 000000000000001a
 Call trace:
  submit_bio_noacct+0x14c/0x4c8
  verity_map+0x178/0x2c8
  __map_bio+0x228/0x250
  dm_submit_bio+0x1c4/0x678
  __submit_bio+0x170/0x230
  submit_bio_noacct_nocheck+0x16c/0x388
  submit_bio_noacct+0x16c/0x4c8
  submit_bio+0xb4/0x210
  f2fs_submit_read_bio+0x4c/0xf0
  f2fs_mpage_readpages+0x3b0/0x5f0
  f2fs_readahead+0x90/0xe8

Tighten blk_throtl_activated() to also require that the throttle policy
bit is set on the queue:

  return q->td != NULL &&
         test_bit(blkcg_policy_throtl.plid, q->blkcg_pols);

This prevents blk_should_throtl() from accessing throttle group state
until policy data has been attached to blkgs.

Fixes: a3166c5170 ("blk-throttle: delay initialization until configuration")
Co-developed-by: Liang Jie <liangjie@lixiang.com>
Signed-off-by: Liang Jie <liangjie@lixiang.com>
Signed-off-by: Han Guangjiang <hanguangjiang@lixiang.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-08 08:24:44 -06:00
Ming Lei
995412e23b blk-mq: Replace tags->lock with SRCU for tag iterators
Replace the spinlock in blk_mq_find_and_get_req() with an SRCU read lock
around the tag iterators.

This is done by:

- Holding the SRCU read lock in blk_mq_queue_tag_busy_iter(),
blk_mq_tagset_busy_iter(), and blk_mq_hctx_has_requests().

- Removing the now-redundant tags->lock from blk_mq_find_and_get_req().

This change fixes lockup issue in scsi_host_busy() in case of shost->host_blocked.

Also avoids big tags->lock when reading disk sysfs attribute `inflight`.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-08 08:05:32 -06:00
Ming Lei
135b8521f2 blk-mq: Defer freeing flush queue to SRCU callback
The freeing of the flush queue/request in blk_mq_exit_hctx() can race with
tag iterators that may still be accessing it. To prevent a potential
use-after-free, the deallocation should be deferred until after a grace
period. With this way, we can replace the big tags->lock in tags iterator
code path with srcu for solving the issue.

This patch introduces an SRCU-based deferred freeing mechanism for the
flush queue.

The changes include:
- Adding a `rcu_head` to `struct blk_flush_queue`.
- Creating a new callback function, `blk_free_flush_queue_callback`,
  to handle the actual freeing.
- Replacing the direct call to `blk_free_flush_queue()` in
  `blk_mq_exit_hctx()` with `call_srcu()`, using the `tags_srcu`
  instance to ensure synchronization with tag iterators.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-08 08:05:32 -06:00
Ming Lei
ad0d05dbdd blk-mq: Defer freeing of tags page_list to SRCU callback
Tag iterators can race with the freeing of the request pages(tags->page_list),
potentially leading to use-after-free issues.

Defer the freeing of the page list and the tags structure itself until
after an SRCU grace period has passed. This ensures that any concurrent
tag iterators have completed before the memory is released. With this
way, we can replace the big tags->lock in tags iterator code path with
srcu for solving the issue.

This is achieved by:
- Adding a new `srcu_struct tags_srcu` to `blk_mq_tag_set` to protect
  tag map iteration.
- Adding an `rcu_head` to `struct blk_mq_tags` to be used with
  `call_srcu`.
- Moving the page list freeing logic and the `kfree(tags)` call into a
  new callback function, `blk_mq_free_tags_callback`.
- In `blk_mq_free_tags`, invoking `call_srcu` to schedule the new
  callback for deferred execution.

The read-side protection for the tag iterators will be added in a
subsequent patch.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-08 08:05:32 -06:00
Ming Lei
9ad8e5af32 blk-mq: Pass tag_set to blk_mq_free_rq_map/tags
To prepare for converting the tag->rqs freeing to be SRCU-based, the
tag_set is needed in the freeing helper functions.

This patch adds 'struct blk_mq_tag_set *' as the first parameter to
blk_mq_free_rq_map() and blk_mq_free_tags(), and updates all their call
sites.

This allows access to the tag_set's SRCU structure in the next step,
which will be used to free the tag maps after a grace period.

No functional change is intended in this patch.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-08 08:05:32 -06:00
Ming Lei
aba19ee71c blk-mq: Move flush queue allocation into blk_mq_init_hctx()
Move flush queue allocation into blk_mq_init_hctx() and its release into
blk_mq_exit_hctx(), and prepare for replacing tags->lock with SRCU to
draining inflight request walking. blk_mq_exit_hctx() is the last chance
for us to get valid `tag_set` reference, and we need to add one SRCU to
`tag_set` for freeing flush request via call_srcu().

It is safe to move flush queue & request release into blk_mq_exit_hctx(),
because blk_mq_clear_flush_rq_mapping() clears the flush request
reference int driver tags inflight request table, meantime inflight
request walking is drained.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-08 08:05:32 -06:00
Yu Kuai
ba28afbd9e blk-mq: fix blk_mq_tags double free while nr_requests grown
In the case user trigger tags grow by queue sysfs attribute nr_requests,
hctx->sched_tags will be freed directly and replaced with a new
allocated tags, see blk_mq_tag_update_depth().

The problem is that hctx->sched_tags is from elevator->et->tags, while
et->tags is still the freed tags, hence later elevator exit will try to
free the tags again, causing kernel panic.

Fix this problem by replacing et->tags with new allocated tags as well.

Noted there are still some long term problems that will require some
refactor to be fixed thoroughly[1].

[1] https://lore.kernel.org/all/20250815080216.410665-1-yukuai1@huaweicloud.com/
Fixes: f5a6604f7a ("block: fix lockdep warning caused by lock dependency in elv_iosched_store")

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Li Nan <linan122@huawei.com>
Link: https://lore.kernel.org/r/20250821060612.1729939-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-05 13:52:52 -06:00
Yu Kuai
7d337eef4a blk-mq: fix elevator depth_updated method
Current depth_updated has some problems:

1) depth_updated() will be called for each hctx, while all elevators
will update async_depth for the disk level, this is not related to hctx;
2) In blk_mq_update_nr_requests(), if previous hctx update succeed and
this hctx update failed, q->nr_requests will not be updated, while
async_depth is already updated with new nr_reqeuests in previous
depth_updated();
3) All elevators are using q->nr_requests to calculate async_depth now,
however, q->nr_requests is still the old value when depth_updated() is
called from blk_mq_update_nr_requests();

Those problems are first from error path, then mq-deadline, and recently
for bfq and kyber, fix those problems by:

- pass in request_queue instead of hctx;
- move depth_updated() after q->nr_requests is updated in
  blk_mq_update_nr_requests();
- add depth_updated() call inside init_sched() method to initialize
  async_depth;
- remove init_hctx() method for mq-deadline and bfq that is useless now;

Fixes: 77f1e0a52d ("bfq: update internal depth state when queue depth changes")
Fixes: 39823b47bb ("block/mq-deadline: Fix the tag reservation code")
Fixes: 42e6c6ce03 ("lib/sbitmap: convert shallow_depth from one word to the whole sbitmap")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Li Nan <linan122@huawei.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Link: https://lore.kernel.org/r/20250821060612.1729939-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-05 13:52:52 -06:00
Jens Axboe
4dbe13c784 switching ->getgeo() from struct block_device to struct gendisk
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCaLifHQAKCRBZ7Krx/gZQ
 64qlAPsGU9cVg8tVcbbuf767MXyuQZkUPeA5AWnSkm0jfQzaKAEAmsF4+KsjOFRR
 EmdjHBlN5kk6a0TWzXcADlieJ/ccNA4=
 =Tr1Q
 -----END PGP SIGNATURE-----

Merge tag 'pull-getgeo' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs into for-6.18/block

Pull struct block_device getgeo changes from Al.

"switching ->getgeo() from struct block_device to struct gendisk

 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>"

* tag 'pull-getgeo' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  block: switch ->getgeo() to struct gendisk
  scsi: switch ->bios_param() to passing gendisk
  scsi: switch scsi_bios_ptable() and scsi_partsize() to gendisk
2025-09-03 15:15:43 -06:00
Qianfeng Rong
b0b4518c99 block: use int to store blk_stack_limits() return value
Change the 'ret' variable in blk_stack_limits() from unsigned int to int,
as it needs to store negative value -1.

Storing the negative error codes in unsigned type, or performing equality
comparisons (e.g., ret == -1), doesn't cause an issue at runtime [1] but
can be confusing.  Additionally, assigning negative error codes to unsigned
type may trigger a GCC warning when the -Wsign-conversion flag is enabled.

No effect on runtime.

Link: https://lore.kernel.org/all/x3wogjf6vgpkisdhg3abzrx7v7zktmdnfmqeih5kosszmagqfs@oh3qxrgzkikf/ #1
Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Fixes: fe0b393f2c ("block: Correct handling of bottom device misaligment")
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20250902130930.68317-1-rongqianfeng@vivo.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-02 19:19:25 -06:00
Simon Schuster
edd3cb05c0 copy_process: pass clone_flags as u64 across calltree
With the introduction of clone3 in commit 7f192e3cd3 ("fork: add
clone3") the effective bit width of clone_flags on all architectures was
increased from 32-bit to 64-bit, with a new type of u64 for the flags.
However, for most consumers of clone_flags the interface was not
changed from the previous type of unsigned long.

While this works fine as long as none of the new 64-bit flag bits
(CLONE_CLEAR_SIGHAND and CLONE_INTO_CGROUP) are evaluated, this is still
undesirable in terms of the principle of least surprise.

Thus, this commit fixes all relevant interfaces of callees to
sys_clone3/copy_process (excluding the architecture-specific
copy_thread) to consistently pass clone_flags as u64, so that
no truncation to 32-bit integers occurs on 32-bit architectures.

Signed-off-by: Simon Schuster <schuster.simon@siemens-energy.com>
Link: https://lore.kernel.org/20250901-nios2-implement-clone3-v2-2-53fcf5577d57@siemens-energy.com
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-01 15:31:34 +02:00
Christian Brauner
e23654f5b1 fuse fixes for 6.17-rc5
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCaLVsHgAKCRDh3BK/laaZ
 PDQ9AQDkWIqblrxfuL/Ji9d18XR4FFMN3PHxt046AjFwFVbL4gD/Q8bIDTmJNyts
 YAGcgSkDpa2Q/UT9yZ/8IFdmidnVqwk=
 =5/3p
 -----END PGP SIGNATURE-----
gpgsig -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaLV6IAAKCRCRxhvAZXjc
 olPYAP9rPVDs2VyektXPfHowxX53nxuYcbGcowPzIn68f9cp0AD/XfyPHfYIKFRc
 Mk/u+Ns438FpHDzBu84y51giQbc2IwM=
 =IEzO
 -----END PGP SIGNATURE-----

Merge tag 'fuse-fixes-6.17-rc5' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse into vfs.fixes

fuse fixes for 6.17-rc5

* tag 'fuse-fixes-6.17-rc5' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (6 commits)
  fuse: Block access to folio overlimit
  fuse: fix fuseblk i_blkbits for iomap partial writes
  fuse: reflect cached blocksize if blocksize was changed
  fuse: prevent overflow in copy_file_range return value
  fuse: check if copy_file_range() returns larger than requested size
  fuse: do not allow mapping a non-regular backing file

Link: https://lore.kernel.org/CAJfpeguEVMMyw_zCb+hbOuSxdE2Z3Raw=SJsq=Y56Ae6dn2W3g@mail.gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-01 12:48:28 +02:00
Li Nan
4c7ef92f6d blk-mq: check kobject state_in_sysfs before deleting in blk_mq_unregister_hctx
In __blk_mq_update_nr_hw_queues() the return value of
blk_mq_sysfs_register_hctxs() is not checked. If sysfs creation for hctx
fails, later changing the number of hw_queues or removing disk will
trigger the following warning:

  kernfs: can not remove 'nr_tags', no directory
  WARNING: CPU: 2 PID: 637 at fs/kernfs/dir.c:1707 kernfs_remove_by_name_ns+0x13f/0x160
  Call Trace:
   remove_files.isra.1+0x38/0xb0
   sysfs_remove_group+0x4d/0x100
   sysfs_remove_groups+0x31/0x60
   __kobject_del+0x23/0xf0
   kobject_del+0x17/0x40
   blk_mq_unregister_hctx+0x5d/0x80
   blk_mq_sysfs_unregister_hctxs+0x94/0xd0
   blk_mq_update_nr_hw_queues+0x124/0x760
   nullb_update_nr_hw_queues+0x71/0xf0 [null_blk]
   nullb_device_submit_queues_store+0x92/0x120 [null_blk]

kobjct_del() was called unconditionally even if sysfs creation failed.
Fix it by checkig the kobject creation statusbefore deleting it.

Fixes: 477e19dedc ("blk-mq: adjust debugfs and sysfs register when updating nr_hw_queues")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250826084854.1030545-1-linan666@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-28 19:21:07 -06:00
Nilay Shroff
e3ef9445cd block: validate QoS before calling __rq_qos_done_bio()
If a bio has BIO_QOS_xxx set, it doesn't guarantee that q->rq_qos is
also present at-least for stacked block devices. For instance, in case
of NVMe when multipath is enabled, the bottom device may have QoS
enabled but top device doesn't. So always validate QoS is enabled and
q->rq_qos is present before calling __rq_qos_done_bio().

Fixes: 370ac285f2 ("block: avoid cpu_hotplug_lock depedency on freeze_lock")
Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Closes: https://lore.kernel.org/all/3a07b752-06a4-4eee-b302-f4669feb859d@linux.ibm.com/
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Link: https://lore.kernel.org/r/20250826163128.1952394-1-nilay@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-26 10:34:08 -06:00
Bart Van Assche
198f36f902 blk-zoned: Fix a lockdep complaint about recursive locking
If preparing a write bio fails then blk_zone_wplug_bio_work() calls
bio_endio() with zwplug->lock held. If a device mapper driver is stacked
on top of the zoned block device then this results in nested locking of
zwplug->lock. The resulting lockdep complaint is a false positive
because this is nested locking and not recursive locking. Suppress this
false positive by calling blk_zone_wplug_bio_io_error() without holding
zwplug->lock. This is safe because no code in
blk_zone_wplug_bio_io_error() depends on zwplug->lock being held. This
patch suppresses the following lockdep complaint:

WARNING: possible recursive locking detected
--------------------------------------------
kworker/3:0H/46 is trying to acquire lock:
ffffff882968b830 (&zwplug->lock){-...}-{2:2}, at: blk_zone_write_plug_bio_endio+0x64/0x1f0

but task is already holding lock:
ffffff88315bc230 (&zwplug->lock){-...}-{2:2}, at: blk_zone_wplug_bio_work+0x8c/0x48c

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(&zwplug->lock);
  lock(&zwplug->lock);

 *** DEADLOCK ***

 May be due to missing lock nesting notation

3 locks held by kworker/3:0H/46:
 #0: ffffff8809486758 ((wq_completion)sdd_zwplugs){+.+.}-{0:0}, at: process_one_work+0x1bc/0x65c
 #1: ffffffc085de3d70 ((work_completion)(&zwplug->bio_work)){+.+.}-{0:0}, at: process_one_work+0x1e4/0x65c
 #2: ffffff88315bc230 (&zwplug->lock){-...}-{2:2}, at: blk_zone_wplug_bio_work+0x8c/0x48c

stack backtrace:
CPU: 3 UID: 0 PID: 46 Comm: kworker/3:0H Tainted: G        W  OE      6.12.38-android16-5-maybe-dirty-4k #1 8b362b6f76e3645a58cd27d86982bce10d150025
Tainted: [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Hardware name: Spacecraft board based on MALIBU (DT)
Workqueue: sdd_zwplugs blk_zone_wplug_bio_work
Call trace:
 dump_backtrace+0xfc/0x17c
 show_stack+0x18/0x28
 dump_stack_lvl+0x40/0xa0
 dump_stack+0x18/0x24
 print_deadlock_bug+0x38c/0x398
 __lock_acquire+0x13e8/0x2e1c
 lock_acquire+0x134/0x2b4
 _raw_spin_lock_irqsave+0x5c/0x80
 blk_zone_write_plug_bio_endio+0x64/0x1f0
 bio_endio+0x9c/0x240
 __dm_io_complete+0x214/0x260
 clone_endio+0xe8/0x214
 bio_endio+0x218/0x240
 blk_zone_wplug_bio_work+0x204/0x48c
 process_one_work+0x26c/0x65c
 worker_thread+0x33c/0x498
 kthread+0x110/0x134
 ret_from_fork+0x10/0x20

Cc: stable@vger.kernel.org
Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Fixes: dd291d77cc ("block: Introduce zone write plugging")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20250825182720.1697203-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-26 08:27:24 -06:00
Bart Van Assche
f5d10e6915 block: Move a misplaced comment in queue_wb_lat_store()
blk_mq_quiesce_queue() does not wait for pending I/O to finish. Freezing
a queue waits for pending I/O to finish. Hence move the comment that
refers to waiting for pending I/O above the call that freezes the
request queue. This patch moves this comment back to the position where
it was when this comment was introduced. See also commit c125311d96
("blk-wbt: don't maintain inflight counts if disabled").

Cc: Christoph Hellwig <hch@lst.de>
Cc: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20250825151424.1653910-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-25 14:43:29 -06:00
Keith Busch
c16b52a0a0 blk-integrity: use iterator for mapping sg
Modify blk_rq_map_integrity_sg to use the blk-mq mapping iterator. This
produces more efficient code and converges the integrity mapping
implementations to reduce future maintenance burdens.

The function implementation moves from blk-integrity.c to blk-mq-dma.c
in order to use the types and functions private to that file.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250813153153.3260897-8-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-25 07:44:39 -06:00
Keith Busch
fec9b16dc5 blk-mq-dma: add scatter-less integrity data DMA mapping
Similar to regular data, introduce more efficient integrity mapping
helpers that does away with the scatterlist structure. This uses the
block mapping iterator to add IOVA segments if IOMMU is enabled, or maps
directly if not. This also supports P2P segements if integrity data ever
wants to allocate that type of memory.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20250813153153.3260897-7-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-25 07:44:39 -06:00
Keith Busch
e2be2ba6d2 blk-mq-dma: move common dma start code to a helper
In preparing for dma mapping integrity metadata, move the common dma
setup to a helper.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20250813153153.3260897-6-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-25 07:44:39 -06:00
Keith Busch
7092639031 blk-mq: remove REQ_P2PDMA flag
It's not serving any particular purpose. pci_p2pdma_state() already has
all the appropriate checks, so the config and flag checks are not
guarding anything.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20250813153153.3260897-5-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-25 07:44:39 -06:00
Keith Busch
dae75dead2 blk-mq-dma: provide the bio_vec array being iterated
This will make it easier to add different sources of the bvec array,
like for upcoming integrity support, rather than assume to use the bio's
bi_io_vec. It also makes iterating "special" payloads more in common
with iterating normal payloads.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20250813153153.3260897-3-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-25 07:44:38 -06:00
Keith Busch
7a6fc1634c blk-mq-dma: create blk_map_iter type
The req_iterator happens to have a similar fields to what the dma
iterator needs, but we're not necessarily iterating a request's
bi_io_vec. Create a new type that can be amended for additional future
use.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20250813153153.3260897-2-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-25 07:44:38 -06:00
Nilay Shroff
370ac285f2 block: avoid cpu_hotplug_lock depedency on freeze_lock
A recent lockdep[1] splat observed while running blktest block/005
reveals a potential deadlock caused by the cpu_hotplug_lock dependency
on ->freeze_lock. This dependency was introduced by commit 033b667a82
("block: blk-rq-qos: guard rq-qos helpers by static key").

That change added a static key to avoid fetching q->rq_qos when
neither blk-wbt nor blk-iolatency is configured. The static key
dynamically patches kernel text to a NOP when disabled, eliminating
overhead of fetching q->rq_qos in the I/O hot path. However, enabling
a static key at runtime requires acquiring both cpu_hotplug_lock and
jump_label_mutex. When this happens after the queue has already been
frozen (i.e., while holding ->freeze_lock), it creates a locking
dependency from cpu_hotplug_lock to ->freeze_lock, which leads to a
potential deadlock reported by lockdep [1].

To resolve this, replace the static key mechanism with q->queue_flags:
QUEUE_FLAG_QOS_ENABLED. This flag is evaluated in the fast path before
accessing q->rq_qos. If the flag is set, we proceed to fetch q->rq_qos;
otherwise, the access is skipped.

Since q->queue_flags is commonly accessed in IO hotpath and resides in
the first cacheline of struct request_queue, checking it imposes minimal
overhead while eliminating the deadlock risk.

This change avoids the lockdep splat without introducing performance
regressions.

[1] https://lore.kernel.org/linux-block/4fdm37so3o4xricdgfosgmohn63aa7wj3ua4e5vpihoamwg3ui@fq42f5q5t5ic/

Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Closes: https://lore.kernel.org/linux-block/4fdm37so3o4xricdgfosgmohn63aa7wj3ua4e5vpihoamwg3ui@fq42f5q5t5ic/
Fixes: 033b667a82 ("block: blk-rq-qos: guard rq-qos helpers by static key")
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250814082612.500845-4-nilay@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-21 07:11:11 -06:00
Nilay Shroff
ade1beea1c block: decrement block_rq_qos static key in rq_qos_del()
rq_qos_add() increments the block_rq_qos static key when a QoS
policy is attached. When a QoS policy is removed via rq_qos_del(),
we must symmetrically decrement the static key. If this removal drops
the last QoS policy from the queue (q->rq_qos becomes NULL), the
static branch can be disabled and the jump label patched to a NOP,
avoiding overhead on the hot path.

This change ensures rq_qos_add()/rq_qos_del() keep the
block_rq_qos static key balanced and prevents leaving the branch
permanently enabled after the last policy is removed.

Fixes: 033b667a82 ("block: blk-rq-qos: guard rq-qos helpers by static key")
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250814082612.500845-3-nilay@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-21 07:11:11 -06:00
Nilay Shroff
275332877e block: skip q->rq_qos check in rq_qos_done_bio()
If a bio has BIO_QOS_THROTTLED or BIO_QOS_MERGED set,
it implicitly guarantees that q->rq_qos is present.
Avoid re-checking q->rq_qos in this case and call
__rq_qos_done_bio() directly as a minor optimization.

Suggested-by : Yu Kuai <yukuai1@huaweicloud.com>

Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250814082612.500845-2-nilay@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-21 07:11:11 -06:00
Ming Lei
2d82f3bd89 blk-mq: fix lockdep warning in __blk_mq_update_nr_hw_queues
Commit 5989bfe6ac ("block: restore two stage elevator switch while
running nr_hw_queue update") reintroduced a lockdep warning by calling
blk_mq_freeze_queue_nomemsave() before switching the I/O scheduler.

The function blk_mq_elv_switch_none() calls elevator_change_done().
Running this while the queue is frozen causes a lockdep warning.

Fix this by reordering the operations: first, switch the I/O scheduler
to 'none', and then freeze the queue. This ensures that elevator_change_done()
is not called on an already frozen queue. And this way is safe because
elevator_set_none() does freeze queue before switching to none.

Also we still have to rely on blk_mq_elv_switch_back() for switching
back, and it has to cover unfrozen queue case.

Cc: Nilay Shroff <nilay@linux.ibm.com>
Cc: Yu Kuai <yukuai3@huawei.com>
Fixes: 5989bfe6ac ("block: restore two stage elevator switch while running nr_hw_queue update")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Link: https://lore.kernel.org/r/20250815131737.331692-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-21 05:34:19 -06:00
Christoph Hellwig
2729a60bbf
block: don't silently ignore metadata for sync read/write
The block fops don't try to handle metadata for synchronous requests,
probably because the completion handler looks at dio->iocb which is not
valid for synchronous requests.

But silently ignoring metadata (or warning in case of
__blkdev_direct_IO_simple) is a really bad idea as that can cause
silent data corruption if a user ever shows up.

Instead simply handle metadata for synchronous requests as the completion
handler can simply check for bio_integrity() as the block layer default
integrity will already be freed at this point, and thus bio_integrity()
will only return true for user mapped integrity.

Fixes: 3d8b5a22d4 ("block: add support to pass user meta buffer")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/20250819082517.2038819-3-hch@lst.de
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-08-20 11:13:01 +02:00
Christoph Hellwig
d072148a86
fs: add a FMODE_ flag to indicate IOCB_HAS_METADATA availability
Currently the kernel will happily route io_uring requests with metadata
to file operations that don't support it.  Add a FMODE_ flag to guard
that.

Fixes: 4de2ce04c8 ("fs: introduce IOCB_HAS_METADATA for metadata")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/20250819082517.2038819-2-hch@lst.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-08-20 11:12:58 +02:00
Christoph Hellwig
d0a2b527d8 block: tone down bio_check_eod
bdev_nr_sectors() == 0 is a pattern used for block devices that have
been hot removed, don't spam the log about them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250818101102.1604551-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-18 13:27:05 -06:00
Christoph Hellwig
f4ae174403 block: remove newlines from the warnings in blk_validate_integrity_limits
Otherwise they are very hard to read in the kernel log.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20250818045456.1482889-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-18 10:17:49 -06:00
Christoph Hellwig
61ca3b891b block: handle pi_tuple_size in queue_limits_stack_integrity
queue_limits_stack_integrity needs to handle the new pi_tuple_size field,
otherwise stacking PI-capable devices will always fail.

Fixes: 76e45252a4 ("block: introduce pi_tuple_size field in blk_integrity")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20250818045456.1482889-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-18 10:17:49 -06:00
Julian Sun
8f5845e074 block: restore default wbt enablement
The commit 245618f8e4 ("block: protect wbt_lat_usec using
q->elevator_lock") protected wbt_enable_default() with
q->elevator_lock; however, it also placed wbt_enable_default()
before blk_queue_flag_set(QUEUE_FLAG_REGISTERED, q);, resulting
in wbt failing to be enabled.

Moreover, the protection of wbt_enable_default() by q->elevator_lock
was removed in commit 78c271344b ("block: move wbt_enable_default()
out of queue freezing from sched ->exit()"), so we can directly fix
this issue by placing wbt_enable_default() after
blk_queue_flag_set(QUEUE_FLAG_REGISTERED, q);.

Additionally, this issue also causes the inability to read the
wbt_lat_usec file, and the scenario is as follows:

root@q:/sys/block/sda/queue# cat wbt_lat_usec
cat: wbt_lat_usec: Invalid argument

root@q:/data00/sjc/linux# ls /sys/kernel/debug/block/sda/rqos
cannot access '/sys/kernel/debug/block/sda/rqos': No such file or directory

root@q:/data00/sjc/linux# find /sys -name wbt
/sys/kernel/debug/tracing/events/wbt

After testing with this patch, wbt can be enabled normally.

Signed-off-by: Julian Sun <sunjunchao@bytedance.com>
Cc: stable@vger.kernel.org
Fixes: 245618f8e4 ("block: protect wbt_lat_usec using q->elevator_lock")
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250812154257.57540-1-sunjunchao@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-13 05:33:48 -06:00
Al Viro
4fc8728aa3 block: switch ->getgeo() to struct gendisk
Instances are happier that way and it makes more sense anyway -
the only part of the result that is related to partition we are given
is the start sector, and that has been filled in by the caller.

Everything else is a function of the disk.  Only one instance
(DASD) is ever looking at anything other than bdev->bd_disk and
that one is trivial to adjust.

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-08-13 02:59:29 -04:00
Tang Yizhou
bccdfcd56d blk-wbt: Eliminate ambiguity in the comments of struct rq_wb
In the current implementation, the last_issue and last_comp members of
struct rq_wb are used only by read requests and not by non-throttled write
requests. Therefore, eliminate the ambiguity here.

Signed-off-by: Tang Yizhou <yizhou.tang@shopee.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20250727173959.160835-3-yizhou.tang@shopee.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-11 10:21:38 -06:00
Tang Yizhou
d8b96a7962 blk-wbt: Optimize wbt_done() for non-throttled writes
In the current implementation, the sync_cookie and last_cookie members of
struct rq_wb are used only by read requests and not by non-throttled write
requests. Based on this, we can optimize wbt_done() by removing one if
condition check for non-throttled write requests.

Signed-off-by: Tang Yizhou <yizhou.tang@shopee.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250727173959.160835-2-yizhou.tang@shopee.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-11 10:21:38 -06:00
Zheng Qixing
343dc5423b block: fix kobject double initialization in add_disk
Device-mapper can call add_disk() multiple times for the same gendisk
due to its two-phase creation process (dm create + dm load). This leads
to kobject double initialization errors when the underlying iSCSI devices
become temporarily unavailable and then reappear.

However, if the first add_disk() call fails and is retried, the queue_kobj
gets initialized twice, causing:

kobject: kobject (ffff88810c27bb90): tried to init an initialized object,
something is seriously wrong.
 Call Trace:
  <TASK>
  dump_stack_lvl+0x5b/0x80
  kobject_init.cold+0x43/0x51
  blk_register_queue+0x46/0x280
  add_disk_fwnode+0xb5/0x280
  dm_setup_md_queue+0x194/0x1c0
  table_load+0x297/0x2d0
  ctl_ioctl+0x2a2/0x480
  dm_ctl_ioctl+0xe/0x20
  __x64_sys_ioctl+0xc7/0x110
  do_syscall_64+0x72/0x390
  entry_SYSCALL_64_after_hwframe+0x76/0x7e

Fix this by separating kobject initialization from sysfs registration:
 - Initialize queue_kobj early during gendisk allocation
 - add_disk() only adds the already-initialized kobject to sysfs
 - del_gendisk() removes from sysfs but doesn't destroy the kobject
 - Final cleanup happens when the disk is released

Fixes: 2bd85221a6 ("block: untangle request_queue refcounting from sysfs")
Reported-by: Li Lingfeng <lilingfeng3@huawei.com>
Closes: https://lore.kernel.org/all/83591d0b-2467-433c-bce0-5581298eb161@huawei.com/
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Link: https://lore.kernel.org/r/20250808053609.3237836-1-zhengqixing@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-11 08:00:49 -06:00
Qianfeng Rong
196447c712 blk-cgroup: remove redundant __GFP_NOWARN
Commit 16f5dfbc85 ("gfp: include __GFP_NOWARN in GFP_NOWAIT") made
GFP_NOWAIT implicitly include __GFP_NOWARN.

Therefore, explicit __GFP_NOWARN combined with GFP_NOWAIT (e.g.,
`GFP_NOWAIT | __GFP_NOWARN`) is now redundant.  Let's clean up these
redundant flags across subsystems.

Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20250809141358.168781-1-rongqianfeng@vivo.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-11 07:59:40 -06:00
Qianfeng Rong
8f3e4e87b0 block, bfq: remove redundant __GFP_NOWARN
Commit 16f5dfbc85 ("gfp: include __GFP_NOWARN in GFP_NOWAIT") made
GFP_NOWAIT implicitly include __GFP_NOWARN.

Therefore, explicit __GFP_NOWARN combined with GFP_NOWAIT (e.g.,
`GFP_NOWAIT | __GFP_NOWARN`) is now redundant.  Let's clean up these
redundant flags across subsystems.

Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com>
Link: https://lore.kernel.org/r/20250811081135.374315-1-rongqianfeng@vivo.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-11 07:59:22 -06:00
Linus Torvalds
2988dfed8a block-6.17-20250808
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmiWLjoQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpvveD/9vbvp3XaF0LagRJLH0fcdhcxL7Z+IHD+7U
 v5vICMeoeBhhhOtPJ0y+h/9LMLQWFYDFl6drkY0atSSxp/CK6CB25qFhIDsoA6Qk
 RBM/qZ64z4Uxvlc+VQmCqI2EMc/ZrYtrcr7jsornwORoTSEKXVHdyO5k7Q9002Sw
 XNWc0bZKIibFlgOk12Wnd8ZS5RWHw1uViUcreojcGVZAVR+BuHNGGoa3xq0bLiHU
 ERbQXfjaN28R+eo4E1euCtdf++7tW2kFjClrDmLcszdb27E2+MWMA6AKMiSTBE2k
 2e2TvJUcGZs1s8atqSIIjBtmwQW3rKws33zODLMONzOP8CIErcaniHxyDSaxJIJr
 kjsdKnwlziL3xVnwQcpgnVOPvvDSKZ4OKEqx8rAuYTqiknpz3uhbt/7EqumuPLHr
 e7Rz0MnFolrVN7KZOHQ5CPJIezkEAOAEpItLdfc5cfLS06pbeTN3j+dJZp+tUohi
 WP/K3l2N3C5pkXA0ilAzshRF20Rwv/09M85BoqWocTLBJY7WqyIKXywCNdX81wkv
 tpbQvp2MpPkJXUIbAh5484BOfCfx9vkYVm2cam2UxXJhR6VfrQCjYfXIjfpqF4jp
 q7xxNesUezrOqB2Q/cKxw8dKOaRtO1XzVnmwutBrcKgqqLezMwUTDDjQYe8l6p1Z
 40E74tsJwQ==
 =EQ7g
 -----END PGP SIGNATURE-----

Merge tag 'block-6.17-20250808' of git://git.kernel.dk/linux

Pull more block updates from Jens Axboe:

 - MD pull request via Yu:
      - mddev null-ptr-dereference fix, by Erkun
      - md-cluster fail to remove the faulty disk regression fix, by
        Heming
      - minor cleanup, by Li Nan and Jinchao
      - mdadm lifetime regression fix reported by syzkaller, by Yu Kuai

 - MD pull request via Christoph
      - add support for getting the FDP featuee in fabrics passthru path
        (Nitesh Shetty)
      - add capability to connect to an administrative controller
        (Kamaljit Singh)
      - fix a leak on sgl setup error (Keith Busch)
      - initialize discovery subsys after debugfs is initialized
        (Mohamed Khalfella)
      - fix various comment typos (Bjorn Helgaas)
      - remove unneeded semicolons (Jiapeng Chong)

 - nvmet debugfs ordering issue fix

 - Fix UAF in the tag_set in zloop

 - Ensure sbitmap shallow depth covers entire set

 - Reduce lock roundtrips in io context lookup

 - Move scheduler tags alloc/free out of elevator and freeze lock, to
   fix some lockdep found issues

 - Improve robustness of queue limits checking

 - Fix a regression with IO priorities, if no io context exists

* tag 'block-6.17-20250808' of git://git.kernel.dk/linux: (26 commits)
  lib/sbitmap: make sbitmap_get_shallow() internal
  lib/sbitmap: convert shallow_depth from one word to the whole sbitmap
  nvmet: exit debugfs after discovery subsystem exits
  block, bfq: Reorder struct bfq_iocq_bfqq_data
  md: make rdev_addable usable for rcu mode
  md/raid1: remove struct pool_info and related code
  md/raid1: change r1conf->r1bio_pool to a pointer type
  block: ensure discard_granularity is zero when discard is not supported
  zloop: fix KASAN use-after-free of tag set
  block: Fix default IO priority if there is no IO context
  nvme: fix various comment typos
  nvme-auth: remove unneeded semicolon
  nvme-pci: fix leak on sgl setup error
  nvmet: initialize discovery subsys after debugfs is initialized
  nvme: add capability to connect to an administrative controller
  nvmet: add support for FDP in fabrics passthru path
  md: rename recovery_cp to resync_offset
  md/md-cluster: handle REMOVE message earlier
  md: fix create on open mddev lifetime regression
  block: fix potential deadlock while running nr_hw_queue update
  ...
2025-08-09 08:47:28 +03:00
Yu Kuai
42e6c6ce03 lib/sbitmap: convert shallow_depth from one word to the whole sbitmap
Currently elevators will record internal 'async_depth' to throttle
asynchronous requests, and they both calculate shallow_dpeth based on
sb->shift, with the respect that sb->shift is the available tags in one
word.

However, sb->shift is not the availbale tags in the last word, see
__map_depth:

if (index == sb->map_nr - 1)
  return sb->depth - (index << sb->shift);

For consequence, if the last word is used, more tags can be get than
expected, for example, assume nr_requests=256 and there are four words,
in the worst case if user set nr_requests=32, then the first word is
the last word, and still use bits per word, which is 64, to calculate
async_depth is wrong.

One the ohter hand, due to cgroup qos, bfq can allow only one request
to be allocated, and set shallow_dpeth=1 will still allow the number
of words request to be allocated.

Fix this problems by using shallow_depth to the whole sbitmap instead
of per word, also change kyber, mq-deadline and bfq to follow this,
a new helper __map_depth_with_shallow() is introduced to calculate
available bits in each word.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250807032413.1469456-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-07 06:30:17 -06:00
Christophe JAILLET
407728da41 block, bfq: Reorder struct bfq_iocq_bfqq_data
The size of struct bfq_iocq_bfqq_data can be reduced by moving a few
fields around.

On a x86_64, with allmodconfig, this shrinks the size from 144 to 128
bytes. The main benefit is to reduce the size of struct bfq_io_cq from
1360 to 1232.

This structure is stored in a dedicated slab cache. So reducing its size
improves cache usage.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/79394db1befaa658e8066b8e3348073ce27d9d26.1754119538.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-04 09:22:44 -06:00
Linus Torvalds
beace86e61 Summary of significant series in this pull request:
- The 4 patch series "mm: ksm: prevent KSM from breaking merging of new
   VMAs" from Lorenzo Stoakes addresses an issue with KSM's
   PR_SET_MEMORY_MERGE mode: newly mapped VMAs were not eligible for
   merging with existing adjacent VMAs.
 
 - The 4 patch series "mm/damon: introduce DAMON_STAT for simple and
   practical access monitoring" from SeongJae Park adds a new kernel module
   which simplifies the setup and usage of DAMON in production
   environments.
 
 - The 6 patch series "stop passing a writeback_control to swap/shmem
   writeout" from Christoph Hellwig is a cleanup to the writeback code
   which removes a couple of pointers from struct writeback_control.
 
 - The 7 patch series "drivers/base/node.c: optimization and cleanups"
   from Donet Tom contains largely uncorrelated cleanups to the NUMA node
   setup and management code.
 
 - The 4 patch series "mm: userfaultfd: assorted fixes and cleanups" from
   Tal Zussman does some maintenance work on the userfaultfd code.
 
 - The 5 patch series "Readahead tweaks for larger folios" from Ryan
   Roberts implements some tuneups for pagecache readahead when it is
   reading into order>0 folios.
 
 - The 4 patch series "selftests/mm: Tweaks to the cow test" from Mark
   Brown provides some cleanups and consistency improvements to the
   selftests code.
 
 - The 4 patch series "Optimize mremap() for large folios" from Dev Jain
   does that.  A 37% reduction in execution time was measured in a
   memset+mremap+munmap microbenchmark.
 
 - The 5 patch series "Remove zero_user()" from Matthew Wilcox expunges
   zero_user() in favor of the more modern memzero_page().
 
 - The 3 patch series "mm/huge_memory: vmf_insert_folio_*() and
   vmf_insert_pfn_pud() fixes" from David Hildenbrand addresses some warts
   which David noticed in the huge page code.  These were not known to be
   causing any issues at this time.
 
 - The 3 patch series "mm/damon: use alloc_migrate_target() for
   DAMOS_MIGRATE_{HOT,COLD" from SeongJae Park provides some cleanup and
   consolidation work in DAMON.
 
 - The 3 patch series "use vm_flags_t consistently" from Lorenzo Stoakes
   uses vm_flags_t in places where we were inappropriately using other
   types.
 
 - The 3 patch series "mm/memfd: Reserve hugetlb folios before
   allocation" from Vivek Kasireddy increases the reliability of large page
   allocation in the memfd code.
 
 - The 14 patch series "mm: Remove pXX_devmap page table bit and pfn_t
   type" from Alistair Popple removes several now-unneeded PFN_* flags.
 
 - The 5 patch series "mm/damon: decouple sysfs from core" from SeongJae
   Park implememnts some cleanup and maintainability work in the DAMON
   sysfs layer.
 
 - The 5 patch series "madvise cleanup" from Lorenzo Stoakes does quite a
   lot of cleanup/maintenance work in the madvise() code.
 
 - The 4 patch series "madvise anon_name cleanups" from Vlastimil Babka
   provides additional cleanups on top or Lorenzo's effort.
 
 - The 11 patch series "Implement numa node notifier" from Oscar Salvador
   creates a standalone notifier for NUMA node memory state changes.
   Previously these were lumped under the more general memory on/offline
   notifier.
 
 - The 6 patch series "Make MIGRATE_ISOLATE a standalone bit" from Zi Yan
   cleans up the pageblock isolation code and fixes a potential issue which
   doesn't seem to cause any problems in practice.
 
 - The 5 patch series "selftests/damon: add python and drgn based DAMON
   sysfs functionality tests" from SeongJae Park adds additional drgn- and
   python-based DAMON selftests which are more comprehensive than the
   existing selftest suite.
 
 - The 5 patch series "Misc rework on hugetlb faulting path" from Oscar
   Salvador fixes a rather obscure deadlock in the hugetlb fault code and
   follows that fix with a series of cleanups.
 
 - The 3 patch series "cma: factor out allocation logic from
   __cma_declare_contiguous_nid" from Mike Rapoport rationalizes and cleans
   up the highmem-specific code in the CMA allocator.
 
 - The 28 patch series "mm/migration: rework movable_ops page migration
   (part 1)" from David Hildenbrand provides cleanups and
   future-preparedness to the migration code.
 
 - The 2 patch series "mm/damon: add trace events for auto-tuned
   monitoring intervals and DAMOS quota" from SeongJae Park adds some
   tracepoints to some DAMON auto-tuning code.
 
 - The 6 patch series "mm/damon: fix misc bugs in DAMON modules" from
   SeongJae Park does that.
 
 - The 6 patch series "mm/damon: misc cleanups" from SeongJae Park also
   does what it claims.
 
 - The 4 patch series "mm: folio_pte_batch() improvements" from David
   Hildenbrand cleans up the large folio PTE batching code.
 
 - The 13 patch series "mm/damon/vaddr: Allow interleaving in
   migrate_{hot,cold} actions" from SeongJae Park facilitates dynamic
   alteration of DAMON's inter-node allocation policy.
 
 - The 3 patch series "Remove unmap_and_put_page()" from Vishal Moola
   provides a couple of page->folio conversions.
 
 - The 4 patch series "mm: per-node proactive reclaim" from Davidlohr
   Bueso implements a per-node control of proactive reclaim - beyond the
   current memcg-based implementation.
 
 - The 14 patch series "mm/damon: remove damon_callback" from SeongJae
   Park replaces the damon_callback interface with a more general and
   powerful damon_call()+damos_walk() interface.
 
 - The 10 patch series "mm/mremap: permit mremap() move of multiple VMAs"
   from Lorenzo Stoakes implements a number of mremap cleanups (of course)
   in preparation for adding new mremap() functionality: newly permit the
   remapping of multiple VMAs when the user is specifying MREMAP_FIXED.  It
   still excludes some specialized situations where this cannot be
   performed reliably.
 
 - The 3 patch series "drop hugetlb_free_pgd_range()" from Anthony Yznaga
   switches some sparc hugetlb code over to the generic version and removes
   the thus-unneeded hugetlb_free_pgd_range().
 
 - The 4 patch series "mm/damon/sysfs: support periodic and automated
   stats update" from SeongJae Park augments the present
   userspace-requested update of DAMON sysfs monitoring files.  Automatic
   update is now provided, along with a tunable to control the update
   interval.
 
 - The 4 patch series "Some randome fixes and cleanups to swapfile" from
   Kemeng Shi does what is claims.
 
 - The 4 patch series "mm: introduce snapshot_page" from Luiz Capitulino
   and David Hildenbrand provides (and uses) a means by which debug-style
   functions can grab a copy of a pageframe and inspect it locklessly
   without tripping over the races inherent in operating on the live
   pageframe directly.
 
 - The 6 patch series "use per-vma locks for /proc/pid/maps reads" from
   Suren Baghdasaryan addresses the large contention issues which can be
   triggered by reads from that procfs file.  Latencies are reduced by more
   than half in some situations.  The series also introduces several new
   selftests for the /proc/pid/maps interface.
 
 - The 6 patch series "__folio_split() clean up" from Zi Yan cleans up
   __folio_split()!
 
 - The 7 patch series "Optimize mprotect() for large folios" from Dev
   Jain provides some quite large (>3x) speedups to mprotect() when dealing
   with large folios.
 
 - The 2 patch series "selftests/mm: reuse FORCE_READ to replace "asm
   volatile("" : "+r" (XXX));" and some cleanup" from wang lian does some
   cleanup work in the selftests code.
 
 - The 3 patch series "tools/testing: expand mremap testing" from Lorenzo
   Stoakes extends the mremap() selftest in several ways, including adding
   more checking of Lorenzo's recently added "permit mremap() move of
   multiple VMAs" feature.
 
 - The 22 patch series "selftests/damon/sysfs.py: test all parameters"
   from SeongJae Park extends the DAMON sysfs interface selftest so that it
   tests all possible user-requested parameters.  Rather than the present
   minimal subset.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaIqcCgAKCRDdBJ7gKXxA
 jkVBAQCCn9DR1QP0CRk961ot0cKzOgioSc0aA03DPb2KXRt2kQEAzDAz0ARurFhL
 8BzbvI0c+4tntHLXvIlrC33n9KWAOQM=
 =XsFy
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2025-07-30-15-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "As usual, many cleanups. The below blurbiage describes 42 patchsets.
  21 of those are partially or fully cleanup work. "cleans up",
  "cleanup", "maintainability", "rationalizes", etc.

  I never knew the MM code was so dirty.

  "mm: ksm: prevent KSM from breaking merging of new VMAs" (Lorenzo Stoakes)
     addresses an issue with KSM's PR_SET_MEMORY_MERGE mode: newly
     mapped VMAs were not eligible for merging with existing adjacent
     VMAs.

  "mm/damon: introduce DAMON_STAT for simple and practical access monitoring" (SeongJae Park)
     adds a new kernel module which simplifies the setup and usage of
     DAMON in production environments.

  "stop passing a writeback_control to swap/shmem writeout" (Christoph Hellwig)
     is a cleanup to the writeback code which removes a couple of
     pointers from struct writeback_control.

  "drivers/base/node.c: optimization and cleanups" (Donet Tom)
     contains largely uncorrelated cleanups to the NUMA node setup and
     management code.

  "mm: userfaultfd: assorted fixes and cleanups" (Tal Zussman)
     does some maintenance work on the userfaultfd code.

  "Readahead tweaks for larger folios" (Ryan Roberts)
     implements some tuneups for pagecache readahead when it is reading
     into order>0 folios.

  "selftests/mm: Tweaks to the cow test" (Mark Brown)
     provides some cleanups and consistency improvements to the
     selftests code.

  "Optimize mremap() for large folios" (Dev Jain)
     does that. A 37% reduction in execution time was measured in a
     memset+mremap+munmap microbenchmark.

  "Remove zero_user()" (Matthew Wilcox)
     expunges zero_user() in favor of the more modern memzero_page().

  "mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes" (David Hildenbrand)
     addresses some warts which David noticed in the huge page code.
     These were not known to be causing any issues at this time.

  "mm/damon: use alloc_migrate_target() for DAMOS_MIGRATE_{HOT,COLD" (SeongJae Park)
     provides some cleanup and consolidation work in DAMON.

  "use vm_flags_t consistently" (Lorenzo Stoakes)
     uses vm_flags_t in places where we were inappropriately using other
     types.

  "mm/memfd: Reserve hugetlb folios before allocation" (Vivek Kasireddy)
     increases the reliability of large page allocation in the memfd
     code.

  "mm: Remove pXX_devmap page table bit and pfn_t type" (Alistair Popple)
     removes several now-unneeded PFN_* flags.

  "mm/damon: decouple sysfs from core" (SeongJae Park)
     implememnts some cleanup and maintainability work in the DAMON
     sysfs layer.

  "madvise cleanup" (Lorenzo Stoakes)
     does quite a lot of cleanup/maintenance work in the madvise() code.

  "madvise anon_name cleanups" (Vlastimil Babka)
     provides additional cleanups on top or Lorenzo's effort.

  "Implement numa node notifier" (Oscar Salvador)
     creates a standalone notifier for NUMA node memory state changes.
     Previously these were lumped under the more general memory
     on/offline notifier.

  "Make MIGRATE_ISOLATE a standalone bit" (Zi Yan)
     cleans up the pageblock isolation code and fixes a potential issue
     which doesn't seem to cause any problems in practice.

  "selftests/damon: add python and drgn based DAMON sysfs functionality tests" (SeongJae Park)
     adds additional drgn- and python-based DAMON selftests which are
     more comprehensive than the existing selftest suite.

  "Misc rework on hugetlb faulting path" (Oscar Salvador)
     fixes a rather obscure deadlock in the hugetlb fault code and
     follows that fix with a series of cleanups.

  "cma: factor out allocation logic from __cma_declare_contiguous_nid" (Mike Rapoport)
     rationalizes and cleans up the highmem-specific code in the CMA
     allocator.

  "mm/migration: rework movable_ops page migration (part 1)" (David Hildenbrand)
     provides cleanups and future-preparedness to the migration code.

  "mm/damon: add trace events for auto-tuned monitoring intervals and DAMOS quota" (SeongJae Park)
     adds some tracepoints to some DAMON auto-tuning code.

  "mm/damon: fix misc bugs in DAMON modules" (SeongJae Park)
     does that.

  "mm/damon: misc cleanups" (SeongJae Park)
     also does what it claims.

  "mm: folio_pte_batch() improvements" (David Hildenbrand)
     cleans up the large folio PTE batching code.

  "mm/damon/vaddr: Allow interleaving in migrate_{hot,cold} actions" (SeongJae Park)
     facilitates dynamic alteration of DAMON's inter-node allocation
     policy.

  "Remove unmap_and_put_page()" (Vishal Moola)
     provides a couple of page->folio conversions.

  "mm: per-node proactive reclaim" (Davidlohr Bueso)
     implements a per-node control of proactive reclaim - beyond the
     current memcg-based implementation.

  "mm/damon: remove damon_callback" (SeongJae Park)
     replaces the damon_callback interface with a more general and
     powerful damon_call()+damos_walk() interface.

  "mm/mremap: permit mremap() move of multiple VMAs" (Lorenzo Stoakes)
     implements a number of mremap cleanups (of course) in preparation
     for adding new mremap() functionality: newly permit the remapping
     of multiple VMAs when the user is specifying MREMAP_FIXED. It still
     excludes some specialized situations where this cannot be performed
     reliably.

  "drop hugetlb_free_pgd_range()" (Anthony Yznaga)
     switches some sparc hugetlb code over to the generic version and
     removes the thus-unneeded hugetlb_free_pgd_range().

  "mm/damon/sysfs: support periodic and automated stats update" (SeongJae Park)
     augments the present userspace-requested update of DAMON sysfs
     monitoring files. Automatic update is now provided, along with a
     tunable to control the update interval.

  "Some randome fixes and cleanups to swapfile" (Kemeng Shi)
     does what is claims.

  "mm: introduce snapshot_page" (Luiz Capitulino and David Hildenbrand)
     provides (and uses) a means by which debug-style functions can grab
     a copy of a pageframe and inspect it locklessly without tripping
     over the races inherent in operating on the live pageframe
     directly.

  "use per-vma locks for /proc/pid/maps reads" (Suren Baghdasaryan)
     addresses the large contention issues which can be triggered by
     reads from that procfs file. Latencies are reduced by more than
     half in some situations. The series also introduces several new
     selftests for the /proc/pid/maps interface.

  "__folio_split() clean up" (Zi Yan)
     cleans up __folio_split()!

  "Optimize mprotect() for large folios" (Dev Jain)
     provides some quite large (>3x) speedups to mprotect() when dealing
     with large folios.

  "selftests/mm: reuse FORCE_READ to replace "asm volatile("" : "+r" (XXX));" and some cleanup" (wang lian)
     does some cleanup work in the selftests code.

  "tools/testing: expand mremap testing" (Lorenzo Stoakes)
     extends the mremap() selftest in several ways, including adding
     more checking of Lorenzo's recently added "permit mremap() move of
     multiple VMAs" feature.

  "selftests/damon/sysfs.py: test all parameters" (SeongJae Park)
     extends the DAMON sysfs interface selftest so that it tests all
     possible user-requested parameters. Rather than the present minimal
     subset"

* tag 'mm-stable-2025-07-30-15-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (370 commits)
  MAINTAINERS: add missing headers to mempory policy & migration section
  MAINTAINERS: add missing file to cgroup section
  MAINTAINERS: add MM MISC section, add missing files to MISC and CORE
  MAINTAINERS: add missing zsmalloc file
  MAINTAINERS: add missing files to page alloc section
  MAINTAINERS: add missing shrinker files
  MAINTAINERS: move memremap.[ch] to hotplug section
  MAINTAINERS: add missing mm_slot.h file THP section
  MAINTAINERS: add missing interval_tree.c to memory mapping section
  MAINTAINERS: add missing percpu-internal.h file to per-cpu section
  mm/page_alloc: remove trace_mm_alloc_contig_migrate_range_info()
  selftests/damon: introduce _common.sh to host shared function
  selftests/damon/sysfs.py: test runtime reduction of DAMON parameters
  selftests/damon/sysfs.py: test non-default parameters runtime commit
  selftests/damon/sysfs.py: generalize DAMON context commit assertion
  selftests/damon/sysfs.py: generalize monitoring attributes commit assertion
  selftests/damon/sysfs.py: generalize DAMOS schemes commit assertion
  selftests/damon/sysfs.py: test DAMOS filters commitment
  selftests/damon/sysfs.py: generalize DAMOS scheme commit assertion
  selftests/damon/sysfs.py: test DAMOS destinations commitment
  ...
2025-07-31 14:57:54 -07:00
Christoph Hellwig
fad6551fcf block: ensure discard_granularity is zero when discard is not supported
Documentation/ABI/stable/sysfs-block states:

  What: /sys/block/<disk>/queue/discard_granularity
  [...]
  A discard_granularity of 0 means that the device does not support
  discard functionality.

but this got broken when sorting out the block limits updates.  Fix this
by setting the discard_granularity limit to zero when the combined
max_discard_sectors is zero.

Fixes: 3c407dc723 ("block: default the discard granularity to sector size")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20250731152228.873923-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-31 15:01:35 -06:00
Nilay Shroff
04225d13ae block: fix potential deadlock while running nr_hw_queue update
Move scheduler tags (sched_tags) allocation and deallocation outside
both the ->elevator_lock and ->freeze_lock when updating nr_hw_queues.
This change breaks the dependency chain from the percpu allocator lock
to the elevator lock, helping to prevent potential deadlocks, as
observed in the reported lockdep splat[1].

This commit introduces batch allocation and deallocation helpers for
sched_tags, which are now used from within __blk_mq_update_nr_hw_queues
routine while iterating through the tagset.

With this change, all sched_tags memory management is handled entirely
outside the ->elevator_lock and the ->freeze_lock context, thereby
eliminating the lock dependency that could otherwise manifest during
nr_hw_queues updates.

[1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/

Reported-by: Stefan Haberland <sth@linux.ibm.com>
Closes: https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Link: https://lore.kernel.org/r/20250730074614.2537382-4-nilay@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-30 06:20:51 -06:00