linux/block
Chaitanya Kulkarni daa6c79858 block: clear BIO_QOS flags in blk_steal_bios()
When a bio goes through the rq_qos infrastructure on a path's request
queue, it gets BIO_QOS_THROTTLED or BIO_QOS_MERGED flags set. These
flags indicate that rq_qos_done_bio() should be called on completion
to update rq_qos accounting.

During path failover in nvme_failover_req(), the bio's bi_bdev is
redirected from the failed path's disk to the multipath head's disk
via bio_set_dev(). However, the BIO_QOS flags are not cleared.

When the bio eventually completes (either successfully via a new path
or with an error via bio_io_error()), rq_qos_done_bio() checks for
these flags and calls __rq_qos_done_bio(q->rq_qos, bio) where q is
obtained from the bio's current bi_bdev - which is now the multipath
head's queue, not the original path's queue.

The multipath head's queue does not have rq_qos enabled (q->rq_qos is
NULL), but the code assumes that if BIO_QOS_* flags are set, q->rq_qos
must be valid.

This breaks when a bio is moved between queues during NVMe multipath
failover, leading to a NULL pointer dereference.

Execution Context timeline :-

   * =====> dd process context
   [USER] dd process
     [SYSCALL] write() - dd process context
       submit_bio()
       nvme_ns_head_submit_bio() - path selection
       blk_mq_submit_bio()  #### QOS FLAGS SET HERE

        [USER] dd waits or returns

          ==== I/O in flight on NVMe hardware =====

   ===== End of submission path ====
   ------------------------------------------------------

   * dd ====> Interrupt context;
   [IRQ] NVMe completion interrupt
       nvme_irq()
        nvme_complete_rq()
         nvme_failover_req() ### BIO MOVED TO HEAD
            spin_lock_irqsave (atomic section)
            bio_set_dev() changes bi_bdev
            ### BUG: QOS flags NOT cleared
            kblockd_schedule_work()

   * Interrupt context =====> kblockd workqueue
   [WQ] kblockd workqueue - kworker process
       nvme_requeue_work()
        submit_bio_noacct()
         nvme_ns_head_submit_bio()
          nvme_find_path() returns NULL
           bio_io_error()
            bio_endio()
             rq_qos_done_bio()  ### CRASH ###

   KERNEL PANIC / OOPS

Crash from blktests nvme/058 (rapid namespace remapping):

[ 1339.636033] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 1339.641025] nvme nvme4: rescanning namespaces.
[ 1339.642064] #PF: supervisor read access in kernel mode
[ 1339.642067] #PF: error_code(0x0000) - not-present page
[ 1339.642070] PGD 0 P4D 0
[ 1339.642073] Oops: Oops: 0000 [#1] SMP NOPTI
[ 1339.642078] CPU: 35 UID: 0 PID: 4579 Comm: kworker/35:2H
               Tainted: G   O     N  6.17.0-rc3nvme+ #5 PREEMPT(voluntary)
[ 1339.642084] Tainted: [O]=OOT_MODULE, [N]=TEST
[ 1339.673446] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
           BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 1339.682359] Workqueue: kblockd nvme_requeue_work [nvme_core]
[ 1339.686613] RIP: 0010:__rq_qos_done_bio+0xd/0x40
[ 1339.690161] Code: 75 dd 5b 5d 41 5c c3 cc cc cc cc 66 90 90 90 90 90 90 90
                     90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 f5
             53 48 89 fb <48> 8b 03 48 8b 40 30 48 85 c0 74 0b 48 89 ee
             48 89 df ff d0 0f 1f
[ 1339.703691] RSP: 0018:ffffc900066f3c90 EFLAGS: 00010202
[ 1339.706844] RAX: ffff888148b9ef00 RBX: 0000000000000000 RCX: 0000000000000000
[ 1339.711136] RDX: 00000000000001c0 RSI: ffff8882aaab8a80 RDI: 0000000000000000
[ 1339.715691] RBP: ffff8882aaab8a80 R08: 0000000000000000 R09: 0000000000000000
[ 1339.720472] R10: 0000000000000000 R11: fefefefefefefeff R12: ffff8882aa3b6010
[ 1339.724650] R13: 0000000000000000 R14: ffff8882338bcef0 R15: ffff8882aa3b6020
[ 1339.729029] FS:  0000000000000000(0000) GS:ffff88985c0cf000(0000) knlGS:0000000000000000
[ 1339.734525] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1339.738563] CR2: 0000000000000000 CR3: 0000000111045000 CR4: 0000000000350ef0
[ 1339.742750] DR0: ffffffff845ccbec DR1: ffffffff845ccbed DR2: ffffffff845ccbee
[ 1339.745630] DR3: ffffffff845ccbef DR6: 00000000ffff0ff0 DR7: 0000000000000600
[ 1339.748488] Call Trace:
[ 1339.749512]  <TASK>
[ 1339.750449]  bio_endio+0x71/0x2e0
[ 1339.751833]  nvme_ns_head_submit_bio+0x290/0x320 [nvme_core]
[ 1339.754073]  __submit_bio+0x222/0x5e0
[ 1339.755623]  ? rcu_is_watching+0xd/0x40
[ 1339.757201]  ? submit_bio_noacct_nocheck+0x131/0x370
[ 1339.759210]  submit_bio_noacct_nocheck+0x131/0x370
[ 1339.761189]  ? submit_bio_noacct+0x20/0x620
[ 1339.762849]  nvme_requeue_work+0x4b/0x60 [nvme_core]
[ 1339.764828]  process_one_work+0x20e/0x630
[ 1339.766528]  worker_thread+0x184/0x330
[ 1339.768129]  ? __pfx_worker_thread+0x10/0x10
[ 1339.769942]  kthread+0x10a/0x250
[ 1339.771263]  ? __pfx_kthread+0x10/0x10
[ 1339.772776]  ? __pfx_kthread+0x10/0x10
[ 1339.774381]  ret_from_fork+0x273/0x2e0
[ 1339.775948]  ? __pfx_kthread+0x10/0x10
[ 1339.777504]  ret_from_fork_asm+0x1a/0x30
[ 1339.779163]  </TASK>

Fix this by clearing both BIO_QOS_THROTTLED and BIO_QOS_MERGED flags
when bios are redirected to the multipath head in nvme_failover_req().
This is consistent with the existing code that clears REQ_POLLED and
REQ_NOWAIT flags when the bio changes queues.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260226031243.87200-3-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-10 07:11:09 -06:00
..
partitions Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
badblocks.c badblocks: Fix a nonsense WARN_ON() which checks whether a u64 variable < 0 2025-03-10 07:41:58 -06:00
bdev.c block: remove redundant kill_bdev() call in set_blocksize() 2026-02-04 09:28:18 -07:00
bfq-cgroup.c treewide: Replace kmalloc with kmalloc_obj for non-scalar types 2026-02-21 01:02:28 -08:00
bfq-iosched.c treewide: Replace kmalloc with kmalloc_obj for non-scalar types 2026-02-21 01:02:28 -08:00
bfq-iosched.h block, bfq: update outdated comment 2026-01-01 08:57:37 -07:00
bfq-wf2q.c
bio-integrity-auto.c block: prepare generation / verification helpers for fs usage 2026-03-09 07:47:02 -06:00
bio-integrity-fs.c block: add fs_bio_integrity helpers 2026-03-09 07:47:02 -06:00
bio-integrity.c block: factor out a bio_integrity_setup_default helper 2026-03-09 07:47:02 -06:00
bio.c Merge branch 'for-7.1/block-integrity' into for-7.1/block 2026-03-09 14:30:14 -06:00
blk-cgroup-fc-appid.c
blk-cgroup-rwstat.c blk-cgroup: use group allocation/free of per-cpu counters API 2024-04-03 09:10:17 -06:00
blk-cgroup-rwstat.h blk-cgroup: rwstat: fix kernel-doc warnings in header file 2025-01-13 07:47:09 -07:00
blk-cgroup.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
blk-cgroup.h block: initialize bio issue time in blk_mq_submit_bio() 2025-09-10 05:23:45 -06:00
blk-core.c blk-mq: add a new queue sysfs attribute async_depth 2026-02-03 07:45:36 -07:00
blk-crypto-fallback.c Convert remaining multi-line kmalloc_obj/flex GFP_KERNEL uses 2026-02-22 08:26:33 -08:00
blk-crypto-internal.h blk-crypto: handle the fallback above the block layer 2026-01-11 12:55:41 -07:00
blk-crypto-profile.c Convert more 'alloc_obj' cases to default GFP_KERNEL arguments 2026-02-21 20:03:00 -08:00
blk-crypto-sysfs.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
blk-crypto.c blk-crypto: handle the fallback above the block layer 2026-01-11 12:55:41 -07:00
blk-flush.c block: pass io_comp_batch to rq_end_io_fn callback 2026-01-20 10:12:54 -07:00
blk-ia-ranges.c block: get rid of request queue ->sysfs_dir_lock 2025-01-29 07:16:47 -07:00
blk-integrity.c block: don't merge bios with different app_tags 2026-01-06 19:10:08 -07:00
blk-ioc.c copy_process: pass clone_flags as u64 across calltree 2025-09-01 15:31:34 +02:00
blk-iocost.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
blk-iolatency.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
blk-ioprio.c treewide: Replace kmalloc with kmalloc_obj for non-scalar types 2026-02-21 01:02:28 -08:00
blk-ioprio.h blk-ioprio: remove per-disk structure 2024-07-28 16:47:51 -06:00
blk-lib.c block: change return type to void 2026-02-12 04:23:53 -07:00
blk-map.c block-7.0-20260305 2026-03-06 08:36:18 -08:00
blk-merge.c for-7.0/block-stable-pages-20260206 2026-02-09 18:14:52 -08:00
blk-mq-cpumap.c blk-mq: add number of queue calc helper 2025-07-01 10:24:19 -06:00
blk-mq-debugfs.c block: allow submitting all zone writes from a single context 2026-03-09 14:30:00 -06:00
blk-mq-debugfs.h blk-mq-debugfs: remove blk_mq_debugfs_unregister_rqos() 2026-02-02 07:05:19 -07:00
blk-mq-dma.c block: fix partial IOVA mapping cleanup in blk_rq_dma_map_iova 2026-02-12 04:23:31 -07:00
blk-mq-sched.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
blk-mq-sched.h blk-mq-sched: unify elevators checking for async requests 2026-02-03 07:45:36 -07:00
blk-mq-sysfs.c blk-mq: Move flush queue allocation into blk_mq_init_hctx() 2025-09-08 08:05:32 -06:00
blk-mq-tag.c blk-mq: use array manage hctx map instead of xarray 2025-11-28 09:09:19 -07:00
blk-mq.c block: clear BIO_QOS flags in blk_steal_bios() 2026-03-10 07:11:09 -06:00
blk-mq.h blk-mq: use queue_hctx in blk_mq_map_queue_type 2025-12-01 07:18:31 -07:00
blk-pm.c block: force noio scope in blk_mq_freeze_queue 2025-01-31 07:20:08 -07:00
blk-pm.h
blk-rq-qos.c blk-mq-debugfs: remove blk_mq_debugfs_unregister_rqos() 2026-02-02 07:05:19 -07:00
blk-rq-qos.h blk-rq-qos: Remove unlikely() hints from QoS checks 2026-01-06 19:08:23 -07:00
blk-settings.c block: make max_integrity_io_size public 2026-03-09 07:47:02 -06:00
blk-stat.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
blk-stat.h blk-stat: convert struct blk_stat_callback to kernel-doc 2026-02-16 10:21:06 -07:00
blk-sysfs.c block: default to QD=1 writes for blk-mq rotational zoned devices 2026-03-09 14:30:00 -06:00
blk-throttle.c block/blk-throttle: Remove throtl_slice from struct throtl_data 2025-11-17 09:39:48 -07:00
blk-throttle.h blk-throttle: fix access race during throttle policy activation 2025-09-08 08:24:44 -06:00
blk-timeout.c
blk-wbt.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
blk-wbt.h blk-wbt: factor out a helper wbt_set_lat() 2026-02-02 07:05:19 -07:00
blk-zoned.c block: allow submitting all zone writes from a single context 2026-03-09 14:30:00 -06:00
blk.h block: prepare generation / verification helpers for fs usage 2026-03-09 07:47:02 -06:00
bsg-lib.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
bsg.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
disk-events.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
early-lookup.c wrapper for access to ->bd_partno 2024-05-02 17:48:09 -04:00
elevator.c block: use trylock to avoid lockdep circular dependency in sysfs 2026-03-05 04:01:42 -07:00
elevator.h block: fix race between wbt_enable_default and IO submission 2025-12-12 12:51:11 -07:00
fops.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
genhd.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
holder.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
ioctl.c block: allow IOC_PR_READ_* ioctls with BLK_OPEN_READ 2026-02-11 10:36:54 -07:00
ioprio.c block: remove test of incorrect io priority level 2025-05-08 09:04:12 -06:00
Kconfig block: Remove obsolete configs BLK_MQ_{PCI,VIRTIO} 2025-05-14 05:43:56 -06:00
Kconfig.iosched
kyber-iosched.c kyber: covert to use request_queue->async_depth 2026-02-03 07:45:36 -07:00
Makefile block: add fs_bio_integrity helpers 2026-03-09 07:47:02 -06:00
mq-deadline.c mq-deadline: covert to use request_queue->async_depth 2026-02-03 07:45:36 -07:00
opal_proto.h sed-opal: add IOC_OPAL_REACTIVATE_LSP. 2026-03-09 14:29:59 -06:00
sed-opal.c sed-opal: add IOC_OPAL_GET_SUM_STATUS ioctl. 2026-03-09 14:29:59 -06:00
t10-pi.c block: prepare generation / verification helpers for fs usage 2026-03-09 07:47:02 -06:00