* android12-5.10-2021-08: (429 commits)
ANDROID: Update symbol list for mtk
ANDROID: scheduler: export task_sched_runtime
FROMLIST: mm: slub: fix slub_debug disabling for list of slabs
FROMLIST: mm/madvise: add MADV_WILLNEED to process_madvise()
ANDROID: Update the exynos symbol list
FROMGIT: firmware: arm_scmi: Free mailbox channels if probe fails
ANDROID: GKI: gki_defconfig: Enable CONFIG_NFC
ANDROID: sched: Make uclamp changes depend on CAP_SYS_NICE
ANDROID: GKI: update xiaomi symbol list and ABI XML
ANDROID: ABI: update generic symbol list
ANDROID: scsi: ufs: Enable CONFIG_SCSI_UFS_HPB
ANDROID: scsi: ufs: Make CONFIG_SCSI_UFS_HPB compatible with the GKI
UPSTREAM: arm64: vdso: Avoid ISB after reading from cntvct_el0
ANDROID: GKI: Disable X86_MCE drivers
ANDROID: GKI: Update symbols to symbol list
ANDROID: ABI: update allowed list for exynos
FROMGIT: sched: Skip priority checks with SCHED_FLAG_KEEP_PARAMS
FROMGIT: sched: Don't report SCHED_FLAG_SUGOV in sched_getattr()
FROMGIT: sched/deadline: Fix reset_on_fork reporting of DL tasks
BACKPORT: FROMGIT: sched: Fix UCLAMP_FLAG_IDLE setting
...
Change-Id: I5e0600bb4ccd0333366b016b42332e1e79e56b61
Conflicts:
drivers/usb/gadget/configfs.c
include/linux/usb/gadget.h
elevator_get_default() uses the following algorithm to select an I/O
scheduler from inside add_disk():
- In case of a single hardware queue or sharing hardware queues across
multiple request queues (BLK_MQ_F_TAG_HCTX_SHARED), use mq-deadline.
- Otherwise, use 'none'.
This is a good choice for most but not for all block drivers. Make it
possible to override the selection of mq-deadline with a new flag,
namely BLK_MQ_F_NO_SCHED_BY_DEFAULT.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Bug: 194450129
(cherry picked from commit 90b7198001 git://git.kernel.dk/linux-block/ for-5.15/block)
Change-Id: I4fb658957c193f350e74bdb5876c20a8f628fcb1
Signed-off-by: Bart Van Assche <bvanassche@google.com>
This effectively locks down OWNERS approval to a small group to guard
the code base against unintentional breakages.
Bug: 194314089
Signed-off-by: Matthias Maennich <maennich@google.com>
Change-Id: Ifd1ea97639a622320ea83f901f6451e2e52b38d4
Since commit c5089591c3ba ("block, bfq: detect wakers and
unconditionally inject their I/O"), when the in-service bfq_queue, say
Q, is temporarily empty, BFQ checks whether there are I/O requests to
inject (also) from the waker bfq_queue for Q. To this goal, the value
pointed by bfqq->waker_bfqq->next_rq must be controlled. However, the
current implementation mistakenly looks at bfqq->next_rq, which
instead points to the next request of the currently served queue.
This mistake evidently causes losses of throughput in scenarios with
waker bfq_queues.
This commit corrects this mistake.
Fixes: c5089591c3ba ("block, bfq: detect wakers and unconditionally inject their I/O")
Signed-off-by: Jia Cheng Hu <jia.jiachenghu@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bug: 193744965
Change-Id: I2b31773e1f67e4828e33a32acfa29b59809be7e0
(cherry picked from commit d4fc3640ff)
Signed-off-by: Ed Tsai <ed.tsai@mediatek.com>
This vendor hook let us initialize payload of the request.
Bug: 188749221
Change-Id: I51d6a3010ac0ab36066dbe1368158592832112b7
Signed-off-by: Yang Yang <yang.yang@vivo.com>
This vendor hook let us attach oem data as payload to the request.
The payload is used by oem driver for debugging purpose.
Bug: 188749221
Change-Id: Iac598bd9cce836dac0efe9198a3e7752928f351a
Signed-off-by: Yang Yang <yang.yang@vivo.com>
Commit "block: Do not accept any requests while suspended" broke the UFS
driver. In the upstream kernel this has been fixed by commit b294ff3e34
("scsi: ufs: core: Enable power management for wlun"). Backporting that
commit or backporting the entire v5.14-rc1 UFS driver is too risky.
Hence revert the block layer patch that is incompatible with the v5.10
UFS driver power management code.
This reverts commit d55d15a332.
Bug: 193181075
Change-Id: Ic50d4e1df98d7ed393bf9797787225ae22e5d7a3
Signed-off-by: Bart Van Assche <bvanassche@google.com>
Add ANDROID_OEM_DATA for implement of oem gki
Bug: 188749221
Change-Id: I1feba2334aa34e3bc46eb9d0217118485405beb4
Signed-off-by: Yang Yang <yang.yang@vivo.com>
Add ANDROID_OEM_DATA for implement of oem gki
Bug: 188749221
Change-Id: Ide8378a898de01a34d8ca3c34472844cd4ffa71c
Signed-off-by: Yang Yang <yang.yang@vivo.com>
While one or more requests with a certain I/O priority are pending, do not
dispatch lower priority requests. Dispatch lower priority requests anyway
after the "aging" time has expired.
This patch has been tested as follows:
modprobe scsi_debug ndelay=1000000 max_queue=16 &&
sd='' &&
while [ -z "$sd" ]; do
sd=/dev/$(basename /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/*)
done &&
echo $((100*1000)) > /sys/block/$sd/queue/iosched/aging_expire &&
cd /sys/fs/cgroup/blkio/ &&
echo $$ >cgroup.procs &&
echo restrict-to-be >blkio.prio.class &&
mkdir -p hipri &&
cd hipri &&
echo none-to-rt >blkio.prio.class &&
{ max-iops -a1 -d32 -j1 -e mq-deadline $sd >& ~/low-pri.txt & } &&
echo $$ >cgroup.procs &&
max-iops -a1 -d32 -j1 -e mq-deadline $sd >& ~/hi-pri.txt
Result:
* 11000 IOPS for the high-priority job
* 40 IOPS for the low-priority job
If the aging expiry time is changed from 100s into 0, the IOPS results change
into 6712 and 6796 IOPS.
The max-iops script is a script that runs fio with the following arguments:
--bs=4K --gtod_reduce=1 --ioengine=libaio --ioscheduler=${arg_e} --runtime=60
--norandommap --rw=read --thread --buffered=0 --numjobs=${arg_j}
--iodepth=${arg_d} --iodepth_batch_submit=${arg_a}
--iodepth_batch_complete=$((arg_d / 2)) --name=${positional_argument_1}
--filename=${positional_argument_1}
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Change-Id: I99a0674b018d096ec96bbfa3008eedcfda5013da
BUG: 187357408
(cherry picked from commit 40d5d42992b0de3ae7961735ea15eef5bd385ebf git://git.kernel.dk/linux-block/ for-5.14/block)
Signed-off-by: Bart Van Assche <bvanassche@google.com>
Maintain statistics per cgroup and export these to user space. These
statistics are essential for verifying whether the proper I/O priorities
have been assigned to requests. An example of the statistics data with
this patch applied:
$ cat /sys/fs/cgroup/io.stat
11:2 rbytes=0 wbytes=0 rios=3 wios=0 dbytes=0 dios=0 [NONE] dispatched=0 inserted=0 merged=171 [RT] dispatched=0 inserted=0 merged=0 [BE] dispatched=0 inserted=0 merged=0 [IDLE] dispatched=0 inserted=0 merged=0
8:32 rbytes=2142720 wbytes=0 rios=105 wios=0 dbytes=0 dios=0 [NONE] dispatched=0 inserted=0 merged=171 [RT] dispatched=0 inserted=0 merged=0 [BE] dispatched=0 inserted=0 merged=0 [IDLE] dispatched=0 inserted=0 merged=0
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
BUG: 187357408
Change-Id: I8d976c62ba2c0397cbb18076f3e61d5ab246cbcf
(cherry picked from commit f5dc926252cb31739809f7d27a8cbc9941b4d36d git://git.kernel.dk/linux-block/ for-5.14/block)
Signed-off-by: Bart Van Assche <bvanassche@google.com>
Track I/O statistics per I/O priority and export these statistics to
debugfs. These statistics help developers of the deadline scheduler.
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
BUG: 187357408
Change-Id: I8e91693dc1d015060737fa2fc15f5f2ebee2530c
(cherry picked from commit 9dc236caf2518c1e434be7a4f8fae60fb0be506a git://git.kernel.dk/linux-block/ for-5.14/block)
Signed-off-by: Bart Van Assche <bvanassche@google.com>
Maintain one dispatch list and one FIFO list per I/O priority class: RT, BE
and IDLE. Maintain statistics for each priority level. Split the debugfs
attributes per priority level as follows:
$ ls /sys/kernel/debug/block/.../sched/
async_depth dispatch2 read_next_rq write2_fifo_list
batching read0_fifo_list starved write_next_rq
dispatch0 read1_fifo_list write0_fifo_list
dispatch1 read2_fifo_list write1_fifo_list
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
BUG: 187357408
Change-Id: I60451cfdb416ad27601dc3ffb4eb307fa6ff783f
(cherry picked from commit 5b701a6e040ff8626ecf29ac06de9689efc00754 git://git.kernel.dk/linux-block/ for-5.14/block)
Signed-off-by: Bart Van Assche <bvanassche@google.com>
When dispatching the first request of a batch, the deadline_move_request()
call clears .next_rq[] for the opposite data direction. .next_rq[] is not
restored when changing data direction. Fix this by not clearing .next_rq[]
and by keeping track of the data direction of a batch in a variable instead.
This patch is a micro-optimization because:
- The number of deadline_next_request() calls for the read direction is
halved.
- The number of times that deadline_next_request() returns NULL is reduced.
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
BUG: 187357408
Change-Id: I582e99603a5443d75cf2b18a5daa2c93b5c66de3
(cherry picked from commit ea0fd2a525436ab5b9ada0f1953b0c0a29357311 git://git.kernel.dk/linux-block/ for-5.14/block)
Signed-off-by: Bart Van Assche <bvanassche@google.com>
For interactive workloads it is important that synchronous requests are
not delayed. Hence reserve 25% of scheduler tags for synchronous requests.
This patch still allows asynchronous requests to fill the hardware queues
since blk_mq_init_sched() makes sure that the number of scheduler requests
is the double of the hardware queue depth. From blk_mq_init_sched():
q->nr_requests = 2 * min_t(unsigned int, q->tag_set->queue_depth,
BLKDEV_MAX_RQ);
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
BUG: 187357408
Change-Id: Ib9cd753a39c8e5f5c45908001d69334130ef2067
(cherry picked from commit c970bc8292aaaf6f2d333d612e657df3a99f417c git://git.kernel.dk/linux-block/ for-5.14/block)
Signed-off-by: Bart Van Assche <bvanassche@google.com>
Define separate macros for integers and jiffies to improve readability.
Use sysfs_emit() and kstrtoint() instead of sprintf() and simple_strtol().
The former functions are the recommended functions.
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
BUG: 187357408
Change-Id: I4e0fd35124cd0319fcace0d1d5e3c113b60a213c
(cherry picked from commit d9baee13f8cf66a8fac9ec67fdb85ce419fcce3a git://git.kernel.dk/linux-block/ for-5.14/block)
Signed-off-by: Bart Van Assche <bvanassche@google.com>
Modern compilers complain if an out-of-range value is passed to a function
argument that has an enumeration type. Let the compiler detect out-of-range
data direction arguments instead of verifying the data_dir argument at
runtime.
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
BUG: 187357408
Change-Id: I4ad8c106a86d17f3010e12e172702e77eca61e80
(cherry picked from commit d9baee13f8cf66a8fac9ec67fdb85ce419fcce3a git://git.kernel.dk/linux-block/ for-5.14/block)
Signed-off-by: Bart Van Assche <bvanassche@google.com>
Change "queue" into "sched" to make the function names reflect better the
purpose of these functions.
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
BUG: 187357408
Change-Id: I30825b379146dbaef4ff3f85148b2e788667a77c
(cherry picked from commit a6e57fe5ab09c250fc741294e6321270a4364fec git://git.kernel.dk/linux-block/ for-5.14/block)
Signed-off-by: Bart Van Assche <bvanassche@google.com>
Make __dd_dispatch_request() easier to read by removing two local
variables.
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
BUG: 187357408
Change-Id: I5567f7d02a2c628efb437058a1c103c7b123747a
(cherry picked from commit f005b6ff19d2a961a2c3ae9c5f49d48fda143469 git://git.kernel.dk/linux-block/ for-5.14/block)
Signed-off-by: Bart Van Assche <bvanassche@google.com>
Document the locking strategy by adding two lockdep_assert_held()
statements.
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
BUG: 187357408
Change-Id: Ie8cf0b0ae208c9cc87731a9c6d7df5e5e59332d5
(cherry picked from commit 91831ddfd7c6e3df9857526a76cfa88673ec0637 git://git.kernel.dk/linux-block/ for-5.14/block)
Signed-off-by: Bart Van Assche <bvanassche@google.com>
Make the code easier to read by adding more comments.
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
BUG: 187357408
Change-Id: If62eb600614d2883d72ee3bd7e7859ae66b24512
(cherry picked from commit 16c3afdb127bbff7d3552e076e568281765674b7 git://git.kernel.dk/linux-block/ for-5.14/block)
Signed-off-by: Bart Van Assche <bvanassche@google.com>
Introduce an rq-qos policy that assigns an I/O priority to requests based
on blk-cgroup configuration settings. This policy has the following
advantages over the ioprio_set() system call:
- This policy is cgroup based so it has all the advantages of cgroups.
- While ioprio_set() does not affect page cache writeback I/O, this rq-qos
controller affects page cache writeback I/O for filesystems that support
assiociating a cgroup with writeback I/O. See also
Documentation/admin-guide/cgroup-v2.rst.
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
BUG: 187357408
Change-Id: If51e608ad37ee7a3f57b507bb17900dcfcb263ed
(cherry picked from commit ee9d2a55c960f152b5710078bbe399a4c51eb0a9 git://git.kernel.dk/linux-block/ for-5.14/block)
Signed-off-by: Bart Van Assche <bvanassche@google.com>
rq_qos_id_to_name() is only used in blk-mq-debugfs.c so move that function
into in blk-mq-debugfs.c.
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
BUG: 187357408
Change-Id: If03083a13917bc2f88b6df7151e033a11ab1bc50
(cherry picked from commit f1a7f539c2720906fb10be0af3514b034e1a9fee git://git.kernel.dk/linux-block/ for-5.14/block)
Signed-off-by: Bart Van Assche <bvanassche@google.com>
Before adding more calls in this function, simplify the error path.
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
BUG: 187357408
Change-Id: I8568b87d1bebbd3841e42a79b7efe2d0a1bff2bc
(cherry picked from commit f1a7f539c2720906fb10be0af3514b034e1a9fee git://git.kernel.dk/linux-block/ for-5.14/block)
Signed-off-by: Bart Van Assche <bvanassche@google.com>
These entries were consecutive at the time of their introduction but are no
longer consecutive. Make these again consecutive. Additionally, modify the
help text since it refers to blk-mq and since the legacy block layer has
been removed.
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
BUG: 187357408
Change-Id: I568383377a3244efba9748adf0a2e90bd7660bb2
(cherry picked from commit fdc250ea26e44066d690bbe65a03fab512af0699 git://git.kernel.dk/linux-block/ for-5.14/block)
Signed-off-by: Bart Van Assche <bvanassche@google.com>
Since commit 01e99aeca3 'blk-mq: insert passthrough request into
hctx->dispatch directly', passthrough request should not appear in
IO-scheduler any more, so blk_rq_is_passthrough checking in addon IO
schedulers is redundant.
(Notes: this patch passes generic IO load test with hdds under SAS
controller and hdds under AHCI controller but obviously not covers all.
Not sure if passthrough request can still escape into IO scheduler from
blk_mq_sched_insert_requests, which is used by blk_mq_flush_plug_list and
has lots of indirect callers.)
Signed-off-by: Lin Feng <linf@wangsu.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
BUG: 187357408
Change-Id: I97d85c38e584add44399295f3839994b694bc9ca
(cherry picked from commit 0856faaa220759a4fe4334f5c57a8661c94c14ce git://git.kernel.dk/linux-block/ for-5.14/block)
Signed-off-by: Bart Van Assche <bvanassche@google.com>
Currently when non-mq aware IO scheduler (BFQ, mq-deadline) is used for
a queue with multiple HW queues, the performance it rather bad. The
problem is that these IO schedulers use queue-wide locking and their
dispatch function does not respect the hctx it is passed in and returns
any request it finds appropriate. Thus locality of request access is
broken and dispatch from multiple CPUs just contends on IO scheduler
locks. For these IO schedulers there's little point in dispatching from
multiple CPUs. Instead dispatch always only from a single CPU to limit
contention.
Below is a comparison of dbench runs on XFS filesystem where the storage
is a raid card with 64 HW queues and to it attached a single rotating
disk. BFQ is used as IO scheduler:
clients MQ SQ MQ-Patched
Amean 1 39.12 (0.00%) 43.29 * -10.67%* 36.09 * 7.74%*
Amean 2 128.58 (0.00%) 101.30 * 21.22%* 96.14 * 25.23%*
Amean 4 577.42 (0.00%) 494.47 * 14.37%* 508.49 * 11.94%*
Amean 8 610.95 (0.00%) 363.86 * 40.44%* 362.12 * 40.73%*
Amean 16 391.78 (0.00%) 261.49 * 33.25%* 282.94 * 27.78%*
Amean 32 324.64 (0.00%) 267.71 * 17.54%* 233.00 * 28.23%*
Amean 64 295.04 (0.00%) 253.02 * 14.24%* 242.37 * 17.85%*
Amean 512 10281.61 (0.00%) 10211.16 * 0.69%* 10447.53 * -1.61%*
Numbers are times so lower is better. MQ is stock 5.10-rc6 kernel. SQ is
the same kernel with megaraid_sas.host_tagset_enable=0 so that the card
advertises just a single HW queue. MQ-Patched is a kernel with this
patch applied.
You can see multiple hardware queues heavily hurt performance in
combination with BFQ. The patch restores the performance.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
BUG: 187357408
Change-Id: I53645eb48cb308cd3af81a1c5e718a6abec6a1f9
(cherry picked from commit fa56cac78af68bd93734c290a0ffd0716e871dba git://git.kernel.dk/linux-block/ for-5.14/block)
Signed-off-by: Bart Van Assche <bvanassche@google.com>
This reverts commit b445547ec1.
Since both mq-deadline and BFQ completely ignore hctx they are passed to
their dispatch function and dispatch whatever request they deem fit
checking whether any request for a particular hctx is queued is just
pointless since we'll very likely get a request from a different hctx
anyway. In the following commit we'll deal with lock contention in these
IO schedulers in presence of multiple HW queues in a different way.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Change-Id: Ibd7dbe69ae1799f2efce5788986e2f1aad88f66d
BUG: 187357408
(cherry picked from commit 2490aeca0081bb168e96fb7b1746d676be84369f git://git.kernel.dk/linux-block/ for-5.14/block)
Signed-off-by: Bart Van Assche <bvanassche@google.com>
* android12-5.10: (2274 commits)
FROMGIT: mm: slub: move sysfs slab alloc/free interfaces to debugfs
ANDROID: gki - CONFIG_NET_SCH_FQ=y
ANDROID: GKI: Kconfig.gki: Add GKI_HIDDEN_ETHERNET_CONFIGS
FROMLIST: media: Kconfig: Fix DVB_CORE can't be selected as module
ANDROID: Update ABI and symbol list
Revert "net: usb: cdc_ncm: don't spew notifications"
ANDROID: Fips 140: move fips symbols entirely in own list
ANDROID: core of xt_IDLETIMER send_nl_msg support
ANDROID: start to re-add xt_IDLETIMER send_nl_msg support
ANDROID: add fips140.ko symbols to module ABI
ANDROID: inject correct HMAC digest into fips140.ko at build time
ANDROID: crypto: fips140 - perform load time integrity check
FROMLIST: crypto: shash - stop comparing function pointers to avoid breaking CFI
ANDROID: arm64: module: preserve RELA sections for FIPS140 integrity selfcheck
ANDROID: arm64: simd: omit capability check in may_use_simd()
ANDROID: kbuild: lto: permit the use of .a archives in LTO modules
ANDROID: arm64: only permit certain alternatives in the FIPS140 module
ANDROID: crypto: lib/aes - add vendor hooks for AES library routines
ANDROID: crypto: lib/sha256 - add vendor hook for sha256() routine
UPSTREAM: KVM: arm64: Mark the host stage-2 memory pools static
...
Conflicts:
drivers/mmc/core/mmc_ops.c
drivers/usb/gadget/function/f_uac1.c
drivers/usb/gadget/function/f_uac2.c
drivers/usb/gadget/function/f_uvc.c
This reverts commit 59870a78d4.
Bring back the commit in 5.10.38 that broke the kabi.
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ic93df6ad7bb79dda947aedc31d23d69fcd97b7d7
Before we free request queue, clearing flush request reference in
tags->rqs[], so that potential UAF can be avoided.
Based on one patch written by David Jeffery.
Tested-by: John Garry <john.garry@huawei.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210511152236.763464-5-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Change-Id: I631dbb5138e427246d2e717576fc44727daaa286
Bug: 188199752
(cherry picked from commit 51d4673e57d2613152fdb2ccfe917643472bb218 git://git.kernel.dk/linux-block/ for-5.14/block)
Signed-off-by: Bart Van Assche <bvanassche@google.com>
refcount_inc_not_zero() in bt_tags_iter() still may read one freed
request.
Fix the issue by the following approach:
1) hold a per-tags spinlock when reading ->rqs[tag] and calling
refcount_inc_not_zero in bt_tags_iter()
2) clearing stale request referred via ->rqs[tag] before freeing
request pool, the per-tags spinlock is held for clearing stale
->rq[tag]
So after we cleared stale requests, bt_tags_iter() won't observe
freed request any more, also the clearing will wait for pending
request reference.
The idea of clearing ->rqs[] is borrowed from John Garry's previous
patch and one recent David's patch.
Tested-by: John Garry <john.garry@huawei.com>
Reviewed-by: David Jeffery <djeffery@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210511152236.763464-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Change-Id: I740ddf3b83ea04ce0349b1b8055ac8b9db1d0557
Bug: 188199752
(cherry picked from commit 33238eb62b7575350be110adff231f32584b20f7 git://git.kernel.dk/linux-block/ for-5.14/block)
Signed-off-by: Bart Van Assche <bvanassche@google.com>
Grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter(), and
this way will prevent the request from being re-used when ->fn is
running. The approach is same as what we do during handling timeout.
Fix request use-after-free(UAF) related with completion race or queue
releasing:
- If one rq is referred before rq->q is frozen, then queue won't be
frozen before the request is released during iteration.
- If one rq is referred after rq->q is frozen, refcount_inc_not_zero()
will return false, and we won't iterate over this request.
However, still one request UAF not covered: refcount_inc_not_zero() may
read one freed request, and it will be handled in next patch.
Tested-by: John Garry <john.garry@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210511152236.763464-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Change-Id: I3e092c0487989ddf389308dc0da325b90e0bf7d4
Bug: 188199752
(cherry picked from commit 91af4d7b8930d9fd8767aee826c3ff4c1eaeec02 git://git.kernel.dk/linux-block/ for-5.14/block)
Signed-off-by: Bart Van Assche <bvanassche@google.com>
This reverts commit 54dbe2d2c1 as it
breaks the kernel abi at the moment. It will be restored at a later
point in time.
Bug: 161946584
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ida737ad962db2dc0ece0bd35ccb71e0db8e76fa2
Changes in 5.10.38
KEYS: trusted: Fix memory leak on object td
tpm: fix error return code in tpm2_get_cc_attrs_tbl()
tpm, tpm_tis: Extend locality handling to TPM2 in tpm_tis_gen_interrupt()
tpm, tpm_tis: Reserve locality in tpm_tis_resume()
KVM: x86/mmu: Remove the defunct update_pte() paging hook
KVM/VMX: Invoke NMI non-IST entry instead of IST entry
ACPI: PM: Add ACPI ID of Alder Lake Fan
PM: runtime: Fix unpaired parent child_count for force_resume
cpufreq: intel_pstate: Use HWP if enabled by platform firmware
kvm: Cap halt polling at kvm->max_halt_poll_ns
ath11k: fix thermal temperature read
fs: dlm: fix debugfs dump
fs: dlm: add errno handling to check callback
fs: dlm: check on minimum msglen size
fs: dlm: flush swork on shutdown
tipc: convert dest node's address to network order
ASoC: Intel: bytcr_rt5640: Enable jack-detect support on Asus T100TAF
net/mlx5e: Use net_prefetchw instead of prefetchw in MPWQE TX datapath
net: stmmac: Set FIFO sizes for ipq806x
ASoC: rsnd: core: Check convert rate in rsnd_hw_params
Bluetooth: Fix incorrect status handling in LE PHY UPDATE event
i2c: bail out early when RDWR parameters are wrong
ALSA: hdsp: don't disable if not enabled
ALSA: hdspm: don't disable if not enabled
ALSA: rme9652: don't disable if not enabled
ALSA: bebob: enable to deliver MIDI messages for multiple ports
Bluetooth: Set CONF_NOT_COMPLETE as l2cap_chan default
Bluetooth: initialize skb_queue_head at l2cap_chan_create()
net/sched: cls_flower: use ntohs for struct flow_dissector_key_ports
net: bridge: when suppression is enabled exclude RARP packets
Bluetooth: check for zapped sk before connecting
selftests/powerpc: Fix L1D flushing tests for Power10
powerpc/32: Statically initialise first emergency context
net: hns3: remediate a potential overflow risk of bd_num_list
net: hns3: add handling for xmit skb with recursive fraglist
ip6_vti: proper dev_{hold|put} in ndo_[un]init methods
ASoC: Intel: bytcr_rt5640: Add quirk for the Chuwi Hi8 tablet
ice: handle increasing Tx or Rx ring sizes
Bluetooth: btusb: Enable quirk boolean flag for Mediatek Chip.
ASoC: rt5670: Add a quirk for the Dell Venue 10 Pro 5055
i2c: Add I2C_AQ_NO_REP_START adapter quirk
MIPS: Loongson64: Use _CACHE_UNCACHED instead of _CACHE_UNCACHED_ACCELERATED
coresight: Do not scan for graph if none is present
IB/hfi1: Correct oversized ring allocation
mac80211: clear the beacon's CRC after channel switch
pinctrl: samsung: use 'int' for register masks in Exynos
rtw88: 8822c: add LC calibration for RTL8822C
mt76: mt7615: support loading EEPROM for MT7613BE
mt76: mt76x0: disable GTK offloading
mt76: mt7915: fix txpower init for TSSI off chips
fuse: invalidate attrs when page writeback completes
virtiofs: fix userns
cuse: prevent clone
iwlwifi: pcie: make cfg vs. trans_cfg more robust
powerpc/mm: Add cond_resched() while removing hpte mappings
ASoC: rsnd: call rsnd_ssi_master_clk_start() from rsnd_ssi_init()
Revert "iommu/amd: Fix performance counter initialization"
iommu/amd: Remove performance counter pre-initialization test
drm/amd/display: Force vsync flip when reconfiguring MPCC
selftests: Set CC to clang in lib.mk if LLVM is set
kconfig: nconf: stop endless search loops
ALSA: hda/realtek: Add quirk for Lenovo Ideapad S740
ASoC: Intel: sof_sdw: add quirk for new ADL-P Rvp
ALSA: hda/hdmi: fix race in handling acomp ELD notification at resume
sctp: Fix out-of-bounds warning in sctp_process_asconf_param()
flow_dissector: Fix out-of-bounds warning in __skb_flow_bpf_to_target()
powerpc/smp: Set numa node before updating mask
ASoC: rt286: Generalize support for ALC3263 codec
ethtool: ioctl: Fix out-of-bounds warning in store_link_ksettings_for_user()
net: sched: tapr: prevent cycle_time == 0 in parse_taprio_schedule
samples/bpf: Fix broken tracex1 due to kprobe argument change
powerpc/pseries: Stop calling printk in rtas_stop_self()
drm/amd/display: fixed divide by zero kernel crash during dsc enablement
drm/amd/display: add handling for hdcp2 rx id list validation
drm/amdgpu: Add mem sync flag for IB allocated by SA
mt76: mt7615: fix entering driver-own state on mt7663
crypto: ccp: Free SEV device if SEV init fails
wl3501_cs: Fix out-of-bounds warnings in wl3501_send_pkt
wl3501_cs: Fix out-of-bounds warnings in wl3501_mgmt_join
qtnfmac: Fix possible buffer overflow in qtnf_event_handle_external_auth
powerpc/iommu: Annotate nested lock for lockdep
iavf: remove duplicate free resources calls
net: ethernet: mtk_eth_soc: fix RX VLAN offload
selftests: mlxsw: Increase the tolerance of backlog buildup
selftests: mlxsw: Fix mausezahn invocation in ERSPAN scale test
kbuild: generate Module.symvers only when vmlinux exists
bnxt_en: Add PCI IDs for Hyper-V VF devices.
ia64: module: fix symbolizer crash on fdescr
watchdog: rename __touch_watchdog() to a better descriptive name
watchdog: explicitly update timestamp when reporting softlockup
watchdog/softlockup: remove logic that tried to prevent repeated reports
watchdog: fix barriers when printing backtraces from all CPUs
ASoC: rt286: Make RT286_SET_GPIO_* readable and writable
thermal: thermal_of: Fix error return code of thermal_of_populate_bind_params()
f2fs: move ioctl interface definitions to separated file
f2fs: fix compat F2FS_IOC_{MOVE,GARBAGE_COLLECT}_RANGE
f2fs: fix to allow migrating fully valid segment
f2fs: fix panic during f2fs_resize_fs()
f2fs: fix a redundant call to f2fs_balance_fs if an error occurs
remoteproc: qcom_q6v5_mss: Replace ioremap with memremap
remoteproc: qcom_q6v5_mss: Validate p_filesz in ELF loader
PCI: iproc: Fix return value of iproc_msi_irq_domain_alloc()
PCI: Release OF node in pci_scan_device()'s error path
ARM: 9064/1: hw_breakpoint: Do not directly check the event's overflow_handler hook
f2fs: fix to align to section for fallocate() on pinned file
f2fs: fix to update last i_size if fallocate partially succeeds
PCI: endpoint: Make *_get_first_free_bar() take into account 64 bit BAR
PCI: endpoint: Add helper API to get the 'next' unreserved BAR
PCI: endpoint: Make *_free_bar() to return error codes on failure
PCI: endpoint: Fix NULL pointer dereference for ->get_features()
f2fs: fix to avoid touching checkpointed data in get_victim()
f2fs: fix to cover __allocate_new_section() with curseg_lock
f2fs: Fix a hungtask problem in atomic write
f2fs: fix to avoid accessing invalid fio in f2fs_allocate_data_block()
rpmsg: qcom_glink_native: fix error return code of qcom_glink_rx_data()
NFS: nfs4_bitmask_adjust() must not change the server global bitmasks
NFS: Fix attribute bitmask in _nfs42_proc_fallocate()
NFSv4.2: Always flush out writes in nfs42_proc_fallocate()
NFS: Deal correctly with attribute generation counter overflow
PCI: endpoint: Fix missing destroy_workqueue()
pNFS/flexfiles: fix incorrect size check in decode_nfs_fh()
NFSv4.2 fix handling of sr_eof in SEEK's reply
SUNRPC: Move fault injection call sites
SUNRPC: Remove trace_xprt_transmit_queued
SUNRPC: Handle major timeout in xprt_adjust_timeout()
thermal/drivers/tsens: Fix missing put_device error
NFSv4.x: Don't return NFS4ERR_NOMATCHING_LAYOUT if we're unmounting
nfsd: ensure new clients break delegations
rtc: fsl-ftm-alarm: add MODULE_TABLE()
dmaengine: idxd: Fix potential null dereference on pointer status
dmaengine: idxd: fix dma device lifetime
dmaengine: idxd: fix cdev setup and free device lifetime issues
SUNRPC: fix ternary sign expansion bug in tracing
pwm: atmel: Fix duty cycle calculation in .get_state()
xprtrdma: Avoid Receive Queue wrapping
xprtrdma: Fix cwnd update ordering
xprtrdma: rpcrdma_mr_pop() already does list_del_init()
swiotlb: Fix the type of index
ceph: fix inode leak on getattr error in __fh_to_dentry
scsi: qla2xxx: Prevent PRLI in target mode
scsi: ufs: core: Do not put UFS power into LPM if link is broken
scsi: ufs: core: Cancel rpm_dev_flush_recheck_work during system suspend
scsi: ufs: core: Narrow down fast path in system suspend path
rtc: ds1307: Fix wday settings for rx8130
net: hns3: fix incorrect configuration for igu_egu_hw_err
net: hns3: initialize the message content in hclge_get_link_mode()
net: hns3: add check for HNS3_NIC_STATE_INITED in hns3_reset_notify_up_enet()
net: hns3: fix for vxlan gpe tx checksum bug
net: hns3: use netif_tx_disable to stop the transmit queue
net: hns3: disable phy loopback setting in hclge_mac_start_phy
sctp: do asoc update earlier in sctp_sf_do_dupcook_a
RISC-V: Fix error code returned by riscv_hartid_to_cpuid()
sunrpc: Fix misplaced barrier in call_decode
libbpf: Fix signed overflow in ringbuf_process_ring
block/rnbd-clt: Change queue_depth type in rnbd_clt_session to size_t
block/rnbd-clt: Check the return value of the function rtrs_clt_query
ethernet:enic: Fix a use after free bug in enic_hard_start_xmit
sctp: fix a SCTP_MIB_CURRESTAB leak in sctp_sf_do_dupcook_b
netfilter: xt_SECMARK: add new revision to fix structure layout
xsk: Fix for xp_aligned_validate_desc() when len == chunk_size
net: stmmac: Clear receive all(RA) bit when promiscuous mode is off
drm/radeon: Fix off-by-one power_state index heap overwrite
drm/radeon: Avoid power table parsing memory leaks
arm64: entry: factor irq triage logic into macros
arm64: entry: always set GIC_PRIO_PSR_I_SET during entry
khugepaged: fix wrong result value for trace_mm_collapse_huge_page_isolate()
mm/hugeltb: handle the error case in hugetlb_fix_reserve_counts()
mm/migrate.c: fix potential indeterminate pte entry in migrate_vma_insert_page()
ksm: fix potential missing rmap_item for stable_node
mm/gup: check every subpage of a compound page during isolation
mm/gup: return an error on migration failure
mm/gup: check for isolation errors
ethtool: fix missing NLM_F_MULTI flag when dumping
net: fix nla_strcmp to handle more then one trailing null character
smc: disallow TCP_ULP in smc_setsockopt()
netfilter: nfnetlink_osf: Fix a missing skb_header_pointer() NULL check
netfilter: nftables: Fix a memleak from userdata error path in new objects
can: mcp251xfd: mcp251xfd_probe(): add missing can_rx_offload_del() in error path
can: mcp251x: fix resume from sleep before interface was brought up
can: m_can: m_can_tx_work_queue(): fix tx_skb race condition
sched: Fix out-of-bound access in uclamp
sched/fair: Fix unfairness caused by missing load decay
fs/proc/generic.c: fix incorrect pde_is_permanent check
kernel: kexec_file: fix error return code of kexec_calculate_store_digests()
kernel/resource: make walk_system_ram_res() find all busy IORESOURCE_SYSTEM_RAM resources
kernel/resource: make walk_mem_res() find all busy IORESOURCE_MEM resources
netfilter: nftables: avoid overflows in nft_hash_buckets()
i40e: fix broken XDP support
i40e: Fix use-after-free in i40e_client_subtask()
i40e: fix the restart auto-negotiation after FEC modified
i40e: Fix PHY type identifiers for 2.5G and 5G adapters
mptcp: fix splat when closing unaccepted socket
f2fs: avoid unneeded data copy in f2fs_ioc_move_range()
ARC: entry: fix off-by-one error in syscall number validation
ARC: mm: PAE: use 40-bit physical page mask
ARC: mm: Use max_high_pfn as a HIGHMEM zone border
powerpc/64s: Fix crashes when toggling stf barrier
powerpc/64s: Fix crashes when toggling entry flush barrier
hfsplus: prevent corruption in shrinking truncate
squashfs: fix divide error in calculate_skip()
userfaultfd: release page in error path to avoid BUG_ON
kasan: fix unit tests with CONFIG_UBSAN_LOCAL_BOUNDS enabled
mm/hugetlb: fix F_SEAL_FUTURE_WRITE
blk-iocost: fix weight updates of inner active iocgs
arm64: mte: initialize RGSR_EL1.SEED in __cpu_setup
arm64: Fix race condition on PG_dcache_clean in __sync_icache_dcache()
btrfs: fix race leading to unpersisted data and metadata on fsync
drm/radeon/dpm: Disable sclk switching on Oland when two 4K 60Hz monitors are connected
drm/amd/display: Initialize attribute for hdcp_srm sysfs file
drm/i915: Avoid div-by-zero on gen2
kvm: exit halt polling on need_resched() as well
KVM: LAPIC: Accurately guarantee busy wait for timer to expire when using hv_timer
drm/msm/dp: initialize audio_comp when audio starts
KVM: x86: Cancel pvclock_gtod_work on module removal
KVM: x86: Prevent deadlock against tk_core.seq
dax: Add an enum for specifying dax wakup mode
dax: Add a wakeup mode parameter to put_unlocked_entry()
dax: Wake up all waiters after invalidating dax entry
xen/unpopulated-alloc: consolidate pgmap manipulation
xen/unpopulated-alloc: fix error return code in fill_list()
perf tools: Fix dynamic libbpf link
usb: dwc3: gadget: Free gadget structure only after freeing endpoints
iio: light: gp2ap002: Fix rumtime PM imbalance on error
iio: proximity: pulsedlight: Fix rumtime PM imbalance on error
iio: hid-sensors: select IIO_TRIGGERED_BUFFER under HID_SENSOR_IIO_TRIGGER
usb: fotg210-hcd: Fix an error message
hwmon: (occ) Fix poll rate limiting
usb: musb: Fix an error message
ACPI: scan: Fix a memory leak in an error handling path
kyber: fix out of bounds access when preempted
nvmet: add lba to sect conversion helpers
nvmet: fix inline bio check for bdev-ns
nvmet-rdma: Fix NULL deref when SEND is completed with error
f2fs: compress: fix to free compress page correctly
f2fs: compress: fix race condition of overwrite vs truncate
f2fs: compress: fix to assign cc.cluster_idx correctly
nbd: Fix NULL pointer in flush_workqueue
blk-mq: plug request for shared sbitmap
blk-mq: Swap two calls in blk_mq_exit_queue()
usb: dwc3: omap: improve extcon initialization
usb: dwc3: pci: Enable usb2-gadget-lpm-disable for Intel Merrifield
usb: xhci: Increase timeout for HC halt
usb: dwc2: Fix gadget DMA unmap direction
usb: core: hub: fix race condition about TRSMRCY of resume
usb: dwc3: gadget: Enable suspend events
usb: dwc3: gadget: Return success always for kick transfer in ep queue
usb: typec: ucsi: Retrieve all the PDOs instead of just the first 4
usb: typec: ucsi: Put fwnode in any case during ->probe()
xhci-pci: Allow host runtime PM as default for Intel Alder Lake xHCI
xhci: Do not use GFP_KERNEL in (potentially) atomic context
xhci: Add reset resume quirk for AMD xhci controller.
iio: gyro: mpu3050: Fix reported temperature value
iio: tsl2583: Fix division by a zero lux_val
cdc-wdm: untangle a circular dependency between callback and softint
xen/gntdev: fix gntdev_mmap() error exit path
KVM: x86: Emulate RDPID only if RDTSCP is supported
KVM: x86: Move RDPID emulation intercept to its own enum
KVM: nVMX: Always make an attempt to map eVMCS after migration
KVM: VMX: Do not advertise RDPID if ENABLE_RDTSCP control is unsupported
KVM: VMX: Disable preemption when probing user return MSRs
Revert "iommu/vt-d: Remove WO permissions on second-level paging entries"
Revert "iommu/vt-d: Preset Access/Dirty bits for IOVA over FL"
iommu/vt-d: Preset Access/Dirty bits for IOVA over FL
iommu/vt-d: Remove WO permissions on second-level paging entries
mm: fix struct page layout on 32-bit systems
MIPS: Reinstate platform `__div64_32' handler
MIPS: Avoid DIVU in `__div64_32' is result would be zero
MIPS: Avoid handcoded DIVU in `__div64_32' altogether
clocksource/drivers/timer-ti-dm: Prepare to handle dra7 timer wrap issue
clocksource/drivers/timer-ti-dm: Handle dra7 timer wrap errata i940
ARM: 9011/1: centralize phys-to-virt conversion of DT/ATAGS address
ARM: 9012/1: move device tree mapping out of linear region
ARM: 9020/1: mm: use correct section size macro to describe the FDT virtual address
ARM: 9027/1: head.S: explicitly map DT even if it lives in the first physical section
usb: typec: tcpm: Fix error while calculating PPS out values
kobject_uevent: remove warning in init_uevent_argv()
drm/i915/gt: Fix a double free in gen8_preallocate_top_level_pdp
drm/i915: Read C0DRB3/C1DRB3 as 16 bits again
drm/i915/overlay: Fix active retire callback alignment
drm/i915: Fix crash in auto_retire
clk: exynos7: Mark aclk_fsys1_200 as critical
media: rkvdec: Remove of_match_ptr()
i2c: mediatek: Fix send master code at more than 1MHz
dt-bindings: media: renesas,vin: Make resets optional on R-Car Gen1
dt-bindings: serial: 8250: Remove duplicated compatible strings
debugfs: Make debugfs_allow RO after init
ext4: fix debug format string warning
nvme: do not try to reconfigure APST when the controller is not live
ASoC: rsnd: check all BUSIF status when error
Linux 5.10.38
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ia32e01283b488a38be48015c58a0e481f09aaf65
[ Upstream commit 630ef623ed ]
If a tag set is shared across request queues (e.g. SCSI LUNs) then the
block layer core keeps track of the number of active request queues in
tags->active_queues. blk_mq_tag_busy() and blk_mq_tag_idle() update that
atomic counter if the hctx flag BLK_MQ_F_TAG_QUEUE_SHARED is set. Make
sure that blk_mq_exit_queue() calls blk_mq_tag_idle() before that flag is
cleared by blk_mq_del_queue_tag_set().
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Fixes: 0d2602ca30 ("blk-mq: improve support for shared tags maps")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210513171529.7977-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 03f26d8f11 ]
In case of shared sbitmap, request won't be held in plug list any more
sine commit 32bc15afed ("blk-mq: Facilitate a shared sbitmap per
tagset"), this way makes request merge from flush plug list & batching
submission not possible, so cause performance regression.
Yanhui reports performance regression when running sequential IO
test(libaio, 16 jobs, 8 depth for each job) in VM, and the VM disk
is emulated with image stored on xfs/megaraid_sas.
Fix the issue by recovering original behavior to allow to hold request
in plug list.
Cc: Yanhui Ma <yama@redhat.com>
Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: kashyap.desai@broadcom.com
Fixes: 32bc15afed ("blk-mq: Facilitate a shared sbitmap per tagset")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210514022052.1047665-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit efed9a3337 ]
__blk_mq_sched_bio_merge() gets the ctx and hctx for the current CPU and
passes the hctx to ->bio_merge(). kyber_bio_merge() then gets the ctx
for the current CPU again and uses that to get the corresponding Kyber
context in the passed hctx. However, the thread may be preempted between
the two calls to blk_mq_get_ctx(), and the ctx returned the second time
may no longer correspond to the passed hctx. This "works" accidentally
most of the time, but it can cause us to read garbage if the second ctx
came from an hctx with more ctx's than the first one (i.e., if
ctx->index_hw[hctx->type] > hctx->nr_ctx).
This manifested as this UBSAN array index out of bounds error reported
by Jakub:
UBSAN: array-index-out-of-bounds in ../kernel/locking/qspinlock.c:130:9
index 13106 is out of range for type 'long unsigned int [128]'
Call Trace:
dump_stack+0xa4/0xe5
ubsan_epilogue+0x5/0x40
__ubsan_handle_out_of_bounds.cold.13+0x2a/0x34
queued_spin_lock_slowpath+0x476/0x480
do_raw_spin_lock+0x1c2/0x1d0
kyber_bio_merge+0x112/0x180
blk_mq_submit_bio+0x1f5/0x1100
submit_bio_noacct+0x7b0/0x870
submit_bio+0xc2/0x3a0
btrfs_map_bio+0x4f0/0x9d0
btrfs_submit_data_bio+0x24e/0x310
submit_one_bio+0x7f/0xb0
submit_extent_page+0xc4/0x440
__extent_writepage_io+0x2b8/0x5e0
__extent_writepage+0x28d/0x6e0
extent_write_cache_pages+0x4d7/0x7a0
extent_writepages+0xa2/0x110
do_writepages+0x8f/0x180
__writeback_single_inode+0x99/0x7f0
writeback_sb_inodes+0x34e/0x790
__writeback_inodes_wb+0x9e/0x120
wb_writeback+0x4d2/0x660
wb_workfn+0x64d/0xa10
process_one_work+0x53a/0xa80
worker_thread+0x69/0x5b0
kthread+0x20b/0x240
ret_from_fork+0x1f/0x30
Only Kyber uses the hctx, so fix it by passing the request_queue to
->bio_merge() instead. BFQ and mq-deadline just use that, and Kyber can
map the queues itself to avoid the mismatch.
Fixes: a6088845c2 ("block: kyber: make kyber more friendly with merging")
Reported-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Link: https://lore.kernel.org/r/c7598605401a48d5cfeadebb678abd10af22b83f.1620691329.git.osandov@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit e9f4eee9a0 upstream.
When the weight of an active iocg is updated, weight_updated() is called
which in turn calls __propagate_weights() to update the active and inuse
weights so that the effective hierarchical weights are update accordingly.
The current implementation is incorrect for inner active nodes. For an
active leaf iocg, inuse can be any value between 1 and active and the
difference represents how much the iocg is donating. When weight is updated,
as long as inuse is clamped between 1 and the new weight, we're alright and
this is what __propagate_weights() currently implements.
However, that's not how an active inner node's inuse is set. An inner node's
inuse is solely determined by the ratio between the sums of inuse's and
active's of its children - ie. they're results of propagating the leaves'
active and inuse weights upwards. __propagate_weights() incorrectly applies
the same clamping as for a leaf when an active inner node's weight is
updated. Consider a hierarchy which looks like the following with saturating
workloads in AA and BB.
R
/ \
A B
| |
AA BB
1. For both A and B, active=100, inuse=100, hwa=0.5, hwi=0.5.
2. echo 200 > A/io.weight
3. __propagate_weights() update A's active to 200 and leave inuse at 100 as
it's already between 1 and the new active, making A:active=200,
A:inuse=100. As R's active_sum is updated along with A's active,
A:hwa=2/3, B:hwa=1/3. However, because the inuses didn't change, the
hwi's remain unchanged at 0.5.
4. The weight of A is now twice that of B but AA and BB still have the same
hwi of 0.5 and thus are doing the same amount of IOs.
Fix it by making __propgate_weights() always calculate the inuse of an
active inner iocg based on the ratio of child_inuse_sum to child_active_sum.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Dan Schatzberg <dschatzberg@fb.com>
Fixes: 7caa47151a ("blkcg: implement blk-iocost")
Cc: stable@vger.kernel.org # v5.4+
Link: https://lore.kernel.org/r/YJsxnLZV1MnBcqjj@slm.duckdns.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
For flush request, rq->end_io() may be called two times, one is from
timeout handling(blk_mq_check_expired()), another is from normal
completion(__blk_mq_end_request()).
Move blk_account_io_flush() after flush_rq->ref drops to zero, so
io accounting can be done just once for flush request.
Fixes: b686631865 ("block: add iostat counters for flush requests")
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: John Garry <john.garry@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210511152236.763464-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bug: 188199752
(cherry picked from commit 773cd5fb22e7c61c65c7528a3e1cb5bcbc1408ea git://git.kernel.dk/linux-block/ for-5.14/block)
Signed-off-by: Bart Van Assche <bvanassche@google.com>
Change-Id: I0cbcb35fce831daab99145d2401a6fdeb5d575b4
If a tag set is shared across request queues (e.g. SCSI LUNs) then the
block layer core keeps track of the number of active request queues in
tags->active_queues. blk_mq_tag_busy() and blk_mq_tag_idle() update that
atomic counter if the hctx flag BLK_MQ_F_TAG_QUEUE_SHARED is set. Make
sure that blk_mq_exit_queue() calls blk_mq_tag_idle() before that flag is
cleared by blk_mq_del_queue_tag_set().
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Fixes: 0d2602ca30 ("blk-mq: improve support for shared tags maps")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Bug: 181696921
Link: https://lore.kernel.org/linux-block/d1b1f123-9b47-8805-0b86-2fa9f6b19bb5@kernel.dk/T/#ma8a01d264c59074d3f9b69f833ffe408592cf837
Signed-off-by: Bart Van Assche <bvanassche@google.com>
Change-Id: I3f9ca00fbefec97d1e3c9df85365be0265b8cec6
Revert most of commit 25a6f60b717d ("BACKPORT: bio: limit bio max size")
because it has been reported to cause data corruption and because it has
been reverted upstream. See also
https://lore.kernel.org/linux-block/1620571445.2k94orj8ee.none@localhost/T/#t
Cc: Changheun Lee <nanich.lee@samsung.com>
Cc: Jaegeuk Kim <jaegeuk@google.com>
Bug: 182716953
Change-Id: I79bf39d1d1a0c13cb30ff4b0f1da2b43e1c817f0
Signed-off-by: Bart Van Assche <bvanassche@google.com>
The inflight of partition 0 doesn't include inflight IOs to all
sub-partitions, since currently mq calculates inflight of specific
partition by simply camparing the value of the partition pointer.
Thus the following case is possible:
$ cat /sys/block/vda/inflight
0 0
$ cat /sys/block/vda/vda1/inflight
0 128
While single queue device (on a previous version, e.g. v3.10) has no
this issue:
$cat /sys/block/sda/sda3/inflight
0 33
$cat /sys/block/sda/inflight
0 33
Partition 0 should be specially handled since it represents the whole
disk. This issue is introduced since commit bf0ddaba65 ("blk-mq: fix
sysfs inflight counter").
Besides, this patch can also fix the inflight statistics of part 0 in
/proc/diskstats. Before this patch, the inflight statistics of part 0
doesn't include that of sub partitions. (I have marked the 'inflight'
field with asterisk.)
$cat /proc/diskstats
259 0 nvme0n1 45974469 0 367814768 6445794 1 0 1 0 *0* 111062 6445794 0 0 0 0 0 0
259 2 nvme0n1p1 45974058 0 367797952 6445727 0 0 0 0 *33* 111001 6445727 0 0 0 0 0 0
This is introduced since commit f299b7c7a9 ("blk-mq: provide internal
in-flight variant").
Fixes: bf0ddaba65 ("blk-mq: fix sysfs inflight counter")
Fixes: f299b7c7a9 ("blk-mq: provide internal in-flight variant")
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[axboe: adapt for 5.11 partition change]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bug: 187355247
Change-Id: I378b2cb7312a5e47d5e2ec7301dc392e6e7336d0
(cherry picked from commit b0d97557eb)
Signed-off-by: Bart Van Assche <bvanassche@google.com>
bio size can grow up to 4GB when muli-page bvec is enabled.
but sometimes it would lead to inefficient behaviors.
in case of large chunk direct I/O, - 32MB chunk read in user space -
all pages for 32MB would be merged to a bio structure if the pages
physical addresses are contiguous. it makes some delay to submit
until merge complete. bio max size should be limited to a proper size.
When 32MB chunk read with direct I/O option is coming from userspace,
kernel behavior is below now in do_direct_IO() loop. it's timeline.
| bio merge for 32MB. total 8,192 pages are merged.
| total elapsed time is over 2ms.
|------------------ ... ----------------------->|
| 8,192 pages merged a bio.
| at this time, first bio submit is done.
| 1 bio is split to 32 read request and issue.
|--------------->
|--------------->
|--------------->
......
|--------------->
|--------------->|
total 19ms elapsed to complete 32MB read done from device. |
If bio max size is limited with 1MB, behavior is changed below.
| bio merge for 1MB. 256 pages are merged for each bio.
| total 32 bio will be made.
| total elapsed time is over 2ms. it's same.
| but, first bio submit timing is fast. about 100us.
|--->|--->|--->|---> ... -->|--->|--->|--->|--->|
| 256 pages merged a bio.
| at this time, first bio submit is done.
| and 1 read request is issued for 1 bio.
|--------------->
|--------------->
|--------------->
......
|--------------->
|--------------->|
total 17ms elapsed to complete 32MB read done from device. |
As a result, read request issue timing is faster if bio max size is limited.
Current kernel behavior with multipage bvec, super large bio can be created.
And it lead to delay first I/O request issue.
Signed-off-by: Changheun Lee <nanich.lee@samsung.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210503095203.29076-1-nanich.lee@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bug: 182716953
(cherry picked from commit cd2c7545ae)
Change-Id: Ie3876daa495535dc7f856ed9a281e65d72a437c1
Signed-off-by: Bart Van Assche <bvanassche@google.com>
* android12-5.10: (966 commits)
ANDROID: Support disabling symbol trimming
ANDROID: Incremental fs: Fix pseudo-file attributes
ANDROID: sched: Fix missing RQCF_UPDATED in migrate_tasks
FROMLIST: mm, thp: Relax the VM_DENYWRITE constraint on file-backed THPs
ANDROID: GKI: Update the generic symbol list
ANDROID: ABI: Add symbols for crypto
ANDROID: ABI: Update the ABI XML
Revert "ANDROID: GKI: Change UCLAMP_BUCKETS_COUNT to 20"
ANDROID: vendor_hooks: Add hook for binder
UPSTREAM: crypto: arm/blake2s - fix for big endian
UPSTREAM: crypto: arm/blake2b - drop unnecessary return statement
FROMGIT: kasan, arm64: tests supports for HW_TAGS async mode
FROMGIT: arm64: mte: Report async tag faults before suspend
FROMGIT: arm64: mte: Enable async tag check fault
FROMGIT: arm64: mte: Conditionally compile mte_enable_kernel_*()
ANDROID: ABI: Update the ABI xml
ANDROID: ABI: Update the generic symbol list
ANDROID: selinux: add vendor hook in selinux
FROMGIT: arm64: mte: Enable TCO in functions that can read beyond buffer limits
ANDROID: sched: Add vendor hooks for update_load_avg
...
Change-Id: I74731b47c1f6cd67cea9622113833b3f8c994544
Changes in 5.10.33
vhost-vdpa: protect concurrent access to vhost device iotlb
gpio: omap: Save and restore sysconfig
KEYS: trusted: Fix TPM reservation for seal/unseal
vdpa/mlx5: Set err = -ENOMEM in case dma_map_sg_attrs fails
pinctrl: lewisburg: Update number of pins in community
block: return -EBUSY when there are open partitions in blkdev_reread_part
pinctrl: core: Show pin numbers for the controllers with base = 0
arm64: dts: allwinner: Revert SD card CD GPIO for Pine64-LTS
bpf: Permits pointers on stack for helper calls
bpf: Allow variable-offset stack access
bpf: Refactor and streamline bounds check into helper
bpf: Tighten speculative pointer arithmetic mask
locking/qrwlock: Fix ordering in queued_write_lock_slowpath()
perf/x86/intel/uncore: Remove uncore extra PCI dev HSWEP_PCI_PCU_3
perf/x86/kvm: Fix Broadwell Xeon stepping in isolation_ucodes[]
perf auxtrace: Fix potential NULL pointer dereference
perf map: Fix error return code in maps__clone()
HID: google: add don USB id
HID: alps: fix error return code in alps_input_configured()
HID cp2112: fix support for multiple gpiochips
HID: wacom: Assign boolean values to a bool variable
soc: qcom: geni: shield geni_icc_get() for ACPI boot
dmaengine: xilinx: dpdma: Fix descriptor issuing on video group
dmaengine: xilinx: dpdma: Fix race condition in done IRQ
ARM: dts: Fix swapped mmc order for omap3
net: geneve: check skb is large enough for IPv4/IPv6 header
dmaengine: tegra20: Fix runtime PM imbalance on error
s390/entry: save the caller of psw_idle
arm64: kprobes: Restore local irqflag if kprobes is cancelled
xen-netback: Check for hotplug-status existence before watching
cavium/liquidio: Fix duplicate argument
kasan: fix hwasan build for gcc
csky: change a Kconfig symbol name to fix e1000 build error
ia64: fix discontig.c section mismatches
ia64: tools: remove duplicate definition of ia64_mf() on ia64
x86/crash: Fix crash_setup_memmap_entries() out-of-bounds access
net: hso: fix NULL-deref on disconnect regression
USB: CDC-ACM: fix poison/unpoison imbalance
Linux 5.10.33
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I638db3c919ad938eaaaac3d687175252edcd7990
[ Upstream commit 68e6582e8f ]
The switch to go through blkdev_get_by_dev means we now ignore the
return value from bdev_disk_changed in __blkdev_get. Add a manual
check to restore the old semantics.
Fixes: 4601b4b130 ("block: reopen the device in blkdev_reread_part")
Reported-by: Karel Zak <kzak@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210421160502.447418-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
* android12-5.10: (1647 commits)
FROMGIT: mm/page_owner: record the timestamp of all pages during free
UPSTREAM: mm/page_io: use pr_alert_ratelimited for swap read/write errors
ANDROID: roll back xt_IDLETIMER to 5.10.21 upstream/vanilla version
ANDROID: qcom: Add ip, rtnl and free related symbols
FROMGIT: power: supply: Fix build error when CONFIG_POWER_SUPPLY is not enabled.
FROMGIT: usb: dwc3: gadget: modify the scale in vbus_draw callback
BACKPORT: FROMLIST: usb: dwc3: gadget: Clear DEP flags after stop transfers in ep disable
FROMLIST: Makefile: fix GDB warning with CONFIG_RELR
ANDROID: refresh ABI XML before enabling KMI enforcement
Revert "Revert "ANDROID: GKI: Enable bounds sanitizer""
Revert "ANDROID: Revert "f2fs: fix to tag FIEMAP_EXTENT_MERGED in f2fs_fiemap()""
ANDROID: Enforce KMI stability
ANDROID: enable options prior to enforcing KMI
Revert "ANDROID: GKI: temporarily disable LTO/CFI"
ANDROID: gki_defconfig: Enable NET_CLS_{BASIC,TCINDEX,MATCHALL} & NET_ACT_{GACT,MIRRED}
FROMLIST: selftests: Add a MREMAP_DONTUNMAP selftest for shmem
FROMLIST: mm: Extend MREMAP_DONTUNMAP to non-anonymous mappings
ANDROID: GKI: enable CONFIG_CMA_SYSFS
ANDROID: make cma_sysfs experimental
FROMLIST: mm: cma: support sysfs
...
Change-Id: I6145eddeb253bea33164fc909e7790d30f17ef1f
Changes in 5.10.31
interconnect: core: fix error return code of icc_link_destroy()
gfs2: Flag a withdraw if init_threads() fails
KVM: arm64: Hide system instruction access to Trace registers
KVM: arm64: Disable guest access to trace filter controls
drm/imx: imx-ldb: fix out of bounds array access warning
gfs2: report "already frozen/thawed" errors
ftrace: Check if pages were allocated before calling free_pages()
tools/kvm_stat: Add restart delay
drm/tegra: dc: Don't set PLL clock to 0Hz
gpu: host1x: Use different lock classes for each client
XArray: Fix splitting to non-zero orders
block: only update parent bi_status when bio fail
radix tree test suite: Register the main thread with the RCU library
idr test suite: Take RCU read lock in idr_find_test_1
idr test suite: Create anchor before launching throbber
null_blk: fix command timeout completion handling
io_uring: don't mark S_ISBLK async work as unbounded
riscv,entry: fix misaligned base for excp_vect_table
block: don't ignore REQ_NOWAIT for direct IO
netfilter: x_tables: fix compat match/target pad out-of-bound write
perf map: Tighten snprintf() string precision to pass gcc check on some 32-bit arches
net: sfp: relax bitrate-derived mode check
net: sfp: cope with SFPs that set both LOS normal and LOS inverted
xen/events: fix setting irq affinity
Linux 5.10.31
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I19a7cfbdaab23e578dd82c552aea86d367c2f40f
[ Upstream commit 3edf5346e4 ]
For multiple split bios, if one of the bio is fail, the whole
should return error to application. But we found there is a race
between bio_integrity_verify_fn and bio complete, which return
io success to application after one of the bio fail. The race as
following:
split bio(READ) kworker
nvme_complete_rq
blk_update_request //split error=0
bio_endio
bio_integrity_endio
queue_work(kintegrityd_wq, &bip->bip_work);
bio_integrity_verify_fn
bio_endio //split bio
__bio_chain_endio
if (!parent->bi_status)
<interrupt entry>
nvme_irq
blk_update_request //parent error=7
req_bio_endio
bio->bi_status = 7 //parent bio
<interrupt exit>
parent->bi_status = 0
parent->bi_end_io() // return bi_status=0
The bio has been split as two: split and parent. When split
bio completed, it depends on kworker to do endio, while
bio_integrity_verify_fn have been interrupted by parent bio
complete irq handler. Then, parent bio->bi_status which have
been set in irq handler will overwrite by kworker.
In fact, even without the above race, we also need to conside
the concurrency beteen mulitple split bio complete and update
the same parent bi_status. Normally, multiple split bios will
be issued to the same hctx and complete from the same irq
vector. But if we have updated queue map between multiple split
bios, these bios may complete on different hw queue and different
irq vector. Then the concurrency update parent bi_status may
cause the final status error.
Suggested-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210331115359.1125679-1-yuyufen@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>