Commit Graph

8871 Commits

Author SHA1 Message Date
Ming Lei
f7700a4415 ublk: fix use-after-free in ublk_cancel_cmd()
When ublk_reset_ch_dev() clears io->cmd via ublk_queue_reinit()
concurrently with ublk_cancel_cmd(), ublk_cancel_cmd() can read a
stale pointer and pass it to io_uring_cmd_done(), causing a
use-after-free.

Fix by synchronizing the two paths with ubq->cancel_lock:

- ublk_cancel_cmd(): read and clear io->cmd under cancel_lock,
  then call io_uring_cmd_done() on the saved local copy outside
  the lock.

- ublk_reset_ch_dev(): hold cancel_lock across ublk_queue_reinit()
  so that io->cmd and io->flags are cleared atomically with respect
  to ublk_cancel_cmd().

Fixes: 216c8f5ef0 ("ublk: replace monitor with cancelable uring_cmd")
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Link: https://patch.msgid.link/20260508123746.242018-1-tom.leiming@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-08 06:44:42 -06:00
Ming Lei
86f33ca9be ublk: validate physical_bs_shift, io_min_shift and io_opt_shift
ublk_validate_params() checks logical_bs_shift is within
[9, PAGE_SHIFT] but has no upper bound for physical_bs_shift,
io_min_shift, or io_opt_shift. A malicious userspace can set any
of these to a large value (e.g., 44), causing undefined behavior
from `1 << shift` in ublk_ctrl_start_dev() since the result is
stored in 32-bit unsigned int.

Cap all three at ilog2(SZ_256M) (28). 256M is big enough to cover
all practical block sizes, and originates from the maximum physical
block size possible in NVMe (lba_size * (1 + npwg), where npwg is
16-bit).

Also zero out ub->params with memset() when copy_from_user() fails
or ublk_validate_params() returns error, so that no stale or partial
params survive for a subsequent START_DEV to consume.

Fixes: 71f28f3136 ("ublk_drv: add io_uring based userspace block driver")
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Link: https://patch.msgid.link/20260506082238.22363-1-tom.leiming@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-06 04:42:06 -06:00
Jens Axboe
845db023a8 ublk: don't issue uring_cmd from fallback task work
When ublk_ch_uring_cmd_cb() runs as fallback task work (e.g., because
the submitting task is exiting), the command should not be issued as
current is a kworker, not the daemon task. This can cause io->task
to capture the wrong task in __ublk_fetch(), leading to a task
mismatch warning in ublk_uring_cmd_cancel_fn().

Check tw.cancel and return -ECANCELED instead of issuing the command
from fallback context.

Fixes: 3421c7f68b ("ublk: make sure io cmd handled in submitter task context")
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Link: https://patch.msgid.link/20260501112312.947327-1-tom.leiming@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-01 05:25:02 -06:00
Linus Torvalds
f3e3dbcea1 block-7.1-20260424
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmnrh6gQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpkE+D/4h1caXVwyXjeBQ+UQ0hXqRkPHgCym+NxAx
 W/ezIs2/Yldbe31x5YZqtYXfV+0SyZhd6ugDbRpOahol8+7wpa4E4wZNYfoxKKqF
 +q9SysEDRCAOOGkCtNvxpALIzmmmXEswrqwJOmge2J+vdNsZwFq8APMiGS0kEo3x
 nJrzOdxnpMHVY68yCB+LtRjggyptoJNMT14lIeFg1uXLAUK7CAg8Ax3qSzuAcW1m
 zWTkGUudrVtVWNEjKdR4kvPWtrkbNb8PY6ybKxJo+4z1XMrOkVlbIAEX6o6dOIqy
 hk5WKWNQIU82oxUi+FondGgq2yRHwuRY4fiDAz+eQ+SO6EcaUxQWqJgAD3gcNyRl
 C2ywNRFT3EZhYHSr8ec+kuNX3Bot5YvRP2BdFpQpmKLhGGs1JwV9pjsYiLr1bXOy
 fI6Wo1CsEnHM7I4rBFaji7W9ZhOXo/6IlehfDO2vf6iAAaWo1aDnJ4gOxNMcIYx3
 hCGWWiug1RadsPXKe9wIjGjR9ahUl48Ucs06fgN2kMjtPb09QsB66yUF7jJS9mvk
 Mdn9EltVqLIoyTt5vS/hUk/hO7VPzYaD/pL0yun4I9h7vi9hEyd/rXm3BIvuLQgJ
 7u7c00ozgZyNVbubUa3lOxJLW7L/ukxHsCOJQW+FW/uJUvqxC3g5LGYaxe9r6uE0
 KB1uNV7sKg==
 =0AP+
 -----END PGP SIGNATURE-----

Merge tag 'block-7.1-20260424' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull block fixes from Jens Axboe:

 - Series for zloop, fixing a variety of issues

 - t10-pi code cleanup

 - Fix for a merge window regression with the bio memory allocation mask

 - Fix for a merge window regression in ublk, caused by an issue with
   the maple tree iteration code at teardown

 - ublk self tests additions

 - Zoned device pgmap fixes

 - Various little cleanups and fixes

* tag 'block-7.1-20260424' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (21 commits)
  Revert "floppy: fix reference leak on platform_device_register() failure"
  ublk: avoid unpinning pages under maple tree spinlock
  ublk: refactor common helper ublk_shmem_remove_ranges()
  ublk: fix maple tree lockdep warning in ublk_buf_cleanup
  selftests: ublk: add ublk auto integrity test
  selftests: ublk: enable test_integrity_02.sh on fio 3.42
  selftests: ublk: remove unused argument to _cleanup
  block: only restrict bio allocation gfp mask asked to block
  block/blk-throttle: Add WQ_PERCPU to alloc_workqueue users
  block: Add WQ_PERCPU to alloc_workqueue users
  block: relax pgmap check in bio_add_page for compatible zone device pages
  block: add pgmap check to biovec_phys_mergeable
  floppy: fix reference leak on platform_device_register() failure
  ublk: use unchecked copy helpers for bio page data
  t10-pi: reduce ref tag code duplication
  zloop: remove irq-safe locking
  zloop: factor out zloop_mark_{full,empty} helpers
  zloop: set RQF_QUIET when completing requests on deleted devices
  zloop: improve the unaligned write pointer warning
  zloop: use vfs_truncate
  ...
2026-04-24 15:06:55 -07:00
Linus Torvalds
ac2dc6d574 We have a series from Alex which extends CephFS client metrics with
support for per-subvolume data I/O performance and latency tracking
 (metadata operations aren't included) and a good variety of fixes and
 cleanups across RBD and CephFS.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCgAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmnrq1YTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi+WFCACA2Yc6oj6W4yXX2LSGCFCN3FOanSb3
 6ZvPeSrmAALzwD9ZXdef6j50An6w05P7kXKmAyKTgmW2tpiRciJs6uT6y7By/aph
 uGZCaPoJWPDvTlo8d05MAVuyfoKH5eU8pwx2YiEMN5W6kfo7VJQze6BgLbvQt7yH
 ToIzzBLifYONH4vF3nfsHj/uCr38Cbpr6GWY8LIPo8QtInWKJTcwF7HWVVicCaMs
 yqf1t+/CWzlIsnnIQtp+aSxWlpoA5lAqWxGt3jSfd3eVTCAL8eDzw5fkbGMRJYgM
 paH3kZ+LuJWkRXe2ts/RMrXWLJF3ZWOVD6sWU6sfnXf+vBe4SkiwwcUt
 =Ooc5
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-7.1-rc1' of https://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "We have a series from Alex which extends CephFS client metrics with
  support for per-subvolume data I/O performance and latency tracking
  (metadata operations aren't included) and a good variety of fixes and
  cleanups across RBD and CephFS"

* tag 'ceph-for-7.1-rc1' of https://github.com/ceph/ceph-client:
  ceph: add subvolume metrics collection and reporting
  ceph: parse subvolume_id from InodeStat v9 and store in inode
  ceph: handle InodeStat v8 versioned field in reply parsing
  libceph: Fix slab-out-of-bounds access in auth message processing
  rbd: fix null-ptr-deref when device_add_disk() fails
  crush: cleanup in crush_do_rule() method
  ceph: clear s_cap_reconnect when ceph_pagelist_encode_32() fails
  ceph: only d_add() negative dentries when they are unhashed
  libceph: update outdated comment in ceph_sock_write_space()
  libceph: Remove obsolete session key alignment logic
  ceph: fix num_ops off-by-one when crypto allocation fails
  libceph: Prevent potential null-ptr-deref in ceph_handle_auth_reply()
2026-04-24 13:47:19 -07:00
Jens Axboe
895a9b3791 Revert "floppy: fix reference leak on platform_device_register() failure"
This reverts commit e784f2ea0b.

Jiri says the patch is buggy, and it looks like he is right revert it
for now.

Link: https://lore.kernel.org/linux-block/897f442d-4e04-4b70-b716-38fd10b8af36@kernel.org/
Reported-by: Jiri Slaby <jirislaby@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-23 05:07:37 -06:00
Ming Lei
309e02dccf ublk: avoid unpinning pages under maple tree spinlock
ublk_shmem_remove_ranges() calls unpin_user_pages() while holding the
maple tree spinlock (mas_lock). Although unpin_user_pages() is safe in
atomic context, holding the spinlock across potentially many page
unpinning operations is not ideal.

Split into __ublk_shmem_remove_ranges() which erases up to 64 ranges
under mas_lock, collecting base_pfn and nr_pages into a temporary
xarray. Then drop the lock and unpin pages outside spinlock context.
ublk_shmem_remove_ranges() loops until all matching ranges are
processed.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Link: https://patch.msgid.link/20260423033058.2805135-4-tom.leiming@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-23 04:52:47 -06:00
Ming Lei
ea1db795de ublk: refactor common helper ublk_shmem_remove_ranges()
Extract the shared walk+erase+unpin+kfree loop into
ublk_shmem_remove_ranges(). When buf_index >= 0, only ranges matching
that index are removed; when buf_index < 0, all ranges are removed.

Also extract ublk_unpin_range_pages() to share the page unpinning
loop.

Convert both __ublk_ctrl_unreg_buf() and ublk_buf_cleanup() to use
the new helper.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Link: https://patch.msgid.link/20260423033058.2805135-3-tom.leiming@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-23 04:52:47 -06:00
Ming Lei
47903faa5c ublk: fix maple tree lockdep warning in ublk_buf_cleanup
ublk_buf_cleanup() iterates the maple tree with mas_for_each()
without holding mas_lock, triggering a lockdep splat on
CONFIG_PROVE_RCU kernels since mas_find() internally uses
rcu_dereference_check() which requires either RCU or the tree lock.

Fix by holding mas_lock around the iteration, and call mas_erase()
before freeing each range to avoid dangling pointers in the tree.

Fixes: 5e864438e2 ("ublk: replace xarray with IDA for shmem buffer index allocation")
Reported-by: Jens Axboe <axboe@kernel.dk>
Closes: https://lore.kernel.org/linux-block/0349d72d-dff8-4f9f-b448-919fa5ae96da@kernel.dk/
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Link: https://patch.msgid.link/20260423033058.2805135-2-tom.leiming@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-23 04:52:47 -06:00
Dawei Feng
d1fef92e41 rbd: fix null-ptr-deref when device_add_disk() fails
do_rbd_add() publishes the device with device_add() before calling
device_add_disk(). If device_add_disk() fails after device_add()
succeeds, the error path calls rbd_free_disk() directly and then later
falls through to rbd_dev_device_release(), which calls rbd_free_disk()
again. This double teardown can leave blk-mq cleanup operating on
invalid state and trigger a null-ptr-deref in
__blk_mq_free_map_and_rqs(), reached from blk_mq_free_tag_set().

Fix this by following the normal remove ordering: call device_del()
before rbd_dev_device_release() when device_add_disk() fails after
device_add(). That keeps the teardown sequence consistent and avoids
re-entering disk cleanup through the wrong path.

The bug was first flagged by an experimental analysis tool we are
developing for kernel memory-management bugs while analyzing
v6.13-rc1. The tool is still under development and is not yet publicly
available.

We reproduced the bug on v7.0 with a real Ceph backend and a QEMU x86_64
guest booted with KASAN and CONFIG_FAILSLAB enabled. The reproducer
confines failslab injections to the __add_disk() range and injects
fail-nth while mapping an RBD image through
/sys/bus/rbd/add_single_major.

On the unpatched kernel, fail-nth=4 reliably triggered the fault:

	Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP KASAN NOPTI
	KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
	CPU: 0 UID: 0 PID: 273 Comm: bash Not tainted 7.0.0-01247-gd60bc1401583 #6 PREEMPT(lazy)
	Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
	RIP: 0010:__blk_mq_free_map_and_rqs+0x8c/0x240
	Code: 00 00 48 8b 6b 60 41 89 f4 49 c1 e4 03 4c 01 e5 45 85 ed 0f 85 0a 01 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 e9 48 c1 e9 03 <80> 3c 01 00 0f 85 31 01 00 00 4c 8b 6d 00 4d 85 ed 0f 84 e2 00 00
	RSP: 0018:ff1100000ab0fac8 EFLAGS: 00000246
	RAX: dffffc0000000000 RBX: ff1100000c4806a0 RCX: 0000000000000000
	RDX: 0000000000000002 RSI: 0000000000000000 RDI: ff1100000c4806f4
	RBP: 0000000000000000 R08: 0000000000000001 R09: ffe21c000189001b
	R10: ff1100000c4800df R11: ff1100006cf37be0 R12: 0000000000000000
	R13: 0000000000000000 R14: ff1100000c480700 R15: ff1100000c480004
	FS:  00007f0fbe8fe740(0000) GS:ff110000e5851000(0000) knlGS:0000000000000000
	CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
	CR2: 00007fe53473b2e0 CR3: 0000000012eef000 CR4: 00000000007516f0
	PKRU: 55555554
	Call Trace:
	 <TASK>
	 blk_mq_free_tag_set+0x77/0x460
	 do_rbd_add+0x1446/0x2b80
	 ? __pfx_do_rbd_add+0x10/0x10
	 ? lock_acquire+0x18c/0x300
	 ? find_held_lock+0x2b/0x80
	 ? sysfs_file_kobj+0xb6/0x1b0
	 ? __pfx_sysfs_kf_write+0x10/0x10
	 kernfs_fop_write_iter+0x2f4/0x4a0
	 vfs_write+0x98e/0x1000
	 ? expand_files+0x51f/0x850
	 ? __pfx_vfs_write+0x10/0x10
	 ksys_write+0xf2/0x1d0
	 ? __pfx_ksys_write+0x10/0x10
	 do_syscall_64+0x115/0x690
	 entry_SYSCALL_64_after_hwframe+0x77/0x7f
	RIP: 0033:0x7f0fbea15907
	Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
	RSP: 002b:00007ffe22346ea8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
	RAX: ffffffffffffffda RBX: 0000000000000058 RCX: 00007f0fbea15907
	RDX: 0000000000000058 RSI: 0000563ace6c0ef0 RDI: 0000000000000001
	RBP: 0000563ace6c0ef0 R08: 0000563ace6c0ef0 R09: 6b6435726d694141
	R10: 5250337279762f78 R11: 0000000000000246 R12: 0000000000000058
	R13: 00007f0fbeb1c780 R14: ff1100000c480700 R15: ff1100000c480004
	 </TASK>

With this fix applied, rerunning the reproducer over fail-nth=1..256
yields no KASAN reports.

[ idryomov: rename err_out_device_del -> err_out_device ]

Cc: stable@vger.kernel.org
Fixes: 27c97abc30 ("rbd: add add_disk() error handling")
Signed-off-by: Zilin Guan <zilin@seu.edu.cn>
Signed-off-by: Dawei Feng <dawei.feng@seu.edu.cn>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2026-04-22 01:40:23 +02:00
Linus Torvalds
40735a683b mm.git review status for linus..mm-stable
Everything:
 
 Total patches:       121
 Reviews/patch:       2.11
 Reviewed rate:       90%
 
 Excluding DAMON:
 
 Total patches:       113
 Reviews/patch:       2.25
 Reviewed rate:       96%
 
 - The 33 patch series "Eliminate Dying Memory Cgroup" from Qi Zheng and
   Muchun Song addresses the longstanding "dying memcg problem".  A
   situation wherein a no-longer-used memory control group will hang around
   for an extended period pointlessly consuming memory.  The [0/N]
   changelog has a good overview of this work.
 
 - The 3 patch series "fix unexpected type conversions and potential
   overflows" from Qi Zheng fixes a couple of potential 32-bit/64-bit
   issues which were identified during review of the "Eliminate Dying
   Memory Cgroup" series.
 
 - The 6 patch series "kho: history: track previous kernel version and
   kexec boot count" from Breno Leitao uses Kexec Handover (KHO) to pass
   the previous kernel's version string and the number of kexec reboots
   since the last cold boot to the next kernel, and prints it at boot time.
 
 - The 4 patch series "liveupdate: prevent double preservation" from
   Pasha Tatashin teaches LUO to avoid managing the same file across
   different active sessions.
 
 - The 10 patch series "liveupdate: Fix module unloading and unregister
   API" from Pasha Tatashin addresses an issue with how LUO handles module
   reference counting and unregistration during module unloading.
 
 - The 2 patch series "zswap pool per-CPU acomp_ctx simplifications" from
   Kanchana Sridhar simplifies and cleans up the zswap crypto compression
   handling and improves the lifecycle management of zswap pool's per-CPU
   acomp_ctx resources.
 
 - The 2 patch series "mm/damon/core: fix damon_call()/damos_walk() vs
   kdmond exit race" from SeongJae Park addresses unlikely but possible
   leaks and deadlocks in damon_call() and damon_walk().
 
 - The 2 patch series "mm/damon/core: validate damos_quota_goal->nid"
   from SeongJae Park fixes a couple of root-only wild pointer
   dereferences.
 
 - The 2 patch series "Docs/admin-guide/mm/damon: warn commit_inputs vs
   other params race" from SeongJae Park updates the DAMON documentation to
   warn operators about potential races which can occur if the
   commit_inputs parameter is altered at the wrong time.
 
 - The 3 patch series "Minor hmm_test fixes and cleanups" from Alistair
   Popple implements two bugfixes a cleanup for the HMM kernel selftests.
 
 - The 6 patch series "Modify memfd_luo code" from Chenghao Duan provides
   cleanups, simplifications and speedups in the memfd_lou code.
 
 - The 4 patch series "mm, kvm: allow uffd support in guest_memfd" from
   Mike Rapoport enables support for userfaultfd in guest_memfd.
 
 - The 6 patch series "selftests/mm: skip several tests when thp is not
   available" from Chunyu Hu fixes several issues in the selftests code
   which were causing breakage when the tests were run on CONFIG_THP=n
   kernels.
 
 - The 2 patch series "mm/mprotect: micro-optimization work" from Pedro
   Falcato implements a couple of nice speedups for mprotect().
 
 - The 3 patch series "MAINTAINERS: update KHO and LIVE UPDATE entries"
   from Pratyush Yadav reflects upcoming changes in the maintenance of KHO,
   LUO, memfd_luo, kexec, crash, kdump and probably other kexec-based
   things - they are being moved out of mm.git and into a new git tree.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaeNL/wAKCRDdBJ7gKXxA
 jt7EAQCEEQvYYTjld+8HJKsCbavY4pEfci7z4SBiQyIPjRracQD/ZfjXnzL7ucc1
 b6q6G4TcslvIDBgzVkk9G2BVn2oCoAg=
 =3ozv
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2026-04-18-02-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull more MM updates from Andrew Morton:

 - "Eliminate Dying Memory Cgroup" (Qi Zheng and Muchun Song)

   Address the longstanding "dying memcg problem". A situation wherein a
   no-longer-used memory control group will hang around for an extended
   period pointlessly consuming memory

 - "fix unexpected type conversions and potential overflows" (Qi Zheng)

   Fix a couple of potential 32-bit/64-bit issues which were identified
   during review of the "Eliminate Dying Memory Cgroup" series

 - "kho: history: track previous kernel version and kexec boot count"
   (Breno Leitao)

   Use Kexec Handover (KHO) to pass the previous kernel's version string
   and the number of kexec reboots since the last cold boot to the next
   kernel, and print it at boot time

 - "liveupdate: prevent double preservation" (Pasha Tatashin)

   Teach LUO to avoid managing the same file across different active
   sessions

 - "liveupdate: Fix module unloading and unregister API" (Pasha
   Tatashin)

   Address an issue with how LUO handles module reference counting and
   unregistration during module unloading

 - "zswap pool per-CPU acomp_ctx simplifications" (Kanchana Sridhar)

   Simplify and clean up the zswap crypto compression handling and
   improve the lifecycle management of zswap pool's per-CPU acomp_ctx
   resources

 - "mm/damon/core: fix damon_call()/damos_walk() vs kdmond exit race"
   (SeongJae Park)

   Address unlikely but possible leaks and deadlocks in damon_call() and
   damon_walk()

 - "mm/damon/core: validate damos_quota_goal->nid" (SeongJae Park)

   Fix a couple of root-only wild pointer dereferences

 - "Docs/admin-guide/mm/damon: warn commit_inputs vs other params race"
   (SeongJae Park)

   Update the DAMON documentation to warn operators about potential
   races which can occur if the commit_inputs parameter is altered at
   the wrong time

 - "Minor hmm_test fixes and cleanups" (Alistair Popple)

   Bugfixes and a cleanup for the HMM kernel selftests

 - "Modify memfd_luo code" (Chenghao Duan)

   Cleanups, simplifications and speedups to the memfd_lou code

 - "mm, kvm: allow uffd support in guest_memfd" (Mike Rapoport)

   Support for userfaultfd in guest_memfd

 - "selftests/mm: skip several tests when thp is not available" (Chunyu
   Hu)

   Fix several issues in the selftests code which were causing breakage
   when the tests were run on CONFIG_THP=n kernels

 - "mm/mprotect: micro-optimization work" (Pedro Falcato)

   A couple of nice speedups for mprotect()

 - "MAINTAINERS: update KHO and LIVE UPDATE entries" (Pratyush Yadav)

   Document upcoming changes in the maintenance of KHO, LUO, memfd_luo,
   kexec, crash, kdump and probably other kexec-based things - they are
   being moved out of mm.git and into a new git tree

* tag 'mm-stable-2026-04-18-02-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (121 commits)
  MAINTAINERS: add page cache reviewer
  mm/vmscan: avoid false-positive -Wuninitialized warning
  MAINTAINERS: update Dave's kdump reviewer email address
  MAINTAINERS: drop include/linux/liveupdate from LIVE UPDATE
  MAINTAINERS: drop include/linux/kho/abi/ from KHO
  MAINTAINERS: update KHO and LIVE UPDATE maintainers
  MAINTAINERS: update kexec/kdump maintainers entries
  mm/migrate_device: remove dead migration entry check in migrate_vma_collect_huge_pmd()
  selftests: mm: skip charge_reserved_hugetlb without killall
  userfaultfd: allow registration of ranges below mmap_min_addr
  mm/vmstat: fix vmstat_shepherd double-scheduling vmstat_update
  mm/hugetlb: fix early boot crash on parameters without '=' separator
  zram: reject unrecognized type= values in recompress_store()
  docs: proc: document ProtectionKey in smaps
  mm/mprotect: special-case small folios when applying permissions
  mm/mprotect: move softleaf code out of the main function
  mm: remove '!root_reclaim' checking in should_abort_scan()
  mm/sparse: fix comment for section map alignment
  mm/page_io: use sio->len for PSWPIN accounting in sio_read_complete()
  selftests/mm: transhuge_stress: skip the test when thp not available
  ...
2026-04-19 08:01:17 -07:00
Andrew Stellman
2f529e73d7 zram: reject unrecognized type= values in recompress_store()
recompress_store() parses the type= parameter with three if statements
checking for "idle", "huge", and "huge_idle".  An unrecognized value
silently falls through with mode left at 0, causing the recompression pass
to run with no slot filter — processing all slots instead of the
intended subset.

Add a !mode check after the type parsing block to return -EINVAL for
unrecognized values, consistent with the function's other parameter
validation.

Link: https://lore.kernel.org/20260407153027.42425-1-astellman@stellman-greene.com
Signed-off-by: Andrew Stellman <astellman@stellman-greene.com>
Suggested-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18 00:10:55 -07:00
Sergey Senozhatsky
e3668b3713 zram: do not forget to endio for partial discard requests
As reported by Qu Wenruo and Avinesh Kumar, the following

 getconf PAGESIZE
 65536
 blkdiscard -p 4k /dev/zram0

takes literally forever to complete.  zram doesn't support partial
discards and just returns immediately w/o doing any discard work in such
cases.  The problem is that we forget to endio on our way out, so
blkdiscard sleeps forever in submit_bio_wait().  Fix this by jumping to
end_bio label, which does bio_endio().

Link: https://lore.kernel.org/20260331074255.777019-1-senozhatsky@chromium.org
Fixes: 0120dd6e4e ("zram: make zram_bio_discard more self-contained")
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reported-by: Qu Wenruo <wqu@suse.com>
Closes: https://lore.kernel.org/linux-block/92361cd3-fb8b-482e-bc89-15ff1acb9a59@suse.com
Tested-by: Qu Wenruo <wqu@suse.com>
Reported-by: Avinesh Kumar <avinesh.kumar@suse.com>
Closes: https://bugzilla.suse.com/show_bug.cgi?id=1256530
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Minchan Kim <minchan@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18 00:10:52 -07:00
Guangshuo Li
e784f2ea0b floppy: fix reference leak on platform_device_register() failure
When platform_device_register() fails in do_floppy_init(), the embedded
struct device in floppy_device[drive] has already been initialized by
device_initialize(), but the failure path jumps to out_remove_drives
without dropping the device reference for the current drive.

Previously registered floppy devices are cleaned up in out_remove_drives,
but the device for the drive that fails registration is not, leading to
a reference leak.

The issue was identified by a static analysis tool I developed and
confirmed by manual review. Fix this by calling put_device() for the
current floppy device before jumping to the common cleanup path.

Fixes: 94fd0db7bf ("[PATCH] Floppy: Add cmos attribute to floppy driver")
Cc: stable@vger.kernel.org
Signed-off-by: Guangshuo Li <lgs201920130244@gmail.com>
Link: https://patch.msgid.link/20260415145708.3331818-1-lgs201920130244@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-17 14:36:40 -06:00
Ming Lei
a7c9fa7f66 ublk: use unchecked copy helpers for bio page data
Bio pages may originate from slab caches that lack a usercopy region
(e.g. jbd2 frozen metadata buffers allocated via jbd2_alloc()).
When CONFIG_HARDENED_USERCOPY is enabled, copy_to_iter() calls
check_copy_size() which rejects these slab pages, triggering a
kernel BUG in usercopy_abort().

This is a false positive: the data is ordinary block I/O content —
the same data the loop driver writes to its backing file via
vfs_iter_write().  The bvec length is always trusted, so the size
check in check_copy_size() is not needed either.

Switch to _copy_to_iter()/_copy_from_iter() which skip the
check_copy_size() wrapper while the underlying copy_to_user()
remains unchanged.

Acked-by: Caleb Sander Mateos <csander@purestorage.com>
Fixes: 2299ceec36 ("ublk: use copy_{to,from}_iter() for user copy")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://patch.msgid.link/20260415230246.808176-1-tom.leiming@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-17 14:36:15 -06:00
Linus Torvalds
334fbe734e mm.git review status for linus..mm-stable
Everything:
 
 Total patches:       368
 Reviews/patch:       1.56
 Reviewed rate:       74%
 
 Excluding DAMON:
 
 Total patches:       316
 Reviews/patch:       1.77
 Reviewed rate:       81%
 
 Excluding DAMON and zram:
 
 Total patches:       306
 Reviews/patch:       1.81
 Reviewed rate:       82%
 
 Excluding DAMON, zram and maple_tree:
 
 Total patches:       276
 Reviews/patch:       2.01
 Reviewed rate:       91%
 
 Significant patch series in this merge:
 
 - The 30 patch series "maple_tree: Replace big node with maple copy"
   from Liam Howlett is mainly prepararatory work for ongoing development
   but it does reduce stack usage and is an improvement.
 
 - The 12 patch series "mm, swap: swap table phase III: remove swap_map"
   from Kairui Song offers memory savings by removing the static swap_map.
   It also yields some CPU savings and implements several cleanups.
 
 - The 2 patch series "mm: memfd_luo: preserve file seals" from Pratyush
   Yadav adds file seal preservation to LUO's memfd code.
 
 - The 2 patch series "mm: zswap: add per-memcg stat for incompressible
   pages" from Jiayuan Chen adds additional userspace stats reportng to
   zswap.
 
 - The 4 patch series "arch, mm: consolidate empty_zero_page" from Mike
   Rapoport implements some cleanups for our handling of ZERO_PAGE() and
   zero_pfn.
 
 - The 2 patch series "mm/kmemleak: Improve scan_should_stop()
   implementation" from Zhongqiu Han provides an robustness improvement and
   some cleanups in the kmemleak code.
 
 - The 4 patch series "Improve khugepaged scan logic" from Vernon Yang
   "improves the khugepaged scan logic and reduces CPU consumption by
   prioritizing scanning tasks that access memory frequently".
 
 - The 2 patch series "Make KHO Stateless" from Jason Miu simplifies
   Kexec Handover by "transitioning KHO from an xarray-based metadata
   tracking system with serialization to a radix tree data structure that
   can be passed directly to the next kernel"
 
 - The 3 patch series "mm: vmscan: add PID and cgroup ID to vmscan
   tracepoints" from Thomas Ballasi and Steven Rostedt enhances vmscan's
   tracepointing.
 
 - The 5 patch series "mm: arch/shstk: Common shadow stack mapping helper
   and VM_NOHUGEPAGE" from Catalin Marinas is a cleanup for the shadow
   stack code: remove per-arch code in favour of a generic implementation.
 
 - The 2 patch series "Fix KASAN support for KHO restored vmalloc
   regions" from Pasha Tatashin fixes a WARN() which can be emitted the KHO
   restores a vmalloc area.
 
 - The 4 patch series "mm: Remove stray references to pagevec" from Tal
   Zussman provides several cleanups, mainly udpating references to "struct
   pagevec", which became folio_batch three years ago.
 
 - The 17 patch series "mm: Eliminate fake head pages from vmemmap
   optimization" from Kiryl Shutsemau simplifies the HugeTLB vmemmap
   optimization (HVO) by changing how tail pages encode their relationship
   to the head page.
 
 - The 2 patch series "mm/damon/core: improve DAMOS quota efficiency for
   core layer filters" from SeongJae Park improves two problematic
   behaviors of DAMOS that makes it less efficient when core layer filters
   are used.
 
 - The 3 patch series "mm/damon: strictly respect min_nr_regions" from
   SeongJae Park improves DAMON usability by extending the treatment of the
   min_nr_regions user-settable parameter.
 
 - The 3 patch series "mm/page_alloc: pcp locking cleanup" from Vlastimil
   Babka is a proper fix for a previously hotfixed SMP=n issue.  Code
   simplifications and cleanups ennsed.
 
 - The 16 patch series "mm: cleanups around unmapping / zapping" from
   David Hildenbrand implements "a bunch of cleanups around unmapping and
   zapping.  Mostly simplifications, code movements, documentation and
   renaming of zapping functions".
 
 - The 6 patch series "support batched checking of the young flag for
   MGLRU" from Baolin Wang supports batched checking of the young flag for
   MGLRU.  It's part cleanups; one benchmark shows large performance
   benefits for arm64.
 
 - The 5 patch series "memcg: obj stock and slab stat caching cleanups"
   from Johannes Weiner provides memcg cleanup and robustness improvements.
 
 - The 5 patch series "Allow order zero pages in page reporting" from
   Yuvraj Sakshith enhances page_reporting's free page reporting - it is
   presently and undesirably order-0 pages when reporting free memory.
 
 - The 6 patch series "mm: vma flag tweaks" from Lorenzo Stoakes is
   cleanup work following from the recent conversion of the VMA flags to a
   bitmap.
 
 - The 10 patch series "mm/damon: add optional debugging-purpose sanity
   checks" from SeongJae Park adds some more developer-facing debug checks
   into DAMON core.
 
 - The 2 patch series "mm/damon: test and document power-of-2
   min_region_sz requirement" from SeongJae Park adds an additional DAMON
   kunit test and makes some adjustments to the addr_unit parameter
   handling.
 
 - The 3 patch series "mm/damon/core: make passed_sample_intervals
   comparisons overflow-safe" from SeongJae Park fixes a hard-to-hit time
   overflow issue in DAMON core.
 
 - The 7 patch series "mm/damon: improve/fixup/update ratio calculation,
   test and documentation" from SeongJae Park is a "batch of misc/minor
   improvements and fixups" for DAMON.
 
 - The 4 patch series "mm: move vma_(kernel|mmu)_pagesize() out of
   hugetlb.c" from David Hildenbrand fixes a possible issue with dax-device
   when CONFIG_HUGETLB=n.  Some code movement was required.
 
 - The 6 patch series "zram: recompression cleanups and tweaks" from
   Sergey Senozhatsky provides "a somewhat random mix of fixups,
   recompression cleanups and improvements" in the zram code.
 
 - The 11 patch series "mm/damon: support multiple goal-based quota
   tuning algorithms" from SeongJae Park extend DAMOS quotas goal
   auto-tuning to support multiple tuning algorithms that users can select.
 
 - The 4 patch series "mm: thp: reduce unnecessary
   start_stop_khugepaged()" from Breno Leitao fixes the khugpaged sysfs
   handling so we no longer spam the logs with reams of junk when
   starting/stopping khugepaged.
 
 - The 3 patch series "mm: improve map count checks" from Lorenzo Stoakes
   provides some cleanups and slight fixes in the mremap, mmap and vma
   code.
 
 - The 5 patch series "mm/damon: support addr_unit on default monitoring
   targets for modules" from SeongJae Park extends the use of DAMON core's
   addr_unit tunable.
 
 - The 5 patch series "mm: khugepaged cleanups and mTHP prerequisites"
   from Nico Pache provides cleanups in the khugepaged and is a base for
   Nico's planned khugepaged mTHP support.
 
 - The 15 patch series "mm: memory hot(un)plug and SPARSEMEM cleanups"
   from David Hildenbrand implements code movement and cleanups in the
   memhotplug and sparsemem code.
 
 - The 2 patch series "mm: remove CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE and
   cleanup CONFIG_MIGRATION" from David Hildenbrand rationalizes some
   memhotplug Kconfig support.
 
 - The 6 patch series "change young flag check functions to return bool"
   from Baolin Wang is "a cleanup patchset to change all young flag check
   functions to return bool".
 
 - The 3 patch series "mm/damon/sysfs: fix memory leak and NULL
   dereference issues" from Josh Law and SeongJae Park fixes a few
   potential DAMON bugs.
 
 - The 25 patch series "mm/vma: convert vm_flags_t to vma_flags_t in vma
   code" from "converts a lot of the existing use of the legacy vm_flags_t
   data type to the new vma_flags_t type which replaces it".  Mainly in the
   vma code.
 
 - The 21 patch series "mm: expand mmap_prepare functionality and usage"
   from Lorenzo Stoakes "expands the mmap_prepare functionality, which is
   intended to replace the deprecated f_op->mmap hook which has been the
   source of bugs and security issues for some time".  Cleanups,
   documentation, extension of mmap_prepare into filesystem drivers.
 
 - The 13 patch series "mm/huge_memory: refactor zap_huge_pmd()" from
   Lorenzo Stoakes simplifies and cleans up zap_huge_pmd().  Additional
   cleanups around vm_normal_folio_pmd() and the softleaf functionality are
   performed.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCad3HDQAKCRDdBJ7gKXxA
 jrUQAPwNhPk5nPSxnyxjAeQtOBHqgCdnICeEismLajPKd9aYRgEA0s2XAu3tSUYi
 GrBnWImHG3s4ePQxVcPCegWTsOUrXgQ=
 =1Q7o
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2026-04-13-21-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - "maple_tree: Replace big node with maple copy" (Liam Howlett)

   Mainly prepararatory work for ongoing development but it does reduce
   stack usage and is an improvement.

 - "mm, swap: swap table phase III: remove swap_map" (Kairui Song)

   Offers memory savings by removing the static swap_map. It also yields
   some CPU savings and implements several cleanups.

 - "mm: memfd_luo: preserve file seals" (Pratyush Yadav)

   File seal preservation to LUO's memfd code

 - "mm: zswap: add per-memcg stat for incompressible pages" (Jiayuan
   Chen)

   Additional userspace stats reportng to zswap

 - "arch, mm: consolidate empty_zero_page" (Mike Rapoport)

   Some cleanups for our handling of ZERO_PAGE() and zero_pfn

 - "mm/kmemleak: Improve scan_should_stop() implementation" (Zhongqiu
   Han)

   A robustness improvement and some cleanups in the kmemleak code

 - "Improve khugepaged scan logic" (Vernon Yang)

   Improve khugepaged scan logic and reduce CPU consumption by
   prioritizing scanning tasks that access memory frequently

 - "Make KHO Stateless" (Jason Miu)

   Simplify Kexec Handover by transitioning KHO from an xarray-based
   metadata tracking system with serialization to a radix tree data
   structure that can be passed directly to the next kernel

 - "mm: vmscan: add PID and cgroup ID to vmscan tracepoints" (Thomas
   Ballasi and Steven Rostedt)

   Enhance vmscan's tracepointing

 - "mm: arch/shstk: Common shadow stack mapping helper and
   VM_NOHUGEPAGE" (Catalin Marinas)

   Cleanup for the shadow stack code: remove per-arch code in favour of
   a generic implementation

 - "Fix KASAN support for KHO restored vmalloc regions" (Pasha Tatashin)

   Fix a WARN() which can be emitted the KHO restores a vmalloc area

 - "mm: Remove stray references to pagevec" (Tal Zussman)

   Several cleanups, mainly udpating references to "struct pagevec",
   which became folio_batch three years ago

 - "mm: Eliminate fake head pages from vmemmap optimization" (Kiryl
   Shutsemau)

   Simplify the HugeTLB vmemmap optimization (HVO) by changing how tail
   pages encode their relationship to the head page

 - "mm/damon/core: improve DAMOS quota efficiency for core layer
   filters" (SeongJae Park)

   Improve two problematic behaviors of DAMOS that makes it less
   efficient when core layer filters are used

 - "mm/damon: strictly respect min_nr_regions" (SeongJae Park)

   Improve DAMON usability by extending the treatment of the
   min_nr_regions user-settable parameter

 - "mm/page_alloc: pcp locking cleanup" (Vlastimil Babka)

   The proper fix for a previously hotfixed SMP=n issue. Code
   simplifications and cleanups ensued

 - "mm: cleanups around unmapping / zapping" (David Hildenbrand)

   A bunch of cleanups around unmapping and zapping. Mostly
   simplifications, code movements, documentation and renaming of
   zapping functions

 - "support batched checking of the young flag for MGLRU" (Baolin Wang)

   Batched checking of the young flag for MGLRU. It's part cleanups; one
   benchmark shows large performance benefits for arm64

 - "memcg: obj stock and slab stat caching cleanups" (Johannes Weiner)

   memcg cleanup and robustness improvements

 - "Allow order zero pages in page reporting" (Yuvraj Sakshith)

   Enhance free page reporting - it is presently and undesirably order-0
   pages when reporting free memory.

 - "mm: vma flag tweaks" (Lorenzo Stoakes)

   Cleanup work following from the recent conversion of the VMA flags to
   a bitmap

 - "mm/damon: add optional debugging-purpose sanity checks" (SeongJae
   Park)

   Add some more developer-facing debug checks into DAMON core

 - "mm/damon: test and document power-of-2 min_region_sz requirement"
   (SeongJae Park)

   An additional DAMON kunit test and makes some adjustments to the
   addr_unit parameter handling

 - "mm/damon/core: make passed_sample_intervals comparisons
   overflow-safe" (SeongJae Park)

   Fix a hard-to-hit time overflow issue in DAMON core

 - "mm/damon: improve/fixup/update ratio calculation, test and
   documentation" (SeongJae Park)

   A batch of misc/minor improvements and fixups for DAMON

 - "mm: move vma_(kernel|mmu)_pagesize() out of hugetlb.c" (David
   Hildenbrand)

   Fix a possible issue with dax-device when CONFIG_HUGETLB=n. Some code
   movement was required.

 - "zram: recompression cleanups and tweaks" (Sergey Senozhatsky)

   A somewhat random mix of fixups, recompression cleanups and
   improvements in the zram code

 - "mm/damon: support multiple goal-based quota tuning algorithms"
   (SeongJae Park)

   Extend DAMOS quotas goal auto-tuning to support multiple tuning
   algorithms that users can select

 - "mm: thp: reduce unnecessary start_stop_khugepaged()" (Breno Leitao)

   Fix the khugpaged sysfs handling so we no longer spam the logs with
   reams of junk when starting/stopping khugepaged

 - "mm: improve map count checks" (Lorenzo Stoakes)

   Provide some cleanups and slight fixes in the mremap, mmap and vma
   code

 - "mm/damon: support addr_unit on default monitoring targets for
   modules" (SeongJae Park)

   Extend the use of DAMON core's addr_unit tunable

 - "mm: khugepaged cleanups and mTHP prerequisites" (Nico Pache)

   Cleanups to khugepaged and is a base for Nico's planned khugepaged
   mTHP support

 - "mm: memory hot(un)plug and SPARSEMEM cleanups" (David Hildenbrand)

   Code movement and cleanups in the memhotplug and sparsemem code

 - "mm: remove CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE and cleanup
   CONFIG_MIGRATION" (David Hildenbrand)

   Rationalize some memhotplug Kconfig support

 - "change young flag check functions to return bool" (Baolin Wang)

   Cleanups to change all young flag check functions to return bool

 - "mm/damon/sysfs: fix memory leak and NULL dereference issues" (Josh
   Law and SeongJae Park)

   Fix a few potential DAMON bugs

 - "mm/vma: convert vm_flags_t to vma_flags_t in vma code" (Lorenzo
   Stoakes)

   Convert a lot of the existing use of the legacy vm_flags_t data type
   to the new vma_flags_t type which replaces it. Mainly in the vma
   code.

 - "mm: expand mmap_prepare functionality and usage" (Lorenzo Stoakes)

   Expand the mmap_prepare functionality, which is intended to replace
   the deprecated f_op->mmap hook which has been the source of bugs and
   security issues for some time. Cleanups, documentation, extension of
   mmap_prepare into filesystem drivers

 - "mm/huge_memory: refactor zap_huge_pmd()" (Lorenzo Stoakes)

   Simplify and clean up zap_huge_pmd(). Additional cleanups around
   vm_normal_folio_pmd() and the softleaf functionality are performed.

* tag 'mm-stable-2026-04-13-21-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits)
  mm: fix deferred split queue races during migration
  mm/khugepaged: fix issue with tracking lock
  mm/huge_memory: add and use has_deposited_pgtable()
  mm/huge_memory: add and use normal_or_softleaf_folio_pmd()
  mm: add softleaf_is_valid_pmd_entry(), pmd_to_softleaf_folio()
  mm/huge_memory: separate out the folio part of zap_huge_pmd()
  mm/huge_memory: use mm instead of tlb->mm
  mm/huge_memory: remove unnecessary sanity checks
  mm/huge_memory: deduplicate zap deposited table call
  mm/huge_memory: remove unnecessary VM_BUG_ON_PAGE()
  mm/huge_memory: add a common exit path to zap_huge_pmd()
  mm/huge_memory: handle buggy PMD entry in zap_huge_pmd()
  mm/huge_memory: have zap_huge_pmd return a boolean, add kdoc
  mm/huge: avoid big else branch in zap_huge_pmd()
  mm/huge_memory: simplify vma_is_specal_huge()
  mm: on remap assert that input range within the proposed VMA
  mm: add mmap_action_map_kernel_pages[_full]()
  uio: replace deprecated mmap hook with mmap_prepare in uio_info
  drivers: hv: vmbus: replace deprecated mmap hook with mmap_prepare
  mm: allow handling of stacked mmap_prepare hooks in more drivers
  ...
2026-04-15 12:59:16 -07:00
Christoph Hellwig
64b437c4a9 zloop: remove irq-safe locking
All of zloop runs in user context, so drop the irq-safe locking.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://patch.msgid.link/20260414081811.549755-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-15 13:58:37 -06:00
Christoph Hellwig
ec5c045f6c zloop: factor out zloop_mark_{full,empty} helpers
Move a few chunks of duplicated code into helpers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://patch.msgid.link/20260414081811.549755-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-15 13:58:36 -06:00
Christoph Hellwig
5b680d7afc zloop: set RQF_QUIET when completing requests on deleted devices
Reduce the dmesg spam for tests that involve device deletion.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://patch.msgid.link/20260414081811.549755-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-15 13:58:36 -06:00
Christoph Hellwig
6466b211f7 zloop: improve the unaligned write pointer warning
Use the IS_ALIGNED helper and avoid extra conversions, and tell the
user what the unaligned size is.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://patch.msgid.link/20260414081811.549755-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-15 13:58:36 -06:00
Christoph Hellwig
14e0077911 zloop: use vfs_truncate
While vfs_truncate does various extra checks that we don't really need,
it is always better to use a VFS helper rather than open coding the
logic.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://patch.msgid.link/20260414081811.549755-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-15 13:58:36 -06:00
Christoph Hellwig
32be3c01c3 zloop: fix write pointer calculation in zloop_forget_cache
The write pointer is absolute and in sector units, so we need to
convert it to a relative byte address first.

Fixes: c505448748f7 ("zloop: forget write cache on force removal")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://patch.msgid.link/20260414081811.549755-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-15 13:58:36 -06:00
Linus Torvalds
ac633ba77c Miscellaneous x86 cleanups for v7.1:
- Consolidate AMD and Hygon cases in parse_topology() (Wei Wang)
  - asm constraints cleanups in __iowrite32_copy() (Uros Bizjak)
  - Drop AMD Extended Interrupt LVT macros (Naveen N Rao)
  - Don't use REALLY_SLOW_IO for delays (Juergen Gross)
  - paravirt cleanups (Juergen Gross)
  - FPU code cleanups (Borislav Petkov)
  - split-lock handling code cleanups (Borislav Petkov, Ronan Pigott)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmncsF0RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gZ1A//TmQrq/spNIRx7KqSjT9u9166OaQaBeA/
 r535C9n1d/rZxw0l10vIoWeSOpDIXEPLpMguvs463pvgLVfdrNOXABn1Kw1RR6dv
 yTHV47KUE1j9FIuJ4Y2fQeHhdTC8cdddrC06fEFOezftTMiAMIR/GMaeVA5ExzQd
 9tcyocH10gjhtKCF+ILFGt7OdPn75YDIc8ysJAAPrsF6Dw222K5E7p4XedmEYL54
 W7WVknLK2jP/BdXp17wDVunQP/Hl7huiM9DMgNlv6eliWV6nyH/hONRm5NrgBUEG
 s2URPPEu30thveMHQ1qv31P6ZY6lVFi0VylubJ+OdPofUJDCdCINRk22Bc6kXurZ
 Y8ZV93UyuIgVfvlI9J5UoHSkpi3owMjvrQShquxH2hDbCzzBvwpI7/+KHwWjgVsH
 9+xdOkjR40UrlmwhyyzqTzmB10mg2SM1/YK5Ca2DcneibIkQRlfXdNXQqNikWqhN
 COAEX6U5ayKEu/TjbiNH4zNInJCEQMI65Jiz+oTmdnf+iCQ1L2sp+zSOB6SoyQtp
 rTyubHDDGu6pq9IEATx3hn5BYO7t6Ly4KJksWCAJ0G8lnP3HRESD9l6QvjqipMWB
 JToVwWsuqgL3zWqCpuvBOErpHslgzN6Usbym6blyrp8ERKVIb2elDt9lDAWyz5X3
 7hS8sNulqDw=
 =Ox7s
 -----END PGP SIGNATURE-----

Merge tag 'x86-cleanups-2026-04-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cleanups from Ingo Molnar:

 - Consolidate AMD and Hygon cases in parse_topology() (Wei Wang)

 - asm constraints cleanups in __iowrite32_copy() (Uros Bizjak)

 - Drop AMD Extended Interrupt LVT macros (Naveen N Rao)

 - Don't use REALLY_SLOW_IO for delays (Juergen Gross)

 - paravirt cleanups (Juergen Gross)

 - FPU code cleanups (Borislav Petkov)

 - split-lock handling code cleanups (Borislav Petkov, Ronan Pigott)

* tag 'x86-cleanups-2026-04-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/fpu: Correct the comment explaining what xfeatures_in_use() does
  x86/split_lock: Don't warn about unknown split_lock_detect parameter
  x86/fpu: Correct misspelled xfeaures_to_write local var
  x86/apic: Drop AMD Extended Interrupt LVT macros
  x86/cpu/topology: Consolidate AMD and Hygon cases in parse_topology()
  block/floppy: Don't use REALLY_SLOW_IO for delays
  x86/paravirt: Replace io_delay() hook with a bool
  x86/irqflags: Preemptively move include paravirt.h directive where it belongs
  x86/split_lock: Restructure the unwieldy switch-case in sld_state_show()
  x86/local: Remove trailing semicolon from _ASM_XADD in local_add_return()
  x86/asm: Use inout "+" asm onstraint modifiers in __iowrite32_copy()
2026-04-14 14:03:27 -07:00
Linus Torvalds
7fe6ac157b for-7.1/block-20260411
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmna0tgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgptEbD/0ZMEsz5pcN+/bpM9Qva5lVVkByRieua+JA
 T7L+JMcEigp1Hf2idAPlv1e9dbrtgOGhkjZNlbZenP2MHXBmbUTnzTWDKW5w0ZQ4
 UqnVC7fMmxzI57DPt7iG/1WQo8O6QPHWwBof5ZXn0b83qwByTB2oVkAb9ysT7CdM
 wGk5KnPRLIAWf5o+aZ4LoWE+196jQiszx1m6U58FTqnCgvJ/GyKyrgzx+uvGUgF+
 owZT/6TrN7cN9A68fOnmcjEZ7beZXygOQPTn32sF9rEOi8JsgK71EE2LofdVVSNU
 ES/tyKVJbSNDgUH2b0T84rErT4MtZcw5J29V3k7CVndC+DcT2uLSroPz3lYQjDg9
 TLeq7ZLjnyoBG+muboWdXcvBKn3aKLec3nfVSbz6J1xb/Z22gWYy5TZbrGnGH8fJ
 zBiyKkHMaZi55IdTDWQT3a48h36qFh0Y2wbvZ6uhyYOfXHyj4pA4ccJZgFfmf4ZG
 flVRFGEL9Tqc82lB8dfy9DBp0ZQSjeBUCd+gyDKjiuWVau5L5iTUeMMkt8yr7qbg
 PY+ATJcHk5S5zwM2xcZUt5EcHBBbCaKQ6DdRZKwzMMUvCjHlvnWvENVjUtRa9Dng
 1vUKpB/e5NGpqD05Iqgyai+OD9/tALc4sUEI2yQ7/dk9pKIXQ4RE9HR/pSkgbjeR
 LGokj08cgg==
 =ga3t
 -----END PGP SIGNATURE-----

Merge tag 'for-7.1/block-20260411' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull block updates from Jens Axboe:

 - Add shared memory zero-copy I/O support for ublk, bypassing per-I/O
   copies between kernel and userspace by matching registered buffer
   PFNs at I/O time. Includes selftests.

 - Refactor bio integrity to support filesystem initiated integrity
   operations and arbitrary buffer alignment.

 - Clean up bio allocation, splitting bio_alloc_bioset() into clear fast
   and slow paths. Add bio_await() and bio_submit_or_kill() helpers,
   unify synchronous bi_end_io callbacks.

 - Fix zone write plug refcount handling and plug removal races. Add
   support for serializing zone writes at QD=1 for rotational zoned
   devices, yielding significant throughput improvements.

 - Add SED-OPAL ioctls for Single User Mode management and a STACK_RESET
   command.

 - Add io_uring passthrough (uring_cmd) support to the BSG layer.

 - Replace pp_buf in partition scanning with struct seq_buf.

 - zloop improvements and cleanups.

 - drbd genl cleanup, switching to pre_doit/post_doit.

 - NVMe pull request via Keith:
      - Fabrics authentication updates
      - Enhanced block queue limits support
      - Workqueue usage updates
      - A new write zeroes device quirk
      - Tagset cleanup fix for loop device

 - MD pull requests via Yu Kuai:
      - Fix raid5 soft lockup in retry_aligned_read()
      - Fix raid10 deadlock with check operation and nowait requests
      - Fix raid1 overlapping writes on writemostly disks
      - Fix sysfs deadlock on array_state=clear
      - Proactive RAID-5 parity building with llbitmap, with
        write_zeroes_unmap optimization for initial sync
      - Fix llbitmap barrier ordering, rdev skipping, and bitmap_ops
        version mismatch fallback
      - Fix bcache use-after-free and uninitialized closure
      - Validate raid5 journal metadata payload size
      - Various cleanups

 - Various other fixes, improvements, and cleanups

* tag 'for-7.1/block-20260411' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (146 commits)
  ublk: fix tautological comparison warning in ublk_ctrl_reg_buf
  scsi: bsg: fix buffer overflow in scsi_bsg_uring_cmd()
  block: refactor blkdev_zone_mgmt_ioctl
  MAINTAINERS: update ublk driver maintainer email
  Documentation: ublk: address review comments for SHMEM_ZC docs
  ublk: allow buffer registration before device is started
  ublk: replace xarray with IDA for shmem buffer index allocation
  ublk: simplify PFN range loop in __ublk_ctrl_reg_buf
  ublk: verify all pages in multi-page bvec fall within registered range
  ublk: widen ublk_shmem_buf_reg.len to __u64 for 4GB buffer support
  xfs: use bio_await in xfs_zone_gc_reset_sync
  block: add a bio_submit_or_kill helper
  block: factor out a bio_await helper
  block: unify the synchronous bi_end_io callbacks
  xfs: fix number of GC bvecs
  selftests/ublk: add read-only buffer registration test
  selftests/ublk: add filesystem fio verify test for shmem_zc
  selftests/ublk: add hugetlbfs shmem_zc test for loop target
  selftests/ublk: add shared memory zero-copy test
  selftests/ublk: add UBLK_F_SHMEM_ZC support for loop target
  ...
2026-04-13 15:51:31 -07:00
Ming Lei
36446de0c3 ublk: fix tautological comparison warning in ublk_ctrl_reg_buf
On 32-bit architectures, 'unsigned long size' can never exceed
UBLK_SHMEM_BUF_SIZE_MAX (1ULL << 32), causing a tautological
comparison warning. Validate buf_reg.len (__u64) directly before
using it, and consolidate all input validation into a single check.

Also remove the unnecessary local variables 'addr' and 'size' since
buf_reg.addr and buf_reg.len can be used directly.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202604101952.3NOzqnu9-lkp@intel.com/
Fixes: 23b3b6f0b5 ("ublk: widen ublk_shmem_buf_reg.len to __u64 for 4GB buffer support")
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Link: https://patch.msgid.link/20260410124136.3983429-1-tom.leiming@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-10 07:02:37 -06:00
Ming Lei
365ea7cc62 ublk: allow buffer registration before device is started
Before START_DEV, there is no disk, no queue, no I/O dispatch, so
the maple tree can be safely modified under ub->mutex alone without
freezing the queue.

Add ublk_lock_buf_tree()/ublk_unlock_buf_tree() helpers that take
ub->mutex first, then freeze the queue if device is started. This
ordering (mutex -> freeze) is safe because ublk_stop_dev_unlocked()
already holds ub->mutex when calling del_gendisk() which freezes
the queue.

Suggested-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Link: https://patch.msgid.link/20260409133020.3780098-6-tom.leiming@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-09 19:10:44 -06:00
Ming Lei
5e864438e2 ublk: replace xarray with IDA for shmem buffer index allocation
Remove struct ublk_buf which only contained nr_pages that was never
read after registration. Use IDA for pure index allocation instead
of xarray. Make __ublk_ctrl_unreg_buf() return int so the caller
can detect invalid index without a separate lookup.

Simplify ublk_buf_cleanup() to walk the maple tree directly and
unpin all pages in one pass, instead of iterating the xarray by
buffer index.

Suggested-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Link: https://patch.msgid.link/20260409133020.3780098-5-tom.leiming@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-09 19:08:35 -06:00
Ming Lei
8ea8566a9a ublk: simplify PFN range loop in __ublk_ctrl_reg_buf
Use the for-loop increment instead of a manual `i++` past the last
page, and fix the mtree_insert_range end key accordingly.

Suggested-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Link: https://patch.msgid.link/20260409133020.3780098-4-tom.leiming@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-09 19:08:35 -06:00
Ming Lei
211ff1602b ublk: verify all pages in multi-page bvec fall within registered range
rq_for_each_bvec() yields multi-page bvecs where bv_page is only the
first page. ublk_try_buf_match() only validated the start PFN against
the maple tree, but a bvec can span multiple pages past the end of a
registered range.

Use mas_walk() instead of mtree_load() to obtain the range boundaries
stored in the maple tree, and check that the bvec's end PFN does not
exceed the range. Also remove base_pfn from struct ublk_buf_range
since mas.index already provides the range start PFN.

Reported-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Link: https://patch.msgid.link/20260409133020.3780098-3-tom.leiming@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-09 19:08:35 -06:00
Ming Lei
23b3b6f0b5 ublk: widen ublk_shmem_buf_reg.len to __u64 for 4GB buffer support
The __u32 len field cannot represent a 4GB buffer (0x100000000
overflows to 0). Change it to __u64 so buffers up to 4GB can be
registered. Add a reserved field for alignment and validate it
is zero.

The kernel enforces a default max of 4GB (UBLK_SHMEM_BUF_SIZE_MAX)
which may be increased in future.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Link: https://patch.msgid.link/20260409133020.3780098-2-tom.leiming@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-09 19:08:35 -06:00
Ming Lei
8a34e88769 ublk: eliminate permanent pages[] array from struct ublk_buf
The pages[] array (kvmalloc'd, 8 bytes per page = 2MB for a 1GB buffer)
was stored permanently in struct ublk_buf but only needed during
pin_user_pages_fast() and maple tree construction. Since the maple tree
already stores PFN ranges via ublk_buf_range, struct page pointers can
be recovered via pfn_to_page() during unregistration.

Make pages[] a temporary allocation in ublk_ctrl_reg_buf(), freed
immediately after the maple tree is built. Rewrite __ublk_ctrl_unreg_buf()
to iterate the maple tree for matching buf_index entries, recovering
struct page pointers via pfn_to_page() and unpinning in batches of 32.
Simplify ublk_buf_erase_ranges() to iterate the maple tree by buf_index
instead of walking the now-removed pages[] array.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://patch.msgid.link/20260331153207.3635125-5-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-07 07:40:19 -06:00
Ming Lei
08677040a9 ublk: enable UBLK_F_SHMEM_ZC feature flag
Add UBLK_F_SHMEM_ZC (1ULL << 19) to the UAPI header and UBLK_F_ALL.
Switch ublk_support_shmem_zc() and ublk_dev_support_shmem_zc() from
returning false to checking the actual flag, enabling the shared
memory zero-copy feature for devices that request it.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://patch.msgid.link/20260331153207.3635125-4-ming.lei@redhat.com
[axboe: ublk_buf_reg -> ublk_shmem_buf_reg errors]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-07 07:39:24 -06:00
Ming Lei
4d4a512a1f ublk: add PFN-based buffer matching in I/O path
Add ublk_try_buf_match() which walks a request's bio_vecs, looks up
each page's PFN in the per-device maple tree, and verifies all pages
belong to the same registered buffer at contiguous offsets.

Add ublk_iod_is_shmem_zc() inline helper for checking whether a
request uses the shmem zero-copy path.

Integrate into the I/O path:
- ublk_setup_iod(): if pages match a registered buffer, set
  UBLK_IO_F_SHMEM_ZC and encode buffer index + offset in addr
- ublk_start_io(): skip ublk_map_io() for zero-copy requests
- __ublk_complete_rq(): skip ublk_unmap_io() for zero-copy requests

The feature remains disabled (ublk_support_shmem_zc() returns false)
until the UBLK_F_SHMEM_ZC flag is enabled in the next patch.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://patch.msgid.link/20260331153207.3635125-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-07 07:38:36 -06:00
Ming Lei
2fb0ded237 ublk: add UBLK_U_CMD_REG_BUF/UNREG_BUF control commands
Add control commands for registering and unregistering shared memory
buffers for zero-copy I/O:

- UBLK_U_CMD_REG_BUF (0x18): pins pages from userspace, inserts PFN
  ranges into a per-device maple tree for O(log n) lookup during I/O.
  Buffer pointers are tracked in a per-device xarray. Returns the
  assigned buffer index.

- UBLK_U_CMD_UNREG_BUF (0x19): removes PFN entries and unpins pages.

Queue freeze/unfreeze is handled internally so userspace need not
quiesce the device during registration.

Also adds:
- UBLK_IO_F_SHMEM_ZC flag and addr encoding helpers in UAPI header
  (16-bit buffer index supporting up to 65536 buffers)
- Data structures (ublk_buf, ublk_buf_range) and xarray/maple tree
- __ublk_ctrl_reg_buf() helper for PFN insertion with error unwinding
- __ublk_ctrl_unreg_buf() helper for cleanup reuse
- ublk_support_shmem_zc() / ublk_dev_support_shmem_zc() stubs
  (returning false — feature not enabled yet)

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://patch.msgid.link/20260331153207.3635125-2-ming.lei@redhat.com
[axboe: fixup ublk_buf_reg -> ublk_shmem_buf_reg errors, comments]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-07 07:38:26 -06:00
David Carlier
fa0cac9a51 drbd: use get_random_u64() where appropriate
Use the typed random integer helpers instead of
get_random_bytes() when filling a single integer variable.
The helpers return the value directly, require no pointer
or size argument, and better express intent.

Signed-off-by: David Carlier <devnexen@gmail.com>
Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Link: https://patch.msgid.link/20260405154704.4610-1-devnexen@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-07 06:27:39 -06:00
Christoph Böhmwalder
a9c4b1d376 drbd: remove DRBD_GENLA_F_MANDATORY flag handling
DRBD used a custom mechanism to mark netlink attributes as "mandatory":
bit 14 of nla_type was repurposed as DRBD_GENLA_F_MANDATORY. Attributes
sent from userspace that had this bit present and that were unknown
to the kernel would lead to an error.

Since commit ef6243acb4 ("genetlink: optionally validate strictly/dumps"),
the generic netlink layer rejects unknown top-level attributes when
strict validation is enabled. DRBD never opted out of strict
validation, so unknown top-level attributes are already rejected by
the netlink core.

The mandatory flag mechanism was required for nested attributes, because
these are parsed liberally, silently dropping attributes unknown to the
kernel.

This prepares for the move to a new YNL-based family, which will use the
now-default strict parsing.
The current family is not expected to gain any new attributes, which
makes this change safe.

Old userspace that still sets bit 14 is unaffected: nla_type()
strips it before __nla_validate_parse() performs attribute validation,
so the bit never reaches DRBD.

Remove all references to the mandatory flag in DRBD.

Cc: Johannes Berg <johannes.berg@intel.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Link: https://patch.msgid.link/20260403132953.2248751-1-christoph.boehmwalder@linbit.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-06 20:21:37 -06:00
Uday Shankar
0842186d2c ublk: reset per-IO canceled flag on each fetch
If a ublk server starts recovering devices but dies before issuing fetch
commands for all IOs, cancellation of the fetch commands that were
successfully issued may never complete. This is because the per-IO
canceled flag can remain set even after the fetch for that IO has been
submitted - the per-IO canceled flags for all IOs in a queue are reset
together only once all IOs for that queue have been fetched. So if a
nonempty proper subset of the IOs for a queue are fetched when the ublk
server dies, the IOs in that subset will never successfully be canceled,
as their canceled flags remain set, and this prevents ublk_cancel_cmd
from actually calling io_uring_cmd_done on the commands, despite the
fact that they are outstanding.

Fix this by resetting the per-IO cancel flags immediately when each IO
is fetched instead of waiting for all IOs for the queue (which may never
happen).

Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Fixes: 728cbac5fe ("ublk: move device reset into ublk_ch_release()")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: zhang, the-essence-of-life <zhangweize9@gmail.com>
Link: https://patch.msgid.link/20260405-cancel-v2-1-02d711e643c2@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-06 08:37:48 -06:00
Sergey Senozhatsky
cba8299330 zram: change scan_slots to return void
scan_slots_for_writeback() and scan_slots_for_recompress() work in a "best
effort" fashion, if they cannot allocate memory for a new pp-slot
candidate they just return and post-processing selects slots that were
successfully scanned thus far.  scan_slots functions never return errors
and their callers never check the return status, so convert them to return
void.

Link: https://lkml.kernel.org/r/20260317032349.753645-1-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:53:30 -07:00
Sergey Senozhatsky
bf989ade27 zram: propagate read_from_bdev_async() errors
When read_from_bdev_async() fails to chain bio, for instance fails to
allocate request or bio, we need to propagate the error condition so that
upper layer is aware of it.  zram already does that by setting
BLK_STS_IOERR ->bi_status, but only for sync reads.  Change async read
path to return its error status so that async errors are also handled.

Link: https://lkml.kernel.org/r/20260316015354.114465-1-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Suggested-by: Brian Geffon <bgeffon@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Richard Chang <richardycc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:53:30 -07:00
gao xu
f0f6f78714 zram: optimize LZ4 dictionary compression performance
Calling `LZ4_loadDict()` repeatedly in Zram causes significant overhead
due to its internal dictionary pre-processing.  This commit introduces a
template stream mechanism to pre-process the dictionary only once when the
dictionary is initially set or modified.  It then efficiently copies this
state for subsequent compressions.

Verification Test Items:
Test Platform: android16-6.12
1. Collect Anonymous Page Dataset
1) Apply the following patch:
static bool zram_meta_alloc(struct zram *zram, u64 disksize)
	if (!huge_class_size)
-		huge_class_size = zs_huge_class_size(zram->mem_pool);
+		huge_class_size = 0;

2)Install multiple apps and monkey testing until SwapFree is close to 0.

3)Execute the following command to export data:
dd if=/dev/block/zram0 of=/data/samples/zram_dump.img bs=4K

2. Train Dictionary
Since LZ4 does not have a dedicated dictionary training tool, the zstd
tool can be used for training[1]. The command is as follows:
zstd --train /data/samples/* --split=4096 --maxdict=64KB -o /vendor/etc/dict_data

3. Test Code
adb shell "dd if=/data/samples/zram_dump.img of=/dev/test_pattern bs=4096 count=131072 conv=fsync"
adb shell "swapoff /dev/block/zram0"
adb shell "echo 1 > /sys/block/zram0/reset"
adb shell "echo lz4 > /sys/block/zram0/comp_algorithm"
adb shell "echo dict=/vendor/etc/dict_data   >  /sys/block/zram0/algorithm_params"
adb shell "echo 6G > /sys/block/zram0/disksize"
echo "Start Compression"
adb shell "taskset 80 dd if=/dev/test_pattern of=/dev/block/zram0 bs=4096 count=131072 conv=fsync"
echo.
echo "Start Decompression"
adb shell "taskset 80 dd if=/dev/block/zram0 of=/dev/output_result bs=4096 count=131072 conv=fsync"
echo "mm_stat:"
adb shell "cat /sys/block/zram0/mm_stat"
echo.
Note: To ensure stable test results, it is best to lock the CPU frequency
before executing the test.

LZ4 supports dictionaries up to 64KB. Below are the test results for
compression rates at various dictionary sizes:
dict_size          base        patch
  4 KB          156M/s      219M/s
  8 KB          136M/s      217M/s
 16KB           98M/s       214M/s
 32KB           66M/s       225M/s
 64KB           38M/s       224M/s

When an LZ4 compression dictionary is enabled, compression speed is
negatively impacted by the dictionary's size; larger dictionaries result
in slower compression.  This patch eliminates the influence of dictionary
size on compression speed, ensuring consistent performance regardless of
dictionary scale.

Link: https://lkml.kernel.org/r/698181478c9c4b10aa21b4a847bdc706@honor.com
Link: https://github.com/lz4/lz4?tab=readme-ov-file [1]
Signed-off-by: gao xu <gaoxu2@honor.com>
Acked-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:53:30 -07:00
Sergey Senozhatsky
301f392200 zram: unify and harden algo/priority params handling
We have two functions that accept algo= and priority= params -
algorithm_params_store() and recompress_store().  This patch unifies and
hardens handling of those parameters.

There are 4 possible cases:

- only priority= provided [recommended]
  We need to verify that provided priority value is
  within permitted range for each particular function.

- both algo= and priority= provided
  We cannot prioritize one over another.  All we should
  do is to verify that zram is configured in the way
  that user-space expects it to be.  Namely that zram
  indeed has compressor algo= setup at given priority=.

- only algo= provided [not recommended]
  We should lookup priority in compressors list.

- none provided [not recommended]
  Just use function's defaults.

Link: https://lkml.kernel.org/r/20260311084312.1766036-7-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Suggested-by: Minchan Kim <minchan@kernel.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: gao xu <gaoxu2@honor.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:53:25 -07:00
Sergey Senozhatsky
cedfa028b5 zram: remove chained recompression
Chained recompression has unpredictable behavior and is not useful in
practice.

First, systems usually configure just one alternative recompression
algorithm, which has slower compression/decompression but better
compression ratio.  A single alternative algorithm doesn't need chaining.

Second, even with multiple recompression algorithms, chained recompression
is suboptimal.  If a lower priority algorithm succeeds, the page is never
attempted with a higher priority algorithm, leading to worse memory
savings.  If a lower priority algorithm fails, the page is still attempted
with a higher priority algorithm, wasting resources on the failed lower
priority attempt.

In either case, the system would be better off targeting a specific
priority directly.

Chained recompression also significantly complicates the code.  Remove it.

Link: https://lkml.kernel.org/r/20260311084312.1766036-6-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: gao xu <gaoxu2@honor.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:53:24 -07:00
Sergey Senozhatsky
5004a27edb zram: drop ->num_active_comps
It's not entirely correct to use ->num_active_comps for max-prio limit, as
->num_active_comps just tells the number of configured algorithms, not the
max configured priority.  For instance, in the following theoretical
example:

    [lz4] [nil] [nil] [deflate]

->num_active_comps is 2, while the actual max-prio is 3.

Drop ->num_active_comps and use ZRAM_MAX_COMPS instead.

Link: https://lkml.kernel.org/r/20260311084312.1766036-4-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Suggested-by: Minchan Kim <minchan@kernel.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: gao xu <gaoxu2@honor.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:53:24 -07:00
Sergey Senozhatsky
ed19b9d550 zram: do not autocorrect bad recompression parameters
Do not silently autocorrect bad recompression priority parameter value and
just error out.

Link: https://lkml.kernel.org/r/20260311084312.1766036-3-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Suggested-by: Minchan Kim <minchan@kernel.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: gao xu <gaoxu2@honor.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:53:24 -07:00
Sergey Senozhatsky
241f9005b1 zram: do not permit params change after init
Patch series "zram: recompression cleanups and tweaks", v2.

This series is a somewhat random mix of fixups, recompression cleanups and
improvements partly based on internal conversations.  A few patches in the
series remove unexpected or confusing behaviour, e.g.  auto correction of
bad priority= param for recompression, which should have always been just
an error.  Then it also removes "chain recompression" which has a tricky,
unexpected and confusing behaviour at times.  We also unify and harden the
handling of algo/priority params.  There is also an addition of missing
device lock in algorithm_params_store() which previously permitted
modification of algo params while the device is active.


This patch (of 6):

First, algorithm_params_store(), like any sysfs handler, should grab
device lock.

Second, like any write() sysfs handler, it should grab device lock in
exclusive mode.

Third, it should not permit change of algos' parameters after device init,
as this doesn't make sense - we cannot compress with one C/D dict and then
just change C/D dict to a different one, for example.

Another thing to notice is that algorithm_params_store() accesses device's
->comp_algs for algo priority lookup, which should be protected by device
lock in exclusive mode in general.

Link: https://lkml.kernel.org/r/20260311084312.1766036-1-senozhatsky@chromium.org
Link: https://lkml.kernel.org/r/20260311084312.1766036-2-senozhatsky@chromium.org
Fixes: 4eac932103 ("zram: introduce algorithm_params device attribute")
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Brian Geffon <bgeffon@google.com>
Cc: gao xu <gaoxu2@honor.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:53:24 -07:00
gao xu
c09fb53d29 zram: use statically allocated compression algorithm names
Currently, zram dynamically allocates memory for compressor algorithm
names when they are set by the user.  This requires careful memory
management, including explicit `kfree` calls and special handling to avoid
freeing statically defined default compressor names.

This patch refactors the way zram handles compression algorithm names. 
Instead of storing dynamically allocated copies, `zram->comp_algs` will
now store pointers directly to the static name strings defined within the
`zcomp_ops` backend structures, thereby removing the need for conditional
`kfree` calls.

Link: https://lkml.kernel.org/r/5bb2e9318d124dbcb2b743dcdce6a950@honor.com
Signed-off-by: gao xu <gaoxu2@honor.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:53:07 -07:00
Damien Le Moal
b2a78fec34 zloop: add max_open_zones option
Introduce the new max_open_zones option to allow specifying a limit on
the maximum number of open zones of a zloop device. This change allows
creating a zloop device that can more closely mimick the characteristics
of a physical SMR drive.

When set to a non zero value, only up to max_open_zones zones can be in
the implicit open (BLK_ZONE_COND_IMP_OPEN) and explicit open
(BLK_ZONE_COND_EXP_OPEN) conditions at any time. The transition to the
implicit open condition of a zone on a write operation can result in an
implicit close of an already implicitly open zone. This is handled in
the function zloop_do_open_zone(). This function also handles
transitions to the explicit open condition. Implicit close transitions
are handled using an LRU ordered list of open zones which is managed
using the helper functions zloop_lru_rotate_open_zone() and
zloop_lru_remove_open_zone().

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260326203245.946830-1-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-31 08:33:28 -06:00
Bart Van Assche
2b31e86387 drbd: Balance RCU calls in drbd_adm_dump_devices()
Make drbd_adm_dump_devices() call rcu_read_lock() before
rcu_read_unlock() is called. This has been detected by the Clang
thread-safety analyzer.

Tested-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Fixes: a55bbd375d ("drbd: Backport the "status" command")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://patch.msgid.link/20260326214054.284593-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-26 16:04:52 -06:00
Christoph Böhmwalder
630bbba45c drbd: use genl pre_doit/post_doit
Every doit handler followed the same pattern: stack-allocate an
adm_ctx, call drbd_adm_prepare() at the top, call drbd_adm_finish()
at the bottom. This duplicated boilerplate across 25 handlers and
made error paths inconsistent, since some handlers could miss sending
the reply skb on early-exit paths.

The generic netlink framework already provides pre_doit/post_doit
hooks for exactly this purpose. An old comment even noted "this
would be a good candidate for a pre_doit hook".

Use them:

- pre_doit heap-allocates adm_ctx, looks up per-command flags from a
  new drbd_genl_cmd_flags[] table, runs drbd_adm_prepare(), and
  stores the context in info->user_ptr[0].
- post_doit sends the reply, drops kref references for
  device/connection/resource, and frees the adm_ctx.
- Handlers just receive adm_ctx from info->user_ptr[0], set
  reply_dh->ret_code, and return. All teardown is in post_doit.
- drbd_adm_finish() is removed, superseded by post_doit.

Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Link: https://patch.msgid.link/20260324152907.2840984-1-christoph.boehmwalder@linbit.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-25 07:11:10 -06:00
Christoph Hellwig
829def1e35 zloop: forget write cache on force removal
Add a new options that causes zloop to truncate the zone files to the
write pointer value recorded at the last cache flush to simulate
unclean shutdowns.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://patch.msgid.link/20260323071156.2940772-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-25 06:49:01 -06:00