Add a real none bitmap backend that exposes the common bitmap sysfs
group and use it to keep bitmap/location available when an array has no
bitmap.
Then switch the bitmap location sysfs path to move only between none
and the classic bitmap backend, using the no-sysfs bitmap helpers while
merging or unmerging the internal bitmap sysfs group.
This restores mdadm --grow bitmap addition through bitmap/location.
Fixes: fb8cc3b0d9 ("md/md-bitmap: delay registration of bitmap_ops until creating bitmap")
Reviewed-by: Su Yue <glass.su@suse.com>
Link: https://lore.kernel.org/r/20260425024615.1696892-4-yukuai@fnnas.com
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Split the classic bitmap sysfs files into a common bitmap group with
the location attribute and a separate internal bitmap group for the
remaining files.
At the same time, convert bitmap operations from a single sysfs group
to a sysfs group array so backends can share part of their sysfs
layout while adding backend-specific attributes separately.
Switch the bitmap sysfs helpers to use sysfs_update_groups() for the
add and update path, and remove groups in reverse order so shared named
groups are unmerged before the last group removes the directory.
Also make bitmap operation lookup depend only on the currently selected
bitmap id matching the installed backend. This prepares the lookup path
for a later registered none backend.
Reviewed-by: Su Yue <glass.su@suse.com>
Link: https://lore.kernel.org/r/20260425024615.1696892-3-yukuai@fnnas.com
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Factor bitmap creation and destruction into helpers that do not touch
bitmap sysfs registration.
This prepares the bitmap sysfs rework so callers such as the sysfs
bitmap location path can create or destroy a bitmap backend without
coupling that to sysfs group lifetime management.
Reviewed-by: Su Yue <glass.su@suse.com>
Link: https://lore.kernel.org/r/20260425024615.1696892-2-yukuai@fnnas.com
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
This keeps mddev locking consistent and ensures that any future changes
to locking behavior are done through the wrapper.
Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com>
Link: https://lore.kernel.org/r/20260415140319.376578-3-abd.masalkhi@gmail.com
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
setup_geo() extracts near_copies (nc) and far_copies (fc) from the
user-provided layout parameter without checking for zero. When fc=0
with the "improved" far set layout selected, 'geo->far_set_size =
disks / fc' triggers a divide-by-zero.
Validate nc and fc immediately after extraction, returning -1 if
either is zero.
Fixes: 475901aff1 ("MD RAID10: Improve redundancy for 'far' and 'offset' algorithms (part 1)")
Cc: stable@vger.kernel.org
Signed-off-by: Junrui Luo <moonafterrain@outlook.com>
Link: https://lore.kernel.org/linux-raid/SYBPR01MB7881A5E2556806CC1D318582AF232@SYBPR01MB7881.ausprd01.prod.outlook.com
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
BLK_STS_INVAL indicates the IO request itself was invalid, not that the
device has failed. When raid1 treats this as a device error, it retries
on alternate mirrors which fail the same way, eventually exceeding the
read error threshold and removing the device from the array.
This happens when stacking configurations bypass bio_split_to_limits()
in the IO path: dm-raid calls md_handle_request() directly without going
through md_submit_bio(), skipping the alignment validation that would
otherwise reject invalid bios early. The invalid bio reaches the
lower block layers, which fail the bio with BLK_STS_INVAL, and raid1
wrongly interprets this as a device failure.
Add BLK_STS_INVAL to raid1_should_handle_error() so that invalid IO
errors are propagated back to the caller rather than triggering device
removal. This is consistent with the previous kernel behavior when
alignment checks were done earlier in the direct-io path.
Fixes: 5ff3f74e14 ("block: simplify direct io validity check")
Reported-by: Tomáš Trnka <trnka@scm.com>
Closes: https://lore.kernel.org/linux-block/2982107.4sosBPzcNG@electra/
Signed-off-by: Keith Busch <kbusch@kernel.org>
Tested-by: Tomáš Trnka <trnka@scm.com>
Link: https://lore.kernel.org/r/20260416140345.3872265-1-kbusch@meta.com
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
If make_stripe_request() returns STRIPE_WAIT_RESHAPE,
raid5_make_request() will free the cloned bio. But raid5_make_request()
can call make_stripe_request() multiple times, writing to the various
stripes. If that bio got added to the toread or towrite lists of a
stripe disk in an earlier call to make_stripe_request(), then it's not
safe to just free the bio if a later part of it is found to cross the
reshape position. Doing so can lead to a UAF error, when bio_endio()
is called on the bio for the earlier stripes.
Instead, raid5_make_request() needs to wait until all parts of the bio
have called bio_endio(). To do this, bios that cross the reshape
position while the reshape can't make progress are flagged as needing to
wait for all parts to complete. When raid5_make_request() has a bio that
failed make_stripe_request() with STRIPE_WAIT_RESHAPE, it sets
bi->bi_private to a completion struct and waits for completion after
ending the bio. When the bio_endio() is called for the last time on a
clone bio with bi->bi_private set, it wakes up the waiter. This
guarantees that raid5_make_request() doesn't return until the cloned bio
needing a retry for io across the reshape boundary is safely cleaned up.
There is a simple reproducer available at [1]. Compile the kernel with
KASAN for more useful reporting when the error is triggered (this is not
necessary to see the bug).
[1] https://gist.github.com/bmarzins/e48598824305cf2171289e47d7241fa5
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Link: https://lore.kernel.org/r/20260408043548.1695157-1-bmarzins@redhat.com
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
There's a bug in dm-thin in the function rebalance_children. If the
internal btree node has one entry, the code tries to copy all btree
entries from the node's child to the node itself and then decrement the
child's reference count.
If the child node is shared (it has reference count > 1), we won't free
it, so there would be two pointers to each of the grandchildren nodes.
But the reference counts of the grandchildren is not increased, thus the
reference count doesn't match the number of pointers that point to the
grandchildren. This results in "device mapper: space map common: unable
to decrement block" errors.
Fix this bug by incrementing reference counts on the grandchildren if the
btree node is shared.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: 3241b1d3e0 ("dm: add persistent data library")
Cc: stable@vger.kernel.org
dm-vdo:
- Make dm-vdo able to format the device itself, like other dm targets,
instead of needing a userspace formating program.
- Add some sanity checks and code cleanup.
dm-cache:
- Fix crashes and hangs when operating in passthrough mode (which have
been around, unnoticed, since 4.12), as well as a late arriving fix for
an error path bug in the passthrough fix.
- Fix a corner case memory leak.
dm-verity:
- Another set of minor bugfixes and code cleanups to the forward error
correction code.
dm-mirror
- Fix minor initialization bug
- Fix overflow crash on a large devices with small region sizes
dm-crypt
- Reimplement elephant diffuser using AES library and minor cleanups
dm-core:
- Claude found a buffer overflow in /dev/mapper/contrl ioctl handling
- make dm_mod.wait_for correctly wait for partitions
- minor code fixes and cleanups
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQV50u+RKUL/8+ysMD7XX+I2MXV2AUCad1IDQAKCRD7XX+I2MXV
2McAAP9WcMCgiKq0IZPTZCONxcGoEAyNfDcsozhjHVxJXrwDyQEAsrEWd+yhQKZq
X/pfsQ+3/Qk28gqxV/5pI+scOVGIbQg=
=RVLi
-----END PGP SIGNATURE-----
Merge tag 'for-7.1/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Benjamin Marzinski:
"There are fixes for some corner case crashes in dm-cache and
dm-mirror, new setup functionality for dm-vdo, and miscellaneous minor
fixes and cleanups, especially to dm-verity.
dm-vdo:
- Make dm-vdo able to format the device itself, like other dm
targets, instead of needing a userspace formating program
- Add some sanity checks and code cleanup
dm-cache:
- Fix crashes and hangs when operating in passthrough mode (which
have been around, unnoticed, since 4.12), as well as a late
arriving fix for an error path bug in the passthrough fix
- Fix a corner case memory leak
dm-verity:
- Another set of minor bugfixes and code cleanups to the forward
error correction code
dm-mirror
- Fix minor initialization bug
- Fix overflow crash on a large devices with small region sizes
dm-crypt
- Reimplement elephant diffuser using AES library and minor cleanups
dm-core:
- Claude found a buffer overflow in /dev/mapper/contrl ioctl handling
- make dm_mod.wait_for correctly wait for partitions
- minor code fixes and cleanups"
* tag 'for-7.1/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (62 commits)
dm cache: fix missing return in invalidate_committed's error path
dm: fix a buffer overflow in ioctl processing
dm-crypt: Make crypt_iv_operations::post return void
dm vdo: Fix spelling mistake "postive" -> "positive"
dm: provide helper to set stacked limits
dm-integrity: always set the io hints
dm-integrity: fix mismatched queue limits
dm-bufio: use kzalloc_flex
dm vdo: save the formatted metadata to disk
dm vdo: add formatting logic and initialization
dm vdo: add synchronous metadata I/O submission helper
dm vdo: add geometry block structure
dm vdo: add geometry block encoding
dm vdo: add upfront validation for logical size
dm vdo: add formatting parameters to table line
dm vdo: add super block initialization to encodings.c
dm vdo: add geometry block initialization to encodings.c
dm-crypt: Make crypt_iv_operations::wipe return void
dm-crypt: Reimplement elephant diffuser using AES library
dm-verity-fec: warn even when there were no errors
...
In passthrough mode, dm-cache defers write submission until after
metadata commit completes via the invalidate_committed() continuation.
On commit error, invalidate_committed() calls invalidate_complete() to
end the bio and free the migration struct, after which it should return
immediately.
The patch 4ca8b8bd95 ("dm cache: fix write hang in passthrough mode")
omitted this early return, causing execution to fall through into the
success path on error. This results in use-after-free on the migration
struct in the subsequent calls.
Fix by adding the missing return after the invalidate_complete() call.
Fixes: 4ca8b8bd95 ("dm cache: fix write hang in passthrough mode")
Reported-by: Dan Carpenter <error27@gmail.com>
Closes: https://lore.kernel.org/dm-devel/adjMq6T5RRjv_uxM@stanley.mountain/
Signed-off-by: Ming-Hung Tsai <mtsai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Tony Asleson (using Claude) found a buffer overflow in dm-ioctl in the
function retrieve_status:
1. The code in retrieve_status checks that the output string fits into
the output buffer and writes the output string there
2. Then, the code aligns the "outptr" variable to the next 8-byte
boundary:
outptr = align_ptr(outptr);
3. The alignment doesn't check overflow, so outptr could point past the
buffer end
4. The "for" loop is iterated again, it executes:
remaining = len - (outptr - outbuf);
5. If "outptr" points past "outbuf + len", the arithmetics wraps around
and the variable "remaining" contains unusually high number
6. With "remaining" being high, the code writes more data past the end of
the buffer
Luckily, this bug has no security implications because:
1. Only root can issue device mapper ioctls
2. The commonly used libraries that communicate with device mapper
(libdevmapper and devicemapper-rs) use buffer size that is aligned to
8 bytes - thus, "outptr = align_ptr(outptr)" can't overshoot the input
buffer and the bug can't happen accidentally
Reported-by: Tony Asleson <tasleson@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Bryn M. Reeves <bmr@redhat.com>
Cc: stable@vger.kernel.org
When retry_aligned_read() encounters an overlapped stripe, it releases
the stripe via raid5_release_stripe() which puts it on the lockless
released_stripes llist. In the next raid5d loop iteration,
release_stripe_list() drains the stripe onto handle_list (since
STRIPE_HANDLE is set by the original IO), but retry_aligned_read()
runs before handle_active_stripes() and removes the stripe from
handle_list via find_get_stripe() -> list_del_init(). This prevents
handle_stripe() from ever processing the stripe to resolve the
overlap, causing an infinite loop and soft lockup.
Fix this by using __release_stripe() with temp_inactive_list instead
of raid5_release_stripe() in the failure path, so the stripe does not
go through the released_stripes llist. This allows raid5d to break out
of its loop, and the overlap will be resolved when the stripe is
eventually processed by handle_stripe().
Fixes: 773ca82fa1 ("raid5: make release_stripe lockless")
Cc: stable@vger.kernel.org
Signed-off-by: FengWei Shih <dannyshih@synology.com>
Signed-off-by: Chia-Ming Chang <chiamingc@synology.com>
Link: https://lore.kernel.org/linux-raid/20260402061406.455755-1-chiamingc@synology.com/
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
During raid456 reshape, direct IO across the reshape position can sleep
in raid5_make_request() waiting for reshape progress while still
holding an active_io reference. If userspace then freezes reshape and
writes md/suspend_lo or md/suspend_hi, mddev_suspend() kills active_io
and waits for all in-flight IO to drain.
This can deadlock: the IO needs reshape progress to continue, but the
reshape thread is already frozen, so the active_io reference is never
dropped and suspend never completes.
raid5_prepare_suspend() already wakes wait_for_reshape for dm-raid. Do
the same for normal md suspend when reshape is already interrupted, so
waiting raid456 IO can abort, drop its reference, and let suspend
finish.
The mdadm test tests/25raid456-reshape-deadlock reproduces the hang.
Fixes: 714d20150e ("md: add new helpers to suspend/resume array")
Link: https://lore.kernel.org/linux-raid/20260327140729.2030564-1-yukuai@fnnas.com/
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Previously, using wait_event() would wake up all waiters simultaneously,
and they would compete for the tree lock. The bio which gets the lock
first will be handled, so the write sequence cannot be guaranteed.
For example:
bio1(100,200)
bio2(150,200)
bio3(150,300)
The write sequence of fast device is bio1,bio2,bio3. But the write sequence
of slow device could be bio1,bio3,bio2 due to lock competition. This causes
data corruption.
Replace waitqueue with a fifo list to guarantee the write sequence. And it
also needs to iterate the list when removing one entry. If not, it may miss
the opportunity to wake up the waiting io.
For example:
bio1(1,3), bio2(2,4)
bio3(5,7), bio4(6,8)
These four bios are in the same bucket. bio1 and bio3 are inserted into
the rbtree. bio2 and bio4 are added to the waiting list and bio2 is the
first one. bio3 returns from slow disk and tries to wake up the waiting
bios. bio2 is removed from the list and will be handled. But bio1 hasn't
finished. So bio2 will be added into waiting list again. Then bio1 returns
from slow disk and wakes up waiting bios. bio4 is removed from the list
and will be handled. Now bio1, bio3 and bio4 all finish and bio2 is left
on the waiting list. So it needs to iterate the waiting list to wake up
the right bio.
Signed-off-by: Xiao Ni <xni@redhat.com>
Link: https://lore.kernel.org/linux-raid/20260324072501.59865-1-xni@redhat.com/
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
For RAID-456 arrays with llbitmap, if all underlying disks support
write_zeroes with unmap, issue write_zeroes to zero all disk data
regions and initialize the bitmap to BitCleanUnwritten instead of
BitUnwritten.
This optimization skips the initial XOR parity building because:
1. write_zeroes with unmap guarantees zeroed reads after the operation
2. For RAID-456, when all data is zero, parity is automatically
consistent (0 XOR 0 XOR ... = 0)
3. BitCleanUnwritten indicates parity is valid but no user data
has been written
The implementation adds two helper functions:
- llbitmap_all_disks_support_wzeroes_unmap(): Checks if all active
disks support write_zeroes with unmap
- llbitmap_zero_all_disks(): Issues blkdev_issue_zeroout() to each
rdev's data region to zero all disks
The zeroing and bitmap state setting happens in llbitmap_init_state()
during bitmap initialization. If any disk fails to zero, we fall back
to BitUnwritten and normal lazy recovery.
This significantly reduces array initialization time for RAID-456
arrays built on modern NVMe SSDs or other devices that support
write_zeroes with unmap.
Reviewed-by: Xiao Ni <xni@redhat.com>
Link: https://lore.kernel.org/linux-raid/20260323054644.3351791-4-yukuai@fnnas.com/
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Add new states to the llbitmap state machine to support proactive XOR
parity building for RAID-5 arrays. This allows users to pre-build parity
data for unwritten regions before any user data is written.
New states added:
- BitNeedSyncUnwritten: Transitional state when proactive sync is triggered
via sysfs on Unwritten regions.
- BitSyncingUnwritten: Proactive sync in progress for unwritten region.
- BitCleanUnwritten: XOR parity has been pre-built, but no user data
written yet. When user writes to this region, it transitions to BitDirty.
New actions added:
- BitmapActionProactiveSync: Trigger for proactive XOR parity building.
- BitmapActionClearUnwritten: Convert CleanUnwritten/NeedSyncUnwritten/
SyncingUnwritten states back to Unwritten before recovery starts.
State flows:
- Current (lazy): Unwritten -> (write) -> NeedSync -> (sync) -> Dirty -> Clean
- New (proactive): Unwritten -> (sysfs) -> NeedSyncUnwritten -> (sync) -> CleanUnwritten
- On write to CleanUnwritten: CleanUnwritten -> (write) -> Dirty -> Clean
- On disk replacement: CleanUnwritten regions are converted to Unwritten
before recovery starts, so recovery only rebuilds regions with user data
A new sysfs interface is added at /sys/block/mdX/md/llbitmap/proactive_sync
(write-only) to trigger proactive sync. This only works for RAID-456 arrays.
Link: https://lore.kernel.org/linux-raid/20260323054644.3351791-3-yukuai@fnnas.com/
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
r5c_recovery_analyze_meta_block() and
r5l_recovery_verify_data_checksum_for_mb() iterate over payloads in a
journal metadata block using on-disk payload size fields without
validating them against the remaining space in the metadata block.
A corrupted journal contains payload sizes extending beyond the PAGE_SIZE
boundary can cause out-of-bounds reads when accessing payload fields or
computing offsets.
Add bounds validation for each payload type to ensure the full payload
fits within meta_size before processing.
Fixes: b4c625c673 ("md/r5cache: r5cache recovery: part 1")
Cc: stable@vger.kernel.org
Signed-off-by: Junrui Luo <moonafterrain@outlook.com>
Link: https://lore.kernel.org/linux-raid/SYBPR01MB78815E78D829BB86CD7C8015AF5FA@SYBPR01MB7881.ausprd01.prod.outlook.com/
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
The md_wq workqueue is defined as static and initialized in md_init(),
but it is not used anywhere within md.c.
All asynchronous and deferred work in this file is handled via
md_misc_wq or dedicated md threads.
Fixes: b75197e86e ("md: Remove flush handling")
Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com>
Link: https://lore.kernel.org/linux-raid/20260328193522.3624-1-abd.masalkhi@gmail.com/
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
syzbot reported a WARNING at mm/page_alloc.c:__alloc_frozen_pages_noprof()
triggered by create_strip_zones() in the RAID0 driver.
When raid_disks is large, the allocation size exceeds MAX_PAGE_ORDER (4MB
on x86), causing WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER).
Convert the strip_zone and devlist allocations from kzalloc/kzalloc_objs to
kvzalloc/kvzalloc_objs, which first attempts a contiguous allocation with
__GFP_NOWARN and then falls back to vmalloc for large sizes. Convert the
corresponding kfree calls to kvfree.
Both arrays are pure metadata lookup tables (arrays of pointers and zone
descriptors) accessed only via indexing, so they do not require physically
contiguous memory.
Reported-by: syzbot+924649752adf0d3ac9dd@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/69adaba8.a00a0220.b130.0005.GAE@google.com/
Signed-off-by: Gregory Price <gourry@gourry.net>
Reviewed-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Li Nan <linan122@huawei.com>
Link: https://lore.kernel.org/linux-raid/20260308234202.3118119-1-gourry@gourry.net/
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
When "clear" is written to array_state, md_attr_store() breaks sysfs
active protection so the array can delete itself from its own sysfs
store method.
However, md_attr_store() currently drops the mddev reference before
calling sysfs_unbreak_active_protection(). Once do_md_stop(..., 0)
has made the mddev eligible for delayed deletion, the temporary
kobject reference taken by sysfs_break_active_protection() can become
the last kobject reference protecting the md kobject.
That allows sysfs_unbreak_active_protection() to drop the last
kobject reference from the current sysfs writer context. kobject
teardown then recurses into kernfs removal while the current sysfs
node is still being unwound, and lockdep reports recursive locking on
kn->active with kernfs_drain() in the call chain.
Reproducer on an existing level:
1. Create an md0 linear array and activate it:
mknod /dev/md0 b 9 0
echo none > /sys/block/md0/md/metadata_version
echo linear > /sys/block/md0/md/level
echo 1 > /sys/block/md0/md/raid_disks
echo "$(cat /sys/class/block/sdb/dev)" > /sys/block/md0/md/new_dev
echo "$(($(cat /sys/class/block/sdb/size) / 2))" > \
/sys/block/md0/md/dev-sdb/size
echo 0 > /sys/block/md0/md/dev-sdb/slot
echo active > /sys/block/md0/md/array_state
2. Wait briefly for the array to settle, then clear it:
sleep 2
echo clear > /sys/block/md0/md/array_state
The warning looks like:
WARNING: possible recursive locking detected
bash/588 is trying to acquire lock:
(kn->active#65) at __kernfs_remove+0x157/0x1d0
but task is already holding lock:
(kn->active#65) at sysfs_unbreak_active_protection+0x1f/0x40
...
Call Trace:
kernfs_drain
__kernfs_remove
kernfs_remove_by_name_ns
sysfs_remove_group
sysfs_remove_groups
__kobject_del
kobject_put
md_attr_store
kernfs_fop_write_iter
vfs_write
ksys_write
Restore active protection before mddev_put() so the extra sysfs
kobject reference is dropped while the mddev is still held alive. The
actual md kobject deletion is then deferred until after the sysfs
write path has fully returned.
Fixes: 9e59d60976 ("md: call del_gendisk in control path")
Reviewed-by: Xiao Ni <xni@redhat.com>
Link: https://lore.kernel.org/linux-raid/20260330055213.3976052-1-yukuai@fnnas.com/
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
In the previous patch ("bcache: fix cached_dev.sb_bio use-after-free and
crash"), we adopted a simple modification suggestion from AI to fix the
use-after-free.
But in actual testing, we found an extreme case where the device is
stopped before calling bch_write_bdev_super().
At this point, struct closure sb_write has not been initialized yet.
For this patch, we ensure that sb_bio has been completed via
sb_write_mutex.
Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn>
Signed-off-by: Coly Li <colyli@fnnas.com>
Link: https://patch.msgid.link/20260403042135.2221247-1-colyli@fnnas.com
Fixes: fec114a98b ("bcache: fix cached_dev.sb_bio use-after-free and crash")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In our production environment, we have received multiple crash reports
regarding libceph, which have caught our attention:
```
[6888366.280350] Call Trace:
[6888366.280452] blk_update_request+0x14e/0x370
[6888366.280561] blk_mq_end_request+0x1a/0x130
[6888366.280671] rbd_img_handle_request+0x1a0/0x1b0 [rbd]
[6888366.280792] rbd_obj_handle_request+0x32/0x40 [rbd]
[6888366.280903] __complete_request+0x22/0x70 [libceph]
[6888366.281032] osd_dispatch+0x15e/0xb40 [libceph]
[6888366.281164] ? inet_recvmsg+0x5b/0xd0
[6888366.281272] ? ceph_tcp_recvmsg+0x6f/0xa0 [libceph]
[6888366.281405] ceph_con_process_message+0x79/0x140 [libceph]
[6888366.281534] ceph_con_v1_try_read+0x5d7/0xf30 [libceph]
[6888366.281661] ceph_con_workfn+0x329/0x680 [libceph]
```
After analyzing the coredump file, we found that the address of
dc->sb_bio has been freed. We know that cached_dev is only freed when it
is stopped.
Since sb_bio is a part of struct cached_dev, rather than an alloc every
time. If the device is stopped while writing to the superblock, the
released address will be accessed at endio.
This patch hopes to wait for sb_write to complete in cached_dev_free.
It should be noted that we analyzed the cause of the problem, then tell
all details to the QWEN and adopted the modifications it made.
Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn>
Fixes: cafe563591 ("bcache: A block layer cache")
Cc: stable@vger.kernel.org # 3.10+
Signed-off-by: Coly Li <colyli@fnnas.com>
Link: https://patch.msgid.link/20260322134102.480107-1-colyli@fnnas.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since all implementations of crypt_iv_operations::post now return 0,
change the return type to void.
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
There is a spelling mistake in a vdo_log_error message. Fix it.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
There are multiple device mappers that set up their stacking limits
exactly the same for the logical, physical and minimum IO queue limits.
Provide a helper for it.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Don't depend on the defaults to be what is desired if the integrity
device was set up with 512b sector size. Always set the queue limits to
be at least what the device mapper wants.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
A user can integritysetup a device with a backing device using a 4k
logical block size, but request the dm device use 1k or 2k. This
mismatch creates an inconsistency such that the dm device would report
limits for IO that it can't actually execute. Fix this by using the
backing device's limits if they are larger.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Avoid manual size calculations and use the proper helper.
Add __counted_by for extra runtime analysis.
Signed-off-by: Rosen Penev <rosenp@gmail.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Add vdo_save_super_block() and vdo_save_geometry_block() to perform
asynchronous writes of the super block and geometry block respectively.
Add vdo_clear_layout() to zero the UDS index's first block, the block
map partition, and the recovery journal partition.
These operations are driven by new phases in the pre-load state machine
(PRE_LOAD_PHASE_FORMAT_*), ensuring that disk writes happen during
pre-resume rather than during dmsetup create.
Signed-off-by: Bruce Johnston <bjohnsto@redhat.com>
Reviewed-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Add the core formatting logic. The initialization path is updated to
read the geometry block (block 0 on the storage device). If the block
is entirely zeroed, the device is treated as unformatted and
vdo_format() is called. Otherwise, the existing geometry is parsed
and the VDO is loaded as before.
The vdo_format() function initializes the volume geometry and super
block, and marks the VDO as needing it's layout saved to disk.
Signed-off-by: Bruce Johnston <bjohnsto@redhat.com>
Reviewed-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Add vdo_submit_metadata_vio_wait(), a synchronous I/O submission
helper that blocks until completion. This is needed for I/O during
early initialization before work queues are available.
Refactor read_geometry_block() to use it.
Signed-off-by: Bruce Johnston <bjohnsto@redhat.com>
Reviewed-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Introduce a vdo_geometry_block structure, containing a vio and buffer,
mirroring the existing vdo_super_block structure. Both are now
initialized at VDO startup and freed at shutdown, establishing the
infrastructure needed to read and write the geometry block using the
same mechanisms as the super block.
Refactor read_geometry_block() to use the new structure.
Signed-off-by: Bruce Johnston <bjohnsto@redhat.com>
Reviewed-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Add vdo_encode_volume_geometry() to write the geometry block into a
buffer so that it can be written to disk. The corresponding decode
path already exists.
Signed-off-by: Bruce Johnston <bjohnsto@redhat.com>
Reviewed-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Add a validation check that the logical size passed via the table line
does not exceed MAXIMUM_VDO_LOGICAL_BLOCKS.
Signed-off-by: Bruce Johnston <bjohnsto@redhat.com>
Reviewed-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Extend the dm table line with three new optional parameters:
indexMemory (UDS index memory size), indexSparse (dense vs sparse
index), and slabSize (blocks per allocation slab). These values are
parsed, validated, and stored in the device configuration for use
during formatting.
Rework the slab size constants from the single MAX_VDO_SLAB_BITS into
explicit MIN_VDO_SLAB_BLOCKS, MAX_VDO_SLAB_BLOCKS, and
DEFAULT_VDO_SLAB_BLOCKS values.
Bump the target version from 9.1.0 to 9.2.0 to reflect this table
line change.
Signed-off-by: Bruce Johnston <bjohnsto@redhat.com>
Reviewed-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Add vdo_initialize_component_states() to populate the super block,
computing the space required for the main VDO components on disk.
Those include the slab depot, block map, and recovery journal.
Signed-off-by: Bruce Johnston <bjohnsto@redhat.com>
Reviewed-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Add vdo_initialize_volume_geometry() to populate the geometry block,
computing the space required for the two main regions on disk.
Add uds_compute_index_size() to calculate the space required for the
UDS indexer from the UDS configuration.
Signed-off-by: Bruce Johnston <bjohnsto@redhat.com>
Reviewed-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Since all implementations of crypt_iv_operations::wipe now return 0,
change the return type to void.
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Simplify and optimize dm-crypt's implementation of Bitlocker's "elephant
diffuser" to use the AES library instead of an "ecb(aes)"
crypto_skcipher.
Note: struct aes_enckey is fixed-size, so it could be embedded directly
in struct iv_elephant_private. But I kept it as a separate allocation
so that the size of struct crypt_config doesn't increase. The elephant
diffuser is rarely used in dm-crypt.
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Currently FEC logs a warning message if at least one error was
corrected, or an error message if there were uncorrectable errors.
However, it doesn't log anything if there were no errors.
"No errors" is actually unexpected, though, considering that dm-verity
calls verity_fec_decode() only when a block's digest doesn't match.
If there were to ever be a bug where verity_fec_decode() is called on
blocks with the correct digest, then there would be no indication in the
log that FEC is running and degrading performance.
Therefore, let's log the warning message even when there were no errors.
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>