mirror of
https://github.com/torvalds/linux.git
synced 2026-06-02 19:43:40 +02:00
v6.4-rc4
12023 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
b158dd941b |
for-6.4-rc3-tag
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmRwqRsACgkQxWXV+ddt WDuCPQ//T8JVY6usnGF/Fw/3zbtDNvrdQLDfp3HovIg7gmLIBda0bT05w4Q46FUU l4BV0bHyTUNWPlXUmrrSmt8HipRe2z4Wjwc16azdLmSs5zf0FO1LbsCKDmM8Ncid LTi2jzyyb3E44ZzC/i7RCaBt+vYRb2ZmtZ/glh3K4H0GgTAYl1GxZoAoYgBnvmlG nvmlWWDaM2cRKaUREm75il37LKLIlW5jvdUFQrqwWNgUH72ay5/7SZxHywlk8x6b qwhhp+s6bMUNzi6CqE2SLnESjI9yl0l/0gLebhDXVulo0BiCrti+YLpueP4eQs1B yYXX3PvHOXhoN4tUQ4yDF9G57To4Gw1aiQOnWOOLcbyGG1ZgyekpoRRXh6r74LKt FDyWT+u/xd78by1km3VzqmvKtqHnRFNMYfP+MMDIhyhy5prKCWeVo7bC+2FP+89o kv9+0Z0w0lkLycFfLaewZkEv0/WY8GMuT7kptHQ2Ao6ulAvG+j97sgVBFGXJjeCr B1OAGdeTF79IV139bCxPA62cat87Zrh15mZN+y7U32Vs2JkOqbT0LTQGKoVs/TCI AyHCDb8oOfGiebibnEDrDNtubz7NFCq4ntZRmuv5FJ+l2d1wl6ZvsI+DoYP7Zide DLR7ZtPs1Yvm27xDjs+fVmMx4nuNGikEbPZPxJro1CjLVzCEt7k= =elHB -----END PGP SIGNATURE----- Merge tag 'for-6.4-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - handle memory allocation error in checksumming helper (reported by syzbot) - fix lockdep splat when aborting a transaction, add NOFS protection around invalidate_inode_pages2 that could allocate with GFP_KERNEL - reduce chances to hit an ENOSPC during scrub with RAID56 profiles * tag 'for-6.4-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: use nofs when cleaning up aborted transactions btrfs: handle memory allocation failure in btrfs_csum_one_bio btrfs: scrub: try harder to mark RAID56 block groups read-only |
||
|
|
597441b343 |
btrfs: use nofs when cleaning up aborted transactions
Our CI system caught a lockdep splat: ====================================================== WARNING: possible circular locking dependency detected 6.3.0-rc7+ #1167 Not tainted ------------------------------------------------------ kswapd0/46 is trying to acquire lock: ffff8c6543abd650 (sb_internal#2){++++}-{0:0}, at: btrfs_commit_inode_delayed_inode+0x5f/0x120 but task is already holding lock: ffffffffabe61b40 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x4aa/0x7a0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (fs_reclaim){+.+.}-{0:0}: fs_reclaim_acquire+0xa5/0xe0 kmem_cache_alloc+0x31/0x2c0 alloc_extent_state+0x1d/0xd0 __clear_extent_bit+0x2e0/0x4f0 try_release_extent_mapping+0x216/0x280 btrfs_release_folio+0x2e/0x90 invalidate_inode_pages2_range+0x397/0x470 btrfs_cleanup_dirty_bgs+0x9e/0x210 btrfs_cleanup_one_transaction+0x22/0x760 btrfs_commit_transaction+0x3b7/0x13a0 create_subvol+0x59b/0x970 btrfs_mksubvol+0x435/0x4f0 __btrfs_ioctl_snap_create+0x11e/0x1b0 btrfs_ioctl_snap_create_v2+0xbf/0x140 btrfs_ioctl+0xa45/0x28f0 __x64_sys_ioctl+0x88/0xc0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc -> #0 (sb_internal#2){++++}-{0:0}: __lock_acquire+0x1435/0x21a0 lock_acquire+0xc2/0x2b0 start_transaction+0x401/0x730 btrfs_commit_inode_delayed_inode+0x5f/0x120 btrfs_evict_inode+0x292/0x3d0 evict+0xcc/0x1d0 inode_lru_isolate+0x14d/0x1e0 __list_lru_walk_one+0xbe/0x1c0 list_lru_walk_one+0x58/0x80 prune_icache_sb+0x39/0x60 super_cache_scan+0x161/0x1f0 do_shrink_slab+0x163/0x340 shrink_slab+0x1d3/0x290 shrink_node+0x300/0x720 balance_pgdat+0x35c/0x7a0 kswapd+0x205/0x410 kthread+0xf0/0x120 ret_from_fork+0x29/0x50 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(fs_reclaim); lock(sb_internal#2); lock(fs_reclaim); lock(sb_internal#2); *** DEADLOCK *** 3 locks held by kswapd0/46: #0: ffffffffabe61b40 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x4aa/0x7a0 #1: ffffffffabe50270 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x113/0x290 #2: ffff8c6543abd0e0 (&type->s_umount_key#44){++++}-{3:3}, at: super_cache_scan+0x38/0x1f0 stack backtrace: CPU: 0 PID: 46 Comm: kswapd0 Not tainted 6.3.0-rc7+ #1167 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x58/0x90 check_noncircular+0xd6/0x100 ? save_trace+0x3f/0x310 ? add_lock_to_list+0x97/0x120 __lock_acquire+0x1435/0x21a0 lock_acquire+0xc2/0x2b0 ? btrfs_commit_inode_delayed_inode+0x5f/0x120 start_transaction+0x401/0x730 ? btrfs_commit_inode_delayed_inode+0x5f/0x120 btrfs_commit_inode_delayed_inode+0x5f/0x120 btrfs_evict_inode+0x292/0x3d0 ? lock_release+0x134/0x270 ? __pfx_wake_bit_function+0x10/0x10 evict+0xcc/0x1d0 inode_lru_isolate+0x14d/0x1e0 __list_lru_walk_one+0xbe/0x1c0 ? __pfx_inode_lru_isolate+0x10/0x10 ? __pfx_inode_lru_isolate+0x10/0x10 list_lru_walk_one+0x58/0x80 prune_icache_sb+0x39/0x60 super_cache_scan+0x161/0x1f0 do_shrink_slab+0x163/0x340 shrink_slab+0x1d3/0x290 shrink_node+0x300/0x720 balance_pgdat+0x35c/0x7a0 kswapd+0x205/0x410 ? __pfx_autoremove_wake_function+0x10/0x10 ? __pfx_kswapd+0x10/0x10 kthread+0xf0/0x120 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x29/0x50 </TASK> This happens because when we abort the transaction in the transaction commit path we call invalidate_inode_pages2_range on our block group cache inodes (if we have space cache v1) and any delalloc inodes we may have. The plain invalidate_inode_pages2_range() call passes through GFP_KERNEL, which makes sense in most cases, but not here. Wrap these two invalidate callees with memalloc_nofs_save/memalloc_nofs_restore to make sure we don't end up with the fs reclaim dependency under the transaction dependency. CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
806570c0bb |
btrfs: handle memory allocation failure in btrfs_csum_one_bio
Since
|
||
|
|
7561551e7b |
btrfs: scrub: try harder to mark RAID56 block groups read-only
Currently we allow a block group not to be marked read-only for scrub. But for RAID56 block groups if we require the block group to be read-only, then we're allowed to use cached content from scrub stripe to reduce unnecessary RAID56 reads. So this patch would: - Make btrfs_inc_block_group_ro() try harder During my tests, for cases like btrfs/061 and btrfs/064, we can hit ENOSPC from btrfs_inc_block_group_ro() calls during scrub. The reason is if we only have one single data chunk, and trying to scrub it, we won't have any space left for any newer data writes. But this check should be done by the caller, especially for scrub cases we only temporarily mark the chunk read-only. And newer data writes would always try to allocate a new data chunk when needed. - Return error for scrub if we failed to mark a RAID56 chunk read-only Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
76c7f8873a |
for-6.4-rc1-tag
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmRebDIACgkQxWXV+ddt
WDu3vA//RNyRGjEz0HgfhTc1119DXJLwK6j544waYLrzRcMtBK4xKByiaFkAA4tL
PQidGX+nAQPm+pZl0jcK30cBMObik5GXJwoSOZGl7/ectx4O7aFfXqiSfwPTyqZU
3fTavoqoJxbxJCVbifcXOPNhsUxMlEGYJmA3CVRsllLviXY+3HMpX2ZpWZ7vch+N
MLENNBfUo1HVdWaxOYfQif/qT5iR9G7D8dBjX9DUK0kVwrbwBB0rolJy4fPrY6z5
gBLED9Ks3FBgyU3mYq4qrfPmbfF8mPiaU0+1j+B46vw3PdPtIwjIForR+91GsZ1v
iHojbykf6VWTQV+gO78mgv4O4vRtn3C+UJaGxLL86OMOaiQQHFYdSETn9arPmoho
p1wCBidI82tvfIOGYXgrTGorLN27hhyPJinHe/2Bqo+1wUL8/J8mwCWunIox7a8z
rxO5QhDIDFX7gamsvYjkW3tBkYuGiGvBjx+Ic2cBHTkVp9wSPL9PCvqNNru2qexA
t0BpAL9DxvN+T1xO1thC3qsm2Ogx0QEmgdDfRglbEVASnRZKZZsJEMO90FzFbkFg
vLbs0KnT7yS7mTwq4NklDrgHZ0eiiJLZVCb8bR8xkzVW+ADrUmZuDM8WOcCgJAUp
fUoMmFsJZi5zsdAOygDWr1bBHorLV5szrY0bSB5L2eHwJjYZ6KE=
=uWUN
-----END PGP SIGNATURE-----
Merge tag 'for-6.4-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull more btrfs fixes from David Sterba:
- fix incorrect number of bitmap entries for space cache if loading is
interrupted by some error
- fix backref walking, this breaks a mode of LOGICAL_INO_V2 ioctl that
is used in deduplication tools
- zoned mode fixes:
- properly finish zone reserved for relocation
- correctly calculate super block zone end on ZNS
- properly initialize new extent buffer for redirty
- make mount option clear_cache work with block-group-tree, to rebuild
free-space-tree instead of temporarily disabling it that would lead
to a forced read-only mount
- fix alignment check for offset when printing extent item
* tag 'for-6.4-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: make clear_cache mount option to rebuild FST without disabling it
btrfs: zero the buffer before marking it dirty in btrfs_redirty_list_add
btrfs: zoned: fix full zone super block reading on ZNS
btrfs: zoned: zone finish data relocation BG with last IO
btrfs: fix backref walking not returning all inode refs
btrfs: fix space cache inconsistency after error loading it from disk
btrfs: print-tree: parent bytenr must be aligned to sector size
|
||
|
|
1d6a4fc857 |
btrfs: make clear_cache mount option to rebuild FST without disabling it
Previously clear_cache mount option would simply disable free-space-tree feature temporarily then re-enable it to rebuild the whole free space tree. But this is problematic for block-group-tree feature, as we have an artificial dependency on free-space-tree feature. If we go the existing method, after clearing the free-space-tree feature, we would flip the filesystem to read-only mode, as we detect a super block write with block-group-tree but no free-space-tree feature. This patch would change the behavior by properly rebuilding the free space tree without disabling this feature, thus allowing clear_cache mount option to work with block group tree. Now we can mount a filesystem with block-group-tree feature and clear_mount option: $ mkfs.btrfs -O block-group-tree /dev/test/scratch1 -f $ sudo mount /dev/test/scratch1 /mnt/btrfs -o clear_cache $ sudo dmesg -t | head -n 5 BTRFS info (device dm-1): force clearing of disk cache BTRFS info (device dm-1): using free space tree BTRFS info (device dm-1): auto enabling async discard BTRFS info (device dm-1): rebuilding free space tree BTRFS info (device dm-1): checking UUID tree CC: stable@vger.kernel.org # 6.1+ Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
c83b56d1dd |
btrfs: zero the buffer before marking it dirty in btrfs_redirty_list_add
btrfs_redirty_list_add zeroes the buffer data and sets the
EXTENT_BUFFER_NO_CHECK to make sure writeback is fine with a bogus
header. But it does that after already marking the buffer dirty, which
means that writeback could already be looking at the buffer.
Switch the order of operations around so that the buffer is only marked
dirty when we're ready to write it.
Fixes:
|
||
|
|
02ca9e6fb5 |
btrfs: zoned: fix full zone super block reading on ZNS
When both of the superblock zones are full, we need to check which
superblock is newer. The calculation of last superblock position is wrong
as it does not consider zone_capacity and uses the length.
Fixes:
|
||
|
|
f84353c7c2 |
btrfs: zoned: zone finish data relocation BG with last IO
For data block groups, we zone finish a zone (or, just deactivate it) when
seeing the last IO in btrfs_finish_ordered_io(). That is only called for
IOs using ZONE_APPEND, but we use a regular WRITE command for data
relocation IOs. Detect it and call btrfs_zone_finish_endio() properly.
Fixes:
|
||
|
|
0cad8f14d7 |
btrfs: fix backref walking not returning all inode refs
When using the logical to ino ioctl v2, if the flag to ignore offsets of file extent items (BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET) is given, the backref walking code ends up not returning references for all file offsets of an inode that point to the given logical bytenr. This happens since kernel 6.2, commit |
||
|
|
0004ff15ea |
btrfs: fix space cache inconsistency after error loading it from disk
When loading a free space cache from disk, at __load_free_space_cache(),
if we fail to insert a bitmap entry, we still increment the number of
total bitmaps in the btrfs_free_space_ctl structure, which is incorrect
since we failed to add the bitmap entry. On error we then empty the
cache by calling __btrfs_remove_free_space_cache(), which will result
in getting the total bitmaps counter set to 1.
A failure to load a free space cache is not critical, so if a failure
happens we just rebuild the cache by scanning the extent tree, which
happens at block-group.c:caching_thread(). Yet the failure will result
in having the total bitmaps of the btrfs_free_space_ctl always bigger
by 1 then the number of bitmap entries we have. So fix this by having
the total bitmaps counter be incremented only if we successfully added
the bitmap entry.
Fixes:
|
||
|
|
c87f318e6f |
btrfs: print-tree: parent bytenr must be aligned to sector size
Check nodesize to sectorsize in alignment check in print_extent_item.
The comment states that and this is correct, similar check is done
elsewhere in the functions.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes:
|
||
|
|
1dc3731daf |
for-6.4-rc1-tag
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmRaYhYACgkQxWXV+ddt
WDvCRQ/+MjRuInALh+N34mMneF8jjPOlUQBZbaC43XYJ0ss9drSzvE3STmPVrjdK
IHyzRYipKI6vdTtzYbyGwxJ9oazsuXTQXC3w/qMW1hO1EAQ0a9tbnTSIQ+BDbU63
BW7rJ3JuM6hKxKK+e9Dserhks0lOgQc+xKT1CUELvAHp3UykD4OrNczguaIT2lGR
YXL+9B3ex2SooCqrQStkqEtjD/kxbaYUkK7yWA2FssXWqU5SjZwUOsuY3ZPOWrm1
ULNI67gIxkMkSynV3aYka7nY3xc9oGIfk9WPeylWcOcH3+pWabeptjk617XbA0KI
4biz1zZ/qTRXWlCLDv3ukUa5EIVAWQ1kxVE/hAt3SzqJvoqB/ymML/2LeQNdyx2i
adMTZQ95JkhQNU9Lp9QOtpgfZonhhjxnL9KE7eMVo28zJFdYjge3egINjimY+mLz
qzrzUBI3bqCNYG0LRR1EvuN0feBd/9nNMFjLBi2mkDqsWtzvTxxzWvVlV5EEcoJe
xrozGh00Y5ioP6ZanKuZRib+u2ligbD66dYhKSU74D6B5kuZPic3Kkn9qICjRByM
uBGBze/7GT/3ouhPOwxVPtGZstiFhbAxE7mApROrIxAx8I9rZjBdHgFJQklolXNy
HSKNf3u98XZBVVcku/O1hyoeTLnApPfApxD4lv3qlRmgdEnAp6I=
=K25T
-----END PGP SIGNATURE-----
Merge tag 'for-6.4-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix backward leaf iteration which could possibly return the same key
- fix assertion when device add and balance race for exclusive
operation
- fix regression when freeing device, state tree would leak after
device replace
- fix attempt to clear space cache v1 when block-group-tree is enabled
- fix potential i_size corruption when encoded write races with send v2
and enabled no-holes (the race is hard to hit though, the window is a
few instructions wide)
- fix wrong bitmap API use when checking empty zones, parameters were
swapped but not causing a bug due to other code
- prevent potential qgroup leak if subvolume create does not commit
transaction (which is pending in the development queue)
- error handling and reporting:
- abort transaction when sibling keys check fails for leaves
- print extent buffers when sibling keys check fails
* tag 'for-6.4-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: don't free qgroup space unless specified
btrfs: fix encoded write i_size corruption with no-holes
btrfs: zoned: fix wrong use of bitops API in btrfs_ensure_empty_zones
btrfs: properly reject clear_cache and v1 cache for block-group-tree
btrfs: print extent buffers when sibling keys check fails
btrfs: abort transaction when sibling keys check fails for leaves
btrfs: fix leak of source device allocation state after device replace
btrfs: fix assertion of exclop condition when starting balance
btrfs: fix btrfs_prev_leaf() to not return the same key twice
|
||
|
|
d246331b78 |
btrfs: don't free qgroup space unless specified
Boris noticed in his simple quotas testing that he was getting a leak
with Sweet Tea's change to subvol create that stopped doing a
transaction commit. This was just a side effect of that change.
In the delayed inode code we have an optimization that will free extra
reservations if we think we can pack a dir item into an already modified
leaf. Previously this wouldn't be triggered in the subvolume create
case because we'd commit the transaction, it was still possible but
much harder to trigger. It could actually be triggered if we did a
mkdir && subvol create with qgroups enabled.
This occurs because in btrfs_insert_delayed_dir_index(), which gets
called when we're adding the dir item, we do the following:
btrfs_block_rsv_release(fs_info, trans->block_rsv, bytes, NULL);
if we're able to skip reserving space.
The problem here is that trans->block_rsv points at the temporary block
rsv for the subvolume create, which has qgroup reservations in the block
rsv.
This is a problem because btrfs_block_rsv_release() will do the
following:
if (block_rsv->qgroup_rsv_reserved >= block_rsv->qgroup_rsv_size) {
qgroup_to_release = block_rsv->qgroup_rsv_reserved -
block_rsv->qgroup_rsv_size;
block_rsv->qgroup_rsv_reserved = block_rsv->qgroup_rsv_size;
}
The temporary block rsv just has ->qgroup_rsv_reserved set,
->qgroup_rsv_size == 0. The optimization in
btrfs_insert_delayed_dir_index() sets ->qgroup_rsv_reserved = 0. Then
later on when we call btrfs_subvolume_release_metadata() which has
btrfs_block_rsv_release(fs_info, rsv, (u64)-1, &qgroup_to_release);
btrfs_qgroup_convert_reserved_meta(root, qgroup_to_release);
qgroup_to_release is set to 0, and we do not convert the reserved
metadata space.
The problem here is that the block rsv code has been unconditionally
messing with ->qgroup_rsv_reserved, because the main place this is used
is delalloc, and any time we call btrfs_block_rsv_release() we do it
with qgroup_to_release set, and thus do the proper accounting.
The subvolume code is the only other code that uses the qgroup
reservation stuff, but it's intermingled with the above optimization,
and thus was getting its reservation freed out from underneath it and
thus leaking the reserved space.
The solution is to simply not mess with the qgroup reservations if we
don't have qgroup_to_release set. This works with the existing code as
anything that messes with the delalloc reservations always have
qgroup_to_release set. This fixes the leak that Boris was observing.
Reviewed-by: Qu Wenruo <wqu@suse.com>
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
||
|
|
e7db9e5c6b |
btrfs: fix encoded write i_size corruption with no-holes
We have observed a btrfs filesystem corruption on workloads using
no-holes and encoded writes via send stream v2. The symptom is that a
file appears to be truncated to the end of its last aligned extent, even
though the final unaligned extent and even the file extent and otherwise
correctly updated inode item have been written.
So if we were writing out a 1MiB+X file via 8 128K extents and one
extent of length X, i_size would be set to 1MiB, but the ninth extent,
nbyte, etc. would all appear correct otherwise.
The source of the race is a narrow (one line of code) window in which a
no-holes fs has read in an updated i_size, but has not yet set a shared
disk_i_size variable to write. Therefore, if two ordered extents run in
parallel (par for the course for receive workloads), the following
sequence can play out: (following "threads" a bit loosely, since there
are callbacks involved for endio but extra threads aren't needed to
cause the issue)
ENC-WR1 (second to last) ENC-WR2 (last)
------- -------
btrfs_do_encoded_write
set i_size = 1M
submit bio B1 ending at 1M
endio B1
btrfs_inode_safe_disk_i_size_write
local i_size = 1M
falls off a cliff for some reason
btrfs_do_encoded_write
set i_size = 1M+X
submit bio B2 ending at 1M+X
endio B2
btrfs_inode_safe_disk_i_size_write
local i_size = 1M+X
disk_i_size = 1M+X
disk_i_size = 1M
btrfs_delayed_update_inode
btrfs_delayed_update_inode
And the delayed inode ends up filled with nbytes=1M+X and isize=1M, and
writes respect i_size and present a corrupted file missing its last
extents.
Fix this by holding the inode lock in the no-holes case so that a thread
can't sneak in a write to disk_i_size that gets overwritten with an out
of date i_size.
Fixes:
|
||
|
|
631003e233 |
btrfs: zoned: fix wrong use of bitops API in btrfs_ensure_empty_zones
find_next_bit and find_next_zero_bit take @size as the second parameter and
@offset as the third parameter. They are specified opposite in
btrfs_ensure_empty_zones(). Thanks to the later loop, it never failed to
detect the empty zones. Fix them and (maybe) return the result a bit
faster.
Note: the naming is a bit confusing, size has two meanings here, bitmap
and our range size.
Fixes:
|
||
|
|
64b5d5b285 |
btrfs: properly reject clear_cache and v1 cache for block-group-tree
[BUG] With block-group-tree feature enabled, mounting it with clear_cache would cause the following transaction abort at mount or remount: BTRFS info (device dm-4): force clearing of disk cache BTRFS info (device dm-4): using free space tree BTRFS info (device dm-4): auto enabling async discard BTRFS info (device dm-4): clearing free space tree BTRFS info (device dm-4): clearing compat-ro feature flag for FREE_SPACE_TREE (0x1) BTRFS info (device dm-4): clearing compat-ro feature flag for FREE_SPACE_TREE_VALID (0x2) BTRFS error (device dm-4): block-group-tree feature requires fres-space-tree and no-holes BTRFS error (device dm-4): super block corruption detected before writing it to disk BTRFS: error (device dm-4) in write_all_supers:4288: errno=-117 Filesystem corrupted (unexpected superblock corruption detected) BTRFS warning (device dm-4: state E): Skipping commit of aborted transaction. [CAUSE] For block-group-tree feature, we have an artificial dependency on free-space-tree. This means if we detect block-group-tree without v2 cache, we consider it a corruption and cause the problem. For clear_cache mount option, it would temporary disable v2 cache, then re-enable it. But unfortunately for that temporary v2 cache disabled status, we refuse to write a superblock with bg tree only flag, thus leads to the above transaction abortion. [FIX] For now, just reject clear_cache and v1 cache mount option for block group tree. So now we got a graceful rejection other than a transaction abort: BTRFS info (device dm-4): force clearing of disk cache BTRFS error (device dm-4): cannot disable free space tree with block-group-tree feature BTRFS error (device dm-4): open_ctree failed CC: stable@vger.kernel.org # 6.1+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
a2cea677db |
btrfs: print extent buffers when sibling keys check fails
When trying to move keys from one node/leaf to another sibling node/leaf, if the sibling keys check fails we just print an error message with the last key of the left sibling and the first key of the right sibling. However it's also useful to print all the keys of each sibling, as it may provide some clues to what went wrong, which code path may be inserting keys in an incorrect order. So just do that, print the siblings with btrfs_print_tree(), as it works for both leaves and nodes. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
9ae5afd02a |
btrfs: abort transaction when sibling keys check fails for leaves
If the sibling keys check fails before we move keys from one sibling leaf to another, we are not aborting the transaction - we leave that to some higher level caller of btrfs_search_slot() (or anything else that uses it to insert items into a b+tree). This means that the transaction abort will provide a stack trace that omits the b+tree modification call chain. So change this to immediately abort the transaction and therefore get a more useful stack trace that shows us the call chain in the bt+tree modification code. It's also important to immediately abort the transaction just in case some higher level caller is not doing it, as this indicates a very serious corruption and we should stop the possibility of doing further damage. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
611ccc58e1 |
btrfs: fix leak of source device allocation state after device replace
When a device replace finishes, the source device is freed by calling btrfs_free_device() at btrfs_rm_dev_replace_free_srcdev(), but the allocation state, tracked in the device's alloc_state io tree, is never freed. This is a regression recently introduced by commit |
||
|
|
ac868bc9d1 |
btrfs: fix assertion of exclop condition when starting balance
Balance as exclusive state is compatible with paused balance and device add, which makes some things more complicated. The assertion of valid states when starting from paused balance needs to take into account two more states, the combinations can be hit when there are several threads racing to start balance and device add. This won't typically happen when the commands are started from command line. Scenario 1: With exclusive_operation state == BTRFS_EXCLOP_NONE. Concurrently adding multiple devices to the same mount point and btrfs_exclop_finish executed finishes before assertion in btrfs_exclop_balance, exclusive_operation will changed to BTRFS_EXCLOP_NONE state which lead to assertion failed: fs_info->exclusive_operation == BTRFS_EXCLOP_BALANCE || fs_info->exclusive_operation == BTRFS_EXCLOP_DEV_ADD, in fs/btrfs/ioctl.c:456 Call Trace: <TASK> btrfs_exclop_balance+0x13c/0x310 ? memdup_user+0xab/0xc0 ? PTR_ERR+0x17/0x20 btrfs_ioctl_add_dev+0x2ee/0x320 btrfs_ioctl+0x9d5/0x10d0 ? btrfs_ioctl_encoded_write+0xb80/0xb80 __x64_sys_ioctl+0x197/0x210 do_syscall_64+0x3c/0xb0 entry_SYSCALL_64_after_hwframe+0x63/0xcd Scenario 2: With exclusive_operation state == BTRFS_EXCLOP_BALANCE_PAUSED. Concurrently adding multiple devices to the same mount point and btrfs_exclop_balance executed finish before the latter thread execute assertion in btrfs_exclop_balance, exclusive_operation will changed to BTRFS_EXCLOP_BALANCE_PAUSED state which lead to assertion failed: fs_info->exclusive_operation == BTRFS_EXCLOP_BALANCE || fs_info->exclusive_operation == BTRFS_EXCLOP_DEV_ADD || fs_info->exclusive_operation == BTRFS_EXCLOP_NONE, fs/btrfs/ioctl.c:458 Call Trace: <TASK> btrfs_exclop_balance+0x240/0x410 ? memdup_user+0xab/0xc0 ? PTR_ERR+0x17/0x20 btrfs_ioctl_add_dev+0x2ee/0x320 btrfs_ioctl+0x9d5/0x10d0 ? btrfs_ioctl_encoded_write+0xb80/0xb80 __x64_sys_ioctl+0x197/0x210 do_syscall_64+0x3c/0xb0 entry_SYSCALL_64_after_hwframe+0x63/0xcd An example of the failed assertion is below, which shows that the paused balance is also needed to be checked. root@syzkaller:/home/xsk# ./repro Failed to add device /dev/vda, errno 14 Failed to add device /dev/vda, errno 14 Failed to add device /dev/vda, errno 14 Failed to add device /dev/vda, errno 14 Failed to add device /dev/vda, errno 14 Failed to add device /dev/vda, errno 14 Failed to add device /dev/vda, errno 14 Failed to add device /dev/vda, errno 14 Failed to add device /dev/vda, errno 14 [ 416.611428][ T7970] BTRFS info (device loop0): fs_info exclusive_operation: 0 Failed to add device /dev/vda, errno 14 [ 416.613973][ T7971] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.615456][ T7972] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.617528][ T7973] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.618359][ T7974] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.622589][ T7975] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.624034][ T7976] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.626420][ T7977] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.627643][ T7978] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.629006][ T7979] BTRFS info (device loop0): fs_info exclusive_operation: 3 [ 416.630298][ T7980] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 Failed to add device /dev/vda, errno 14 [ 416.632787][ T7981] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.634282][ T7982] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.636202][ T7983] BTRFS info (device loop0): fs_info exclusive_operation: 3 [ 416.637012][ T7984] BTRFS info (device loop0): fs_info exclusive_operation: 1 Failed to add device /dev/vda, errno 14 [ 416.637759][ T7984] assertion failed: fs_info->exclusive_operation == BTRFS_EXCLOP_BALANCE || fs_info->exclusive_operation == BTRFS_EXCLOP_DEV_ADD || fs_info->exclusive_operation == BTRFS_EXCLOP_NONE, in fs/btrfs/ioctl.c:458 [ 416.639845][ T7984] invalid opcode: 0000 [#1] PREEMPT SMP KASAN [ 416.640485][ T7984] CPU: 0 PID: 7984 Comm: repro Not tainted 6.2.0 #7 [ 416.641172][ T7984] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 [ 416.642090][ T7984] RIP: 0010:btrfs_assertfail+0x2c/0x2e [ 416.644423][ T7984] RSP: 0018:ffffc90003ea7e28 EFLAGS: 00010282 [ 416.645018][ T7984] RAX: 00000000000000cc RBX: 0000000000000000 RCX: 0000000000000000 [ 416.645763][ T7984] RDX: ffff88801d030000 RSI: ffffffff81637e7c RDI: fffff520007d4fb7 [ 416.646554][ T7984] RBP: ffffffff8a533de0 R08: 00000000000000cc R09: 0000000000000000 [ 416.647299][ T7984] R10: 0000000000000001 R11: 0000000000000001 R12: ffffffff8a533da0 [ 416.648041][ T7984] R13: 00000000000001ca R14: 000000005000940a R15: 0000000000000000 [ 416.648785][ T7984] FS: 00007fa2985d4640(0000) GS:ffff88802cc00000(0000) knlGS:0000000000000000 [ 416.649616][ T7984] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 416.650238][ T7984] CR2: 0000000000000000 CR3: 0000000018e5e000 CR4: 0000000000750ef0 [ 416.650980][ T7984] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 416.651725][ T7984] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 416.652502][ T7984] PKRU: 55555554 [ 416.652888][ T7984] Call Trace: [ 416.653241][ T7984] <TASK> [ 416.653527][ T7984] btrfs_exclop_balance+0x240/0x410 [ 416.654036][ T7984] ? memdup_user+0xab/0xc0 [ 416.654465][ T7984] ? PTR_ERR+0x17/0x20 [ 416.654874][ T7984] btrfs_ioctl_add_dev+0x2ee/0x320 [ 416.655380][ T7984] btrfs_ioctl+0x9d5/0x10d0 [ 416.655822][ T7984] ? btrfs_ioctl_encoded_write+0xb80/0xb80 [ 416.656400][ T7984] __x64_sys_ioctl+0x197/0x210 [ 416.656874][ T7984] do_syscall_64+0x3c/0xb0 [ 416.657346][ T7984] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 416.657922][ T7984] RIP: 0033:0x4546af [ 416.660170][ T7984] RSP: 002b:00007fa2985d4150 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 416.660972][ T7984] RAX: ffffffffffffffda RBX: 00007fa2985d4640 RCX: 00000000004546af [ 416.661714][ T7984] RDX: 0000000000000000 RSI: 000000005000940a RDI: 0000000000000003 [ 416.662449][ T7984] RBP: 00007fa2985d41d0 R08: 0000000000000000 R09: 00007ffee37a4c4f [ 416.663195][ T7984] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fa2985d4640 [ 416.663951][ T7984] R13: 0000000000000009 R14: 000000000041b320 R15: 00007fa297dd4000 [ 416.664703][ T7984] </TASK> [ 416.665040][ T7984] Modules linked in: [ 416.665590][ T7984] ---[ end trace 0000000000000000 ]--- [ 416.666176][ T7984] RIP: 0010:btrfs_assertfail+0x2c/0x2e [ 416.668775][ T7984] RSP: 0018:ffffc90003ea7e28 EFLAGS: 00010282 [ 416.669425][ T7984] RAX: 00000000000000cc RBX: 0000000000000000 RCX: 0000000000000000 [ 416.670235][ T7984] RDX: ffff88801d030000 RSI: ffffffff81637e7c RDI: fffff520007d4fb7 [ 416.671050][ T7984] RBP: ffffffff8a533de0 R08: 00000000000000cc R09: 0000000000000000 [ 416.671867][ T7984] R10: 0000000000000001 R11: 0000000000000001 R12: ffffffff8a533da0 [ 416.672685][ T7984] R13: 00000000000001ca R14: 000000005000940a R15: 0000000000000000 [ 416.673501][ T7984] FS: 00007fa2985d4640(0000) GS:ffff88802cc00000(0000) knlGS:0000000000000000 [ 416.674425][ T7984] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 416.675114][ T7984] CR2: 0000000000000000 CR3: 0000000018e5e000 CR4: 0000000000750ef0 [ 416.675933][ T7984] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 416.676760][ T7984] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Link: https://lore.kernel.org/linux-btrfs/20230324031611.98986-1-xiaoshoukui@gmail.com/ CC: stable@vger.kernel.org # 6.1+ Signed-off-by: xiaoshoukui <xiaoshoukui@ruijie.com.cn> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
6f932d4ef0 |
btrfs: fix btrfs_prev_leaf() to not return the same key twice
A call to btrfs_prev_leaf() may end up returning a path that points to the
same item (key) again. This happens if while btrfs_prev_leaf(), after we
release the path, a concurrent insertion happens, which moves items off
from a sibling into the front of the previous leaf, and an item with the
computed previous key does not exists.
For example, suppose we have the two following leaves:
Leaf A
-------------------------------------------------------------
| ... key (300 96 10) key (300 96 15) key (300 96 16) |
-------------------------------------------------------------
slot 20 slot 21 slot 22
Leaf B
-------------------------------------------------------------
| key (300 96 20) key (300 96 21) key (300 96 22) ... |
-------------------------------------------------------------
slot 0 slot 1 slot 2
If we call btrfs_prev_leaf(), from btrfs_previous_item() for example, with
a path pointing to leaf B and slot 0 and the following happens:
1) At btrfs_prev_leaf() we compute the previous key to search as:
(300 96 19), which is a key that does not exists in the tree;
2) Then we call btrfs_release_path() at btrfs_prev_leaf();
3) Some other task inserts a key at leaf A, that sorts before the key at
slot 20, for example it has an objectid of 299. In order to make room
for the new key, the key at slot 22 is moved to the front of leaf B.
This happens at push_leaf_right(), called from split_leaf().
After this leaf B now looks like:
--------------------------------------------------------------------------------
| key (300 96 16) key (300 96 20) key (300 96 21) key (300 96 22) ... |
--------------------------------------------------------------------------------
slot 0 slot 1 slot 2 slot 3
4) At btrfs_prev_leaf() we call btrfs_search_slot() for the computed
previous key: (300 96 19). Since the key does not exists,
btrfs_search_slot() returns 1 and with a path pointing to leaf B
and slot 1, the item with key (300 96 20);
5) This makes btrfs_prev_leaf() return a path that points to slot 1 of
leaf B, the same key as before it was called, since the key at slot 0
of leaf B (300 96 16) is less than the computed previous key, which is
(300 96 19);
6) As a consequence btrfs_previous_item() returns a path that points again
to the item with key (300 96 20).
For some users of btrfs_prev_leaf() or btrfs_previous_item() this may not
be functional a problem, despite not making sense to return a new path
pointing again to the same item/key. However for a caller such as
tree-log.c:log_dir_items(), this has a bad consequence, as it can result
in not logging some dir index deletions in case the directory is being
logged without holding the inode's VFS lock (logging triggered while
logging a child inode for example) - for the example scenario above, in
case the dir index keys 17, 18 and 19 were deleted in the current
transaction.
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
||
|
|
85d7ab2463 |
for-6.4-tag
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmRHC3gACgkQxWXV+ddt
WDvI/A//ZzREEE0wNexbuidoTacDVXVJ6LBb2K1eP+HUKfsmd6GYWQDJ9x/ExpKb
T1ehLibCYWLeYxEREFbjXI3x9G8mrvLzvzsqXs/MzJPkmEF1igPddFztidBwvLQH
ey/Bh+cra2bpVhRhkX0Cf09/q/YWp17/d14ZxxW60PMfyhx8RWXejXhHkulOPVv8
+3FL8E0kc2Zjx9ioUwOy/i18LR6YzsCNVXoHzUZuWyWM4A7NG2TZR6FhuLSjlWSZ
3RAnROwr+8i5nR0xchcyYaVMO2LMbqH6mBtHnXCtxCr+4pFrfrvKym+CQco/Xriz
v1y/xDc23XeYXLCVhb0beJ6uRcjaM9+gvDF1oVBSJEv6V7sQr/tEGo/8QRehfEfT
FTro7Lf89R1GOa1IBSkv/T5S25d9LlIID3/g7PbcUBtXNKvLAjDAGTH9bzL4HS5x
/MKwN80GvaGs1KyEfUndbVPIpAwNFDYZPHM7nw1x+JTkIBcHgfjRyAMAC9jrJd0D
730W04c+0nXZtQGtKKsxc3U8y4ewzSJAKx9t7Vgo7+1P6dSRnzvJee3x/5kXV9Yn
MhxxzYDfIN9EcWbASdSm11gY5WZdG3an609pO7nc1T2K4Tuo0SPs4xOR7c3xuZrY
MN5z3QFWyI2ustUuTG+nsd5J81j76DEmj5ymWQfG3SBplTneDM0=
=Jt7p
-----END PGP SIGNATURE-----
Merge tag 'for-6.4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"Mostly core changes and cleanups, some notable fixes and two
performance improvements in directory logging.
The IO path cleanups are removing or refactoring old code, scrub main
loop has been completely rewritten also refactoring old code.
There are some changes to non-btrfs code, mostly trivial, the cgroup
punt bio logic is only moved from generic code.
Performance improvements:
- improve logging changes in a directory during one transaction,
avoid iterating over items and reduce lock contention (fsync time
4x lower)
- when logging directory entries during one transaction, reduce
locking of subvolume trees by checking tree-log instead
(improvement in throughput and latency for concurrent access to a
subvolume)
Notable fixes:
- dev-replace:
- properly honor read mode when requested to avoid reading from
source device
- target device won't be used for eventual read repair, this is
unreliable for NODATASUM files
- when there are unpaired (and unrepairable) metadata during
replace, exit early with error and don't try to finish whole
operation
- scrub ioctl properly rejects unknown flags
- fix global block reserve calculations
- fix partial direct io write when there's a page fault in the
middle, iomap will try to continue with partial request but the
btrfs part did not match that, this can lead to zeros written
instead of data
Core changes:
- io path:
- continued cleanups and refactoring around bio handling
- extent io submit path simplifications and cleanups
- flush write path simplifications and cleanups
- rework logic of passing sync mode of bio, with further cleanups
- rewrite scrub code flow, restructure how the stripes are enumerated
and verified in a more unified way
- allow to set lower threshold for block group reclaim in debug mode
to aid zoned mode testing
- remove obsolete time-based delayed ref throttling logic when
truncating items
- DREW locks are not using percpu variables anymore
- more warning fixes (-Wmaybe-uninitialized)
- u64 division simplifications
- error handling improvements
Non-btrfs code changes:
- push cgroup punt bio logic to btrfs code (there was no other user
of that), the functionality can be now selected separately by
BLK_CGROUP_PUNT_BIO
- crc32c_impl removed after removing last uses in btrfs code
- add btrfs_assertfail() to objtool table"
* tag 'for-6.4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (147 commits)
btrfs: mark btrfs_assertfail() __noreturn
btrfs: fix uninitialized variable warnings
btrfs: use log root when iterating over index keys when logging directory
btrfs: avoid iterating over all indexes when logging directory
btrfs: dev-replace: error out if we have unrepaired metadata error during
btrfs: remove pointless loop at btrfs_get_next_valid_item()
btrfs: scrub: reject unsupported scrub flags
btrfs: reinterpret async discard iops_limit=0 as no delay
btrfs: set default discard iops_limit to 1000
btrfs: remove unused raid56 functions which were dedicated for scrub
btrfs: scrub: remove scrub_bio structure
btrfs: scrub: remove scrub_block and scrub_sector structures
btrfs: scrub: remove the old scrub recheck code
btrfs: scrub: remove the old writeback infrastructure
btrfs: scrub: remove scrub_parity structure
btrfs: scrub: use scrub_stripe to implement RAID56 P/Q scrub
btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure
btrfs: scrub: introduce helper to queue a stripe for scrub
btrfs: scrub: introduce error reporting functionality for scrub_stripe
btrfs: scrub: introduce a writeback helper for scrub_stripe
...
|
||
|
|
7bcff5a396 |
v6.4/vfs.acl
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZEEhwgAKCRCRxhvAZXjc
otwgAQDXHnKiPm/d76lITXbxdUNCtvZz+ig26EbOrD+vEszzIQEA81dru0QbCNCt
ctoZdcsmtKbt2VaYQF1CDOhlnNg5VQM=
=pER1
-----END PGP SIGNATURE-----
Merge tag 'v6.4/vfs.acl' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull acl updates from Christian Brauner:
"After finishing the introduction of the new posix acl api last cycle
the generic POSIX ACL xattr handlers are still around in the
filesystems xattr handlers for two reasons:
(1) Because a few filesystems rely on the ->list() method of the
generic POSIX ACL xattr handlers in their ->listxattr() inode
operation.
(2) POSIX ACLs are only available if IOP_XATTR is raised. The
IOP_XATTR flag is raised in inode_init_always() based on whether
the sb->s_xattr pointer is non-NULL. IOW, the registered xattr
handlers of the filesystem are used to raise IOP_XATTR. Removing
the generic POSIX ACL xattr handlers from all filesystems would
risk regressing filesystems that only implement POSIX ACL support
and no other xattrs (nfs3 comes to mind).
This contains the work to decouple POSIX ACLs from the IOP_XATTR flag
as they don't depend on xattr handlers anymore. So it's now possible
to remove the generic POSIX ACL xattr handlers from the sb->s_xattr
list of all filesystems. This is a crucial step as the generic POSIX
ACL xattr handlers aren't used for POSIX ACLs anymore and POSIX ACLs
don't depend on the xattr infrastructure anymore.
Adressing problem (1) will require more long-term work. It would be
best to get rid of the ->list() method of xattr handlers completely at
some point.
For erofs, ext{2,4}, f2fs, jffs2, ocfs2, and reiserfs the nop POSIX
ACL xattr handler is kept around so they can continue to use
array-based xattr handler indexing.
This update does simplify the ->listxattr() implementation of all
these filesystems however"
* tag 'v6.4/vfs.acl' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
acl: don't depend on IOP_XATTR
ovl: check for ->listxattr() support
reiserfs: rework priv inode handling
fs: rename generic posix acl handlers
reiserfs: rework ->listxattr() implementation
fs: simplify ->listxattr() implementation
fs: drop unused posix acl handlers
xattr: remove unused argument
xattr: add listxattr helper
xattr: simplify listxattr helpers
|
||
|
|
b9dff2195f |
iter-ubuf.2-2023-04-21
-----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmRCvdsQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpg4oD/457EJ21Fm36NuyT/S0Cr8ok9Tdk7t9BeBh V/9CYThoXr5aqAox0Vq23FF+Rhzm81GzwYERN4493LBblliNeNOo2IaXF9/7qrUW 11v9Bkug2J3k3hRGtEa6Zl0EpMu+FRLsNpchjFS2KPuOq+iMDxrvwuy50kidWg7n r25e4UwpExVO9fIoUSmzgWVfRHOTuj9yiG/UsaH2+2BRXerIX0Q1tyElwmcGh25M Ad2hN+yDnuIbNA5gNUpnzY32Dp0zjAsquc//QOvq9mltcNTElokB8idGliismvyd 8qF0lkwQwewOBT/sSD5EY3K0Qd8IJu425bvT/yPUDScHz1chxHUoxo5eisIr2M9l 5AL5KHAf7Zzs8ZuV+IYPzZ5qM6a/vF3mHUisKRNKYVhF46Nmd4cBratfXwWb1MxV clQM2qr0TLOYli9mOeTXph3hg/rBVqKqf90boAZoN8b2tWBKlMykpqRadbepjrgx bmBSwwAF99NxIHEjU3U5DMdUloCSiMZIfMfDxQrPNDrfWAW4xJs5Ym0VeOjEotTt oFEs1fr6c3Mn7KEuPPfOtnDxvs51IP/B8+gDgMt/edf+wHiCU1Zm31u2gxt2dsKh g73Y92i5SHjIf36H5szBTeioyMy1E1VA9HF14xWz2eKdQ+wxQ9VNWoctcJ85k3F4 6AZDYRIrWA== =EaE9 -----END PGP SIGNATURE----- Merge tag 'iter-ubuf.2-2023-04-21' of git://git.kernel.dk/linux Pull ITER_UBUF updates from Jens Axboe: "This turns singe vector imports into ITER_UBUF, rather than ITER_IOVEC. The former is more trivial to iterate and advance, and hence a bit more efficient. From some very unscientific testing, ~60% of all iovec imports are single vector" * tag 'iter-ubuf.2-2023-04-21' of git://git.kernel.dk/linux: iov_iter: Mark copy_compat_iovec_from_user() noinline iov_iter: import single vector iovecs as ITER_UBUF iov_iter: convert import_single_range() to ITER_UBUF iov_iter: overlay struct iovec and ubuf/len iov_iter: set nr_segs = 1 for ITER_UBUF iov_iter: remove iov_iter_iovec() iov_iter: add iter_iov_addr() and iter_iov_len() helpers ALSA: pcm: check for user backed iterator, not specific iterator type IB/qib: check for user backed iterator, not specific iterator type IB/hfi1: check for user backed iterator, not specific iterator type iov_iter: add iter_iovec() helper block: ensure bio_alloc_map_data() deals with ITER_UBUF correctly |
||
|
|
c337b23f32 |
for-6.3-rc7-tag
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmRCmhgACgkQxWXV+ddt
WDsHrA/+KaCgixD0Z/8f2tMu2Kd5KQ6vQGMlydZzr0OvTYh3skAjTbAfTGAUHiXF
6qZOpYYEilE+xhdcTegB4fV1OPJQw8+rvRrPps9ugZEShQhHlUbIuuiSCtrILKmK
424wkllNc7NDbz5CHbbBpNNGdc6Xgyr3zy4nKZf/Sezmj+aK/nRL/JmazzUaEnxM
NC8hBq+Nrpz0ucyStiLp4jfdp5geo4hcfpXVEBuH2ZpzhBPV4usLBWwsEj6uBcTy
mpvMNHTFw/8H/k9w6GS+E/hrU5Rs5tWHTlEIz+xD1kK8DoPoE1arcgdLCzS0yC81
8MyjB2qgMp3XutVlQGwyWAalY04UfzKvQ4yUYwTKT24pToc0TmQq8YV2Sy7c7SeA
SDy+Ev1wgteeaPskhS9vMbJvnKVSzOMovt0oNR6VoPivXZ0OjVRDkC3fT2l497JL
jZB3H7JaUGxJ/du1kUQkhL2c6YnjkWsqbl1YoOUBilNXkY/Mbz8NCZZdLJia0Q41
P14w4aeD8HAYBNkOvSrDwfBQB5fR31GQq3QH/dGfJ4i41eJlNAposcOWQkV115Ib
eILV3kFxJNSCpUI7eaE2biacGxJLdiWPQDv5Oo5AETyqcoiFqjCDerZWCTgH54H2
YzzJiY/1BH8RgYbrCUyoPmyGOhoovYSVG9gLK3nXk1jqWltJgD0=
=mGL5
-----END PGP SIGNATURE-----
Merge tag 'for-6.3-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Two patches fixing the problem with aync discard.
The default settings had a low IOPS limit and processing a large batch
to discard would take a long time. On laptops this can cause increased
power consumption due to disk activity.
As async discard has been on by default since 6.2 this likely affects
a lot of users.
Summary:
- increase the default IOPS limit 10x which reportedly helped
- setting the sysfs IOPS value to 0 now does not throttle anymore
allowing the discards to be processed at full speed. Previously
there was an arbitrary 6 hour target for processing the pending
batch"
* tag 'for-6.3-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: reinterpret async discard iops_limit=0 as no delay
btrfs: set default discard iops_limit to 1000
|
||
|
|
ef9cddfe57 |
btrfs: reinterpret async discard iops_limit=0 as no delay
Currently, a limit of 0 results in a hard coded metering over 6 hours. Since the default is a set limit, I suspect no one truly depends on this rather arbitrary setting. Repurpose it for an arguably more useful "unlimited" mode, where the delay is 0. Note that if block groups are too new, or go fully empty, there is still a delay associated with those conditions. Those delays implement heuristics for not trimming a region we are relatively likely to fully overwrite soon. CC: stable@vger.kernel.org # 6.2+ Reviewed-by: Neal Gompa <neal@gompa.dev> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
e9f59429b8 |
btrfs: set default discard iops_limit to 1000
Previously, the default was a relatively conservative 10. This results in a 100ms delay, so with ~300 discards in a commit, it takes the full 30s till the next commit to finish the discards. On a workstation, this results in the disk never going idle, wasting power/battery, etc. Set the default to 1000, which results in using the smallest possible delay, currently, which is 1ms. This has shown to not pathologically keep the disk busy by the original reporter. Link: https://lore.kernel.org/linux-btrfs/Y%2F+n1wS%2F4XAH7X1p@nz/ Link: https://bugzilla.redhat.com/show_bug.cgi?id=2182228 CC: stable@vger.kernel.org # 6.2+ Reviewed-by: Neal Gompa <neal@gompa.dev Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
f372463124 |
btrfs: mark btrfs_assertfail() __noreturn
Fixes a bunch of warnings including: vmlinux.o: warning: objtool: select_reloc_root+0x314: unreachable instruction vmlinux.o: warning: objtool: finish_inode_if_needed+0x15b1: unreachable instruction vmlinux.o: warning: objtool: get_bio_sector_nr+0x259: unreachable instruction vmlinux.o: warning: objtool: raid_wait_read_end_io+0xc26: unreachable instruction vmlinux.o: warning: objtool: raid56_parity_alloc_scrub_rbio+0x37b: unreachable instruction ... Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/oe-kbuild-all/202302210709.IlXfgMpX-lkp@intel.com/ Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
8ba7d5f5ba |
btrfs: fix uninitialized variable warnings
There are some warnings on older compilers (gcc 10, 7) or non-x86_64
architectures (aarch64). As btrfs wants to enable -Wmaybe-uninitialized
by default, fix the warnings even though it's not necessary on recent
compilers (gcc 12+).
../fs/btrfs/volumes.c: In function ‘btrfs_init_new_device’:
../fs/btrfs/volumes.c:2703:3: error: ‘seed_devices’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
2703 | btrfs_setup_sprout(fs_info, seed_devices);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../fs/btrfs/send.c: In function ‘get_cur_inode_state’:
../include/linux/compiler.h:70:32: error: ‘right_gen’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
70 | (__if_trace.miss_hit[1]++,1) : \
| ^
../fs/btrfs/send.c:1878:6: note: ‘right_gen’ was declared here
1878 | u64 right_gen;
| ^~~~~~~~~
Reported-by: k2ci <kernel-bot@kylinos.cn>
Signed-off-by: Genjian Zhang <zhanggenjian@kylinos.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
||
|
|
5d3e4f1d51 |
btrfs: use log root when iterating over index keys when logging directory
When logging dir dentries of a directory, we iterate over the subvolume
tree to find dir index keys on leaves modified in the current transaction.
This however is heavy on locking, since btrfs_search_forward() may often
keep locks on extent buffers for quite a while when walking the tree to
find a suitable leaf modified in the current transaction and with a key
not smaller than then the provided minimum key. That means it will block
other tasks trying to access the subvolume tree, which may be common fs
operations like creating, renaming, linking, unlinking, reflinking files,
etc.
A better solution is to iterate the log tree, since it's much smaller than
a subvolume tree and just use plain btrfs_search_slot() (or the wrapper
btrfs_for_each_slot()) and only contains dir index keys added in the
current transaction.
The following bonnie++ test on a non-debug kernel (with Debian's default
kernel config) on a 20G null block device, was used to measure the impact:
$ cat test.sh
#!/bin/bash
DEV=/dev/nullb0
MNT=/mnt/nullb0
NR_DIRECTORIES=20
NR_FILES=20480 # must be a multiple of 1024
DATASET_SIZE=$(( (8 * 1024 * 1024 * 1024) / 1048576 )) # 8 GiB as megabytes
DIRECTORY_SIZE=$(( DATASET_SIZE / NR_FILES ))
NR_FILES=$(( NR_FILES / 1024 ))
umount $DEV &> /dev/null
mkfs.btrfs -f $DEV
mount $DEV $MNT
bonnie++ -u root -d $MNT \
-n $NR_FILES:$DIRECTORY_SIZE:$DIRECTORY_SIZE:$NR_DIRECTORIES \
-r 0 -s $DATASET_SIZE -b
umount $MNT
Before patchset:
Version 2.00a ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Name:Size etc /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
debian0 8G 376k 99 1.1g 98 939m 92 1527k 99 3.2g 99 9060 256
Latency 24920us 207us 680ms 5594us 171us 2891us
Version 2.00a ------Sequential Create------ --------Random Create--------
debian0 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
20/20 20480 96 +++++ +++ 20480 95 20480 99 +++++ +++ 20480 97
Latency 8708us 137us 5128us 6743us 60us 19712us
After patchset:
Version 2.00a ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Name:Size etc /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
debian0 8G 384k 99 1.2g 99 971m 91 1533k 99 3.3g 99 9180 309
Latency 24930us 125us 661ms 5587us 46us 2020us
Version 2.00a ------Sequential Create------ --------Random Create--------
debian0 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
20/20 20480 90 +++++ +++ 20480 99 20480 99 +++++ +++ 20480 97
Latency 7030us 61us 1246us 4942us 56us 16855us
The patchset consists of this patch plus a previous one that has the
following subject:
"btrfs: avoid iterating over all indexes when logging directory"
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
||
|
|
fa4b8cb173 |
btrfs: avoid iterating over all indexes when logging directory
When logging a directory, after copying all directory index items from the
subvolume tree to the log tree, we iterate over the subvolume tree to find
all dir index items that are located in leaves COWed (or created) in the
current transaction. If we keep logging a directory several times during
the same transaction, we end up iterating over the same dir index items
everytime we log the directory, wasting time and adding extra lock
contention on the subvolume tree.
So just keep track of the last logged dir index offset in order to start
the search for that index (+1) the next time the directory is logged, as
dir index values (key offsets) come from a monotonically increasing
counter.
The following test measures the difference before and after this change:
$ cat test.sh
#!/bin/bash
DEV=/dev/nullb0
MNT=/mnt/nullb0
umount $DEV &> /dev/null
mkfs.btrfs -f $DEV
mount -o ssd $DEV $MNT
# Time values in milliseconds.
declare -a fsync_times
# Total number of files added to the test directory.
num_files=1000000
# Fsync directory after every N files are added.
fsync_period=100
mkdir $MNT/testdir
fsync_total_time=0
for ((i = 1; i <= $num_files; i++)); do
echo -n > $MNT/testdir/file_$i
if [ $((i % fsync_period)) -eq 0 ]; then
start=$(date +%s%N)
xfs_io -c "fsync" $MNT/testdir
end=$(date +%s%N)
fsync_total_time=$((fsync_total_time + (end - start)))
fsync_times[i]=$(( (end - start) / 1000000 ))
echo -n -e "Progress $i / $num_files\r"
fi
done
echo -e "\nHistogram of directory fsync duration in ms:\n"
printf '%s\n' "${fsync_times[@]}" | \
perl -MStatistics::Histogram -e '@d = <>; print get_histogram(\@d);'
fsync_total_time=$((fsync_total_time / 1000000))
echo -e "\nTotal time spent in fsync: $fsync_total_time ms\n"
echo
umount $MNT
The test was run on a non-debug kernel (Debian's default kernel config)
against a 15G null block device.
Result before this change:
Histogram of directory fsync duration in ms:
Count: 10000
Range: 3.000 - 362.000; Mean: 34.556; Median: 31.000; Stddev: 25.751
Percentiles: 90th: 71.000; 95th: 77.000; 99th: 81.000
3.000 - 5.278: 1423 #################################
5.278 - 8.854: 1173 ###########################
8.854 - 14.467: 591 ##############
14.467 - 23.277: 1025 #######################
23.277 - 37.105: 1422 #################################
37.105 - 58.809: 2036 ###############################################
58.809 - 92.876: 2316 #####################################################
92.876 - 146.346: 6 |
146.346 - 230.271: 6 |
230.271 - 362.000: 2 |
Total time spent in fsync: 350527 ms
Result after this change:
Histogram of directory fsync duration in ms:
Count: 10000
Range: 3.000 - 1088.000; Mean: 8.704; Median: 8.000; Stddev: 12.576
Percentiles: 90th: 12.000; 95th: 14.000; 99th: 17.000
3.000 - 6.007: 3222 #################################
6.007 - 11.276: 5197 #####################################################
11.276 - 20.506: 1551 ################
20.506 - 36.674: 24 |
36.674 - 201.552: 1 |
201.552 - 353.841: 4 |
353.841 - 1088.000: 1 |
Total time spent in fsync: 92114 ms
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
||
|
|
8eb3dd17ea |
btrfs: dev-replace: error out if we have unrepaired metadata error during
[BUG] Even before the scrub rework, if we have some corrupted metadata failed to be repaired during replace, we still continue replacing and let it finish just as there is nothing wrong: BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1 BTRFS warning (device dm-4): tree block 5578752 mirror 0 has bad csum, has 0x00000000 want 0xade80ca1 BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5 BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5 BTRFS error (device dm-4): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad bytenr, has 0 want 5578752 BTRFS error (device dm-4): unable to fixup (regular) error at logical 5578752 on dev /dev/mapper/test-scratch1 BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 finished This can lead to unexpected problems for the resulting filesystem. [CAUSE] Btrfs reuses scrub code path for dev-replace to iterate all dev extents. But unlike scrub, dev-replace doesn't really bother to check the scrub progress, which records all the errors found during replace. And even if we check the progress, we cannot really determine which errors are minor, which are critical just by the plain numbers. (remember we don't treat metadata/data checksum error differently). This behavior is there from the very beginning. [FIX] Instead of continuing the replace, just error out if we hit an unrepaired metadata sector. Now the dev-replace would be rejected with -EIO, to let the user know. Although it also means, the filesystem has some metadata error which cannot be repaired, the user would be upset anyway. The new dmesg would look like this: BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1 BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1 BTRFS error (device dm-4): unable to fixup (regular) error at logical 5570560 on dev /dev/mapper/test-scratch1 physical 5570560 BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5 BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5 BTRFS error (device dm-4): stripe 5570560 has unrepaired metadata sector at 5578752 BTRFS error (device dm-4): btrfs_scrub_dev(/dev/mapper/test-scratch1, 1, /dev/mapper/test-scratch2) failed -5 Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
524f14bb11 |
btrfs: remove pointless loop at btrfs_get_next_valid_item()
It's pointless to have a while loop at btrfs_get_next_valid_item(), as if the slot on the current leaf is beyond the last item, we call btrfs_next_leaf(), which leaves us at a valid slot of the next leaf (or a valid slot in the current leaf if after releasing the path an item gets pushed from the next leaf to the current leaf). So just call btrfs_next_leaf() if the current slot on the current leaf is beyond the last item. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
604e6681e1 |
btrfs: scrub: reject unsupported scrub flags
Since the introduction of scrub interface, the only flag that we support is BTRFS_SCRUB_READONLY. Thus there is no sanity checks, if there are some undefined flags passed in, we just ignore them. This is problematic if we want to introduce new scrub flags, as we have no way to determine if such flags are supported. Address the problem by introducing a check for the flags, and if unsupported flags are set, return -EOPNOTSUPP to inform the user space. This check should be backported for all supported kernels before any new scrub flags are introduced. CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
f263a7c3a5 |
btrfs: reinterpret async discard iops_limit=0 as no delay
Currently, a limit of 0 results in a hard coded metering over 6 hours. Since the default is a set limit, I suspect no one truly depends on this rather arbitrary setting. Repurpose it for an arguably more useful "unlimited" mode, where the delay is 0. Note that if block groups are too new, or go fully empty, there is still a delay associated with those conditions. Those delays implement heuristics for not trimming a region we are relatively likely to fully overwrite soon. CC: stable@vger.kernel.org # 6.2+ Reviewed-by: Neal Gompa <neal@gompa.dev> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
cfe3445a58 |
btrfs: set default discard iops_limit to 1000
Previously, the default was a relatively conservative 10. This results in a 100ms delay, so with ~300 discards in a commit, it takes the full 30s till the next commit to finish the discards. On a workstation, this results in the disk never going idle, wasting power/battery, etc. Set the default to 1000, which results in using the smallest possible delay, currently, which is 1ms. This has shown to not pathologically keep the disk busy by the original reporter. Link: https://lore.kernel.org/linux-btrfs/Y%2F+n1wS%2F4XAH7X1p@nz/ Link: https://bugzilla.redhat.com/show_bug.cgi?id=2182228 CC: stable@vger.kernel.org # 6.2+ Reviewed-by: Neal Gompa <neal@gompa.dev Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
aca43fe839 |
btrfs: remove unused raid56 functions which were dedicated for scrub
Since the scrub rework, the following RAID56 functions are no longer called: - raid56_add_scrub_pages() - raid56_alloc_missing_rbio() - raid56_submit_missing_rbio() Those functions are all utilized by scrub to handle missing device cases for RAID56. However the new scrub code handle them in a completely different way: - If it's data stripe, go recovery path through btrfs_submit_bio() - If it's P/Q stripe, it would be handled through raid56_parity_submit_scrub_rbio() And that function would handle dev-replace and repair properly. Thus we can safely remove those functions. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
13a62fd997 |
btrfs: scrub: remove scrub_bio structure
Since scrub path has been fully moved to scrub_stripe based facilities, no more scrub_bio would be submitted. Thus we can remove it completely, this involves: - SCRUB_SECTORS_PER_BIO macro - SCRUB_BIOS_PER_SCTX macro - SCRUB_MAX_PAGES macro - BTRFS_MAX_MIRRORS macro - scrub_bio structure - scrub_ctx::bios member - scrub_ctx::curr member - scrub_ctx::bios_in_flight member - scrub_ctx::workers_pending member - scrub_ctx::list_lock member - scrub_ctx::list_wait member - function scrub_bio_end_io_worker() - function scrub_pending_bio_inc() - function scrub_pending_bio_dec() - function scrub_throttle() - function scrub_submit() - function scrub_find_csum() - function drop_csum_range() - Some unnecessary flush and scrub pauses Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
001e3fc263 |
btrfs: scrub: remove scrub_block and scrub_sector structures
Those two structures are used to represent a bunch of sectors for scrub, but now they are fully replaced by scrub_stripe in one go, so we can remove them. This involves: - structure scrub_block - structure scrub_sector - structure scrub_page_private - function attach_scrub_page_private() - function detach_scrub_page_private() Now we no longer need to use page::private to handle subpage. - function alloc_scrub_block() - function alloc_scrub_sector() - function scrub_sector_get_page() - function scrub_sector_get_page_offset() - function scrub_sector_get_kaddr() - function bio_add_scrub_sector() - function scrub_checksum_data() - function scrub_checksum_tree_block() - function scrub_checksum_super() - function scrub_check_fsid() - function scrub_block_get() - function scrub_block_put() - function scrub_sector_get() - function scrub_sector_put() - function scrub_bio_end_io() - function scrub_block_complete() - function scrub_add_sector_to_rd_bio() Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
e9255d6c40 |
btrfs: scrub: remove the old scrub recheck code
The old scrub code has different entrance to verify the content, and since we have removed the writeback path, now we can start removing the re-check part, including: - scrub_recover structure - scrub_sector::recover member - function scrub_setup_recheck_block() - function scrub_recheck_block() - function scrub_recheck_block_checksum() - function scrub_repair_block_group_good_copy() - function scrub_repair_sector_from_good_copy() - function scrub_is_page_on_raid56() - function full_stripe_lock() - function search_full_stripe_lock() - function get_full_stripe_logical() - function insert_full_stripe_lock() - function lock_full_stripe() - function unlock_full_stripe() - btrfs_block_group::full_stripe_locks_root member - btrfs_full_stripe_locks_tree structure This infrastructure is to ensure RAID56 scrub is properly handling recovery and P/Q scrub correctly. This is no longer needed, before P/Q scrub we will wait for all the involved data stripes to be scrubbed first, and RAID56 code has internal lock to ensure no race in the same full stripe. - function scrub_print_warning() - function scrub_get_recover() - function scrub_put_recover() - function scrub_handle_errored_block() - function scrub_setup_recheck_block() - function scrub_bio_wait_endio() - function scrub_submit_raid56_bio_wait() - function scrub_recheck_block_on_raid56() - function scrub_recheck_block() - function scrub_recheck_block_checksum() - function scrub_repair_block_from_good_copy() - function scrub_repair_sector_from_good_copy() And two more functions exported temporarily for later cleanup: - alloc_scrub_sector() - alloc_scrub_block() Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
16f9399349 |
btrfs: scrub: remove the old writeback infrastructure
Since the whole scrub path has been switched to scrub_stripe based solution, the old writeback path can be removed completely, which involves: - scrub_ctx::wr_curr_bio member - scrub_ctx::flush_all_writes member - function scrub_write_block_to_dev_replace() - function scrub_write_sector_to_dev_replace() - function scrub_add_sector_to_wr_bio() - function scrub_wr_submit() - function scrub_wr_bio_end_io() - function scrub_wr_bio_end_io_worker() And one more function needs to be exported temporarily: - scrub_sector_get() Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
5dc96f8d5d |
btrfs: scrub: remove scrub_parity structure
The structure scrub_parity is used to indicate that some extents are scrubbed for the purpose of RAID56 P/Q scrubbing. Since the whole RAID56 P/Q scrubbing path has been replaced with new scrub_stripe infrastructure, and we no longer need to use scrub_parity to modify the behavior of data stripes, we can remove it completely. This removal involves: - scrub_parity_workers Now only one worker would be utilized, scrub_workers, to do the read and repair. All writeback would happen at the main scrub thread. - scrub_block::sparity member - scrub_parity structure - function scrub_parity_get() - function scrub_parity_put() - function scrub_free_parity() - function __scrub_mark_bitmap() - function scrub_parity_mark_sectors_error() - function scrub_parity_mark_sectors_data() These helpers are no longer needed, scrub_stripe has its bitmaps and we can use bitmap helpers to get the error/data status. - scrub_parity_bio_endio() - scrub_parity_check_and_repair() - function scrub_sectors_for_parity() - function scrub_extent_for_parity() - function scrub_raid56_data_stripe_for_parity() - function scrub_raid56_parity() The new code would reuse the scrub read-repair and writeback path. Just skip the dev-replace phase. And scrub_stripe infrastructure allows us to submit and wait for those data stripes before scrubbing P/Q, without extra infrastructure. The following two functions are temporarily exported for later cleanup: - scrub_find_csum() - scrub_add_sector_to_rd_bio() Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
1009254bf2 |
btrfs: scrub: use scrub_stripe to implement RAID56 P/Q scrub
Implement the only missing part for scrub: RAID56 P/Q stripe scrub. The workflow is pretty straightforward for the new function, scrub_raid56_parity_stripe(): - Go through the regular scrub path for each data stripe - Wait for the verification and repair to finish - Writeback the repaired sectors to data stripes - Make sure all stripes are properly repaired If we have sectors unrepaired, we cannot continue, or we could further corrupt the P/Q stripe. - Submit the rbio for P/Q stripe The dev-replace would be handled inside raid56_parity_submit_scrub_rbio() path. - Wait for the above bio to finish Although the old code is no longer used, we still keep the declaration, as the cleanup can be several times larger than this patch itself. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
e02ee89baa |
btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure
Switch scrub_simple_mirror() to the new scrub_stripe infrastructure. Since scrub_simple_mirror() is the core part of scrub (only RAID56 P/Q stripes don't utilize it), we can get rid of a big chunk of code, mostly scrub_extent(), scrub_sectors() and directly called functions. There is a functionality change: - Scrub speed throttle now only affects read on the scrubbing device Writes (for repair and replace), and reads from other mirrors won't be limited by the set limits. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
54765392a1 |
btrfs: scrub: introduce helper to queue a stripe for scrub
The new helper, queue_scrub_stripe(), would try to queue a stripe for scrub. If all stripes are already in use, we will submit all the existing ones and wait for them to finish. Currently we would queue up to 8 stripes, to enlarge the blocksize to 512KiB to improve the performance. Sectors repaired on zoned need to be relocated instead of in-place fix. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
0096580713 |
btrfs: scrub: introduce error reporting functionality for scrub_stripe
The new helper, scrub_stripe_report_errors(), will report the result of the scrub to system log. The main reporting is done by introducing a new helper, scrub_print_common_warning(), which is mostly the same content from scrub_print_wanring(), but without the need for a scrub_block. Since we're reporting the errors, it's the perfect time to update the scrub stats too. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
058e09e6fe |
btrfs: scrub: introduce a writeback helper for scrub_stripe
Add a new helper, scrub_write_sectors(), to submit write bios for specified sectors to the target disk. There are several differences compared to read path: - Utilize btrfs_submit_scrub_write() Now we still rely on the @mirror_num based writeback, but the requirement is also a little different than regular writeback or read, thus we have to call btrfs_submit_scrub_write(). - We cannot write the full stripe back We can only write the sectors we have. There will be two call sites later, one for repaired sectors, one for all utilized sectors of dev-replace. Thus the callers should specify their own write_bitmap. This function only submit the bios, will not wait for them unless for zoned case. Caller must explicitly wait for the IO to finish. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
9ecb5ef543 |
btrfs: scrub: introduce the main read repair worker for scrub_stripe
The new helper, scrub_stripe_read_repair_worker(), would handle the read-repair part: - Wait for the previous submitted read IO to finish - Verify the contents of the stripe - Go through the remaining mirrors, using as large blocksize as possible At this stage, we just read out all the failed sectors from each mirror and re-verify. If no more failed sector, we can exit. - Go through all mirrors again, sector-by-sector This time, we read sector by sector, this is to address cases where one bad sector mismatches the drive's internal checksum, and cause the whole read range to fail. We put this recovery method as the last resort, as sector-by-sector reading is slow, and reading from other mirrors may have already fixed the errors. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
97cf8f3754 |
btrfs: scrub: introduce a helper to verify one scrub_stripe
The new helper, scrub_verify_stripe(), shares the same main workflow of the old scrub code. The major differences are: - How pages/page_offset is grabbed Everything can be grabbed from scrub_stripe easily. - When error report happens Currently the helper only verifies the sectors, not really doing any error reporting. The error reporting would be done after we have done the repair. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |