Commit Graph

1544 Commits

Author SHA1 Message Date
Linus Torvalds
c92b4d3dd5 for-7.1-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmnYSG0ACgkQxWXV+ddt
 WDumDQ/9E8ms1vZcfMwZUf48o7Z2fHnZMUy6dXKHnH72NiRrqSP2jZnhluT6qGqb
 MmmnqvmKFNfJ0J5QLZTgFz/MWzY7PQEIG8WkQ3JvT6iKO5Csa2vFzCXv1oaGWo+m
 TIw++3IS+GliKYQedgVXMYRKFc24OP95RO+Grsh8pMOXWcpSO60oSrTPyzbkdfid
 +Gv4CpSRTCCl/qQ8ZX2PRQ9tLJtR2IAnJBWkwE/MPWxFfkt0oBiauy/BoiddGwrl
 ocDn5fH2CnORwONLGPbVg0ScVNMaRFJfYVrI18N8pfT+4ZVeJFiWGiRnrqSmk8PG
 a8BT51VPZZunyGoVFZmpqOhsy8PtqpjX0ljpebY7K69fH+1ewrWVE9ovs/nZ6Hq+
 DgB9pXu2OxKdyByHfr8Pl/0A2naWOrQ0JHOGnVsEg2qDi67vy5EBUIYQbiS9uo4s
 IFdd5bA04DS0Khzp2Y8Crrc2tWootsRCcUs6oiwKgKVBoqtNbFvVHKJqfi8XZB6i
 W4/rL+F0gBVzR127TZF+tejd1jq9u6WOBRKlwkHK5DoWXiv84oLv/zdwtqinTWLs
 N7LOFfDgYwH1YNPx12tEm9DW3Ef76RlHPZiTAmG4NUphmgwkKaYOosqsX7WvrMqR
 kkeKfbsRm4M/lQDLwd8IBUloMhl2+uspxJrkNUy/31pxWxByGvk=
 =mVDJ
 -----END PGP SIGNATURE-----

Merge tag 'for-7.1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "User visible changes:

   - move shutdown ioctl support out of experimental features, a forced
     stop of filesystem operation until the next unmount; additionally
     there's a super block operation to forcibly remove a device from
     under the filesystem that could lead to a shutdown or not if the
     redundancy allows that

   - report filesystem shutdown using fserror mechanism

   - tree-checker updates:
      - verify free space info, extent and bitmap items
      - verify remap-tree items and related data in block group items

  Performance improvements:

   - speed up clearing first extent in the tracked range (+10%
     throughput on sample workload)

   - reduce COW rewrites of extent buffers during the same transaction

   - avoid taking big device lock to update device stats during
     transaction commit

   - fix unnecessary flush on close when truncating empty files
     (observed in practice on a backup application)

   - prevent direct reclaim during compressed readahead to avoid stalls
     under memory pressure

  Notable fixes:

   - fix chunk allocation strategy on RAID1-like block groups with
     disproportionate device sizes, this could lead to ENOSPC due to
     skewed reservation estimates

   - adjust metadata reservation overcommit ratio to be less aggressive
     and also try to flush if possible, this avoids ENOSPC and potential
     transaction aborts in some edge cases (that are otherwise hard to
     reproduce)

   - fix silent IO error in encoded writes and ordered extent split in
     zoned mode, the error was not correctly propagated to the address
     space and could lead to zeroed ranges

   - don't mark inline files NOCOMPRESS unexpectedly, the intent was to
     do that for single block writes of regular files

   - fix deadlock between reflink and transaction commit when using
     flushoncommit

   - fix overly strict item check of a running dev-replace operation

  Core:

   - zoned mode space reservation fixes:
      - cap delayed refs metadata reservation to avoid overcommit
      - update logic to reclaim partially unusable zones
      - add another state to flush and reclaim partially used zone
      - limit number of zones reclaimed in one go to avoid blocking
        other operations

   - don't let log trees consume global reserve on overcommit and fall
     back to transaction commit

   - revalidate extent buffer when checking its up-to-date status

   - add self tests for zoned mode block group specifics

   - reduce atomic allocations in some qgroup paths

   - avoid unnecessary root node COW during snapshotting

   - start new transaction in block group relocation conditionally

   - faster check of NOCOW files on currently snapshotted root

   - change how compressed bio size is tracked from bio and reduce the
     structure size

   - new tracepoint for search slot restart tracking

   - checksum list manipulation improvements

   - type, parameter cleanups, refactoring

   - error handling improvements, transaction abort call adjustments"

* tag 'for-7.1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (116 commits)
  btrfs: btrfs_log_dev_io_error() on all bio errors
  btrfs: fix silent IO error loss in encoded writes and zoned split
  btrfs: skip clearing EXTENT_DEFRAG for NOCOW ordered extents
  btrfs: use BTRFS_FS_UPDATE_UUID_TREE_GEN flag for UUID tree rescan check
  btrfs: remove duplicate journal_info reset on failure to commit transaction
  btrfs: tag as unlikely if statements that check for fs in error state
  btrfs: fix double free in create_space_info() error path
  btrfs: fix double free in create_space_info_sub_group() error path
  btrfs: do not reject a valid running dev-replace
  btrfs: only invalidate btree inode pages after all ebs are released
  btrfs: prevent direct reclaim during compressed readahead
  btrfs: replace BUG_ON() with error return in cache_save_setup()
  btrfs: zstd: don't cache sectorsize in a local variable
  btrfs: zlib: don't cache sectorsize in a local variable
  btrfs: zlib: drop redundant folio address variable
  btrfs: lzo: inline read/write length helpers
  btrfs: use common eb range validation in read_extent_buffer_to_user_nofault()
  btrfs: read eb folio index right before loops
  btrfs: rename local variable for offset in folio
  btrfs: unify types for binary search variables
  ...
2026-04-13 16:35:32 -07:00
Linus Torvalds
7fe6ac157b for-7.1/block-20260411
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmna0tgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgptEbD/0ZMEsz5pcN+/bpM9Qva5lVVkByRieua+JA
 T7L+JMcEigp1Hf2idAPlv1e9dbrtgOGhkjZNlbZenP2MHXBmbUTnzTWDKW5w0ZQ4
 UqnVC7fMmxzI57DPt7iG/1WQo8O6QPHWwBof5ZXn0b83qwByTB2oVkAb9ysT7CdM
 wGk5KnPRLIAWf5o+aZ4LoWE+196jQiszx1m6U58FTqnCgvJ/GyKyrgzx+uvGUgF+
 owZT/6TrN7cN9A68fOnmcjEZ7beZXygOQPTn32sF9rEOi8JsgK71EE2LofdVVSNU
 ES/tyKVJbSNDgUH2b0T84rErT4MtZcw5J29V3k7CVndC+DcT2uLSroPz3lYQjDg9
 TLeq7ZLjnyoBG+muboWdXcvBKn3aKLec3nfVSbz6J1xb/Z22gWYy5TZbrGnGH8fJ
 zBiyKkHMaZi55IdTDWQT3a48h36qFh0Y2wbvZ6uhyYOfXHyj4pA4ccJZgFfmf4ZG
 flVRFGEL9Tqc82lB8dfy9DBp0ZQSjeBUCd+gyDKjiuWVau5L5iTUeMMkt8yr7qbg
 PY+ATJcHk5S5zwM2xcZUt5EcHBBbCaKQ6DdRZKwzMMUvCjHlvnWvENVjUtRa9Dng
 1vUKpB/e5NGpqD05Iqgyai+OD9/tALc4sUEI2yQ7/dk9pKIXQ4RE9HR/pSkgbjeR
 LGokj08cgg==
 =ga3t
 -----END PGP SIGNATURE-----

Merge tag 'for-7.1/block-20260411' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull block updates from Jens Axboe:

 - Add shared memory zero-copy I/O support for ublk, bypassing per-I/O
   copies between kernel and userspace by matching registered buffer
   PFNs at I/O time. Includes selftests.

 - Refactor bio integrity to support filesystem initiated integrity
   operations and arbitrary buffer alignment.

 - Clean up bio allocation, splitting bio_alloc_bioset() into clear fast
   and slow paths. Add bio_await() and bio_submit_or_kill() helpers,
   unify synchronous bi_end_io callbacks.

 - Fix zone write plug refcount handling and plug removal races. Add
   support for serializing zone writes at QD=1 for rotational zoned
   devices, yielding significant throughput improvements.

 - Add SED-OPAL ioctls for Single User Mode management and a STACK_RESET
   command.

 - Add io_uring passthrough (uring_cmd) support to the BSG layer.

 - Replace pp_buf in partition scanning with struct seq_buf.

 - zloop improvements and cleanups.

 - drbd genl cleanup, switching to pre_doit/post_doit.

 - NVMe pull request via Keith:
      - Fabrics authentication updates
      - Enhanced block queue limits support
      - Workqueue usage updates
      - A new write zeroes device quirk
      - Tagset cleanup fix for loop device

 - MD pull requests via Yu Kuai:
      - Fix raid5 soft lockup in retry_aligned_read()
      - Fix raid10 deadlock with check operation and nowait requests
      - Fix raid1 overlapping writes on writemostly disks
      - Fix sysfs deadlock on array_state=clear
      - Proactive RAID-5 parity building with llbitmap, with
        write_zeroes_unmap optimization for initial sync
      - Fix llbitmap barrier ordering, rdev skipping, and bitmap_ops
        version mismatch fallback
      - Fix bcache use-after-free and uninitialized closure
      - Validate raid5 journal metadata payload size
      - Various cleanups

 - Various other fixes, improvements, and cleanups

* tag 'for-7.1/block-20260411' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (146 commits)
  ublk: fix tautological comparison warning in ublk_ctrl_reg_buf
  scsi: bsg: fix buffer overflow in scsi_bsg_uring_cmd()
  block: refactor blkdev_zone_mgmt_ioctl
  MAINTAINERS: update ublk driver maintainer email
  Documentation: ublk: address review comments for SHMEM_ZC docs
  ublk: allow buffer registration before device is started
  ublk: replace xarray with IDA for shmem buffer index allocation
  ublk: simplify PFN range loop in __ublk_ctrl_reg_buf
  ublk: verify all pages in multi-page bvec fall within registered range
  ublk: widen ublk_shmem_buf_reg.len to __u64 for 4GB buffer support
  xfs: use bio_await in xfs_zone_gc_reset_sync
  block: add a bio_submit_or_kill helper
  block: factor out a bio_await helper
  block: unify the synchronous bi_end_io callbacks
  xfs: fix number of GC bvecs
  selftests/ublk: add read-only buffer registration test
  selftests/ublk: add filesystem fio verify test for shmem_zc
  selftests/ublk: add hugetlbfs shmem_zc test for loop target
  selftests/ublk: add shared memory zero-copy test
  selftests/ublk: add UBLK_F_SHMEM_ZC support for loop target
  ...
2026-04-13 15:51:31 -07:00
Filipe Manana
7801f3ea95 btrfs: tag as unlikely if statements that check for fs in error state
Having the filesystem in an error state, meaning we had a transaction
abort, is unexpected. Mark every check for the error state with the
unlikely annotation to convey that and to allow the compiler to generate
better code.

On x86_64, using gcc 14.2.0-19 from Debian, resulted in a slightly
reduced object size and better code.

Before:

  $ size fs/btrfs/btrfs.ko
     text	   data	    bss	    dec	    hex	filename
  2008598	 175912	  15592	2200102	 219226	fs/btrfs/btrfs.ko

After:

  $ size fs/btrfs/btrfs.ko
     text	   data	    bss	    dec	    hex	filename
  2008450	 175912	  15592	2199954	 219192	fs/btrfs/btrfs.ko

Reviewed-by: Anand Jain <asj@kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07 19:41:42 +02:00
Qu Wenruo
3c0c45a4df btrfs: do not reject a valid running dev-replace
[BUG]
There is a bug report that a btrfs with running dev-replace got rejected
with the following messages:

  BTRFS error (device sdk1): devid 0 path /dev/sdk1 is registered but not found in chunk tree
  BTRFS error (device sdk1): remove the above devices or use 'btrfs device scan --forget <dev>' to unregister them before mount
  BTRFS error (device sdk1): open_ctree failed: -117

[CAUSE]
The tree and super block dumps show the fs is completely sane, except
one thing, there is no dev item for devid 0 in chunk tree.

However this is not a bug, as we do not insert dev item for devid 0 in
the first place.
Since the devid 0 is only there temporarily we do not really need to
insert a dev item for it and then later remove it again.

It is the commit 3430818739 ("btrfs: add extra device item checks at
mount") adding a overly strict check that triggers a false alert and
rejected the valid filesystem.

[FIX]
Add a special handling for devid 0, and doesn't require devid 0 to
have a device item in chunk tree.

Reported-by: Jaron Viëtor <jaron@vietors.com>
Link: https://lore.kernel.org/linux-btrfs/CAF1bhLVYLZvD=j2XyuxXDKD-NWNJAwDnpVN+UYeQW-HbzNRn1A@mail.gmail.com/
Fixes: 3430818739 ("btrfs: add extra device item checks at mount")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07 18:56:08 +02:00
Filipe Manana
cee4cfd6cc btrfs: avoid taking the device_list_mutex in btrfs_run_dev_stats()
btrfs_run_dev_stats() is called during the critical section of a
transaction commit and it takes the device_list_mutex, which is also
acquired by fitrim, which does discard operations while holding that
mutex. Most of the time, if we are on a healthy filesystem, we don't have
new stat updates to persist in the device tree, so blocking on the
device_list_mutex is just wasting time and making any tasks that need to
start a new transaction wait longer that necessary.

Since the device list is RCU safe/protected, make btrfs_run_dev_stats()
do an initial check for device stat updates using RCU and quit without
taking the device_list_mutex in case there are no new device stats that
need to be persisted in the device tree.

Also note that adding/removing devices also requires starting a
transaction, and since btrfs_run_dev_stats() is called from the critical
section of a transaction commit, no one can be concurrently adding or
removing a device while btrfs_run_dev_stats() is called.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07 18:56:05 +02:00
Miquel Sabaté Solà
2d2b5507e5 btrfs: replace kcalloc() calls to kzalloc_objs()
Commit 2932ba8d9c ("slab: Introduce kmalloc_obj() and family")
introduced, among many others, the kzalloc_objs() helper, which has some
benefits over kcalloc(). Namely, internal introspection of the allocated
type now becomes possible, allowing for future alignment-aware choices
to be made by the allocator and future hardening work that can be type
sensitive. Dropping 'sizeof' comes also as a nice side-effect.

Moreover, this also allows us to be in line with the recent tree-wide
migration to the kmalloc_obj() and family of helpers. See
commit 69050f8d6d ("treewide: Replace kmalloc with kmalloc_obj for
non-scalar types").

Reviewed-by: Kees Cook <kees@kernel.org>
Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07 18:55:59 +02:00
Qu Wenruo
c84053d9f7 btrfs: update per-profile available estimation
This involves the following timing:

- Chunk allocation
- Chunk removal
- After Mount
- New device
- Device removal
- Device shrink
- Device enlarge

And since the function btrfs_update_per_profile_avail() will not return
an error, this won't cause new error handling path.

Although when btrfs_update_per_profile_avail() failed (only ENOSPC
possible) it will mark the per-profile available estimation as
unreliable, so later btrfs_get_per_profile_avail() will return false and
require the caller to have a fallback solution.

The function btrfs_update_per_profile_avail() will be executed with
chunk_mutex hold, thus it will slightly slow down those involved
functions, but not a lot.

As all the core workload is just various u64 calculations inside a loop,
without any tree search, the overhead should be acceptable even for all
supported 9 profiles.

For 4 disks (which exercises all 9 profiles), the execution time of that
function will still be less than 10 us.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07 18:55:53 +02:00
Qu Wenruo
52fead5eb8 btrfs: introduce the device layout aware per-profile available space
[BUG]
There is a long known bug that if metadata is using RAID1 on two disks
with unbalanced sizes, there is a very high chance to hit ENOSPC related
transaction abort.

[CAUSE]
The root cause is in the available space estimation code:

- Factor based calculation
  Just use all unallocated space, divide by the profile factor
  One obvious user is can_overcommit().

This can not handle the following example:

  devid 1 unallocated:	1GiB
  devid 2 unallocated:	50GiB
  metadata type:	RAID1

If using factor based estimation, we can use (1GiB + 50GiB) / 2 = 25.5GiB
free space for metadata.
Thus we can continue allocating metadata (over-commit) way beyond the
1GiB limit.

But this estimation is completely wrong, in reality we can only allocate
one single 1GiB RAID1 block group, thus if we continue over-commit, at
one time we will hit ENOSPC at some critical path and flips the fs
read-only.

[SOLUTION]
This patch will introduce per-profile available space estimation,
which can provide chunk-allocator like behavior to give a (mostly)
accurate result, with under-estimate corner cases.

There are some differences between the estimation and real chunk
allocator:

- No consideration on hole size
  It's fine for most cases, as all data/metadata strips are in 1GiB size
  thus there should not be any hole wasting much space.

  And chunk allocator is able to use smaller stripes when there is
  really no other choice.

  Although in theory this means it can lead to some over-estimation, it
  should not cause too much hassle in the real world.

  The other benefit of such behavior is, we avoid dev-extent tree search
  completely, thus the overhead is very small.

- No true balance for certain cases
  If we have 3 disks RAID1, and each device has 2GiB unallocated space,
  we can load balance the chunk allocation so that we can allocate 3GiB
  RAID1 chunks, and that's what chunk allocator will do.

  But this current estimation code is using the largest available space
  to do a single allocation. Meaning the estimation will be 2GiB, thus
  under estimate.

  Such under estimation is fine and after the first chunk allocation, the
  estimation will be updated and still give a correct 2GiB
  estimation.
  So this only means the estimation will be a little conservative, which
  is safer for call sites like metadata over-commit check.

With that facility, for above 1GiB + 50GiB case, it will give a RAID1
estimation of 1GiB, instead of the incorrect 25.5GiB.

Or for a more complex example:
  devid 1 unallocated:	1T
  devid 2 unallocated:  1T
  devid 3 unallocated:	10T

We will get an array of:
  RAID10:	2T
  RAID1:	2T
  RAID1C3:	1T
  RAID1C4:	0  (not enough devices)
  DUP:		6T
  RAID0:	3T
  SINGLE:	12T
  RAID5:	2T
  RAID6:	1T

[IMPLEMENTATION]
And for the each profile , we go chunk allocator level calculation:
The pseudo code looks like:

  clear_virtual_used_space_of_all_rw_devices();
  do {
  	/*
  	 * The same as chunk allocator, despite used space,
  	 * we also take virtual used space into consideration.
  	 */
  	sort_device_with_virtual_free_space();

  	/*
  	 * Unlike chunk allocator, we don't need to bother hole/stripe
  	 * size, so we use the smallest device to make sure we can
  	 * allocated as many stripes as regular chunk allocator
  	 */
  	stripe_size = device_with_smallest_free->avail_space;
	stripe_size = min(stripe_size, to_alloc / ndevs);

  	/*
  	 * Allocate a virtual chunk, allocated virtual chunk will
  	 * increase virtual used space, allow next iteration to
  	 * properly emulate chunk allocator behavior.
  	 */
  	ret = alloc_virtual_chunk(stripe_size, &allocated_size);
  	if (ret == 0)
  		avail += allocated_size;
  } while (ret == 0)

This minimal available space based calculation is not perfect, but the
important part is, the estimation is never exceeding the real available
space.

This patch just introduces the infrastructure, no hooks are executed
yet.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07 18:55:53 +02:00
Linus Torvalds
b51ad67773 for-7.0-rc5-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmnIO1oACgkQxWXV+ddt
 WDuLDQ//eWEAvqyA7ch+aLNm0QVHqsdH3NdVFjqw/3NsZfC9A3txTpQmmWpfqIUl
 Yqb3AVIdRqwdytinn3xX17SxkJxzCWExcQ8LckoiJ2ODhLoi/vWGnaK8sUBzUDsm
 2etvLGbDmhEEpkTMtIFkbY8fhprqthGCQgRHNrn/cZ+AzwohaboinvACysQJSLlH
 Nqv0YxnAxsaGIaGD7wigjZ7a2BDoSZ/Q82+RvacrOkyqEThSOqzLmfl++fmHfxQo
 UDzVOt9RSUnZ5LU4MYjF3H5otLczqKqRuoQC66ulIkvCml07jaKxfUVJyjkQ3dzZ
 //UiylTXnC6XK/x1ayRrM73dKkXoN9mBmvH4b2FnOQJ3D3oyiCuytEF8s4nCrE+P
 Zml+OTvIvWMj7Xy2U7cTMWuQGeggZ1m4nxCRScZBEnububUzt13izODwE7C8Cx1y
 5FvfFIkiyeBrI03bWUO48H38lwWsfIKj6KnEk1SUeZIARClnZfM4NGZ7MbxjcntG
 SunXzuQAywOZCkpskUDWMASLSH6CEGG4zgHPmbzw1NSCE5ZCHFZYXG7znv4w3CTW
 TTeowG9r9cgdVBnMUTIToNWuD0jSMLmUfEPWvcwZF9ptfXdoABsPEQZTrKTLeiq7
 TVdunKWzU12ZLr2sl4iku8U4H4aNQFoAwV3XqKnG8VXmzN41+Zw=
 =I6JH
 -----END PGP SIGNATURE-----

Merge tag 'for-7.0-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A few more fixes. There's one that stands out in size as it fixes an
  edge case in fsync.

   - fix issue on fsync where file with zero size appears as a non-zero
     after log replay

   - in zlib compression, handle a crash when data alignment causes
     folio reference issues

   - fix possible crash with enabled tracepoints on a overlayfs mount

   - handle device stats update error

   - on zoned filesystems, fix kobject leak on sub-block groups

   - fix super block offset in an error message in validation"

* tag 'for-7.0-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix lost error when running device stats on multiple devices fs
  btrfs: tracepoints: get correct superblock from dentry in event btrfs_sync_file()
  btrfs: zlib: handle page aligned compressed size correctly
  btrfs: fix leak of kobject name for sub-group space_info
  btrfs: fix zero size inode with non-zero size after log replay
  btrfs: fix super block offset in error message in btrfs_validate_super()
2026-03-28 15:23:03 -07:00
Filipe Manana
1c37d896b1 btrfs: fix lost error when running device stats on multiple devices fs
Whenever we get an error updating the device stats item for a device in
btrfs_run_dev_stats() we allow the loop to go to the next device, and if
updating the stats item for the next device succeeds, we end up losing
the error we had from the previous device.

Fix this by breaking out of the loop once we get an error and make sure
it's returned to the caller. Since we are in the transaction commit path
(and in the critical section actually), returning the error will result
in a transaction abort.

Fixes: 733f4fbbc1 ("Btrfs: read device stats on mount, write modified ones during commit")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-03-23 20:12:12 +01:00
Linus Torvalds
8991448e56 for-7.0-rc4-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmm+kPQACgkQxWXV+ddt
 WDt5lw/+P36nlsFO1XoEuMCtE4nxibGpejg1h8OA9Huv3GtC2x7G2yjkaHqXEw32
 A98BoEYI1nvieOHfhcD3685384xlH/dcItdxDIJJmbDWFM2n3H1ayXLDcsYQRg5Q
 3oExe87r+MfYzYCzpV1xePz/0OAcwdn+KGav6ASs/PPhVfdN9kjgZwVsCfQAIGuk
 cASPAx9SDUXjmD0f9OtBZqtOQt5eEF4Xvv3qvd/7/N5SpFyUMe3AeYE+2ttjrTUt
 sw0KE0XTLMJuUVZY1dUyUSpOIADcdoHBcpkPCCwh9JnK7OIcx+vM0VAbxjFsgKFi
 0kBRS1YdOeww6pQ88SCSLHk5xKbCsW1zGfF9lMKT7kUrLIFG01ddefmRXf4qQAni
 w1cowxp2LrcdXgL8AMsAQ4DJVMy1wJ1IFThaL8ZBBdX2CphViYt0gChZwTpchMhz
 GAtqcKSURGOADCvdAgkwERuaSSesdvjfsJ3IBF3ZjLSGTN8Wasj8FV7zOyjbOoEe
 4SZa36X+MEkIQ4Nn3MoEHvK2wuox/rDxTWp96NADSUVvRBCqijVPRZFapw6H9rlO
 zQorO1CAMIqxMC4dYxUDxMl8j2P/VwoaIsib6pVFidRzubI6vxk1dcZn82HKkh8a
 ghIm8XlfsQNTZDS+ominlbxPCsyPIInS5tfgC69FcV4ktrnZz3k=
 =MYi5
 -----END PGP SIGNATURE-----

Merge tag 'for-7.0-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "Another batch of fixes for problems that have been identified by tools
  analyzing code or by fuzzing. Most of them are short, two patches fix
  the same thing in many places so the diffs are bigger.

   - handle potential NULL pointer errors after attempting to read
     extent and checksum trees

   - prevent ENOSPC when creating many qgroups by ioctls in the same
     transaction

   - encoded write ioctl fixes (with 64K page and 4K block size):
       - fix unexpected bio length
       - do not let compressed bios and pages interfere with page cache

   - compression fixes on setups with 64K page and 4K block size: fix
     folio length assertions (zstd and lzo)

   - remap tree fixes:
       - make sure to hold block group reference while moving it
       - handle early exit when moving block group to unused list

   - handle deleted subvolumes with inconsistent state of deletion
     progress"

* tag 'for-7.0-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: reject root items with drop_progress and zero drop_level
  btrfs: check block group before marking it unused in balance_remap_chunks()
  btrfs: hold block group reference during entire move_existing_remap()
  btrfs: fix an incorrect ASSERT() condition inside lzo_decompress_bio()
  btrfs: fix an incorrect ASSERT() condition inside zstd_decompress_bio()
  btrfs: do not touch page cache for encoded writes
  btrfs: fix a bug that makes encoded write bio larger than expected
  btrfs: reserve enough transaction items for qgroup ioctls
  btrfs: check for NULL root after calls to btrfs_csum_root()
  btrfs: check for NULL root after calls to btrfs_extent_root()
2026-03-21 08:42:17 -07:00
Mark Harmstone
adbb0ebacc btrfs: check block group before marking it unused in balance_remap_chunks()
Fix a potential segfault in balance_remap_chunks(): if we quit early
because btrfs_inc_block_group_ro() fails, all the remaining items in the
chunks list will still have their bg value set to NULL. It's thus not
safe to dereference this pointer without checking first.

Reported-by: Chris Mason <clm@fb.com>
Link: https://lore.kernel.org/linux-btrfs/20260125120717.1578828-1-clm@meta.com/
Fixes: 81e5a4551c ("btrfs: allow balancing remap tree")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-03-17 11:43:08 +01:00
Linus Torvalds
2d1373e424 for-7.0-rc4-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmm3/NwACgkQxWXV+ddt
 WDt0gg//YB++Kd9nqDZAA3LqfctQPPEwhHgElg2zpg2BPiDxkZ/uOKECABdEuOd0
 gh3qxLytlfeTD6jM0jdU/g0NLfba7ZG8ABmiN922kqXy3rvYbgj4rdWGlZAHaZpz
 xAbFeQoNEIu5jHYwhdmtVILsfF/DuHGFGw/5h/sT3pPNH9Jw9tCxImM7fldHTP/u
 Nh/4gmpQ2vvH3ZKzhpKX+xSYucpdDkC3y8iIKIbfnfemctW/x7KD/9+cWtQZO/KB
 rM9PWZZEtPpM4ryfu0TsOFPXt69/RXZ/9RnGaY8K0KHrBfQqpT/9j3J5Ulx/x4Zq
 w7nw5pMAU1Gfsb9H9KAKzl2Bxy4+RlKKlQ5hp7VHbcluz82FTSGhDKXZr3WjnTfO
 QFwCHs8K0YPjCslTiTK+v8WDGyb31k0KcE2FDFhGdmNet0QVmCEitiiLQMZOf5z9
 onS0VC8o6Nh0c3AJA7chg4qzbA6ehBamz7t80JGGIPUtt9z83poQEXvq1XOqbXtF
 MLeCHuXJbo6CV+xZ4iyBKCV3WZIUZzvq7rp3WqoRh/hOu5dn5+0bjhpappBA7dHE
 PZIsOZyoTLo4L4YEtORb+8MFhLx5jGhkRcxYMOGhIG2LxQ3bjjQCeGDEXrJKc3OR
 qKYiMCyBD30JW6+e7YSui61ftpMbTUrR10jzOa4UUIY5nIP0z4c=
 =Z2JX
 -----END PGP SIGNATURE-----

Merge tag 'for-7.0-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - fix logging of new dentries when logging parent directory and there
   are conflicting inodes (e.g. deleted directory)

 - avoid taking big device lock for zone setup, this is not necessary
   during mount

 - tune message verbosity when auto-reclaiming zones when low on space

 - fix slightly misleading message of root item check

* tag 'for-7.0-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: tree-checker: fix misleading root drop_level error message
  btrfs: log new dentries when logging parent dir of a conflicting inode
  btrfs: don't take device_list_mutex when querying zone info
  btrfs: pass 'verbose' parameter to btrfs_relocate_block_group
2026-03-16 08:53:06 -07:00
Johannes Thumshirn
4f6abe9c74 btrfs: pass 'verbose' parameter to btrfs_relocate_block_group
Function `btrfs_relocate_chunk()` always passes verbose=true to
`btrfs_relocate_block_group()` instead of the `verbose` parameter passed
into it by it's callers.

While user initiated rebalancing should be logged in the Kernel's log
buffer. This causes excessive log spamming from automatic rebalancing,
e.g. on zoned filesystems running low on usable space.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-03-13 12:47:35 +01:00
Damien Le Moal
ecd92cfec5 block: remove bdev_nonrot()
bdev_nonrot() is simply the negative return value of bdev_rot().
So replace all call sites of bdev_nonrot() with calls to bdev_rot()
and remove bdev_nonrot().

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-09 14:30:00 -06:00
Linus Torvalds
c44db6c820 for-7.0-rc2-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmmm7WwACgkQxWXV+ddt
 WDs7PQ/+Mc0CHhMhRH3DyEnZTPO5YcNLGl2ytqu19X2VdGu3Ra86Au4V+0tWJ+zf
 g4jI8UdgJWdR7aIoIgtMkl2BbK0tyY0WBEJ76EJNDsatByNmTXc0iXwGROe6tL9p
 n4qrEnaTMh4SmYEsFEQX9lO5ISbDbk+kfN8qapCl03c9JyKO6D3PSGrM7wzIkXX4
 oIyfDWpYpAxbyWKjn+uJlpPzdsdfRceJ0fyCbq9sJITVW/FhicqTr6xvqqeoPSXp
 oJiL/Bbsilh7AtCLHguqpczt0X+Fus9enpjT9QqATN/JgUsaXt6O6Mk6NHcnEwjS
 vW6ZdeiFdELz2yLnJyb15ROf6Uorm3Mt2kAnkatLpyHxG9Z7rkxs3+cX4nm7MxSG
 GfLBkFB+HGw155z7cK0dPHMAhQ0KCF66I99VKTgLChjmUs8ipjPAYR8f/Tsq82RD
 mrYf3mEgWYnw6alx2ak454hsNjiXuYmc9bNy8Q+TXD73gQGqwUcZR6alIV+eoWVB
 xbX/0BQPemMITlhX6IuNn5EkCZSoB7eLcDMmYRSOpJOd8oo+gXmzQ5WvQIpwYhwz
 IZIH+KTdErw2FKJ8x9tStydnrmzN63QTEMMtuBy8pRsP5qJMrncPfAOMNBlqhqMq
 3W1GJuurHt2dBmUOQXWrUcMQlDLPyOxUHV6TdpCL83xNzdZK8G4=
 =REHq
 -----END PGP SIGNATURE-----

Merge tag 'for-7.0-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "One-liner or short fixes for minor/moderate problems reported recently:

   - fixes or level adjustments of error messages

   - fix leaked transaction handles after aborted transactions, when
     using the remap tree feature

   - fix a few leaked chunk maps after errors

   - fix leaked page array in io_uring encoded read if an error occurs
     and the 'finished' is not called

   - fix double release of reserved extents when doing a range COW

   - don't commit super block when the filesystem is in shutdown state

   - fix squota accounting condition when checking members vs parent
     usage

   - other error handling fixes"

* tag 'for-7.0-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: check block group lookup in remove_range_from_remap_tree()
  btrfs: fix transaction handle leaks in btrfs_last_identity_remap_gone()
  btrfs: fix chunk map leak in btrfs_map_block() after btrfs_translate_remap()
  btrfs: fix chunk map leak in btrfs_map_block() after btrfs_chunk_map_num_copies()
  btrfs: fix compat mask in error messages in btrfs_check_features()
  btrfs: print correct subvol num if active swapfile prevents deletion
  btrfs: fix warning in scrub_verify_one_metadata()
  btrfs: fix objectid value in error message in check_extent_data_ref()
  btrfs: fix incorrect key offset in error message in check_dev_extent_item()
  btrfs: fix error message order of parameters in btrfs_delete_delayed_dir_index()
  btrfs: don't commit the super block when unmounting a shutdown filesystem
  btrfs: free pages on error in btrfs_uring_read_extent()
  btrfs: fix referenced/exclusive check in squota_check_parent_usage()
  btrfs: remove pointless WARN_ON() in cache_save_setup()
  btrfs: convert log messages to error level in btrfs_replay_log()
  btrfs: remove btrfs_handle_fs_error() after failure to recover log trees
  btrfs: remove redundant warning message in btrfs_check_uuid_tree()
  btrfs: change warning messages to error level in open_ctree()
  btrfs: fix a double release on reserved extents in cow_one_range()
  btrfs: handle discard errors in in btrfs_finish_extent_commit()
2026-03-03 09:08:00 -08:00
Mark Harmstone
54b9395b18 btrfs: fix chunk map leak in btrfs_map_block() after btrfs_translate_remap()
If the call to btrfs_translate_remap() in btrfs_map_block() returns an
error code, we were leaking the chunk map. Fix it by jumping to out
rather than returning directly.

Reported-by: Chris Mason <clm@fb.com>
Link: https://lore.kernel.org/linux-btrfs/20260125125830.2352988-1-clm@meta.com/
Fixes: 18ba649928 ("btrfs: redirect I/O for remapped block groups")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-26 15:03:29 +01:00
Mark Harmstone
f15fb3d415 btrfs: fix chunk map leak in btrfs_map_block() after btrfs_chunk_map_num_copies()
Fix a chunk map leak in btrfs_map_block(): if we return early with -EINVAL,
we're not freeing the chunk map that we've just looked up.

Fixes: 0ae653fbec ("btrfs: reduce chunk_map lookups in btrfs_map_block()")
CC: stable@vger.kernel.org # 6.12+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-26 15:03:29 +01:00
Linus Torvalds
bf4afc53b7 Convert 'alloc_obj' family to use the new default GFP_KERNEL argument
This was done entirely with mindless brute force, using

    git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
        xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'

to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.

Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.

For the same reason the 'flex' versions will be done as a separate
conversion.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21 17:09:51 -08:00
Kees Cook
69050f8d6d treewide: Replace kmalloc with kmalloc_obj for non-scalar types
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:

Single allocations:	kmalloc(sizeof(TYPE), ...)
are replaced with:	kmalloc_obj(TYPE, ...)

Array allocations:	kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with:	kmalloc_objs(TYPE, COUNT, ...)

Flex array allocations:	kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with:	kmalloc_flex(*PTR, FAM, COUNT, ...)

(where TYPE may also be *VAR)

The resulting allocations no longer return "void *", instead returning
"TYPE *".

Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-21 01:02:28 -08:00
Adarsh Das
be6324a809 btrfs: replace BUG() with error handling in __btrfs_balance()
We search with offset (u64)-1 which should never match exactly.
Previously this was handled with BUG(). Now logs an error
and return -EUCLEAN.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Adarsh Das <adarshdas950@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-18 15:25:53 +01:00
Boris Burkov
5341c98450 btrfs: tests: add unit tests for pending extent walking functions
I ran into another sort of trivial bug in v1 of the patch and concluded
that these functions really ought to be unit tested.

These two functions form the core of searching the chunk allocation
pending extent bitmap and have relatively easily definable semantics, so
unit testing them can help ensure the correctness of chunk allocation.

I also made a minor unrelated fix in volumes.h to properly forward
declare btrfs_space_info. Because of the order of the includes in the
new test, this was actually hitting a latent build warning.

Note:
This is an early example for me of a commit authored in part by an AI
agent, so I wanted to more clear about what I did. I defined a
trivial test and explained the set of tests I wanted to the agent and it
produced the large set of test cases seen here. I then checked each test
case to make sure it matched the description and simplified the
constants and numbers until they looked reasonable to me. I then checked
the looping logic to make sure it made sense to the original spirit of
the trivial test. Finally, carefully combed over all the lines it wrote
to loop over the tests it generated to make sure they followed our code
style guide.

Assisted-by: Claude:claude-opus-4-5
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:56:25 +01:00
Boris Burkov
b14c5e04bd btrfs: fix EEXIST abort due to non-consecutive gaps in chunk allocation
I have been observing a number of systems aborting at
insert_dev_extents() in btrfs_create_pending_block_groups(). The
following is a sample stack trace of such an abort coming from forced
chunk allocation (typically behind CONFIG_BTRFS_EXPERIMENTAL) but this
can theoretically happen to any DUP chunk allocation.

  [81.801] ------------[ cut here ]------------
  [81.801] BTRFS: Transaction aborted (error -17)
  [81.801] WARNING: fs/btrfs/block-group.c:2876 at btrfs_create_pending_block_groups+0x721/0x770 [btrfs], CPU#1: bash/319
  [81.802] Modules linked in: virtio_net btrfs xor zstd_compress raid6_pq null_blk
  [81.803] CPU: 1 UID: 0 PID: 319 Comm: bash Kdump: loaded Not tainted 6.19.0-rc6+ #319 NONE
  [81.803] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.17.0-2-2 04/01/2014
  [81.804] RIP: 0010:btrfs_create_pending_block_groups+0x723/0x770 [btrfs]
  [81.806] RSP: 0018:ffffa36241a6bce8 EFLAGS: 00010282
  [81.806] RAX: 000000000000000d RBX: ffff8e699921e400 RCX: 0000000000000000
  [81.807] RDX: 0000000002040001 RSI: 00000000ffffffef RDI: ffffffffc0608bf0
  [81.807] RBP: 00000000ffffffef R08: ffff8e69830f6000 R09: 0000000000000007
  [81.808] R10: ffff8e699921e5e8 R11: 0000000000000000 R12: ffff8e6999228000
  [81.808] R13: ffff8e6984d82000 R14: ffff8e69966a69c0 R15: ffff8e69aa47b000
  [81.809] FS:  00007fec6bdd9740(0000) GS:ffff8e6b1b379000(0000) knlGS:0000000000000000
  [81.809] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [81.810] CR2: 00005604833670f0 CR3: 0000000116679000 CR4: 00000000000006f0
  [81.810] Call Trace:
  [81.810]  <TASK>
  [81.810]  __btrfs_end_transaction+0x3e/0x2b0 [btrfs]
  [81.811]  btrfs_force_chunk_alloc_store+0xcd/0x140 [btrfs]
  [81.811]  kernfs_fop_write_iter+0x15f/0x240
  [81.812]  vfs_write+0x264/0x500
  [81.812]  ksys_write+0x6c/0xe0
  [81.812]  do_syscall_64+0x66/0x770
  [81.812]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
  [81.813] RIP: 0033:0x7fec6be66197
  [81.814] RSP: 002b:00007fffb159dd30 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
  [81.815] RAX: ffffffffffffffda RBX: 00007fec6bdd9740 RCX: 00007fec6be66197
  [81.815] RDX: 0000000000000002 RSI: 0000560483374f80 RDI: 0000000000000001
  [81.816] RBP: 0000560483374f80 R08: 0000000000000000 R09: 0000000000000000
  [81.816] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000002
  [81.817] R13: 00007fec6bfb85c0 R14: 00007fec6bfb5ee0 R15: 00005604833729c0
  [81.817]  </TASK>
  [81.817] irq event stamp: 20039
  [81.818] hardirqs last  enabled at (20047): [<ffffffff99a68302>] __up_console_sem+0x52/0x60
  [81.818] hardirqs last disabled at (20056): [<ffffffff99a682e7>] __up_console_sem+0x37/0x60
  [81.819] softirqs last  enabled at (19470): [<ffffffff999d2b46>] __irq_exit_rcu+0x96/0xc0
  [81.819] softirqs last disabled at (19463): [<ffffffff999d2b46>] __irq_exit_rcu+0x96/0xc0
  [81.820] ---[ end trace 0000000000000000 ]---
  [81.820] BTRFS: error (device dm-7 state A) in btrfs_create_pending_block_groups:2876: errno=-17 Object already exists

Inspecting these aborts with drgn, I observed a pattern of overlapping
chunk_maps. Note how stripe 1 of the first chunk overlaps in physical
address with stripe 0 of the second chunk.

Physical Start     Physical End       Length       Logical            Type                 Stripe
----------------------------------------------------------------------------------------------------
0x0000000102500000 0x0000000142500000 1.0G         0x0000000641d00000 META|DUP             0/2
0x0000000142500000 0x0000000182500000 1.0G         0x0000000641d00000 META|DUP             1/2
0x0000000142500000 0x0000000182500000 1.0G         0x0000000601d00000 META|DUP             0/2
0x0000000182500000 0x00000001c2500000 1.0G         0x0000000601d00000 META|DUP             1/2

Now how could this possibly happen? All chunk allocation is protected by
the chunk_mutex so racing allocations should see a consistent view of
the CHUNK_ALLOCATED bit in the chunk allocation extent-io-tree
(device->alloc_state as set by chunk_map_device_set_bits()) The tree
itself is protected by a spin lock, and clearing/setting the bits is
always protected by fs_info->mapping_tree_lock, so no race is apparent.

It turns out that there is a subtle bug in the logic regarding chunk
allocations that have happened in the current transaction, known as
"pending extents". The chunk allocation as defined in
find_free_dev_extent() is a loop which searches the commit root of the
dev_root and looks for gaps between DEV_EXTENT items. For those gaps, it
then checks alloc_state bitmap for any pending extents and adjusts the
hole that it finds accordingly. However, the logic in that adjustment
assumes that the first pending extent is the only one in that range.

e.g., given a layout with two non-consecutive pending extents in a hole
passed to dev_extent_hole_check() via *hole_start and *hole_size:

  |----pending A----|    real hole     |----pending B----|
           |           candidate hole        |
      *hole_start                         *hole_start + *hole_size

the code incorrectly returns a "hole" from the end of pending extent A
until the passed in hole end, failing to account for pending B.

However, it is not entirely obvious that it is actually possible to
produce such a layout. I was able to reproduce it, but with some
contortions: I continued to use the force chunk allocation sysfs file
and I introduced a long delay (10 seconds) into the start of the cleaner
thread. I also prevented the unused bgs cleaning logic from ever
deleting metadata bgs. These help make it easier to deterministically
produce the condition but shouldn't really matter if you imagine the
conditions happening by race/luck. Allocations/frees can happen
concurrently with the cleaner thread preparing to process an unused
extent and both create some used chunks with an unused chunk
interleaved, all during one transaction. Then btrfs_delete_unused_bgs()
sees the unused one and clears it, leaving a range with several pending
chunk allocations and a gap in the middle.

The basic idea is that the unused_bgs cleanup work happens on a worker
so if we allocate 3 block groups in one transaction, then the cleaner
work kicked off by the previous transaction comes through and deletes
the middle one of the 3, then the commit root shows no dev extents and
we have the bad pattern in the extent-io-tree. One final consideration
is that the code happens to loop to the next hole if there are no more
extents at all, so we need one more dev extent way past the area we are
working in. Something like the following demonstrates the technique:

  # push the BG frontier out to 20G
  fallocate -l 20G $mnt/foo
  # allocate one more that will prevent the "no more dev extents" luck
  fallocate -l 1G $mnt/sticky
  # sync
  sync
  # clear out the allocation area
  rm $mnt/foo
  sync
  _cleaner
  # let everything quiesce
  sleep 20
  sync

  # dev tree should have one bg 20G out and the rest at the beginning..
  # sort of like an empty FS but with a random sticky chunk.

  # kick off the cleaner in the background, remember it will sleep 10s
  # before doing interesting work
  _cleaner &

  sleep 3

  # create 3 trivial block groups, all empty, all immediately marked as unused.
  echo 1 > "$(_btrfs_sysfs_space_info $dev metadata)/force_chunk_alloc"
  echo 1 > "$(_btrfs_sysfs_space_info $dev data)/force_chunk_alloc"
  echo 1 > "$(_btrfs_sysfs_space_info $dev metadata)/force_chunk_alloc"

  # let the cleaner thread definitely finish, it will remove the data bg
  sleep 10

  # this allocation sees the non-consecutive pending metadata chunks with
  # data chunk gap of 1G and allocates a 2G extent in that hole. ENOSPC!
  echo 1 > "$(_btrfs_sysfs_space_info $dev metadata)/force_chunk_alloc"

As for the fix, it is not that obvious. I could not see a trivial way to
do it even by adding backup loops into find_free_dev_extent(), so I
opted to change the semantics of dev_extent_hole_check() to not stop
looping until it finds a sufficiently big hole. For clarity, this also
required changing the helper function contains_pending_extent() into two
new helpers which find the first pending extent and the first suitable
hole in a range.

I attempted to clean up the documentation and range calculations to be
as consistent and clear as possible for the future.

I also looked at the zoned case and concluded that the loop there is
different and not to be unified with this one. As far as I can tell, the
zoned check will only further constrain the hole so looping back to find
more holes is acceptable. Though given that zoned really only appends, I
find it highly unlikely that it is susceptible to this bug.

Fixes: 1b98450816 ("Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole")
Reported-by: Dimitrios Apostolou <jimis@gmx.net>
Closes: https://lore.kernel.org/linux-btrfs/q7760374-q1p4-029o-5149-26p28421s468@tzk.arg/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:56:25 +01:00
Filipe Manana
6f926597f9 btrfs: abort transaction on error in btrfs_remove_block_group()
When btrfs_remove_block_group() fails we abort the transaction in its
single caller (btrfs_remove_chunk()). This makes it harder to find out
where exactly the failure happened, as several steps inside
btrfs_remove_block_group() can fail.

So make btrfs_remove_block_group() abort the transaction whenever an
error happens, instead of aborting in its caller.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:56:24 +01:00
Filipe Manana
cefef3cc12 btrfs: remove out label in btrfs_check_rw_degradable()
There is no point in having the label since all it does is return the
value in the 'ret' variable. Instead make every goto return directly
and remove the label.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:56:23 +01:00
Mark Harmstone
81e5a4551c btrfs: allow balancing remap tree
Balancing the METADATA_REMAP chunk, i.e. the chunk in which the remap tree
lives, is a special case.

We can't use the remap tree itself for this, as then we'd have no way to
boostrap it on mount. And we can't use the pre-remap tree code for this
as it relies on walking the extent tree, and we're not creating backrefs
for METADATA_REMAP chunks.

So instead, if a balance would relocate any METADATA_REMAP block groups, mark
those block groups as readonly and COW every leaf of the remap tree.

There's more sophisticated ways of doing this, such as only COWing nodes
within a block group that's to be relocated, but they're fiddly and with
lots of edge cases. Plus it's not anticipated that

a) the number of METADATA_REMAP chunks is going to be particularly large, or

b) that users will want to only relocate some of these chunks - the main
   use case here is to unbreak RAID conversion and device removal.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:54:36 +01:00
Mark Harmstone
a645372e7e btrfs: add do_remap parameter to btrfs_discard_extent()
btrfs_discard_extent() can be called either when an extent is removed
or from walking the free-space tree. With a remapped block group these
two things are no longer equivalent: the extent's addresses are
remapped, while the free-space tree exclusively uses underlying
addresses.

Add a do_remap parameter to btrfs_discard_extent() and
btrfs_map_discard(), saying whether or not the address needs to be run
through the remap tree first.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:54:35 +01:00
Mark Harmstone
b56f35560b btrfs: handle setting up relocation of block group with remap-tree
Handle the preliminary work for relocating a block group in a filesystem
with the remap-tree flag set.

If the block group is SYSTEM btrfs_relocate_block_group() proceeds as it
does already, as bootstrapping issues mean that these block groups have
to be processed the existing way. Similarly with METADATA_REMAP blocks, which
are dealt with in a later patch.

Otherwise we walk the free-space tree for the block group in question,
recording any holes. These get converted into identity remaps and placed
in the remap tree, and the block group's REMAPPED flag is set. From now
on no new allocations are possible within this block group, and any I/O
to it will be funnelled through btrfs_translate_remap(). We store the
number of identity remaps in `identity_remap_count`, so that we know
when we've removed the last one and the block group is fully remapped.

The change in btrfs_read_roots() is because data relocations no longer
rely on the data reloc tree as a hidden subvolume in which to do
snapshots.

(Thanks to Sun YangKai for his suggestions.)

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:54:35 +01:00
Mark Harmstone
979e1dc3d6 btrfs: handle deletions from remapped block group
Handle the case where we free an extent from a block group that has the
REMAPPED flag set. Because the remap tree is orthogonal to the extent
tree, for data this may be within any number of identity remaps or
actual remaps. If we're freeing a metadata node, this will be wholly
inside one or the other.

btrfs_remove_extent_from_remap_tree() searches the remap tree for the
remaps that cover the range in question, then calls
remove_range_from_remap_tree() for each one, to punch a hole in the
remap and adjust the free-space tree.

For an identity remap, remove_range_from_remap_tree() will adjust the
block group's `identity_remap_count` if this changes. If it reaches
zero we mark the block group as fully remapped.

For an identity remap, remove_range_from_remap_tree() will adjust the
block group's `identity_remap_count` if this changes. If it reaches
zero we mark the block group as fully remapped.

Fully remapped block groups have their chunk stripes removed and their
device extents freed, which makes the disk space available again to the
chunk allocator. This happens asynchronously: in the cleaner thread for
sync discard and nodiscard, and (in a later patch) in the discard worker
for async discard.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:54:35 +01:00
Mark Harmstone
18ba649928 btrfs: redirect I/O for remapped block groups
Change btrfs_map_block() so that if the block group has the REMAPPED
flag set, we call btrfs_translate_remap() to obtain a new address.

btrfs_translate_remap() searches the remap tree for a range
corresponding to the logical address passed to btrfs_map_block(). If it
is within an identity remap, this part of the block group hasn't yet
been relocated, and so we use the existing address.

If it is within an actual remap, we subtract the start of the remap
range and add the address of its destination, contained in the item's
payload.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:54:35 +01:00
Mark Harmstone
efcab3176e btrfs: don't add metadata items for the remap tree to the extent tree
There is the following potential problem with the remap tree and delayed refs:

* Remapped extent freed in a delayed ref, which removes an entry from the
  remap tree
* Remap tree now small enough to fit in a single leaf
* Corruption as we now have a level-0 block with a level-1 metadata item
  in the extent tree

One solution to this would be to rework the remap tree code so that it operates
via delayed refs. But as we're hoping to remove cow-only metadata items in the
future anyway, change things so that the remap tree doesn't have any entries in
the extent tree. This also has the benefit of reducing write amplification.

We also make it so that the clear_cache mount option is a no-op, as with the
extent tree v2, as the free-space tree can no longer be recreated from the
extent tree.

Finally disable relocating the remap tree itself, which is added back in
a later patch. As it is we would get corruption as the traditional
relocation method walks the extent tree, and we're removing its metadata
items.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:54:34 +01:00
Mark Harmstone
c3d6dda60c btrfs: allow remapped chunks to have zero stripes
When a chunk has been fully remapped, we are going to set its
num_stripes to 0, as it will no longer represent a physical location on
disk.

Change tree-checker to allow for this, and fix read_one_chunk() to avoid
a divide by zero.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:54:34 +01:00
Mark Harmstone
0b4d29fa98 btrfs: add METADATA_REMAP chunk type
Add a new METADATA_REMAP chunk type, which is a metadata chunk that holds the
remap tree.

This is needed for bootstrapping purposes: the remap tree can't itself
be remapped, and must be relocated the existing way, by COWing every
leaf. The remap tree can't go in the SYSTEM chunk as space there is
limited, because a copy of the chunk item gets placed in the superblock.

The changes in fs/btrfs/volumes.h are because we're adding a new block
group type bit after the profile bits, and so can no longer rely on the
const_ilog2 trick.

The sizing to 32MB per chunk, matching the SYSTEM chunk, is an estimate
here, we can adjust it later if it proves to be too big or too small.
This works out to be ~500,000 remap items, which for a 4KB block size
covers ~2GB of remapped data in the worst case and ~500TB in the best case.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:54:27 +01:00
Mark Harmstone
ef6a31d035 btrfs: add definitions and constants for remap-tree
Add an incompat flag for the new remap-tree feature, and the constants
and definitions needed to support it.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:54:02 +01:00
jinbaohong
23d4f616cb btrfs: use READA_FORWARD_ALWAYS for device extent verification
btrfs_verify_dev_extents() iterates through the entire device tree
during mount to verify dev extents against chunks. Since this function
scans the whole tree, READA_FORWARD_ALWAYS is more appropriate than
READA_FORWARD.

While the device tree is typically small (a few hundred KB even for
multi-TB filesystems), using the correct readahead mode for full-tree
iteration is more consistent with the intended usage.

Signed-off-by: robbieko <robbieko@synology.com>
Signed-off-by: jinbaohong <jinbaohong@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:51:43 +01:00
Qu Wenruo
4681dbcfdc btrfs: shrink the size of btrfs_device
There are two main causes of holes inside btrfs_device:

- The single bytes member of last_flush_error
  Not only it's a single byte member, but we never really care about the
  exact error number.

- The @devt member
  Which is placed between two u64 members.

Shrink the size of btrfs_device by:

- Use a single bit flag for flush error
  Use BTRFS_DEV_STATE_FLUSH_FAILED so that we no longer need that
  dedicated member.

- Move @devt to the hole after dev_stat_values[]

This reduces the size of btrfs_device from 528 to exact 512 bytes for
x86_64.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:51:43 +01:00
Filipe Manana
8d206b0c21 btrfs: avoid transaction commit on error in insert_balance_item()
There's no point in committing the transaction if we failed to insert the
balance item, since we haven't done anything else after we started/joined
the transaction. Also stop using two variables for tracking the return
value and use only 'ret'.

Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:50:23 +01:00
Filipe Manana
cb73493cae btrfs: avoid transaction commit on error in del_balance_item()
There's no point in committing the transaction if we failed to delete the
item, since we haven't done anything before. Also stop using two variables
for tracking the return value and use only 'ret'.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:49:10 +01:00
David Sterba
4b117be65f btrfs: merge setting ret and return ret
In many places we have pattern:

	ret = ...;
	return ret;

This can be simplified to a direct return, removing 'ret' if not
otherwise needed. The places in self tests are not converted so we can
add more test cases without changing surrounding code
(extent-map-tests.c:test_case_4()).

Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 06:38:33 +01:00
Linus Torvalds
07eebd934c for-6.19-rc6-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmlwVncACgkQxWXV+ddt
 WDskUw//ShxtgrpXPMHjFiJVAddaga5OLOb5LqoWqfkxvYjjO2Ye+Ldw85dvsWNR
 ooYJsOxOl6GedUh5bRXV4BNeCESaoTGHLx0tOiJNyplhV5had5On8slutTmAxkZQ
 KANph0Oj9y6A+R2JsTtNMdE8R7hMl+kYHprxlUP4LbQsOdku2GXXBEaxJU+7CZz3
 f1bi4TCuF1ry2jgoAyJ56b1qQAy4FteAsr7dYFjk+Pz2GSNG4Ix/ALbs4vdsjDTf
 t8C63eAW+Q2T8nyCbWhnrWTgq6HAXkJznPc4uTMpkVP/UeNlwjieJUwgijMRomc6
 MD3hLQxSTndwZispx4XWmBjEd/ubAA7OE9YhomvStiaibFs6wwvg2Ikz5SNfUvfD
 fOg9oHX16zrprmwYbgQujW8G0bNcY0nXu1YRzXtNBvyW/vSQFpgw1aqyH1ei99aN
 sKqT0dzKsKAACqIp3I7Nj7IcqgW8BJ+Y9sqlJvrU7kEqBAEFDhAmdm0Tn8/s+i2N
 3YlP5KTg8rcR4Fn0XCKvZIGx6Y319JfpynSo75M3rl3mn/NurmLHoReIdTXiSvxu
 QWt5JE91+1EqrNAaUrw4+sQXd0W7UpV/om8tZXZOKC2TsImQv3K9S8B6kMj947Rs
 QcFwdgwrIoGuT1kku+sAs+knxReq8IXvJWt5wT4fqCxn4i+3ppM=
 =n4z3
 -----END PGP SIGNATURE-----

Merge tag 'for-6.19-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - protect reading super block vs setting block size externally (found
   by syzbot)

 - make sure no transaction is started in read-only mode even with some
   rescue mount option combinations

 - fix checksum calculation of backup super blocks when block-group-tree
   is enabled

 - more extensive mount-time checks of device items that could be left
   after device replace and attempting degraded mount

 - fix build warning with -Wmaybe-uninitialized on loongarch64-gcc 12

* tag 'for-6.19-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: add extra device item checks at mount
  btrfs: fix missing fields in superblock backup with BLOCK_GROUP_TREE
  btrfs: reject new transactions if the fs is fully read-only
  btrfs: sync read disk super and set block size
  btrfs: fix Wmaybe-uninitialized warning in replay_one_buffer()
2026-01-21 08:42:34 -08:00
Qu Wenruo
3430818739 btrfs: add extra device item checks at mount
[BUG]
There is a bug report where after a dev-replace, the replace source
device with devid 4 is properly erased (dump tree shows it's the old
devid 4), but the target device is still using devid 0.

When the user tries to mount the fs degraded, the mount failed with the
following errors:

  BTRFS: device fsid 84a1ed4a-365c-45c3-a9ee-a7df525dc3c9 devid 5 transid 1394395 /dev/sda (8:0) scanned by btrfs (261)
  BTRFS: device fsid 84a1ed4a-365c-45c3-a9ee-a7df525dc3c9 devid 6 transid 1394395 /dev/sde (8:64) scanned by btrfs (261)
  BTRFS: device fsid 84a1ed4a-365c-45c3-a9ee-a7df525dc3c9 devid 0 transid 1394395 /dev/sdd (8:48) scanned by btrfs (261)
  BTRFS: device fsid 84a1ed4a-365c-45c3-a9ee-a7df525dc3c9 devid 3 transid 1394395 /dev/sdf (8:80) scanned by btrfs (261)
  BTRFS info (device sdd): first mount of filesystem 84a1ed4a-365c-45c3-a9ee-a7df525dc3c9
  BTRFS info (device sdd): using crc32c (crc32c-intel) checksum algorithm
  BTRFS warning (device sdd): devid 4 uuid 01e2081c-9c2a-4071-b9f4-e1b27e571ff5 is missing
  BTRFS info (device sdd): bdev <missing disk> errs: wr 84994544, rd 15567, flush 65872, corrupt 0, gen 0
  BTRFS info (device sdd): bdev /dev/sdd errs: wr 71489901, rd 0, flush 30001, corrupt 0, gen 0
  BTRFS error (device sdd): replace without active item, run 'device scan --forget' on the target device
  BTRFS error (device sdd): failed to init dev_replace: -117
  BTRFS error (device sdd): open_ctree failed: -117

[CAUSE]
The devid 0 didn't get its devid updated is its own problem, here I'm
only focusing on the mount failure itself.

The mount is not caused by the missing device, as the fs has RAID1C3 for
metadata and RAID10 for data, thus is completely able to tolerate one
missing device.

The device tree shows the dev-replace has properly finished:

        item 7 key (0 DEV_REPLACE 0) itemoff 15931 itemsize 72
                src devid -1 cursor left 11091821199360 cursor right 11091821199360 mode ALWAYS
                state FINISHED write errors 0 uncorrectable read errors 0
		      ^^^^^^^^

And the chunk tree shows there is no devid 0:

  leaf 37980736602112 items 23 free space 12548 generation 1394388 owner CHUNK_TREE
  leaf 37980736602112 flags 0x1(WRITTEN) backref revision 1
  fs uuid 84a1ed4a-365c-45c3-a9ee-a7df525dc3c9
  chunk uuid d074c661-6311-4570-b59f-a5c83fd37f8e
         item 0 key (DEV_ITEMS DEV_ITEM 3) itemoff 16185 itemsize 98
                 devid 3 total_bytes 20000588955648 bytes_used 8282877984768
                 io_align 4096 io_width 4096 sector_size 4096 type 0
                 generation 0 start_offset 0 dev_group 0
                 seek_speed 0 bandwidth 0
                 uuid 0d596b69-fb0d-4031-b4af-a301d0868b8b
                 fsid 84a1ed4a-365c-45c3-a9ee-a7df525dc3c9
         ...

Which shows the first device is devid 3.

But there is indeed /dev/sdd with devid 0:

  superblock: bytenr=65536, device=/dev/sdd
  ---------------------------------------------------------
  csum_type               0 (crc32c)
  csum_size               4
  csum                    0xd4bed87e [match]
  bytenr                  65536
  flags                   0x1
                          ( WRITTEN )
  magic                   _BHRfS_M [match]
  fsid                    84a1ed4a-365c-45c3-a9ee-a7df525dc3c9
  ...
  uuid_tree_generation    1394388
  dev_item.uuid           ee6532ad-5442-45f7-87fb-7703e29ed934
  dev_item.fsid           84a1ed4a-365c-45c3-a9ee-a7df525dc3c9 [match]
  dev_item.type           0
  dev_item.total_bytes    20000588955648
  dev_item.bytes_used     8292541661184
  dev_item.io_align       0
  dev_item.io_width       0
  dev_item.sector_size    0
  dev_item.devid          0 <<<

So this means device scan will register sdd as devid 0 into the fs, then
during btrfs_init_dev_replace(), we located the replace progress item,
found the previous replace is finished, but we still need to check if
the dev-replace target device (devid 0) exists.

If that device exists, we error out showing that error message.

But to be honest the end user may not really remember which device is
the replace target device, thus not sure what to do in the next step.

[ENHANCEMENT]
To make the error more obvious, and tell the end user which devices
should be unregistered:

- Introduce BTRFS_DEV_STATE_ITEM_FOUND flag
  During device item read from the chunk tree, set the flag for each
  found device item.

- Verify there is no device without the above flag during mount
  Even missing device should have that flag set.
  If we found a device without that flag set, it means it's an
  unexpected one and should be rejected.

- More detailed error message on what to do next
  This will show all unexpected devices and tell the end user to use
  'btrfs dev scan --forget' to forget them or remove them before mount.

There is an example dmesg where a device of a valid filesystem is modified to
have devid 0, then try degraded mount:

  BTRFS info (device dm-6): first mount of filesystem 7c873869-844c-4b39-bd75-a96148bf4656
  BTRFS info (device dm-6): using crc32c checksum algorithm
  BTRFS warning (device dm-6): devid 3 uuid b4a9f35b-db42-4ac4-b55a-cbf81d3b9683 is missing
  BTRFS error (device dm-6): devid 0 path /dev/mapper/test-scratch3 is registered but not found in chunk tree
  BTRFS error (device dm-6): please remove above devices or use 'btrfs device scan --forget <dev>' to unregister them before mount
  BTRFS error (device dm-6): open_ctree failed: -117

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-01-20 17:18:48 +01:00
Edward Adam Davis
3f29d661e5 btrfs: sync read disk super and set block size
When the user performs a btrfs mount, the block device is not set
correctly. The user sets the block size of the block device to 0x4000
by executing the BLKBSZSET command.
Since the block size change also changes the mapping->flags value, this
further affects the result of the mapping_min_folio_order() calculation.

Let's analyze the following two scenarios:

Scenario 1: Without executing the BLKBSZSET command, the block size is
0x1000, and mapping_min_folio_order() returns 0;

Scenario 2: After executing the BLKBSZSET command, the block size is
0x4000, and mapping_min_folio_order() returns 2.

do_read_cache_folio() allocates a folio before the BLKBSZSET command
is executed. This results in the allocated folio having an order value
of 0. Later, after BLKBSZSET is executed, the block size increases to
0x4000, and the mapping_min_folio_order() calculation result becomes 2.

This leads to two undesirable consequences:

1. filemap_add_folio() triggers a VM_BUG_ON_FOLIO(folio_order(folio) <
mapping_min_folio_order(mapping)) assertion.

2. The syzbot report [1] shows a null pointer dereference in
create_empty_buffers() due to a buffer head allocation failure.

Synchronization should be established based on the inode between the
BLKBSZSET command and read cache page to prevent inconsistencies in
block size or mapping flags before and after folio allocation.

[1]
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
RIP: 0010:create_empty_buffers+0x4d/0x480 fs/buffer.c:1694
Call Trace:
 folio_create_buffers+0x109/0x150 fs/buffer.c:1802
 block_read_full_folio+0x14c/0x850 fs/buffer.c:2403
 filemap_read_folio+0xc8/0x2a0 mm/filemap.c:2496
 do_read_cache_folio+0x266/0x5c0 mm/filemap.c:4096
 do_read_cache_page mm/filemap.c:4162 [inline]
 read_cache_page_gfp+0x29/0x120 mm/filemap.c:4195
 btrfs_read_disk_super+0x192/0x500 fs/btrfs/volumes.c:1367

Reported-by: syzbot+b4a2af3000eaa84d95d5@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=b4a2af3000eaa84d95d5
Signed-off-by: Edward Adam Davis <eadavis@qq.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-01-20 17:18:47 +01:00
Linus Torvalds
115fada16b for-6.19-rc1-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmlATDcACgkQxWXV+ddt
 WDvWIA/7BP1o+7bGSY/HIwnVIwNyEd1YCpUrLZAd61C7t2OEZKCO13HtM9cZSIRb
 z++k8iCkNl5Z3uZH/cQfUcCuA2sip9HSGaVrfeQ3qVnB/s3DMGb/mu1yBCKUo9/f
 nCxpsIC7BD6cTaiAzCJ6JgvCjOSieG111s0/LGjDcjlM7YQoA8p/Jan1nlvfc8It
 sTQ+ZCsuO/xHViQnfmT/XIyy5bSuSABb5LR78wp8wumngTL3ooBXGwWifrNT/egi
 E06Hhnqopg1PjBDtQtmInJ1gh1E0capQ5j1Z6TDeMYCPeUOuPpRqLVrRP3bIM4jN
 vDu5dZpM9r542Wpj/vZvs/UqmhczUmbQfjLfWdr+KORrl6RA9pkHXyFTFIsTKhGi
 vtAsmMnu5FwKSlnZU1i/EuvcF89KEPx4jKRGQWiKUPwAuBUAkVa4xhsI/mAUmwv5
 +Z+hQxPuIAdmcbLblsI0mnhCGjMTx+qUQQdhY2r7U2bOKhEds+XekABb9KBrjOdj
 k8UEQZJwwWkcPSunYsOpBYBI1SIV8UeHtp8d2xrat90+Ome7feL1VFEjV/rOKc6w
 f7hUeYZPVNQMcXdfNRkoXHK/zqKxpMF5lz9Tq3mzfF6XoseC+gSW44dWQnHPnI8X
 kCj0bpg7o2WgGuc3UWIXrVXZmSEWhn30Go6UfsGqHT7xiULQvSc=
 =lAhb
 -----END PGP SIGNATURE-----

Merge tag 'for-6.19-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - fix missing btrfs_path release after printing a relocation error
   message

 - fix extent changeset leak on mmap write after failure to reserve
   metadata

 - fix fs devices list structure freeing, it could be potentially leaked
   under some circumstances

 - tree log fixes:
     - fix incremental directory logging where inodes for new dentries
       were incorrectly skipped
     - don't log conflicting inode if it's a directory moved in the
       current transaction

 - regression fixes:
     - fix incorrect btrfs_path freeing when it's auto-cleaned
     - revert commit simplifying preallocation of temporary structures
       in qgroup functions, some cases were not handled properly

* tag 'for-6.19-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix changeset leak on mmap write after failure to reserve metadata
  btrfs: fix memory leak of fs_devices in degraded seed device path
  btrfs: fix a potential path leak in print_data_reloc_error()
  Revert "btrfs: add ASSERTs on prealloc in qgroup functions"
  btrfs: do not skip logging new dentries when logging a new name
  btrfs: don't log conflicting inode if it's a dir moved in the current transaction
  btrfs: tests: fix double btrfs_path free in remove_extent_ref()
2025-12-16 19:28:20 +12:00
Deepanshu Kartikey
b57f2ddd28 btrfs: fix memory leak of fs_devices in degraded seed device path
In open_seed_devices(), when find_fsid() fails and we're in DEGRADED
mode, a new fs_devices is allocated via alloc_fs_devices() but is never
added to the seed_list before returning. This contrasts with the normal
path where fs_devices is properly added via list_add().

If any error occurs later in read_one_dev() or btrfs_read_chunk_tree(),
the cleanup code iterates seed_list to free seed devices, but this
orphaned fs_devices is never found and never freed, causing a memory
leak. Any devices allocated via add_missing_dev() and attached to this
fs_devices are also leaked.

Fix this by adding the newly allocated fs_devices to seed_list in the
degraded path, consistent with the normal path.

Fixes: 5f37583569 ("Btrfs: move the missing device to its own fs device list")
Reported-by: syzbot+eadd98df8bceb15d7fed@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=eadd98df8bceb15d7fed
Tested-by: syzbot+eadd98df8bceb15d7fed@syzkaller.appspotmail.com
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-12-12 16:33:12 +01:00
Linus Torvalds
7696286034 for-6.19-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmko/DUACgkQxWXV+ddt
 WDtyCw//UaFOTX/k72HgA1n2MWfegWbWyD+OGNbGosoZljKOrAe/mRnjXTyF9lyW
 8GDzGvJzF4Tkl5lyuGyiequlrO2F3veTpwHo94xnBTOYCeiTpMqTN/e/5SkasBpN
 4YlWq7OGYR4hwghRvZpaW7nsmVCKDLIlZVkH77x9Bmvx0NLO24EJlEZusQT4zYew
 ntC/i9x3DW0ZxYyfRhFIFvk9JUUdgXfxJ6dNexz0zi3dKUSUIR9hI0J9Nwl++1cF
 SgjAzbtO064htWoCvsKykgA6YGbJCZjw8XO8D2eJonkN24VbqSMaY44TPXmCMLVs
 ZXw871jV2E/urfWhRNdxv/kJdCFudPk0qXG5ZtfHO4UUwS/nZ+qAig+LHawgAOCJ
 9CgWy4zrfiYCqULRuqF1wzWu/z22++zIlZC552VAZd1RQ+JjqJY/aje4xhY5nUF4
 n1uVBReZaI9sH3jJOsMWpwLMptbhpH9RZp3QPgqZlUHo6GtPJJmNKfw8KgMAhZ7L
 wf7iy6v9yo+7VZ2ACwu2qJ+lZRxbZ0yvCnFatN3O5G1O0kkIrZFUM3MwdKtufZ0u
 LHWkGfoaq7zR6E6DhIaxIhiTTXMlOfLTikNKgBUO3NEdrRZwrDhr7K07S25jFxSx
 ZCNV6OdSCeziShPqT0ntcwecnJ41/kOcm13732NHF+QgzMK5LrI=
 =rO4x
 -----END PGP SIGNATURE-----

Merge tag 'for-6.19-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "Features:

   - shutdown ioctl support (needs CONFIG_BTRFS_EXPERIMENTAL for now):
      - set filesystem state as being shut down (also named going down
        in other filesystems), where all active operations return EIO
        and this cannot be changed until unmount
      - pending operations are attempted to be finished but error
        messages may still show up depending on where exactly the
        shutdown happened

   - scrub (and device replace) vs suspend/hibernate:
      - a running scrub will prevent suspend, which can be annoying as
        suspend is an immediate request and scrub is not critical
      - filesystem freezing before suspend was not sufficient as the
        problem was in process freezing
      - behaviour change: on suspend scrub and device replace are
        cancelled, where scrub can record the last state and continue
        from there; the device replace has to be restarted from the
        beginning

   - zone stats exported in sysfs, from the perspective of the
     filesystem this includes active, reclaimable, relocation etc zones

  Performance:

   - improvements when processing space reservation tickets by
     optimizing locking and shrinking critical sections, cumulative
     improvements in lockstat numbers show +15%

  Notable fixes:

   - use vmalloc fallback when allocating bios as high order allocations
     can happen with wide checksums (like sha256)

   - scrub will always track the last position of progress so it's not
     starting from zero after an error

  Core:

   - under experimental config, checksum calculations are offloaded to
     process context, simplifies locking and allows to remove
     compression write worker kthread(s):
      - speed improvement in direct IO throughput with buffered IO
        fallback is +15% when not offloaded but this is more related to
        internal crypto subsystem improvements
      - this will be probably default in the future removing the sysfs
        tunable

   - (experimental) block size > page size updates:
      - support more operations when not using large folios (encoded
        read/write and send)
      - raid56

   - more preparations for fscrypt support

  Other:

   - more conversions to auto-cleaned variables

   - parameter cleanups and removals

   - extended warning fixes

   - improved printing of structured values like keys

   - lots of other cleanups and refactoring"

* tag 'for-6.19-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (147 commits)
  btrfs: remove unnecessary inode key in btrfs_log_all_parents()
  btrfs: remove redundant zero/NULL initializations in btrfs_alloc_root()
  btrfs: remaining BTRFS_PATH_AUTO_FREE conversions
  btrfs: send: do not allocate memory for xattr data when checking it exists
  btrfs: send: add unlikely to all unexpected overflow checks
  btrfs: reduce arguments to btrfs_del_inode_ref_in_log()
  btrfs: remove root argument from btrfs_del_dir_entries_in_log()
  btrfs: use test_and_set_bit() in btrfs_delayed_delete_inode_ref()
  btrfs: don't search back for dir inode item in INO_LOOKUP_USER
  btrfs: don't rewrite ret from inode_permission
  btrfs: add orig_logical to btrfs_bio for encryption
  btrfs: disable verity on encrypted inodes
  btrfs: disable various operations on encrypted inodes
  btrfs: remove redundant level reset in btrfs_del_items()
  btrfs: simplify leaf traversal after path release in btrfs_next_old_leaf()
  btrfs: optimize balance_level() path reference handling
  btrfs: factor out root promotion logic into promote_child_to_root()
  btrfs: raid56: remove the "_step" infix
  btrfs: raid56: enable bs > ps support
  btrfs: raid56: prepare finish_parity_scrub() to support bs > ps cases
  ...
2025-12-03 20:03:46 -08:00
Linus Torvalds
978d337c2e vfs-6.19-rc1.guards
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZgAKCRCRxhvAZXjc
 opxBAQCjNjr0yTSoaGRM0CJXg79Of3DLIlBdB7TygibTN16WhwEA+VKWoHL5eRjg
 PZlwZD4Ei2ymeQYxi+6owTF8G806tAs=
 =m/Bt
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.19-rc1.guards' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull superblock lock guard updates from Christian Brauner:
 "This starts the work of introducing guards for superblock related
  locks.

  Introduce super_write_guard for scoped superblock write protection.

  This provides a guard-based alternative to the manual sb_start_write()
  and sb_end_write() pattern, allowing the compiler to automatically
  handle the cleanup"

* tag 'vfs-6.19-rc1.guards' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  xfs: use super write guard in xfs_file_ioctl()
  open: use super write guard in do_ftruncate()
  btrfs: use super write guard in relocating_repair_kthread()
  ext4: use super write guard in write_mmp_block()
  btrfs: use super write guard in sb_start_write()
  btrfs: use super write guard btrfs_run_defrag_inode()
  btrfs: use super write guard in btrfs_reclaim_bgs_work()
  fs: add super_write_guard
2025-12-01 14:39:03 -08:00
Christoph Hellwig
ded9958704
btrfs: use vfs_utimes to update file timestamps
Btrfs updates the device node timestamps for block device special files
when it stop using the device.

Commit 8f96a5bfa1 ("btrfs: update the bdev time directly when closing")
switch that update from the correct layering to directly call the
low-level helper on the bdev inode.  This is wrong and got fixed in
commit 54fde91f52 ("btrfs: update device path inode time instead of
bd_inode") by updating the file system inode instead of the bdev inode,
but this kept the incorrect bypassing of the VFS interfaces and file
system ->update_times method.  Fix this by using the propet vfs_utimes
interface.

Fixes: 8f96a5bfa1 ("btrfs: update the bdev time directly when closing")
Fixes: 54fde91f52 ("btrfs: update device path inode time instead of bd_inode")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251120064859.2911749-5-hch@lst.de
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-26 14:50:10 +01:00
David Sterba
10934c131f btrfs: remaining BTRFS_PATH_AUTO_FREE conversions
Do the remaining btrfs_path conversion to the auto cleaning, this seems
to be the last one. Most of the conversions are trivial, only adding the
declaration and removing the freeing, or changing the goto patterns to
return.

There are some functions with many changes, like __btrfs_free_extent(),
btrfs_remove_from_free_space_tree() or btrfs_add_to_free_space_tree()
but it still follows the same pattern.

Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-25 01:53:33 +01:00
Filipe Manana
d7fe41044b btrfs: use bool type for btrfs_path members used as booleans
Many fields of struct btrfs_path are used as booleans but their type is
an unsigned int (of one 1 bit width to save space). Change the type to
bool keeping the :1 suffix so that they combine with the previous u8
fields in order to save space. This makes the code more clear by using
explicit true/false and more in line with the preferred style, preserving
the size of the structure.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:42:25 +01:00
Miquel Sabaté Solà
7ab5d01d58 btrfs: apply the AUTO_K(V)FREE macros throughout the code
Apply the AUTO_KFREE and AUTO_KVFREE macros wherever it makes
sense. Since this macro is expected to improve code readability, it has
been avoided in places where the lifetime of objects wasn't easy to
follow and a cleanup attribute would've made things worse; or when the
cleanup section of a function involved many other things and thus there
was no readability impact anyways. This change has also not been applied
in extremely short functions where readability was clearly not an issue.

Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:34:51 +01:00