Commit Graph

1934 Commits

Author SHA1 Message Date
Johannes Thumshirn
390aa432f3 btrfs: decrease indentation of find_free_extent_update_loop
Decrease the indentation of find_free_extent_update_loop(), by inverting
the check if the loop state is smaller than LOOP_NO_EMPTY_SIZE.

This also allows for an early return from find_free_extent_update_loop(),
in case LOOP_NO_EMPTY_SIZE is already set at this point.

While at it change a

    if () {
    }
    else if
    else

pattern to all using curly braces and be consistent with the rest of btrfs
code.

Also change 'int exists' to 'bool have_trans' giving it a more meaningful
name and type.

No functional changes intended.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07 18:56:05 +02:00
Qu Wenruo
908ab56347 btrfs: remove atomic parameter from btrfs_buffer_uptodate()
That parameter was introduced by commit b9fab919b7 ("Btrfs: avoid
sleeping in verify_parent_transid while atomic").
At that time we needed to lock the extent buffer range inside the io tree
to avoid content changes, thus it could sleep.

But that behavior is no longer there, as later commit 9e2aff90fc
("btrfs: stop using lock_extent in btrfs_buffer_uptodate") dropped the
io tree lock.

We can remove the @atomic parameter safely now.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07 18:56:02 +02:00
ZhengYuan Huang
f04c6475c2 btrfs: revalidate cached tree blocks on the uptodate path
read_extent_buffer_pages_nowait() returns immediately when an extent
buffer is already marked uptodate. On that cache-hit path,
the caller supplied btrfs_tree_parent_check is not re-run.

This can let read_tree_root_path() accept a cached tree block whose
actual header level/owner does not match the expected value derived from
the parent.

E.g. a corrupted root item that points to a tree block which doesn't
even belong to that root, and has mismatching level/owner.

But that tree block is already read and cached, later the corrupted tree
root got read from disk and hit the cached tree block.

Fix this by re-validating cached extent buffers against the supplied
btrfs_tree_parent_check on the uptodate path, and make
read_tree_root_path() pass its check to btrfs_buffer_uptodate().

This makes cache hits and fresh reads follow the same tree-parent
verification rules, and turns the corruption into a read failure instead
of constructing an inconsistent root object.

Signed-off-by: ZhengYuan Huang <gality369@gmail.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
[ Resolve the conflict with extent_buffer_uptodate() helper, handle
  transid mismatch case ]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07 18:56:02 +02:00
Jiasheng Jiang
08ef56661f btrfs: zoned: remove redundant space_info lock and variable in do_allocation_zoned()
In do_allocation_zoned(), the code acquires space_info->lock before
block_group->lock. However, the critical section does not access or
modify any members of the space_info structure. Thus, the lock is
redundant as it provides no necessary synchronization here.

This change simplifies the locking logic and aligns the function with
other zoned paths, such as __btrfs_add_free_space_zoned(), which only
rely on block_group->lock. Since the 'space_info' local variable is
no longer used after removing the lock calls, it is also removed.

Removing this unnecessary lock reduces contention on the global
space_info lock, improving concurrency in the zoned allocation path.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Jiasheng Jiang <jiashengjiangcool@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07 18:55:52 +02:00
robbieko
316fb1b316 btrfs: fix incorrect return value after changing leaf in lookup_extent_data_ref()
After commit 1618aa3c2e ("btrfs: simplify return variables in
lookup_extent_data_ref()"), the err and ret variables were merged into
a single ret variable. However, when btrfs_next_leaf() returns 0
(success), ret is overwritten from -ENOENT to 0. If the first key in
the next leaf does not match (different objectid or type), the function
returns 0 instead of -ENOENT, making the caller believe the lookup
succeeded when it did not. This can lead to operations on the wrong
extent tree item, potentially causing extent tree corruption.

Fix this by returning -ENOENT directly when the key does not match,
instead of relying on the ret variable.

Fixes: 1618aa3c2e ("btrfs: simplify return variables in lookup_extent_data_ref()")
CC: stable@vger.kernel.org # 6.12+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: robbieko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-03-31 06:00:45 +02:00
Filipe Manana
2b4cb4e58f btrfs: check for NULL root after calls to btrfs_csum_root()
btrfs_csum_root() can return a NULL pointer in case the root we are
looking for is not in the rb tree that tracks roots. So add checks to
every caller that is missing such check to log a message and return
an error.

Reported-by: Chris Mason <clm@meta.com>
Link: https://lore.kernel.org/linux-btrfs/20260208161657.3972997-1-clm@meta.com/
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-03-17 11:29:32 +01:00
Filipe Manana
5024282870 btrfs: check for NULL root after calls to btrfs_extent_root()
btrfs_extent_root() can return a NULL pointer in case the root we are
looking for is not in the rb tree that tracks roots. So add checks to
every caller that is missing such check to log a message and return
an error. The same applies to callers of btrfs_block_group_root(),
since it calls btrfs_extent_root().

Reported-by: Chris Mason <clm@meta.com>
Link: https://lore.kernel.org/linux-btrfs/20260208161657.3972997-1-clm@meta.com/
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-03-17 11:22:49 +01:00
Jingkai Tan
2970525f78 btrfs: handle discard errors in in btrfs_finish_extent_commit()
Coverity (ID: 1226842) reported that the return value of
btrfs_discard_extent() is assigned to ret but is immediately
overwritten by unpin_extent_range() without being checked.

Use the same error handling that is done later in the same function.

Signed-off-by: Jingkai Tan <contact@jingk.ai>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-26 15:03:26 +01:00
jinbaohong
b291ad4458 btrfs: fix transaction commit blocking during trim of unallocated space
When trimming unallocated space, btrfs_trim_fs() holds the device_list_mutex
for the entire duration while iterating through all devices. On large
filesystems with significant unallocated space, this operation can take
minutes to hours on large storage systems.

This causes a problem because btrfs_run_dev_stats(), which is called
during transaction commit, also requires device_list_mutex:

  btrfs_trim_fs()
    mutex_lock(&fs_devices->device_list_mutex)
    list_for_each_entry(device, ...)
      btrfs_trim_free_extents(device)
    mutex_unlock(&fs_devices->device_list_mutex)

  commit_transaction()
    btrfs_run_dev_stats()
      mutex_lock(&fs_devices->device_list_mutex)  // blocked!
      ...

While trim is running, all transaction commits are blocked waiting for
the mutex.

Fix this by refactoring btrfs_trim_free_extents() to process devices in
bounded chunks (up to 2GB per iteration) and release device_list_mutex
between chunks.

Signed-off-by: robbieko <robbieko@synology.com>
Signed-off-by: jinbaohong <jinbaohong@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:56:25 +01:00
jinbaohong
bfb670b918 btrfs: handle user interrupt properly in btrfs_trim_fs()
When a fatal signal is pending or the process is freezing,
btrfs_trim_block_group() and btrfs_trim_free_extents() return -ERESTARTSYS.
Currently this is treated as a regular error: the loops continue to the
next iteration and count it as a block group or device failure.

Instead, break out of the loops immediately and return -ERESTARTSYS to
userspace without counting it as a failure. Also skip the device loop
entirely if the block group loop was interrupted.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: jinbaohong <jinbaohong@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:56:25 +01:00
jinbaohong
1cc4ada418 btrfs: preserve first error in btrfs_trim_fs()
When multiple block groups or devices fail during trim, preserve the
first error encountered rather than the last one. The first error is
typically more useful for debugging as it represents the original
failure, while subsequent errors may be cascading effects.

Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: jinbaohong <jinbaohong@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:56:24 +01:00
jinbaohong
912d1c6680 btrfs: continue trimming remaining devices on failure
Commit 93bba24d4b ("btrfs: Enhance btrfs_trim_fs function to handle
error better") intended to make device trimming continue even if one
device fails, tracking failures and reporting them at the end. However,
it used 'break' instead of 'continue', causing the loop to exit on the
first device failure.

Fix this by replacing 'break' with 'continue'.

Fixes: 93bba24d4b ("btrfs: Enhance btrfs_trim_fs function to handle error better")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: jinbaohong <jinbaohong@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:56:24 +01:00
Filipe Manana
ea8f921005 btrfs: remove pointless out labels from extent-tree.c
Some functions (lookup_extent_data_ref(), __btrfs_mod_ref() and
btrfs_free_tree_block()) have an 'out' label that does nothing but
return, making it pointless. Simplify this by removing the label and
returning instead of gotos plus setting the 'ret' variable.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:56:21 +01:00
Filipe Manana
4ac81c3811 btrfs: use the btrfs_block_group_end() helper everywhere
We have a helper to calculate a block group's exclusive end offset, but
we only use it in some places. Update every site that open codes the
calculation to use the helper.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:54:36 +01:00
Mark Harmstone
7cddbb4339 btrfs: handle discarding fully-remapped block groups
Discard normally works by iterating over the free-space entries of a
block group. This doesn't work for fully-remapped block groups, as we
removed their free-space entries when we started relocation.

For sync discard, call btrfs_discard_extent() when we commit the
transaction in which the last identity remap was removed.

For async discard, add a new function btrfs_trim_fully_remapped_block_group()
to be called by the discard worker, which iterates over the block
group's range using the normal async discard rules. Once we reach the
end, remove the chunk's stripes and device extents to get back its free
space.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:54:36 +01:00
Mark Harmstone
a645372e7e btrfs: add do_remap parameter to btrfs_discard_extent()
btrfs_discard_extent() can be called either when an extent is removed
or from walking the free-space tree. With a remapped block group these
two things are no longer equivalent: the extent's addresses are
remapped, while the free-space tree exclusively uses underlying
addresses.

Add a do_remap parameter to btrfs_discard_extent() and
btrfs_map_discard(), saying whether or not the address needs to be run
through the remap tree first.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:54:35 +01:00
Mark Harmstone
bbea42dfb9 btrfs: move existing remaps before relocating block group
If when relocating a block group we find that `remap_bytes` > 0 in its
block group item, that means that it has been the destination block
group for another that has been remapped.

We need to search the remap tree for any remap backrefs within this
range, and move the data to a third block group. This is because
otherwise btrfs_translate_remap() could end up following an unbounded
chain of remaps, which would only get worse over time.

We only relocate one block group at a time, so `remap_bytes` will only
ever go down while we are doing this. Once we're finished we set the
REMAPPED flag on the block group, which will permanently prevent any
other data from being moved to within it.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:54:35 +01:00
Mark Harmstone
979e1dc3d6 btrfs: handle deletions from remapped block group
Handle the case where we free an extent from a block group that has the
REMAPPED flag set. Because the remap tree is orthogonal to the extent
tree, for data this may be within any number of identity remaps or
actual remaps. If we're freeing a metadata node, this will be wholly
inside one or the other.

btrfs_remove_extent_from_remap_tree() searches the remap tree for the
remaps that cover the range in question, then calls
remove_range_from_remap_tree() for each one, to punch a hole in the
remap and adjust the free-space tree.

For an identity remap, remove_range_from_remap_tree() will adjust the
block group's `identity_remap_count` if this changes. If it reaches
zero we mark the block group as fully remapped.

For an identity remap, remove_range_from_remap_tree() will adjust the
block group's `identity_remap_count` if this changes. If it reaches
zero we mark the block group as fully remapped.

Fully remapped block groups have their chunk stripes removed and their
device extents freed, which makes the disk space available again to the
chunk allocator. This happens asynchronously: in the cleaner thread for
sync discard and nodiscard, and (in a later patch) in the discard worker
for async discard.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:54:35 +01:00
Mark Harmstone
8620da16fb btrfs: allow mounting filesystems with remap-tree incompat flag
If we encounter a filesystem with the remap-tree incompat flag set,
validate its compatibility with the other flags, and load the remap tree
using the values that have been added to the superblock.

The remap-tree feature depends on the free-space-tree, but no-holes and
block-group-tree have been made dependencies to reduce the testing
matrix. Similarly I'm not aware of any reason why mixed-bg and zoned would be
incompatible with remap-tree, but this is blocked for the time being
until it can be fully tested.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:54:35 +01:00
Mark Harmstone
efcab3176e btrfs: don't add metadata items for the remap tree to the extent tree
There is the following potential problem with the remap tree and delayed refs:

* Remapped extent freed in a delayed ref, which removes an entry from the
  remap tree
* Remap tree now small enough to fit in a single leaf
* Corruption as we now have a level-0 block with a level-1 metadata item
  in the extent tree

One solution to this would be to rework the remap tree code so that it operates
via delayed refs. But as we're hoping to remove cow-only metadata items in the
future anyway, change things so that the remap tree doesn't have any entries in
the extent tree. This also has the benefit of reducing write amplification.

We also make it so that the clear_cache mount option is a no-op, as with the
extent tree v2, as the free-space tree can no longer be recreated from the
extent tree.

Finally disable relocating the remap tree itself, which is added back in
a later patch. As it is we would get corruption as the traditional
relocation method walks the extent tree, and we're removing its metadata
items.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:54:34 +01:00
Filipe Manana
b322fa5ff1 btrfs: tag as unlikely error handling in run_one_delayed_ref()
We don't expect to get errors unless we have a corrupted fs, bad RAM or a
bug. So tag the error handling as unlikely.

This slightly reduces the module's text size on x86_64 using gcc 14.2.0-19
from Debian.

Before this change:

  $ size fs/btrfs/btrfs.ko
     text	   data	    bss	    dec	    hex	filename
  1939458	 172512	  15592	2127562	 2076ca	fs/btrfs/btrfs.ko

After this change:

  $ size fs/btrfs/btrfs.ko
     text	   data	    bss	    dec	    hex	filename
  1939398	 172512	  15592	2127502	 20768e	fs/btrfs/btrfs.ko

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:51:43 +01:00
Filipe Manana
271cbe7635 btrfs: remove unnecessary else branch in run_one_delayed_ref()
There is no need for an else branch to deal with an unexpected delayed ref
type. We can just change the previous branch to deal with this by checking
if the ref type is not BTRFS_EXTENT_OWNER_REF_KEY, since that branch is
useless as it only sets 'ret' to zero when it's already zero. So merge the
two branches.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:51:43 +01:00
Filipe Manana
c7d1d4ff56 btrfs: don't BUG() on unexpected delayed ref type in run_one_delayed_ref()
There is no need to BUG(), we can just return an error and log an error
message.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:51:43 +01:00
Johannes Thumshirn
4273db18a8 btrfs: zoned: re-flow prepare_allocation_zoned()
Re-flow prepare allocation zoned to make it a bit more readable by
returning early and removing unnecessary indentations.

This patch does not change any functionality.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:49:10 +01:00
David Sterba
4b117be65f btrfs: merge setting ret and return ret
In many places we have pattern:

	ret = ...;
	return ret;

This can be simplified to a direct return, removing 'ret' if not
otherwise needed. The places in self tests are not converted so we can
add more test cases without changing surrounding code
(extent-map-tests.c:test_case_4()).

Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 06:38:33 +01:00
Sun YangKai
a5eb902436 btrfs: simplify boolean argument for btrfs_inc_ref()/btrfs_dec_ref()
Replace open-coded if/else blocks with the boolean directly and introduce
local const bool variables, making the code shorter and easier to read.

Signed-off-by: Sun YangKai <sunk67188@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 06:38:31 +01:00
Sun YangKai
8bfee251b7 btrfs: use true/false for boolean parameters in btrfs_inc_ref()/btrfs_dec_ref()
Replace integer literals 0/1 with true/false when calling
btrfs_inc_ref() and btrfs_dec_ref() to make the code self-documenting
and avoid mixing bool/integer types.

Signed-off-by: Sun YangKai <sunk67188@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 06:38:31 +01:00
Sun YangKai
53e8303149 btrfs: update comment for visit_node_for_delete()
Drop the obsolete @refs parameter from the comment so the argument list
matches the current function signature after commit f8c4d59de2
("btrfs: drop unused parameter refs from visit_node_for_delete()").

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Sun YangKai <sunk67188@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 06:38:31 +01:00
David Sterba
10934c131f btrfs: remaining BTRFS_PATH_AUTO_FREE conversions
Do the remaining btrfs_path conversion to the auto cleaning, this seems
to be the last one. Most of the conversions are trivial, only adding the
declaration and removing the freeing, or changing the goto patterns to
return.

There are some functions with many changes, like __btrfs_free_extent(),
btrfs_remove_from_free_space_tree() or btrfs_add_to_free_space_tree()
but it still follows the same pattern.

Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-25 01:53:33 +01:00
Filipe Manana
e21756fc4a btrfs: use booleans for delalloc arguments and struct find_free_extent_ctl
The struct find_free_extent_ctl uses an int for the 'delalloc' field but
it's always used as a boolean, and its value is used to be passed to
several functions to signal if we are dealing with delalloc. The same goes
for the 'is_data' argument from btrfs_reserve_extent(). So change the type
from int to bool and move the field definition in the find_free_extent_ctl
structure so that it's close to other bool fields and reduces the size of
the structure from 144 down to 136 bytes (at the moment it's only declared
in the stack of btrfs_reserve_extent(), never allocated otherwise).

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:42:26 +01:00
Filipe Manana
d7fe41044b btrfs: use bool type for btrfs_path members used as booleans
Many fields of struct btrfs_path are used as booleans but their type is
an unsigned int (of one 1 bit width to save space). Change the type to
bool keeping the :1 suffix so that they combine with the previous u8
fields in order to save space. This makes the code more clear by using
explicit true/false and more in line with the preferred style, preserving
the size of the structure.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:42:25 +01:00
Qu Wenruo
c5667f9c8e btrfs: headers cleanup to remove unnecessary local includes
[BUG]
When I tried to remove btrfs_bio::fs_info and use btrfs_bio::inode to
grab the fs_info, the header "btrfs_inode.h" is needed to access the
full btrfs_inode structure.

Then btrfs will fail to compile.

[CAUSE]
There is a recursive including chain:

  "bio.h" -> "btrfs_inode.h" -> "extent_map.h" -> "compression.h" ->
  "bio.h"

That recursive including is causing problems for btrfs.

[ENHANCEMENT]
To reduce the risk of recursive including:

- Remove unnecessary local includes from btrfs headers
  Either the included header is pulled in by other headers, or is
  completely unnecessary.

- Remove btrfs local includes if the header only requires a pointer
  In that case let the implementing C file to pull the required header.

  This is especially important for headers like "btrfs_inode.h" which
  pulls in a lot of other btrfs headers, thus it's a mine field of
  recursive including.

- Remove unnecessary temporary structure definition
  Either if we have included the header defining the structure, or
  completely unused.

Now including "btrfs_inode.h" inside "bio.h" is completely fine,
although "btrfs_inode.h" still includes "extent_map.h", but that header
only includes "fs.h", no more chain back to "bio.h".

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:34:52 +01:00
Miquel Sabaté Solà
7ab5d01d58 btrfs: apply the AUTO_K(V)FREE macros throughout the code
Apply the AUTO_KFREE and AUTO_KVFREE macros wherever it makes
sense. Since this macro is expected to improve code readability, it has
been avoided in places where the lifetime of objects wasn't easy to
follow and a cleanup attribute would've made things worse; or when the
cleanup section of a function involved many other things and thus there
was no readability impact anyways. This change has also not been applied
in extremely short functions where readability was clearly not an issue.

Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:34:51 +01:00
Filipe Manana
8b6e1f5dce btrfs: remove pointless label and goto from unpin_extent_range()
There's no need to have an 'out' label and jump there in case we can
not find a block group. We can simply return directly since there are no
resources to release, removing the need for the label and the 'ret'
variable.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:16:23 +01:00
Filipe Manana
36574363b7 btrfs: reduce block group critical section in unpin_extent_range()
There's no need to update the bytes_pinned, bytes_readonly and
max_extent_size fields of the space_info while inside the critical section
delimited by the block group's lock. So move that out of the block group's
critical section, but sill inside the space_info's critical section.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:16:20 +01:00
Filipe Manana
4cb0abc1cf btrfs: change 'reserved' argument from pin_down_extent() to bool
It's used as a boolean, so convert it from int type to bool type.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:16:14 +01:00
Filipe Manana
8dcb8e4b11 btrfs: remove 'reserved' argument from btrfs_pin_extent()
All callers pass a value of 1 (true) to it, so remove it.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:16:10 +01:00
Filipe Manana
ec8022cd26 btrfs: use local variable for space_info in pin_down_extent()
Instead of dereferencing the block group multiple times to access its
space_info, use a local variable to shorten the code horizontal wise and
make it easier to read. Also, while at it, also rename the block group
argument from 'cache' to 'bg', as the cache name is confusing and it's
from the old days where the block group structure was named as
'btrfs_block_group_cache'.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:15:58 +01:00
Filipe Manana
585416766d btrfs: reduce block group critical section in pin_down_extent()
There's no need to update the bytes_reserved and bytes_may_use fields of
the space_info while holding the block group's spinlock. We are only
making the critical section longer than necessary. So move the space_info
updates outside of the block group's critical section.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:15:13 +01:00
Filipe Manana
af1e800c02 btrfs: use the key format macros when printing keys
Change all locations that print a key to use the new macros to print
them in order to ensure a consistent style and avoid repetitive code.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:03:02 +01:00
Filipe Manana
e96059c9d7 btrfs: remove fs_info argument from btrfs_dump_space_info()
We don't need it since we can grab fs_info from the given space_info.
So remove the fs_info argument.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <asj@kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 21:59:10 +01:00
David Sterba
a929904cf7 btrfs: add unlikely annotations to branches leading to transaction abort
The unlikely() annotation is a static prediction hint that compiler may
use to reorder code out of hot path. We use it elsewhere (namely
tree-checker.c) for error branches that almost never happen.

Transaction abort is one such error, the btrfs_abort_transaction()
inlines code to check the state and print a warning, this ought to be
out of the hot path.

The most common pattern is when transaction abort is called after
checking a return value and the control flow leads to a quick return.
In other cases it may not be necessary to add unlikely() e.g. when the
function returns anyway or the control flow is not changed noticeably.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23 08:49:26 +02:00
David Sterba
cc53bd2085 btrfs: add unlikely annotations to branches leading to EIO
The unlikely() annotation is a static prediction hint that compiler may
use to reorder code out of hot path. We use it elsewhere (namely
tree-checker.c) for error branches that almost never happen, where
EIO is one of them.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23 08:49:26 +02:00
David Sterba
9264d004a6 btrfs: add unlikely annotations to branches leading to EUCLEAN
The unlikely() annotation is a static prediction hint that compiler may
use to reorder code out of hot path. We use it elsewhere (namely
tree-checker.c) for error branches that almost never happen, where
EUCLEAN (a corruption) is one of them.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23 08:49:26 +02:00
David Sterba
17dc82dc1e btrfs: fix typos in comments and strings
Annual typo fixing pass. Strangely codespell found only about 30% of
what is in this patch, the rest was done manually using text
spellchecker with a custom dictionary of acceptable terms.

Reviewed-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23 08:49:16 +02:00
David Sterba
67e78f983e btrfs: convert several int parameters to bool
We're almost done cleaning misused int/bool parameters. Convert a bunch
of them, found by manual grepping.  Note that btrfs_sync_fs() needs an
int as it's mandated by the struct super_operations prototype.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-22 10:54:32 +02:00
Naohiro Aota
0d703963d2 btrfs: zoned: refine extent allocator hint selection
The hint block group selection in the extent allocator is wrong in the
first place, as it can select the dedicated data relocation block group for
the normal data allocation.

Since we separated the normal data space_info and the data relocation
space_info, we can easily identify a block group is for data relocation or
not. Do not choose it for the normal data allocation.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-22 10:54:31 +02:00
Boris Burkov
807d9023e7 btrfs: fix ssd_spread overallocation
If the ssd_spread mount option is enabled, then we run the so called
clustered allocator for data block groups. In practice, this results in
creating a btrfs_free_cluster which caches a block_group and borrows its
free extents for allocation.

Since the introduction of allocation size classes in 6.1, there has been
a bug in the interaction between that feature and ssd_spread.
find_free_extent() has a number of nested loops. The loop going over the
allocation stages, stored in ffe_ctl->loop and managed by
find_free_extent_update_loop(), the loop over the raid levels, and the
loop over all the block_groups in a space_info. The size class feature
relies on the block_group loop to ensure it gets a chance to see a
block_group of a given size class.  However, the clustered allocator
uses the cached cluster block_group and breaks that loop. Each call to
do_allocation() will really just go back to the same cached block_group.
Normally, this is OK, as the allocation either succeeds and we don't
want to loop any more or it fails, and we clear the cluster and return
its space to the block_group.

But with size classes, the allocation can succeed, then later fail,
outside of do_allocation() due to size class mismatch. That latter
failure is not properly handled due to the highly complex multi loop
logic. The result is a painful loop where we continue to allocate the
same num_bytes from the cluster in a tight loop until it fails and
releases the cluster and lets us try a new block_group. But by then, we
have skipped great swaths of the available block_groups and are likely
to fail to allocate, looping the outer loop. In pathological cases like
the reproducer below, the cached block_group is often the very last one,
in which case we don't perform this tight bg loop but instead rip
through the ffe stages to LOOP_CHUNK_ALLOC and allocate a chunk, which
is now the last one, and we enter the tight inner loop until an
allocation failure. Then allocation succeeds on the final block_group
and if the next allocation is a size mismatch, the exact same thing
happens again.

Triggering this is as easy as mounting with -o ssd_spread and then
running:

  mount -o ssd_spread $dev $mnt
  dd if=/dev/zero of=$mnt/big bs=16M count=1 &>/dev/null
  dd if=/dev/zero of=$mnt/med bs=4M count=1 &>/dev/null
  sync

if you do the two writes + sync in a loop, you can force btrfs to spin
an excessive amount on semi-successful clustered allocations, before
ultimately failing and advancing to the stage where we force a chunk
allocation. This results in 2G of data allocated per iteration, despite
only using ~20M of data. By using a small size classed extent, the inner
loop takes longer and we can spin for longer.

The simplest, shortest term fix to unbreak this is to make the clustered
allocator size_class aware in the dumbest way, where it fails on size
class mismatch. This may hinder the operation of the clustered
allocator, but better hindered than completely broken and terribly
overallocating.

Further re-design improvements are also in the works.

Fixes: 52bb7a2166 ("btrfs: introduce size class to block group allocator")
CC: stable@vger.kernel.org # 6.1+
Reported-by: David Sterba <dsterba@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 01:12:52 +02:00
Filipe Manana
fd00922abc btrfs: add btrfs prefix to is_fstree() and make it return bool
This is an exported function and therefore it should have a 'btrfs_'
prefix, to make it clear it's btrfs specific, avoid future name collisions
with code outside btrfs, and make its naming consistent with most other
btrfs exported functions.

So add a 'btrfs_' prefix to it and make it return bool instead of int,
since all we need is to return true or false.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:58:04 +02:00
Filipe Manana
6fc5ef7829 btrfs: add btrfs prefix to free space tree exported functions
A few of the free space tree exported functions have a 'btrfs_' prefix in
their name, but most don't. Not only is this inconsistent, the preferred
style is to have such a prefix, to avoid potential collisions in the
future with other kernel code and offer a somewhat better readibility by
making it obvious in calls sites that we are calling btrfs specific code.

So add the 'btrfs_' prefix to all free space tree functions that are
missing it.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:58:02 +02:00