Commit Graph

451 Commits

Author SHA1 Message Date
Qu Wenruo
a86a283430 btrfs: apply first key check for readahead when possible
Currently for tree block readahead we never pass a
btrfs_tree_parent_check with @has_first_key set.

Without @has_first_key set, btrfs will skip the following extra
checks:

- Header generation check
  This is a minor one.

- Empty leaf/node checks
  This is more serious, for certain trees like the csum tree, they are
  allowed to be empty, thus an empty leaf can pass the tree checker.
  But if there is a parent node for such an empty leaf, it indicates
  corruption.

  Without @has_first_key set, we can no longer detect such a problem.

  In fact there is already a fuzzed image report that a corrupted csum
  leaf which has zero nritems but still has a parent node can trigger
  a BUG_ON() during csum deletion.

However there are only two call sites of btrfs_readahead_tree_block():

- Inside relocate_tree_blocks()
  At this call site we are trying to grab the first key of the tree
  block, thus we are not able to pass a @first_key parameter.

- Inside btrfs_readahead_node_child()
  This is the more common call site, where we have the parent node and
  want to readahead the child tree blocks.

  In this case we can easily grab the node key and pass it for checks.

Add a new parameter @first_key to btrfs_readahead_tree_block() and pass
the node key to it inside btrfs_readahead_node_child().

This should plug the gap in empty leaf detection during readahead.

Link: https://lore.kernel.org/linux-btrfs/20260409071255.3358044-1-gality369@gmail.com/
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-21 04:01:24 +02:00
Leo Martins
e0a85137a8 btrfs: avoid GFP_ATOMIC allocations in qgroup free paths
When qgroups are enabled, __btrfs_qgroup_release_data() and
qgroup_free_reserved_data() pass an extent_changeset to
btrfs_clear_record_extent_bits() to track how many bytes had their
EXTENT_QGROUP_RESERVED bits cleared. Inside the extent IO tree spinlock,
add_extent_changeset() calls ulist_add() with GFP_ATOMIC to record each
changed range. If this allocation fails, it hits a BUG_ON and panics the
kernel.

However, both of these callers only read changeset.bytes_changed
afterwards — the range_changed ulist is populated and immediately freed
without ever being iterated. The GFP_ATOMIC allocation is entirely
unnecessary for these paths.

Introduce extent_changeset_init_bytes_only() which uses a sentinel value
(EXTENT_CHANGESET_BYTES_ONLY) on the ulist's prealloc field to signal
that only bytes_changed should be tracked. add_extent_changeset() checks
for this sentinel and returns early after updating bytes_changed,
skipping the ulist_add() call entirely. This eliminates the GFP_ATOMIC
allocation and makes the BUG_ON unreachable for these paths.

Callers that need range tracking (qgroup_reserve_data,
qgroup_unreserve_range, btrfs_qgroup_check_reserved_leak) continue to
use extent_changeset_init() and are unaffected.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07 18:56:05 +02:00
Leo Martins
f9a48549a1 btrfs: inhibit extent buffer writeback to prevent COW amplification
Inhibit writeback on COW'd extent buffers for the lifetime of the
transaction handle, preventing background writeback from setting
BTRFS_HEADER_FLAG_WRITTEN and causing unnecessary re-COW.

COW amplification occurs when background writeback flushes an extent
buffer that a transaction handle is still actively modifying. When
lock_extent_buffer_for_io() transitions a buffer from dirty to
writeback, it sets BTRFS_HEADER_FLAG_WRITTEN, marking the block as
having been persisted to disk at its current bytenr. Once WRITTEN is
set, should_cow_block() must either COW the block again or overwrite
it in place, both of which are unnecessary overhead when the buffer
is still being modified by the same handle that allocated it. By
inhibiting background writeback on actively-used buffers, WRITTEN is
never set while a transaction handle holds a reference to the buffer,
avoiding this overhead entirely.

Add an atomic_t writeback_inhibitors counter to struct extent_buffer,
which fits in an existing 6-byte hole without increasing struct size.
When a buffer is COW'd in btrfs_force_cow_block(), call
btrfs_inhibit_eb_writeback() to store the eb in the transaction
handle's writeback_inhibited_ebs xarray (keyed by eb->start), take a
reference, and increment writeback_inhibitors. The function handles
dedup (same eb inhibited twice by the same handle) and replacement
(different eb at the same logical address). Allocation failure is
graceful: the buffer simply falls back to the pre-existing behavior
where it may be written back and re-COW'd.

Also inhibit writeback in should_cow_block() when COW is skipped,
so that every transaction handle that reuses an already-COW'd buffer
also inhibits its writeback. Without this, if handle A COWs a block
and inhibits it, and handle B later reuses the same block without
inhibiting, handle A's uninhibit on end_transaction leaves the buffer
unprotected while handle B is still using it. This ensures all handles
that access a COW'd buffer contribute to the inhibitor count, and the
buffer remains protected until the last handle releases it.

In lock_extent_buffer_for_io(), when writeback_inhibitors is non-zero
and the writeback mode is WB_SYNC_NONE, skip the buffer. WB_SYNC_NONE
is used by the VM flusher threads for background and periodic
writeback, which are the only paths that cause COW amplification by
opportunistically writing out dirty extent buffers mid-transaction.
Skipping these is safe because the buffers remain dirty in the page
cache and will be written out at transaction commit time.

WB_SYNC_ALL must always proceed regardless of writeback_inhibitors.
This is required for correctness in the fsync path: btrfs_sync_log()
writes log tree blocks via filemap_fdatawrite_range() (WB_SYNC_ALL)
while the transaction handle that inhibited those same blocks is still
active. Without the WB_SYNC_ALL bypass, those inhibited log tree
blocks would be silently skipped, resulting in an incomplete log on
disk and corruption on replay. btrfs_write_and_wait_transaction()
also uses WB_SYNC_ALL via filemap_fdatawrite_range(); for that path,
inhibitors are already cleared beforehand, but the bypass ensures
correctness regardless.

Uninhibit in __btrfs_end_transaction() before atomic_dec(num_writers)
to prevent a race where the committer proceeds while buffers are still
inhibited. Also uninhibit in btrfs_commit_transaction() before writing
and in cleanup_transaction() for the error path.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Sun YangKai <sunk67188@gmail.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07 18:56:00 +02:00
Filipe Manana
6ee5c986b0 btrfs: use the helper extent_buffer_uptodate() everywhere
Instead of open coding testing the uptodate bit on the extent buffer's
flags, use the existing helper extent_buffer_uptodate() (which is even
shorter to type). Also change the helper's return value from int to bool,
since we always use it in a boolean context.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07 18:55:55 +02:00
Linus Torvalds
bf4afc53b7 Convert 'alloc_obj' family to use the new default GFP_KERNEL argument
This was done entirely with mindless brute force, using

    git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
        xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'

to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.

Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.

For the same reason the 'flex' versions will be done as a separate
conversion.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21 17:09:51 -08:00
Kees Cook
69050f8d6d treewide: Replace kmalloc with kmalloc_obj for non-scalar types
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:

Single allocations:	kmalloc(sizeof(TYPE), ...)
are replaced with:	kmalloc_obj(TYPE, ...)

Array allocations:	kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with:	kmalloc_objs(TYPE, COUNT, ...)

Flex array allocations:	kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with:	kmalloc_flex(*PTR, FAM, COUNT, ...)

(where TYPE may also be *VAR)

The resulting allocations no longer return "void *", instead returning
"TYPE *".

Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-21 01:02:28 -08:00
Qu Wenruo
4e159150a9 btrfs: do not strictly require dirty metadata threshold for metadata writepages
[BUG]
There is an internal report that over 1000 processes are
waiting at the io_schedule_timeout() of balance_dirty_pages(), causing
a system hang and trigger a kernel coredump.

The kernel is v6.4 kernel based, but the root problem still applies to
any upstream kernel before v6.18.

[CAUSE]
From Jan Kara for his wisdom on the dirty page balance behavior first.

  This cgroup dirty limit was what was actually playing the role here
  because the cgroup had only a small amount of memory and so the dirty
  limit for it was something like 16MB.

  Dirty throttling is responsible for enforcing that nobody can dirty
  (significantly) more dirty memory than there's dirty limit. Thus when
  a task is dirtying pages it periodically enters into balance_dirty_pages()
  and we let it sleep there to slow down the dirtying.

  When the system is over dirty limit already (either globally or within
  a cgroup of the running task), we will not let the task exit from
  balance_dirty_pages() until the number of dirty pages drops below the
  limit.

  So in this particular case, as I already mentioned, there was a cgroup
  with relatively small amount of memory and as a result with dirty limit
  set at 16MB. A task from that cgroup has dirtied about 28MB worth of
  pages in btrfs btree inode and these were practically the only dirty
  pages in that cgroup.

So that means the only way to reduce the dirty pages of that cgroup is
to writeback the dirty pages of btrfs btree inode, and only after that
those processes can exit balance_dirty_pages().

Now back to the btrfs part, btree_writepages() is responsible for
writing back dirty btree inode pages.

The problem here is, there is a btrfs internal threshold that if the
btree inode's dirty bytes are below the 32M threshold, it will not
do any writeback.

This behavior is to batch as much metadata as possible so we won't write
back those tree blocks and then later re-COW them again for another
modification.

This internal 32MiB is higher than the existing dirty page size (28MiB),
meaning no writeback will happen, causing a deadlock between btrfs and
cgroup:

- Btrfs doesn't want to write back btree inode until more dirty pages

- Cgroup/MM doesn't want more dirty pages for btrfs btree inode
  Thus any process touching that btree inode is put into sleep until
  the number of dirty pages is reduced.

Thanks Jan Kara a lot for the analysis of the root cause.

[ENHANCEMENT]
Since kernel commit b55102826d ("btrfs: set AS_KERNEL_FILE on the
btree_inode"), btrfs btree inode pages will only be charged to the root
cgroup which should have a much larger limit than btrfs' 32MiB
threshold.
So it should not affect newer kernels.

But for all current LTS kernels, they are all affected by this problem,
and backporting the whole AS_KERNEL_FILE may not be a good idea.

Even for newer kernels I still think it's a good idea to get
rid of the internal threshold at btree_writepages(), since for most cases
cgroup/MM has a better view of full system memory usage than btrfs' fixed
threshold.

For internal callers using btrfs_btree_balance_dirty() since that
function is already doing internal threshold check, we don't need to
bother them.

But for external callers of btree_writepages(), just respect their
requests and write back whatever they want, ignoring the internal
btrfs threshold to avoid such deadlock on btree inode dirty page
balancing.

CC: stable@vger.kernel.org
CC: Jan Kara <jack@suse.cz>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-01-21 19:35:00 +01:00
Qu Wenruo
c5667f9c8e btrfs: headers cleanup to remove unnecessary local includes
[BUG]
When I tried to remove btrfs_bio::fs_info and use btrfs_bio::inode to
grab the fs_info, the header "btrfs_inode.h" is needed to access the
full btrfs_inode structure.

Then btrfs will fail to compile.

[CAUSE]
There is a recursive including chain:

  "bio.h" -> "btrfs_inode.h" -> "extent_map.h" -> "compression.h" ->
  "bio.h"

That recursive including is causing problems for btrfs.

[ENHANCEMENT]
To reduce the risk of recursive including:

- Remove unnecessary local includes from btrfs headers
  Either the included header is pulled in by other headers, or is
  completely unnecessary.

- Remove btrfs local includes if the header only requires a pointer
  In that case let the implementing C file to pull the required header.

  This is especially important for headers like "btrfs_inode.h" which
  pulls in a lot of other btrfs headers, thus it's a mine field of
  recursive including.

- Remove unnecessary temporary structure definition
  Either if we have included the header defining the structure, or
  completely unused.

Now including "btrfs_inode.h" inside "bio.h" is completely fine,
although "btrfs_inode.h" still includes "extent_map.h", but that header
only includes "fs.h", no more chain back to "bio.h".

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:34:52 +01:00
Qu Wenruo
c2ffb1ec1a btrfs: prepare compression folio alloc/free for bs > ps cases
This includes the following preparation for bs > ps cases:

- Always alloc/free the folio directly if bs > ps
  This adds a new @fs_info parameter for btrfs_alloc_compr_folio(), thus
  affecting all compression algorithms.

  For btrfs_free_compr_folio() it needs no parameter for now, as we can
  use the folio size to skip the caching part.

  For now the change is just to passing a @fs_info into the function,
  all the folio size assumption is still based on page size.

- Properly zero the last folio in compress_file_range()
  Since the compressed folios can be larger than a page, we need to
  properly zero the whole folio.

- Use correct folio size for btrfs_add_compressed_bio_folios()
  Instead of page size, use the correct folio size.

- Use correct folio size/shift for btrfs_compress_filemap_get_folio()
  As we are not only using simple page sized folios anymore.

- Use correct folio size for btrfs_decompress()
  There is an ASSERT() making sure the decompressed range is no larger
  than a page, which will be triggered for bs > ps cases.

- Skip readahead for compressed pages
  Similar to subpage cases.

- Make btrfs_alloc_folio_array() to accept a new @order parameter

- Add a helper to calculate the minimal folio size

All those changes should not affect the existing bs <= ps handling.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23 08:49:24 +02:00
Filipe Manana
790b88c4dd btrfs: make extent_buffer_test_bit() return a boolean instead
All the callers want is to determine if a bit is set and all of them call
the function and do a double negation (!!) on its result to get a boolean.
So change it to return a boolean and simplify callers.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:58:02 +02:00
Filipe Manana
b769777d92 btrfs: use refcount_t type for the extent buffer reference counter
Instead of using a bare atomic, use the refcount_t type, which despite
being a structure that contains only an atomic, has an API that checks
for underflows and other hazards. This doesn't change the size of the
extent_buffer structure.

This removes the need to do things like this:

    WARN_ON(atomic_read(&eb->refs) == 0);
    if (atomic_dec_and_test(&eb->refs)) {
        (...)
    }

And do just:

    if (refcount_dec_and_test(&eb->refs)) {
        (...)
    }

Since refcount_dec_and_test() already triggers a warning when we decrement
a ref count that has a value of 0 (or below zero).

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:30 +02:00
Josef Bacik
5e121ae687 btrfs: use buffer xarray for extent buffer writeback operations
Currently we have this ugly back and forth with the btree writeback
where we find the folio, find the eb associated with that folio, and
then attempt to writeback.  This results in two different paths for
subpage ebs and >= page size ebs.

Clean this up by adding our own infrastructure around looking up tagged
ebs and writing the ebs out directly.  This allows us to unify the
subpage and >= pagesize IO paths, resulting in a much cleaner writeback
path for extent buffers.

I ran this through fsperf on a VM with 8 CPUs and 16GiB of RAM.  I used
smallfiles100k, but reduced the files to 1k to make it run faster, the
results are as follows, with the statistically significant improvements
marked with *, there were no regressions.  fsperf was run with -n 10 for
both runs, so the baseline is the average 10 runs and the test is the
average of 10 runs.

smallfiles100k results
      metric           baseline       current        stdev            diff
================================================================================
avg_commit_ms               68.58         58.44          3.35   -14.79% *
commits                    270.60        254.70         16.24    -5.88%
dev_read_iops                  48            48             0     0.00%
dev_read_kbytes              1044          1044             0     0.00%
dev_write_iops          866117.90     850028.10      14292.20    -1.86%
dev_write_kbytes      10939976.40   10605701.20     351330.32    -3.06%
elapsed                     49.30            33          1.64   -33.06% *
end_state_mount_ns    41251498.80   35773220.70    2531205.32   -13.28% *
end_state_umount_ns      1.90e+09      1.50e+09   14186226.85   -21.38% *
max_commit_ms                 139        111.60          9.72   -19.71% *
sys_cpu                      4.90          3.86          0.88   -21.29%
write_bw_bytes        42935768.20   64318451.10    1609415.05    49.80% *
write_clat_ns_mean      366431.69     243202.60      14161.98   -33.63% *
write_clat_ns_p50        49203.20         20992        264.40   -57.34% *
write_clat_ns_p99          827392     653721.60      65904.74   -20.99% *
write_io_kbytes           2035940       2035940             0     0.00%
write_iops               10482.37      15702.75        392.92    49.80% *
write_lat_ns_max         1.01e+08      90516129    3910102.06   -10.29% *
write_lat_ns_mean       366556.19     243308.48      14154.51   -33.62% *

As you can see we get about a 33% decrease runtime, with a 50%
throughput increase, which is pretty significant.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:50 +02:00
David Sterba
05a6ec865d btrfs: use unsigned types for constants defined as bit shifts
The unsigned type is a recommended practice (CWE-190, CWE-194) for bit
shifts to avoid problems with potential unwanted sign extensions.
Although there are no such cases in btrfs codebase, follow the
recommendation.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:48 +02:00
Daniel Vacek
062f3d02a2 btrfs: remove unused flag EXTENT_BUFFER_IN_TREE
This flag is set after inserting the eb to the buffer tree and cleared
on it's removal.  It was added in commit 34b41acec1 ("Btrfs: use a
bit to track if we're in the radix tree") and wanted to make use of it,
faa2dbf004 ("Btrfs: add sanity tests for new qgroup accounting
code"). Both are 10+ years old, we can remove the flag.

Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:39 +02:00
Daniel Vacek
c61660ec34 btrfs: remove unused flag EXTENT_BUFFER_CORRUPT
This flag is no longer being used.  It was added by commit a826d6dcb3
("Btrfs: check items for correctness as we search") but it's no longer
being used after commit f26c923860 ("btrfs: remove reada
infrastructure").

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:39 +02:00
Daniel Vacek
350362e95f btrfs: remove unused flag EXTENT_BUFFER_READAHEAD
This flag is no longer being used.  It was added by commit ab0fff0305
("btrfs: add READAHEAD extent buffer flag") and used in commits:

79fb65a1f6 ("Btrfs: don't call readahead hook until we have read the entire eb")
78e62c02ab ("btrfs: Remove extent_io_ops::readpage_io_failed_hook")
371cdc0700 ("btrfs: introduce subpage metadata validation check")

Finally all the code using it was removed by commit f26c923860 ("btrfs: remove
reada infrastructure").

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:39 +02:00
Daniel Vacek
40f47f6d72 btrfs: remove unused flag EXTENT_BUFFER_READ_ERR
This flag was added by commit 656f30dba7 ("Btrfs: be aware of btree
inode write errors to avoid fs corruption") but it stopped being used
after commit 046b562b20 ("btrfs: use a separate end_io handler for
read_extent_buffer").

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:39 +02:00
Boris Burkov
d6fe0c69b3 btrfs: handle empty eb->folios in num_extent_folios()
num_extent_folios() unconditionally calls folio_order() on
eb->folios[0]. If that is NULL this will be a segfault. It is reasonable
for it to return 0 as the number of folios in the eb when the first
entry is NULL, so do that instead.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-02 13:20:08 +02:00
David Sterba
6e7c283863 btrfs: add __pure attribute to eb page and folio counters
The functions qualify for the pure attribute as they always return the
same value for the same argument (in the given scope). This allows to
optimize the calls and cache the value.

Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:47 +01:00
David Sterba
db3a1bac3f btrfs: merge alloc_dummy_extent_buffer() helpers
After previous patch removing nodesize from parameters,
__alloc_dummy_extent_buffer() and alloc_dummy_extent_buffer() are
identical so we can drop one.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:47 +01:00
David Sterba
551561c346 btrfs: don't pass nodesize to __alloc_extent_buffer()
All callers pass a valid fs_info so we can read the nodesize from that
instead of passing it as parameter.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:47 +01:00
David Sterba
248c4ff393 btrfs: split waiting from read_extent_buffer_pages(), drop parameter wait
There are only 2 WAIT_* values left for wait parameter, we can encode
this to the function name if the waiting functionality is split.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:23 +01:00
David Sterba
db9eef2ea8 btrfs: remove unused define WAIT_PAGE_LOCK for extent io
Last use was in the readahead code that got removed by f26c923860
("btrfs: remove reada infrastructure").

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:23 +01:00
David Sterba
075adeeb92 btrfs: make wait_on_extent_buffer_writeback() static inline
The simple helper can be inlined, no need for the separate function.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:22 +01:00
David Sterba
3a1c46dbc9 btrfs: open code set_page_extent_mapped()
The function set_page_extent_mapped() is now a simple wrapper so use the
folio helper.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:22 +01:00
Li Zetao
046c0d6596 btrfs: convert try_release_extent_mapping() to take a folio
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. And page_to_inode() can be replaced with folio_to_inode() now.

Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:21 +02:00
Li Zetao
b8ae2bfa68 btrfs: convert try_release_extent_buffer() to take a folio
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio.

Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:20 +02:00
Li Zetao
266a9361a4 btrfs: convert clear_page_extent_mapped() to take a folio
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. Now clear_page_extent_mapped() can deal with a folio
directly, so change its name to clear_folio_extent_mapped().

Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:20 +02:00
Josef Bacik
01e11841f0 btrfs: convert extent_write_locked_range() to take a folio
This mostly uses folios, convert it to take a folio instead and update
the callers to pass in the folio.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:15 +02:00
Josef Bacik
a67f540582 btrfs: convert extent_clear_unlock_delalloc() to take a folio
Instead of taking the locked page, take the locked folio so we can pass
that into __process_folios_contig.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:14 +02:00
Josef Bacik
c987f1e6d4 btrfs: convert find_lock_delalloc_range() to use a folio
Instead of passing in a page for locked_page, pass in the folio instead.
We only use the folio itself to validate some range assumptions, and
then pass it into other functions.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:14 +02:00
Qu Wenruo
af61081fb5 btrfs: move extent_range_clear_dirty_for_io() into inode.c
The function is only used inside inode.c by compress_file_range(),
so move it to inode.c and unexport it.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:52:25 +02:00
Qu Wenruo
0fbf6cbd72 btrfs: rename the extra_gfp parameter of btrfs_alloc_page_array()
There is only one caller utilizing the @extra_gfp parameter,
alloc_eb_folio_array().  And in that case the extra_gfp is only assigned
to __GFP_NOFAIL.

Rename the @extra_gfp parameter to @nofail to indicate that.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:30 +02:00
Qu Wenruo
fea91134c2 btrfs: remove the extra_gfp parameter from btrfs_alloc_folio_array()
The function btrfs_alloc_folio_array() is only utilized in
btrfs_submit_compressed_read() and no other location, and the only
caller is not utilizing the @extra_gfp parameter.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:30 +02:00
Boris Burkov
33336c1805 btrfs: preallocate ulist memory for qgroup rsv
When qgroups are enabled, during data reservation, we allocate the
ulist_nodes that track the exact reserved extents with GFP_ATOMIC
unconditionally. This is unnecessary, and we can follow the model
already employed by the struct extent_state we preallocate in the non
qgroups case, which should reduce the risk of allocation failures with
GFP_ATOMIC.

Add a prealloc node to struct ulist which ulist_add will grab when it is
present, and try to allocate it before taking the tree lock while we can
still take advantage of a less strict gfp mask. The lifetime of that
node belongs to the new prealloc field, until it is used, at which point
it belongs to the ulist linked list.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:26 +02:00
David Sterba
2917f74102 btrfs: constify pointer parameters where applicable
We can add const to many parameters, this is for clarity and minor
addition to safety. There are some minor effects, in the assembly
code and .ko measured on release config. This patch does not cover all
possible conversions.

Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:22 +02:00
Filipe Manana
8996f61ab9 btrfs: move fiemap code into its own file
Currently the core of the fiemap code lives in extent_io.c, which does
not make any sense because it's not related to extent IO at all (and it
was not as well before the big rewrite of fiemap I did some time ago).
The entry point for fiemap, btrfs_fiemap(), lives in inode.c since it's
an inode operation.

Since there's a significant amount of fiemap code, move all of it into a
dedicated file, including its entry point inode.c:btrfs_fiemap().

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:20 +02:00
Josef Bacik
6b0a63a4fa btrfs: add a cached state to extent_clear_unlock_delalloc
Now that we have the lock_extent tightly coupled with
extent_clear_unlock_delalloc we can add a cached state to
extent_clear_unlock_delalloc and benefit from skipping the extra lookup
when we're doing cow.

Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:10 +02:00
Filipe Manana
de6f14e83e btrfs: make try_release_extent_mapping() return a bool
Currently try_release_extent_mapping() as an int return type, but we
use it as a boolean. Its only caller, the release folio callback, also
returns a boolean which corresponds to try_release_extent_mapping()'s
return value. So change its return value type to bool as well as its
helper try_release_extent_state().

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:07 +02:00
Qu Wenruo
11e03f2f4b btrfs: introduce btrfs_alloc_folio_array()
The new helper will do the same thing as btrfs_alloc_page_array(), but
with folios.

One extra difference is, there is no extra helper for bulk allocation,
thus it may not be as efficient as the page version.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:02 +02:00
Filipe Manana
c66f2afc71 btrfs: remove pointless writepages callback wrapper
There's no point in having a static writepages callback in inode.c that
does nothing besides calling extent_writepages from extent_io.c.
So just remove the callback at inode.c and rename extent_writepages()
to btrfs_writepages().

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:00 +02:00
Filipe Manana
7938d38b94 btrfs: remove pointless readahead callback wrapper
There's no point in having a static readahead callback in inode.c that
does nothing besides calling extent_readahead() from extent_io.c.
So just remove the callback at inode.c and rename extent_readahead()
to btrfs_readahead().

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:00 +02:00
David Sterba
602035d7fe btrfs: add forward declarations and headers, part 2
Do a cleanup in more headers:

- add forward declarations for types referenced by pointers
- add includes when types need them

This fixes potential compilation problems if the headers are reordered
or the missing includes are not provided indirectly.

Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04 16:24:49 +01:00
Matthew Wilcox (Oracle)
dfba9f4773 btrfs: add set_folio_extent_mapped() helper
Turn set_page_extent_mapped() into a wrapper around this version.
Saves a call to compound_head() for callers who already have a folio
and removes a couple of users of page->mapping.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04 16:24:45 +01:00
Filipe Manana
592a0ce9e2 btrfs: remove extent_map_tree forward declaration at extent_io.h
There's no need to do a forward declaration of struct extent_map_tree at
extent_io.h, as there are no function prototypes, inline functions or data
structures that refer to struct extent_map_tree.

So remove that forward declaration, which is not needed since commit
477a30ba5f ("btrfs: Sink extent_tree arguments in
try_release_extent_mapping").

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04 16:24:45 +01:00
Qu Wenruo
84cda1a608 btrfs: cache folio size and shift in extent_buffer
After the conversion to folio interfaces (but without the patch to
enable larger folio allocation), there is an LTP report about observable
performance drop on metadata heavy operations.

https://lore.kernel.org/linux-btrfs/202312221750.571925bd-oliver.sang@intel.com/

This drop is caused by the extra code of calculating the
folio_size()/folio_shift(), instead of the old hard coded
PAGE_SIZE/PAGE_SHIFT.

To slightly reduce the overhead, just cache both folio_size and
folio_shift in extent_buffer.

The two new members (u32 folio_size and u8 folio_shift) are stored
inside the holes of extent_buffer. folio_size is shared with len, which
is reduced to u32. The size of eb does not change.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04 16:24:45 +01:00
Qu Wenruo
8d99361835 btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
These two functions are still using the old page based code, which is
not going to handle larger folios at all.

The migration itself is going to involve the following changes:

- PAGE_SIZE -> folio_size()
- PAGE_SHIFT -> folio_shift()
- get_eb_page_index() -> get_eb_folio_index()
- get_eb_offset_in_page() -> get_eb_offset_in_folio()

And since we're going to support larger folios, although above straight
conversion is good enough, this patch would add extra comments in the
involved functions to explain why the same single line code can now
cover 3 cases:

- folio_size == PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
  The common, non-subpage case with per-page folio.

- folio_size > PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
  The incoming larger folio, non-subpage case.

- folio_size == PAGE_SIZE, sectorsize < PAGE_SIZE, nodesize < PAGE_SIZE
  The existing subpage case, we won't larger folio anyway.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15 23:03:58 +01:00
Qu Wenruo
13df3775ef btrfs: cleanup metadata page pointer usage
Although we have migrated extent_buffer::pages[] to folios[], we're
still mostly using the folio_page() help to grab the page.

This patch would do the following cleanups for metadata:

- Introduce num_extent_folios() helper
  This is to replace most num_extent_pages() callers.

- Use num_extent_folios() to iterate future large folios
  This allows us to use things like
  bio_add_folio()/bio_add_folio_nofail(), and only set the needed flags
  for the folio (aka the leading/tailing page), which reduces the loop
  iteration to 1 for large folios.

- Change metadata related functions to use folio pointers
  Including their function name, involving:
  * attach_extent_buffer_page()
  * detach_extent_buffer_page()
  * page_range_has_eb()
  * btrfs_release_extent_buffer_pages()
  * btree_clear_page_dirty()
  * btrfs_page_inc_eb_refs()
  * btrfs_page_dec_eb_refs()

- Change btrfs_is_subpage() to accept an address_space pointer
  This is to allow both page->mapping and folio->mapping to be utilized.
  As data is still using the old per-page code, and may keep so for a
  while.

- Special corner case place holder for future order mismatches between
  extent buffer and inode filemap
  For now it's  just a block of comments and a dead ASSERT(), no real
  handling yet.

The subpage code would still go page, just because subpage and large
folio are conflicting conditions, thus we don't need to bother subpage
with higher order folios at all. Just folio_page(folio, 0) would be
enough.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor styling tweaks ]
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15 23:01:04 +01:00
Qu Wenruo
082d5bb9b3 btrfs: migrate extent_buffer::pages[] to folio
For now extent_buffer::pages[] are still only accepting single page
pointer, thus we can migrate to folios pretty easily.

As for single page, page and folio are 1:1 mapped, including their page
flags.

This patch would just do the conversion from struct page to struct
folio, providing the first step to higher order folio in the future.

This conversion is pretty simple:

- extent_buffer::pages[] -> extent_buffer::folios[]

- page_address(eb->pages[i]) -> folio_address(eb->pages[i])

- eb->pages[i] -> folio_page(eb->folios[i], 0)

There would be more specific cleanups preparing for the incoming higher
order folio support.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15 23:01:04 +01:00
Qu Wenruo
09e6cef19c btrfs: refactor alloc_extent_buffer() to allocate-then-attach method
Currently alloc_extent_buffer() utilizes find_or_create_page() to
allocate one page a time for an extent buffer.

This method has the following disadvantages:

- find_or_create_page() is the legacy way of allocating new pages
  With the new folio infrastructure, find_or_create_page() is just
  redirected to filemap_get_folio().

- Lacks the way to support higher order (order >= 1) folios
  As we can not yet let filemap give us a higher order folio.

This patch would change the workflow by the following way:

		Old		   |		new
-----------------------------------+-------------------------------------
                                   | ret = btrfs_alloc_page_array();
for (i = 0; i < num_pages; i++) {  | for (i = 0; i < num_pages; i++) {
    p = find_or_create_page();     |     ret = filemap_add_folio();
    /* Attach page private */      |     /* Reuse page cache if needed */
    /* Reused eb if needed */      |
				   |     /* Attach page private and
				   |        reuse eb if needed */
				   | }

By this we split the page allocation and private attaching into two
parts, allowing future updates to each part more easily, and migrate to
folio interfaces (especially for possible higher order folios).

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15 23:01:04 +01:00