Commit Graph

113 Commits

Author SHA1 Message Date
Boris Burkov
fc3d532881 btrfs: btrfs_log_dev_io_error() on all bio errors
As far as I can tell, we never intentionally constrained ourselves to
these status codes, and it is misleading and surprising to lack the
bdev error logging when we get a different error code from the block
layer. This can lead to jumping to a wrong conclusion like "this
system didn't see any bio failures but aborted with EIO".

For example on nvme devices, I observe many failures coming back as
BLK_STS_MEDIUM. It is apparent that the nvme driver returns a variety of
BLK_STS_* status values in nvme_error_status().

So handle the known expected errors and make some noise on the rest
which we expect won't really happen.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anand Jain <asj@kernel.org>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07 20:00:29 +02:00
Kees Cook
69050f8d6d treewide: Replace kmalloc with kmalloc_obj for non-scalar types
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:

Single allocations:	kmalloc(sizeof(TYPE), ...)
are replaced with:	kmalloc_obj(TYPE, ...)

Array allocations:	kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with:	kmalloc_objs(TYPE, COUNT, ...)

Flex array allocations:	kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with:	kmalloc_flex(*PTR, FAM, COUNT, ...)

(where TYPE may also be *VAR)

The resulting allocations no longer return "void *", instead returning
"TYPE *".

Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-21 01:02:28 -08:00
Qu Wenruo
8ceaad6cd6 btrfs: do not ASSERT() when the fs flips RO inside btrfs_repair_io_failure()
[BUG]
There is a bug report that when btrfs hits ENOSPC error in a critical
path, btrfs flips RO (this part is expected, although the ENOSPC bug
still needs to be addressed).

The problem is after the RO flip, if there is a read repair pending, we
can hit the ASSERT() inside btrfs_repair_io_failure() like the following:

  BTRFS info (device vdc): relocating block group 30408704 flags metadata|raid1
  ------------[ cut here ]------------
  BTRFS: Transaction aborted (error -28)
  WARNING: fs/btrfs/extent-tree.c:3235 at __btrfs_free_extent.isra.0+0x453/0xfd0, CPU#1: btrfs/383844
  Modules linked in: kvm_intel kvm irqbypass
  [...]
  ---[ end trace 0000000000000000 ]---
  BTRFS info (device vdc state EA): 2 enospc errors during balance
  BTRFS info (device vdc state EA): balance: ended with status: -30
  BTRFS error (device vdc state EA): parent transid verify failed on logical 30556160 mirror 2 wanted 8 found 6
  BTRFS error (device vdc state EA): bdev /dev/nvme0n1 errs: wr 0, rd 0, flush 0, corrupt 10, gen 0
  [...]
  assertion failed: !(fs_info->sb->s_flags & SB_RDONLY) :: 0, in fs/btrfs/bio.c:938
  ------------[ cut here ]------------
  assertion failed: !(fs_info->sb->s_flags & SB_RDONLY) :: 0, in fs/btrfs/bio.c:938
  kernel BUG at fs/btrfs/bio.c:938!
  Oops: invalid opcode: 0000 [#1] SMP NOPTI
  CPU: 0 UID: 0 PID: 868 Comm: kworker/u8:13 Tainted: G        W        N  6.19.0-rc6+ #4788 PREEMPT(full)
  Tainted: [W]=WARN, [N]=TEST
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
  Workqueue: btrfs-endio simple_end_io_work
  RIP: 0010:btrfs_repair_io_failure.cold+0xb2/0x120
  RSP: 0000:ffffc90001d2bcf0 EFLAGS: 00010246
  RAX: 0000000000000051 RBX: 0000000000001000 RCX: 0000000000000000
  RDX: 0000000000000000 RSI: ffffffff8305cf42 RDI: 00000000ffffffff
  RBP: 0000000000000002 R08: 00000000fffeffff R09: ffffffff837fa988
  R10: ffffffff8327a9e0 R11: 6f69747265737361 R12: ffff88813018d310
  R13: ffff888168b8a000 R14: ffffc90001d2bd90 R15: ffff88810a169000
  FS:  0000000000000000(0000) GS:ffff8885e752c000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  ------------[ cut here ]------------

[CAUSE]
The cause of -ENOSPC error during the test case btrfs/124 is still
unknown, although it's known that we still have cases where metadata can
be over-committed but can not be fulfilled correctly, thus if we hit
such ENOSPC error inside a critical path, we have no choice but abort
the current transaction.

This will mark the fs read-only.

The problem is inside the btrfs_repair_io_failure() path that we require
the fs not to be mount read-only. This is normally fine, but if we are
doing a read-repair meanwhile the fs flips RO due to a critical error,
we can enter btrfs_repair_io_failure() with super block set to
read-only, thus triggering the above crash.

[FIX]
Just replace the ASSERT() with a proper return if the fs is already
read-only.

Reported-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/linux-btrfs/20260126045555.GB31641@lst.de/
Tested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-18 15:25:53 +01:00
Johannes Thumshirn
c757edbef9 btrfs: fix copying the flags of btrfs_bio after split
When a btrfs_bio gets split, only 'bbio->csum_search_commit_root' gets
copied to the new btrfs_bio, all the other flags don't.

When a bio is split in btrfs_submit_chunk(), btrfs_split_bio() creates
the new split bio via btrfs_bio_init() which zeroes the struct with
memset. Looking at btrfs_split_bio(), it copies csum_search_commit_root
from the original but does not copy can_use_append.

After the split, the code does:

    bbio = split;
    bio = &bbio->bio;

This means the split bio (with can_use_append = false) gets submitted,
not the original. In btrfs_submit_dev_bio(), the condition:

    if (btrfs_bio(bio)->can_use_append && btrfs_dev_is_sequential(...))

Will be false for the split bio even when writing to a sequential zone.
Does the split bio need to inherit can_use_append from the original? The
old code used a local variable use_append which persisted across the
split.

Copy the rest of the flags as well.

Link: https://lore.kernel.org/linux-btrfs/20260125132120.2525146-1-clm@meta.com/
Reported-by: Chris Mason <clm@meta.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:56:24 +01:00
Mark Harmstone
bbea42dfb9 btrfs: move existing remaps before relocating block group
If when relocating a block group we find that `remap_bytes` > 0 in its
block group item, that means that it has been the destination block
group for another that has been remapped.

We need to search the remap tree for any remap backrefs within this
range, and move the data to a third block group. This is because
otherwise btrfs_translate_remap() could end up following an unbounded
chain of remaps, which would only get worse over time.

We only relocate one block group at a time, so `remap_bytes` will only
ever go down while we are doing this. Once we're finished we set the
REMAPPED flag on the block group, which will permanently prevent any
other data from being moved to within it.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:54:35 +01:00
Qu Wenruo
ae23fee41b btrfs: remove experimental offload csum mode
The offload csum mode was introduced to allow developers to compare the
performance of generating checksum for data writes at different timings:

- During btrfs_submit_chunk()
  This is the most common one, if any of the following condition is met
  we go this path:

  * The csum is fast
    For now it's CRC32C and xxhash.

  * It's a synchronous write

  * Zoned

- Delay the checksum generation to a workqueue

However since commit dd57c78aec ("btrfs: introduce
btrfs_bio::async_csum") we no longer need to bother any of them.

As if it's an experimental build, async checksum generation at the
background will be faster anyway.

And if not an experimental build, we won't even have the offload csum
mode support.

Considering the async csum will be the new default, let's remove the
offload csum mode code.

There will be no impact to end users, and offload csum mode is still
under experimental features.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:51:43 +01:00
Zhen Ni
fdb945f665 btrfs: simplify check for zoned NODATASUM writes in btrfs_submit_chunk()
This function already dereferences 'inode' multiple times earlier,
making the additional NULL check at line 840 redundant since the
function would have crashed already if inode were NULL.

After commit 81cea6cd70 ("btrfs: remove btrfs_bio::fs_info by
extracting it from btrfs_bio::inode"), the btrfs_bio::inode field is
mandatory for all btrfs_bio allocations and is guaranteed to be
non-NULL.

Simplify the condition for allocating dummy checksums for zoned
NODATASUM data by removing the unnecessary 'inode &&' check.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Zhen Ni <zhen.ni@easystack.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:50:26 +01:00
Johannes Thumshirn
b39b26e017 btrfs: zoned: don't zone append to conventional zone
In case of a zoned RAID, it can happen that a data write is targeting a
sequential write required zone and a conventional zone. In this case the
bio will be marked as REQ_OP_ZONE_APPEND but for the conventional zone,
this needs to be REQ_OP_WRITE.

The setting of REQ_OP_ZONE_APPEND is deferred to the last possible time in
btrfs_submit_dev_bio(), but the decision if we can use zone append is
cached in btrfs_bio.

CC: Naohiro Aota <naohiro.aota@wdc.com>
Fixes: e9b9b911e0 ("btrfs: add raid stripe tree to features enabled with debug config")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 06:38:31 +01:00
Josef Bacik
bd45e9e3f6 btrfs: add orig_logical to btrfs_bio for encryption
When checksumming the encrypted bio on writes we need to know which
logical address this checksum is for.  At the point where we get the
encrypted bio the bi_sector is the physical location on the target disk,
so we need to save the original logical offset in the btrfs_bio.  Then
we can use this when checksumming the bio instead of the
bio->iter.bi_sector.

Note: The patch was taken from v5 of fscrypt patchset
(https://lore.kernel.org/linux-btrfs/cover.1706116485.git.josef@toxicpanda.com/)
which was handled over time by various people: Omar Sandoval, Sweet Tea
Dorminy, Josef Bacik.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-25 01:52:23 +01:00
Qu Wenruo
ec20799064 btrfs: enable encoded read/write/send for bs > ps cases
Since the read verification and read repair are all supporting bs > ps
without large folios now, we can enable encoded read/write/send.

Now we can relax the alignment in assert_bbio_alignment() to
min(blocksize, PAGE_SIZE).
But also add the extra blocksize based alignment check for the logical
and length of the bbio.

There is a pitfall in btrfs_add_compress_bio_folios(), which relies on
the folios passed in to meet the minimal folio order.
But now we can pass regular page sized folios in, update it to check
each folio's size instead of using the minimal folio size.

This allows btrfs_add_compress_bio_folios() to even handle folios array
with different sizes, thankfully we don't yet need to handle such crazy
situation.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:42:24 +01:00
Qu Wenruo
052fd7a5ca btrfs: make read verification handle bs > ps cases without large folios
The current read verification is also relying on large folios to support
bs > ps cases, but that introduced quite some limits.

To enhance read-repair to support bs > ps without large folios:

- Make btrfs_data_csum_ok() to accept an array of paddrs
  Which can pass the paddrs[] direct into
  btrfs_calculate_block_csum_pages().

- Make repair_one_sector() to accept an array of paddrs
  So that it can submit a repair bio backed by regular pages, not only
  large folios.
  This requires us to allocate more slots at bio allocation time though.

  Also since the caller may have only partially advanced the saved_iter
  for bs > ps cases, we can not directly trust the logical bytenr from
  saved_iter (can be unaligned), thus a manual round down is necessary
  for the logical bytenr.

- Make btrfs_check_read_bio() to build an array of paddrs
  The tricky part is that we can only call btrfs_data_csum_ok() after
  all involved pages are assembled.

  This means at the call time of btrfs_check_read_bio(), our offset
  inside the bio is already at the end of the fs block.
  Thus we must re-calculate @bio_offset for btrfs_data_csum_ok() and
  repair_one_sector().

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:42:24 +01:00
Qu Wenruo
2574e90110 btrfs: make btrfs_repair_io_failure() handle bs > ps cases without large folios
Currently btrfs_repair_io_failure() only accept a single @paddr
parameter, and for bs > ps cases it's required that @paddr is backed by
a large folio.

That assumption has quite some limitations, preventing us from utilizing
true zero-copy direct-io and encoded read/writes.

To address the problem, enhance btrfs_repair_io_failure() by:

- Accept an array of paddrs, up to 64K / PAGE_SIZE entries
  This kind of acts like a bio_vec, but with very limited entries, as the
  function is only utilized to repair one fs data block, or a tree block.

  Both have an upper size limit (BTRFS_MAX_BLOCK_SIZE, i.e. 64K), so we
  don't need the full bio_vec thing to handle it.

- Allocate a bio with multiple slots
  Previously even for bs > ps cases, we only passed in a contiguous
  physical address range, thus a single slot will be enough.

  But not anymore, so we have to allocate a bio structure, other than
  using the on-stack one.

- Use on-stack memory to allocate @paddrs array
  It's at most 16 pages (4K page size, 64K block size), will take up at
  most 128 bytes.
  I think the on-stack cost is still acceptable.

- Add one extra check to make sure the repair bio is exactly one block

- Utilize btrfs_repair_io_failure() to submit a single bio for metadata
  This should improve the read-repair performance for metadata, as now
  we submit a node sized bio then wait, other than submit each block of
  the metadata and wait for each submitted block.

- Add one extra parameter indicating the step
  This is due to the fact that metadata step can be as large as
  nodesize, instead of sectorsize.
  So we need a way to distinguish metadata and data repair.

- Reduce the width of @length parameter of btrfs_repair_io_failure()
  Since we only call btrfs_repair_io_failure() on a single data or
  metadata block, u64 is overkilled.
  Use u32 instead and add one extra ASSERT()s to make sure the length
  never exceed BTRFS_MAX_BLOCK_SIZE.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:42:23 +01:00
Qu Wenruo
cfc7fe2b0f btrfs: use kvcalloc for btrfs_bio::csum allocation
[BUG]
There is a report that memory allocation failed for btrfs_bio::csum
during a large read:

  b2sum: page allocation failure: order:4, mode:0x40c40(GFP_NOFS|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0
  CPU: 0 UID: 0 PID: 416120 Comm: b2sum Tainted: G        W           6.17.0 #1 NONE
  Tainted: [W]=WARN
  Hardware name: Raspberry Pi 4 Model B Rev 1.5 (DT)
  Call trace:
   show_stack+0x18/0x30 (C)
   dump_stack_lvl+0x5c/0x7c
   dump_stack+0x18/0x24
   warn_alloc+0xec/0x184
   __alloc_pages_slowpath.constprop.0+0x21c/0x730
   __alloc_frozen_pages_noprof+0x230/0x260
   ___kmalloc_large_node+0xd4/0xf0
   __kmalloc_noprof+0x1c8/0x260
   btrfs_lookup_bio_sums+0x214/0x278
   btrfs_submit_chunk+0xf0/0x3c0
   btrfs_submit_bbio+0x2c/0x4c
   submit_one_bio+0x50/0xac
   submit_extent_folio+0x13c/0x340
   btrfs_do_readpage+0x4b0/0x7a0
   btrfs_readahead+0x184/0x254
   read_pages+0x58/0x260
   page_cache_ra_unbounded+0x170/0x24c
   page_cache_ra_order+0x360/0x3bc
   page_cache_async_ra+0x1a4/0x1d4
   filemap_readahead.isra.0+0x44/0x74
   filemap_get_pages+0x2b4/0x3b4
   filemap_read+0xc4/0x3bc
   btrfs_file_read_iter+0x70/0x7c
   vfs_read+0x1ec/0x2c0
   ksys_read+0x4c/0xe0
   __arm64_sys_read+0x18/0x24
   el0_svc_common.constprop.0+0x5c/0x130
   do_el0_svc+0x1c/0x30
   el0_svc+0x30/0xa0
   el0t_64_sync_handler+0xa0/0xe4
   el0t_64_sync+0x198/0x19c

[CAUSE]
Btrfs needs to allocate memory for btrfs_bio::csum for large reads, so
that we can later verify the contents of the read.

However nowadays a read bio can easily go beyond BIO_MAX_VECS *
PAGE_SIZE (which is 1M for 4K page sizes), due to the multi-page bvec
that one bvec can have more than one pages, as long as the pages are
physically adjacent.

This will become more common when the large folio support is moved out
of experimental features.

In the above case, a read larger than 4MiB with SHA256 checksum (32
bytes for each 4K block) will be able to trigger a order 4 allocation.

The order 4 is larger than PAGE_ALLOC_COSTLY_ORDER (3), thus without
extra flags such allocation will not retry.

And if the system has very small amount of memory (e.g. RPI4 with low
memory spec) or VMs with small vRAM, or the memory is heavily
fragmented, such allocation will fail and cause the above warning.

[FIX]
Although btrfs is handling the memory allocation failure correctly, we
do not really need the physically contiguous memory just to restore
our checksum.

In fact btrfs_csum_one_bio() is already using kvzalloc() to reduce the
memory pressure.

So follow the step to use kvcalloc() for btrfs_bio::csum.

Reported-by: Calvin Owens <calvin@wbinvd.org>
Link: https://lore.kernel.org/linux-btrfs/20251105180054.511528-1-calvin@wbinvd.org/
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:42:22 +01:00
Qu Wenruo
dd57c78aec btrfs: introduce btrfs_bio::async_csum
[ENHANCEMENT]
Btrfs currently calculates data checksums then submits the bio.

But after commit 968f19c5b1 ("btrfs: always fallback to buffered write
if the inode requires checksum"), any writes with data checksum will
fallback to buffered IO, meaning the content will not change during
writeback.

This means we're safe to calculate the data checksum and submit the bio
in parallel, and only need the following new behavior:

- Wait the csum generation to finish before calling btrfs_bio::end_io()
  Or this can lead to use-after-free for the csum generation worker.

- Save the current bi_iter for csum_one_bio()
  As the submission part can advance btrfs_bio::bio.bi_iter, if not
  saved csum_one_bio() may got an empty bi_iter and do not generate any
  checksum.

  Unfortunately this means we have to increase the size of btrfs_bio for
  16 bytes, but this is still acceptable.

As usual, such new feature is hidden behind the experimental flag.

[THEORETIC ANALYZE]
Consider the following theoretic hardware performance, which should be
more or less close to modern mainstream hardware:

	Memory bandwidth:	50GiB/s
	CRC32C bandwidth:	45GiB/s
	SSD bandwidth:		8GiB/s

Then write bandwidth with data checksum before the patch is:

	1 / ( 1 / 50 + 1 / 45 + 1 / 8) = 5.98 GiB/s

After the patch, the bandwidth is:

	1 / ( 1 / 50 + max( 1 / 45 + 1 / 8)) = 6.90 GiB/s

The difference is 15.32% improvement.

[REAL WORLD BENCHMARK]
I'm using a Zen5 (HX 370) as the host, the VM has 4GiB memory, 10 vCPUs, the
storage is backed by a PCIe gen3 x4 NVMe.

The test is a direct IO write, with 1MiB block size, write 7GiB data
into a btrfs mount with data checksum. Thus the direct write will
fallback to buffered one:

Vanilla Datasum:	1619.97 GiB/s
Patched Datasum:	1792.26 GiB/s
Diff			+10.6 %

In my case, the bottleneck is the storage, thus the improvement is not
reaching the theoretic one, but still some observable improvement.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:42:21 +01:00
Qu Wenruo
4591c3ef75 btrfs: make sure all btrfs_bio::end_io are called in task context
[BACKGROUND]
Btrfs has a lot of different bi_end_io functions, to handle different
raid profiles. But they introduced a lot of different contexts for
btrfs_bio::end_io() calls:

- Simple read bios
  Run in task context, backed by either endio_meta_workers or
  endio_workers.

- Simple write bios
  Run in IRQ context.

- RAID56 write or rebuild bios
  Run in task context, backed by rmw_workers.

- Mirrored write bios
  Run in irq context.

This is inconsistent, and contributes to the number of workqueues used
in btrfs.

[ENHANCEMENT]
Make all the above bios call their btrfs_bio::end_io() in task context,
backed by either endio_meta_workers for metadata, or endio_workers for
data.

For simple write bios, merge the handling into simple_end_io_work().

For mirrored write bios, it will be a little more complex, since both
the original or the cloned bios can run the final btrfs_bio::end_io().

Here we make sure the cloned bios are using btrfs_bioset, to reuse the
end_io_work, and run both original and cloned work inside the workqueue.

Add extra ASSERT()s to make sure btrfs_bio_end_io() is running in task
context.

This not only unifies the context for btrfs_bio::end_io() functions, but
also opens a new door for further btrfs_bio::end_io() related cleanups.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:40:21 +01:00
Qu Wenruo
81cea6cd70 btrfs: remove btrfs_bio::fs_info by extracting it from btrfs_bio::inode
Currently there is only one caller which doesn't populate
btrfs_bio::inode, and that's scrub.

The idea is scrub doesn't want any automatic csum verification nor
read-repair, as everything will be handled by scrub itself.

However that behavior is really no different than metadata inode, thus
we can reuse btree_inode as btrfs_bio::inode for scrub.

The only exception is in btrfs_submit_chunk() where if a bbio is from
scrub or data reloc inode, we set rst_search_commit_root to true.
This means we still need a way to distinguish scrub from metadata, but
that can be done by a new flag inside btrfs_bio.

Now btrfs_bio::inode is a mandatory parameter, we can extract fs_info
from that inode thus can remove btrfs_bio::fs_info to save 8 bytes from
btrfs_bio structure.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:40:16 +01:00
David Sterba
cc53bd2085 btrfs: add unlikely annotations to branches leading to EIO
The unlikely() annotation is a static prediction hint that compiler may
use to reorder code out of hot path. We use it elsewhere (namely
tree-checker.c) for error branches that almost never happen, where
EIO is one of them.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23 08:49:26 +02:00
Qu Wenruo
e9bed72e88 btrfs: add extra ASSERT()s to catch unaligned bios
Btrfs uses btrfs_bio to handle read/write of logical address, for the
incoming bs > ps support, btrfs has extra requirements:

- One folio must contain at least one fs block
- No fs block can cross folio boundaries

This requirement is not hard to maintain, thanks to the address space's
minimal folio order.

But not all btrfs bios are generated through address space, e.g.
compression and scrub.

To catch possible unaligned bios, introduce a helper,
assert_bbio_alginment(), for each btrfs_bio in btrfs_submit_bbio().

This will check the following things:

- bv_offset is aligned to block size
- bv_len is aligned to block size

With a btrfs bio passing above checks, unless it's empty it will ensure
the requirements for bs > ps support.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23 08:49:25 +02:00
Qu Wenruo
9afc617265 btrfs: introduce btrfs_bio_for_each_block() helper
Currently if we want to iterate a bio in block unit, we do something
like this:

	while (iter->bi_size) {
		struct bio_vec bv = bio_iter_iovec();

		/* Do something with using the bv */

		bio_advance_iter_single(&bbio->bio, iter, sectorsize);
	}

That's fine for now, but it will not handle future bs > ps, as
bio_iter_iovec() returns a single-page bvec, meaning the bv_len will not
exceed page size.

This means the code using that bv can only handle a block if bs <= ps.

To address this problem and handle future bs > ps cases better:

- Introduce a helper btrfs_bio_for_each_block()
  Instead of bio_vec, which has single and multiple page version and
  multiple page version has quite some limits, use my favorite way to
  represent a block, phys_addr_t.

  For bs <= ps cases, nothing is changed, except we will do a very
  small overhead to convert phys_addr_t to a folio, then use the proper
  folio helpers to handle the possible highmem cases.

  For bs > ps cases, all blocks will be backed by large folios, meaning
  every folio will cover at least one block. And still use proper folio
  helpers to handle highmem cases.

  With phys_addr_t, we will handle both large folio and highmem
  properly. So there is no better single variable to present a btrfs
  block than phys_addr_t.

- Extract the data block csum calculation into a helper
  The new helper, btrfs_calculate_block_csum() will be utilized by
  btrfs_csum_one_bio().

- Use btrfs_bio_for_each_block() to replace existing call sites
  Including:

  * index_one_bio() from raid56.c
    Very straight-forward.

  * btrfs_check_read_bio()
    Also update repair_one_sector() to grab the folio using phys_addr_t,
    and do extra checks to make sure the folio covers at least one
    block.
    We do not need to bother bv_len at all now.

  * btrfs_csum_one_bio()
    Now we can move the highmem handling into a dedicated helper,
    calculate_block_csum(), and use btrfs_bio_for_each_block() helper.

There is one exception in btrfs_decompress_buf2page(), which is copying
decompressed data into the original bio, which is not iterating using
block size thus we don't need to bother.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23 08:49:17 +02:00
Qu Wenruo
35aff706dc btrfs: concentrate highmem handling for data verification
Currently for btrfs checksum verification, we do it in the following
pattern:

	kaddr = kmap_local_*();
	ret = btrfs_check_csum_csum(kaddr);
	kunmap_local(kaddr);

It's OK for now, but it's still not following the patterns of helpers
inside linux/highmem.h, which never requires a virt memory address.

In those highmem helpers, they mostly accept a folio, some offset/length
inside the folio, and in the implementation they check if the folio
needs partial kmap, and do the handling.

Inspired by those formal highmem helpers, enhance the highmem handling
of data checksum verification by:

- Rename btrfs_check_sector_csum() to btrfs_check_block_csum()
  To follow the more common term "block" used in all other major
  filesystems.

- Pass a physical address into btrfs_check_block_csum() and
  btrfs_data_csum_ok()
  The physical address is always available even for a highmem page.
  Since it's page frame number << PAGE_SHIFT + offset in page.

  And with that physical address, we can grab the folio covering the
  page, and do extra checks to ensure it covers at least one block.

  This also allows us to do the kmap inside btrfs_check_block_csum().
  This means all the extra HIGHMEM handling will be concentrated into
  btrfs_check_block_csum(), and no callers will need to bother highmem
  by themselves.

- Properly zero out the block if csum mismatch
  Since btrfs_data_csum_ok() only got a paddr, we can not and should not
  use memzero_bvec(), which only accepts single page bvec.
  Instead use paddr to grab the folio and call folio_zero_range()

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23 08:49:16 +02:00
Boris Burkov
f07b855c56 btrfs: try to search for data csums in commit root
If you run a workload with:

- a cgroup that does tons of parallel data reading, with a working set
  much larger than its memory limit
- a second cgroup that writes relatively fewer files, with overwrites,
  with no memory limit
(see full code listing at the bottom for a reproducer)

Then what quickly occurs is:

- we have a large number of threads trying to read the csum tree
- we have a decent number of threads deleting csums running delayed refs
- we have a large number of threads in direct reclaim and thus high
  memory pressure

The result of this is that we writeback the csum tree repeatedly mid
transaction, to get back the extent_buffer folios for reclaim. As a
result, we repeatedly COW the csum tree for the delayed refs that are
deleting csums. This means repeatedly write locking the higher levels of
the tree.

As a result of this, we achieve an unpleasant priority inversion. We
have:

- a high degree of contention on the csum root node (and other upper
  nodes) eb rwsem
- a memory starved cgroup doing tons of reclaim on CPU.
- many reader threads in the memory starved cgroup "holding" the sem
  as readers, but not scheduling promptly. i.e., task __state == 0, but
  not running on a cpu.
- btrfs_commit_transaction stuck trying to acquire the sem as a writer.
  (running delayed_refs, deleting csums for unreferenced data extents)

This results in arbitrarily long transactions. This then results in
seriously degraded performance for any cgroup using the filesystem (the
victim cgroup in the script).

It isn't an academic problem, as we see this exact problem in production
at Meta with one cgroup over its memory limit ruining btrfs performance
for the whole system, stalling critical system services that depend on
btrfs syncs.

The underlying scheduling "problem" with global rwsems is sort of thorny
and apparently well known and was discussed at LPC 2024, for example.

As a result, our main lever in the short term is just trying to reduce
contention on our various rwsems with an eye to reducing the frequency
of write locking, to avoid disabling the read lock fast acquisition path.

Luckily, it seems likely that many reads are for old extents written
many transactions ago, and that for those we *can* in fact search the
commit root. The commit_root_sem only gets taken write once, near the
end of transaction commit, no matter how much memory pressure there is,
so we have much less contention between readers and writers.

This change detects when we are trying to read an old extent (according
to extent map generation) and then wires that through bio_ctrl to the
btrfs_bio, which unfortunately isn't allocated yet when we have this
information. When we go to lookup the csums in lookup_bio_sums we can
check this condition on the btrfs_bio and do the commit root lookup
accordingly.

Note that a single bio_ctrl might collect a few extent_maps into a single
bio, so it is important to track a maximum generation across all the
extent_maps used for each bio to make an accurate decision on whether it
is valid to look in the commit root. If any extent_map is updated in the
current generation, we can't use the commit root.

To test and reproduce this issue, I used the following script and
accompanying C program (to avoid bottlenecks in constantly forking
thousands of dd processes):

====== big-read.c ======
  #include <fcntl.h>
  #include <stdio.h>
  #include <stdlib.h>
  #include <sys/mman.h>
  #include <sys/stat.h>
  #include <unistd.h>
  #include <errno.h>

  #define BUF_SZ (128 * (1 << 10UL))

  int read_once(int fd, size_t sz) {
  	char buf[BUF_SZ];
  	size_t rd = 0;
  	int ret = 0;

  	while (rd < sz) {
  		ret = read(fd, buf, BUF_SZ);
  		if (ret < 0) {
  			if (errno == EINTR)
  				continue;
  			fprintf(stderr, "read failed: %d\n", errno);
  			return -errno;
  		} else if (ret == 0) {
  			break;
  		} else {
  			rd += ret;
  		}
  	}
  	return rd;
  }

  int read_loop(char *fname) {
  	int fd;
  	struct stat st;
  	size_t sz = 0;
  	int ret;

  	while (1) {
  		fd = open(fname, O_RDONLY);
  		if (fd == -1) {
  			perror("open");
  			return 1;
  		}
  		if (!sz) {
  			if (!fstat(fd, &st)) {
  				sz = st.st_size;
  			} else {
  				perror("stat");
  				return 1;
  			}
  		}

                  ret = read_once(fd, sz);
  		close(fd);
  	}
  }

  int main(int argc, char *argv[]) {
  	int fd;
  	struct stat st;
  	off_t sz;
  	char *buf;
  	int ret;

  	if (argc != 2) {
  		fprintf(stderr, "Usage: %s <filename>\n", argv[0]);
  		return 1;
  	}

  	return read_loop(argv[1]);
  }

====== repro.sh ======
  #!/usr/bin/env bash

  SCRIPT=$(readlink -f "$0")
  DIR=$(dirname "$SCRIPT")

  dev=$1
  mnt=$2
  shift
  shift

  CG_ROOT=/sys/fs/cgroup
  BAD_CG=$CG_ROOT/bad-nbr
  GOOD_CG=$CG_ROOT/good-nbr
  NR_BIGGOS=1
  NR_LITTLE=10
  NR_VICTIMS=32
  NR_VILLAINS=512

  START_SEC=$(date +%s)

  _elapsed() {
  	echo "elapsed: $(($(date +%s) - $START_SEC))"
  }

  _stats() {
  	local sysfs=/sys/fs/btrfs/$(findmnt -no UUID $dev)

  	echo "================"
  	date
  	_elapsed
  	cat $sysfs/commit_stats
  	cat $BAD_CG/memory.pressure
  }

  _setup_cgs() {
  	echo "+memory +cpuset" > $CG_ROOT/cgroup.subtree_control
  	mkdir -p $GOOD_CG
  	mkdir -p $BAD_CG
  	echo max > $BAD_CG/memory.max
  	# memory.high much less than the working set will cause heavy reclaim
  	echo $((1 << 30)) > $BAD_CG/memory.high

  	# victims get a subset of villain CPUs
  	echo 0 > $GOOD_CG/cpuset.cpus
  	echo 0,1,2,3 > $BAD_CG/cpuset.cpus
  }

  _kill_cg() {
  	local cg=$1
  	local attempts=0
  	echo "kill cgroup $cg"
  	[ -f $cg/cgroup.procs ] || return
  	while true; do
  		attempts=$((attempts + 1))
  		echo 1 > $cg/cgroup.kill
  		sleep 1
  		procs=$(wc -l $cg/cgroup.procs | cut -d' ' -f1)
  		[ $procs -eq 0 ] && break
  	done
  	rmdir $cg
  	echo "killed cgroup $cg in $attempts attempts"
  }

  _biggo_vol() {
  	echo $mnt/biggo_vol.$1
  }

  _biggo_file() {
  	echo $(_biggo_vol $1)/biggo
  }

  _subvoled_biggos() {
  	total_sz=$((10 << 30))
  	per_sz=$((total_sz / $NR_VILLAINS))
  	dd_count=$((per_sz >> 20))
  	echo "create $NR_VILLAINS subvols with a file of size $per_sz bytes for a total of $total_sz bytes."
  	for i in $(seq $NR_VILLAINS)
  	do
  		btrfs subvol create $(_biggo_vol $i) &>/dev/null
  		dd if=/dev/zero of=$(_biggo_file $i) bs=1M count=$dd_count &>/dev/null
  	done
  	echo "done creating subvols."
  }

  _setup() {
  	[ -f .done ] && rm .done
  	findmnt -n $dev && exit 1
        if [ -f .re-mkfs ]; then
		mkfs.btrfs -f -m single -d single $dev >/dev/null || exit 2
	else
		echo "touch .re-mkfs to populate the test fs"
	fi

  	mount -o noatime $dev $mnt || exit 3
  	[ -f .re-mkfs ] && _subvoled_biggos
  	_setup_cgs
  }

  _my_cleanup() {
  	echo "CLEANUP!"
  	_kill_cg $BAD_CG
  	_kill_cg $GOOD_CG
  	sleep 1
  	umount $mnt
  }

  _bad_exit() {
  	_err "Unexpected Exit! $?"
  	_stats
  	exit $?
  }

  trap _my_cleanup EXIT
  trap _bad_exit INT TERM

  _setup

  # Use a lot of page cache reading the big file
  _villain() {
  	local i=$1
  	echo $BASHPID > $BAD_CG/cgroup.procs
  	$DIR/big-read $(_biggo_file $i)
  }

  # Hit del_csum a lot by overwriting lots of small new files
  _victim() {
  	echo $BASHPID > $GOOD_CG/cgroup.procs
  	i=0;
  	while (true)
  	do
  		local tmp=$mnt/tmp.$i

  		dd if=/dev/zero of=$tmp bs=4k count=2 >/dev/null 2>&1
  		i=$((i+1))
  		[ $i -eq $NR_LITTLE ] && i=0
  	done
  }

  _one_sync() {
  	echo "sync..."
  	before=$(date +%s)
  	sync
  	after=$(date +%s)
  	echo "sync done in $((after - before))s"
  	_stats
  }

  # sync in a loop
  _sync() {
  	echo "start sync loop"
  	syncs=0
  	echo $BASHPID > $GOOD_CG/cgroup.procs
  	while true
  	do
  		[ -f .done ] && break
  		_one_sync
  		syncs=$((syncs + 1))
  		[ -f .done ] && break
  		sleep 10
  	done
  	if [ $syncs -eq 0 ]; then
  		echo "do at least one sync!"
  		_one_sync
  	fi
  	echo "sync loop done."
  }

  _sleep() {
  	local time=${1-60}
  	local now=$(date +%s)
  	local end=$((now + time))
  	while [ $now -lt $end ];
  	do
  		echo "SLEEP: $((end - now))s left. Sleep 10."
  		sleep 10
  		now=$(date +%s)
  	done
  }

  echo "start $NR_VILLAINS villains"
  for i in $(seq $NR_VILLAINS)
  do
  	_villain $i &
  	disown # get rid of annoying log on kill (done via cgroup anyway)
  done

  echo "start $NR_VICTIMS victims"
  for i in $(seq $NR_VICTIMS)
  do
  	_victim &
  	disown
  done

  _sync &
  SYNC_PID=$!

  _sleep $1
  _elapsed
  touch .done
  wait $SYNC_PID

  echo "OK"
  exit 0

Without this patch, that reproducer:

- Ran for 6+ minutes instead of 60s
- Hung hundreds of threads in D state on the csum reader lock
- Got a commit stuck for 3 minutes

sync done in 388s
================
Wed Jul  9 09:52:31 PM UTC 2025
elapsed: 420
commits 2
cur_commit_ms 0
last_commit_ms 159446
max_commit_ms 159446
total_commit_ms 160058
some avg10=99.03 avg60=98.97 avg300=75.43 total=418033386
full avg10=82.79 avg60=80.52 avg300=59.45 total=324995274

419 hits state R, D comms big-read
                 btrfs_tree_read_lock_nested
                 btrfs_read_lock_root_node
                 btrfs_search_slot
                 btrfs_lookup_csum
                 btrfs_lookup_bio_sums
                 btrfs_submit_bbio

1 hits state D comms btrfs-transacti
                 btrfs_tree_lock_nested
                 btrfs_lock_root_node
                 btrfs_search_slot
                 btrfs_del_csums
                 __btrfs_run_delayed_refs
                 btrfs_run_delayed_refs

With the patch, the reproducer exits naturally, in 65s, completing a
pretty decent 4 commits, despite heavy memory pressure. Occasionally you
can still trigger a rather long commit (couple seconds) but never one
that is minutes long.

sync done in 3s
================
elapsed: 65
commits 4
cur_commit_ms 0
last_commit_ms 485
max_commit_ms 689
total_commit_ms 2453
some avg10=98.28 avg60=64.54 avg300=19.39 total=64849893
full avg10=74.43 avg60=48.50 avg300=14.53 total=48665168

some random rwalker samples showed the most common stack in reclaim,
rather than the csum tree:
145 hits state R comms bash, sleep, dd, shuf
                 shrink_folio_list
                 shrink_lruvec
                 shrink_node
                 do_try_to_free_pages
                 try_to_free_mem_cgroup_pages
                 reclaim_high

Link: https://lpc.events/event/18/contributions/1883/
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-22 10:54:31 +02:00
David Sterba
80f4fab544 btrfs: switch RCU helper versions to btrfs_debug()
The RCU protection is now done in the plain helpers, we can remove the
"_in_rcu" and "_rl_in_rcu".

Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:56:38 +02:00
David Sterba
2eac2ae8b2 btrfs: switch RCU helper versions to btrfs_info()
The RCU protection is now done in the plain helpers, we can remove the
"_in_rcu" and "_rl_in_rcu".

Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:56:38 +02:00
Qu Wenruo
cc38d178ff btrfs: enable large data folio support under CONFIG_BTRFS_EXPERIMENTAL
With all the preparation patches already merged, it's pretty easy to
enable large data folios:

- Remove the ASSERT() on folio size in btrfs_end_repair_bio()

- Add a helper to properly set the max folio order
  Currently due to several call sites that are fetching the bitmap
  content directly into an unsigned long, we can only support
  BITS_PER_LONG blocks for each bitmap.

- Call the helper when reading/creating an inode

The support has the following limitations:

- No large folios for data reloc inode
  The relocation code still requires page sized folio.
  But it's not that hot nor common compared to regular buffered ios.

  Will be improved in the future.

- Requires CONFIG_BTRFS_EXPERIMENTAL

- Will require all folio related operations to check if it needs the
  extra btrfs_subpage structure
  Now any folio larger than block size will need btrfs_subpage structure
  handling.

Unfortunately I do not have a physical machine for performance test, but
if everything goes like XFS/EXT4, it should mostly bring single digits
percentage performance improvement in the real world.

Although I believe there are still quite some optimizations to be done,
let's focus on testing the current large data folio support first.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:30 +02:00
David Sterba
ccb42a6eed btrfs: constify more pointer parameters
Another batch of pointer parameter constifications. This is for clarity
and minor addition to type safety. There are no observable effects in the
assembly code and .ko measured on release config.

Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:26 +02:00
David Sterba
853b5727c9 btrfs: change return type of btrfs_alloc_dummy_sum() to int
The type blk_status_t is from block layer and not related to checksums
in our context. Use int internally and do the conversions to blk_status_t
as needed in btrfs_submit_chunk().

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:49 +02:00
David Sterba
9c0b0807ec btrfs: rename error to ret in btrfs_submit_chunk()
We can now rename 'error' to 'ret' and use it for generic errors.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:48 +02:00
David Sterba
beaa7cdb6a btrfs: rename ret to status in btrfs_submit_chunk()
We're using 'status' for the blk_status_t variables, rename 'ret' so we
can use it for proper return type.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:48 +02:00
David Sterba
64c13195dd btrfs: change return type of btrfs_bio_csum() to int
The type blk_status_t is from block layer and not related to checksums
in our context. Use int internally and do the conversions to blk_status_t
as needed.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:48 +02:00
David Sterba
a24d185c36 btrfs: change return type of btree_csum_one_bio() to int
The type blk_status_t is from block layer and not related to checksums
in our context. Use int internally and do the conversions to blk_status_t
as needed in btrfs_bio_csum().

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:48 +02:00
David Sterba
9b20d242af btrfs: change return type of btrfs_csum_one_bio() to int
The type blk_status_t is from block layer and not related to checksums
in our context. Use int internally and do the conversions to blk_status_t
as needed in btrfs_bio_csum().

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:48 +02:00
David Sterba
6f6e7e98b0 btrfs: change return type of btrfs_lookup_bio_sums() to int
The type blk_status_t is from block layer and not related to checksums
in our context. Use int internally and do the conversions to blk_status_t
as needed in btrfs_submit_chunk().

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:48 +02:00
Christoph Hellwig
3240b2c97b btrfs: pass a physical address to btrfs_repair_io_failure()
Using physical address has the following advantages:

- All involved callers only need a single pointer
  Instead of the old @folio + @offset pair.

- No complex poking into the bio_vec structure
  As a bio_vec can be single or multiple paged, grabbing the real page
  can be quite complex if the bio_vec is a multi-page one.

  Instead bvec_phys() will always give a single physical address, and it
  cab be easily converted to a page.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:46 +02:00
Yangtao Li
c900f415be btrfs: reuse exit helper for cleanup in btrfs_bioset_init()
Do not duplicate the cleanup after failed initialization
in btrfs_bioset_init() and reuse the exit function btrfs_bioset_exit().

Signed-off-by: Yangtao Li <frank.li@vivo.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:46 +02:00
Filipe Manana
9024b744e7 btrfs: avoid unnecessary bio dereference at run_one_async_done()
We have dereferenced the async_submit_bio structure and extracted the bio
pointer into a local variable, so there's no need to dereference it again
when calling btrfs_bio_end_io(). Just use "bio->bi_status" instead of the
longer expression "async->bbio->bio.bi_status".

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:50 +01:00
Filipe Manana
477a7a9c1f btrfs: move btrfs_cleanup_bio() code into its single caller
The btrfs_cleanup_bio() helper is trivial and has a single caller, there's
no point in having a dedicated helper function. So get rid of it and move
its code into the caller (btrfs_bio_end_io()).

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:49 +01:00
Filipe Manana
530b601b91 btrfs: move __btrfs_bio_end_io() code into its single caller
The __btrfs_bio_end_io() helper is trivial and has a single caller, so
there's no point in having a dedicated helper function. Further the double
underscore prefix in the name is discouraged. So get rid of it and move
its code into the caller (btrfs_bio_end_io()).

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:49 +01:00
Anand Jain
22fb0d99c9 btrfs: add tracking of read blocks for read policy
Track number of read blocks in the whole filesystem. The counter is
initialized when devices are opened. The counter is increased at
btrfs_submit_dev_bio() if the stats tracking is enabled (depends on the
read policy).  Stats tracking is disabled by default and is enabled
through fs_devices::collect_fs_stats when required.

The code is not under the EXPERIMENTAL define, as stats can be expanded
to include write counts and other performance counters, with the user
interface independent of its internal use.

This is an in-memory-only feature, not related to the dev error stats.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:21 +01:00
Johannes Thumshirn
9c48bcec47 btrfs: cache RAID stripe tree decision in btrfs_io_context
Cache the decision if a particular I/O needs to update RAID stripe tree
entries in struct btrfs_io_context.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:16 +01:00
Linus Torvalds
eabcdba3ad for-6.13-rc3-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmdhyQAACgkQxWXV+ddt
 WDuveg//bJSuXHrA7jkijst8rdoAFrceiUXuQPZ6bqb9QrSqlDZlP5/XQpdXZ3yU
 qJh/aE13cy0zWTQ2+fMcc770WSvU1cRW/f5BZ+fdXgvO8lS516suXGYd2Q06Cl9/
 DriAKGKtRfJn1BrEEv8+fjKS/chxZg6IR/W4kN6AinW31myY9jE5mEDAn+vyTDgQ
 8USZ/ar/3KuWo+wO5h5JzrvGnhzK0W0HRs/A0NZ3gG8J5T4yj+8zG0VJR4Gf93AL
 iBlsnAR8VzAYJOZCi36SD3j3/eDxJio5GhDYsdt28tk1bL8FqSuI4Yxt+LuiZ2Fg
 Cq/31lELEkyEH8AoVFm9pX3HNyRmV6JhpvDXiyofHaOUZ3VeivVE59gOShLUUMkn
 f9Pl/uh5/t/ioWWHBnCMyRpI9GZUGCvW24k7HjT7QZhsDGFLTm07diCiRgZ7eaOu
 LZRKMOL5jifAnfxNSvIJV19H4lQLTZfbdjmJyb6Il39tIU/1U9pXicgih3iyidW2
 N5n4pHf3OQFwG8kNw1mR1g1CPBALP62ja8kMv//IgH4YXXnm1Mo7B3CcJogAAmo4
 HB9f/gFqZ8kWaiuIUJKfPZkkLFt5x0TNZQyyOhVUd7V4mFdtEzVtZRWo3juYuLGk
 7Shp/MTlYokwnEropiWHU5ab3Bb9vLxlh8daGK/OmwBz01DaApI=
 =AAmb
 -----END PGP SIGNATURE-----

Merge tag 'for-6.13-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - tree-checker catches invalid number of inline extent references

 - zoned mode fixes:
    - enhance zone append IO command so it also detects emulated writes
    - handle bio splitting at sectorsize boundary

 - when deleting a snapshot, fix a condition for visiting nodes in reloc
   trees

* tag 'for-6.13-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: tree-checker: reject inline extent items with 0 ref count
  btrfs: split bios to the fs sector size boundary
  btrfs: use bio_is_zone_append() in the completion handler
  btrfs: fix improper generation check in snapshot delete
2024-12-18 14:17:21 -08:00
Christoph Hellwig
be691b5e59 btrfs: split bios to the fs sector size boundary
Btrfs like other file systems can't really deal with I/O not aligned to
it's internal block size (which strangely is called sector size in
btrfs, for historical reasons), but the block layer split helper doesn't
even know about that.

Round down the split boundary so that all I/Os are aligned.

Fixes: d5e4377d50 ("btrfs: split zone append bios in btrfs_submit_bio")
CC: stable@vger.kernel.org # 6.12
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-12-17 19:54:32 +01:00
Christoph Hellwig
6c3864e055 btrfs: use bio_is_zone_append() in the completion handler
Otherwise it won't catch bios turned into regular writes by the block
level zone write plugging. The additional test it adds is for emulated
zone append.

Fixes: 9b1ce7f0c6 ("block: Implement zone append emulation")
CC: stable@vger.kernel.org # 6.12
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-12-17 19:54:32 +01:00
Johannes Thumshirn
c7c97ceff9 btrfs: handle bio_split() errors
Commit e546fe1da9 ("block: Rework bio_split() return value") changed
bio_split() so that it can return errors.

Add error handling for it in btrfs_split_bio() and ultimately
btrfs_submit_chunk(). As the bio is not submitted, the bio counter must
be decremented to pair btrfs_bio_counter_inc_blocked().

Reviewed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-12-06 15:04:13 +01:00
Qu Wenruo
67cd3f2217 btrfs: split out CONFIG_BTRFS_EXPERIMENTAL from CONFIG_BTRFS_DEBUG
Currently CONFIG_BTRFS_EXPERIMENTAL is not only for the extra debugging
output, but also for experimental features.

This is not ideal to distinguish planned but not yet stable features
from those purely designed for debugging.

This patch splits the following features into CONFIG_BTRFS_EXPERIMENTAL:

- Extent map shrinker
  This seems to be the first one to exit experimental.

- Extent tree v2
  This seems to be the last one to graduate from experimental.

- Raid stripe tree
- Csum offload mode
- Send protocol v3

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11 14:34:12 +01:00
Linus Torvalds
6b4926494e for-6.12-rc5-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmck8eQACgkQxWXV+ddt
 WDu05g/6AwrnvPkivC4iVOv4Wkzrpk4gm76smx91Y9B8tSDLI1pHaS27CvJz9iWl
 vBKXPN3PQVQHwo6SPn+NjsFOSMkXlbBOVKpPU+MlZwH9Tuw66qcC+EnUCK2wEuAy
 3TN7cUGIA4r/j+SkhgIz+Irlr5pjdb1KkPIMBEVGcVFqDIuvDaTEGBqTn2i/V5aa
 dMn+gK+9rfngTOJ68t/pEFaX7SEWCvgMIcBpBB4/vs1gHm3ve2bcc1sBAdMxb1Se
 SrxgZfq+Rc5tkMn540JaWGwkb0rLzwXlurK6ygTKDKCpH0IMX+pBvDkexh9Zj0ux
 jejlRxiuDzTx3z2a7FjHDyp2sdZWMpq3sPsowpJ1Dsgi5EtSxTy4irmQuSAZY1Uj
 /uo6YwV9aTGeiNDwZeKqKc/wOuAttaMZLr14s37pro9KxndFJ/XZBxeyB+euUCOw
 B8AvAQVVIJAYQLyWINWruNKppqlgiO2RaN15RvvT2pX01d0TOx1KX1XFQku7YFxb
 M/8ZNXzJ96XtkeyHL3euo3zj7N5jWtnCvPINugUG1ADQa+bc8aX336gld1neD6fs
 QqIFIgzZG0l4N95viJilACrI6tW9zFnBqMyNFRhucKiX9aP9glOvhSfxfjcpDuQ/
 i/LIyxVLwp8M3hPNvv8tC345+1C2ug9AD0OyhWjjIYPuiOxtTWs=
 =alpB
 -----END PGP SIGNATURE-----

Merge tag 'for-6.12-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A few more stability fixes. There's one patch adding export of MIPS
  cmpxchg helper, used in the error propagation fix.

   - fix error propagation from split bios to the original btrfs bio

   - fix merging of adjacent extents (normal operation, defragmentation)

   - fix potential use after free after freeing btrfs device structures"

* tag 'for-6.12-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix defrag not merging contiguous extents due to merged extent maps
  btrfs: fix extent map merging not happening for adjacent extents
  btrfs: fix use-after-free of block device file in __btrfs_free_extra_devids()
  btrfs: fix error propagation of split bios
  MIPS: export __cmpxchg_small()
2024-11-01 07:31:47 -10:00
Naohiro Aota
d48e1dea39 btrfs: fix error propagation of split bios
The purpose of btrfs_bbio_propagate_error() shall be propagating an error
of split bio to its original btrfs_bio, and tell the error to the upper
layer. However, it's not working well on some cases.

* Case 1. Immediate (or quick) end_bio with an error

When btrfs sends btrfs_bio to mirrored devices, btrfs calls
btrfs_bio_end_io() when all the mirroring bios are completed. If that
btrfs_bio was split, it is from btrfs_clone_bioset and its end_io function
is btrfs_orig_write_end_io. For this case, btrfs_bbio_propagate_error()
accesses the orig_bbio's bio context to increase the error count.

That works well in most cases. However, if the end_io is called enough
fast, orig_bbio's (remaining part after split) bio context may not be
properly set at that time. Since the bio context is set when the orig_bbio
(the last btrfs_bio) is sent to devices, that might be too late for earlier
split btrfs_bio's completion.  That will result in NULL pointer
dereference.

That bug is easily reproducible by running btrfs/146 on zoned devices [1]
and it shows the following trace.

[1] You need raid-stripe-tree feature as it create "-d raid0 -m raid1" FS.

  BUG: kernel NULL pointer dereference, address: 0000000000000020
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 0 P4D 0
  Oops: Oops: 0000 [#1] PREEMPT SMP PTI
  CPU: 1 UID: 0 PID: 13 Comm: kworker/u32:1 Not tainted 6.11.0-rc7-BTRFS-ZNS+ #474
  Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  Workqueue: writeback wb_workfn (flush-btrfs-5)
  RIP: 0010:btrfs_bio_end_io+0xae/0xc0 [btrfs]
  BTRFS error (device dm-0): bdev /dev/mapper/error-test errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
  RSP: 0018:ffffc9000006f248 EFLAGS: 00010246
  RAX: 0000000000000000 RBX: ffff888005a7f080 RCX: ffffc9000006f1dc
  RDX: 0000000000000000 RSI: 000000000000000a RDI: ffff888005a7f080
  RBP: ffff888011dfc540 R08: 0000000000000000 R09: 0000000000000001
  R10: ffffffff82e508e0 R11: 0000000000000005 R12: ffff88800ddfbe58
  R13: ffff888005a7f080 R14: ffff888005a7f158 R15: ffff888005a7f158
  FS:  0000000000000000(0000) GS:ffff88803ea80000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000020 CR3: 0000000002e22006 CR4: 0000000000370ef0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   <TASK>
   ? __die_body.cold+0x19/0x26
   ? page_fault_oops+0x13e/0x2b0
   ? _printk+0x58/0x73
   ? do_user_addr_fault+0x5f/0x750
   ? exc_page_fault+0x76/0x240
   ? asm_exc_page_fault+0x22/0x30
   ? btrfs_bio_end_io+0xae/0xc0 [btrfs]
   ? btrfs_log_dev_io_error+0x7f/0x90 [btrfs]
   btrfs_orig_write_end_io+0x51/0x90 [btrfs]
   dm_submit_bio+0x5c2/0xa50 [dm_mod]
   ? find_held_lock+0x2b/0x80
   ? blk_try_enter_queue+0x90/0x1e0
   __submit_bio+0xe0/0x130
   ? ktime_get+0x10a/0x160
   ? lockdep_hardirqs_on+0x74/0x100
   submit_bio_noacct_nocheck+0x199/0x410
   btrfs_submit_bio+0x7d/0x150 [btrfs]
   btrfs_submit_chunk+0x1a1/0x6d0 [btrfs]
   ? lockdep_hardirqs_on+0x74/0x100
   ? __folio_start_writeback+0x10/0x2c0
   btrfs_submit_bbio+0x1c/0x40 [btrfs]
   submit_one_bio+0x44/0x60 [btrfs]
   submit_extent_folio+0x13f/0x330 [btrfs]
   ? btrfs_set_range_writeback+0xa3/0xd0 [btrfs]
   extent_writepage_io+0x18b/0x360 [btrfs]
   extent_write_locked_range+0x17c/0x340 [btrfs]
   ? __pfx_end_bbio_data_write+0x10/0x10 [btrfs]
   run_delalloc_cow+0x71/0xd0 [btrfs]
   btrfs_run_delalloc_range+0x176/0x500 [btrfs]
   ? find_lock_delalloc_range+0x119/0x260 [btrfs]
   writepage_delalloc+0x2ab/0x480 [btrfs]
   extent_write_cache_pages+0x236/0x7d0 [btrfs]
   btrfs_writepages+0x72/0x130 [btrfs]
   do_writepages+0xd4/0x240
   ? find_held_lock+0x2b/0x80
   ? wbc_attach_and_unlock_inode+0x12c/0x290
   ? wbc_attach_and_unlock_inode+0x12c/0x290
   __writeback_single_inode+0x5c/0x4c0
   ? do_raw_spin_unlock+0x49/0xb0
   writeback_sb_inodes+0x22c/0x560
   __writeback_inodes_wb+0x4c/0xe0
   wb_writeback+0x1d6/0x3f0
   wb_workfn+0x334/0x520
   process_one_work+0x1ee/0x570
   ? lock_is_held_type+0xc6/0x130
   worker_thread+0x1d1/0x3b0
   ? __pfx_worker_thread+0x10/0x10
   kthread+0xee/0x120
   ? __pfx_kthread+0x10/0x10
   ret_from_fork+0x30/0x50
   ? __pfx_kthread+0x10/0x10
   ret_from_fork_asm+0x1a/0x30
   </TASK>
  Modules linked in: dm_mod btrfs blake2b_generic xor raid6_pq rapl
  CR2: 0000000000000020

* Case 2. Earlier completion of orig_bbio for mirrored btrfs_bios

btrfs_bbio_propagate_error() assumes the end_io function for orig_bbio is
called last among split bios. In that case, btrfs_orig_write_end_io() sets
the bio->bi_status to BLK_STS_IOERR by seeing the bioc->error [2].
Otherwise, the increased orig_bio's bioc->error is not checked by anyone
and return BLK_STS_OK to the upper layer.

[2] Actually, this is not true. Because we only increases orig_bioc->errors
by max_errors, the condition "atomic_read(&bioc->error) > bioc->max_errors"
is still not met if only one split btrfs_bio fails.

* Case 3. Later completion of orig_bbio for un-mirrored btrfs_bios

In contrast to the above case, btrfs_bbio_propagate_error() is not working
well if un-mirrored orig_bbio is completed last. It sets
orig_bbio->bio.bi_status to the btrfs_bio's error. But, that is easily
over-written by orig_bbio's completion status. If the status is BLK_STS_OK,
the upper layer would not know the failure.

* Solution

Considering the above cases, we can only save the error status in the
orig_bbio (remaining part after split) itself as it is always
available. Also, the saved error status should be propagated when all the
split btrfs_bios are finished (i.e, bbio->pending_ios == 0).

This commit introduces "status" to btrfs_bbio and saves the first error of
split bios to original btrfs_bio's "status" variable. When all the split
bios are finished, the saved status is loaded into original btrfs_bio's
status.

With this commit, btrfs/146 on zoned devices does not hit the NULL pointer
dereference anymore.

Fixes: 852eee62d3 ("btrfs: allow btrfs_submit_bio to split bios")
CC: stable@vger.kernel.org # 6.6+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-23 18:17:43 +02:00
Linus Torvalds
26bb0d3f38 for-6.12/block-20240913
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmbkZhQQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjOKD/0fzd4yOcqxSI9W3OLGd04VrOTJIQa4CRbV
 GmoTq39pOeIDVGug5ekkTpqqHHnuGk+nQhCzD9vsN/eTmC7yZOIr847O2aWzvYEn
 PzFRgmJpoo2E9sr/IsTR5LnJjbaIZhQVkqLH6ZOj9tpKlVwN2SK0nIRVNrAi5zgT
 MaDrto/2OUld+vmA99Rgb23jxM6UBdCPIjuiVa+11Vg9Z3D1tWbBmrsG7OMysyIf
 FbASBeKHqFSO61/ipFCZv6VV1X8zoWEVyT8n4A1yUbbN5rLzPgoQJVbfSqQRXIdr
 cdrKeCbKxl+joSgKS6LKpvnfwRgGF+hgAfpZg4c0vrbZGTQcRhhLFECyh/aVI08F
 p5TOMArhVaX59664gHgSPq4KnGTXOO29dot9N3Jya/ZQnxinjY9r+GVOfLuduPPy
 1B04vab8oAsk4zK7fZbkDxgYUyifwzK/vQ6OqYq2mYdpdIS/AE7T2ou61Bz5mI7I
 /BuucNV0Z96OKlyLEXwXXZjZgNu1TFcq6ARIBJ8L08PY64Fesj5BXabRyXkeNH26
 0exyz9heeJs6OwRGfngXmS24tDSS0k74CeZX3KoePNj69u6KCn346KiU1qgntwwD
 E5F7AEHqCl5FjUEIWB4M1EPlfA8U0MzOL+tkx2xKJAjsU60wAy7jRSyOIcqodpMs
 6UlPcJzgYg==
 =uuLl
 -----END PGP SIGNATURE-----

Merge tag 'for-6.12/block-20240913' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - MD changes via Song:
      - md-bitmap refactoring (Yu Kuai)
      - raid5 performance optimization (Artur Paszkiewicz)
      - Other small fixes (Yu Kuai, Chen Ni)
      - Add a sysfs entry 'new_level' (Xiao Ni)
      - Improve information reported in /proc/mdstat (Mateusz Kusiak)

 - NVMe changes via Keith:
      - Asynchronous namespace scanning (Stuart)
      - TCP TLS updates (Hannes)
      - RDMA queue controller validation (Niklas)
      - Align field names to the spec (Anuj)
      - Metadata support validation (Puranjay)
      - A syntax cleanup (Shen)
      - Fix a Kconfig linking error (Arnd)
      - New queue-depth quirk (Keith)

 - Add missing unplug trace event (Keith)

 - blk-iocost fixes (Colin, Konstantin)

 - t10-pi modular removal and fixes (Alexey)

 - Fix for potential BLKSECDISCARD overflow (Alexey)

 - bio splitting cleanups and fixes (Christoph)

 - Deal with folios rather than rather than pages, speeding up how the
   block layer handles bigger IOs (Kundan)

 - Use spinlocks rather than bit spinlocks in zram (Sebastian, Mike)

 - Reduce zoned device overhead in ublk (Ming)

 - Add and use sendpages_ok() for drbd and nvme-tcp (Ofir)

 - Fix regression in partition error pointer checking (Riyan)

 - Add support for write zeroes and rotational status in nbd (Wouter)

 - Add Yu Kuai as new BFQ maintainer. The scheduler has been
   unmaintained for quite a while.

 - Various sets of fixes for BFQ (Yu Kuai)

 - Misc fixes and cleanups (Alvaro, Christophe, Li, Md Haris, Mikhail,
   Yang)

* tag 'for-6.12/block-20240913' of git://git.kernel.dk/linux: (120 commits)
  nvme-pci: qdepth 1 quirk
  block: fix potential invalid pointer dereference in blk_add_partition
  blk_iocost: make read-only static array vrate_adj_pct const
  block: unpin user pages belonging to a folio at once
  mm: release number of pages of a folio
  block: introduce folio awareness and add a bigger size from folio
  block: Added folio-ized version of bio_add_hw_page()
  block, bfq: factor out a helper to split bfqq in bfq_init_rq()
  block, bfq: remove local variable 'bfqq_already_existing' in bfq_init_rq()
  block, bfq: remove local variable 'split' in bfq_init_rq()
  block, bfq: remove bfq_log_bfqg()
  block, bfq: merge bfq_release_process_ref() into bfq_put_cooperator()
  block, bfq: fix procress reference leakage for bfqq in merge chain
  block, bfq: fix uaf for accessing waker_bfqq after splitting
  blk-throttle: support prioritized processing of metadata
  blk-throttle: remove last_low_overflow_time
  drbd: Add NULL check for net_conf to prevent dereference in state validation
  nvme-tcp: fix link failure for TCP auth
  blk-mq: add missing unplug trace event
  mtip32xx: Remove redundant null pointer checks in mtip_hw_debugfs_init()
  ...
2024-09-16 13:33:06 +02:00
Qu Wenruo
9ca0e58cb7 btrfs: merge btrfs_orig_bbio_end_io() into btrfs_bio_end_io()
There are only two differences between the two functions:

- btrfs_orig_bbio_end_io() does extra error propagation
  This is mostly to allow tolerance for write errors.

- btrfs_orig_bbio_end_io() does extra pending_ios check
  This check can handle both the original bio, or the cloned one.
  (All accounting happens in the original one).

This makes btrfs_orig_bbio_end_io() a much safer call.
In fact we already had a double freeing error due to usage of
btrfs_bio_end_io() in the error path of btrfs_submit_chunk().

So just move the whole content of btrfs_orig_bbio_end_io() into
btrfs_bio_end_io().

For normal paths this brings no change, because they are already calling
btrfs_orig_bbio_end_io() in the first place.

For error paths (not only inside bio.c but also external callers), this
change will introduce extra checks, especially for external callers, as
they will error out without submitting the btrfs bio.

But considering it's already in the error path, such slower but much
safer checks are still an overall win.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:20 +02:00
David Sterba
22b4ef50dc btrfs: rename __btrfs_submit_bio() and drop double underscores
Previous patch freed the function name btrfs_submit_bio() so we can use
it for a helper that submits struct bio.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:19 +02:00
David Sterba
792e86ef31 btrfs: rename btrfs_submit_bio() to btrfs_submit_bbio()
The function name is a bit misleading as it submits the btrfs_bio
(bbio), rename it so we can use btrfs_submit_bio() when an actual bio is
submitted.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:19 +02:00