linux/fs/btrfs
Filipe Manana a1ce40f8ae btrfs: make send work with concurrent block group relocation
commit d96b34248c upstream.

We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.

The restriction between balance and send was added in commit 9e967495e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e6 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.

Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.

For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.

This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:

1) For all tree searches, send acquires a read lock on the commit root
   semaphore;

2) After each tree search, and before releasing the commit root semaphore,
   the leaf is cloned and placed in the search path (struct btrfs_path);

3) After releasing the commit root semaphore, the changed_cb() callback
   is invoked, which operates on the leaf and writes commands to the pipe
   (or file in case send/receive is not used with a pipe). It's important
   here to not hold a lock on the commit root semaphore, because if we did
   we could deadlock when sending and receiving to the same filesystem
   using a pipe - the send task blocks on the pipe because it's full, the
   receive task, which is the only consumer of the pipe, triggers a
   transaction commit when attempting to create a subvolume or reserve
   space for a write operation for example, but the transaction commit
   blocks trying to write lock the commit root semaphore, resulting in a
   deadlock;

4) Before moving to the next key, or advancing to the next change in case
   of an incremental send, check if a transaction used for relocation was
   committed (or is about to finish its commit). If so, release the search
   path(s) and restart the search, to where we were before, so that we
   don't operate on stale extent buffers. The search restarts are always
   possible because both the send and parent roots are RO, and no one can
   add, remove of update keys (change their offset) in RO trees - the
   only exception is deduplication, but that is still not allowed to run
   in parallel with send;

5) Periodically check if there is contention on the commit root semaphore,
   which means there is a transaction commit trying to write lock it, and
   release the semaphore and reschedule if there is contention, so as to
   avoid causing any significant delays to transaction commits.

This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).

Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.

A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-16 14:23:46 +01:00
..
tests btrfs: remove ignore_offset argument from btrfs_find_all_roots() 2021-08-23 13:19:01 +02:00
acl.c overlayfs update for 5.15 2021-09-02 09:21:27 -07:00
async-thread.c btrfs: fix memory ordering between normal and ordered work functions 2021-11-25 09:48:46 +01:00
async-thread.h
backref.c btrfs: remove BUG_ON(!eie) in find_parent_nodes 2022-01-27 11:04:52 +01:00
backref.h btrfs: remove ignore_offset argument from btrfs_find_all_roots() 2021-08-23 13:19:01 +02:00
block-group.c btrfs: make send work with concurrent block group relocation 2022-03-16 14:23:46 +01:00
block-group.h btrfs: rework chunk allocation to avoid exhaustion of the system chunk array 2021-07-07 17:42:41 +02:00
block-rsv.c
block-rsv.h
btrfs_inode.h btrfs: initial fsverity support 2021-08-23 13:19:09 +02:00
check-integrity.c btrfs: check-integrity: drop kmap/kunmap for block pages 2021-08-23 13:19:00 +02:00
check-integrity.h
compression.c Revert "btrfs: compression: drop kmap/kunmap from generic helpers" 2021-10-27 10:39:03 +02:00
compression.h btrfs: rework btrfs_decompress_buf2page() 2021-08-23 13:19:04 +02:00
ctree.c btrfs: make send work with concurrent block group relocation 2022-03-16 14:23:46 +01:00
ctree.h btrfs: make send work with concurrent block group relocation 2022-03-16 14:23:46 +01:00
delalloc-space.c btrfs: free exchange changeset on failures 2021-12-14 10:57:13 +01:00
delalloc-space.h
delayed-inode.c btrfs: add ro compat flags to inodes 2021-08-23 13:19:09 +02:00
delayed-inode.h
delayed-ref.c btrfs: fix lock inversion problem when doing qgroup extent tracing 2021-07-22 15:50:07 +02:00
delayed-ref.h
dev-replace.c btrfs: zoned: cache reported zone during mount 2022-02-23 12:03:02 +01:00
dev-replace.h
dir-item.c btrfs: unify lookup return value when dir entry is missing 2021-10-07 22:06:32 +02:00
discard.c btrfs: fix typos in comments 2021-06-22 14:11:57 +02:00
discard.h
disk-io.c btrfs: make send work with concurrent block group relocation 2022-03-16 14:23:46 +01:00
disk-io.h
export.c
export.h
extent_io.c btrfs: do not WARN_ON() if we have PageError set 2022-03-08 19:12:54 +01:00
extent_io.h btrfs: fix argument type of btrfs_bio_clone_partial() 2021-08-23 13:19:08 +02:00
extent_map.c
extent_map.h
extent-io-tree.h
extent-tree.c btrfs: do not start relocation until in progress drops are done 2022-03-08 19:12:54 +01:00
file-item.c btrfs: replace BUG_ON() in btrfs_csum_one_bio() with proper error handling 2021-09-17 19:29:38 +02:00
file.c btrfs: fix abort logic in btrfs_replace_file_extents 2021-10-07 22:08:06 +02:00
free-space-cache.c btrfs: zoned: fix block group alloc_offset calculation 2021-08-23 13:19:11 +02:00
free-space-cache.h
free-space-tree.c
free-space-tree.h
inode-item.c
inode.c btrfs: fix ENOSPC failure when attempting direct IO write into NOCOW range 2022-03-08 19:12:46 +01:00
ioctl.c btrfs: fix use-after-free after failure to create a snapshot 2022-02-08 18:34:04 +01:00
Kconfig btrfs: disable build on platforms having page size 256K 2021-06-22 14:11:57 +02:00
locking.c btrfs: fix typos in comments 2021-06-22 14:11:57 +02:00
locking.h
lzo.c btrfs: prevent copying too big compressed lzo segment 2022-03-02 11:48:07 +01:00
Makefile btrfs: initial fsverity support 2021-08-23 13:19:09 +02:00
misc.h btrfs: use correct header for div_u64 in misc.h 2021-09-07 14:29:50 +02:00
ordered-data.c btrfs: zoned: fix double counting of split ordered extent 2021-09-07 14:30:41 +02:00
ordered-data.h btrfs: remove uptodate parameter from btrfs_dec_test_first_ordered_pending 2021-08-23 13:19:02 +02:00
orphan.c
print-tree.c
print-tree.h
props.c btrfs: props: change how empty value is interpreted 2021-06-22 14:11:58 +02:00
props.h
qgroup.c btrfs: qgroup: fix deadlock between rescan worker and remove qgroup 2022-03-08 19:12:54 +01:00
qgroup.h btrfs: fix lock inversion problem when doing qgroup extent tracing 2021-07-22 15:50:07 +02:00
raid56.c btrfs: constify and cleanup variables in comparators 2021-08-23 13:19:03 +02:00
raid56.h
rcu-string.h
reada.c
ref-verify.c btrfs: stop doing GFP_KERNEL memory allocations in the ref verify tool 2021-08-23 13:19:00 +02:00
ref-verify.h
reflink.c btrfs: reflink: initialize return value to 0 in btrfs_extent_same() 2021-11-18 19:16:20 +01:00
reflink.h
relocation.c btrfs: make send work with concurrent block group relocation 2022-03-16 14:23:46 +01:00
root-tree.c btrfs: do not start relocation until in progress drops are done 2022-03-08 19:12:54 +01:00
scrub.c btrfs: make 1-bit bit-fields of scrub_page unsigned int 2021-11-25 09:48:37 +01:00
send.c btrfs: make send work with concurrent block group relocation 2022-03-16 14:23:46 +01:00
send.h
space-info.c btrfs: prevent __btrfs_dump_space_info() to underflow its free space 2021-09-17 19:29:54 +02:00
space-info.h btrfs: rip out btrfs_space_info::total_bytes_pinned 2021-06-22 14:55:25 +02:00
struct-funcs.c btrfs: add special case to setget helpers for 64k pages 2021-08-23 13:18:58 +02:00
subpage.c btrfs: subpage: fix a potential use-after-free in writeback helper 2021-08-23 13:19:05 +02:00
subpage.h btrfs: subpage: fix writeback which does not have ordered extent 2021-08-23 13:19:04 +02:00
super.c btrfs: use latest_dev in btrfs_show_devname 2021-12-22 09:32:37 +01:00
sysfs.c btrfs: sysfs: document structures and their associated files 2021-08-23 13:19:12 +02:00
sysfs.h
transaction.c btrfs: make send work with concurrent block group relocation 2022-03-16 14:23:46 +01:00
transaction.h btrfs: do not start relocation until in progress drops are done 2022-03-08 19:12:54 +01:00
tree-checker.c btrfs: tree-checker: check item_size for dev_item 2022-03-02 11:47:48 +01:00
tree-checker.h
tree-defrag.c
tree-log.c btrfs: add missing run of delayed items after unlink during log replay 2022-03-08 19:12:54 +01:00
tree-log.h
tree-mod-log.c btrfs: fix race when picking most recent mod log operation for an old root 2021-04-20 19:27:17 +02:00
tree-mod-log.h
ulist.c
ulist.h
uuid-tree.c
verity.c btrfs: fix transaction handle leak after verity rollback failure 2021-09-17 19:29:41 +02:00
volumes.c btrfs: zoned: cache reported zone during mount 2022-02-23 12:03:02 +01:00
volumes.h btrfs: convert latest_bdev type to btrfs_device and rename 2021-12-22 09:32:37 +01:00
xattr.c
xattr.h
zlib.c Revert "btrfs: compression: drop kmap/kunmap from zlib" 2021-10-29 13:03:05 +02:00
zoned.c btrfs: zoned: cache reported zone during mount 2022-02-23 12:03:02 +01:00
zoned.h btrfs: zoned: cache reported zone during mount 2022-02-23 12:03:02 +01:00
zstd.c Revert "btrfs: compression: drop kmap/kunmap from zstd" 2021-10-29 13:02:50 +02:00