From 68a135013bf73dfd6a277f76fc4e088b0f3dfa79 Mon Sep 17 00:00:00 2001
From: Mark Harmstone <mark@harmstone.com>
Date: Mon, 23 Mar 2026 12:59:47 +0000
Subject: [PATCH 01/15] btrfs: fix bytes_may_use leak in move_existing_remap()

If the call to btrfs_reserve_extent() in move_existing_remap() returns a
smaller extent than we asked for, currently we're not undoing the
bytes_may_use change that we made. Fix this by calling
btrfs_space_info_update_bytes_may_use() again for the difference.

Fixes: bbea42dfb91f ("btrfs: move existing remaps before relocating block group")
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/relocation.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 1c42c5180bdd..3d1756b61162 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -4174,6 +4174,12 @@ static int move_existing_remap(struct btrfs_fs_info *fs_info,
 		return ret;
 	}
 
+	if (ins.offset < length) {
+		spin_lock(&sinfo->lock);
+		btrfs_space_info_update_bytes_may_use(sinfo, ins.offset - length);
+		spin_unlock(&sinfo->lock);
+	}
+
 	dest_addr = ins.objectid;
 	dest_length = ins.offset;
 

From 9b8824533d75fb199a3fb0f6147ffcca64b5caf8 Mon Sep 17 00:00:00 2001
From: Mark Harmstone <mark@harmstone.com>
Date: Mon, 23 Mar 2026 12:59:57 +0000
Subject: [PATCH 02/15] btrfs: fix bytes_may_use leak in do_remap_reloc_trans()

If the call to btrfs_reserve_extent() in do_remap_reloc_trans() returns
a smaller extent than we asked for, currently we're not undoing the
bytes_may_use change that we made. Fix this by calling
btrfs_space_info_update_bytes_may_use() again for the difference.

Fixes: fd6594b1446c ("btrfs: replace identity remaps with actual remaps when doing relocations")
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/relocation.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 3d1756b61162..ad433b7ca919 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -5006,6 +5006,12 @@ static int do_remap_reloc_trans(struct btrfs_fs_info *fs_info,
 		return ret;
 	}
 
+	if (ins.offset < remap_length) {
+		spin_lock(&sinfo->lock);
+		btrfs_space_info_update_bytes_may_use(sinfo, ins.offset - remap_length);
+		spin_unlock(&sinfo->lock);
+	}
+
 	made_reservation = true;
 
 	new_addr = ins.objectid;

From 73db0fad673af844772de964eebecae60eda0496 Mon Sep 17 00:00:00 2001
From: Mark Harmstone <mark@harmstone.com>
Date: Mon, 23 Mar 2026 17:16:43 +0000
Subject: [PATCH 03/15] btrfs: abort transaction in do_remap_reloc_trans() on
 failure

If one of the calls made by do_remap_reloc_trans() fails, we can leave
the remap tree in an inconsistent state. Abort the transaction if this
happens, to prevent the corrupt state from reaching the disk.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/relocation.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index ad433b7ca919..3e48d1a59fb3 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -5035,21 +5035,27 @@ static int do_remap_reloc_trans(struct btrfs_fs_info *fs_info,
 
 	if (bg_needs_free_space) {
 		ret = btrfs_add_block_group_free_space(trans, dest_bg);
-		if (ret)
+		if (ret) {
+			btrfs_abort_transaction(trans, ret);
 			goto fail;
+		}
 	}
 
 	ret = copy_remapped_data(fs_info, start, new_addr, length);
-	if (ret)
+	if (ret) {
+		btrfs_abort_transaction(trans, ret);
 		goto fail;
+	}
 
 	ret = btrfs_remove_from_free_space_tree(trans, new_addr, length);
-	if (ret)
+	if (ret) {
+		btrfs_abort_transaction(trans, ret);
 		goto fail;
+	}
 
 	ret = add_remap_entry(trans, path, src_bg, start, new_addr, length);
 	if (ret) {
-		btrfs_add_to_free_space_tree(trans, new_addr, length);
+		btrfs_abort_transaction(trans, ret);
 		goto fail;
 	}
 

From a86a283430e1a44907b142c4f53e1f3ad24e87ae Mon Sep 17 00:00:00 2001
From: Qu Wenruo <wqu@suse.com>
Date: Sun, 12 Apr 2026 20:31:05 +0930
Subject: [PATCH 04/15] btrfs: apply first key check for readahead when
 possible

Currently for tree block readahead we never pass a
btrfs_tree_parent_check with @has_first_key set.

Without @has_first_key set, btrfs will skip the following extra
checks:

- Header generation check
  This is a minor one.

- Empty leaf/node checks
  This is more serious, for certain trees like the csum tree, they are
  allowed to be empty, thus an empty leaf can pass the tree checker.
  But if there is a parent node for such an empty leaf, it indicates
  corruption.

  Without @has_first_key set, we can no longer detect such a problem.

  In fact there is already a fuzzed image report that a corrupted csum
  leaf which has zero nritems but still has a parent node can trigger
  a BUG_ON() during csum deletion.

However there are only two call sites of btrfs_readahead_tree_block():

- Inside relocate_tree_blocks()
  At this call site we are trying to grab the first key of the tree
  block, thus we are not able to pass a @first_key parameter.

- Inside btrfs_readahead_node_child()
  This is the more common call site, where we have the parent node and
  want to readahead the child tree blocks.

  In this case we can easily grab the node key and pass it for checks.

Add a new parameter @first_key to btrfs_readahead_tree_block() and pass
the node key to it inside btrfs_readahead_node_child().

This should plug the gap in empty leaf detection during readahead.

Link: https://lore.kernel.org/linux-btrfs/20260409071255.3358044-1-gality369@gmail.com/
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/extent_io.c  | 14 ++++++++++++--
 fs/btrfs/extent_io.h  |  3 ++-
 fs/btrfs/relocation.c |  2 +-
 3 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 1ba8a7d3587b..45d56421ac50 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4641,7 +4641,8 @@ int try_release_extent_buffer(struct folio *folio)
  * to read the block we will not block on anything.
  */
 void btrfs_readahead_tree_block(struct btrfs_fs_info *fs_info,
-				u64 bytenr, u64 owner_root, u64 gen, int level)
+				u64 bytenr, u64 owner_root, u64 gen, int level,
+				const struct btrfs_key *first_key)
 {
 	struct btrfs_tree_parent_check check = {
 		.level = level,
@@ -4650,6 +4651,11 @@ void btrfs_readahead_tree_block(struct btrfs_fs_info *fs_info,
 	struct extent_buffer *eb;
 	int ret;
 
+	if (first_key) {
+		memcpy(&check.first_key, first_key, sizeof(struct btrfs_key));
+		check.has_first_key = true;
+	}
+
 	eb = btrfs_find_create_tree_block(fs_info, bytenr, owner_root, level);
 	if (IS_ERR(eb))
 		return;
@@ -4677,9 +4683,13 @@ void btrfs_readahead_tree_block(struct btrfs_fs_info *fs_info,
  */
 void btrfs_readahead_node_child(struct extent_buffer *node, int slot)
 {
+	struct btrfs_key node_key;
+
+	btrfs_node_key_to_cpu(node, &node_key, slot);
 	btrfs_readahead_tree_block(node->fs_info,
 				   btrfs_node_blockptr(node, slot),
 				   btrfs_header_owner(node),
 				   btrfs_node_ptr_generation(node, slot),
-				   btrfs_header_level(node) - 1);
+				   btrfs_header_level(node) - 1,
+				   &node_key);
 }
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index fd209233317f..b310a5145cf6 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -287,7 +287,8 @@ static inline void wait_on_extent_buffer_writeback(struct extent_buffer *eb)
 }
 
 void btrfs_readahead_tree_block(struct btrfs_fs_info *fs_info,
-				u64 bytenr, u64 owner_root, u64 gen, int level);
+				u64 bytenr, u64 owner_root, u64 gen, int level,
+				const struct btrfs_key *first_key);
 void btrfs_readahead_node_child(struct extent_buffer *node, int slot);
 
 /* Note: this can be used in for loops without caching the value in a variable. */
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 3e48d1a59fb3..d21742750b69 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -2607,7 +2607,7 @@ int relocate_tree_blocks(struct btrfs_trans_handle *trans,
 		if (!block->key_ready)
 			btrfs_readahead_tree_block(fs_info, block->bytenr,
 						   block->owner, 0,
-						   block->level);
+						   block->level, NULL);
 	}
 
 	/* Get first keys */

From 41e706c07ef9f752a08f0b9567176ac79441895f Mon Sep 17 00:00:00 2001
From: Qu Wenruo <wqu@suse.com>
Date: Tue, 14 Apr 2026 11:51:15 +0930
Subject: [PATCH 05/15] btrfs: enable shutdown ioctl for non-experimental
 builds

Although commit 304076527c38 ("btrfs: move shutdown and remove_bdev
callbacks out of experimental features") tries to move both shutdown and
remove_bdev out of experimental features, that commit has only addressed
the super block operation callback, the ioctl one is left untouched.

Fix that missing aspect by also moving shutdown ioctl out of
experimental features.

Since we're here, also add unknown flag detection to reject any
unsupported shutdown flags.

Fixes: 304076527c38 ("btrfs: move shutdown and remove_bdev callbacks out of experimental features")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/ioctl.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index b2e447f5005c..a39460bf68a7 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -5102,7 +5102,6 @@ static int btrfs_ioctl_subvol_sync(struct btrfs_fs_info *fs_info, void __user *a
 	return 0;
 }
 
-#ifdef CONFIG_BTRFS_EXPERIMENTAL
 static int btrfs_ioctl_shutdown(struct btrfs_fs_info *fs_info, unsigned long arg)
 {
 	int ret = 0;
@@ -5134,10 +5133,12 @@ static int btrfs_ioctl_shutdown(struct btrfs_fs_info *fs_info, unsigned long arg
 	case BTRFS_SHUTDOWN_FLAGS_NOLOGFLUSH:
 		btrfs_force_shutdown(fs_info);
 		break;
+	default:
+		ret = -EINVAL;
+		break;
 	}
 	return ret;
 }
-#endif
 
 long btrfs_ioctl(struct file *file, unsigned int
 		cmd, unsigned long arg)
@@ -5294,10 +5295,8 @@ long btrfs_ioctl(struct file *file, unsigned int
 #endif
 	case BTRFS_IOC_SUBVOL_SYNC_WAIT:
 		return btrfs_ioctl_subvol_sync(fs_info, argp);
-#ifdef CONFIG_BTRFS_EXPERIMENTAL
 	case BTRFS_IOC_SHUTDOWN:
 		return btrfs_ioctl_shutdown(fs_info, arg);
-#endif
 	}
 
 	return -ENOTTY;

From 44366af74061793ee5ceef455a4f0e465892d0de Mon Sep 17 00:00:00 2001
From: Mark Harmstone <maharmstone@fb.com>
Date: Mon, 23 Mar 2026 17:17:01 +0000
Subject: [PATCH 06/15] btrfs: don't clobber errors in add_remap_tree_entries()

In add_remap_tree_entries(), we only process a certain number of entries
at a time, meaning we may need to loop.

But because we weren't checking the return value of btrfs_insert_empty_items()
within the loop, this meant that if the last iteration of the loop
succeeded but a previous iteration failed, we were erroneously returning
0.

Fix this by breaking the loop early if btrfs_insert_empty_items() fails.

Fixes: b56f35560b82 ("btrfs: handle setting up relocation of block group with remap-tree")
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/relocation.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index d21742750b69..3ebaf5880125 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -3876,7 +3876,7 @@ static int add_remap_tree_entries(struct btrfs_trans_handle *trans, struct btrfs
 		ret = btrfs_insert_empty_items(trans, fs_info->remap_root, path, &batch);
 		btrfs_release_path(path);
 
-		if (num_entries <= max_items)
+		if (ret || num_entries <= max_items)
 			break;
 
 		num_entries -= max_items;

From 999757231c49376cd1a37308d2c8c4c9932571e1 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana@suse.com>
Date: Thu, 9 Apr 2026 15:46:51 +0100
Subject: [PATCH 07/15] btrfs: fix missing last_unlink_trans update when
 removing a directory

When removing a directory we are not updating its last_unlink_trans field,
which can result in incorrect fsync behaviour in case some one fsyncs the
directory after it was removed because it's holding a file descriptor on
it.

Example scenario:

   mkdir /mnt/dir1
   mkdir /mnt/dir1/dir2
   mkdir /mnt/dir3

   sync -f /mnt

   # Do some change to the directory and fsync it.
   chmod 700 /mnt/dir1
   xfs_io -c fsync /mnt/dir1

   # Move dir2 out of dir1 so that dir1 becomes empty.
   mv /mnt/dir1/dir2 /mnt/dir3/

   open fd on /mnt/dir1
   call rmdir(2) on path "/mnt/dir1"
   fsync fd

   <trigger power failure>

When attempting to mount the filesystem, the log replay will fail with
an -EIO error and dmesg/syslog has the following:

   [445771.626482] BTRFS info (device dm-0): first mount of filesystem 0368bbea-6c5e-44b5-b409-09abe496e650
   [445771.626486] BTRFS info (device dm-0): using crc32c checksum algorithm
   [445771.627912] BTRFS info (device dm-0): start tree-log replay
   [445771.628335] page: refcount:2 mapcount:0 mapping:0000000061443ddc index:0x1d00 pfn:0x7072a5
   [445771.629453] memcg:ffff89f400351b00
   [445771.629892] aops:btree_aops [btrfs] ino:1
   [445771.630737] flags: 0x17fffc00000402a(uptodate|lru|private|writeback|node=0|zone=2|lastcpupid=0x1ffff)
   [445771.632359] raw: 017fffc00000402a fffff47284d950c8 fffff472907b7c08 ffff89f458e412b8
   [445771.633713] raw: 0000000000001d00 ffff89f6c51d1a90 00000002ffffffff ffff89f400351b00
   [445771.635029] page dumped because: eb page dump
   [445771.635825] BTRFS critical (device dm-0): corrupt leaf: root=5 block=30408704 slot=10 ino=258, invalid nlink: has 2 expect no more than 1 for dir
   [445771.638088] BTRFS info (device dm-0): leaf 30408704 gen 10 total ptrs 17 free space 14878 owner 5
   [445771.638091] BTRFS info (device dm-0): refs 4 lock_owner 0 current 3581087
   [445771.638094] 	item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160
   [445771.638097] 		inode generation 3 transid 9 size 16 nbytes 16384
   [445771.638098] 		block group 0 mode 40755 links 1 uid 0 gid 0
   [445771.638100] 		rdev 0 sequence 2 flags 0x0
   [445771.638102] 		atime 1775744884.0
   [445771.660056] 		ctime 1775744885.645502983
   [445771.660058] 		mtime 1775744885.645502983
   [445771.660060] 		otime 1775744884.0
   [445771.660062] 	item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12
   [445771.660064] 		index 0 name_len 2
   [445771.660066] 	item 2 key (256 DIR_ITEM 1843588421) itemoff 16077 itemsize 34
   [445771.660068] 		location key (259 1 0) type 2
   [445771.660070] 		transid 9 data_len 0 name_len 4
   [445771.660075] 	item 3 key (256 DIR_ITEM 2363071922) itemoff 16043 itemsize 34
   [445771.660076] 		location key (257 1 0) type 2
   [445771.660077] 		transid 9 data_len 0 name_len 4
   [445771.660078] 	item 4 key (256 DIR_INDEX 2) itemoff 16009 itemsize 34
   [445771.660079] 		location key (257 1 0) type 2
   [445771.660080] 		transid 9 data_len 0 name_len 4
   [445771.660081] 	item 5 key (256 DIR_INDEX 3) itemoff 15975 itemsize 34
   [445771.660082] 		location key (259 1 0) type 2
   [445771.660083] 		transid 9 data_len 0 name_len 4
   [445771.660084] 	item 6 key (257 INODE_ITEM 0) itemoff 15815 itemsize 160
   [445771.660086] 		inode generation 9 transid 9 size 8 nbytes 0
   [445771.660087] 		block group 0 mode 40777 links 1 uid 0 gid 0
   [445771.660088] 		rdev 0 sequence 2 flags 0x0
   [445771.660089] 		atime 1775744885.641174097
   [445771.660090] 		ctime 1775744885.645502983
   [445771.660091] 		mtime 1775744885.645502983
   [445771.660105] 		otime 1775744885.641174097
   [445771.660106] 	item 7 key (257 INODE_REF 256) itemoff 15801 itemsize 14
   [445771.660107] 		index 2 name_len 4
   [445771.660108] 	item 8 key (257 DIR_ITEM 2676584006) itemoff 15767 itemsize 34
   [445771.660109] 		location key (258 1 0) type 2
   [445771.660110] 		transid 9 data_len 0 name_len 4
   [445771.660111] 	item 9 key (257 DIR_INDEX 2) itemoff 15733 itemsize 34
   [445771.660112] 		location key (258 1 0) type 2
   [445771.660113] 		transid 9 data_len 0 name_len 4
   [445771.660114] 	item 10 key (258 INODE_ITEM 0) itemoff 15573 itemsize 160
   [445771.660115] 		inode generation 9 transid 10 size 0 nbytes 0
   [445771.660116] 		block group 0 mode 40755 links 2 uid 0 gid 0
   [445771.660117] 		rdev 0 sequence 0 flags 0x0
   [445771.660118] 		atime 1775744885.645502983
   [445771.660119] 		ctime 1775744885.645502983
   [445771.660120] 		mtime 1775744885.645502983
   [445771.660121] 		otime 1775744885.645502983
   [445771.660122] 	item 11 key (258 INODE_REF 257) itemoff 15559 itemsize 14
   [445771.660123] 		index 2 name_len 4
   [445771.660124] 	item 12 key (258 INODE_REF 259) itemoff 15545 itemsize 14
   [445771.660125] 		index 2 name_len 4
   [445771.660126] 	item 13 key (259 INODE_ITEM 0) itemoff 15385 itemsize 160
   [445771.660127] 		inode generation 9 transid 10 size 8 nbytes 0
   [445771.660128] 		block group 0 mode 40755 links 1 uid 0 gid 0
   [445771.660129] 		rdev 0 sequence 1 flags 0x0
   [445771.660130] 		atime 1775744885.645502983
   [445771.660130] 		ctime 1775744885.645502983
   [445771.660131] 		mtime 1775744885.645502983
   [445771.660132] 		otime 1775744885.645502983
   [445771.660133] 	item 14 key (259 INODE_REF 256) itemoff 15371 itemsize 14
   [445771.660134] 		index 3 name_len 4
   [445771.660135] 	item 15 key (259 DIR_ITEM 2676584006) itemoff 15337 itemsize 34
   [445771.660136] 		location key (258 1 0) type 2
   [445771.660137] 		transid 10 data_len 0 name_len 4
   [445771.660138] 	item 16 key (259 DIR_INDEX 2) itemoff 15303 itemsize 34
   [445771.660139] 		location key (258 1 0) type 2
   [445771.660140] 		transid 10 data_len 0 name_len 4
   [445771.660144] BTRFS error (device dm-0): block=30408704 write time tree block corruption detected
   [445771.661650] ------------[ cut here ]------------
   [445771.662358] WARNING: fs/btrfs/disk-io.c:326 at btree_csum_one_bio+0x217/0x230 [btrfs], CPU#8: mount/3581087
   [445771.663588] Modules linked in: btrfs f2fs xfs (...)
   [445771.671229] CPU: 8 UID: 0 PID: 3581087 Comm: mount Tainted: G        W           7.0.0-rc6-btrfs-next-230+ #2 PREEMPT(full)
   [445771.672575] Tainted: [W]=WARN
   [445771.672987] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
   [445771.674460] RIP: 0010:btree_csum_one_bio+0x217/0x230 [btrfs]
   [445771.675222] Code: 89 44 24 (...)
   [445771.677364] RSP: 0018:ffffd23882247660 EFLAGS: 00010246
   [445771.678029] RAX: 0000000000000000 RBX: ffff89f6c51d1a90 RCX: 0000000000000000
   [445771.678975] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff89f406020000
   [445771.679983] RBP: ffff89f821204000 R08: 0000000000000000 R09: 00000000ffefffff
   [445771.680905] R10: ffffd23882247448 R11: 0000000000000003 R12: ffffd23882247668
   [445771.681978] R13: ffff89f458e40fc0 R14: ffff89f737f4f500 R15: ffff89f737f4f500
   [445771.682912] FS:  00007f0447a98840(0000) GS:ffff89fb9771d000(0000) knlGS:0000000000000000
   [445771.684393] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   [445771.685230] CR2: 00007f0447bf1330 CR3: 000000017cb02002 CR4: 0000000000370ef0
   [445771.686273] Call Trace:
   [445771.686646]  <TASK>
   [445771.686969]  btrfs_submit_bbio+0x83f/0x860 [btrfs]
   [445771.687750]  ? write_one_eb+0x28f/0x340 [btrfs]
   [445771.688428]  btree_writepages+0x2e3/0x550 [btrfs]
   [445771.689180]  ? kmem_cache_alloc_noprof+0x12a/0x490
   [445771.689963]  ? alloc_extent_state+0x19/0x120 [btrfs]
   [445771.690801]  ? kmem_cache_free+0x135/0x380
   [445771.691328]  ? preempt_count_add+0x69/0xa0
   [445771.691831]  ? set_extent_bit+0x252/0x8e0 [btrfs]
   [445771.692468]  ? xas_load+0x9/0xc0
   [445771.692873]  ? xas_find+0x14d/0x1a0
   [445771.693304]  do_writepages+0xc6/0x160
   [445771.693756]  filemap_writeback+0xb8/0xe0
   [445771.694274]  btrfs_write_marked_extents+0x61/0x170 [btrfs]
   [445771.694999]  btrfs_write_and_wait_transaction+0x4e/0xc0 [btrfs]
   [445771.695818]  btrfs_commit_transaction+0x5c8/0xd10 [btrfs]
   [445771.696530]  ? kmem_cache_free+0x135/0x380
   [445771.697120]  ? release_extent_buffer+0x34/0x160 [btrfs]
   [445771.697786]  btrfs_recover_log_trees+0x7be/0x7e0 [btrfs]
   [445771.698525]  ? __pfx_replay_one_buffer+0x10/0x10 [btrfs]
   [445771.699206]  open_ctree+0x11e5/0x1810 [btrfs]
   [445771.699776]  btrfs_get_tree.cold+0xb/0x162 [btrfs]
   [445771.700463]  ? fscontext_read+0x165/0x180
   [445771.701146]  ? rw_verify_area+0x50/0x180
   [445771.701866]  vfs_get_tree+0x25/0xd0
   [445771.702491]  vfs_cmd_create+0x59/0xe0
   [445771.703125]  __do_sys_fsconfig+0x303/0x610
   [445771.703603]  do_syscall_64+0xe9/0xf20
   [445771.703974]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
   [445771.704700] RIP: 0033:0x7f0447cbd4aa
   [445771.705108] Code: 73 01 c3 (...)
   [445771.707263] RSP: 002b:00007ffc4e528318 EFLAGS: 00000246 ORIG_RAX: 00000000000001af
   [445771.708107] RAX: ffffffffffffffda RBX: 00005561585d8c20 RCX: 00007f0447cbd4aa
   [445771.708931] RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003
   [445771.709744] RBP: 00005561585d9120 R08: 0000000000000000 R09: 0000000000000000
   [445771.710674] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
   [445771.711477] R13: 00007f0447e4f580 R14: 00007f0447e5126c R15: 00007f0447e36a23
   [445771.712277]  </TASK>
   [445771.712541] ---[ end trace 0000000000000000 ]---
   [445771.713382] BTRFS error (device dm-0): error while writing out transaction: -5
   [445771.714679] BTRFS warning (device dm-0): Skipping commit of aborted transaction.
   [445771.715562] BTRFS error (device dm-0 state A): Transaction aborted (error -5)
   [445771.716459] BTRFS: error (device dm-0 state A) in cleanup_transaction:2068: errno=-5 IO failure
   [445771.717936] BTRFS error (device dm-0 state EA): failed to recover log trees with error: -5
   [445771.719681] BTRFS error (device dm-0 state EA): open_ctree failed: -5

The problem is that such a fsync should have result in a fallback to a
transaction commit, but that did not happen because through the
btrfs_rmdir() we never update the directory's last_unlink_trans field.
Any inode that had a link removed must have its last_unlink_trans updated
to the ID of transaction used for the operation, otherwise fsync and log
replay will not work correctly.

btrfs_rmdir() calls btrfs_unlink_inode() and through that call chain we
never call btrfs_record_unlink_dir() in order to update last_unlink_trans.
However btrfs_unlink(), which is used for unlinking regular files, calls
btrfs_record_unlink_dir() and then calls btrfs_unlink_inode(). So fix
this by moving the call to btrfs_record_unlink_dir() from btrfs_unlink()
to btrfs_unlink_inode().

A test case for fstests will follow soon.

Reported-by: Slava0135 <slava.kovalevskiy.2014@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAAJYhww5ov62Hm+n+tmhcL-e_4cBobg+OWogKjOJxVUXivC=MQ@mail.gmail.com/
CC: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/inode.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 40474014c03f..55133a364305 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4959,6 +4959,8 @@ static int btrfs_rmdir(struct inode *vfs_dir, struct dentry *dentry)
 	if (ret)
 		goto out;
 
+	btrfs_record_unlink_dir(trans, dir, inode, false);
+
 	/* now the directory is empty */
 	ret = btrfs_unlink_inode(trans, dir, inode, &fname.disk_name);
 	if (!ret)

From 4d95b9efd783adca472e957b2f576983e789b839 Mon Sep 17 00:00:00 2001
From: David Sterba <dsterba@suse.com>
Date: Tue, 14 Apr 2026 17:30:31 +0200
Subject: [PATCH 08/15] btrfs: handle unexpected free-space-tree key types

Replace the conditional assertions with proper error handling and
transaction abort if we find an unexpected key type in the free space
tree.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/free-space-tree.c | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/free-space-tree.c b/fs/btrfs/free-space-tree.c
index 9efd1ec90f03..472b3060e5ac 100644
--- a/fs/btrfs/free-space-tree.c
+++ b/fs/btrfs/free-space-tree.c
@@ -259,7 +259,11 @@ int btrfs_convert_free_space_to_bitmaps(struct btrfs_trans_handle *trans,
 				nr++;
 				path->slots[0]--;
 			} else {
-				ASSERT(0);
+				btrfs_err(fs_info, "unexpected free space tree key type %u",
+					  found_key.type);
+				ret = -EUCLEAN;
+				btrfs_abort_transaction(trans, ret);
+				goto out;
 			}
 		}
 
@@ -405,7 +409,11 @@ int btrfs_convert_free_space_to_extents(struct btrfs_trans_handle *trans,
 
 				nr++;
 			} else {
-				ASSERT(0);
+				btrfs_err(fs_info, "unexpected free space tree key type %u",
+					  found_key.type);
+				ret = -EUCLEAN;
+				btrfs_abort_transaction(trans, ret);
+				goto out;
 			}
 		}
 
@@ -1518,7 +1526,11 @@ int btrfs_remove_block_group_free_space(struct btrfs_trans_handle *trans,
 				nr++;
 				path->slots[0]--;
 			} else {
-				ASSERT(0);
+				btrfs_err(trans->fs_info, "unexpected free space tree key type %u",
+					  found_key.type);
+				ret = -EUCLEAN;
+				btrfs_abort_transaction(trans, ret);
+				return ret;
 			}
 		}
 

From 513f8a52eed880ea525dbb139b2127bd9bb793f1 Mon Sep 17 00:00:00 2001
From: robbieko <robbieko@synology.com>
Date: Mon, 13 Apr 2026 14:52:32 +0800
Subject: [PATCH 09/15] btrfs: copy devid in
 btrfs_partially_delete_raid_extent()

When btrfs_partially_delete_raid_extent() rebuilds a truncated/shifted
stripe extent into newitem, the loop copies the physical address for
each stride but forgets to copy the devid. The resulting item written
back to the stripe tree has zeroed-out devids, corrupting the stripe
mapping.

Fix this by reading the devid with btrfs_raid_stride_devid() and
writing it into the new item with btrfs_set_stack_raid_stride_devid()
before copying the physical address.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: robbieko <robbieko@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/raid-stripe-tree.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
index 638c4ad572c9..ac8cec3ce6d3 100644
--- a/fs/btrfs/raid-stripe-tree.c
+++ b/fs/btrfs/raid-stripe-tree.c
@@ -45,8 +45,11 @@ static int btrfs_partially_delete_raid_extent(struct btrfs_trans_handle *trans,
 
 	for (int i = 0; i < btrfs_num_raid_stripes(item_size); i++) {
 		struct btrfs_raid_stride *stride = &extent->strides[i];
+		u64 devid;
 		u64 phys;
 
+		devid = btrfs_raid_stride_devid(leaf, stride);
+		btrfs_set_stack_raid_stride_devid(&newitem->strides[i], devid);
 		phys = btrfs_raid_stride_physical(leaf, stride) + frontpad;
 		btrfs_set_stack_raid_stride_physical(&newitem->strides[i], phys);
 	}

From 2aef5cb1dcf9b3e1be3895a6477dc065e618aab8 Mon Sep 17 00:00:00 2001
From: robbieko <robbieko@synology.com>
Date: Mon, 13 Apr 2026 14:52:33 +0800
Subject: [PATCH 10/15] btrfs: fix raid stripe search missing entries at leaf
 boundaries

In btrfs_delete_raid_extent(), the search key uses offset=0. When the
target stripe entry is the first item on a leaf, btrfs_search_slot()
may land on the previous leaf and decrementing the slot from nritems
still points to the wrong entry, causing the stripe extent to be
silently missed.

Fix this by searching with offset=(u64)-1 instead. Since no real stripe
entry has this offset, btrfs_search_slot() always returns 1 with the
slot pointing past the last matching objectid entry. Then unconditionally
decrement the slot with a proper slots[0]==0 early-exit check to handle
the case where no matching entry exists.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: robbieko <robbieko@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/raid-stripe-tree.c | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
index ac8cec3ce6d3..4937b08da9de 100644
--- a/fs/btrfs/raid-stripe-tree.c
+++ b/fs/btrfs/raid-stripe-tree.c
@@ -98,14 +98,26 @@ int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start, u64 le
 	while (1) {
 		key.objectid = start;
 		key.type = BTRFS_RAID_STRIPE_KEY;
-		key.offset = 0;
+		key.offset = (u64)-1;
 
 		ret = btrfs_search_slot(trans, stripe_root, &key, path, -1, 1);
 		if (ret < 0)
 			break;
 
-		if (path->slots[0] == btrfs_header_nritems(path->nodes[0]))
-			path->slots[0]--;
+		/*
+		 * Search with offset=(u64)-1 ensures we land on the correct
+		 * leaf even when the target entry is the first item on a leaf.
+		 * Since no real entry has offset=(u64)-1, ret is always 1 and
+		 * slot points past the last entry with objectid==start (or
+		 * past the end of the leaf if that entry is the last item).
+		 * Back up one slot to find the actual entry.
+		 */
+		if (path->slots[0] == 0) {
+			/* No entry with objectid <= start exists. */
+			ret = 0;
+			break;
+		}
+		path->slots[0]--;
 
 		leaf = path->nodes[0];
 		slot = path->slots[0];

From 1871ae78ffa5ce7c0458e9ba5867958c1753e425 Mon Sep 17 00:00:00 2001
From: robbieko <robbieko@synology.com>
Date: Mon, 13 Apr 2026 14:52:34 +0800
Subject: [PATCH 11/15] btrfs: fix wrong min_objectid in btrfs_previous_item()
 call

When found_start > start and slot == 0, btrfs_previous_item() is called
with min_objectid=start to find the previous stripe extent. However, the
previous stripe extent we are looking for has objectid < start (it starts
before our deletion range), so passing start as min_objectid prevents
finding it.

Fix by passing 0 as min_objectid to allow finding any preceding stripe
extent regardless of its objectid.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: robbieko <robbieko@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/raid-stripe-tree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
index 4937b08da9de..1a0ea2107688 100644
--- a/fs/btrfs/raid-stripe-tree.c
+++ b/fs/btrfs/raid-stripe-tree.c
@@ -138,7 +138,7 @@ int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start, u64 le
 		 */
 		if (found_start > start) {
 			if (slot == 0) {
-				ret = btrfs_previous_item(stripe_root, path, start,
+				ret = btrfs_previous_item(stripe_root, path, 0,
 							  BTRFS_RAID_STRIPE_KEY);
 				if (ret) {
 					if (ret > 0)

From 653361585d251fbca0e19ac58b04ba95dd01e378 Mon Sep 17 00:00:00 2001
From: robbieko <robbieko@synology.com>
Date: Mon, 13 Apr 2026 14:52:35 +0800
Subject: [PATCH 12/15] btrfs: replace ASSERT with proper error handling in
 stripe lookup fallback

After falling back to the previous item in btrfs_delete_raid_extent(),
the code uses ASSERT(found_start <= start) to verify the found extent
actually precedes our target range. If the B-tree state is unexpected
(e.g. no overlapping extent exists), this triggers a kernel BUG/panic
in debug builds, or silently continues with wrong data otherwise.

Replace the ASSERT with a proper bounds check that returns -ENOENT if
the found extent does not actually overlap with the start position.

Signed-off-by: robbieko <robbieko@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/raid-stripe-tree.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
index 1a0ea2107688..d454894b9e66 100644
--- a/fs/btrfs/raid-stripe-tree.c
+++ b/fs/btrfs/raid-stripe-tree.c
@@ -154,7 +154,10 @@ int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start, u64 le
 			btrfs_item_key_to_cpu(leaf, &key, slot);
 			found_start = key.objectid;
 			found_end = found_start + key.offset;
-			ASSERT(found_start <= start);
+			if (found_start > start || found_end <= start) {
+				ret = -ENOENT;
+				break;
+			}
 		}
 
 		if (key.type != BTRFS_RAID_STRIPE_KEY)

From fe0cdfd7118d8b40a21bfac221bb4982c5e10e10 Mon Sep 17 00:00:00 2001
From: robbieko <robbieko@synology.com>
Date: Mon, 13 Apr 2026 14:52:36 +0800
Subject: [PATCH 13/15] btrfs: handle -EAGAIN from btrfs_duplicate_item and
 refresh stale leaf pointer

In the 'punch a hole' case of btrfs_delete_raid_extent(),
btrfs_duplicate_item() can return -EAGAIN when the leaf needs to be
split and the path becomes invalid. The old code treats any error as
fatal and breaks out of the loop.

Additionally, btrfs_duplicate_item() may trigger setup_leaf_for_split()
which can reallocate the leaf node. The code continues using the old
leaf pointer, leading to use-after-free or stale data access.

Fix both issues by:

- Handling -EAGAIN specifically: release the path and retry the loop.
- Refreshing leaf = path->nodes[0] after successful duplication.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: robbieko <robbieko@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/raid-stripe-tree.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
index d454894b9e66..2e0d2f83c651 100644
--- a/fs/btrfs/raid-stripe-tree.c
+++ b/fs/btrfs/raid-stripe-tree.c
@@ -194,9 +194,19 @@ int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start, u64 le
 
 			/* The "right" item. */
 			ret = btrfs_duplicate_item(trans, stripe_root, path, &newkey);
+			if (ret == -EAGAIN) {
+				btrfs_release_path(path);
+				continue;
+			}
 			if (ret)
 				break;
 
+			/*
+			 * btrfs_duplicate_item() may have triggered a leaf
+			 * split via setup_leaf_for_split(), so we must refresh
+			 * our leaf pointer from the path.
+			 */
+			leaf = path->nodes[0];
 			item_size = btrfs_item_size(leaf, path->slots[0]);
 			extent = btrfs_item_ptr(leaf, path->slots[0],
 						struct btrfs_stripe_extent);

From a8d58a7c0200904ff24ca7f0d7c147017e25aa99 Mon Sep 17 00:00:00 2001
From: robbieko <robbieko@synology.com>
Date: Mon, 13 Apr 2026 14:52:37 +0800
Subject: [PATCH 14/15] btrfs: check return value of
 btrfs_partially_delete_raid_extent()

btrfs_partially_delete_raid_extent() returns an error code (e.g.
-ENOMEM from kzalloc(), or errors from btrfs_del_item/btrfs_insert_item()),
but all three call sites in btrfs_delete_raid_extent() discard the
return value, silently losing errors and potentially leaving the stripe
tree in an inconsistent state.

Fix by capturing the return value into ret at all three call sites and
breaking out of the loop on error where appropriate.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: robbieko <robbieko@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/raid-stripe-tree.c | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
index 2e0d2f83c651..4b0186c83ad1 100644
--- a/fs/btrfs/raid-stripe-tree.c
+++ b/fs/btrfs/raid-stripe-tree.c
@@ -223,8 +223,9 @@ int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start, u64 le
 			/* The "left" item. */
 			path->slots[0]--;
 			btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
-			btrfs_partially_delete_raid_extent(trans, path, &key,
-							   diff_start, 0);
+			ret = btrfs_partially_delete_raid_extent(trans, path,
+								 &key,
+								 diff_start, 0);
 			break;
 		}
 
@@ -240,8 +241,11 @@ int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start, u64 le
 		if (found_start < start) {
 			u64 diff_start = start - found_start;
 
-			btrfs_partially_delete_raid_extent(trans, path, &key,
-							   diff_start, 0);
+			ret = btrfs_partially_delete_raid_extent(trans, path,
+								 &key,
+								 diff_start, 0);
+			if (ret)
+				break;
 
 			start += (key.offset - diff_start);
 			length -= (key.offset - diff_start);
@@ -264,9 +268,10 @@ int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start, u64 le
 		if (found_end > end) {
 			u64 diff_end = found_end - end;
 
-			btrfs_partially_delete_raid_extent(trans, path, &key,
-							   key.offset - length,
-							   length);
+			ret = btrfs_partially_delete_raid_extent(trans, path,
+								 &key,
+								 key.offset - length,
+								 length);
 			ASSERT(key.offset - diff_end == length);
 			break;
 		}

From 82323b1a7088b7a5c3e528a5d634bff447fa286f Mon Sep 17 00:00:00 2001
From: Mark Harmstone <mark@harmstone.com>
Date: Thu, 16 Apr 2026 18:15:23 +0100
Subject: [PATCH 15/15] btrfs: fix double-decrement of bytes_may_use in
 submit_one_async_extent()

submit_one_async_extent() calls btrfs_reserve_extent(), which decrements
bytes_may_use. If the call btrfs_create_io_em() fails, we jump to
out_free_reserve, which calls extent_clear_unlock_delalloc().

Because we're specifying EXTENT_DO_ACCOUNTING, i.e.
EXTENT_CLEAR_META_RESV | EXTENT_CLEAR_DATA_RESV, this decreases
bytes_may_use again. This can lead to problems later on, as an initial
write can fail only for the writeback to silently ENOSPC.

Fix this by replacing EXTENT_DO_ACCOUNTING with EXTENT_CLEAR_META_RESV.
This parallels a4fe134fc1d8eb ("btrfs: fix a double release on reserved
extents in cow_one_range()"), which is the same fix in cow_one_range().

Fixes: 151a41bc46df ("Btrfs: fix what bits we clear when erroring out from delalloc")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/inode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 55133a364305..906d5c21ebc4 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1153,7 +1153,7 @@ static void submit_one_async_extent(struct async_chunk *async_chunk,
 				     NULL, &cached,
 				     EXTENT_LOCKED | EXTENT_DELALLOC |
 				     EXTENT_DELALLOC_NEW |
-				     EXTENT_DEFRAG | EXTENT_DO_ACCOUNTING,
+				     EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV,
 				     PAGE_UNLOCK | PAGE_START_WRITEBACK |
 				     PAGE_END_WRITEBACK);
 	if (async_extent->cb)