linux

mirror of https://github.com/torvalds/linux.git synced 2026-06-05 04:56:13 +02:00

Author	SHA1	Message	Date
Qu Wenruo	4066c55e10	btrfs: only release the dirty pages io tree after successful writes [WARNING] With extra warning on dirty extent buffers at umount (aka, the next patch in the series), test case generic/388 can trigger the following warning about dirty extent buffers at unmount time: BTRFS critical (device dm-2 state E): emergency shutdown BTRFS error (device dm-2 state E): error while writing out transaction: -30 BTRFS warning (device dm-2 state E): Skipping commit of aborted transaction. BTRFS error (device dm-2 state EA): Transaction 9 aborted (error -30) BTRFS: error (device dm-2 state EA) in cleanup_transaction:2068: errno=-30 Readonly filesystem BTRFS info (device dm-2 state EA): forced readonly BTRFS info (device dm-2 state EA): last unmount of filesystem 4fbf2e15-f941-49a0-bc7c-716315d2777c ------------[ cut here ]------------ WARNING: disk-io.c:3311 at invalidate_and_check_btree_folios+0xfd/0x1ca [btrfs], CPU#8: umount/914368 CPU: 8 UID: 0 PID: 914368 Comm: umount Tainted: G OE 7.1.0-rc1-custom+ #372 PREEMPT(full) 2de38db8d1deae71fde295430a0ff3ab98ccf596 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022 RIP: 0010:invalidate_and_check_btree_folios+0xfd/0x1ca [btrfs] Call Trace: <TASK> close_ctree+0x52e/0x574 [btrfs d2f0b1cd330d1287e7a9919d112eadfc0e914efd] generic_shutdown_super+0x89/0x1a0 kill_anon_super+0x16/0x40 btrfs_kill_super+0x16/0x20 [btrfs d2f0b1cd330d1287e7a9919d112eadfc0e914efd] deactivate_locked_super+0x2d/0xb0 cleanup_mnt+0xdc/0x140 task_work_run+0x5a/0xa0 exit_to_user_mode_loop+0x123/0x4b0 do_syscall_64+0x243/0x7c0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 </TASK> ---[ end trace 0000000000000000 ]--- BTRFS warning (device dm-2 state EA): unable to release extent buffer 30539776 owner 9 gen 9 refs 2 flags 0x7 BTRFS warning (device dm-2 state EA): unable to release extent buffer 30621696 owner 257 gen 9 refs 2 flags 0x7 BTRFS warning (device dm-2 state EA): unable to release extent buffer 30638080 owner 258 gen 9 refs 2 flags 0x7 BTRFS warning (device dm-2 state EA): unable to release extent buffer 30654464 owner 7 gen 9 refs 2 flags 0x7 BTRFS warning (device dm-2 state EA): unable to release extent buffer 30703616 owner 2 gen 9 refs 2 flags 0x7 BTRFS warning (device dm-2 state EA): unable to release extent buffer 30720000 owner 10 gen 9 refs 2 flags 0x7 BTRFS warning (device dm-2 state EA): unable to release extent buffer 30736384 owner 4 gen 9 refs 2 flags 0x7 BTRFS warning (device dm-2 state EA): unable to release extent buffer 30752768 owner 11 gen 9 refs 2 flags 0x7 I'm using a stripped down version, which seems to trigger the warning more reliably: _fsstress_pid="" workload() { dmesg -C mkfs.btrfs -f -K $dev > /dev/null echo 1 > /sys/kernel/debug/clear_warn_once mount $dev $mnt $fsstress -w -n 1024 -p 4 -d $mnt & _fsstress_pid=$! sleep 0 $godown $mnt pkill --echo -PIPE fsstress > /dev/null wait $_fsstress_pid unset _fsstress_pid umount $mnt if dmesg \| grep -q "WARNING"; then fail fi } for (( i = 0; i < $runtime; i++ )); do echo "=== $i/$runtime ===" workload done [CAUSE] Inside btrfs_write_and_wait_transaction(), we first try to write all dirty ebs, then wait for them to finish. After that we call btrfs_extent_io_tree_release() to free all extent states from dirty_pages io tree. However if we hit an error from btrfs_write_marked_extent(), then we still call btrfs_extent_io_tree_release() to clear that dirty_pages io tree, which may contain dirty records that we haven't yet submitted. Furthermore, the later transaction cleanup path will utilize that dirty_pages io tree to properly cleanup those dirty ebs, but since it's already empty, no dirty ebs are properly cleaned up, thus will later trigger the warnings inside invalidate_btree_folios(). [FIX] Normally such dirty ebs won't cause problems, as when the iput() is called on the btree inode, the dirty ebs will be forcibly written back, and since the fs is already in an error status, such writeback will not reach disk and finish immediately. But it's still better to get rid of such dirty ebs, if we ended up with dirty ebs but the fs is not in an error status, then such writeback at iput() time will be too late, as all workers are already stopped but writeback will utilize workers, which will lead to NULL pointer dereferences. Instead of unconditionally calling btrfs_extent_io_tree_release(), only call it if btrfs_write_and_wait_transaction() finished successfully, so that @dirty_pages extent io tree is kept untouched for transaction cleanup. CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-05-08 00:31:47 +02:00
Filipe Manana	c73370c677	btrfs: tracepoints: fix sleep while in atomic context in btrfs_sync_file() The trace event btrfs_sync_file() is called in an atomic context (all trace events are) and its call to dput(), which is needed due to the call to dget_parent(), can sleep, triggering a kernel splat. This can be reproduced by enabling the trace event and running btrfs/056 from fstests for example. The splat shown in dmesg is the following: [53.919] BUG: sleeping function called from invalid context at fs/dcache.c:970 [53.947] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 32773, name: xfs_io [53.988] preempt_count: 2, expected: 0 [53.967] RCU nest depth: 0, expected: 0 [53.943] Preemption disabled at: [53.944] [<0000000000000000>] 0x0 [54.078] CPU: 0 UID: 0 PID: 32773 Comm: xfs_io Tainted: G W 7.1.0-rc1-btrfs-next-232+ #1 PREEMPT(full) [54.070] Tainted: [W]=WARN [54.071] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [54.072] Call Trace: [54.074] <TASK> [54.076] dump_stack_lvl+0x56/0x80 [54.079] __might_resched.cold+0xd6/0x10f [54.072] dput.part.0+0x24/0x110 [54.078] trace_event_raw_event_btrfs_sync_file+0x75/0x140 [btrfs] [54.089] btrfs_sync_file+0x1ed/0x530 [btrfs] [54.087] ? __handle_mm_fault+0x8ae/0xed0 [54.089] btrfs_do_write_iter+0x172/0x210 [btrfs] [54.091] vfs_write+0x21f/0x450 [54.094] __x64_sys_pwrite64+0x8d/0xc0 [54.096] ? do_user_addr_fault+0x20c/0x670 [54.099] do_syscall_64+0x60/0xf20 [54.092] ? clear_bhb_loop+0x60/0xb0 [54.094] entry_SYSCALL_64_after_hwframe+0x76/0x7e So stop using dget_parent() and dput() and access the parent dentry directly as dentry->d_parent. This is also what ext4 is doing in its equivalent trace event ext4_sync_file_enter(). Fixes: `a85b46db14` ("btrfs: tracepoints: get correct superblock from dentry in event btrfs_sync_file()") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-05-08 00:31:37 +02:00
Calvin Owens	4822703b15	btrfs: always pass __GFP_NOWARN from add_ra_bio_pages() A build workload newly prints order-0 allocation failures on 7.1-rc1: sh: page allocation failure: order:0 mode:0x14084a(__GFP_HIGHMEM\|__GFP_MOVABLE\|__GFP_IO\|__GFP_KSWAPD_RECLAIM\| __GFP_COMP\|__GFP_HARDWALL) CPU: 27 UID: 1000 PID: 855540 Comm: sh Not tainted 7.1.0-rc1-llvm-00058-gdca922e019dd #1 PREEMPTLAZY Call Trace: <TASK> dump_stack_lvl+0x50/0x70 warn_alloc+0xeb/0x100 __alloc_pages_slowpath+0x567/0x5a0 ? filemap_get_entry+0x11a/0x140 __alloc_frozen_pages_noprof+0x249/0x2d0 alloc_pages_mpol+0xe4/0x180 folio_alloc_noprof+0x80/0xa0 add_ra_bio_pages+0x13c/0x4b0 btrfs_submit_compressed_read+0x229/0x300 submit_one_bio+0x9e/0xe0 btrfs_readahead+0x185/0x1a0 [...] (lldb) source list -a add_ra_bio_pages+0x13c .../vmlinux.unstripped add_ra_bio_pages + 316 at .../fs/btrfs/compression.c:454:8 451 452 folio = filemap_alloc_folio(mapping_gfp_constraint(mapping, constraint_gfp), 453 0, NULL); -> 454 if (!folio) 455 break; I can reproduce this consistently by running a memory hog concurrently with a buffered writer on a machine with a very large amount of swap. Commit `7ae37b2c94` ("btrfs: prevent direct reclaim during compressed readahead") clearly intended to suppress these warnings. But because the mask set in the address_space with mapping_set_gfp_mask() doesn't include __GFP_NOWARN, mapping_gfp_constraint() removes it from constraint_gfp before it is passed to filemap_alloc_folio(). Fix by refactoring the code to add __GFP_NOWARN after the call to mapping_gfp_constraint(). Fixes: `7ae37b2c94` ("btrfs: prevent direct reclaim during compressed readahead") Signed-off-by: Calvin Owens <calvin@wbinvd.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-05-08 00:30:53 +02:00
ZhengYuan Huang	fc51cba3eb	btrfs: fix check_chunk_block_group_mappings() to iterate all chunk maps [BUG] A corrupted image with a chunk present in the chunk tree but whose corresponding block group item is missing from the extent tree can be mounted successfully, even though check_chunk_block_group_mappings() is supposed to catch exactly this corruption at mount time. Once mounted, running btrfs balance with a usage filter (-dusage=N or -dusage=min..max) triggers a null-ptr-deref: KASAN: null-ptr-deref in range [0x0000000000000070-0x0000000000000077] RIP: 0010:chunk_usage_filter fs/btrfs/volumes.c:3874 [inline] RIP: 0010:should_balance_chunk fs/btrfs/volumes.c:4018 [inline] RIP: 0010:__btrfs_balance fs/btrfs/volumes.c:4172 [inline] RIP: 0010:btrfs_balance+0x2024/0x42b0 fs/btrfs/volumes.c:4604 [CAUSE] The crash occurs because __btrfs_balance() iterates the on-disk chunk tree, finds the orphaned chunk, calls chunk_usage_filter() (or chunk_usage_range_filter()), which queries the in-memory block group cache via btrfs_lookup_block_group(). Since no block group was ever inserted for this chunk, the lookup returns NULL, and the subsequent dereference of cache->used crashes. check_chunk_block_group_mappings() uses btrfs_find_chunk_map() to iterate the in-memory chunk map (fs_info->mapping_tree): map = btrfs_find_chunk_map(fs_info, start, 1); With @start = 0 and @length = 1, btrfs_find_chunk_map() looks for a chunk map that contains the logical address 0. If no chunk contains logical address 0, btrfs_find_chunk_map(fs_info, 0, 1) returns NULL immediately and the loop breaks after the very first iteration, having checked zero chunks. The entire verification function is therefore a no-op, and the corrupted image passes the mount-time check undetected. [FIX] Replace the btrfs_find_chunk_map() based loop with a direct in-order walk of fs_info->mapping_tree using rb_first_cached() + rb_next(). This guarantees that every chunk map in the tree is visited regardless of the logical addresses involved. No lock is taken around the traversal. This function is called during mount from btrfs_read_block_groups(), which is invoked from open_ctree() before any background threads (cleaner, transaction kthread, etc.) are started. There are therefore no concurrent writers that could modify mapping_tree at this point. An analogous lockless direct traversal of mapping_tree already exists in fill_dummy_bgs() in the same file. Since we walk the rb-tree directly via rb_entry() without going through btrfs_find_chunk_map(), no reference is taken on each map entry, so the btrfs_free_chunk_map() calls are also removed. Signed-off-by: ZhengYuan Huang <gality369@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-05-08 00:29:07 +02:00
Mark Harmstone	82323b1a70	btrfs: fix double-decrement of bytes_may_use in submit_one_async_extent() submit_one_async_extent() calls btrfs_reserve_extent(), which decrements bytes_may_use. If the call btrfs_create_io_em() fails, we jump to out_free_reserve, which calls extent_clear_unlock_delalloc(). Because we're specifying EXTENT_DO_ACCOUNTING, i.e. EXTENT_CLEAR_META_RESV \| EXTENT_CLEAR_DATA_RESV, this decreases bytes_may_use again. This can lead to problems later on, as an initial write can fail only for the writeback to silently ENOSPC. Fix this by replacing EXTENT_DO_ACCOUNTING with EXTENT_CLEAR_META_RESV. This parallels `a4fe134fc1` ("btrfs: fix a double release on reserved extents in cow_one_range()"), which is the same fix in cow_one_range(). Fixes: `151a41bc46` ("Btrfs: fix what bits we clear when erroring out from delalloc") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Mark Harmstone <mark@harmstone.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-21 04:03:08 +02:00
robbieko	a8d58a7c02	btrfs: check return value of btrfs_partially_delete_raid_extent() btrfs_partially_delete_raid_extent() returns an error code (e.g. -ENOMEM from kzalloc(), or errors from btrfs_del_item/btrfs_insert_item()), but all three call sites in btrfs_delete_raid_extent() discard the return value, silently losing errors and potentially leaving the stripe tree in an inconsistent state. Fix by capturing the return value into ret at all three call sites and breaking out of the loop on error where appropriate. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: robbieko <robbieko@synology.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-21 04:02:39 +02:00
robbieko	fe0cdfd711	btrfs: handle -EAGAIN from btrfs_duplicate_item and refresh stale leaf pointer In the 'punch a hole' case of btrfs_delete_raid_extent(), btrfs_duplicate_item() can return -EAGAIN when the leaf needs to be split and the path becomes invalid. The old code treats any error as fatal and breaks out of the loop. Additionally, btrfs_duplicate_item() may trigger setup_leaf_for_split() which can reallocate the leaf node. The code continues using the old leaf pointer, leading to use-after-free or stale data access. Fix both issues by: - Handling -EAGAIN specifically: release the path and retry the loop. - Refreshing leaf = path->nodes[0] after successful duplication. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: robbieko <robbieko@synology.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-21 04:02:34 +02:00
robbieko	653361585d	btrfs: replace ASSERT with proper error handling in stripe lookup fallback After falling back to the previous item in btrfs_delete_raid_extent(), the code uses ASSERT(found_start <= start) to verify the found extent actually precedes our target range. If the B-tree state is unexpected (e.g. no overlapping extent exists), this triggers a kernel BUG/panic in debug builds, or silently continues with wrong data otherwise. Replace the ASSERT with a proper bounds check that returns -ENOENT if the found extent does not actually overlap with the start position. Signed-off-by: robbieko <robbieko@synology.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-21 04:02:30 +02:00
robbieko	1871ae78ff	btrfs: fix wrong min_objectid in btrfs_previous_item() call When found_start > start and slot == 0, btrfs_previous_item() is called with min_objectid=start to find the previous stripe extent. However, the previous stripe extent we are looking for has objectid < start (it starts before our deletion range), so passing start as min_objectid prevents finding it. Fix by passing 0 as min_objectid to allow finding any preceding stripe extent regardless of its objectid. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: robbieko <robbieko@synology.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-21 04:02:26 +02:00
robbieko	2aef5cb1dc	btrfs: fix raid stripe search missing entries at leaf boundaries In btrfs_delete_raid_extent(), the search key uses offset=0. When the target stripe entry is the first item on a leaf, btrfs_search_slot() may land on the previous leaf and decrementing the slot from nritems still points to the wrong entry, causing the stripe extent to be silently missed. Fix this by searching with offset=(u64)-1 instead. Since no real stripe entry has this offset, btrfs_search_slot() always returns 1 with the slot pointing past the last matching objectid entry. Then unconditionally decrement the slot with a proper slots[0]==0 early-exit check to handle the case where no matching entry exists. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: robbieko <robbieko@synology.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-21 04:02:21 +02:00
robbieko	513f8a52ee	btrfs: copy devid in btrfs_partially_delete_raid_extent() When btrfs_partially_delete_raid_extent() rebuilds a truncated/shifted stripe extent into newitem, the loop copies the physical address for each stride but forgets to copy the devid. The resulting item written back to the stripe tree has zeroed-out devids, corrupting the stripe mapping. Fix this by reading the devid with btrfs_raid_stride_devid() and writing it into the new item with btrfs_set_stack_raid_stride_devid() before copying the physical address. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: robbieko <robbieko@synology.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-21 04:02:18 +02:00
David Sterba	4d95b9efd7	btrfs: handle unexpected free-space-tree key types Replace the conditional assertions with proper error handling and transaction abort if we find an unexpected key type in the free space tree. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-21 04:02:02 +02:00
Filipe Manana	999757231c	btrfs: fix missing last_unlink_trans update when removing a directory When removing a directory we are not updating its last_unlink_trans field, which can result in incorrect fsync behaviour in case some one fsyncs the directory after it was removed because it's holding a file descriptor on it. Example scenario: mkdir /mnt/dir1 mkdir /mnt/dir1/dir2 mkdir /mnt/dir3 sync -f /mnt # Do some change to the directory and fsync it. chmod 700 /mnt/dir1 xfs_io -c fsync /mnt/dir1 # Move dir2 out of dir1 so that dir1 becomes empty. mv /mnt/dir1/dir2 /mnt/dir3/ open fd on /mnt/dir1 call rmdir(2) on path "/mnt/dir1" fsync fd <trigger power failure> When attempting to mount the filesystem, the log replay will fail with an -EIO error and dmesg/syslog has the following: [445771.626482] BTRFS info (device dm-0): first mount of filesystem 0368bbea-6c5e-44b5-b409-09abe496e650 [445771.626486] BTRFS info (device dm-0): using crc32c checksum algorithm [445771.627912] BTRFS info (device dm-0): start tree-log replay [445771.628335] page: refcount:2 mapcount:0 mapping:0000000061443ddc index:0x1d00 pfn:0x7072a5 [445771.629453] memcg:ffff89f400351b00 [445771.629892] aops:btree_aops [btrfs] ino:1 [445771.630737] flags: 0x17fffc00000402a(uptodate\|lru\|private\|writeback\|node=0\|zone=2\|lastcpupid=0x1ffff) [445771.632359] raw: 017fffc00000402a fffff47284d950c8 fffff472907b7c08 ffff89f458e412b8 [445771.633713] raw: 0000000000001d00 ffff89f6c51d1a90 00000002ffffffff ffff89f400351b00 [445771.635029] page dumped because: eb page dump [445771.635825] BTRFS critical (device dm-0): corrupt leaf: root=5 block=30408704 slot=10 ino=258, invalid nlink: has 2 expect no more than 1 for dir [445771.638088] BTRFS info (device dm-0): leaf 30408704 gen 10 total ptrs 17 free space 14878 owner 5 [445771.638091] BTRFS info (device dm-0): refs 4 lock_owner 0 current 3581087 [445771.638094] item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160 [445771.638097] inode generation 3 transid 9 size 16 nbytes 16384 [445771.638098] block group 0 mode 40755 links 1 uid 0 gid 0 [445771.638100] rdev 0 sequence 2 flags 0x0 [445771.638102] atime 1775744884.0 [445771.660056] ctime 1775744885.645502983 [445771.660058] mtime 1775744885.645502983 [445771.660060] otime 1775744884.0 [445771.660062] item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12 [445771.660064] index 0 name_len 2 [445771.660066] item 2 key (256 DIR_ITEM 1843588421) itemoff 16077 itemsize 34 [445771.660068] location key (259 1 0) type 2 [445771.660070] transid 9 data_len 0 name_len 4 [445771.660075] item 3 key (256 DIR_ITEM 2363071922) itemoff 16043 itemsize 34 [445771.660076] location key (257 1 0) type 2 [445771.660077] transid 9 data_len 0 name_len 4 [445771.660078] item 4 key (256 DIR_INDEX 2) itemoff 16009 itemsize 34 [445771.660079] location key (257 1 0) type 2 [445771.660080] transid 9 data_len 0 name_len 4 [445771.660081] item 5 key (256 DIR_INDEX 3) itemoff 15975 itemsize 34 [445771.660082] location key (259 1 0) type 2 [445771.660083] transid 9 data_len 0 name_len 4 [445771.660084] item 6 key (257 INODE_ITEM 0) itemoff 15815 itemsize 160 [445771.660086] inode generation 9 transid 9 size 8 nbytes 0 [445771.660087] block group 0 mode 40777 links 1 uid 0 gid 0 [445771.660088] rdev 0 sequence 2 flags 0x0 [445771.660089] atime 1775744885.641174097 [445771.660090] ctime 1775744885.645502983 [445771.660091] mtime 1775744885.645502983 [445771.660105] otime 1775744885.641174097 [445771.660106] item 7 key (257 INODE_REF 256) itemoff 15801 itemsize 14 [445771.660107] index 2 name_len 4 [445771.660108] item 8 key (257 DIR_ITEM 2676584006) itemoff 15767 itemsize 34 [445771.660109] location key (258 1 0) type 2 [445771.660110] transid 9 data_len 0 name_len 4 [445771.660111] item 9 key (257 DIR_INDEX 2) itemoff 15733 itemsize 34 [445771.660112] location key (258 1 0) type 2 [445771.660113] transid 9 data_len 0 name_len 4 [445771.660114] item 10 key (258 INODE_ITEM 0) itemoff 15573 itemsize 160 [445771.660115] inode generation 9 transid 10 size 0 nbytes 0 [445771.660116] block group 0 mode 40755 links 2 uid 0 gid 0 [445771.660117] rdev 0 sequence 0 flags 0x0 [445771.660118] atime 1775744885.645502983 [445771.660119] ctime 1775744885.645502983 [445771.660120] mtime 1775744885.645502983 [445771.660121] otime 1775744885.645502983 [445771.660122] item 11 key (258 INODE_REF 257) itemoff 15559 itemsize 14 [445771.660123] index 2 name_len 4 [445771.660124] item 12 key (258 INODE_REF 259) itemoff 15545 itemsize 14 [445771.660125] index 2 name_len 4 [445771.660126] item 13 key (259 INODE_ITEM 0) itemoff 15385 itemsize 160 [445771.660127] inode generation 9 transid 10 size 8 nbytes 0 [445771.660128] block group 0 mode 40755 links 1 uid 0 gid 0 [445771.660129] rdev 0 sequence 1 flags 0x0 [445771.660130] atime 1775744885.645502983 [445771.660130] ctime 1775744885.645502983 [445771.660131] mtime 1775744885.645502983 [445771.660132] otime 1775744885.645502983 [445771.660133] item 14 key (259 INODE_REF 256) itemoff 15371 itemsize 14 [445771.660134] index 3 name_len 4 [445771.660135] item 15 key (259 DIR_ITEM 2676584006) itemoff 15337 itemsize 34 [445771.660136] location key (258 1 0) type 2 [445771.660137] transid 10 data_len 0 name_len 4 [445771.660138] item 16 key (259 DIR_INDEX 2) itemoff 15303 itemsize 34 [445771.660139] location key (258 1 0) type 2 [445771.660140] transid 10 data_len 0 name_len 4 [445771.660144] BTRFS error (device dm-0): block=30408704 write time tree block corruption detected [445771.661650] ------------[ cut here ]------------ [445771.662358] WARNING: fs/btrfs/disk-io.c:326 at btree_csum_one_bio+0x217/0x230 [btrfs], CPU#8: mount/3581087 [445771.663588] Modules linked in: btrfs f2fs xfs (...) [445771.671229] CPU: 8 UID: 0 PID: 3581087 Comm: mount Tainted: G W 7.0.0-rc6-btrfs-next-230+ #2 PREEMPT(full) [445771.672575] Tainted: [W]=WARN [445771.672987] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [445771.674460] RIP: 0010:btree_csum_one_bio+0x217/0x230 [btrfs] [445771.675222] Code: 89 44 24 (...) [445771.677364] RSP: 0018:ffffd23882247660 EFLAGS: 00010246 [445771.678029] RAX: 0000000000000000 RBX: ffff89f6c51d1a90 RCX: 0000000000000000 [445771.678975] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff89f406020000 [445771.679983] RBP: ffff89f821204000 R08: 0000000000000000 R09: 00000000ffefffff [445771.680905] R10: ffffd23882247448 R11: 0000000000000003 R12: ffffd23882247668 [445771.681978] R13: ffff89f458e40fc0 R14: ffff89f737f4f500 R15: ffff89f737f4f500 [445771.682912] FS: 00007f0447a98840(0000) GS:ffff89fb9771d000(0000) knlGS:0000000000000000 [445771.684393] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [445771.685230] CR2: 00007f0447bf1330 CR3: 000000017cb02002 CR4: 0000000000370ef0 [445771.686273] Call Trace: [445771.686646] <TASK> [445771.686969] btrfs_submit_bbio+0x83f/0x860 [btrfs] [445771.687750] ? write_one_eb+0x28f/0x340 [btrfs] [445771.688428] btree_writepages+0x2e3/0x550 [btrfs] [445771.689180] ? kmem_cache_alloc_noprof+0x12a/0x490 [445771.689963] ? alloc_extent_state+0x19/0x120 [btrfs] [445771.690801] ? kmem_cache_free+0x135/0x380 [445771.691328] ? preempt_count_add+0x69/0xa0 [445771.691831] ? set_extent_bit+0x252/0x8e0 [btrfs] [445771.692468] ? xas_load+0x9/0xc0 [445771.692873] ? xas_find+0x14d/0x1a0 [445771.693304] do_writepages+0xc6/0x160 [445771.693756] filemap_writeback+0xb8/0xe0 [445771.694274] btrfs_write_marked_extents+0x61/0x170 [btrfs] [445771.694999] btrfs_write_and_wait_transaction+0x4e/0xc0 [btrfs] [445771.695818] btrfs_commit_transaction+0x5c8/0xd10 [btrfs] [445771.696530] ? kmem_cache_free+0x135/0x380 [445771.697120] ? release_extent_buffer+0x34/0x160 [btrfs] [445771.697786] btrfs_recover_log_trees+0x7be/0x7e0 [btrfs] [445771.698525] ? __pfx_replay_one_buffer+0x10/0x10 [btrfs] [445771.699206] open_ctree+0x11e5/0x1810 [btrfs] [445771.699776] btrfs_get_tree.cold+0xb/0x162 [btrfs] [445771.700463] ? fscontext_read+0x165/0x180 [445771.701146] ? rw_verify_area+0x50/0x180 [445771.701866] vfs_get_tree+0x25/0xd0 [445771.702491] vfs_cmd_create+0x59/0xe0 [445771.703125] __do_sys_fsconfig+0x303/0x610 [445771.703603] do_syscall_64+0xe9/0xf20 [445771.703974] entry_SYSCALL_64_after_hwframe+0x76/0x7e [445771.704700] RIP: 0033:0x7f0447cbd4aa [445771.705108] Code: 73 01 c3 (...) [445771.707263] RSP: 002b:00007ffc4e528318 EFLAGS: 00000246 ORIG_RAX: 00000000000001af [445771.708107] RAX: ffffffffffffffda RBX: 00005561585d8c20 RCX: 00007f0447cbd4aa [445771.708931] RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003 [445771.709744] RBP: 00005561585d9120 R08: 0000000000000000 R09: 0000000000000000 [445771.710674] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [445771.711477] R13: 00007f0447e4f580 R14: 00007f0447e5126c R15: 00007f0447e36a23 [445771.712277] </TASK> [445771.712541] ---[ end trace 0000000000000000 ]--- [445771.713382] BTRFS error (device dm-0): error while writing out transaction: -5 [445771.714679] BTRFS warning (device dm-0): Skipping commit of aborted transaction. [445771.715562] BTRFS error (device dm-0 state A): Transaction aborted (error -5) [445771.716459] BTRFS: error (device dm-0 state A) in cleanup_transaction:2068: errno=-5 IO failure [445771.717936] BTRFS error (device dm-0 state EA): failed to recover log trees with error: -5 [445771.719681] BTRFS error (device dm-0 state EA): open_ctree failed: -5 The problem is that such a fsync should have result in a fallback to a transaction commit, but that did not happen because through the btrfs_rmdir() we never update the directory's last_unlink_trans field. Any inode that had a link removed must have its last_unlink_trans updated to the ID of transaction used for the operation, otherwise fsync and log replay will not work correctly. btrfs_rmdir() calls btrfs_unlink_inode() and through that call chain we never call btrfs_record_unlink_dir() in order to update last_unlink_trans. However btrfs_unlink(), which is used for unlinking regular files, calls btrfs_record_unlink_dir() and then calls btrfs_unlink_inode(). So fix this by moving the call to btrfs_record_unlink_dir() from btrfs_unlink() to btrfs_unlink_inode(). A test case for fstests will follow soon. Reported-by: Slava0135 <slava.kovalevskiy.2014@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CAAJYhww5ov62Hm+n+tmhcL-e_4cBobg+OWogKjOJxVUXivC=MQ@mail.gmail.com/ CC: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-21 04:01:48 +02:00
Mark Harmstone	44366af740	btrfs: don't clobber errors in add_remap_tree_entries() In add_remap_tree_entries(), we only process a certain number of entries at a time, meaning we may need to loop. But because we weren't checking the return value of btrfs_insert_empty_items() within the loop, this meant that if the last iteration of the loop succeeded but a previous iteration failed, we were erroneously returning 0. Fix this by breaking the loop early if btrfs_insert_empty_items() fails. Fixes: `b56f35560b` ("btrfs: handle setting up relocation of block group with remap-tree") Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-21 04:01:43 +02:00
Qu Wenruo	41e706c07e	btrfs: enable shutdown ioctl for non-experimental builds Although commit `304076527c` ("btrfs: move shutdown and remove_bdev callbacks out of experimental features") tries to move both shutdown and remove_bdev out of experimental features, that commit has only addressed the super block operation callback, the ioctl one is left untouched. Fix that missing aspect by also moving shutdown ioctl out of experimental features. Since we're here, also add unknown flag detection to reject any unsupported shutdown flags. Fixes: `304076527c` ("btrfs: move shutdown and remove_bdev callbacks out of experimental features") Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-21 04:01:31 +02:00
Qu Wenruo	a86a283430	btrfs: apply first key check for readahead when possible Currently for tree block readahead we never pass a btrfs_tree_parent_check with @has_first_key set. Without @has_first_key set, btrfs will skip the following extra checks: - Header generation check This is a minor one. - Empty leaf/node checks This is more serious, for certain trees like the csum tree, they are allowed to be empty, thus an empty leaf can pass the tree checker. But if there is a parent node for such an empty leaf, it indicates corruption. Without @has_first_key set, we can no longer detect such a problem. In fact there is already a fuzzed image report that a corrupted csum leaf which has zero nritems but still has a parent node can trigger a BUG_ON() during csum deletion. However there are only two call sites of btrfs_readahead_tree_block(): - Inside relocate_tree_blocks() At this call site we are trying to grab the first key of the tree block, thus we are not able to pass a @first_key parameter. - Inside btrfs_readahead_node_child() This is the more common call site, where we have the parent node and want to readahead the child tree blocks. In this case we can easily grab the node key and pass it for checks. Add a new parameter @first_key to btrfs_readahead_tree_block() and pass the node key to it inside btrfs_readahead_node_child(). This should plug the gap in empty leaf detection during readahead. Link: https://lore.kernel.org/linux-btrfs/20260409071255.3358044-1-gality369@gmail.com/ Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-21 04:01:24 +02:00
Mark Harmstone	73db0fad67	btrfs: abort transaction in do_remap_reloc_trans() on failure If one of the calls made by do_remap_reloc_trans() fails, we can leave the remap tree in an inconsistent state. Abort the transaction if this happens, to prevent the corrupt state from reaching the disk. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Mark Harmstone <mark@harmstone.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-21 04:00:52 +02:00
Mark Harmstone	9b8824533d	btrfs: fix bytes_may_use leak in do_remap_reloc_trans() If the call to btrfs_reserve_extent() in do_remap_reloc_trans() returns a smaller extent than we asked for, currently we're not undoing the bytes_may_use change that we made. Fix this by calling btrfs_space_info_update_bytes_may_use() again for the difference. Fixes: `fd6594b144` ("btrfs: replace identity remaps with actual remaps when doing relocations") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Mark Harmstone <mark@harmstone.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-21 04:00:39 +02:00
Mark Harmstone	68a135013b	btrfs: fix bytes_may_use leak in move_existing_remap() If the call to btrfs_reserve_extent() in move_existing_remap() returns a smaller extent than we asked for, currently we're not undoing the bytes_may_use change that we made. Fix this by calling btrfs_space_info_update_bytes_may_use() again for the difference. Fixes: `bbea42dfb9` ("btrfs: move existing remaps before relocating block group") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Mark Harmstone <mark@harmstone.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-21 04:00:32 +02:00
Boris Burkov	fc3d532881	btrfs: btrfs_log_dev_io_error() on all bio errors As far as I can tell, we never intentionally constrained ourselves to these status codes, and it is misleading and surprising to lack the bdev error logging when we get a different error code from the block layer. This can lead to jumping to a wrong conclusion like "this system didn't see any bio failures but aborted with EIO". For example on nvme devices, I observe many failures coming back as BLK_STS_MEDIUM. It is apparent that the nvme driver returns a variety of BLK_STS_* status values in nvme_error_status(). So handle the known expected errors and make some noise on the rest which we expect won't really happen. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 20:00:29 +02:00
Michal Grzedzicki	3cd181cc46	btrfs: fix silent IO error loss in encoded writes and zoned split can_finish_ordered_extent() and btrfs_finish_ordered_zoned() set BTRFS_ORDERED_IOERR via bare set_bit(). Later, btrfs_mark_ordered_extent_error() in btrfs_finish_one_ordered() uses test_and_set_bit(), finds it already set, and skips mapping_set_error(). The error is never recorded on the inode's address_space, making it invisible to fsync. For encoded writes this causes btrfs receive to silently produce files with zero-filled holes. Fix: replace bare set_bit(BTRFS_ORDERED_IOERR) with btrfs_mark_ordered_extent_error() which pairs test_and_set_bit() with mapping_set_error(), guaranteeing the error is recorded exactly once. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Mark Harmstone <mark@harmstone.com> Signed-off-by: Michal Grzedzicki <mge@meta.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 19:43:50 +02:00
Dave Chen	e0dfaebb8f	btrfs: skip clearing EXTENT_DEFRAG for NOCOW ordered extents In btrfs_finish_one_ordered(), clear_bits is unconditionally initialized with EXTENT_DEFRAG. For NOCOW ordered extents this is always a no-op because should_nocow() already forces the COW path when EXTENT_DEFRAG is set, so a NOCOW ordered extent can never have EXTENT_DEFRAG on its range. Although harmless, the unconditional btrfs_clear_extent_bit() call still performs a cold rbtree lookup under the io tree spinlock on every NOCOW write completion. Avoid this by only adding EXTENT_DEFRAG to clear_bits for non-NOCOW ordered extents, and skip the call entirely when there are no bits to clear. Signed-off-by: Dave Chen <davechen@synology.com> Signed-off-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 19:43:22 +02:00
Dave Chen	e70e3f858e	btrfs: use BTRFS_FS_UPDATE_UUID_TREE_GEN flag for UUID tree rescan check The UUID tree rescan check in open_ctree() compares fs_info->generation with the superblock's uuid_tree_generation. This comparison is not reliable because fs_info->generation is bumped at transaction start time in join_transaction(), while uuid_tree_generation is only updated at commit time via update_super_roots(). Between the early BTRFS_FS_UPDATE_UUID_TREE_GEN flag check and the late rescan decision, mount operations such as file orphan cleanup from an unclean shutdown start transactions without committing them. This advances fs_info->generation past uuid_tree_generation and produces a false-positive mismatch. Use the BTRFS_FS_UPDATE_UUID_TREE_GEN flag directly instead. The flag was already set earlier in open_ctree() when the generations were known to match, and accurately represents "UUID tree is up to date" without being affected by subsequent transaction starts. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Dave Chen <davechen@synology.com> Signed-off-by: Robbie Ko <robbieko@synology.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 19:43:06 +02:00
Filipe Manana	e1194226bf	btrfs: remove duplicate journal_info reset on failure to commit transaction If we get an error during the transaction commit path, we are resetting current->journal_info to NULL twice - once in btrfs_commit_transaction() right before calling cleanup_transaction() and then once again inside cleanup_transaction(). Remove the instance in btrfs_commit_transaction(). Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 19:42:24 +02:00
Filipe Manana	7801f3ea95	btrfs: tag as unlikely if statements that check for fs in error state Having the filesystem in an error state, meaning we had a transaction abort, is unexpected. Mark every check for the error state with the unlikely annotation to convey that and to allow the compiler to generate better code. On x86_64, using gcc 14.2.0-19 from Debian, resulted in a slightly reduced object size and better code. Before: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 2008598 175912 15592 2200102 219226 fs/btrfs/btrfs.ko After: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 2008450 175912 15592 2199954 219192 fs/btrfs/btrfs.ko Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 19:41:42 +02:00
Guangshuo Li	3f487be812	btrfs: fix double free in create_space_info() error path When kobject_init_and_add() fails, the call chain is: create_space_info() -> btrfs_sysfs_add_space_info_type() -> kobject_init_and_add() -> failure -> kobject_put(&space_info->kobj) -> space_info_release() -> kfree(space_info) Then control returns to create_space_info(): btrfs_sysfs_add_space_info_type() returns error -> goto out_free -> kfree(space_info) This causes a double free. Keep the direct kfree(space_info) for the earlier failure path, but after btrfs_sysfs_add_space_info_type() has called kobject_put(), let the kobject release callback handle the cleanup. Fixes: `a11224a016` ("btrfs: fix memory leaks in create_space_info() error paths") CC: stable@vger.kernel.org # 6.19+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Guangshuo Li <lgs201920130244@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:09 +02:00
Guangshuo Li	a7449edf96	btrfs: fix double free in create_space_info_sub_group() error path When kobject_init_and_add() fails, the call chain is: create_space_info_sub_group() -> btrfs_sysfs_add_space_info_type() -> kobject_init_and_add() -> failure -> kobject_put(&sub_group->kobj) -> space_info_release() -> kfree(sub_group) Then control returns to create_space_info_sub_group(), where: btrfs_sysfs_add_space_info_type() returns error -> kfree(sub_group) Thus, sub_group is freed twice. Keep parent->sub_group[index] = NULL for the failure path, but after btrfs_sysfs_add_space_info_type() has called kobject_put(), let the kobject release callback handle the cleanup. Fixes: `f92ee31e03` ("btrfs: introduce btrfs_space_info sub-group") CC: stable@vger.kernel.org # 6.18+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Guangshuo Li <lgs201920130244@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:09 +02:00
Qu Wenruo	3c0c45a4df	btrfs: do not reject a valid running dev-replace [BUG] There is a bug report that a btrfs with running dev-replace got rejected with the following messages: BTRFS error (device sdk1): devid 0 path /dev/sdk1 is registered but not found in chunk tree BTRFS error (device sdk1): remove the above devices or use 'btrfs device scan --forget <dev>' to unregister them before mount BTRFS error (device sdk1): open_ctree failed: -117 [CAUSE] The tree and super block dumps show the fs is completely sane, except one thing, there is no dev item for devid 0 in chunk tree. However this is not a bug, as we do not insert dev item for devid 0 in the first place. Since the devid 0 is only there temporarily we do not really need to insert a dev item for it and then later remove it again. It is the commit `3430818739` ("btrfs: add extra device item checks at mount") adding a overly strict check that triggers a false alert and rejected the valid filesystem. [FIX] Add a special handling for devid 0, and doesn't require devid 0 to have a device item in chunk tree. Reported-by: Jaron Viëtor <jaron@vietors.com> Link: https://lore.kernel.org/linux-btrfs/CAF1bhLVYLZvD=j2XyuxXDKD-NWNJAwDnpVN+UYeQW-HbzNRn1A@mail.gmail.com/ Fixes: `3430818739` ("btrfs: add extra device item checks at mount") Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:08 +02:00
Qu Wenruo	48aa5c0e2b	btrfs: only invalidate btree inode pages after all ebs are released In close_ctree(), we call invalidate_inode_pages2() to invalidate all pages from btree inode. But the problem is, it never returns 0, but always -EBUSY. The problem is that we are still holding all the essential tree root nodes, thus pages holding those tree blocks can not be invalidated thus invalidate_inode_pages2() always returns -EBUSY. This is also against the error cleanup path of open_ctree(), which properly frees all root pointers before calling invalidate_inode_pages(). So fix the order by delaying invalidate_inode_pages2() until we have freed all root pointers. Reviewed-by: Anand Jain <asj@kernel.org> Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:08 +02:00
JP Kobryn (Meta)	7ae37b2c94	btrfs: prevent direct reclaim during compressed readahead Under memory pressure, direct reclaim can kick in during compressed readahead. This puts the associated task into D-state. Then shrink_lruvec() disables interrupts when acquiring the LRU lock. Under heavy pressure, we've observed reclaim can run long enough that the CPU becomes prone to CSD lock stalls since it cannot service incoming IPIs. Although the CSD lock stalls are the worst case scenario, we have found many more subtle occurrences of this latency on the order of seconds, over a minute in some cases. Prevent direct reclaim during compressed readahead. This is achieved by using different GFP flags at key points when the bio is marked for readahead. There are two functions that allocate during compressed readahead: btrfs_alloc_compr_folio() and add_ra_bio_pages(). Both currently use GFP_NOFS which includes __GFP_DIRECT_RECLAIM. For the internal API call btrfs_alloc_compr_folio(), the signature changes to accept an additional gfp_t parameter. At the readahead call site, it gets flags similar to GFP_NOFS but stripped of __GFP_DIRECT_RECLAIM. __GFP_NOWARN is added since these allocations are allowed to fail. Demand reads still use full GFP_NOFS and will enter reclaim if needed. All other existing call sites of btrfs_alloc_compr_folio() now explicitly pass GFP_NOFS to retain their current behavior. add_ra_bio_pages() gains a bool parameter which allows callers to specify if they want to allow direct reclaim or not. In either case, the __GFP_NOWARN flag was added unconditionally since the allocations are speculative. There has been some previous work done on calling add_ra_bio_pages() [0]. This patch is complementary: where that patch reduces call frequency, this patch reduces the latency associated with those calls. [0] https://lore.kernel.org/linux-btrfs/656838ec1232314a2657716e59f4f15a8eadba64.1751492111.git.boris@bur.io/ Reviewed-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: JP Kobryn (Meta) <jp.kobryn@linux.dev> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:08 +02:00
Teng Liu	30d537f723	btrfs: replace BUG_ON() with error return in cache_save_setup() In cache_save_setup(), if create_free_space_inode() succeeds but the subsequent lookup_free_space_inode() still fails on retry, the BUG_ON(retries) will crash the kernel. This can happen due to I/O errors or transient failures, not just programming bugs. Replace the BUG_ON with proper error handling that returns the original error code through the existing cleanup path. The callers already handle this gracefully: disk_cache_state defaults to BTRFS_DC_ERROR, so the space cache simply won't be written for that block group. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Teng Liu <27rabbitlt@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:08 +02:00
David Sterba	f0d3b4c7b8	btrfs: zstd: don't cache sectorsize in a local variable The sectorsize is used once or at most twice in the callbacks, no need to cache it on stack. Minor effect on zstd_compress_folios() where it saves 8 bytes of stack. Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:08 +02:00
David Sterba	efcf0898a6	btrfs: zlib: don't cache sectorsize in a local variable The sectorsize is used once or at most twice in the callbacks, no need to cache it on stack. Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:08 +02:00
David Sterba	4d083672b4	btrfs: zlib: drop redundant folio address variable We're caching the current output folio address but it's not really necessary as we store it in the variable and then pass it to the stream context. We can read the folio address directly. Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:07 +02:00
David Sterba	5b93f24168	btrfs: lzo: inline read/write length helpers The LZO_LEN read/write helpers are supposed to be trivial and we're duplicating the put/get unaligned helpers so use them directly. Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:07 +02:00
David Sterba	463626a2ec	btrfs: use common eb range validation in read_extent_buffer_to_user_nofault() The extent buffer access is checked in other helpers by check_eb_range(), which validates the requested start, length against the extent buffer. While this almost never fails we should still handle it as an error and not just warn. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:07 +02:00
David Sterba	b8aa337121	btrfs: read eb folio index right before loops There are generic helpers to access extent buffer folio data of any length, potentially iterating over a few of them. This is a slow path, either we use the type based accessors or the eb folio allocation is contiguous and we can use the memcpy/memcmp helpers. The initialization of 'i' is done at the beginning though it may not be needed. Move it right before the folio loop, this has minor effect on generated code in __write_extent_buffer(). Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:07 +02:00
David Sterba	aae9042194	btrfs: rename local variable for offset in folio Use proper abbreviation of the 'offset in folio' in the variable name, same as we have in accessors.c. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:07 +02:00
David Sterba	a5b6b23c45	btrfs: unify types for binary search variables The variables calculating where to jump next are using mixed in types which requires some conversions on the instruction level. Using 'u32' removes one call to 'movslq', making the main loop shorter. This complements type conversion done in `a724f313f8` ("btrfs: do unsigned integer division in the extent buffer binary search loop") Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:07 +02:00
David Sterba	7e1e45a9e4	btrfs: remove duplicate calculation of eb offset in btrfs_bin_search() In the main search loop the variable 'oil' (offset in folio) is set twice, one duplicated when the key fits completely to the contiguous range. We can remove it and while it's just a simple calculation, the binary search loop is executed many times so micro optimizations add up. The code size is reduced by 64 bytes on release config, the loop is reorganized a bit and a few instructions shorter. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:07 +02:00
Mark Harmstone	b753612be0	btrfs: tree-checker: add remap-tree checks to check_block_group_item() Add some write-time checks for block group items relating to the remap tree. Here we're checking: * That the REMAPPED or METADATA_REMAP flags aren't set unless the REMAP_TREE incompat flag is also set * That `remap_bytes` isn't more than the size of the block group * That `identity_remap_count` isn't more than the number of sectors in the block group Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:06 +02:00
Filipe Manana	e3799e65c1	btrfs: make btrfs_free_log() and btrfs_free_log_root_tree() return void These functions never fail, always return success (0) and none of the callers care about their return values. Change their return type from int to void. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:06 +02:00
Filipe Manana	b48c980b6a	btrfs: fix deadlock between reflink and transaction commit when using flushoncommit When using the flushoncommit mount option, we can have a deadlock between a transaction commit and a reflink operation that copied an inline extent to an offset beyond the current i_size of the destination node. The deadlock happens like this: 1) Task A clones an inline extent from inode X to an offset of inode Y that is beyond Y's current i_size. This means we copied the inline extent's data to a folio of inode Y that is beyond its EOF, using a call to copy_inline_to_page(); 2) Task B starts a transaction commit and calls btrfs_start_delalloc_flush() to flush delalloc; 3) The delalloc flushing sees the new dirty folio of inode Y and when it attempts to flush it, it ends up at extent_writepage() and sees that the offset of the folio is beyond the i_size of inode Y, so it attempts to invalidate the folio by calling folio_invalidate(), which ends up at btrfs' folio invalidate callback - btrfs_invalidate_folio(). There it tries to lock the folio's range in inode Y's extent io tree, but it blocks since it's currently locked by task A - during a reflink we lock the inodes and the source and destination ranges after flushing all delalloc and waiting for ordered extent completion - after that we don't expect to have dirty folios in the ranges, the exception is if we have to copy an inline extent's data (because the destination offset is not zero); 4) Task A then attempts to start a transaction to update the inode item, and then it's blocked since the current transaction is in the TRANS_STATE_COMMIT_START state. Therefore task A has to wait for the current transaction to become unblocked (its state >= TRANS_STATE_UNBLOCKED). So task A is waiting for the transaction commit done by task B, and the later waiting on the extent lock of inode Y that is currently held by task A. Syzbot recently reported this with the following stack traces: INFO: task kworker/u8:7:1053 blocked for more than 143 seconds. Not tainted syzkaller #0 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u8:7 state:D stack:23520 pid:1053 tgid:1053 ppid:2 task_flags:0x4208060 flags:0x00080000 Workqueue: writeback wb_workfn (flush-btrfs-46) Call Trace: <TASK> context_switch kernel/sched/core.c:5298 [inline] __schedule+0x1553/0x5240 kernel/sched/core.c:6911 __schedule_loop kernel/sched/core.c:6993 [inline] schedule+0x164/0x360 kernel/sched/core.c:7008 wait_extent_bit fs/btrfs/extent-io-tree.c:811 [inline] btrfs_lock_extent_bits+0x59c/0x700 fs/btrfs/extent-io-tree.c:1914 btrfs_lock_extent fs/btrfs/extent-io-tree.h:152 [inline] btrfs_invalidate_folio+0x43d/0xc40 fs/btrfs/inode.c:7704 extent_writepage fs/btrfs/extent_io.c:1852 [inline] extent_write_cache_pages fs/btrfs/extent_io.c:2580 [inline] btrfs_writepages+0x12ff/0x2440 fs/btrfs/extent_io.c:2713 do_writepages+0x32e/0x550 mm/page-writeback.c:2554 __writeback_single_inode+0x133/0x11a0 fs/fs-writeback.c:1750 writeback_sb_inodes+0x995/0x19d0 fs/fs-writeback.c:2042 wb_writeback+0x456/0xb70 fs/fs-writeback.c:2227 wb_do_writeback fs/fs-writeback.c:2374 [inline] wb_workfn+0x41a/0xf60 fs/fs-writeback.c:2414 process_one_work kernel/workqueue.c:3276 [inline] process_scheduled_works+0xb6e/0x18c0 kernel/workqueue.c:3359 worker_thread+0xa53/0xfc0 kernel/workqueue.c:3440 kthread+0x388/0x470 kernel/kthread.c:436 ret_from_fork+0x51e/0xb90 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 </TASK> INFO: task syz.4.64:6910 blocked for more than 143 seconds. Not tainted syzkaller #0 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:syz.4.64 state:D stack:22752 pid:6910 tgid:6905 ppid:5944 task_flags:0x400140 flags:0x00080002 Call Trace: <TASK> context_switch kernel/sched/core.c:5298 [inline] __schedule+0x1553/0x5240 kernel/sched/core.c:6911 __schedule_loop kernel/sched/core.c:6993 [inline] schedule+0x164/0x360 kernel/sched/core.c:7008 wait_current_trans+0x39f/0x590 fs/btrfs/transaction.c:535 start_transaction+0x6a7/0x1650 fs/btrfs/transaction.c:705 clone_copy_inline_extent fs/btrfs/reflink.c:299 [inline] btrfs_clone+0x128a/0x24d0 fs/btrfs/reflink.c:529 btrfs_clone_files+0x271/0x3f0 fs/btrfs/reflink.c:750 btrfs_remap_file_range+0x76b/0x1320 fs/btrfs/reflink.c:903 vfs_copy_file_range+0xda7/0x1390 fs/read_write.c:1600 __do_sys_copy_file_range fs/read_write.c:1683 [inline] __se_sys_copy_file_range+0x2fb/0x480 fs/read_write.c:1650 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x14d/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f5f73afc799 RSP: 002b:00007f5f7315e028 EFLAGS: 00000246 ORIG_RAX: 0000000000000146 RAX: ffffffffffffffda RBX: 00007f5f73d75fa0 RCX: 00007f5f73afc799 RDX: 0000000000000005 RSI: 0000000000000000 RDI: 0000000000000005 RBP: 00007f5f73b92c99 R08: 0000000000000863 R09: 0000000000000000 R10: 00002000000000c0 R11: 0000000000000246 R12: 0000000000000000 R13: 00007f5f73d76038 R14: 00007f5f73d75fa0 R15: 00007fff138a5068 </TASK> INFO: task syz.4.64:6975 blocked for more than 143 seconds. Not tainted syzkaller #0 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:syz.4.64 state:D stack:24736 pid:6975 tgid:6905 ppid:5944 task_flags:0x400040 flags:0x00080002 Call Trace: <TASK> context_switch kernel/sched/core.c:5298 [inline] __schedule+0x1553/0x5240 kernel/sched/core.c:6911 __schedule_loop kernel/sched/core.c:6993 [inline] schedule+0x164/0x360 kernel/sched/core.c:7008 wb_wait_for_completion+0x3e8/0x790 fs/fs-writeback.c:227 __writeback_inodes_sb_nr+0x24c/0x2d0 fs/fs-writeback.c:2838 try_to_writeback_inodes_sb+0x9a/0xc0 fs/fs-writeback.c:2886 btrfs_start_delalloc_flush fs/btrfs/transaction.c:2175 [inline] btrfs_commit_transaction+0x82e/0x31a0 fs/btrfs/transaction.c:2364 btrfs_ioctl+0xca7/0xd00 fs/btrfs/ioctl.c:5206 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:597 [inline] __se_sys_ioctl+0xff/0x170 fs/ioctl.c:583 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x14d/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f5f73afc799 RSP: 002b:00007f5f7313d028 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007f5f73d76090 RCX: 00007f5f73afc799 RDX: 0000000000000000 RSI: 0000000000009408 RDI: 0000000000000004 RBP: 00007f5f73b92c99 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007f5f73d76128 R14: 00007f5f73d76090 R15: 00007fff138a5068 </TASK> Fix this by updating the i_size of the destination inode of a reflink operation after we copy an inline extent's data to an offset beyond the i_size and before attempting to start a transaction to update the inode's item. Reported-by: syzbot+63056bf627663701bbbf@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/69bba3fe.050a0220.227207.002f.GAE@google.com/ Fixes: `05a5a7621c` ("Btrfs: implement full reflink support for inline extents") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:06 +02:00
Mark Harmstone	18addf9ec8	btrfs: tree-checker: check remap-tree flags in btrfs_check_chunk_valid() Add a check to btrfs_check_chunk_valid() that the METADATA_REMAP and REMAPPED flags are only set if the REMAP_TREE incompat flag is also set. Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:06 +02:00
Mark Harmstone	da08c02bc7	btrfs: tree-checker: add checker for items in remap tree Add write-time checking of items in the remap tree, to catch errors before they are written to disk. We're checking: * That remap items, remap backrefs, and identity remaps aren't written unless the REMAP_TREE incompat flag is set * That identity remaps have a size of 0 * That remap items and remap backrefs have a size of sizeof(struct btrfs_remap_item) * That the objectid for these items is aligned to the sector size * That the offset for these items (i.e. the size of the remapping) isn't 0 and is aligned to the sector size * That objectid + offset doesn't overflow Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:06 +02:00
Dave Chen	0e6a169c64	btrfs: fix unnecessary flush on close when truncating zero-sized files In btrfs_setsize(), when a file is truncated to size 0, the BTRFS_INODE_FLUSH_ON_CLOSE flag is unconditionally set to ensure pending writes get flushed on close. This flag was designed to protect the "truncate-then-rewrite" pattern, where an application truncates a file with existing data down to zero and writes new content, ensuring the new data reach disk on close. However, when a file already has a size of 0 (e.g. a newly created file opened with O_CREAT \| O_TRUNC), oldsize and newsize are both 0. In this case, setting BTRFS_INODE_FLUSH_ON_CLOSE is unnecessary because no "good data" was truncated away. The subsequent filemap_flush() in btrfs_release_file() then triggers avoidable writeback that disrupts the normal delayed writeback batching, adding I/O overhead. This comes from a real workload. A backup service creates temporary files via mkstemp(), closes them, and later reopens them with O_TRUNC for writing. The O_TRUNC is defensive. The file creation and usage is done by a different component, so removing the unneeded truncation is not straightforward. This pattern repeats for a large number of files each close() triggers an unnecessary filemap_flush(). Signed-off-by: Dave Chen <davechen@synology.com> Signed-off-by: Robbie Ko <robbieko@synology.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:06 +02:00
Qu Wenruo	304076527c	btrfs: move shutdown and remove_bdev callbacks out of experimental features These two new callbacks have been introduced in v6.19, and it has been two releases in v7.1. During that time we have not yet exposed bugs related that two features, thus it's time to expose them for end users. It's especially important to expose remove_bdev callback to end users. That new callback makes btrfs automatically shutdown or go degraded when a device is missing (depending on if the fs can maintain RW), which is affecting end users. We want some feedback from early adopters. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:05 +02:00
Yochai Eisenrich	973e57c726	btrfs: fix btrfs_ioctl_space_info() slot_count TOCTOU which can lead to info-leak btrfs_ioctl_space_info() has a TOCTOU race between two passes over the block group RAID type lists. The first pass counts entries to determine the allocation size, then the second pass fills the buffer. The groups_sem rwlock is released between passes, allowing concurrent block group removal to reduce the entry count. When the second pass fills fewer entries than the first pass counted, copy_to_user() copies the full alloc_size bytes including trailing uninitialized kmalloc bytes to userspace. Fix by copying only total_spaces entries (the actually-filled count from the second pass) instead of alloc_size bytes, and switch to kzalloc so any future copy size mismatch cannot leak heap data. Fixes: `7fde62bffb` ("Btrfs: buffer results in the space_info ioctl") CC: stable@vger.kernel.org # 3.0 Signed-off-by: Yochai Eisenrich <echelonh@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:05 +02:00
Filipe Manana	cee4cfd6cc	btrfs: avoid taking the device_list_mutex in btrfs_run_dev_stats() btrfs_run_dev_stats() is called during the critical section of a transaction commit and it takes the device_list_mutex, which is also acquired by fitrim, which does discard operations while holding that mutex. Most of the time, if we are on a healthy filesystem, we don't have new stat updates to persist in the device tree, so blocking on the device_list_mutex is just wasting time and making any tasks that need to start a new transaction wait longer that necessary. Since the device list is RCU safe/protected, make btrfs_run_dev_stats() do an initial check for device stat updates using RCU and quit without taking the device_list_mutex in case there are no new device stats that need to be persisted in the device tree. Also note that adding/removing devices also requires starting a transaction, and since btrfs_run_dev_stats() is called from the critical section of a transaction commit, no one can be concurrently adding or removing a device while btrfs_run_dev_stats() is called. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:05 +02:00
Leo Martins	e0a85137a8	btrfs: avoid GFP_ATOMIC allocations in qgroup free paths When qgroups are enabled, __btrfs_qgroup_release_data() and qgroup_free_reserved_data() pass an extent_changeset to btrfs_clear_record_extent_bits() to track how many bytes had their EXTENT_QGROUP_RESERVED bits cleared. Inside the extent IO tree spinlock, add_extent_changeset() calls ulist_add() with GFP_ATOMIC to record each changed range. If this allocation fails, it hits a BUG_ON and panics the kernel. However, both of these callers only read changeset.bytes_changed afterwards — the range_changed ulist is populated and immediately freed without ever being iterated. The GFP_ATOMIC allocation is entirely unnecessary for these paths. Introduce extent_changeset_init_bytes_only() which uses a sentinel value (EXTENT_CHANGESET_BYTES_ONLY) on the ulist's prealloc field to signal that only bytes_changed should be tracked. add_extent_changeset() checks for this sentinel and returns early after updating bytes_changed, skipping the ulist_add() call entirely. This eliminates the GFP_ATOMIC allocation and makes the BUG_ON unreachable for these paths. Callers that need range tracking (qgroup_reserve_data, qgroup_unreserve_range, btrfs_qgroup_check_reserved_leak) continue to use extent_changeset_init() and are unaffected. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Leo Martins <loemra.dev@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:05 +02:00

1 2 3 4 5 ...

1429872 Commits