Commit Graph

2991 Commits

Author SHA1 Message Date
Boris Burkov
975e63c7a8 btrfs: always drop root->inodes lock before cond_resched()
find_first_inode() and find_first_inode_to_shrink() lock root->inodes,
then loop over them, occasionally skipping some inodes. When they skip
an inode, they attempt to share the cpu/lock with cond_resched_lock().

However, that has a subtle problem associated with it.
cond_resched_lock() only drops the lock if it needs to actually call
schedule(). With CONFIG_PREEMPT_NONE, this means the full timeslice as
detected at ticks. With 8+ cpus and default tunables, this is 2.8ms. So
regardless of HZ, we will run for at least 2.8ms in this loop without
dropping the lock, assuming it finds no suitable inodes. If HZ is
small enough, it might be even worse as the tick granularity becomes
bigger than the timeslice.

The knock-on effect of this is that callers to
btrfs_del_inode_from_root() like kswapd trying to shrink the inode slab
or userspace threads calling evict() will spin on xa_lock(&root->inodes)
for 2.8ms, so the extent map shrinker dominates the lock even though
ostensibly it is intending to share it. This produces memory pressure as
there is only one kswapd and it runs sequentially so it can get stuck in
the inode slab shrinking.

To fix it, simply replace cond_resched_lock() with an open coded variant
which unconditionally does unlock/lock around cond_resched. Sharing the
lock is decoupled from sharing the CPU, and all the users of the lock
now share it fairly.

I was able to reproduce this on test systems by producing a lot of empty
files (to make a big root->inodes xarray), then producing memory
pressure by reading large files larger than ram, triggering kswapd and
the extent_map shrinker. The lock contention is visible with perf or
lockstat. This patch also relieved a user-apparent bottleneck on a
production system from the original report.

Tested-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-05-16 03:06:56 +02:00
Robbie Ko
c562ba61fc btrfs: fix incorrect i_size after remount caused by KEEP_SIZE prealloc gap
When fallocate() with FALLOC_FL_KEEP_SIZE preallocates an extent past the
current i_size, the file_extent_tree of the inode is updated to cover
that range. However, on the next mount, btrfs_read_locked_inode() only
re-populates file_extent_tree with [0, round_up(i_size, sectorsize)),
losing the marks that belonged to the KEEP_SIZE prealloc extent beyond
i_size.

Later, when a non-KEEP_SIZE fallocate() extends i_size into / past that
old prealloc extent, the reservation loop in btrfs_fallocate() skips
already-prealloc segments and does not call into the path that marks the
file_extent_tree, so a gap remains inside the file_extent_tree across
[old_aligned_i_size, start_of_new_alloc). Then __btrfs_prealloc_file_range()
calls btrfs_inode_safe_disk_i_size_write(), which uses
find_contiguous_extent_bit() starting at offset 0 to derive disk_i_size.
The walk stops at the gap, so disk_i_size ends up smaller than i_size and
gets persisted. After the next mount, the file shows the wrong (smaller)
size.

The following reproducer triggers the problem:

  $ cat test.sh
  MNT=/mnt/sdi
  DEV=/dev/sdi

  mkdir -p $MNT
  mkfs.btrfs -f -O ^no-holes $DEV
  mount $DEV $MNT

  touch $MNT/file1
  # KEEP_SIZE prealloc beyond i_size (i_size stays 0)
  fallocate -n -o 4M -l 4M $MNT/file1
  umount $MNT
  mount $DEV $MNT

  # non-KEEP_SIZE fallocate that overlaps the previous prealloc tail
  # and extends past it
  fallocate -o 7M -l 2M $MNT/file1
  ls -lh $MNT/file1
  umount $MNT
  mount $DEV $MNT
  ls -lh $MNT/file1
  umount $MNT

Running the reproducer gives the following result:

  $ ./test.sh
  (...)
  -rw-rw-r-- 1 root root 9.0M May  4 16:35 /mnt/sdi/file1
  -rw-rw-r-- 1 root root 7.0M May  4 16:35 /mnt/sdi/file1

The size before the second mount is correct (9M), but after the
remount it drops to 7M, i.e. the start of the gap inside file_extent_tree.

Fix this in __btrfs_prealloc_file_range() by marking the entire range
[round_down(old_i_size, sectorsize), round_up(new_i_size, sectorsize))
in file_extent_tree before updating i_size and calling
btrfs_inode_safe_disk_i_size_write(). This ensures the contiguous bit
search starting from 0 is not truncated by a stale gap left behind by a
previous KEEP_SIZE prealloc that was not restored on inode load.

The fix has no effect when the NO_HOLES feature is enabled because
btrfs_inode_safe_disk_i_size_write() and
btrfs_inode_set_file_extent_range()
both take the fast path that directly tracks disk_i_size without
consulting file_extent_tree.

Fixes: 9ddc959e80 ("btrfs: use the file extent tree infrastructure")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
[ Minor updates to the change log ]
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-05-08 00:32:08 +02:00
Mark Harmstone
82323b1a70 btrfs: fix double-decrement of bytes_may_use in submit_one_async_extent()
submit_one_async_extent() calls btrfs_reserve_extent(), which decrements
bytes_may_use. If the call btrfs_create_io_em() fails, we jump to
out_free_reserve, which calls extent_clear_unlock_delalloc().

Because we're specifying EXTENT_DO_ACCOUNTING, i.e.
EXTENT_CLEAR_META_RESV | EXTENT_CLEAR_DATA_RESV, this decreases
bytes_may_use again. This can lead to problems later on, as an initial
write can fail only for the writeback to silently ENOSPC.

Fix this by replacing EXTENT_DO_ACCOUNTING with EXTENT_CLEAR_META_RESV.
This parallels a4fe134fc1 ("btrfs: fix a double release on reserved
extents in cow_one_range()"), which is the same fix in cow_one_range().

Fixes: 151a41bc46 ("Btrfs: fix what bits we clear when erroring out from delalloc")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-21 04:03:08 +02:00
Filipe Manana
999757231c btrfs: fix missing last_unlink_trans update when removing a directory
When removing a directory we are not updating its last_unlink_trans field,
which can result in incorrect fsync behaviour in case some one fsyncs the
directory after it was removed because it's holding a file descriptor on
it.

Example scenario:

   mkdir /mnt/dir1
   mkdir /mnt/dir1/dir2
   mkdir /mnt/dir3

   sync -f /mnt

   # Do some change to the directory and fsync it.
   chmod 700 /mnt/dir1
   xfs_io -c fsync /mnt/dir1

   # Move dir2 out of dir1 so that dir1 becomes empty.
   mv /mnt/dir1/dir2 /mnt/dir3/

   open fd on /mnt/dir1
   call rmdir(2) on path "/mnt/dir1"
   fsync fd

   <trigger power failure>

When attempting to mount the filesystem, the log replay will fail with
an -EIO error and dmesg/syslog has the following:

   [445771.626482] BTRFS info (device dm-0): first mount of filesystem 0368bbea-6c5e-44b5-b409-09abe496e650
   [445771.626486] BTRFS info (device dm-0): using crc32c checksum algorithm
   [445771.627912] BTRFS info (device dm-0): start tree-log replay
   [445771.628335] page: refcount:2 mapcount:0 mapping:0000000061443ddc index:0x1d00 pfn:0x7072a5
   [445771.629453] memcg:ffff89f400351b00
   [445771.629892] aops:btree_aops [btrfs] ino:1
   [445771.630737] flags: 0x17fffc00000402a(uptodate|lru|private|writeback|node=0|zone=2|lastcpupid=0x1ffff)
   [445771.632359] raw: 017fffc00000402a fffff47284d950c8 fffff472907b7c08 ffff89f458e412b8
   [445771.633713] raw: 0000000000001d00 ffff89f6c51d1a90 00000002ffffffff ffff89f400351b00
   [445771.635029] page dumped because: eb page dump
   [445771.635825] BTRFS critical (device dm-0): corrupt leaf: root=5 block=30408704 slot=10 ino=258, invalid nlink: has 2 expect no more than 1 for dir
   [445771.638088] BTRFS info (device dm-0): leaf 30408704 gen 10 total ptrs 17 free space 14878 owner 5
   [445771.638091] BTRFS info (device dm-0): refs 4 lock_owner 0 current 3581087
   [445771.638094] 	item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160
   [445771.638097] 		inode generation 3 transid 9 size 16 nbytes 16384
   [445771.638098] 		block group 0 mode 40755 links 1 uid 0 gid 0
   [445771.638100] 		rdev 0 sequence 2 flags 0x0
   [445771.638102] 		atime 1775744884.0
   [445771.660056] 		ctime 1775744885.645502983
   [445771.660058] 		mtime 1775744885.645502983
   [445771.660060] 		otime 1775744884.0
   [445771.660062] 	item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12
   [445771.660064] 		index 0 name_len 2
   [445771.660066] 	item 2 key (256 DIR_ITEM 1843588421) itemoff 16077 itemsize 34
   [445771.660068] 		location key (259 1 0) type 2
   [445771.660070] 		transid 9 data_len 0 name_len 4
   [445771.660075] 	item 3 key (256 DIR_ITEM 2363071922) itemoff 16043 itemsize 34
   [445771.660076] 		location key (257 1 0) type 2
   [445771.660077] 		transid 9 data_len 0 name_len 4
   [445771.660078] 	item 4 key (256 DIR_INDEX 2) itemoff 16009 itemsize 34
   [445771.660079] 		location key (257 1 0) type 2
   [445771.660080] 		transid 9 data_len 0 name_len 4
   [445771.660081] 	item 5 key (256 DIR_INDEX 3) itemoff 15975 itemsize 34
   [445771.660082] 		location key (259 1 0) type 2
   [445771.660083] 		transid 9 data_len 0 name_len 4
   [445771.660084] 	item 6 key (257 INODE_ITEM 0) itemoff 15815 itemsize 160
   [445771.660086] 		inode generation 9 transid 9 size 8 nbytes 0
   [445771.660087] 		block group 0 mode 40777 links 1 uid 0 gid 0
   [445771.660088] 		rdev 0 sequence 2 flags 0x0
   [445771.660089] 		atime 1775744885.641174097
   [445771.660090] 		ctime 1775744885.645502983
   [445771.660091] 		mtime 1775744885.645502983
   [445771.660105] 		otime 1775744885.641174097
   [445771.660106] 	item 7 key (257 INODE_REF 256) itemoff 15801 itemsize 14
   [445771.660107] 		index 2 name_len 4
   [445771.660108] 	item 8 key (257 DIR_ITEM 2676584006) itemoff 15767 itemsize 34
   [445771.660109] 		location key (258 1 0) type 2
   [445771.660110] 		transid 9 data_len 0 name_len 4
   [445771.660111] 	item 9 key (257 DIR_INDEX 2) itemoff 15733 itemsize 34
   [445771.660112] 		location key (258 1 0) type 2
   [445771.660113] 		transid 9 data_len 0 name_len 4
   [445771.660114] 	item 10 key (258 INODE_ITEM 0) itemoff 15573 itemsize 160
   [445771.660115] 		inode generation 9 transid 10 size 0 nbytes 0
   [445771.660116] 		block group 0 mode 40755 links 2 uid 0 gid 0
   [445771.660117] 		rdev 0 sequence 0 flags 0x0
   [445771.660118] 		atime 1775744885.645502983
   [445771.660119] 		ctime 1775744885.645502983
   [445771.660120] 		mtime 1775744885.645502983
   [445771.660121] 		otime 1775744885.645502983
   [445771.660122] 	item 11 key (258 INODE_REF 257) itemoff 15559 itemsize 14
   [445771.660123] 		index 2 name_len 4
   [445771.660124] 	item 12 key (258 INODE_REF 259) itemoff 15545 itemsize 14
   [445771.660125] 		index 2 name_len 4
   [445771.660126] 	item 13 key (259 INODE_ITEM 0) itemoff 15385 itemsize 160
   [445771.660127] 		inode generation 9 transid 10 size 8 nbytes 0
   [445771.660128] 		block group 0 mode 40755 links 1 uid 0 gid 0
   [445771.660129] 		rdev 0 sequence 1 flags 0x0
   [445771.660130] 		atime 1775744885.645502983
   [445771.660130] 		ctime 1775744885.645502983
   [445771.660131] 		mtime 1775744885.645502983
   [445771.660132] 		otime 1775744885.645502983
   [445771.660133] 	item 14 key (259 INODE_REF 256) itemoff 15371 itemsize 14
   [445771.660134] 		index 3 name_len 4
   [445771.660135] 	item 15 key (259 DIR_ITEM 2676584006) itemoff 15337 itemsize 34
   [445771.660136] 		location key (258 1 0) type 2
   [445771.660137] 		transid 10 data_len 0 name_len 4
   [445771.660138] 	item 16 key (259 DIR_INDEX 2) itemoff 15303 itemsize 34
   [445771.660139] 		location key (258 1 0) type 2
   [445771.660140] 		transid 10 data_len 0 name_len 4
   [445771.660144] BTRFS error (device dm-0): block=30408704 write time tree block corruption detected
   [445771.661650] ------------[ cut here ]------------
   [445771.662358] WARNING: fs/btrfs/disk-io.c:326 at btree_csum_one_bio+0x217/0x230 [btrfs], CPU#8: mount/3581087
   [445771.663588] Modules linked in: btrfs f2fs xfs (...)
   [445771.671229] CPU: 8 UID: 0 PID: 3581087 Comm: mount Tainted: G        W           7.0.0-rc6-btrfs-next-230+ #2 PREEMPT(full)
   [445771.672575] Tainted: [W]=WARN
   [445771.672987] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
   [445771.674460] RIP: 0010:btree_csum_one_bio+0x217/0x230 [btrfs]
   [445771.675222] Code: 89 44 24 (...)
   [445771.677364] RSP: 0018:ffffd23882247660 EFLAGS: 00010246
   [445771.678029] RAX: 0000000000000000 RBX: ffff89f6c51d1a90 RCX: 0000000000000000
   [445771.678975] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff89f406020000
   [445771.679983] RBP: ffff89f821204000 R08: 0000000000000000 R09: 00000000ffefffff
   [445771.680905] R10: ffffd23882247448 R11: 0000000000000003 R12: ffffd23882247668
   [445771.681978] R13: ffff89f458e40fc0 R14: ffff89f737f4f500 R15: ffff89f737f4f500
   [445771.682912] FS:  00007f0447a98840(0000) GS:ffff89fb9771d000(0000) knlGS:0000000000000000
   [445771.684393] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   [445771.685230] CR2: 00007f0447bf1330 CR3: 000000017cb02002 CR4: 0000000000370ef0
   [445771.686273] Call Trace:
   [445771.686646]  <TASK>
   [445771.686969]  btrfs_submit_bbio+0x83f/0x860 [btrfs]
   [445771.687750]  ? write_one_eb+0x28f/0x340 [btrfs]
   [445771.688428]  btree_writepages+0x2e3/0x550 [btrfs]
   [445771.689180]  ? kmem_cache_alloc_noprof+0x12a/0x490
   [445771.689963]  ? alloc_extent_state+0x19/0x120 [btrfs]
   [445771.690801]  ? kmem_cache_free+0x135/0x380
   [445771.691328]  ? preempt_count_add+0x69/0xa0
   [445771.691831]  ? set_extent_bit+0x252/0x8e0 [btrfs]
   [445771.692468]  ? xas_load+0x9/0xc0
   [445771.692873]  ? xas_find+0x14d/0x1a0
   [445771.693304]  do_writepages+0xc6/0x160
   [445771.693756]  filemap_writeback+0xb8/0xe0
   [445771.694274]  btrfs_write_marked_extents+0x61/0x170 [btrfs]
   [445771.694999]  btrfs_write_and_wait_transaction+0x4e/0xc0 [btrfs]
   [445771.695818]  btrfs_commit_transaction+0x5c8/0xd10 [btrfs]
   [445771.696530]  ? kmem_cache_free+0x135/0x380
   [445771.697120]  ? release_extent_buffer+0x34/0x160 [btrfs]
   [445771.697786]  btrfs_recover_log_trees+0x7be/0x7e0 [btrfs]
   [445771.698525]  ? __pfx_replay_one_buffer+0x10/0x10 [btrfs]
   [445771.699206]  open_ctree+0x11e5/0x1810 [btrfs]
   [445771.699776]  btrfs_get_tree.cold+0xb/0x162 [btrfs]
   [445771.700463]  ? fscontext_read+0x165/0x180
   [445771.701146]  ? rw_verify_area+0x50/0x180
   [445771.701866]  vfs_get_tree+0x25/0xd0
   [445771.702491]  vfs_cmd_create+0x59/0xe0
   [445771.703125]  __do_sys_fsconfig+0x303/0x610
   [445771.703603]  do_syscall_64+0xe9/0xf20
   [445771.703974]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
   [445771.704700] RIP: 0033:0x7f0447cbd4aa
   [445771.705108] Code: 73 01 c3 (...)
   [445771.707263] RSP: 002b:00007ffc4e528318 EFLAGS: 00000246 ORIG_RAX: 00000000000001af
   [445771.708107] RAX: ffffffffffffffda RBX: 00005561585d8c20 RCX: 00007f0447cbd4aa
   [445771.708931] RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003
   [445771.709744] RBP: 00005561585d9120 R08: 0000000000000000 R09: 0000000000000000
   [445771.710674] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
   [445771.711477] R13: 00007f0447e4f580 R14: 00007f0447e5126c R15: 00007f0447e36a23
   [445771.712277]  </TASK>
   [445771.712541] ---[ end trace 0000000000000000 ]---
   [445771.713382] BTRFS error (device dm-0): error while writing out transaction: -5
   [445771.714679] BTRFS warning (device dm-0): Skipping commit of aborted transaction.
   [445771.715562] BTRFS error (device dm-0 state A): Transaction aborted (error -5)
   [445771.716459] BTRFS: error (device dm-0 state A) in cleanup_transaction:2068: errno=-5 IO failure
   [445771.717936] BTRFS error (device dm-0 state EA): failed to recover log trees with error: -5
   [445771.719681] BTRFS error (device dm-0 state EA): open_ctree failed: -5

The problem is that such a fsync should have result in a fallback to a
transaction commit, but that did not happen because through the
btrfs_rmdir() we never update the directory's last_unlink_trans field.
Any inode that had a link removed must have its last_unlink_trans updated
to the ID of transaction used for the operation, otherwise fsync and log
replay will not work correctly.

btrfs_rmdir() calls btrfs_unlink_inode() and through that call chain we
never call btrfs_record_unlink_dir() in order to update last_unlink_trans.
However btrfs_unlink(), which is used for unlinking regular files, calls
btrfs_record_unlink_dir() and then calls btrfs_unlink_inode(). So fix
this by moving the call to btrfs_record_unlink_dir() from btrfs_unlink()
to btrfs_unlink_inode().

A test case for fstests will follow soon.

Reported-by: Slava0135 <slava.kovalevskiy.2014@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAAJYhww5ov62Hm+n+tmhcL-e_4cBobg+OWogKjOJxVUXivC=MQ@mail.gmail.com/
CC: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-21 04:01:48 +02:00
Dave Chen
e0dfaebb8f btrfs: skip clearing EXTENT_DEFRAG for NOCOW ordered extents
In btrfs_finish_one_ordered(), clear_bits is unconditionally initialized
with EXTENT_DEFRAG.  For NOCOW ordered extents this is always a no-op
because should_nocow() already forces the COW path when EXTENT_DEFRAG is
set, so a NOCOW ordered extent can never have EXTENT_DEFRAG on its range.

Although harmless, the unconditional btrfs_clear_extent_bit() call still
performs a cold rbtree lookup under the io tree spinlock on every NOCOW
write completion.  Avoid this by only adding EXTENT_DEFRAG to clear_bits
for non-NOCOW ordered extents, and skip the call entirely when there are
no bits to clear.

Signed-off-by: Dave Chen <davechen@synology.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07 19:43:22 +02:00
Filipe Manana
7801f3ea95 btrfs: tag as unlikely if statements that check for fs in error state
Having the filesystem in an error state, meaning we had a transaction
abort, is unexpected. Mark every check for the error state with the
unlikely annotation to convey that and to allow the compiler to generate
better code.

On x86_64, using gcc 14.2.0-19 from Debian, resulted in a slightly
reduced object size and better code.

Before:

  $ size fs/btrfs/btrfs.ko
     text	   data	    bss	    dec	    hex	filename
  2008598	 175912	  15592	2200102	 219226	fs/btrfs/btrfs.ko

After:

  $ size fs/btrfs/btrfs.ko
     text	   data	    bss	    dec	    hex	filename
  2008450	 175912	  15592	2199954	 219192	fs/btrfs/btrfs.ko

Reviewed-by: Anand Jain <asj@kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07 19:41:42 +02:00
JP Kobryn (Meta)
7ae37b2c94 btrfs: prevent direct reclaim during compressed readahead
Under memory pressure, direct reclaim can kick in during compressed
readahead. This puts the associated task into D-state. Then shrink_lruvec()
disables interrupts when acquiring the LRU lock. Under heavy pressure,
we've observed reclaim can run long enough that the CPU becomes prone to
CSD lock stalls since it cannot service incoming IPIs. Although the CSD
lock stalls are the worst case scenario, we have found many more subtle
occurrences of this latency on the order of seconds, over a minute in some
cases.

Prevent direct reclaim during compressed readahead. This is achieved by
using different GFP flags at key points when the bio is marked for
readahead.

There are two functions that allocate during compressed readahead:
btrfs_alloc_compr_folio() and add_ra_bio_pages(). Both currently use
GFP_NOFS which includes __GFP_DIRECT_RECLAIM.

For the internal API call btrfs_alloc_compr_folio(), the signature changes
to accept an additional gfp_t parameter. At the readahead call site, it
gets flags similar to GFP_NOFS but stripped of __GFP_DIRECT_RECLAIM.
__GFP_NOWARN is added since these allocations are allowed to fail. Demand
reads still use full GFP_NOFS and will enter reclaim if needed. All other
existing call sites of btrfs_alloc_compr_folio() now explicitly pass
GFP_NOFS to retain their current behavior.

add_ra_bio_pages() gains a bool parameter which allows callers to specify
if they want to allow direct reclaim or not. In either case, the
__GFP_NOWARN flag was added unconditionally since the allocations are
speculative.

There has been some previous work done on calling add_ra_bio_pages() [0].
This patch is complementary: where that patch reduces call frequency, this
patch reduces the latency associated with those calls.

[0] https://lore.kernel.org/linux-btrfs/656838ec1232314a2657716e59f4f15a8eadba64.1751492111.git.boris@bur.io/

Reviewed-by: Mark Harmstone <mark@harmstone.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: JP Kobryn (Meta) <jp.kobryn@linux.dev>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07 18:56:08 +02:00
Dave Chen
0e6a169c64 btrfs: fix unnecessary flush on close when truncating zero-sized files
In btrfs_setsize(), when a file is truncated to size 0, the
BTRFS_INODE_FLUSH_ON_CLOSE flag is unconditionally set to ensure
pending writes get flushed on close. This flag was designed to protect
the "truncate-then-rewrite" pattern, where an application truncates a
file with existing data down to zero and writes new content, ensuring
the new data reach disk on close.

However, when a file already has a size of 0 (e.g. a newly created
file opened with O_CREAT | O_TRUNC), oldsize and newsize are both 0.
In this case, setting BTRFS_INODE_FLUSH_ON_CLOSE is unnecessary because
no "good data" was truncated away. The subsequent filemap_flush() in
btrfs_release_file() then triggers avoidable writeback that disrupts
the normal delayed writeback batching, adding I/O overhead.

This comes from a real workload. A backup service creates temporary
files via mkstemp(), closes them, and later reopens them with O_TRUNC
for writing. The O_TRUNC is defensive.  The file creation and usage is
done by a different component, so removing the unneeded truncation is
not straightforward.  This pattern repeats for a large number of files
each close() triggers an unnecessary filemap_flush().

Signed-off-by: Dave Chen <davechen@synology.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07 18:56:06 +02:00
Philipp Hahn
e526779648 btrfs: prefer IS_ERR_OR_NULL() over manual NULL check
Prefer using IS_ERR_OR_NULL() over using IS_ERR() and a manual NULL
check.

IS_ERR_OR_NULL() already uses likely(!ptr) internally. checkpatch does
not like nesting it:
> WARNING: nested (un)?likely() calls, IS_ERR_OR_NULL already uses
> unlikely() internally
Remove the explicit use of likely().

Change generated with coccinelle.

Signed-off-by: Philipp Hahn <phahn-oss@avm.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07 18:56:02 +02:00
Qu Wenruo
3eaf5f082c btrfs: extract inlined creation into a dedicated delalloc helper
Currently we call cow_file_range_inline() in different situations, from
regular cow_file_range() to compress_file_range().

This is because inline extent creation has different conditions based on
whether it's a compressed one or not.

But on the other hand, inline extent creation shouldn't be so
distributed, we can just have a dedicated branch in
btrfs_run_delalloc_range().

It will become more obvious for compressed inline cases, it makes no
sense to go through all the complex async extent mechanism just to
inline a single block.

So here we introduce a dedicated run_delalloc_inline() helper, and
remove all inline related handling from cow_file_range() and
compress_file_range().

There is a special update to inode_need_compress(), that a new
@check_inline parameter is introduced.
This is to allow inline specific checks to be done inside
run_delalloc_inline(), which allows single block compression, but
other call sites should always reject single block compression.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07 18:56:01 +02:00
Qu Wenruo
cab4c8b594 btrfs: extract the max compression chunk size into a macro
We have two locations using open-coded 512K size, as the async chunk
size.

For compression we have not only the max size a compressed extent can
represent (128K), but also how large an async chunk can be (512K).

Although we have a macro for the maximum compressed extent size, we do
not have any macro for the async chunk size.

Add such a macro and replace the two open-coded SZ_512K.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07 18:56:00 +02:00
Qu Wenruo
6603a98598 btrfs: do compressed bio size roundup and zeroing in one go
Currently we zero out all the remaining bytes of the last folio of
the compressed bio, then round the bio size to fs block boundary.

But that is done in two different functions, zero_last_folio() to zero
the remaining bytes of the last folio, and round_up_last_block() to
round up the bio to fs block boundary.

There are some minor problems:

- zero_last_folio() is zeroing ranges we won't submit
  This is mostly affecting block size < page size cases, where we can
  have a large folio (e.g. 64K), but the fs block size is only 4K.

  In that case, we may only want to submit the first 4K of the folio,
  the remaining range won't matter, but we still zero them all.

  This causes unnecessary CPU usage just to zero out some bytes we won't
  utilized.

- compressed_bio_last_folio() is called twice in two different functions
  Which in theory we only need to call it once.

Enhance the situation by:

- Only zero out bytes up to the fs block boundary
  Thus this will reduce some overhead for bs < ps cases.

- Move the folio_zero_range() call into round_up_last_block()
  So that we can reuse the same folio returned by
  compressed_bio_last_folio().

Reviewed-by: Anand Jain <asj@kernel.org>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07 18:55:59 +02:00
Chen Guan Jie
13816fd5aa btrfs: check snapshot_force_cow earlier in can_nocow_file_extent()
When a snapshot is being created, the atomic counter snapshot_force_cow
is incremented to force incoming writes to fallback to COW. This is a
critical mechanism to protect the consistency of the snapshot being taken.

Currently, can_nocow_file_extent() checks this counter only after
performing several checks, most notably the expensive cross-reference
check via btrfs_cross_ref_exist(). btrfs_cross_ref_exist() releases the
path and performs a search in the extent tree or backref cache, which
involves btree traversals and locking overhead.

Moves the snapshot_force_cow check to the very beginning of
can_nocow_file_extent().

This reordering is safe and beneficial because:

1. args->writeback_path is invariant for the duration of the call (set
   by caller run_delalloc_nocow).

2. is_freespace_inode is a static property of the inode.

3. The state of snapshot_force_cow is driven by the btrfs_mksnapshot()
   process. Checking it earlier does not change the outcome of the NOCOW
   decision, but effectively prunes the expensive code path when a
   fallback to COW is inevitable.

By failing fast when a snapshot is pending, we avoid the unnecessary
overhead of btrfs_cross_ref_exist() and other extent item checks in the
scenario where NOCOW is already known to be impossible.

Signed-off-by: Chen Guan Jie <jk.chen1095@gmail.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07 18:55:58 +02:00
Qu Wenruo
2e0e3716c7 btrfs: do not mark inode incompressible after inline attempt fails
[BUG]
The following sequence will set the file with nocompress flag:

  # mkfs.btrfs -f $dev
  # mount $dev $mnt -o max_inline=4,compress
  # xfs_io -f -c "pwrite 0 2k" -c sync $mnt/foobar

The inode will have NOCOMPRESS flag, even if the content itself (all 0xcd)
can still be compressed very well:

	item 4 key (257 INODE_ITEM 0) itemoff 15879 itemsize 160
		generation 9 transid 10 size 2097152 nbytes 1052672
		block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
		sequence 257 flags 0x8(NOCOMPRESS)

Please note that, this behavior is there even before commit 59615e2c1f
("btrfs: reject single block sized compression early").

[CAUSE]
At compress_file_range(), after btrfs_compress_folios() call, we try
making an inlined extent by calling cow_file_range_inline().

But cow_file_range_inline() calls can_cow_file_range_inline() which has
more accurate checks on if the range can be inlined.

One of the user configurable conditions is the "max_inline=" mount
option. If that value is set low (like the example, 4 bytes, which
cannot store any header), or the compressed content is just slightly
larger than 2K (the default value, meaning a 50% compression ratio),
cow_file_range_inline() will return 1 immediately.

And since we're here only to try inline the compressed data, the range
is no larger than a single fs block.

Thus compression is never going to make it a win, we fall back to
marking the inode incompressible unavoidably.

[FIX]
Just add an extra check after inline attempt, so that if the inline
attempt failed, do not set the nocompress flag.

As there is no way to remove that flag, and the default 50% compression
ratio is way too strict for the whole inode.

CC: stable@vger.kernel.org # 6.12+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07 18:55:57 +02:00
Qu Wenruo
a11d6912fd btrfs: remove folio parameter from ordered io related functions
Both functions btrfs_finish_ordered_extent() and
btrfs_mark_ordered_io_finished() are accepting an optional folio
parameter.

That @folio is passed into can_finish_ordered_extent(), which later will
test and clear the ordered flag for the involved range.

However I do not think there is any other call site that can clear
ordered flags of an page cache folio and can affect
can_finish_ordered_extent().

There are limited *_clear_ordered() callers out of
can_finish_ordered_extent() function:

- btrfs_migrate_folio()
  This is completely unrelated, it's just migrating the ordered flag to
  the new folio.

- btrfs_cleanup_ordered_extents()
  We manually clean the ordered flags of all involved folios, then call
  btrfs_mark_ordered_io_finished() without a @folio parameter.
  So it doesn't need and didn't pass a @folio parameter in the first
  place.

- btrfs_writepage_fixup_worker()
  This function is going to be removed soon, and we should not hit that
  function anymore.

- btrfs_invalidate_folio()
  This is the real call site we need to bother with.

  If we already have a bio running, btrfs_finish_ordered_extent() in
  end_bbio_data_write() will be executed first, as
  btrfs_invalidate_folio() will wait for the writeback to finish.

  Thus if there is a running bio, it will not see the range has
  ordered flags, and just skip to the next range.

  If there is no bio running, meaning the ordered extent is created but
  the folio is not yet submitted.

  In that case btrfs_invalidate_folio() will manually clear the folio
  ordered range, but then manually finish the ordered extent with
  btrfs_dec_test_ordered_pending() without bothering the folio ordered
  flags.

  Meaning if the OE range with folio ordered flags will be finished
  manually without the need to call can_finish_ordered_extent().

This means all can_finish_ordered_extent() call sites should get a range
that has folio ordered flag set, thus the old "return false" branch
should never be triggered.

Now we can:

- Remove the @folio parameter from involved functions
  * btrfs_mark_ordered_io_finished()
  * btrfs_finish_ordered_extent()

  For call sites passing a @folio into those functions, let them
  manually clear the ordered flag of involved folios.

- Move btrfs_finish_ordered_extent() out of the loop in
  end_bbio_data_write()

  We only need to call btrfs_finish_ordered_extent() once per bbio,
  not per folio.

- Add an ASSERT() to make sure all folio ranges have ordered flags
  It's only for end_bbio_data_write().

  And we already have enough safe nets to catch over-accounting of ordered
  extents.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07 18:55:57 +02:00
Qu Wenruo
afe60cdb3c btrfs: remove the btrfs_inode parameter from btrfs_remove_ordered_extent()
We already have btrfs_ordered_extent::inode, thus there is no need to
pass a btrfs_inode parameter to btrfs_remove_ordered_extent().

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07 18:55:57 +02:00
Filipe Manana
6fa9729568 btrfs: pass literal booleans to functions that take boolean arguments
We have several functions with parameters defined as booleans but then we
have callers passing integers, 0 or 1, instead of false and true. While
this isn't a bug since 0 and 1 are converted to false and true, it is odd
and less readable. Change the callers to pass true and false literals
instead.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07 18:55:56 +02:00
Qu Wenruo
b943097758 btrfs: rename btrfs_csum_file_blocks() to btrfs_insert_data_csums()
The function btrfs_csum_file_blocks() is a little confusing, unlike
btrfs_csum_one_bio(), it is not calculating the checksum of some file
blocks.

Instead it's just inserting the already calculated checksums into a given
root (can be a csum root or a log tree).

So rename it to btrfs_insert_data_csums() to reflect its behavior better.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07 18:55:56 +02:00
Qu Wenruo
00c865b60b btrfs: make add_pending_csums() to take an ordered extent as parameter
The structure btrfs_ordered_extent has a lot of list heads for different
purposes, passing a random list_head pointer is never a good idea as if
the wrong list is passed in, the type casting along with the fs will be
screwed up.

Instead pass the btrfs_ordered_extent pointer, and grab the csum_list
inside add_pending_csums() to make it a little safer.

Since we're here, also update the comments to follow the current style.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07 18:55:56 +02:00
Qu Wenruo
09664971b3 btrfs: rename btrfs_ordered_extent::list to csum_list
That list head records all pending checksums for that ordered
extent. And unlike other lists, we just use the name "list", which can
be very confusing for readers.

Rename it to "csum_list" which follows the remaining lists, showing the
purpose of the list.

And since we're here, remove a comment inside
btrfs_finish_ordered_zoned() where we have
"ASSERT(!list_empty(&ordered->csum_list))" to make sure the OE has
pending csums.

That comment is only here to make sure we do not call list_first_entry()
before checking BTRFS_ORDERED_PREALLOC.
But since we already have that bit checked and even have a dedicated
ASSERT(), there is no need for that comment anymore.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07 18:55:56 +02:00
Qu Wenruo
883adb6dcf btrfs: fix the inline compressed extent check in inode_need_compress()
[BUG]
Since commit 59615e2c1f ("btrfs: reject single block sized compression
early"), the following script will result the inode to have NOCOMPRESS
flag, meanwhile old kernels don't:

  # mkfs.btrfs -f $dev
  # mount $dev $mnt -o max_inline=2k,compress=zstd
  # truncate -s 8k $mnt/foobar
  # xfs_io -f -c "pwrite 0 2k" $mnt/foobar
  # sync

Before that commit, the inode will not have NOCOMPRESS flag:

	item 4 key (257 INODE_ITEM 0) itemoff 15879 itemsize 160
		generation 9 transid 9 size 8192 nbytes 4096
		block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
		sequence 3 flags 0x0(none)

But after that commit, the inode will have NOCOMPRESS flag:

	item 4 key (257 INODE_ITEM 0) itemoff 15879 itemsize 160
		generation 9 transid 10 size 8192 nbytes 4096
		block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
		sequence 3 flags 0x8(NOCOMPRESS)

This will make a lot of files no longer to be compressed.

[CAUSE]
The old compressed inline check looks like this:

	if (total_compressed <= blocksize &&
	   (start > 0 || end + 1 < inode->disk_i_size))
		goto cleanup_and_bail_uncompressed;

That inline part check is equal to "!(start == 0 && end + 1 >=
inode->disk_i_size)", but the new check no longer has that disk_i_size
check.

Thus it means any single block sized write at file offset 0 will pass
the inline check, which is wrong.

Furthermore, since we have merged the old check into
inode_need_compress(), there is no disk_i_size based inline check
anymore, we will always try compressing that single block at file offset
0, then later find out it's not a net win and go to the
mark_incompressible tag.

This results the inode to have NOCOMPRESS flag.

[FIX]
Add back the missing disk_i_size based check into inode_need_compress().

Now the same script will no longer cause NOCOMPRESS flag.

Fixes: 59615e2c1f ("btrfs: reject single block sized compression early")
Reported-by: Chris Mason <clm@meta.com>
Link: https://lore.kernel.org/linux-btrfs/20260208183840.975975-1-clm@meta.com/
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07 18:55:54 +02:00
Linus Torvalds
8991448e56 for-7.0-rc4-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmm+kPQACgkQxWXV+ddt
 WDt5lw/+P36nlsFO1XoEuMCtE4nxibGpejg1h8OA9Huv3GtC2x7G2yjkaHqXEw32
 A98BoEYI1nvieOHfhcD3685384xlH/dcItdxDIJJmbDWFM2n3H1ayXLDcsYQRg5Q
 3oExe87r+MfYzYCzpV1xePz/0OAcwdn+KGav6ASs/PPhVfdN9kjgZwVsCfQAIGuk
 cASPAx9SDUXjmD0f9OtBZqtOQt5eEF4Xvv3qvd/7/N5SpFyUMe3AeYE+2ttjrTUt
 sw0KE0XTLMJuUVZY1dUyUSpOIADcdoHBcpkPCCwh9JnK7OIcx+vM0VAbxjFsgKFi
 0kBRS1YdOeww6pQ88SCSLHk5xKbCsW1zGfF9lMKT7kUrLIFG01ddefmRXf4qQAni
 w1cowxp2LrcdXgL8AMsAQ4DJVMy1wJ1IFThaL8ZBBdX2CphViYt0gChZwTpchMhz
 GAtqcKSURGOADCvdAgkwERuaSSesdvjfsJ3IBF3ZjLSGTN8Wasj8FV7zOyjbOoEe
 4SZa36X+MEkIQ4Nn3MoEHvK2wuox/rDxTWp96NADSUVvRBCqijVPRZFapw6H9rlO
 zQorO1CAMIqxMC4dYxUDxMl8j2P/VwoaIsib6pVFidRzubI6vxk1dcZn82HKkh8a
 ghIm8XlfsQNTZDS+ominlbxPCsyPIInS5tfgC69FcV4ktrnZz3k=
 =MYi5
 -----END PGP SIGNATURE-----

Merge tag 'for-7.0-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "Another batch of fixes for problems that have been identified by tools
  analyzing code or by fuzzing. Most of them are short, two patches fix
  the same thing in many places so the diffs are bigger.

   - handle potential NULL pointer errors after attempting to read
     extent and checksum trees

   - prevent ENOSPC when creating many qgroups by ioctls in the same
     transaction

   - encoded write ioctl fixes (with 64K page and 4K block size):
       - fix unexpected bio length
       - do not let compressed bios and pages interfere with page cache

   - compression fixes on setups with 64K page and 4K block size: fix
     folio length assertions (zstd and lzo)

   - remap tree fixes:
       - make sure to hold block group reference while moving it
       - handle early exit when moving block group to unused list

   - handle deleted subvolumes with inconsistent state of deletion
     progress"

* tag 'for-7.0-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: reject root items with drop_progress and zero drop_level
  btrfs: check block group before marking it unused in balance_remap_chunks()
  btrfs: hold block group reference during entire move_existing_remap()
  btrfs: fix an incorrect ASSERT() condition inside lzo_decompress_bio()
  btrfs: fix an incorrect ASSERT() condition inside zstd_decompress_bio()
  btrfs: do not touch page cache for encoded writes
  btrfs: fix a bug that makes encoded write bio larger than expected
  btrfs: reserve enough transaction items for qgroup ioctls
  btrfs: check for NULL root after calls to btrfs_csum_root()
  btrfs: check for NULL root after calls to btrfs_extent_root()
2026-03-21 08:42:17 -07:00
Qu Wenruo
65ee606138 btrfs: fix a bug that makes encoded write bio larger than expected
[BUG]
When running btrfs/284 with 64K page size and 4K fs block size, the
following ASSERT() can be triggered:

  assertion failed: cb->bbio.bio.bi_iter.bi_size == disk_num_bytes :: 0, in inode.c:9991
  ------------[ cut here ]------------
  kernel BUG at inode.c:9991!
  Internal error: Oops - BUG: 00000000f2000800 [#1]  SMP
  CPU: 5 UID: 0 PID: 6787 Comm: btrfs Tainted: G           OE       6.19.0-rc8-custom+ #1 PREEMPT(voluntary)
  Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
  pc : btrfs_do_encoded_write+0x9b0/0x9c0 [btrfs]
  lr : btrfs_do_encoded_write+0x9b0/0x9c0 [btrfs]
  Call trace:
   btrfs_do_encoded_write+0x9b0/0x9c0 [btrfs] (P)
   btrfs_do_write_iter+0x1d8/0x208 [btrfs]
   btrfs_ioctl_encoded_write+0x3c8/0x6d0 [btrfs]
   btrfs_ioctl+0xeb0/0x2b60 [btrfs]
   __arm64_sys_ioctl+0xac/0x110
   invoke_syscall.constprop.0+0x64/0xe8
   el0_svc_common.constprop.0+0x40/0xe8
   do_el0_svc+0x24/0x38
   el0_svc+0x3c/0x1b8
   el0t_64_sync_handler+0xa0/0xe8
   el0t_64_sync+0x1a4/0x1a8
  Code: 91180021 90001080 9111a000 94039d54 (d4210000)
  ---[ end trace 0000000000000000 ]---

[CAUSE]
After commit e1bc83f8b1 ("btrfs: get rid of compressed_folios[] usage
for encoded writes"), the encoded write is changed to copy the content
from the iov into a folio, and queue the folio into the compressed bio.

However we always queue the full folio into the compressed bio, which
can make the compressed bio larger than the on-disk extent, if the folio
size is larger than the fs block size.

Although we have an ASSERT() to catch such problem, for kernels without
CONFIG_BTRFS_ASSERT, such larger than expected bio will just be
submitted, possibly overwrite the next data extent, causing data
corruption.

[FIX]
Instead of blindly queuing the full folio into the compressed bio, only
queue the rounded up range, which is the old behavior before that
offending commit.
This also means we no longer need to zero the tailing range until the
folio end (but still to the block boundary), as such range will not be
submitted anyway.

And since we're here, add a final ASSERT() into
btrfs_submit_compressed_write() as the last safety net for kernels with
btrfs assertions enabled

Fixes: e1bc83f8b1 ("btrfs: get rid of compressed_folios[] usage for encoded writes")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-03-17 11:43:07 +01:00
Filipe Manana
2b4cb4e58f btrfs: check for NULL root after calls to btrfs_csum_root()
btrfs_csum_root() can return a NULL pointer in case the root we are
looking for is not in the rb tree that tracks roots. So add checks to
every caller that is missing such check to log a message and return
an error.

Reported-by: Chris Mason <clm@meta.com>
Link: https://lore.kernel.org/linux-btrfs/20260208161657.3972997-1-clm@meta.com/
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-03-17 11:29:32 +01:00
Linus Torvalds
e0b38d286e for-7.0-rc3-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmmytygACgkQxWXV+ddt
 WDuiKw/+IyoS2IE95Jd4G2hezcsaqJNAvv5q90vVyLepT/7jzRbYrHytk7z/0v56
 +Pjc1JHgp+TidYKZa52E/R09eEvYCyuZvEq4bnClOnO3CAJDCixqTKWV70CcYiHX
 HoSCuPML2CJMLZY3u6MxATgk46y1ey3UkbmQ4oufoUHrjAE+D3pDQsYQ9BWR/P6J
 4LbyZ214uIUPvbp0wPJ14cVUMpNxs116JmWvK9dxWQNZSzFcq2IVuLwQUcZBPAdb
 dursd0kt6HXXmNCITmUD8O+ipG4U1akn8FuCjaLqo4LF1AH2f6OzNjucsisfa1tE
 4MD+ATzmNsew5q6dtyfcvv8Btl+olbTP4KGibLl9PCCM9vlkzl3EN5GPUllGi6S9
 T8jqe2pILXZHEx1DIQjHaXJsnuHG9aEkUbkSHUxIa15QDj66omrZZkZG3EF5Buy9
 TogJJgESYU19dt/y110Q/vD/LPOgJhB3NBXwIx1FDYA1OwaAjt6hUcVRVRI5PtpL
 moAIG4nNrPz5Pa875NvguFmrXFIudpTqyANHCPVio4eqDR7LeSg9vVHj9zeOfeRD
 gmn3WGuGNb6L+86TNet/nFQhjxhkKqsnSbzO2zoPQKhG6FS+HSLnUwvJYL4/YnkW
 QEeveeI/6Duiei4DHuw7FXmZVNQxeBEWW6MTFDHA1UNV3HMKewA=
 =TgSD
 -----END PGP SIGNATURE-----

Merge tag 'for-7.0-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - detect possible file name hash collision earlier so it does not lead
   to transaction abort

 - handle b-tree leaf overflows when snapshotting a subvolume with set
   received UUID, leading to transaction abort

 - in zoned mode, reorder relocation block group initialization after
   the transaction kthread start

 - fix orphan cleanup state tracking of subvolume, this could lead to
   invalid dentries under some conditions

 - add locking around updates of dynamic reclain state update

 - in subpage mode, add missing RCU unlock when trying to releae extent
   buffer

 - remap tree fixes:
     - add missing description strings for the newly added remap tree
     - properly update search key when iterating backrefs

* tag 'for-7.0-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: remove duplicated definition of btrfs_printk_in_rcu()
  btrfs: remove unnecessary transaction abort in the received subvol ioctl
  btrfs: abort transaction on failure to update root in the received subvol ioctl
  btrfs: fix transaction abort on set received ioctl due to item overflow
  btrfs: fix transaction abort when snapshotting received subvolumes
  btrfs: fix transaction abort on file creation due to name hash collision
  btrfs: read key again after incrementing slot in move_existing_remaps()
  btrfs: add missing RCU unlock in error path in try_release_subpage_extent_buffer()
  btrfs: set BTRFS_ROOT_ORPHAN_CLEANUP during subvol create
  btrfs: zoned: move btrfs_zoned_reserve_data_reloc_bg() after kthread start
  btrfs: hold space_info->lock when clearing periodic reclaim ready
  btrfs: print-tree: add remap tree definitions
2026-03-12 12:15:27 -07:00
Linus Torvalds
c44db6c820 for-7.0-rc2-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmmm7WwACgkQxWXV+ddt
 WDs7PQ/+Mc0CHhMhRH3DyEnZTPO5YcNLGl2ytqu19X2VdGu3Ra86Au4V+0tWJ+zf
 g4jI8UdgJWdR7aIoIgtMkl2BbK0tyY0WBEJ76EJNDsatByNmTXc0iXwGROe6tL9p
 n4qrEnaTMh4SmYEsFEQX9lO5ISbDbk+kfN8qapCl03c9JyKO6D3PSGrM7wzIkXX4
 oIyfDWpYpAxbyWKjn+uJlpPzdsdfRceJ0fyCbq9sJITVW/FhicqTr6xvqqeoPSXp
 oJiL/Bbsilh7AtCLHguqpczt0X+Fus9enpjT9QqATN/JgUsaXt6O6Mk6NHcnEwjS
 vW6ZdeiFdELz2yLnJyb15ROf6Uorm3Mt2kAnkatLpyHxG9Z7rkxs3+cX4nm7MxSG
 GfLBkFB+HGw155z7cK0dPHMAhQ0KCF66I99VKTgLChjmUs8ipjPAYR8f/Tsq82RD
 mrYf3mEgWYnw6alx2ak454hsNjiXuYmc9bNy8Q+TXD73gQGqwUcZR6alIV+eoWVB
 xbX/0BQPemMITlhX6IuNn5EkCZSoB7eLcDMmYRSOpJOd8oo+gXmzQ5WvQIpwYhwz
 IZIH+KTdErw2FKJ8x9tStydnrmzN63QTEMMtuBy8pRsP5qJMrncPfAOMNBlqhqMq
 3W1GJuurHt2dBmUOQXWrUcMQlDLPyOxUHV6TdpCL83xNzdZK8G4=
 =REHq
 -----END PGP SIGNATURE-----

Merge tag 'for-7.0-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "One-liner or short fixes for minor/moderate problems reported recently:

   - fixes or level adjustments of error messages

   - fix leaked transaction handles after aborted transactions, when
     using the remap tree feature

   - fix a few leaked chunk maps after errors

   - fix leaked page array in io_uring encoded read if an error occurs
     and the 'finished' is not called

   - fix double release of reserved extents when doing a range COW

   - don't commit super block when the filesystem is in shutdown state

   - fix squota accounting condition when checking members vs parent
     usage

   - other error handling fixes"

* tag 'for-7.0-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: check block group lookup in remove_range_from_remap_tree()
  btrfs: fix transaction handle leaks in btrfs_last_identity_remap_gone()
  btrfs: fix chunk map leak in btrfs_map_block() after btrfs_translate_remap()
  btrfs: fix chunk map leak in btrfs_map_block() after btrfs_chunk_map_num_copies()
  btrfs: fix compat mask in error messages in btrfs_check_features()
  btrfs: print correct subvol num if active swapfile prevents deletion
  btrfs: fix warning in scrub_verify_one_metadata()
  btrfs: fix objectid value in error message in check_extent_data_ref()
  btrfs: fix incorrect key offset in error message in check_dev_extent_item()
  btrfs: fix error message order of parameters in btrfs_delete_delayed_dir_index()
  btrfs: don't commit the super block when unmounting a shutdown filesystem
  btrfs: free pages on error in btrfs_uring_read_extent()
  btrfs: fix referenced/exclusive check in squota_check_parent_usage()
  btrfs: remove pointless WARN_ON() in cache_save_setup()
  btrfs: convert log messages to error level in btrfs_replay_log()
  btrfs: remove btrfs_handle_fs_error() after failure to recover log trees
  btrfs: remove redundant warning message in btrfs_check_uuid_tree()
  btrfs: change warning messages to error level in open_ctree()
  btrfs: fix a double release on reserved extents in cow_one_range()
  btrfs: handle discard errors in in btrfs_finish_extent_commit()
2026-03-03 09:08:00 -08:00
Filipe Manana
2d1ababded btrfs: fix transaction abort on file creation due to name hash collision
If we attempt to create several files with names that result in the same
hash, we have to pack them in same dir item and that has a limit inherent
to the leaf size. However if we reach that limit, we trigger a transaction
abort and turns the filesystem into RO mode. This allows for a malicious
user to disrupt a system, without the need to have administration
privileges/capabilities.

Reproducer:

  $ cat exploit-hash-collisions.sh
  #!/bin/bash

  DEV=/dev/sdi
  MNT=/mnt/sdi

  # Use smallest node size to make the test faster and require fewer file
  # names that result in hash collision.
  mkfs.btrfs -f --nodesize 4K $DEV
  mount $DEV $MNT

  # List of names that result in the same crc32c hash for btrfs.
  declare -a names=(
   'foobar'
   '%a8tYkxfGMLWRGr55QSeQc4PBNH9PCLIvR6jZnkDtUUru1t@RouaUe_L:@xGkbO3nCwvLNYeK9vhE628gss:T$yZjZ5l-Nbd6CbC$M=hqE-ujhJICXyIxBvYrIU9-TDC'
   'AQci3EUB%shMsg-N%frgU:02ByLs=IPJU0OpgiWit5nexSyxZDncY6WB:=zKZuk5Zy0DD$Ua78%MelgBuMqaHGyKsJUFf9s=UW80PcJmKctb46KveLSiUtNmqrMiL9-Y0I_l5Fnam04CGIg=8@U:Z'
   'CvVqJpJzueKcuA$wqwePfyu7VxuWNN3ho$p0zi2H8QFYK$7YlEqOhhb%:hHgjhIjW5vnqWHKNP4'
   'ET:vk@rFU4tsvMB0$C_p=xQHaYZjvoF%-BTc%wkFW8yaDAPcCYoR%x$FH5O:'
   'HwTon%v7SGSP4FE08jBwwiu5aot2CFKXHTeEAa@38fUcNGOWvE@Mz6WBeDH_VooaZ6AgsXPkVGwy9l@@ZbNXabUU9csiWrrOp0MWUdfi$EZ3w9GkIqtz7I_eOsByOkBOO'
   'Ij%2VlFGXSuPvxJGf5UWy6O@1svxGha%b@=%wjkq:CIgE6u7eJOjmQY5qTtxE2Rjbis9@us'
   'KBkjG5%9R8K9sOG8UTnAYjxLNAvBmvV5vz3IiZaPmKuLYO03-6asI9lJ_j4@6Xo$KZicaLWJ3Pv8XEwVeUPMwbHYWwbx0pYvNlGMO9F:ZhHAwyctnGy%_eujl%WPd4U2BI7qooOSr85J-C2V$LfY'
   'NcRfDfuUQ2=zP8K3CCF5dFcpfiOm6mwenShsAb_F%n6GAGC7fT2JFFn:c35X-3aYwoq7jNX5$ZJ6hI3wnZs$7KgGi7wjulffhHNUxAT0fRRLF39vJ@NvaEMxsMO'
   'Oj42AQAEzRoTxa5OuSKIr=A_lwGMy132v4g3Pdq1GvUG9874YseIFQ6QU'
   'Ono7avN5GjC:_6dBJ_'
   'WHmN2gnmaN-9dVDy4aWo:yNGFzz8qsJyJhWEWcud7$QzN2D9R0efIWWEdu5kwWr73NZm4=@CoCDxrrZnRITr-kGtU_cfW2:%2_am'
   'WiFnuTEhAG9FEC6zopQmj-A-$LDQ0T3WULz%ox3UZAPybSV6v1Z$b4L_XBi4M4BMBtJZpz93r9xafpB77r:lbwvitWRyo$odnAUYlYMmU4RvgnNd--e=I5hiEjGLETTtaScWlQp8mYsBovZwM2k'
   'XKyH=OsOAF3p%uziGF_ZVr$ivrvhVgD@1u%5RtrV-gl_vqAwHkK@x7YwlxX3qT6WKKQ%PR56NrUBU2dOAOAdzr2=5nJuKPM-T-$ZpQfCL7phxQbUcb:BZOTPaFExc-qK-gDRCDW2'
   'd3uUR6OFEwZr%ns1XH_@tbxA@cCPmbBRLdyh7p6V45H$P2$F%w0RqrD3M0g8aGvWpoTFMiBdOTJXjD:JF7=h9a_43xBywYAP%r$SPZi%zDg%ql-KvkdUCtF9OLaQlxmd'
   'ePTpbnit%hyNm@WELlpKzNZYOzOTf8EQ$sEfkMy1VOfIUu3coyvIr13-Y7Sv5v-Ivax2Go_GQRFMU1b3362nktT9WOJf3SpT%z8sZmM3gvYQBDgmKI%%RM-G7hyrhgYflOw%z::ZRcv5O:lDCFm'
   'evqk743Y@dvZAiG5J05L_ROFV@$2%rVWJ2%3nxV72-W7$e$-SK3tuSHA2mBt$qloC5jwNx33GmQUjD%akhBPu=VJ5g$xhlZiaFtTrjeeM5x7dt4cHpX0cZkmfImndYzGmvwQG:$euFYmXn$_2rA9mKZ'
   'gkgUtnihWXsZQTEkrMAWIxir09k3t7jk_IK25t1:cy1XWN0GGqC%FrySdcmU7M8MuPO_ppkLw3=Dfr0UuBAL4%GFk2$Ma10V1jDRGJje%Xx9EV2ERaWKtjpwiZwh0gCSJsj5UL7CR8RtW5opCVFKGGy8Cky'
   'hNgsG_8lNRik3PvphqPm0yEH3P%%fYG:kQLY=6O-61Wa6nrV_WVGR6TLB09vHOv%g4VQRP8Gzx7VXUY1qvZyS'
   'isA7JVzN12xCxVPJZ_qoLm-pTBuhjjHMvV7o=F:EaClfYNyFGlsfw-Kf%uxdqW-kwk1sPl2vhbjyHU1A6$hz'
   'kiJ_fgcdZFDiOptjgH5PN9-PSyLO4fbk_:u5_2tz35lV_iXiJ6cx7pwjTtKy-XGaQ5IefmpJ4N_ZqGsqCsKuqOOBgf9LkUdffHet@Wu'
   'lvwtxyhE9:%Q3UxeHiViUyNzJsy:fm38pg_b6s25JvdhOAT=1s0$pG25x=LZ2rlHTszj=gN6M4zHZYr_qrB49i=pA--@WqWLIuX7o1S_SfS@2FSiUZN'
   'rC24cw3UBDZ=5qJBUMs9e$=S4Y94ni%Z8639vnrGp=0Hv4z3dNFL0fBLmQ40=EYIY:Z=SLc@QLMSt2zsss2ZXrP7j4='
   'uwGl2s-fFrf@GqS=DQqq2I0LJSsOmM%xzTjS:lzXguE3wChdMoHYtLRKPvfaPOZF2fER@j53evbKa7R%A7r4%YEkD=kicJe@SFiGtXHbKe4gCgPAYbnVn'
   'UG37U6KKua2bgc:IHzRs7BnB6FD:2Mt5Cc5NdlsW%$1tyvnfz7S27FvNkroXwAW:mBZLA1@qa9WnDbHCDmQmfPMC9z-Eq6QT0jhhPpqyymaD:R02ghwYo%yx7SAaaq-:x33LYpei$5g8DMl3C'
   'y2vjek0FE1PDJC0qpfnN:x8k2wCFZ9xiUF2ege=JnP98R%wxjKkdfEiLWvQzmnW'
   '8-HCSgH5B%K7P8_jaVtQhBXpBk:pE-$P7ts58U0J@iR9YZntMPl7j$s62yAJO@_9eanFPS54b=UTw$94C-t=HLxT8n6o9P=QnIxq-f1=Ne2dvhe6WbjEQtc'
   'YPPh:IFt2mtR6XWSmjHptXL_hbSYu8bMw-JP8@PNyaFkdNFsk$M=xfL6LDKCDM-mSyGA_2MBwZ8Dr4=R1D%7-mCaaKGxb990jzaagRktDTyp'
   '9hD2ApKa_t_7x-a@GCG28kY:7$M@5udI1myQ$x5udtggvagmCQcq9QXWRC5hoB0o-_zHQUqZI5rMcz_kbMgvN5jr63LeYA4Cj-c6F5Ugmx6DgVf@2Jqm%MafecpgooqreJ53P-QTS'
  )

  # Now create files with all those names in the same parent directory.
  # It should not fail since a 4K leaf has enough space for them.
  for name in "${names[@]}"; do
       touch $MNT/$name
  done

  # Now add one more file name that causes a crc32c hash collision.
  # This should fail, but it should not turn the filesystem into RO mode
  # (which could be exploited by malicious users) due to a transaction
  # abort.
  touch $MNT/'W6tIm-VK2@BGC@IBfcgg6j_p:pxp_QUqtWpGD5Ok_GmijKOJJt'

  # Check that we are able to create another file, with a name that does not cause
  # a crc32c hash collision.
  echo -n "hello world" > $MNT/baz

  # Unmount and mount again, verify file baz exists and with the right content.
  umount $MNT
  mount $DEV $MNT
  echo "File baz content: $(cat $MNT/baz)"

  umount $MNT

When running the reproducer:

  $ ./exploit-hash-collisions.sh
  (...)
  touch: cannot touch '/mnt/sdi/W6tIm-VK2@BGC@IBfcgg6j_p:pxp_QUqtWpGD5Ok_GmijKOJJt': Value too large for defined data type
  ./exploit-hash-collisions.sh: line 57: /mnt/sdi/baz: Read-only file system
  cat: /mnt/sdi/baz: No such file or directory
  File baz content:

And the transaction abort stack trace in dmesg/syslog:

  $ dmesg
  (...)
  [758240.509761] ------------[ cut here ]------------
  [758240.510668] BTRFS: Transaction aborted (error -75)
  [758240.511577] WARNING: fs/btrfs/inode.c:6854 at btrfs_create_new_inode+0x805/0xb50 [btrfs], CPU#6: touch/888644
  [758240.513513] Modules linked in: btrfs dm_zero (...)
  [758240.523221] CPU: 6 UID: 0 PID: 888644 Comm: touch Tainted: G        W           6.19.0-rc8-btrfs-next-225+ #1 PREEMPT(full)
  [758240.524621] Tainted: [W]=WARN
  [758240.525037] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
  [758240.526331] RIP: 0010:btrfs_create_new_inode+0x80b/0xb50 [btrfs]
  [758240.527093] Code: 0f 82 cf (...)
  [758240.529211] RSP: 0018:ffffce64418fbb48 EFLAGS: 00010292
  [758240.529935] RAX: 00000000ffffffd3 RBX: 0000000000000000 RCX: 00000000ffffffb5
  [758240.531040] RDX: 0000000d04f33e06 RSI: 00000000ffffffb5 RDI: ffffffffc0919dd0
  [758240.531920] RBP: ffffce64418fbc10 R08: 0000000000000000 R09: 00000000ffffffb5
  [758240.532928] R10: 0000000000000000 R11: ffff8e52c0000000 R12: ffff8e53eee7d0f0
  [758240.533818] R13: ffff8e57f70932a0 R14: ffff8e5417629568 R15: 0000000000000000
  [758240.534664] FS:  00007f1959a2a740(0000) GS:ffff8e5b27cae000(0000) knlGS:0000000000000000
  [758240.535821] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [758240.536644] CR2: 00007f1959b10ce0 CR3: 000000012a2cc005 CR4: 0000000000370ef0
  [758240.537517] Call Trace:
  [758240.537828]  <TASK>
  [758240.538099]  btrfs_create_common+0xbf/0x140 [btrfs]
  [758240.538760]  path_openat+0x111a/0x15b0
  [758240.539252]  do_filp_open+0xc2/0x170
  [758240.539699]  ? preempt_count_add+0x47/0xa0
  [758240.540200]  ? __virt_addr_valid+0xe4/0x1a0
  [758240.540800]  ? __check_object_size+0x1b3/0x230
  [758240.541661]  ? alloc_fd+0x118/0x180
  [758240.542315]  do_sys_openat2+0x70/0xd0
  [758240.543012]  __x64_sys_openat+0x50/0xa0
  [758240.543723]  do_syscall_64+0x50/0xf20
  [758240.544462]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
  [758240.545397] RIP: 0033:0x7f1959abc687
  [758240.546019] Code: 48 89 fa (...)
  [758240.548522] RSP: 002b:00007ffe16ff8690 EFLAGS: 00000202 ORIG_RAX: 0000000000000101
  [758240.566278] RAX: ffffffffffffffda RBX: 00007f1959a2a740 RCX: 00007f1959abc687
  [758240.567068] RDX: 0000000000000941 RSI: 00007ffe16ffa333 RDI: ffffffffffffff9c
  [758240.567860] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
  [758240.568707] R10: 00000000000001b6 R11: 0000000000000202 R12: 0000561eec7c4b90
  [758240.569712] R13: 0000561eec7c311f R14: 00007ffe16ffa333 R15: 0000000000000000
  [758240.570758]  </TASK>
  [758240.571040] ---[ end trace 0000000000000000 ]---
  [758240.571681] BTRFS: error (device sdi state A) in btrfs_create_new_inode:6854: errno=-75 unknown
  [758240.572899] BTRFS info (device sdi state EA): forced readonly

Fix this by checking for hash collision, and if the adding a new name is
possible, early in btrfs_create_new_inode() before we do any tree updates,
so that we don't need to abort the transaction if we cannot add the new
name due to the leaf size limit.

A test case for fstests will be sent soon.

Fixes: caae78e032 ("btrfs: move common inode creation code into btrfs_create_new_inode()")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-03-03 17:03:59 +01:00
Mark Harmstone
1c7e9111f4 btrfs: print correct subvol num if active swapfile prevents deletion
Fix the error message in btrfs_delete_subvolume() if we can't delete a
subvolume because it has an active swapfile: we were printing the number
of the parent rather than the target.

Fixes: 60021bd754 ("btrfs: prevent subvol with swapfile from being deleted")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-26 15:03:28 +01:00
Qu Wenruo
a4fe134fc1 btrfs: fix a double release on reserved extents in cow_one_range()
[BUG]
Commit c28214bde6 ("btrfs: refactor the main loop of
cow_file_range()") refactored the handling of COWing one range.

However it changed the error handling of the reserved extent.

The old cleanup looks like this:

out_drop_extent_cache:
	btrfs_drop_extent_map_range(inode, start, start + cur_alloc_size - 1, false);
out_reserve:
	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
	btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, true);
	[...]
	clear_bits = EXTENT_LOCKED | EXTENT_DELALLOC | EXTENT_DELALLOC_NEW |
		     EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV;
	page_ops = PAGE_UNLOCK | PAGE_START_WRITEBACK | PAGE_END_WRITEBACK;
	/*
	 * For the range (2). If we reserved an extent for our delalloc range
	 * (or a subrange) and failed to create the respective ordered extent,
	 * then it means that when we reserved the extent we decremented the
	 * extent's size from the data space_info's bytes_may_use counter and
	 * incremented the space_info's bytes_reserved counter by the same
	 * amount. We must make sure extent_clear_unlock_delalloc() does not try
	 * to decrement again the data space_info's bytes_may_use counter,
	 * therefore we do not pass it the flag EXTENT_CLEAR_DATA_RESV.
	 */
	if (cur_alloc_size) {
	        extent_clear_unlock_delalloc(inode, start,
	                                     start + cur_alloc_size - 1,
	                                     locked_folio, &cached, clear_bits,
	                                     page_ops);
	        btrfs_qgroup_free_data(inode, NULL, start, cur_alloc_size, NULL);
	}

Which only calls EXTENT_CLEAR_META_RESV.
As the reserved extent is properly handled by
btrfs_free_reserved_extent().

However the new cleanup is:

	extent_clear_unlock_delalloc(inode, file_offset, cur_end, locked_folio, cached,
				     EXTENT_LOCKED | EXTENT_DELALLOC |
				     EXTENT_DELALLOC_NEW |
				     EXTENT_DEFRAG | EXTENT_DO_ACCOUNTING,
				     PAGE_UNLOCK | PAGE_START_WRITEBACK |
				     PAGE_END_WRITEBACK);
	btrfs_qgroup_free_data(inode, NULL, file_offset, cur_len, NULL);
	btrfs_dec_block_group_reservations(fs_info, ins->objectid);
	btrfs_free_reserved_extent(fs_info, ins->objectid, ins->offset, true);

The flag EXTENT_DO_ACCOUNTING implies both EXTENT_CLEAR_META_RESV and
EXTENT_CLEAR_DATA_RESV, which will release the bytes_may_use, which
later btrfs_free_reserved_extent() will do again, causing incorrect
double release (and may underflow bytes_may_use).

[FIX]
Use EXTENT_CLEAR_META_RESV to replace EXTENT_DO_ACCOUNTING, and add back
the comments on why we only use EXTENT_CLEAR_META_RESV.

Fixes: c28214bde6 ("btrfs: refactor the main loop of cow_file_range()")
Reported-by: Chris Mason <clm@meta.com>
Link: https://lore.kernel.org/linux-btrfs/20260208184920.1102719-1-clm@meta.com/
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-26 15:03:26 +01:00
Linus Torvalds
323bbfcf1e Convert 'alloc_flex' family to use the new default GFP_KERNEL argument
This is the exact same thing as the 'alloc_obj()' version, only much
smaller because there are a lot fewer users of the *alloc_flex()
interface.

As with alloc_obj() version, this was done entirely with mindless brute
force, using the same script, except using 'flex' in the pattern rather
than 'objs*'.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21 17:09:51 -08:00
Linus Torvalds
bf4afc53b7 Convert 'alloc_obj' family to use the new default GFP_KERNEL argument
This was done entirely with mindless brute force, using

    git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
        xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'

to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.

Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.

For the same reason the 'flex' versions will be done as a separate
conversion.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21 17:09:51 -08:00
Kees Cook
69050f8d6d treewide: Replace kmalloc with kmalloc_obj for non-scalar types
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:

Single allocations:	kmalloc(sizeof(TYPE), ...)
are replaced with:	kmalloc_obj(TYPE, ...)

Array allocations:	kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with:	kmalloc_objs(TYPE, COUNT, ...)

Flex array allocations:	kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with:	kmalloc_flex(*PTR, FAM, COUNT, ...)

(where TYPE may also be *VAR)

The resulting allocations no longer return "void *", instead returning
"TYPE *".

Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-21 01:02:28 -08:00
Linus Torvalds
b3f1da2a4d for-7.0-rc1-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmmYpGMACgkQxWXV+ddt
 WDvKNQ//cy7gb2nItc9hbYXUM/88Ks3Usu94u/4gREL9I97u2vHer6sT8cRfWA2k
 OSOZF2L7yPWIcxcj39YC66ubs7uebSwt/bZAL1TKAyA7wUvR/kdhD7DUTWX4ySf3
 2+1BANv1Bng8C7vGnWDhYPHcb1u8LvKxKcn+9h8SzBGpW5dyx3k4xUrneaMYq+jf
 D9sPjkkM6fxsKn+S3OJP/zFUIQ2DQiv7nF+Jv4Ke2h9c9nCVfn8fRK0AuTlYXFY/
 mWkKWo1ATGVd0fBg/otRp/ZlZczoKs3/1YBUMYTxZZngyweIms4Q6I4/GIGHO+RD
 QFFoIQ7OQd0aqBGhuKTDlYMlc6OS2jwoTgVYr6vSIxSRUsCK/grHdPL+s+9dLc3h
 p7+/URH9Gpfad46wFypb5w7zmmc8jCRkR1Ff+jf6Pi8GgffqocCro3C3HlGRKwcf
 CAj6gI3ypNPNFfYidcKbS+ehXhjmMVb9xhNa8YwCC1CdgM54ZMmEs/ksAN+uBc/u
 EfcAbB3T15LQgzUJs2WKvCI3E/0XUYEi54ng8UwCJ6P01p3egfvQo8t6jZal9Vx8
 ba/LUG50W1xRRjxgG1AU5s42GmGkO8WNyIixmLlT+Pwog0I2auPVDQBudbXZK4ps
 +FOtNnN9hYLmuZyRSTT03MHHf0Rqtckdjvq3413KMFILVh+ZM+Q=
 =CQsu
 -----END PGP SIGNATURE-----

Merge tag 'for-7.0-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - multiple error handling fixes of unexpected conditions

 - reset block group size class once it becomes empty so that
   its class can be changed

 - error message level adjustments

 - fixes of returned error values

 - use correct block reserve for delayed refs

* tag 'for-7.0-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix invalid leaf access in btrfs_quota_enable() if ref key not found
  btrfs: fix lost error return in btrfs_find_orphan_roots()
  btrfs: fix lost return value on error in finish_verity()
  btrfs: change unaligned root messages to error level in btrfs_validate_super()
  btrfs: use the correct type to initialize block reserve for delayed refs
  btrfs: do not ASSERT() when the fs flips RO inside btrfs_repair_io_failure()
  btrfs: reset block group size class when it becomes empty
  btrfs: replace BUG() with error handling in __btrfs_balance()
  btrfs: handle unexpected exact match in btrfs_set_inode_index_count()
2026-02-20 14:57:09 -08:00
Adarsh Das
1c88823a19 btrfs: handle unexpected exact match in btrfs_set_inode_index_count()
We search with offset (u64)-1 which should never match exactly.
Previously the code silently returned success without setting the index
count. Now logs an error and return -EUCLEAN instead.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Adarsh Das <adarshdas950@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>,
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-18 15:25:53 +01:00
Linus Torvalds
997f9640c9 fsverity updates for 7.0
fsverity cleanups, speedup, and memory usage optimization from
 Christoph Hellwig:
 
 - Move some logic into common code
 
 - Fix btrfs to reject truncates of fsverity files
 
 - Improve the readahead implementation
 
 - Store each inode's fsverity_info in a hash table instead of using a
   pointer in the filesystem-specific part of the inode.
 
   This optimizes for memory usage in the usual case where most files
   don't have fsverity enabled.
 
 - Look up the fsverity_info fewer times during verification, to
   amortize the hash table overhead
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCaY0nZhQcZWJpZ2dlcnNA
 a2VybmVsLm9yZwAKCRDzXCl4vpKOK/AVAP9wSLEYsG3dqnNIHjIvLeK+9NC3Ni4d
 m+fvT1JfuideOwEA9r2EfztusLU5iyqWJlHyxekibXItUDgYGltaYb7eXAU=
 =a+To
 -----END PGP SIGNATURE-----

Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux

Pull fsverity updates from Eric Biggers:
 "fsverity cleanups, speedup, and memory usage optimization from
  Christoph Hellwig:

   - Move some logic into common code

   - Fix btrfs to reject truncates of fsverity files

   - Improve the readahead implementation

   - Store each inode's fsverity_info in a hash table instead of using a
     pointer in the filesystem-specific part of the inode.

     This optimizes for memory usage in the usual case where most files
     don't have fsverity enabled.

   - Look up the fsverity_info fewer times during verification, to
     amortize the hash table overhead"

* tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux:
  fsverity: remove inode from fsverity_verification_ctx
  fsverity: use a hashtable to find the fsverity_info
  btrfs: consolidate fsverity_info lookup
  f2fs: consolidate fsverity_info lookup
  ext4: consolidate fsverity_info lookup
  fs: consolidate fsverity_info lookup in buffer.c
  fsverity: push out fsverity_info lookup
  fsverity: deconstify the inode pointer in struct fsverity_info
  fsverity: kick off hash readahead at data I/O submission time
  ext4: move ->read_folio and ->readahead to readpage.c
  readahead: push invalidate_lock out of page_cache_ra_unbounded
  fsverity: don't issue readahead for non-ENOENT errors from __filemap_get_folio
  fsverity: start consolidating pagecache code
  fsverity: pass struct file to ->write_merkle_tree_block
  f2fs: don't build the fsverity work handler for !CONFIG_FS_VERITY
  ext4: don't build the fsverity work handler for !CONFIG_FS_VERITY
  fs,fsverity: clear out fsverity_info from common code
  fs,fsverity: reject size changes on fsverity files in setattr_prepare
2026-02-12 10:41:34 -08:00
Linus Torvalds
8912c2fd58 for-6.20-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmmDT6sACgkQxWXV+ddt
 WDteIBAAnQBKtHZOrefnA/SjbT4N+IV20x8sxVc3XI2MXw6RpjEN6k+0oGMLdvMy
 5NBryJ43q5CwCV6iNkWQE4mT86gcPa6Bqv1nFOC5Q2BDkvbVBpOfOq7kC2+fQ7ay
 HF2Mr0PUHc0Y0MhkRSljO+T2QD4tDpWaxbEeVY+TxiAsepD1paK4fHV6Lwu2sk25
 17RJQvm/2XRY32g9Sa6NZIc7mGuyIasMCBcTpDKDJW10hP61NNtK4wHgPLtMRtzx
 qzCAPSMS6QkeJZHcDa/Atg+iqpR5U8pdKAUSYJii3Kgcmjr5n1U1ZTp5WRLlXSS2
 tHiR62a983ya022wKR1ApsdjN7ncE8iIeT/GrezZVcPtm9jTxaSzgd7dDNfSmr29
 my4crJWvlEuD9Qt+/oz//eLAjkgEe2Q5RtaAworCAG00MzaGOEwNiXXP7DDMQApI
 VTxx9dvY0s/W3UF/IuJWTTN9q95KjvlmZ9ELAPxwwtyq+sAD41CvlYhJqCaLLec5
 6xMotP5cy3Ur+yp+J7RCDprQ7x6YcU98PYIXQxf1/77f3Lz/7QA2TWafPzJ5V2Bk
 UtprVCrlqwCmSFrSISN6HzNf0UYY/ZI36WRoUj/ZJkGNfkQwvs9aBjb+lVYRb8T8
 OcMlJrJvoUwIY//ef5K97ma8HOecodxszdEIafOgmnJtE9H3foI=
 =Ie8n
 -----END PGP SIGNATURE-----

Merge tag 'for-6.20-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "User visible changes, feature updates:

   - when using block size > page size, enable direct IO

   - fallback to buffered IO if the data profile has duplication,
     workaround to avoid checksum mismatches on block group profiles
     with redundancy, real direct IO is possible on single or RAID0

   - redo export of zoned statistics, moved from sysfs to
     /proc/pid/mountstats due to size limitations of the former

  Experimental features:

   - remove offload checksum tunable, intended to find best way to do it
     but since we've switched to offload to thread for everything we
     don't need it anymore

   - initial support for remap-tree feature, a translation layer of
     logical block addresses that allow changes without moving/rewriting
     blocks to do eg. relocation, or other changes that require COW

  Notable fixes:

   - automatic removal of accidentally leftover chunks when
     free-space-tree is enabled since mkfs.btrfs v6.16.1

   - zoned mode:
      - do not try to append to conventional zones when RAID is mixing
        zoned and conventional drives
      - fixup write pointers when mixing zoned and conventional on
        DUP/RAID* profiles

   - when using squota, relax deletion rules for qgroups with 0 members
     to allow easier recovery from accounting bugs, also add more checks
     to detect bad accounting

   - fix periodic reclaim scanning, properly check boundary conditions
     not to trigger it unexpectedly or miss the time to run it

   - trim:
      - continue after first error
      - change reporting to the first detected error
      - add more cancellation points
      - reduce contention of big device lock that can block other
        operations when there's lots of trimmed space

   - when chunk allocation is forced (needs experimental build) fix
     transaction abort when unexpected space layout is detected

  Core:

   - switch to crypto library API for checksumming, removed module
     dependencies, pointer indirections, etc.

   - error handling improvements

   - adjust how and where transaction commit or abort are done and are
     maybe not necessary

   - minor compression optimization to skip single block ranges

   - improve how compression folios are handled

   - new and updated selftests

   - cleanups, refactoring:
      - auto-freeing and other automatic variable cleanup conversion
      - structure size optimizations
      - condition annotations"

* tag 'for-6.20-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (137 commits)
  btrfs: get rid of compressed_bio::compressed_folios[]
  btrfs: get rid of compressed_folios[] usage for encoded writes
  btrfs: get rid of compressed_folios[] usage for compressed read
  btrfs: remove the old btrfs_compress_folios() infrastructure
  btrfs: switch to btrfs_compress_bio() interface for compressed writes
  btrfs: introduce btrfs_compress_bio() helper
  btrfs: zlib: introduce zlib_compress_bio() helper
  btrfs: zstd: introduce zstd_compress_bio() helper
  btrfs: lzo: introduce lzo_compress_bio() helper
  btrfs: zoned: factor out the zone loading part into a testable function
  btrfs: add cleanup function for btrfs_free_chunk_map
  btrfs: tests: add cleanup functions for test specific functions
  btrfs: raid56: fix memory leak of btrfs_raid_bio::stripe_uptodate_bitmap
  btrfs: tests: add unit tests for pending extent walking functions
  btrfs: fix EEXIST abort due to non-consecutive gaps in chunk allocation
  btrfs: fix transaction commit blocking during trim of unallocated space
  btrfs: handle user interrupt properly in btrfs_trim_fs()
  btrfs: preserve first error in btrfs_trim_fs()
  btrfs: continue trimming remaining devices on failure
  btrfs: do not BUG_ON() in btrfs_remove_block_group()
  ...
2026-02-09 15:45:21 -08:00
Linus Torvalds
aa2a0fcd4c vfs-7.0-rc1.leases
Please consider pulling these changes from the signed vfs-7.0-rc1.leases tag.
 
 Thanks!
 Christian
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaYX49gAKCRCRxhvAZXjc
 olR/AP40iNOTRn7LosXbRWqGGZqzy9v64QYoLzk3QdsWuGmbRAD/egNQzof8mkAf
 IscefWTOjY7xyDzmEBEBnfHftgMiEwM=
 =zre0
 -----END PGP SIGNATURE-----

Merge tag 'vfs-7.0-rc1.leases' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs lease updates from Christian Brauner:
 "This contains updates for lease support to require filesystems to
  explicitly opt-in to lease support

  Currently kernel_setlease() falls through to generic_setlease() when a
  a filesystem does not define ->setlease(), silently granting lease
  support to every filesystem regardless of whether it is prepared for
  it.

  This is a poor default: most filesystems never intended to support
  leases, and the silent fallthrough makes it impossible to distinguish
  "supports leases" from "never thought about it".

  This inverts the default. It adds explicit

	.setlease = generic_setlease;

  assignments to every in-tree filesystem that should retain lease
  support, then changes kernel_setlease() to return -EINVAL when
  ->setlease is NULL.

  With the new default in place, simple_nosetlease() is redundant and
  is removed along with all references to it"

* tag 'vfs-7.0-rc1.leases' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (25 commits)
  fuse: add setlease file operation
  fs: remove simple_nosetlease()
  filelock: default to returning -EINVAL when ->setlease operation is NULL
  xfs: add setlease file operation
  ufs: add setlease file operation
  udf: add setlease file operation
  tmpfs: add setlease file operation
  squashfs: add setlease file operation
  overlayfs: add setlease file operation
  orangefs: add setlease file operation
  ocfs2: add setlease file operation
  ntfs3: add setlease file operation
  nilfs2: add setlease file operation
  jfs: add setlease file operation
  jffs2: add setlease file operation
  gfs2: add a setlease file operation
  fat: add setlease file operation
  f2fs: add setlease file operation
  exfat: add setlease file operation
  ext4: add setlease file operation
  ...
2026-02-09 11:59:07 -08:00
Linus Torvalds
74554251df vfs-7.0-rc1.nonblocking_timestamps
Please consider pulling these changes from the signed vfs-7.0-rc1.nonblocking_timestamps tag.
 
 Thanks!
 Christian
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaYX49gAKCRCRxhvAZXjc
 oqNMAQCjHw9iwYDu63n96QAipWopJb8onqc0rTEvi0OOl1zDNwEAufN3EqTzV3uQ
 JbNgSwBWD/+ICd2aUOuAX0GgU6teyAQ=
 =lJlI
 -----END PGP SIGNATURE-----

Merge tag 'vfs-7.0-rc1.nonblocking_timestamps' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs timestamp updates from Christian Brauner:
 "This contains the changes to support non-blocking timestamp updates.

  Since commit 66fa3cedf1 ("fs: Add async write file modification
  handling") file_update_time_flags() unconditionally returns -EAGAIN
  when any timestamp needs updating and IOCB_NOWAIT is set. This makes
  non-blocking direct writes impossible on file systems with granular
  enough timestamps, which in practice means all of them.

  This reworks the timestamp update path to propagate IOCB_NOWAIT
  through ->update_time so that file systems which can update timestamps
  without blocking are no longer penalized.

  With that groundwork in place, the core change passes IOCB_NOWAIT into
  ->update_time and returns -EAGAIN only when the file system indicates
  it would block.

  XFS implements non-blocking timestamp updates by using the new
  ->sync_lazytime and open-coding generic_update_time without the
  S_NOWAIT check, since the lazytime path through the generic helpers
  can never block in XFS"

* tag 'vfs-7.0-rc1.nonblocking_timestamps' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  xfs: enable non-blocking timestamp updates
  xfs: implement ->sync_lazytime
  fs: refactor file_update_time_flags
  fs: add support for non-blocking timestamp updates
  fs: add a ->sync_lazytime method
  fs: factor out a sync_lazytime helper
  fs: refactor ->update_time handling
  fat: cleanup the flags for fat_truncate_time
  nfs: split nfs_update_timestamps
  fs: allow error returns from generic_update_time
  fs: remove inode_update_time
2026-02-09 11:25:01 -08:00
Christoph Hellwig
f77f281b61 fsverity: use a hashtable to find the fsverity_info
Use the kernel's resizable hash table (rhashtable) to find the
fsverity_info.  This way file systems that want to support fsverity don't
have to bloat every inode in the system with an extra pointer.  The
trade-off is that looking up the fsverity_info is a bit more expensive
now, but the main operations are still dominated by I/O and hashing
overhead.

The rhashtable implementations requires no external synchronization, and
the _fast versions of the APIs provide the RCU critical sections required
by the implementation.  Because struct fsverity_info is only removed on
inode eviction and does not contain a reference count, there is no need
for an extended critical section to grab a reference or validate the
object state.  The file open path uses rhashtable_lookup_get_insert_fast,
which can either find an existing object for the hash key or insert a
new one in a single atomic operation, so that concurrent opens never
instantiate duplicate fsverity_info structure.  FS_IOC_ENABLE_VERITY must
already be synchronized by a combination of i_rwsem and file system flags
and uses rhashtable_lookup_insert_fast, which errors out on an existing
object for the hash key as an additional safety check.

Because insertion into the hash table now happens before S_VERITY is set,
fsverity just becomes a barrier and a flag check and doesn't have to look
up the fsverity_info at all, so there is only a single lookup per
->read_folio or ->readahead invocation.  For btrfs there is an additional
one for each bio completion, while for ext4 and f2fs the fsverity_info
is stored in the per-I/O context and reused for the completion workqueue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Link: https://lore.kernel.org/r/20260202060754.270269-12-hch@lst.de
[EB: folded in fix for missing fsverity_free_info()]
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2026-02-04 11:31:54 -08:00
Qu Wenruo
e1bc83f8b1 btrfs: get rid of compressed_folios[] usage for encoded writes
Currently only encoded writes utilized btrfs_submit_compressed_write(),
which utilized compressed_bio::compressed_folios[] array.

Change the only call site to call the new helper,
btrfs_alloc_compressed_write(), to allocate a compressed bio, then queue
needed folios into that bio, and finally call
btrfs_submit_compressed_write() to submit the compressed bio.

This change has one hidden benefit, previously we used
btrfs_alloc_folio_array() for the folios of
btrfs_submit_compressed_read(), which doesn't utilize the compression
page pool for bs == ps cases.

Now we call btrfs_alloc_compr_folio() which will benefit from the page pool.

The other obvious benefit is that we no longer need to allocate an array
to hold all those folios, thus one less error path.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:59:07 +01:00
Qu Wenruo
26902be0cd btrfs: remove the old btrfs_compress_folios() infrastructure
Since it's been replaced by btrfs_compress_bio(), remove all involved
functions.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:59:07 +01:00
Qu Wenruo
6f706f34fc btrfs: switch to btrfs_compress_bio() interface for compressed writes
This switch has the following benefits:

- A single structure to handle all compression

  No more extra members like compressed_folios[] nor compress_type, all
  those members.

  This means the structure of async_extent is much smaller.

- Simpler error handling

  A single cleanup_compressed_bio() will handle everything, no extra
  compressed_folios[] array to bother.

Some extra notes:

- Compressed folios releasing

  Now we go bio_for_each_folio_all() loop to release the folios of the
  bio. This will work for both the old compressed_folios[] array and the
  new pure bio method.

  For old compressed_folios[], all folios of that array is queued into
  the bio, thus releasing the folios from the bio is the same as
  releasing each folio of that array. We just need to be sure no double
  releasing from the array and bio.

  For the new pure bio method, that array is NULL, just usual folio
  releasing of the bio.

  The only extra note is for end_bbio_compressed_read(), as the folios
  are allocated using btrfs_alloc_folio_array(), thus the folios should
  only be released by regular folio_put(), not btrfs_free_compr_folio().

- Rounding up the bio to block size

  We cannot simply increase bi_size, as that will not increase the
  length of the last bvec.

  Thus we have to properly add the last part into the bio.
  This will be done by the helper, round_up_last_block().

  The reason we do not round those bios up at compression time is to get
  the unaligned compressed size, so that they can be utilized for
  inline extents.
  If we round the bios up at *_compress_bio(), then every compressed bio
  will be larger than or equal to one fs block, resulting no inline
  compressed extent.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:59:07 +01:00
Filipe Manana
47c9dbc791 btrfs: remove pointless out labels from inode.c
Some functions (insert_inline_extent() and insert_reserved_file_extent())
have an 'out' label that does nothing but return, making it pointless.
Simplify this by removing the label and returning instead of gotos plus
setting the 'ret' variable.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:56:21 +01:00
Filipe Manana
571e75f4c0 btrfs: unfold transaction aborts in btrfs_finish_one_ordered()
We have a single transaction abort that can be caused either by a failure
from a call to btrfs_mark_extent_written(), if we are dealing with a
write to a prealloc extent, or otherwise from a call to
insert_ordered_extent_file_extent(). So when the transaction abort happens
we can not know for sure which case failed. Unfold the aborts so that it's
clear in case of a failure.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:56:19 +01:00
Filipe Manana
ea7ab405c5 btrfs: use the btrfs_extent_map_end() helper everywhere
We have a helper to calculate an extent map's exclusive end offset, but
we only use it in some places. Update every site that open codes the
calculation to use the helper.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:54:36 +01:00
Mark Harmstone
a645372e7e btrfs: add do_remap parameter to btrfs_discard_extent()
btrfs_discard_extent() can be called either when an extent is removed
or from walking the free-space tree. With a remapped block group these
two things are no longer equivalent: the extent's addresses are
remapped, while the free-space tree exclusively uses underlying
addresses.

Add a do_remap parameter to btrfs_discard_extent() and
btrfs_map_discard(), saying whether or not the address needs to be run
through the remap tree first.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:54:35 +01:00
Qu Wenruo
59615e2c1f btrfs: reject single block sized compression early
Currently for an inode that needs compression, even if there is a delalloc
range that is single fs block sized and can not be inlined, we will
still go through the compression path.

Then inside compress_file_range(), we have one extra check to reject
single block sized range, and fall back to regular uncompressed write.

This rejection is in fact a little too late, we have already allocated
memory to async_chunk, delayed the submission, just to fallback to the
same uncompressed write.

Change the behavior to reject such cases earlier at
inode_need_compress(), so for such single block sized range we won't
even bother trying to go through compress path.

And since the inline small block check is inside inode_need_compress()
and compress_file_range() also calls that function, we no longer need a
dedicate check inside compress_file_range().

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:51:42 +01:00
Filipe Manana
7d7608cc9a btrfs: move unlikely checks around btrfs_is_shutdown() into the helper
Instead of surrounding every caller of btrfs_is_shutdown() with unlikely,
move the unlikely into the helper itself, like we do in other places in
btrfs and is common in the kernel outside btrfs too. Also make the fs_info
argument of btrfs_is_shutdown() const.

On a x86_84 box using gcc 14.2.0-19 from Debian, this resulted in a slight
reduction of the module's text size.

Before:

  $ size fs/btrfs/btrfs.ko
     text	   data	    bss	    dec	    hex	filename
  1939044	 172568	  15592	2127204	 207564	fs/btrfs/btrfs.ko

After:

  $ size fs/btrfs/btrfs.ko
     text	   data	    bss	    dec	    hex	filename
  1938876	 172568	  15592	2127036	 2074bc	fs/btrfs/btrfs.ko

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:49:12 +01:00
Qu Wenruo
c28214bde6 btrfs: refactor the main loop of cow_file_range()
Currently inside the main loop of cow_file_range(), we do the following
sequence:

- Reserve an extent
- Lock the IO tree range
- Create an IO extent map
- Create an ordered extent

Every step will need extra steps to do cleanup in the following order:

- Drop the newly created extent map
- Unlock extent range and cleanup the involved folios
- Free the reserved extent

However currently the error handling is done inconsistently:

- Extent map drop is handled in a dedicated tag
  Out of the main loop, make it much harder to track.

- The extent unlock and folios cleanup is done separately
  The extent is unlocked through btrfs_unlock_extent(), then
  extent_clear_unlock_delalloc() again in a dedicated tag.
  Meanwhile all other callsites (compression/encoded/nocow) all just
  call extent_clear_unlock_delalloc() to handle unlock and folio clean
  up in one go.

- Reserved extent freeing is handled in a dedicated tag
  Out of the main loop, make it much harder to track.

- Error handling of btrfs_reloc_clone_csums() is relying out-of-loop
  tags
  This is due to the special requirement to finish ordered extents to
  handle the metadata reserved space.

Enhance the error handling and align the behavior by:

- Introduce a dedicated cow_one_range() helper
  Which do the reserve/lock/allocation in the helper.

  And also handle the errors inside the helper.
  No more dedicated tags out of the main loop.

- Use a single extent_clear_unlock_delalloc() to unlock and cleanup
  folios

- Move the btrfs_reloc_clone_csums() error handling into the new helper
  Thankfully it's not that complex compared to other cases.

And since we're here, also reduce the width of the following local
variables to u32:

- cur_alloc_size
- min_alloc_size
  Each allocation won't go beyond 128M, thus u32 is more than enough.

- blocksize
  The maximum is 64K, no need for u64.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:49:12 +01:00
Filipe Manana
6d0f25cdd8 btrfs: update stale comment in __cow_file_range_inline()
We mention that the reserved data space is page size aligned but that's
not true anymore, as it's sector size aligned instead.
In commit 0bb067ca64 ("btrfs: fix the qgroup data free range for inline
data extents") we updated the amount passed to btrfs_qgroup_free_data()
from page size to sector size, but forgot to update the comment.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:49:10 +01:00