linux

mirror of https://github.com/torvalds/linux.git synced 2026-05-30 10:04:04 +02:00

Author	SHA1	Message	Date
Guangshuo Li	3f487be812	btrfs: fix double free in create_space_info() error path When kobject_init_and_add() fails, the call chain is: create_space_info() -> btrfs_sysfs_add_space_info_type() -> kobject_init_and_add() -> failure -> kobject_put(&space_info->kobj) -> space_info_release() -> kfree(space_info) Then control returns to create_space_info(): btrfs_sysfs_add_space_info_type() returns error -> goto out_free -> kfree(space_info) This causes a double free. Keep the direct kfree(space_info) for the earlier failure path, but after btrfs_sysfs_add_space_info_type() has called kobject_put(), let the kobject release callback handle the cleanup. Fixes: `a11224a016` ("btrfs: fix memory leaks in create_space_info() error paths") CC: stable@vger.kernel.org # 6.19+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Guangshuo Li <lgs201920130244@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:09 +02:00
Guangshuo Li	a7449edf96	btrfs: fix double free in create_space_info_sub_group() error path When kobject_init_and_add() fails, the call chain is: create_space_info_sub_group() -> btrfs_sysfs_add_space_info_type() -> kobject_init_and_add() -> failure -> kobject_put(&sub_group->kobj) -> space_info_release() -> kfree(sub_group) Then control returns to create_space_info_sub_group(), where: btrfs_sysfs_add_space_info_type() returns error -> kfree(sub_group) Thus, sub_group is freed twice. Keep parent->sub_group[index] = NULL for the failure path, but after btrfs_sysfs_add_space_info_type() has called kobject_put(), let the kobject release callback handle the cleanup. Fixes: `f92ee31e03` ("btrfs: introduce btrfs_space_info sub-group") CC: stable@vger.kernel.org # 6.18+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Guangshuo Li <lgs201920130244@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:09 +02:00
Johannes Thumshirn	ad0c23c97b	btrfs: zoned: limit number of zones reclaimed in flush_space() Limit the number of zones reclaimed in flush_space()'s RECLAIM_ZONES state. This prevents possibly long running reclaim sweeps to block other tasks in the system, while the system is under pressure anyways, causing the tasks to hang. An example of this can be seen here, triggered by fstests generic/551: generic/551 [ 27.042349] run fstests generic/551 at 2026-02-27 11:05:30 BTRFS: device fsid 78c16e29-20d9-4c8e-bc04-7ba431be38ff devid 1 transid 8 /dev/vdb (254:16) scanned by mount (806) BTRFS info (device vdb): first mount of filesystem 78c16e29-20d9-4c8e-bc04-7ba431be38ff BTRFS info (device vdb): using crc32c checksum algorithm BTRFS info (device vdb): host-managed zoned block device /dev/vdb, 64 zones of 268435456 bytes BTRFS info (device vdb): zoned mode enabled with zone size 268435456 BTRFS info (device vdb): checking UUID tree BTRFS info (device vdb): enabling free space tree INFO: task kworker/u38:1:90 blocked for more than 120 seconds. Not tainted 7.0.0-rc1+ #345 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u38:1 state:D stack:0 pid:90 tgid:90 ppid:2 task_flags:0x4208060 flags:0x00080000 Workqueue: events_unbound btrfs_async_reclaim_data_space Call Trace: <TASK> __schedule+0x34f/0xe70 schedule+0x41/0x140 schedule_timeout+0xa3/0x110 ? mark_held_locks+0x40/0x70 ? lockdep_hardirqs_on_prepare+0xd8/0x1c0 ? trace_hardirqs_on+0x18/0x100 ? lockdep_hardirqs_on+0x84/0x130 ? _raw_spin_unlock_irq+0x33/0x50 wait_for_completion+0xa4/0x150 ? __flush_work+0x24c/0x550 __flush_work+0x339/0x550 ? __pfx_wq_barrier_func+0x10/0x10 ? wait_for_completion+0x39/0x150 flush_space+0x243/0x660 ? find_held_lock+0x2b/0x80 ? kvm_sched_clock_read+0x11/0x20 ? local_clock_noinstr+0x17/0x110 ? local_clock+0x15/0x30 ? lock_release+0x1b7/0x4b0 do_async_reclaim_data_space+0xe8/0x160 btrfs_async_reclaim_data_space+0x19/0x30 process_one_work+0x20a/0x5f0 ? lock_is_held_type+0xcd/0x130 worker_thread+0x1e2/0x3c0 ? __pfx_worker_thread+0x10/0x10 kthread+0x103/0x150 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x20d/0x320 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> Showing all locks held in the system: 1 lock held by khungtaskd/67: #0: ffffffff824d58e0 (rcu_read_lock){....}-{1:3}, at: debug_show_all_locks+0x3d/0x194 2 locks held by kworker/u38:1/90: #0: ffff8881000aa158 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x3c4/0x5f0 #1: ffffc90000c17e58 ((work_completion)(&fs_info->async_data_reclaim_work)){+.+.}-{0:0}, at: process_one_work+0x1c0/0x5f0 5 locks held by kworker/u39:1/191: #0: ffff8881000aa158 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x3c4/0x5f0 #1: ffffc90000dfbe58 ((work_completion)(&fs_info->reclaim_bgs_work)){+.+.}-{0:0}, at: process_one_work+0x1c0/0x5f0 #2: ffff888101da0420 (sb_writers#9){.+.+}-{0:0}, at: process_one_work+0x20a/0x5f0 #3: ffff88811040a648 (&fs_info->reclaim_bgs_lock){+.+.}-{4:4}, at: btrfs_reclaim_bgs_work+0x1de/0x770 #4: ffff888110408a18 (&fs_info->cleaner_mutex){+.+.}-{4:4}, at: btrfs_relocate_block_group+0x95a/0x20f0 1 lock held by aio-dio-write-v/980: #0: ffff888110093008 (&sb->s_type->i_mutex_key#15){++++}-{4:4}, at: btrfs_inode_lock+0x51/0xb0 ============================================= To prevent these long running reclaims from blocking the system, only reclaim 5 block_groups in the RECLAIM_ZONES state of flush_space(). Also as these reclaims are now constrained, it opens up the use for a synchronous call to brtfs_reclaim_block_groups(), eliminating the need to place the reclaim task on a workqueue and then flushing the workqueue again. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:01 +02:00
Filipe Manana	1899d9b09d	btrfs: constify arguments of some functions There are several functions that take pointer arguments but don't need to modify the objects they point to, so add the const qualifiers. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:55:58 +02:00
Johannes Thumshirn	e2a7fd2237	btrfs: zoned: add zone reclaim flush state for DATA space_info On zoned block devices, DATA block groups can accumulate large amounts of zone_unusable space (space between the write pointer and zone end). When zone_unusable reaches high levels (e.g., 98% of total space), new allocations fail with ENOSPC even though space could be reclaimed by relocating data and resetting zones. The existing flush states don't handle this scenario effectively - they either try to free cached space (which doesn't exist for zone_unusable) or reset empty zones (which doesn't help when zones contain valid data mixed with zone_unusable space). Add a new RECLAIM_ZONES flush state that triggers the block group reclaim machinery. This state: - Calls btrfs_reclaim_sweep() to identify reclaimable block groups - Calls btrfs_reclaim_bgs() to queue reclaim work - Waits for reclaim_bgs_work to complete via flush_work() - Commits the transaction to finalize changes The reclaim work (btrfs_reclaim_bgs_work) safely relocates valid data from fragmented block groups to other locations before resetting zones, converting zone_unusable space back into usable space. Insert RECLAIM_ZONES before RESET_ZONES in data_flush_states so that we attempt to reclaim partially-used block groups before falling back to resetting completely empty ones. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:55:55 +02:00
Filipe Manana	574d93fc62	btrfs: be less aggressive with metadata overcommit when we can do full flushing Over the years we often get reports of some -ENOSPC failure while updating metadata that leads to a transaction abort. I have seen this happen for filesystems of all sizes and with workloads that are very user/customer specific and unable to reproduce, but Aleksandar recently reported a simple way to reproduce this with a 1G filesystem and using the bonnie++ benchmark tool. The following test script reproduces the failure: $ cat test.sh #!/bin/bash # Create and use a 1G null block device, memory backed, otherwise # the test takes a very long time. modprobe null_blk nr_devices="0" null_dev="/sys/kernel/config/nullb/nullb0" mkdir "$null_dev" size=$((1 * 1024)) # in MB echo 2 > "$null_dev/submit_queues" echo "$size" > "$null_dev/size" echo 1 > "$null_dev/memory_backed" echo 1 > "$null_dev/discard" echo 1 > "$null_dev/power" DEV=/dev/nullb0 MNT=/mnt/nullb0 mkfs.btrfs -f $DEV mount $DEV $MNT mkdir $MNT/test/ bonnie++ -d $MNT/test/ -m BTRFS -u 0 -s 256M -r 128M -b umount $MNT echo 0 > "$null_dev/power" rmdir "$null_dev" When running this bonnie++ fails in the phase where it deletes test directories and files: $ ./test.sh (...) Using uid:0, gid:0. Writing a byte at a time...done Writing intelligently...done Rewriting...done Reading a byte at a time...done Reading intelligently...done start 'em...done...done...done...done...done... Create files in sequential order...done. Stat files in sequential order...done. Delete files in sequential order...done. Create files in random order...done. Stat files in random order...done. Delete files in random order...Can't sync directory, turning off dir-sync. Can't delete file 9Bq7sr0000000338 Cleaning up test directory after error. Bonnie: drastic I/O error (rmdir): Read-only file system And in the syslog/dmesg we can see the following transaction abort trace: [161915.501506] BTRFS warning (device nullb0): Skipping commit of aborted transaction. [161915.502983] ------------[ cut here ]------------ [161915.503832] BTRFS: Transaction aborted (error -28) [161915.504748] WARNING: fs/btrfs/transaction.c:2045 at btrfs_commit_transaction+0xa21/0xd30 [btrfs], CPU#11: bonnie++/3377975 [161915.506786] Modules linked in: btrfs dm_zero dm_snapshot (...) [161915.518759] CPU: 11 UID: 0 PID: 3377975 Comm: bonnie++ Tainted: G W 6.19.0-rc7-btrfs-next-224+ #4 PREEMPT(full) [161915.520857] Tainted: [W]=WARN [161915.521405] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [161915.523414] RIP: 0010:btrfs_commit_transaction+0xa24/0xd30 [btrfs] [161915.524630] Code: 48 8b 7c 24 (...) [161915.526982] RSP: 0018:ffffd3fe8206fda8 EFLAGS: 00010292 [161915.527707] RAX: 0000000000000002 RBX: ffff8f4886d3c000 RCX: 0000000000000000 [161915.528723] RDX: 0000000002040001 RSI: 00000000ffffffe4 RDI: ffffffffc088f780 [161915.529691] RBP: ffff8f4f5adae7e0 R08: 0000000000000000 R09: ffffd3fe8206fb90 [161915.530842] R10: ffff8f4f9c1fffa8 R11: 0000000000000003 R12: 00000000ffffffe4 [161915.532027] R13: ffff8f4ef2cf2400 R14: ffff8f4f5adae708 R15: ffff8f4f62d18000 [161915.533229] FS: 00007ff93112a780(0000) GS:ffff8f4ff63ee000(0000) knlGS:0000000000000000 [161915.534611] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [161915.535575] CR2: 00005571b3072000 CR3: 0000000176080005 CR4: 0000000000370ef0 [161915.536758] Call Trace: [161915.537185] <TASK> [161915.537575] btrfs_sync_file+0x431/0x530 [btrfs] [161915.538473] do_fsync+0x39/0x80 [161915.539042] __x64_sys_fsync+0xf/0x20 [161915.539750] do_syscall_64+0x50/0xf20 [161915.540396] entry_SYSCALL_64_after_hwframe+0x76/0x7e [161915.541301] RIP: 0033:0x7ff930ca49ee [161915.541904] Code: 08 0f 85 f5 (...) [161915.544830] RSP: 002b:00007ffd94291f38 EFLAGS: 00000246 ORIG_RAX: 000000000000004a [161915.546152] RAX: ffffffffffffffda RBX: 00007ff93112a780 RCX: 00007ff930ca49ee [161915.547263] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003 [161915.548383] RBP: 0000000000000dab R08: 0000000000000000 R09: 0000000000000000 [161915.549853] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffd94291fb0 [161915.551196] R13: 00007ffd94292350 R14: 0000000000000001 R15: 00007ffd94292340 [161915.552161] </TASK> [161915.552457] ---[ end trace 0000000000000000 ]--- [161915.553232] BTRFS info (device nullb0 state A): dumping space info: [161915.553236] BTRFS info (device nullb0 state A): space_info DATA (sub-group id 0) has 12582912 free, is not full [161915.553239] BTRFS info (device nullb0 state A): space_info total=12582912, used=0, pinned=0, reserved=0, may_use=0, readonly=0 zone_unusable=0 [161915.553243] BTRFS info (device nullb0 state A): space_info METADATA (sub-group id 0) has -5767168 free, is full [161915.553245] BTRFS info (device nullb0 state A): space_info total=53673984, used=6635520, pinned=46956544, reserved=16384, may_use=5767168, readonly=65536 zone_unusable=0 [161915.553251] BTRFS info (device nullb0 state A): space_info SYSTEM (sub-group id 0) has 8355840 free, is not full [161915.553254] BTRFS info (device nullb0 state A): space_info total=8388608, used=16384, pinned=16384, reserved=0, may_use=0, readonly=0 zone_unusable=0 [161915.553257] BTRFS info (device nullb0 state A): global_block_rsv: size 5767168 reserved 5767168 [161915.553261] BTRFS info (device nullb0 state A): trans_block_rsv: size 0 reserved 0 [161915.553263] BTRFS info (device nullb0 state A): chunk_block_rsv: size 0 reserved 0 [161915.553265] BTRFS info (device nullb0 state A): remap_block_rsv: size 0 reserved 0 [161915.553268] BTRFS info (device nullb0 state A): delayed_block_rsv: size 0 reserved 0 [161915.553270] BTRFS info (device nullb0 state A): delayed_refs_rsv: size 0 reserved 0 [161915.553272] BTRFS: error (device nullb0 state A) in cleanup_transaction:2045: errno=-28 No space left [161915.554463] BTRFS info (device nullb0 state EA): forced readonly The problem is that we allow for a very aggressive metadata overcommit, about 1/8th of the currently available space, even when the task attempting the reservation allows for full flushing. Over time this allows more and more tasks to overcommit without getting a transaction commit to release pinned extents, joining the same transaction and eventually lead to the transaction abort when attempting some tree update, as the extent allocator is not able to find any available metadata extent and it's not able to allocate a new metadata block group either (not enough unallocated space for that). Fix this by allowing the overcommit to be up to 1/64th of the available (unallocated) space instead and for that limit to apply to both types of full flushing, BTRFS_RESERVE_FLUSH_ALL and BTRFS_RESERVE_FLUSH_ALL_STEAL. This way we get more frequent transaction commits to release pinned extents in case our caller is in a context where full flushing is allowed. Note that the space infos dump in the dmesg/syslog right after the transaction abort give the wrong idea that we have plenty of unallocated space when the abort happened. During the bonnie++ workload we had a metadata chunk allocation attempt and it failed with -ENOSPC because at that time we had a bunch of data block groups allocated, which then became empty and got deleted by the cleaner kthread after the metadata chunk allocation failed with -ENOSPC and before the transaction abort happened and dumped the space infos. The custom tracing (some trace_printk() calls spread in strategic places) used to check that: mount-1793735 [011] ...1. 28877.261096: btrfs_add_bg_to_space_info: added bg offset 13631488 length 8388608 flags 1 to space_info->flags 1 total_bytes 8388608 bytes_used 0 bytes_may_use 0 mount-1793735 [011] ...1. 28877.261098: btrfs_add_bg_to_space_info: added bg offset 22020096 length 8388608 flags 34 to space_info->flags 2 total_bytes 8388608 bytes_used 16384 bytes_may_use 0 mount-1793735 [011] ...1. 28877.261100: btrfs_add_bg_to_space_info: added bg offset 30408704 length 53673984 flags 36 to space_info->flags 4 total_bytes 53673984 bytes_used 131072 bytes_may_use 0 These are from loading the block groups created by mkfs during mount. Then when bonnie++ starts doing its thing: kworker/u48:5-1792004 [011] ..... 28886.122050: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 kworker/u48:5-1792004 [011] ..... 28886.122053: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 927596544 kworker/u48:5-1792004 [011] ..... 28886.122055: btrfs_make_block_group: make bg offset 84082688 size 117440512 type 1 kworker/u48:5-1792004 [011] ...1. 28886.122064: btrfs_add_bg_to_space_info: added bg offset 84082688 length 117440512 flags 1 to space_info->flags 1 total_bytes 125829120 bytes_used 0 bytes_may_use 5251072 First allocation of a data block group of 112M. kworker/u48:5-1792004 [011] ..... 28886.192408: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 kworker/u48:5-1792004 [011] ..... 28886.192413: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 810156032 kworker/u48:5-1792004 [011] ..... 28886.192415: btrfs_make_block_group: make bg offset 201523200 size 117440512 type 1 kworker/u48:5-1792004 [011] ...1. 28886.192425: btrfs_add_bg_to_space_info: added bg offset 201523200 length 117440512 flags 1 to space_info->flags 1 total_bytes 243269632 bytes_used 0 bytes_may_use 122691584 Another 112M data block group allocated. kworker/u48:5-1792004 [011] ..... 28886.260935: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 kworker/u48:5-1792004 [011] ..... 28886.260941: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 692715520 kworker/u48:5-1792004 [011] ..... 28886.260943: btrfs_make_block_group: make bg offset 318963712 size 117440512 type 1 kworker/u48:5-1792004 [011] ...1. 28886.260954: btrfs_add_bg_to_space_info: added bg offset 318963712 length 117440512 flags 1 to space_info->flags 1 total_bytes 360710144 bytes_used 0 bytes_may_use 240132096 Yet another one. bonnie++-1793755 [010] ..... 28886.280407: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 bonnie++-1793755 [010] ..... 28886.280412: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 575275008 bonnie++-1793755 [010] ..... 28886.280414: btrfs_make_block_group: make bg offset 436404224 size 117440512 type 1 bonnie++-1793755 [010] ...1. 28886.280419: btrfs_add_bg_to_space_info: added bg offset 436404224 length 117440512 flags 1 to space_info->flags 1 total_bytes 478150656 bytes_used 0 bytes_may_use 268435456 One more. kworker/u48:5-1792004 [011] ..... 28886.566233: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 kworker/u48:5-1792004 [011] ..... 28886.566238: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 457834496 kworker/u48:5-1792004 [011] ..... 28886.566241: btrfs_make_block_group: make bg offset 553844736 size 117440512 type 1 kworker/u48:5-1792004 [011] ...1. 28886.566250: btrfs_add_bg_to_space_info: added bg offset 553844736 length 117440512 flags 1 to space_info->flags 1 total_bytes 595591168 bytes_used 268435456 bytes_may_use 209723392 Another one. bonnie++-1793755 [009] ..... 28886.613446: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 bonnie++-1793755 [009] ..... 28886.613451: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 340393984 bonnie++-1793755 [009] ..... 28886.613453: btrfs_make_block_group: make bg offset 671285248 size 117440512 type 1 bonnie++-1793755 [009] ...1. 28886.613458: btrfs_add_bg_to_space_info: added bg offset 671285248 length 117440512 flags 1 to space_info->flags 1 total_bytes 713031680 bytes_used 268435456 bytes_may_use 2 68435456 Another one. bonnie++-1793755 [009] ..... 28886.674953: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 bonnie++-1793755 [009] ..... 28886.674957: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 222953472 bonnie++-1793755 [009] ..... 28886.674959: btrfs_make_block_group: make bg offset 788725760 size 117440512 type 1 bonnie++-1793755 [009] ...1. 28886.674963: btrfs_add_bg_to_space_info: added bg offset 788725760 length 117440512 flags 1 to space_info->flags 1 total_bytes 830472192 bytes_used 268435456 bytes_may_use 1 34217728 Another one. bonnie++-1793755 [009] ..... 28886.674981: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 bonnie++-1793755 [009] ..... 28886.674982: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 105512960 bonnie++-1793755 [009] ..... 28886.674983: btrfs_make_block_group: make bg offset 906166272 size 105512960 type 1 bonnie++-1793755 [009] ...1. 28886.674984: btrfs_add_bg_to_space_info: added bg offset 906166272 length 105512960 flags 1 to space_info->flags 1 total_bytes 935985152 bytes_used 268435456 bytes_may_use 67108864 Another one, but a bit smaller (~100.6M) since we now have less space. bonnie++-1793758 [009] ..... 28891.962096: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 bonnie++-1793758 [009] ..... 28891.962103: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 12582912 bonnie++-1793758 [009] ..... 28891.962105: btrfs_make_block_group: make bg offset 1011679232 size 12582912 type 1 bonnie++-1793758 [009] ...1. 28891.962114: btrfs_add_bg_to_space_info: added bg offset 1011679232 length 12582912 flags 1 to space_info->flags 1 total_bytes 948568064 bytes_used 268435456 bytes_may_use 8192 Another one, this one even smaller (12M). kworker/u48:5-1792004 [011] ..... 28892.112802: btrfs_chunk_alloc: enter first metadata chunk alloc attempt kworker/u48:5-1792004 [011] ..... 28892.112805: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 131072 dev_extent_want 536870912 kworker/u48:5-1792004 [011] ..... 28892.112806: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 131072 dev_extent_want 536870912 max_avail 0 536870912 is 512M, the standard 256M metadata chunk size times 2 because of the DUP profile for metadata. 'max_avail' is what find_free_dev_extent() returns to us in gather_device_info(). As a result, gather_device_info() sets ctl->ndevs to 0, making decide_stripe_size() fail with -ENOSPC, and therefore metadata chunk allocation fails while we are attempting to run delayed items during the transaction commit. kworker/u48:5-1792004 [011] ..... 28892.112807: btrfs_create_chunk: decide_stripe_size fail -ENOSPC In the syslog/dmesg pasted above, which happened after the transaction was aborted, the space info dumps did not account for all these data block groups that were allocated during bonnie++'s workload. And that is because after the metadata chunk allocation failed with -ENOSPC and before the transaction abort happened, most of the data block groups had become empty and got deleted by by the cleaner kthread - when the abort happened, we had bonnie++ in the middle of deleting the files it created. But dumping the space infos right after the metadata chunk allocation fails by adding a call to btrfs_dump_space_info_for_trans_abort() in decide_stripe_size() when it returns -ENOSPC, we get: [29972.409295] BTRFS info (device nullb0): dumping space info: [29972.409300] BTRFS info (device nullb0): space_info DATA (sub-group id 0) has 673341440 free, is not full [29972.409303] BTRFS info (device nullb0): space_info total=948568064, used=0, pinned=275226624, reserved=0, may_use=0, readonly=0 zone_unusable=0 [29972.409305] BTRFS info (device nullb0): space_info METADATA (sub-group id 0) has 3915776 free, is not full [29972.409306] BTRFS info (device nullb0): space_info total=53673984, used=163840, pinned=42827776, reserved=147456, may_use=6553600, readonly=65536 zone_unusable=0 [29972.409308] BTRFS info (device nullb0): space_info SYSTEM (sub-group id 0) has 7979008 free, is not full [29972.409310] BTRFS info (device nullb0): space_info total=8388608, used=16384, pinned=0, reserved=0, may_use=393216, readonly=0 zone_unusable=0 [29972.409311] BTRFS info (device nullb0): global_block_rsv: size 5767168 reserved 5767168 [29972.409313] BTRFS info (device nullb0): trans_block_rsv: size 0 reserved 0 [29972.409314] BTRFS info (device nullb0): chunk_block_rsv: size 393216 reserved 393216 [29972.409315] BTRFS info (device nullb0): remap_block_rsv: size 0 reserved 0 [29972.409316] BTRFS info (device nullb0): delayed_block_rsv: size 0 reserved 0 So here we see there's ~904.6M of data space, ~51.2M of metadata space and 8M of system space, making a total of 963.8M. Reported-by: Aleksandar Gerasimovski <Aleksandar.Gerasimovski@belden.com> Link: https://lore.kernel.org/linux-btrfs/SA1PR18MB56922F690C5EC2D85371408B998FA@SA1PR18MB5692.namprd18.prod.outlook.com/ Link: https://lore.kernel.org/linux-btrfs/CAL3q7H61vZ3_+eqJ1A9po2WcgNJJjUu9MJQoYB2oDSAAecHaug@mail.gmail.com/ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:55:53 +02:00
Qu Wenruo	2672a26a75	btrfs: use per-profile available space in calc_available_free_space() For the following disk layout, can_overcommit() can cause false confidence in available space: devid 1 unallocated: 1GiB devid 2 unallocated: 50GiB metadata type: RAID1 As can_overcommit() simply uses unallocated space with factor to calculate the allocatable metadata chunk size, resulting 25.5GiB available space. But in reality we can only allocate one 1GiB RAID1 chunk, the remaining 49GiB on devid 2 will never be utilized to fulfill a RAID1 chunk. This leads to various ENOSPC related transaction abort and flips the fs read-only. Now use per-profile available space in calc_available_free_space(), and only when that failed we fall back to the old factor based estimation. And for zoned devices or for the very low chance of temporary memory allocation failure, we will still fallback to factor based estimation. But I hope in reality it's very rare. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:55:53 +02:00
Linus Torvalds	e0b38d286e	for-7.0-rc3-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmmytygACgkQxWXV+ddt WDuiKw/+IyoS2IE95Jd4G2hezcsaqJNAvv5q90vVyLepT/7jzRbYrHytk7z/0v56 +Pjc1JHgp+TidYKZa52E/R09eEvYCyuZvEq4bnClOnO3CAJDCixqTKWV70CcYiHX HoSCuPML2CJMLZY3u6MxATgk46y1ey3UkbmQ4oufoUHrjAE+D3pDQsYQ9BWR/P6J 4LbyZ214uIUPvbp0wPJ14cVUMpNxs116JmWvK9dxWQNZSzFcq2IVuLwQUcZBPAdb dursd0kt6HXXmNCITmUD8O+ipG4U1akn8FuCjaLqo4LF1AH2f6OzNjucsisfa1tE 4MD+ATzmNsew5q6dtyfcvv8Btl+olbTP4KGibLl9PCCM9vlkzl3EN5GPUllGi6S9 T8jqe2pILXZHEx1DIQjHaXJsnuHG9aEkUbkSHUxIa15QDj66omrZZkZG3EF5Buy9 TogJJgESYU19dt/y110Q/vD/LPOgJhB3NBXwIx1FDYA1OwaAjt6hUcVRVRI5PtpL moAIG4nNrPz5Pa875NvguFmrXFIudpTqyANHCPVio4eqDR7LeSg9vVHj9zeOfeRD gmn3WGuGNb6L+86TNet/nFQhjxhkKqsnSbzO2zoPQKhG6FS+HSLnUwvJYL4/YnkW QEeveeI/6Duiei4DHuw7FXmZVNQxeBEWW6MTFDHA1UNV3HMKewA= =TgSD -----END PGP SIGNATURE----- Merge tag 'for-7.0-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - detect possible file name hash collision earlier so it does not lead to transaction abort - handle b-tree leaf overflows when snapshotting a subvolume with set received UUID, leading to transaction abort - in zoned mode, reorder relocation block group initialization after the transaction kthread start - fix orphan cleanup state tracking of subvolume, this could lead to invalid dentries under some conditions - add locking around updates of dynamic reclain state update - in subpage mode, add missing RCU unlock when trying to releae extent buffer - remap tree fixes: - add missing description strings for the newly added remap tree - properly update search key when iterating backrefs * tag 'for-7.0-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: remove duplicated definition of btrfs_printk_in_rcu() btrfs: remove unnecessary transaction abort in the received subvol ioctl btrfs: abort transaction on failure to update root in the received subvol ioctl btrfs: fix transaction abort on set received ioctl due to item overflow btrfs: fix transaction abort when snapshotting received subvolumes btrfs: fix transaction abort on file creation due to name hash collision btrfs: read key again after incrementing slot in move_existing_remaps() btrfs: add missing RCU unlock in error path in try_release_subpage_extent_buffer() btrfs: set BTRFS_ROOT_ORPHAN_CLEANUP during subvol create btrfs: zoned: move btrfs_zoned_reserve_data_reloc_bg() after kthread start btrfs: hold space_info->lock when clearing periodic reclaim ready btrfs: print-tree: add remap tree definitions	2026-03-12 12:15:27 -07:00
Sun YangKai	b8883b61f2	btrfs: hold space_info->lock when clearing periodic reclaim ready btrfs_set_periodic_reclaim_ready() requires space_info->lock to be held, as enforced by lockdep_assert_held(). However, btrfs_reclaim_sweep() was calling it after do_reclaim_sweep() returns, at which point space_info->lock is no longer held. Fix this by explicitly acquiring space_info->lock before clearing the periodic reclaim ready flag in btrfs_reclaim_sweep(). Reported-by: Chris Mason <clm@meta.com> Link: https://lore.kernel.org/linux-btrfs/20260208182556.891815-1-clm@meta.com/ Fixes: `19eff93dc7` ("btrfs: fix periodic reclaim condition") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Sun YangKai <sunk67188@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-03-03 15:54:00 +01:00
Kees Cook	69050f8d6d	treewide: Replace kmalloc with kmalloc_obj for non-scalar types This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(PTR, FAM, COUNT, ...) (where TYPE may also be VAR) The resulting allocations no longer return "void ", instead returning "TYPE ". Signed-off-by: Kees Cook <kees@kernel.org>	2026-02-21 01:02:28 -08:00
Filipe Manana	5eb01bf4a9	btrfs: remove out label in btrfs_init_space_info() There is no point in having the label since all it does is return the value in the 'ret' variable. Instead make every goto return directly and remove the label. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:23 +01:00
Sun YangKai	19eff93dc7	btrfs: fix periodic reclaim condition Problems with current implementation: 1. reclaimable_bytes is signed while chunk_sz is unsigned, causing negative reclaimable_bytes to trigger reclaim unexpectedly 2. The "space must be freed between scans" assumption breaks the two-scan requirement: first scan marks block groups, second scan reclaims them. Without the second scan, no reclamation occurs. Instead, track actual reclaim progress: pause reclaim when block groups will be reclaimed, and resume only when progress is made. This ensures reclaim continues until no further progress can be made. And resume periodic reclaim when there's enough free space. And we take care if reclaim is making any progress now, so it's unnecessary to set periodic_reclaim_ready to false when failed to reclaim a block group. Fixes: `813d4c6422` ("btrfs: prevent pathological periodic reclaim loops") CC: stable@vger.kernel.org # 6.12+ Suggested-by: Boris Burkov <boris@bur.io> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Sun YangKai <sunk67188@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:54:37 +01:00
Mark Harmstone	b56f35560b	btrfs: handle setting up relocation of block group with remap-tree Handle the preliminary work for relocating a block group in a filesystem with the remap-tree flag set. If the block group is SYSTEM btrfs_relocate_block_group() proceeds as it does already, as bootstrapping issues mean that these block groups have to be processed the existing way. Similarly with METADATA_REMAP blocks, which are dealt with in a later patch. Otherwise we walk the free-space tree for the block group in question, recording any holes. These get converted into identity remaps and placed in the remap tree, and the block group's REMAPPED flag is set. From now on no new allocations are possible within this block group, and any I/O to it will be funnelled through btrfs_translate_remap(). We store the number of identity remaps in `identity_remap_count`, so that we know when we've removed the last one and the block group is fully remapped. The change in btrfs_read_roots() is because data relocations no longer rely on the data reloc tree as a hidden subvolume in which to do snapshots. (Thanks to Sun YangKai for his suggestions.) Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Mark Harmstone <mark@harmstone.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:54:35 +01:00
Mark Harmstone	0b4d29fa98	btrfs: add METADATA_REMAP chunk type Add a new METADATA_REMAP chunk type, which is a metadata chunk that holds the remap tree. This is needed for bootstrapping purposes: the remap tree can't itself be remapped, and must be relocated the existing way, by COWing every leaf. The remap tree can't go in the SYSTEM chunk as space there is limited, because a copy of the chunk item gets placed in the superblock. The changes in fs/btrfs/volumes.h are because we're adding a new block group type bit after the profile bits, and so can no longer rely on the const_ilog2 trick. The sizing to 32MB per chunk, matching the SYSTEM chunk, is an estimate here, we can adjust it later if it proves to be too big or too small. This works out to be ~500,000 remap items, which for a 4KB block size covers ~2GB of remapped data in the worst case and ~500TB in the best case. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:54:27 +01:00
Filipe Manana	c208aa0ef6	btrfs: add and use helper to compute the available space for a block group We have currently three places that compute how much available space a block group has. Add a helper function for this and use it in those places. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:51:44 +01:00
Johannes Thumshirn	2ef2e97fe7	btrfs: move space_info_flag_to_str() to space-info.h Move space_info_flag_to_str() to space-info.h and as it now isn't static to space-info.c any more prefix it with 'btrfs_'. This way it can be re-used in other places. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:49:12 +01:00
Jiasheng Jiang	a11224a016	btrfs: fix memory leaks in create_space_info() error paths In create_space_info(), the 'space_info' object is allocated at the beginning of the function. However, there are two error paths where the function returns an error code without freeing the allocated memory: 1. When create_space_info_sub_group() fails in zoned mode. 2. When btrfs_sysfs_add_space_info_type() fails. In both cases, 'space_info' has not yet been added to the fs_info->space_info list, resulting in a memory leak. Fix this by adding an error handling label to kfree(space_info) before returning. Fixes: `2be12ef79f` ("btrfs: Separate space_info create/update") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Jiasheng Jiang <jiashengjiangcool@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-01-12 16:21:55 +01:00
David Sterba	1c094e6cce	btrfs: make a few more ASSERTs verbose We have support for optional string to be printed in ASSERT() (added in `19468a623a` ("btrfs: enhance ASSERT() to take optional format string")), it's not yet everywhere it could be so add a few more files. Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 22:42:24 +01:00
Filipe Manana	fe1e50031f	btrfs: move struct reserve_ticket definition to space-info.c It's not used anywhere outside space-info.c so move it from space-info.h into space-info.c. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 22:42:23 +01:00
Qu Wenruo	c5667f9c8e	btrfs: headers cleanup to remove unnecessary local includes [BUG] When I tried to remove btrfs_bio::fs_info and use btrfs_bio::inode to grab the fs_info, the header "btrfs_inode.h" is needed to access the full btrfs_inode structure. Then btrfs will fail to compile. [CAUSE] There is a recursive including chain: "bio.h" -> "btrfs_inode.h" -> "extent_map.h" -> "compression.h" -> "bio.h" That recursive including is causing problems for btrfs. [ENHANCEMENT] To reduce the risk of recursive including: - Remove unnecessary local includes from btrfs headers Either the included header is pulled in by other headers, or is completely unnecessary. - Remove btrfs local includes if the header only requires a pointer In that case let the implementing C file to pull the required header. This is especially important for headers like "btrfs_inode.h" which pulls in a lot of other btrfs headers, thus it's a mine field of recursive including. - Remove unnecessary temporary structure definition Either if we have included the header defining the structure, or completely unused. Now including "btrfs_inode.h" inside "bio.h" is completely fine, although "btrfs_inode.h" still includes "extent_map.h", but that header only includes "fs.h", no more chain back to "bio.h". Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 22:34:52 +01:00
Filipe Manana	38e03b820e	btrfs: annotate as unlikely fs aborted checks in space flushing code It's not expected to have the fs in an aborted state, so surround the abortion checks with unlikely to make it clear it's unexpected and to hint the compiler to generate better code. Also at maybe_fail_all_tickets() untangle all repeated checks for the abortion into a single if-then-else. This makes things more readable and makes the compiler generate less code. On x86_64 with gcc 14.2.0-19 from Debian I got the following object size differences. Before this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 2021606 179704 25088 2226398 21f8de fs/btrfs/btrfs.ko After this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 2021458 179704 25088 2226250 21f84a fs/btrfs/btrfs.ko Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 22:20:29 +01:00
Filipe Manana	f912f0af13	btrfs: avoid space_info locking when checking if tickets are served When checking if a ticket was served, we take the space_info's spinlock. If the ticket was served (its ->bytes is 0) or had an error (its ->error it not 0) then we just unlock the space_info and return. This however causes contention on the space_info's spinlock, which is heavily used (space reservation, space flushing, allocating and deallocating an extent from a block group (btrfs_update_block_group()), etc). Instead of using the space_info's spinlock to check if a ticket was served, use a per ticket spinlock which isn't used by anyone other than the task that created the ticket (stack allocated) and the task that serves the ticket (a reclaim task or any task deallocating space that ends up at btrfs_try_granting_tickets()). After applying this patch and all previous patches from the same patchset (many attempt to reduce space_info critical sections), lockstat showed some improvements for a fs_mark test regarding the space_info's spinlock 'lock'. The lockstat results: Before patchset: con-bounces: 13733858 contentions: 15902322 waittime-total: 264902529.72 acq-bounces: 28161791 acquisitions: 38679282 After patchset: con-bounces: 12032220 contentions: 13598034 waittime-total: 221806127.28 acq-bounces: 24717947 acquisitions: 34103281 Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 22:18:21 +01:00
Filipe Manana	50a51b5378	btrfs: move ticket wakeup and finalization to remove_ticket() Instead of repeating the wakeup and setup of the ->bytes or ->error field, move those steps to remove_ticket() to avoid duplication. This is also needed for the next patch in the series, so that we avoid duplicating more logic. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 22:18:17 +01:00
Filipe Manana	cdf8a566ee	btrfs: add data_race() in btrfs_account_ro_block_groups_free_space() Surround the intentional empty list check with the data_race() annotation so that tools like KCSAN don't report a data race. The race is intentional as it's harmless and we want to avoid lock contention of the space_info since its lock is heavily used (space reservation, space flushing, extent allocation and deallocation, etc). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 22:16:27 +01:00
Filipe Manana	b70c32f10a	btrfs: remove double underscore prefix from __reserve_bytes() The use of a double underscore prefix is discouraged and we have no justification at all for it all since there's no reserved_bytes() counter part. So remove the prefix. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 22:13:00 +01:00
Filipe Manana	189db25105	btrfs: process ticket outside global reserve critical section In steal_from_global_rsv() there's no need to process the ticket inside the critical section of the global reserve. Move the ticket processing to happen after the critical section. This helps reduce contention on the global reserve's spinlock. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 22:12:37 +01:00
Filipe Manana	5ca7725ddf	btrfs: assign booleans to global reserve's full field We have a couple places that are assigning 0 and 1 to the full field of the global reserve. This is harmless since 0 is converted to false and 1 converted to true, but for better readability, replace these with true and false since the field is of type bool. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 22:12:07 +01:00
Filipe Manana	f18a203a1b	btrfs: assert space_info is locked in steal_from_global_rsv() The caller is supposed to have locked the space_info, so assert that. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 22:11:40 +01:00
Filipe Manana	afbc047ab0	btrfs: avoid unnecessary reclaim calculation in priority_reclaim_metadata_space() If the given ticket was already served (its ->bytes is 0), then we wasted time calculating the metadata reclaim size. So calculate it only after we checked the ticket was not yet served. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 22:10:18 +01:00
Filipe Manana	4ddb077378	btrfs: shorten critical section in btrfs_preempt_reclaim_metadata_space() We are doing a lot of small calculations and assignments while holding the space_info's spinlock, which is a heavily used lock for space reservation and flushing. There's no point in holding the lock for so long when all we want is to call need_preemptive_reclaim() and get a consistent value for a couple of counters from the space_info. Instead, grab the counters into local variables, release the lock and then use the local variables. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 22:09:46 +01:00
Filipe Manana	8ab2b8bdbe	btrfs: increment loop count outside critical section during metadata reclaim In btrfs_preempt_reclaim_metadata_space() there's no need to increment the local variable that tracks the number of iterations of the while loop while inside the critical section delimited by the space_info's spinlock. That spinlock is heavily used by space reservation and flushing code, so it's desirable to have its critical sections as short as possible. So move the loop count incremented outside the critical section. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 22:09:14 +01:00
Filipe Manana	49f204be22	btrfs: bail out earlier from need_preemptive_reclaim() if we have tickets Instead of doing some calculations and then return false if it turns out we have queued tickets, check first if we have tickets and return false immediately if we have tickets, without wasting time on doing those computations. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 22:08:46 +01:00
Filipe Manana	6f4779faa0	btrfs: inline btrfs_space_info_used() The function is simple enough to be inlined and in fact doing it even reduces the object code. In x86_64 with gcc 14.2.0-19 from Debian the results were the following: Before this change $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1919410 161703 15592 2096705 1ffe41 fs/btrfs/btrfs.ko After this change $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1918991 161675 15592 2096258 1ffc82 fs/btrfs/btrfs.ko Also remove the ASSERT() that checks the space_info argument is not NULL, as it's odd to be there since it can never be NULL and in case that ever happens during development, a stack trace from a NULL pointer dereference will be obvious. It was originally added when btrfs_space_info_used() was introduced in commit `4136135b08` ("Btrfs: use helper to get used bytes of space_info"). Also add a lockdep assertion to check the space_info's lock is being held by the calling task. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 22:08:14 +01:00
Filipe Manana	0ce6300fec	btrfs: avoid used space computation when reserving space In __reserve_bytes() we have 3 repeated calls to btrfs_space_info_used(), one early on as soon as take the space_info's spinlock, another one when we call btrfs_can_overcommit(), which calls btrfs_space_info_used() again, and a final one when we are reserving for a flush emergency. During all these calls we are holding the space_info's spinlock, which is heavily used by the space reservation and flushing code, so it's desirable to make the critical sections as short as possible. So make this more efficient by: 1) Instead of calling btrfs_can_overcommit() call the new variant can_overcommit() which takes the space_info's used space as an argument and pass the value we already computed and have in the 'used' variable; 2) Instead of calling btrfs_space_info_used() with its second argument as false when we are doing a flush emergency, decrement the space_info's bytes_may_use counter from the 'used' variable, as the difference between passing true or false as the second argument to btrfs_space_info_used() is whether or not to include the space_info's bytes_may_use counter in the computation. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 22:07:36 +01:00
Filipe Manana	a5f8f64aa3	btrfs: avoid used space computation when trying to grant tickets In btrfs_try_granting_tickets(), we call btrfs_can_overcommit() and that calls btrfs_space_info_used(). But we already keep track, in the 'used' local variable, of the used space in the space_info, so we are just repeating the same computation and doing an extra function call while we are holding the space_info's spinlock, which is heavily used by the space reservation and flushing code. So add a local variant of btrfs_can_overcommit() that takes in the used space as an argument and therefore does not call btrfs_space_info_used(), and use it in btrfs_try_granting_tickets(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 22:07:26 +01:00
Filipe Manana	563ef2befb	btrfs: make btrfs_can_overcommit() return bool instead of int It's a boolean function, so switch its return type to bool. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 22:06:26 +01:00
Filipe Manana	60532c2136	btrfs: avoid recomputing used space in btrfs_try_granting_tickets() In every iteration of the loop we call btrfs_space_info_used() which sums a bunch of fields from a space_info object. This implies doing a function call besides the sum, and we are holding the space_info's spinlock while we do this, so we want to keep the critical section as short as possible since that spinlock is used in all the code for space reservation and flushing (therefore it's heavily used). So call btrfs_try_granting_tickets() only once, before entering the loop, and then update it as we remove tickets. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 22:05:06 +01:00
Filipe Manana	063171a4f0	btrfs: return real error when failing tickets in maybe_fail_all_tickets() In case we had a transaction abort we set a ticket's error to -EIO, but we have the real error that caused the transaction to be aborted returned by the macro BTRFS_FS_ERROR(). So use that real error instead of -EIO. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 22:04:30 +01:00
Filipe Manana	771af6ff72	btrfs: remove fs_info argument from btrfs_sysfs_add_space_info_type() We don't need it since we can grab fs_info from the given space_info. So remove the fs_info argument. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 22:02:30 +01:00
Filipe Manana	a1359d06d7	btrfs: remove fs_info argument from btrfs_reserve_metadata_bytes() We don't need it since we can grab fs_info from the given space_info. So remove the fs_info argument. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 21:59:11 +01:00
Filipe Manana	30b87a2319	btrfs: remove fs_info argument from __reserve_bytes() We don't need it since we can grab fs_info from the given space_info. So remove the fs_info argument. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 21:59:11 +01:00
Filipe Manana	09d0f28531	btrfs: fix parameter documentation for btrfs_reserve_data_bytes() We don't have a fs_info argument anymore since commit `5d39fda880` ("btrfs: pass btrfs_space_info to btrfs_reserve_data_bytes()"), it was replaced by a space_info argument. So update the documentation. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 21:59:11 +01:00
Filipe Manana	5495cbe920	btrfs: remove fs_info argument from maybe_clamp_preempt() We don't need it since we can grab fs_info from the given space_info. So remove the fs_info argument. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 21:59:11 +01:00
Filipe Manana	e182eca6ed	btrfs: remove fs_info argument from handle_reserve_ticket() We don't need it since we can grab fs_info from the given space_info. So remove the fs_info argument. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 21:59:11 +01:00
Filipe Manana	ddeac2a12b	btrfs: remove fs_info argument from steal_from_global_rsv() We don't need it since we can grab fs_info from the given space_info. So remove the fs_info argument. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 21:59:10 +01:00
Filipe Manana	d77b22de56	btrfs: remove fs_info argument from need_preemptive_reclaim() We don't need it since we can grab fs_info from the given space_info. So remove the fs_info argument. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 21:59:10 +01:00
Filipe Manana	4199eb2761	btrfs: remove fs_info argument from btrfs_calc_reclaim_metadata_size() We don't need it since we can grab fs_info from the given space_info. So remove the fs_info argument. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 21:59:10 +01:00
Filipe Manana	3ee1246536	btrfs: remove fs_info argument from shrink_delalloc() and flush_space() We don't need it since we can grab fs_info from the given space_info. So remove the fs_info argument. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 21:59:10 +01:00
Filipe Manana	e96059c9d7	btrfs: remove fs_info argument from btrfs_dump_space_info() We don't need it since we can grab fs_info from the given space_info. So remove the fs_info argument. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 21:59:10 +01:00
Filipe Manana	78a77f4da4	btrfs: remove fs_info argument from btrfs_can_overcommit() We don't need it since we can grab fs_info from the given space_info. So remove the fs_info argument. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 21:59:10 +01:00

1 2 3 4 5

250 Commits