linux

mirror of https://github.com/torvalds/linux.git synced 2026-06-04 12:35:52 +02:00

Author	SHA1	Message	Date
Filipe Manana	b48c980b6a	btrfs: fix deadlock between reflink and transaction commit when using flushoncommit When using the flushoncommit mount option, we can have a deadlock between a transaction commit and a reflink operation that copied an inline extent to an offset beyond the current i_size of the destination node. The deadlock happens like this: 1) Task A clones an inline extent from inode X to an offset of inode Y that is beyond Y's current i_size. This means we copied the inline extent's data to a folio of inode Y that is beyond its EOF, using a call to copy_inline_to_page(); 2) Task B starts a transaction commit and calls btrfs_start_delalloc_flush() to flush delalloc; 3) The delalloc flushing sees the new dirty folio of inode Y and when it attempts to flush it, it ends up at extent_writepage() and sees that the offset of the folio is beyond the i_size of inode Y, so it attempts to invalidate the folio by calling folio_invalidate(), which ends up at btrfs' folio invalidate callback - btrfs_invalidate_folio(). There it tries to lock the folio's range in inode Y's extent io tree, but it blocks since it's currently locked by task A - during a reflink we lock the inodes and the source and destination ranges after flushing all delalloc and waiting for ordered extent completion - after that we don't expect to have dirty folios in the ranges, the exception is if we have to copy an inline extent's data (because the destination offset is not zero); 4) Task A then attempts to start a transaction to update the inode item, and then it's blocked since the current transaction is in the TRANS_STATE_COMMIT_START state. Therefore task A has to wait for the current transaction to become unblocked (its state >= TRANS_STATE_UNBLOCKED). So task A is waiting for the transaction commit done by task B, and the later waiting on the extent lock of inode Y that is currently held by task A. Syzbot recently reported this with the following stack traces: INFO: task kworker/u8:7:1053 blocked for more than 143 seconds. Not tainted syzkaller #0 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u8:7 state:D stack:23520 pid:1053 tgid:1053 ppid:2 task_flags:0x4208060 flags:0x00080000 Workqueue: writeback wb_workfn (flush-btrfs-46) Call Trace: <TASK> context_switch kernel/sched/core.c:5298 [inline] __schedule+0x1553/0x5240 kernel/sched/core.c:6911 __schedule_loop kernel/sched/core.c:6993 [inline] schedule+0x164/0x360 kernel/sched/core.c:7008 wait_extent_bit fs/btrfs/extent-io-tree.c:811 [inline] btrfs_lock_extent_bits+0x59c/0x700 fs/btrfs/extent-io-tree.c:1914 btrfs_lock_extent fs/btrfs/extent-io-tree.h:152 [inline] btrfs_invalidate_folio+0x43d/0xc40 fs/btrfs/inode.c:7704 extent_writepage fs/btrfs/extent_io.c:1852 [inline] extent_write_cache_pages fs/btrfs/extent_io.c:2580 [inline] btrfs_writepages+0x12ff/0x2440 fs/btrfs/extent_io.c:2713 do_writepages+0x32e/0x550 mm/page-writeback.c:2554 __writeback_single_inode+0x133/0x11a0 fs/fs-writeback.c:1750 writeback_sb_inodes+0x995/0x19d0 fs/fs-writeback.c:2042 wb_writeback+0x456/0xb70 fs/fs-writeback.c:2227 wb_do_writeback fs/fs-writeback.c:2374 [inline] wb_workfn+0x41a/0xf60 fs/fs-writeback.c:2414 process_one_work kernel/workqueue.c:3276 [inline] process_scheduled_works+0xb6e/0x18c0 kernel/workqueue.c:3359 worker_thread+0xa53/0xfc0 kernel/workqueue.c:3440 kthread+0x388/0x470 kernel/kthread.c:436 ret_from_fork+0x51e/0xb90 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 </TASK> INFO: task syz.4.64:6910 blocked for more than 143 seconds. Not tainted syzkaller #0 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:syz.4.64 state:D stack:22752 pid:6910 tgid:6905 ppid:5944 task_flags:0x400140 flags:0x00080002 Call Trace: <TASK> context_switch kernel/sched/core.c:5298 [inline] __schedule+0x1553/0x5240 kernel/sched/core.c:6911 __schedule_loop kernel/sched/core.c:6993 [inline] schedule+0x164/0x360 kernel/sched/core.c:7008 wait_current_trans+0x39f/0x590 fs/btrfs/transaction.c:535 start_transaction+0x6a7/0x1650 fs/btrfs/transaction.c:705 clone_copy_inline_extent fs/btrfs/reflink.c:299 [inline] btrfs_clone+0x128a/0x24d0 fs/btrfs/reflink.c:529 btrfs_clone_files+0x271/0x3f0 fs/btrfs/reflink.c:750 btrfs_remap_file_range+0x76b/0x1320 fs/btrfs/reflink.c:903 vfs_copy_file_range+0xda7/0x1390 fs/read_write.c:1600 __do_sys_copy_file_range fs/read_write.c:1683 [inline] __se_sys_copy_file_range+0x2fb/0x480 fs/read_write.c:1650 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x14d/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f5f73afc799 RSP: 002b:00007f5f7315e028 EFLAGS: 00000246 ORIG_RAX: 0000000000000146 RAX: ffffffffffffffda RBX: 00007f5f73d75fa0 RCX: 00007f5f73afc799 RDX: 0000000000000005 RSI: 0000000000000000 RDI: 0000000000000005 RBP: 00007f5f73b92c99 R08: 0000000000000863 R09: 0000000000000000 R10: 00002000000000c0 R11: 0000000000000246 R12: 0000000000000000 R13: 00007f5f73d76038 R14: 00007f5f73d75fa0 R15: 00007fff138a5068 </TASK> INFO: task syz.4.64:6975 blocked for more than 143 seconds. Not tainted syzkaller #0 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:syz.4.64 state:D stack:24736 pid:6975 tgid:6905 ppid:5944 task_flags:0x400040 flags:0x00080002 Call Trace: <TASK> context_switch kernel/sched/core.c:5298 [inline] __schedule+0x1553/0x5240 kernel/sched/core.c:6911 __schedule_loop kernel/sched/core.c:6993 [inline] schedule+0x164/0x360 kernel/sched/core.c:7008 wb_wait_for_completion+0x3e8/0x790 fs/fs-writeback.c:227 __writeback_inodes_sb_nr+0x24c/0x2d0 fs/fs-writeback.c:2838 try_to_writeback_inodes_sb+0x9a/0xc0 fs/fs-writeback.c:2886 btrfs_start_delalloc_flush fs/btrfs/transaction.c:2175 [inline] btrfs_commit_transaction+0x82e/0x31a0 fs/btrfs/transaction.c:2364 btrfs_ioctl+0xca7/0xd00 fs/btrfs/ioctl.c:5206 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:597 [inline] __se_sys_ioctl+0xff/0x170 fs/ioctl.c:583 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x14d/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f5f73afc799 RSP: 002b:00007f5f7313d028 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007f5f73d76090 RCX: 00007f5f73afc799 RDX: 0000000000000000 RSI: 0000000000009408 RDI: 0000000000000004 RBP: 00007f5f73b92c99 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007f5f73d76128 R14: 00007f5f73d76090 R15: 00007fff138a5068 </TASK> Fix this by updating the i_size of the destination inode of a reflink operation after we copy an inline extent's data to an offset beyond the i_size and before attempting to start a transaction to update the inode's item. Reported-by: syzbot+63056bf627663701bbbf@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/69bba3fe.050a0220.227207.002f.GAE@google.com/ Fixes: `05a5a7621c` ("Btrfs: implement full reflink support for inline extents") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:06 +02:00
Filipe Manana	6fa9729568	btrfs: pass literal booleans to functions that take boolean arguments We have several functions with parameters defined as booleans but then we have callers passing integers, 0 or 1, instead of false and true. While this isn't a bug since 0 and 1 are converted to false and true, it is odd and less readable. Change the callers to pass true and false literals instead. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:55:56 +02:00
Filipe Manana	8ecf596ed8	btrfs: update comment for delalloc flush and oe wait in btrfs_clone_files() Make the comment more detailed about why we need to flush delalloc and wait for ordered extent completion before attempting to invalidate the page cache. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:51:43 +01:00
Filipe Manana	7d7608cc9a	btrfs: move unlikely checks around btrfs_is_shutdown() into the helper Instead of surrounding every caller of btrfs_is_shutdown() with unlikely, move the unlikely into the helper itself, like we do in other places in btrfs and is common in the kernel outside btrfs too. Also make the fs_info argument of btrfs_is_shutdown() const. On a x86_84 box using gcc 14.2.0-19 from Debian, this resulted in a slight reduction of the module's text size. Before: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1939044 172568 15592 2127204 207564 fs/btrfs/btrfs.ko After: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1938876 172568 15592 2127036 2074bc fs/btrfs/btrfs.ko Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:49:12 +01:00
Filipe Manana	8826807749	btrfs: invalidate pages instead of truncate after reflinking Qu reported that generic/164 often fails because the read operations get zeroes when it expects to either get all bytes with a value of 0x61 or 0x62. The issue stems from truncating the pages from the page cache instead of invalidating, as truncating can zero page contents. This zeroing is not just in case the range is not page sized (as it's commented in truncate_inode_pages_range()) but also in case we are using large folios, they need to be split and the splitting fails. Stealing Qu's comment in the thread linked below: "We can have the following case: 0 4K 8K 12K 16K \| \| \| \| \| \|<---- Extent A ----->\|<----- Extent B ------>\| The page size is still 4K, but the folio we got is 16K. Then if we remap the range for [8K, 16K), then truncate_inode_pages_range() will get the large folio 0 sized 16K, then call truncate_inode_partial_folio(). Which later calls folio_zero_range() for the [8K, 16K) range first, then tries to split the folio into smaller ones to properly drop them from the cache. But if splitting failed (e.g. racing with other operations holding the filemap lock), the partially zeroed large folio will be kept, resulting the range [8K, 16K) being zeroed meanwhile the folio is still a 16K sized large one." So instead of truncating, invalidate the page cache range with a call to filemap_invalidate_inode(), which besides not doing any zeroing also ensures that while it's invalidating folios, no new folios are added. This helps ensure that buffered reads that happen while a reflink operation is in progress always get either the whole old data (the one before the reflink) or the whole new data, which is what generic/164 expects. Link: https://lore.kernel.org/linux-btrfs/7fb9b44f-9680-4c22-a47f-6648cb109ddf@suse.com/ Reported-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-01-12 16:21:55 +01:00
Omar Sandoval	f968340053	btrfs: disable various operations on encrypted inodes Initially, only normal data extents will be encrypted. This change forbids various other bits: - allows reflinking only if both inodes have the same encryption status - disable inline data on encrypted inodes Note: The patch was taken from v5 of fscrypt patchset (https://lore.kernel.org/linux-btrfs/cover.1706116485.git.josef@toxicpanda.com/) which was handled over time by various people: Omar Sandoval, Sweet Tea Dorminy, Josef Bacik. Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add note ] Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:50:56 +01:00
Miquel Sabaté Solà	7ab5d01d58	btrfs: apply the AUTO_K(V)FREE macros throughout the code Apply the AUTO_KFREE and AUTO_KVFREE macros wherever it makes sense. Since this macro is expected to improve code readability, it has been avoided in places where the lifetime of objects wasn't easy to follow and a cleanup attribute would've made things worse; or when the cleanup section of a function involved many other things and thus there was no readability impact anyways. This change has also not been applied in extremely short functions where readability was clearly not an issue. Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 22:34:51 +01:00
Qu Wenruo	9b2839451d	btrfs: introduce a new shutdown state A new fs state EMERGENCY_SHUTDOWN is introduced, which is btrfs' equivalent of XFS_IOC_GOINGDOWN or EXT4_IOC_SHUTDOWN, after entering emergency shutdown state, all operations will return errors (-EIO), and can not be bring back to normal state until unmouont. The new state will reject the following file operations: - read_iter() - write_iter() - mmap() - open() - remap_file_range() - uring_cmd() - splice_read() This requires a small wrapper to do the extra shutdown check, then call the regular filemap_splice_read() function This should reject most of the file operations on a shutdown btrfs. And for the existing dirty folios, extra shutdown checks are introduced to the following functions: - run_delalloc_nocow() - run_delalloc_compressed() - cow_file_range() So that dirty ranges will still be properly cleaned without being submitted. Finally the shutdown state will also set the fs error, so that no new transaction will be committed, protecting the metadata from any possible further corruption. And when the fs entered shutdown mode for the first time, a critical level kernel message will show up to indicate the incident. That message will be important for end users as rejected delalloc ranges will output error messages, hopefully that shutdown message and the fact that all fs operations are returning error will prevent end users from getting too confused about the delalloc error messages. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <asj@kernel.org> Tested-by: Anand Jain <asj@kernel.org> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 21:45:03 +01:00
David Sterba	a929904cf7	btrfs: add unlikely annotations to branches leading to transaction abort The unlikely() annotation is a static prediction hint that compiler may use to reorder code out of hot path. We use it elsewhere (namely tree-checker.c) for error branches that almost never happen. Transaction abort is one such error, the btrfs_abort_transaction() inlines code to check the state and print a warning, this ought to be out of the hot path. The most common pattern is when transaction abort is called after checking a return value and the control flow leads to a quick return. In other cases it may not be necessary to add unlikely() e.g. when the function returns anyway or the control flow is not changed noticeably. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:26 +02:00
Sun YangKai	4ca6f24a52	btrfs: more trivial BTRFS_PATH_AUTO_FREE conversions Trivial pattern for the auto freeing with goto -> return conversions if possible. The following cases are considered trivial in this patch: 1. Cases where there are no operations between btrfs_free_path() and the function returns. 2. Cases where only simple cleanup operations (such as kfree(), kvfree(), clear_bit(), and fs_path_free()) are present between btrfs_free_path() and the function return. Signed-off-by: Sun YangKai <sunk67188@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:26 +02:00
David Sterba	67e78f983e	btrfs: convert several int parameters to bool We're almost done cleaning misused int/bool parameters. Convert a bunch of them, found by manual grepping. Note that btrfs_sync_fs() needs an int as it's mandated by the struct super_operations prototype. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-22 10:54:32 +02:00
Filipe Manana	d6be378de0	btrfs: remove btrfs_clear_extent_bits() It's just a simple wrapper around btrfs_clear_extent_bit() that passes a NULL for its last argument (a cached extent state record), plus there is not counter part - we have a btrfs_set_extent_bit() but we do not have a btrfs_set_extent_bits() (plural version). So just remove it and make all callers use btrfs_clear_extent_bit() directly. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:22 +02:00
Filipe Manana	f2de2b9ffd	btrfs: unfold transaction abort at clone_copy_inline_extent() We have a common error path where we abort the transaction, but like this in case we get a transaction abort stack trace we don't know exactly which previous function call failed. Instead abort the transaction after any function call that returns an error, so that we can easily identify which function failed. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 15:31:06 +02:00
Filipe Manana	5ff6050fcd	btrfs: remove pointless 'out' label from clone_finish_inode_update() The label is only used once and we can instead return directly where it's used, besides the fact that all we do under the label is to return the value of 'ret'. So get rid of the label and return directly. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 15:31:06 +02:00
Filipe Manana	443e4d0e1c	btrfs: return real error from __filemap_get_folio() calls We have a few places that always assume a -ENOMEM error happened in case a call to __filemap_get_folio() returns an error, which is just too much of an assumption and even if it would be the case at some point in time, it's not future proof and there's nothing in the documentation that guarantees that only ERR_PTR(-ENOMEM) can be returned with the flags we are passing to it. So use the exact error returned by __filemap_get_folio() instead. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:57 +02:00
Filipe Manana	9d222562b4	btrfs: rename the functions to clear bits for an extent range These functions are exported so they should have a 'btrfs_' prefix by convention, to make it clear they are btrfs specific and to avoid collisions with functions from elsewhere in the kernel. One of them has a double underscore prefix which is also discouraged. So remove double underscore prefix where applicable and add a 'btrfs_' prefix to their name to make it clear they are from btrfs. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:43 +02:00
Filipe Manana	242570e80b	btrfs: add btrfs prefix to main lock, try lock and unlock extent functions These functions are exported so they should have a 'btrfs_' prefix by convention, to make it clear they are btrfs specific and to avoid collisions with functions from elsewhere in the kernel. So add a prefix to their name. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:43 +02:00
Filipe Manana	c4e33a8431	btrfs: use clear_extent_bits() instead of clear_extent_bit() where possible Several places are using clear_extent_bit() and passing a NULL value for the 'cached' argument, which is pointless as they can use instead clear_extent_bits(). Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:41 +02:00
David Sterba	2d4323ced5	btrfs: use struct btrfs_inode inside btrfs_remap_file_range_prep() Use a struct btrfs_inode in btrfs_remap_file_range_prep() as it's an internal helper, allowing to remove some use of BTRFS_I. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-03-18 20:35:44 +01:00
David Sterba	8b044e17e5	btrfs: use struct btrfs_inode inside btrfs_remap_file_range() Use a struct btrfs_inode to btrfs_remap_file_range() as it's an internal helper, allowing to remove some use of BTRFS_I. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-03-18 20:35:44 +01:00
David Sterba	61dbdeb870	btrfs: pass struct btrfs_inode to btrfs_extent_same_range() Pass a struct btrfs_inode to btrfs_extent_same_range() as it's an internal interface, allowing to remove some use of BTRFS_I. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-03-18 20:35:44 +01:00
David Sterba	651cef4611	btrfs: pass struct btrfs_inode to btrfs_double_mmap_unlock() Pass a struct btrfs_inode to btrfs_double_mmap_unlock() as it's an internal interface, allowing to remove some use of BTRFS_I. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-03-18 20:35:44 +01:00
David Sterba	0061ba125b	btrfs: pass struct btrfs_inode to btrfs_double_mmap_lock() Pass a struct btrfs_inode to btrfs_double_mmap_lock() as it's an internal interface, allowing to remove some use of BTRFS_I. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-03-18 20:35:44 +01:00
David Sterba	65a66afd1e	btrfs: pass struct btrfs_inode to clone_copy_inline_extent() Pass a struct btrfs_inode to clone_copy_inline_extent() as it's an internal interface, allowing to remove some use of BTRFS_I. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-03-18 20:35:44 +01:00
Li Zetao	faad57ae20	btrfs: convert copy_inline_to_page() to use folio The old page API is being gradually replaced and converted to use folio to improve code readability and avoid repeated conversion between page and folio. Moreover find_or_create_page() is compatible API, and it can replaced with __filemap_get_folio(). Some interfaces have been converted to use folio before, so the conversion operation from page can be eliminated here. Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:21 +02:00
Li Zetao	aeb6d88148	btrfs: convert btrfs_decompress() to take a folio The old page API is being gradually replaced and converted to use folio to improve code readability and avoid repeated conversion between page and folio. Based on the previous patch, the compression path can be directly used in folio without converting to page. Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-09-10 16:51:21 +02:00
Filipe Manana	e641e323ab	btrfs: pass a btrfs_inode to btrfs_wait_ordered_range() Instead of passing a (VFS) inode pointer argument, pass a btrfs_inode instead, as this is generally what we do for internal APIs, making it more consistent with most of the code base. This will later allow to help to remove a lot of BTRFS_I() calls in btrfs_sync_file(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-07-11 15:33:18 +02:00
Josef Bacik	e094f48040	btrfs: change root->root_key.objectid to btrfs_root_id() A comment from Filipe on one of my previous cleanups brought my attention to a new helper we have for getting the root id of a root, which makes it easier to read in the code. The changes where made with the following Coccinelle semantic patch: // <smpl> @@ expression E,E1; @@ ( E->root_key.objectid = E1 \| - E->root_key.objectid + btrfs_root_id(E) ) // </smpl> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor style fixups ] Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:06 +02:00
Filipe Manana	5d6f0e9890	btrfs: stop locking the source extent range during reflink Nowadays before starting a reflink operation we do this: 1) Take the VFS lock of the inodes in exclusive mode (a rw semaphore); 2) Take the mmap lock of the inodes (struct btrfs_inode::i_mmap_lock); 3) Flush all delalloc in the source and target ranges; 4) Wait for all ordered extents in the source and target ranges to complete; 5) Lock the source and destination ranges in the inodes' io trees. In step 5 we lock the source range because: 1) We needed to serialize against mmap writes, but that is not needed anymore because nowadays we do that through the inode's i_mmap_lock (step 2). This happens since commit `8c99516a8c` ("btrfs: exclude mmaps while doing remap"); 2) To serialize against a concurrent relocation and avoid generating a delayed ref for an extent that was just dropped by relocation, see commit `d8b5524242` ("Btrfs: fix race between reflink/dedupe and relocation"). Locking the source range however blocks any concurrent reads for that range and makes test case generic/733 fail. So instead of locking the source range during reflinks, make relocation read lock the inode's i_mmap_lock, so that it serializes with a concurrent reflink while still able to run concurrently with mmap writes and allow concurrent reads too. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:02 +02:00
David Sterba	41044b41ad	btrfs: add helper to get fs_info from struct inode pointer Add a convenience helper to get a fs_info from a VFS inode pointer instead of open coding the chain or using btrfs_sb() that in some cases does one more pointer hop. This is implemented as a macro (still with type checking) so we don't need full definitions of struct btrfs_inode, btrfs_root or btrfs_fs_info. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-03-04 16:24:49 +01:00
David Sterba	4e00422ee6	btrfs: replace sb::s_blocksize by fs_info::sectorsize The block size stored in the super block is used by subsystems outside of btrfs and it's a copy of fs_info::sectorsize. Unify that to always use our sectorsize, with the exception of mount where we first need to use fixed values (4K) until we read the super block and can set the sectorsize. Replace all uses, in most cases it's fewer pointer indirections. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-03-04 16:24:46 +01:00
Qu Wenruo	55151ea9ec	btrfs: migrate subpage code to folio interfaces Although subpage itself is conflicting with higher folio, since subpage (sectorsize < PAGE_SIZE and nodesize < PAGE_SIZE) means we will never need higher order folio, there is a hidden pitfall: - btrfs_page_*() helpers Those helpers are an abstraction to handle both subpage and non-subpage cases, which means we're going to pass pages pointers to those helpers. And since those helpers are shared between data and metadata paths, it's unavoidable to let them to handle folios, including higher order folios). Meanwhile for true subpage case, we should only have a single page backed folios anyway, thus add a new ASSERT() for btrfs_subpage_assert() to ensure that. Also since those helpers are shared between both data and metadata, add some extra ASSERT()s for data path to make sure we only get single page backed folio for now. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-12-15 23:03:58 +01:00
Linus Torvalds	d5acbc60fa	for-6.7-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmU/xAEACgkQxWXV+ddt WDvYKg//SjTimA5Nins9mb4jdz8n+dDeZnQhKzy3FqInU41EzDRc4WwnEODmDlTa AyU9rGB3k0JNSUc075jZFCyLqq/ARiOqRi4x33Gk0ckIlc4X5OgBoqP2XkPh0VlP txskLCrmhc3pwyR4ErlFDX2jebIUXfkv39bJuE40grGvUatRe+WNq0ERIrgO8RAr Rc3hBotMH8AIqfD1L6j1ZiZIAyrOkT1BJMuqeoq27/gJZn/MRhM9TCrMTzfWGaoW SxPrQiCDEN3KECsOY/caroMn3AekDijg/ley1Nf7Z0N6oEV+n4VWWPBFE9HhRz83 9fIdvSbGjSJF6ekzTjcVXPAbcuKZFzeqOdBRMIW3TIUo7mZQyJTVkMsc1y/NL2Z3 9DhlRLIzvWJJjt1CEK0u18n5IU+dGngdktbhWWIuIlo8r+G/iKR/7zqU92VfWLHL Z7/eh6HgH5zr2bm+yKORbrUjkv4IVhGVarW8D4aM+MCG0lFN2GaPcJCCUrp4n7rZ PzpQbxXa38ANBk6hsp4ndS8TJSBL9moY8tumzLcKg97nzNMV6KpBdV/G6/QfRLCN 3kM6UbwTAkMwGcQS86Mqx6s04ORLnQeD6f7N6X4Ppx0Mi/zkjI2HkRuvQGp12B0v iZjCCZAYY2Iu+/TU0GrCXSss/grzIAUPzM9msyV3XGO/VBpwdec= =9TVx -----END PGP SIGNATURE----- Merge tag 'for-6.7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "New features: - raid-stripe-tree New tree for logical file extent mapping where the physical mapping may not match on multiple devices. This is now used in zoned mode to implement RAID0/RAID1* profiles, but can be used in non-zoned mode as well. The support for RAID56 is in development and will eventually fix the problems with the current implementation. This is a backward incompatible feature and has to be enabled at mkfs time. - simple quota accounting (squota) A simplified mode of qgroup that accounts all space on the initial extent owners (a subvolume), the snapshots are then cheap to create and delete. The deletion of snapshots in fully accounting qgroups is a known CPU/IO performance bottleneck. The squota is not suitable for the general use case but works well for containers where the original subvolume exists for the whole time. This is a backward incompatible feature as it needs extending some structures, but can be enabled on an existing filesystem. - temporary filesystem fsid (temp_fsid) The fsid identifies a filesystem and is hard coded in the structures, which disallows mounting the same fsid found on different devices. For a single device filesystem this is not strictly necessary, a new temporary fsid can be generated on mount e.g. after a device is cloned. This will be used by Steam Deck for root partition A/B testing, or can be used for VM root images. Other user visible changes: - filesystems with partially finished metadata_uuid conversion cannot be mounted anymore and the uuid fixup has to be done by btrfs-progs (btrfstune). Performance improvements: - reduce reservations for checksum deletions (with enabled free space tree by factor of 4), on a sample workload on file with many extents the deletion time decreased by 12% - make extent state merges more efficient during insertions, reduce rb-tree iterations (run time of critical functions reduced by 5%) Core changes: - the integrity check functionality has been removed, this was a debugging feature and removal does not affect other integrity checks like checksums or tree-checker - space reservation changes: - more efficient delayed ref reservations, this avoids building up too much work or overusing or exhausting the global block reserve in some situations - move delayed refs reservation to the transaction start time, this prevents some ENOSPC corner cases related to exhaustion of global reserve - improvements in reducing excessive reservations for block group items - adjust overcommit logic in near full situations, account for one more chunk to eventually allocate metadata chunk, this is mostly relevant for small filesystems (<10GiB) - single device filesystems are scanned but not registered (except seed devices), this allows temp_fsid to work - qgroup iterations do not need GFP_ATOMIC allocations anymore - cleanups, refactoring, reduced data structure size, function parameter simplifications, error handling fixes" * tag 'for-6.7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (156 commits) btrfs: open code timespec64 in struct btrfs_inode btrfs: remove redundant log root tree index assignment during log sync btrfs: remove redundant initialization of variable dirty in btrfs_update_time() btrfs: sysfs: show temp_fsid feature btrfs: disable the device add feature for temp-fsid btrfs: disable the seed feature for temp-fsid btrfs: update comment for temp-fsid, fsid, and metadata_uuid btrfs: remove pointless empty log context list check when syncing log btrfs: update comment for struct btrfs_inode::lock btrfs: remove pointless barrier from btrfs_sync_file() btrfs: add and use helpers for reading and writing last_trans_committed btrfs: add and use helpers for reading and writing fs_info->generation btrfs: add and use helpers for reading and writing log_transid btrfs: add and use helpers for reading and writing last_log_commit btrfs: support cloned-device mount capability btrfs: add helper function find_fsid_by_disk btrfs: stop reserving excessive space for block group item insertions btrfs: stop reserving excessive space for block group item updates btrfs: reorder btrfs_inode to fill gaps btrfs: open code btrfs_ordered_inode_tree in btrfs_inode ...	2023-10-30 10:42:06 -10:00
Jeff Layton	b1c38a1338	btrfs: convert to new timestamp accessors Convert to using the new inode timestamp accessor functions. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20231004185347.80880-21-jlayton@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2023-10-18 13:26:19 +02:00
Filipe Manana	8b9d032225	btrfs: remove redundant root argument from btrfs_update_inode() The root argument for btrfs_update_inode() always matches the root of the given inode, so remove the root argument and get it from the inode argument. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-10-12 16:44:12 +02:00
Jeff Layton	2a9462de43	btrfs: convert to ctime accessor functions In later patches, we're going to change how the inode's ctime field is used. Switch to using accessor functions instead of raw accesses of inode->i_ctime. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230705190309.579783-27-jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2023-07-13 10:28:04 +02:00
David Sterba	e5d4d75bd3	btrfs: pass btrfs_inode to btrfs_inode_unlock The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:53 +01:00
David Sterba	29b6352b14	btrfs: pass btrfs_inode to btrfs_inode_lock The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:53 +01:00
Josef Bacik	7f0add250f	btrfs: move super_block specific helpers into super.h This will make syncing fs.h to user space a little easier if we can pull the super block specific helpers out of fs.h and put them in super.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:47 +01:00
Josef Bacik	af142b6f44	btrfs: move file prototypes to file.h Move these out of ctree.h into file.h to cut down on code in ctree.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:46 +01:00
Josef Bacik	7c8ede1628	btrfs: move file-item prototypes into their own header Move these prototypes out of ctree.h and into file-item.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:46 +01:00
David Sterba	43dd529abe	btrfs: update function comments Update, reformat or reword function comments. This also removes the kdoc marker so we don't get reports when the function name is missing. Changes made: - remove kdoc markers - reformat the brief description to be a proper sentence - reword to imperative voice - align parameter list - fix typos Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:45 +01:00
Josef Bacik	07e81dc944	btrfs: move accessor helpers into accessors.h This is a large patch, but because they're all macros it's impossible to split up. Simply copy all of the item accessors in ctree.h and paste them in accessors.h, and then update any files to include the header so everything compiles. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ reformat comments, style fixups ] Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:42 +01:00
Josef Bacik	ec8eb376e2	btrfs: move BTRFS_FS_STATE* definitions and helpers to fs.h We're going to use fs.h to hold fs wide related helpers and definitions, move the FS_STATE enum and related helpers to fs.h, and then update all files that need these definitions to include fs.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:42 +01:00
Josef Bacik	9b569ea0be	btrfs: move the printk helpers out of ctree.h We have a bunch of printk helpers that are in ctree.h. These have nothing to do with ctree.c, so move them into their own header. Subsequent patches will cleanup the printk helpers. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:41 +01:00
Josef Bacik	bd015294af	btrfs: replace delete argument with EXTENT_CLEAR_ALL_BITS Instead of taking up a whole argument to indicate we're clearing everything in a range, simply add another EXTENT bit to control this, and then update all the callers to drop this argument from the clear_extent_bit variants. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:05 +02:00
Josef Bacik	570eb97bac	btrfs: unify the lock/unlock extent variants We have two variants of lock/unlock extent, one set that takes a cached state, another that does not. This is slightly annoying, and generally speaking there are only a few places where we don't have a cached state. Simplify this by making lock_extent/unlock_extent the only variant and make it take a cached state, then convert all the callers appropriately. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:05 +02:00
Josef Bacik	dbbf49928f	btrfs: remove the wake argument from clear_extent_bits This is only used in the case that we are clearing EXTENT_LOCKED, so infer this value from the bits passed in instead of taking it as an argument. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:04 +02:00
David Sterba	c1867eb33e	btrfs: clean up chained assignments The chained assignments may be convenient to write, but make readability a bit worse as it's too easy to overlook that there are several values set on the same line while this is rather an exception. Making it consistent everywhere avoids surprises. The pattern where inode times are initialized reuses the first value and the order is mtime, ctime. In other blocks the assignments are expanded so the order of variables is similar to the neighboring code. Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:39 +02:00
Filipe Manana	6fe81a3a3a	btrfs: balance btree dirty pages and delayed items after clone and dedupe When reflinking extents (clone and deduplication), we need to touch the btree of the destination inode's subvolume, as well as potentially create a delayed inode for the destination inode (if it was not created before). However we are neither balancing the btree dirty pages nor the delayed items after such operations, so if we have a task that is doing a long series of clone or deduplication operations, it can result in accumulation of too many btree dirty pages and delayed items. So just call btrfs_btree_balance_dirty() after clone and deduplication, just like we do for every other system call that results on modifying a btree and adding delayed items. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:35 +02:00

1 2

100 Commits