linux

mirror of https://github.com/torvalds/linux.git synced 2026-05-30 10:04:04 +02:00

Author	SHA1	Message	Date
Linus Torvalds	73082fbdb1	for-7.1-rc1-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmnvdO8ACgkQxWXV+ddt WDuLqA//fcHDOnClWHRRUaIWhhkMYm7gNZkXf2d+qyYLMtAP2Cv2sZ+aV+OkHp5D /Gq1W1mUXZLabu0EV0xKICn01nwzWtbZwDO8Bo3+QEdLoAi2gITODsYyY8yeW9KO GfSBPsom+d7ktVrjaYE7Ppcm6YifBjWNDDcC+MX7Kpy+OUqhyOtsJIaEeTwii9+P eiyAAC1zqrHZtaQfLsY3WvM0baNaqlm1xURMjJPyRCAtjGpjZy1hK/iFsGcHRlfc SR//WT/MRnUAFn8zlIBG0wNrk1IEIgPPiA7hAXMRGXFKo0C6ICYLl5MJQh/o/MUs tFBdkBhtcX/Kynvwb059SyalXZzVhQvzaRN89ZGuDyalNiejRzb8F2oVCfKAVKIU MdkKOjnR5b7BUzCcZ1cJf1LgX4SngYKTnXrNGHpW0fuUzX6moJEd4wbrgmHjk9ke +TVdl2vcpAduvBU9idkpDAcUW998tcYmX/LyQhGYpR6k/4n2UdFZJPINqco3pOAO RIFbIgEAq9rUi+GMSJdEDMO6xLmUYoI6vaw7uZSU6E04zJPiVIcixfRDCBKGPV5Q Yl9PC3ViLSlgKWaG7UVl8PVaSkCQ7esbfPAnNI/+RBCUeehhSFygePcY+kH1k4LA 0qMne1abDysUVwolb/1de/fqkznLlB3SlA447HwdvwMI0mCSb7w= =ajKs -----END PGP SIGNATURE----- Merge tag 'for-7.1-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - space reservation fixes: - correctly undo 'may_use' accounting for remap tree - avoid double decrement of 'may_use' when submitting async io - actually enable the shutdown ioctl callback (not just the superblock ops) - raid stripe tree fixes when deleting extents - add missing error handling - fix various incorrect values set - fix transaction state when removing a directory, possibly leading to EIO during log replay - additional b-tree node key checks during metadata readahead - error handling and transaction abort updates * tag 'for-7.1-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix double-decrement of bytes_may_use in submit_one_async_extent() btrfs: check return value of btrfs_partially_delete_raid_extent() btrfs: handle -EAGAIN from btrfs_duplicate_item and refresh stale leaf pointer btrfs: replace ASSERT with proper error handling in stripe lookup fallback btrfs: fix wrong min_objectid in btrfs_previous_item() call btrfs: fix raid stripe search missing entries at leaf boundaries btrfs: copy devid in btrfs_partially_delete_raid_extent() btrfs: handle unexpected free-space-tree key types btrfs: fix missing last_unlink_trans update when removing a directory btrfs: don't clobber errors in add_remap_tree_entries() btrfs: enable shutdown ioctl for non-experimental builds btrfs: apply first key check for readahead when possible btrfs: abort transaction in do_remap_reloc_trans() on failure btrfs: fix bytes_may_use leak in do_remap_reloc_trans() btrfs: fix bytes_may_use leak in move_existing_remap()	2026-04-27 16:35:44 -07:00
Qu Wenruo	a86a283430	btrfs: apply first key check for readahead when possible Currently for tree block readahead we never pass a btrfs_tree_parent_check with @has_first_key set. Without @has_first_key set, btrfs will skip the following extra checks: - Header generation check This is a minor one. - Empty leaf/node checks This is more serious, for certain trees like the csum tree, they are allowed to be empty, thus an empty leaf can pass the tree checker. But if there is a parent node for such an empty leaf, it indicates corruption. Without @has_first_key set, we can no longer detect such a problem. In fact there is already a fuzzed image report that a corrupted csum leaf which has zero nritems but still has a parent node can trigger a BUG_ON() during csum deletion. However there are only two call sites of btrfs_readahead_tree_block(): - Inside relocate_tree_blocks() At this call site we are trying to grab the first key of the tree block, thus we are not able to pass a @first_key parameter. - Inside btrfs_readahead_node_child() This is the more common call site, where we have the parent node and want to readahead the child tree blocks. In this case we can easily grab the node key and pass it for checks. Add a new parameter @first_key to btrfs_readahead_tree_block() and pass the node key to it inside btrfs_readahead_node_child(). This should plug the gap in empty leaf detection during readahead. Link: https://lore.kernel.org/linux-btrfs/20260409071255.3358044-1-gality369@gmail.com/ Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-21 04:01:24 +02:00
Linus Torvalds	334fbe734e	mm.git review status for linus..mm-stable Everything: Total patches: 368 Reviews/patch: 1.56 Reviewed rate: 74% Excluding DAMON: Total patches: 316 Reviews/patch: 1.77 Reviewed rate: 81% Excluding DAMON and zram: Total patches: 306 Reviews/patch: 1.81 Reviewed rate: 82% Excluding DAMON, zram and maple_tree: Total patches: 276 Reviews/patch: 2.01 Reviewed rate: 91% Significant patch series in this merge: - The 30 patch series "maple_tree: Replace big node with maple copy" from Liam Howlett is mainly prepararatory work for ongoing development but it does reduce stack usage and is an improvement. - The 12 patch series "mm, swap: swap table phase III: remove swap_map" from Kairui Song offers memory savings by removing the static swap_map. It also yields some CPU savings and implements several cleanups. - The 2 patch series "mm: memfd_luo: preserve file seals" from Pratyush Yadav adds file seal preservation to LUO's memfd code. - The 2 patch series "mm: zswap: add per-memcg stat for incompressible pages" from Jiayuan Chen adds additional userspace stats reportng to zswap. - The 4 patch series "arch, mm: consolidate empty_zero_page" from Mike Rapoport implements some cleanups for our handling of ZERO_PAGE() and zero_pfn. - The 2 patch series "mm/kmemleak: Improve scan_should_stop() implementation" from Zhongqiu Han provides an robustness improvement and some cleanups in the kmemleak code. - The 4 patch series "Improve khugepaged scan logic" from Vernon Yang "improves the khugepaged scan logic and reduces CPU consumption by prioritizing scanning tasks that access memory frequently". - The 2 patch series "Make KHO Stateless" from Jason Miu simplifies Kexec Handover by "transitioning KHO from an xarray-based metadata tracking system with serialization to a radix tree data structure that can be passed directly to the next kernel" - The 3 patch series "mm: vmscan: add PID and cgroup ID to vmscan tracepoints" from Thomas Ballasi and Steven Rostedt enhances vmscan's tracepointing. - The 5 patch series "mm: arch/shstk: Common shadow stack mapping helper and VM_NOHUGEPAGE" from Catalin Marinas is a cleanup for the shadow stack code: remove per-arch code in favour of a generic implementation. - The 2 patch series "Fix KASAN support for KHO restored vmalloc regions" from Pasha Tatashin fixes a WARN() which can be emitted the KHO restores a vmalloc area. - The 4 patch series "mm: Remove stray references to pagevec" from Tal Zussman provides several cleanups, mainly udpating references to "struct pagevec", which became folio_batch three years ago. - The 17 patch series "mm: Eliminate fake head pages from vmemmap optimization" from Kiryl Shutsemau simplifies the HugeTLB vmemmap optimization (HVO) by changing how tail pages encode their relationship to the head page. - The 2 patch series "mm/damon/core: improve DAMOS quota efficiency for core layer filters" from SeongJae Park improves two problematic behaviors of DAMOS that makes it less efficient when core layer filters are used. - The 3 patch series "mm/damon: strictly respect min_nr_regions" from SeongJae Park improves DAMON usability by extending the treatment of the min_nr_regions user-settable parameter. - The 3 patch series "mm/page_alloc: pcp locking cleanup" from Vlastimil Babka is a proper fix for a previously hotfixed SMP=n issue. Code simplifications and cleanups ennsed. - The 16 patch series "mm: cleanups around unmapping / zapping" from David Hildenbrand implements "a bunch of cleanups around unmapping and zapping. Mostly simplifications, code movements, documentation and renaming of zapping functions". - The 6 patch series "support batched checking of the young flag for MGLRU" from Baolin Wang supports batched checking of the young flag for MGLRU. It's part cleanups; one benchmark shows large performance benefits for arm64. - The 5 patch series "memcg: obj stock and slab stat caching cleanups" from Johannes Weiner provides memcg cleanup and robustness improvements. - The 5 patch series "Allow order zero pages in page reporting" from Yuvraj Sakshith enhances page_reporting's free page reporting - it is presently and undesirably order-0 pages when reporting free memory. - The 6 patch series "mm: vma flag tweaks" from Lorenzo Stoakes is cleanup work following from the recent conversion of the VMA flags to a bitmap. - The 10 patch series "mm/damon: add optional debugging-purpose sanity checks" from SeongJae Park adds some more developer-facing debug checks into DAMON core. - The 2 patch series "mm/damon: test and document power-of-2 min_region_sz requirement" from SeongJae Park adds an additional DAMON kunit test and makes some adjustments to the addr_unit parameter handling. - The 3 patch series "mm/damon/core: make passed_sample_intervals comparisons overflow-safe" from SeongJae Park fixes a hard-to-hit time overflow issue in DAMON core. - The 7 patch series "mm/damon: improve/fixup/update ratio calculation, test and documentation" from SeongJae Park is a "batch of misc/minor improvements and fixups" for DAMON. - The 4 patch series "mm: move vma_(kernel\|mmu)_pagesize() out of hugetlb.c" from David Hildenbrand fixes a possible issue with dax-device when CONFIG_HUGETLB=n. Some code movement was required. - The 6 patch series "zram: recompression cleanups and tweaks" from Sergey Senozhatsky provides "a somewhat random mix of fixups, recompression cleanups and improvements" in the zram code. - The 11 patch series "mm/damon: support multiple goal-based quota tuning algorithms" from SeongJae Park extend DAMOS quotas goal auto-tuning to support multiple tuning algorithms that users can select. - The 4 patch series "mm: thp: reduce unnecessary start_stop_khugepaged()" from Breno Leitao fixes the khugpaged sysfs handling so we no longer spam the logs with reams of junk when starting/stopping khugepaged. - The 3 patch series "mm: improve map count checks" from Lorenzo Stoakes provides some cleanups and slight fixes in the mremap, mmap and vma code. - The 5 patch series "mm/damon: support addr_unit on default monitoring targets for modules" from SeongJae Park extends the use of DAMON core's addr_unit tunable. - The 5 patch series "mm: khugepaged cleanups and mTHP prerequisites" from Nico Pache provides cleanups in the khugepaged and is a base for Nico's planned khugepaged mTHP support. - The 15 patch series "mm: memory hot(un)plug and SPARSEMEM cleanups" from David Hildenbrand implements code movement and cleanups in the memhotplug and sparsemem code. - The 2 patch series "mm: remove CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE and cleanup CONFIG_MIGRATION" from David Hildenbrand rationalizes some memhotplug Kconfig support. - The 6 patch series "change young flag check functions to return bool" from Baolin Wang is "a cleanup patchset to change all young flag check functions to return bool". - The 3 patch series "mm/damon/sysfs: fix memory leak and NULL dereference issues" from Josh Law and SeongJae Park fixes a few potential DAMON bugs. - The 25 patch series "mm/vma: convert vm_flags_t to vma_flags_t in vma code" from "converts a lot of the existing use of the legacy vm_flags_t data type to the new vma_flags_t type which replaces it". Mainly in the vma code. - The 21 patch series "mm: expand mmap_prepare functionality and usage" from Lorenzo Stoakes "expands the mmap_prepare functionality, which is intended to replace the deprecated f_op->mmap hook which has been the source of bugs and security issues for some time". Cleanups, documentation, extension of mmap_prepare into filesystem drivers. - The 13 patch series "mm/huge_memory: refactor zap_huge_pmd()" from Lorenzo Stoakes simplifies and cleans up zap_huge_pmd(). Additional cleanups around vm_normal_folio_pmd() and the softleaf functionality are performed. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCad3HDQAKCRDdBJ7gKXxA jrUQAPwNhPk5nPSxnyxjAeQtOBHqgCdnICeEismLajPKd9aYRgEA0s2XAu3tSUYi GrBnWImHG3s4ePQxVcPCegWTsOUrXgQ= =1Q7o -----END PGP SIGNATURE----- Merge tag 'mm-stable-2026-04-13-21-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "maple_tree: Replace big node with maple copy" (Liam Howlett) Mainly prepararatory work for ongoing development but it does reduce stack usage and is an improvement. - "mm, swap: swap table phase III: remove swap_map" (Kairui Song) Offers memory savings by removing the static swap_map. It also yields some CPU savings and implements several cleanups. - "mm: memfd_luo: preserve file seals" (Pratyush Yadav) File seal preservation to LUO's memfd code - "mm: zswap: add per-memcg stat for incompressible pages" (Jiayuan Chen) Additional userspace stats reportng to zswap - "arch, mm: consolidate empty_zero_page" (Mike Rapoport) Some cleanups for our handling of ZERO_PAGE() and zero_pfn - "mm/kmemleak: Improve scan_should_stop() implementation" (Zhongqiu Han) A robustness improvement and some cleanups in the kmemleak code - "Improve khugepaged scan logic" (Vernon Yang) Improve khugepaged scan logic and reduce CPU consumption by prioritizing scanning tasks that access memory frequently - "Make KHO Stateless" (Jason Miu) Simplify Kexec Handover by transitioning KHO from an xarray-based metadata tracking system with serialization to a radix tree data structure that can be passed directly to the next kernel - "mm: vmscan: add PID and cgroup ID to vmscan tracepoints" (Thomas Ballasi and Steven Rostedt) Enhance vmscan's tracepointing - "mm: arch/shstk: Common shadow stack mapping helper and VM_NOHUGEPAGE" (Catalin Marinas) Cleanup for the shadow stack code: remove per-arch code in favour of a generic implementation - "Fix KASAN support for KHO restored vmalloc regions" (Pasha Tatashin) Fix a WARN() which can be emitted the KHO restores a vmalloc area - "mm: Remove stray references to pagevec" (Tal Zussman) Several cleanups, mainly udpating references to "struct pagevec", which became folio_batch three years ago - "mm: Eliminate fake head pages from vmemmap optimization" (Kiryl Shutsemau) Simplify the HugeTLB vmemmap optimization (HVO) by changing how tail pages encode their relationship to the head page - "mm/damon/core: improve DAMOS quota efficiency for core layer filters" (SeongJae Park) Improve two problematic behaviors of DAMOS that makes it less efficient when core layer filters are used - "mm/damon: strictly respect min_nr_regions" (SeongJae Park) Improve DAMON usability by extending the treatment of the min_nr_regions user-settable parameter - "mm/page_alloc: pcp locking cleanup" (Vlastimil Babka) The proper fix for a previously hotfixed SMP=n issue. Code simplifications and cleanups ensued - "mm: cleanups around unmapping / zapping" (David Hildenbrand) A bunch of cleanups around unmapping and zapping. Mostly simplifications, code movements, documentation and renaming of zapping functions - "support batched checking of the young flag for MGLRU" (Baolin Wang) Batched checking of the young flag for MGLRU. It's part cleanups; one benchmark shows large performance benefits for arm64 - "memcg: obj stock and slab stat caching cleanups" (Johannes Weiner) memcg cleanup and robustness improvements - "Allow order zero pages in page reporting" (Yuvraj Sakshith) Enhance free page reporting - it is presently and undesirably order-0 pages when reporting free memory. - "mm: vma flag tweaks" (Lorenzo Stoakes) Cleanup work following from the recent conversion of the VMA flags to a bitmap - "mm/damon: add optional debugging-purpose sanity checks" (SeongJae Park) Add some more developer-facing debug checks into DAMON core - "mm/damon: test and document power-of-2 min_region_sz requirement" (SeongJae Park) An additional DAMON kunit test and makes some adjustments to the addr_unit parameter handling - "mm/damon/core: make passed_sample_intervals comparisons overflow-safe" (SeongJae Park) Fix a hard-to-hit time overflow issue in DAMON core - "mm/damon: improve/fixup/update ratio calculation, test and documentation" (SeongJae Park) A batch of misc/minor improvements and fixups for DAMON - "mm: move vma_(kernel\|mmu)_pagesize() out of hugetlb.c" (David Hildenbrand) Fix a possible issue with dax-device when CONFIG_HUGETLB=n. Some code movement was required. - "zram: recompression cleanups and tweaks" (Sergey Senozhatsky) A somewhat random mix of fixups, recompression cleanups and improvements in the zram code - "mm/damon: support multiple goal-based quota tuning algorithms" (SeongJae Park) Extend DAMOS quotas goal auto-tuning to support multiple tuning algorithms that users can select - "mm: thp: reduce unnecessary start_stop_khugepaged()" (Breno Leitao) Fix the khugpaged sysfs handling so we no longer spam the logs with reams of junk when starting/stopping khugepaged - "mm: improve map count checks" (Lorenzo Stoakes) Provide some cleanups and slight fixes in the mremap, mmap and vma code - "mm/damon: support addr_unit on default monitoring targets for modules" (SeongJae Park) Extend the use of DAMON core's addr_unit tunable - "mm: khugepaged cleanups and mTHP prerequisites" (Nico Pache) Cleanups to khugepaged and is a base for Nico's planned khugepaged mTHP support - "mm: memory hot(un)plug and SPARSEMEM cleanups" (David Hildenbrand) Code movement and cleanups in the memhotplug and sparsemem code - "mm: remove CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE and cleanup CONFIG_MIGRATION" (David Hildenbrand) Rationalize some memhotplug Kconfig support - "change young flag check functions to return bool" (Baolin Wang) Cleanups to change all young flag check functions to return bool - "mm/damon/sysfs: fix memory leak and NULL dereference issues" (Josh Law and SeongJae Park) Fix a few potential DAMON bugs - "mm/vma: convert vm_flags_t to vma_flags_t in vma code" (Lorenzo Stoakes) Convert a lot of the existing use of the legacy vm_flags_t data type to the new vma_flags_t type which replaces it. Mainly in the vma code. - "mm: expand mmap_prepare functionality and usage" (Lorenzo Stoakes) Expand the mmap_prepare functionality, which is intended to replace the deprecated f_op->mmap hook which has been the source of bugs and security issues for some time. Cleanups, documentation, extension of mmap_prepare into filesystem drivers - "mm/huge_memory: refactor zap_huge_pmd()" (Lorenzo Stoakes) Simplify and clean up zap_huge_pmd(). Additional cleanups around vm_normal_folio_pmd() and the softleaf functionality are performed. * tag 'mm-stable-2026-04-13-21-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits) mm: fix deferred split queue races during migration mm/khugepaged: fix issue with tracking lock mm/huge_memory: add and use has_deposited_pgtable() mm/huge_memory: add and use normal_or_softleaf_folio_pmd() mm: add softleaf_is_valid_pmd_entry(), pmd_to_softleaf_folio() mm/huge_memory: separate out the folio part of zap_huge_pmd() mm/huge_memory: use mm instead of tlb->mm mm/huge_memory: remove unnecessary sanity checks mm/huge_memory: deduplicate zap deposited table call mm/huge_memory: remove unnecessary VM_BUG_ON_PAGE() mm/huge_memory: add a common exit path to zap_huge_pmd() mm/huge_memory: handle buggy PMD entry in zap_huge_pmd() mm/huge_memory: have zap_huge_pmd return a boolean, add kdoc mm/huge: avoid big else branch in zap_huge_pmd() mm/huge_memory: simplify vma_is_specal_huge() mm: on remap assert that input range within the proposed VMA mm: add mmap_action_map_kernel_pages[_full]() uio: replace deprecated mmap hook with mmap_prepare in uio_info drivers: hv: vmbus: replace deprecated mmap hook with mmap_prepare mm: allow handling of stacked mmap_prepare hooks in more drivers ...	2026-04-15 12:59:16 -07:00
Filipe Manana	7801f3ea95	btrfs: tag as unlikely if statements that check for fs in error state Having the filesystem in an error state, meaning we had a transaction abort, is unexpected. Mark every check for the error state with the unlikely annotation to convey that and to allow the compiler to generate better code. On x86_64, using gcc 14.2.0-19 from Debian, resulted in a slightly reduced object size and better code. Before: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 2008598 175912 15592 2200102 219226 fs/btrfs/btrfs.ko After: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 2008450 175912 15592 2199954 219192 fs/btrfs/btrfs.ko Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 19:41:42 +02:00
David Sterba	463626a2ec	btrfs: use common eb range validation in read_extent_buffer_to_user_nofault() The extent buffer access is checked in other helpers by check_eb_range(), which validates the requested start, length against the extent buffer. While this almost never fails we should still handle it as an error and not just warn. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:07 +02:00
David Sterba	b8aa337121	btrfs: read eb folio index right before loops There are generic helpers to access extent buffer folio data of any length, potentially iterating over a few of them. This is a slow path, either we use the type based accessors or the eb folio allocation is contiguous and we can use the memcpy/memcmp helpers. The initialization of 'i' is done at the beginning though it may not be needed. Move it right before the folio loop, this has minor effect on generated code in __write_extent_buffer(). Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:07 +02:00
Qu Wenruo	908ab56347	btrfs: remove atomic parameter from btrfs_buffer_uptodate() That parameter was introduced by commit `b9fab919b7` ("Btrfs: avoid sleeping in verify_parent_transid while atomic"). At that time we needed to lock the extent buffer range inside the io tree to avoid content changes, thus it could sleep. But that behavior is no longer there, as later commit `9e2aff90fc` ("btrfs: stop using lock_extent in btrfs_buffer_uptodate") dropped the io tree lock. We can remove the @atomic parameter safely now. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:02 +02:00
ZhengYuan Huang	f04c6475c2	btrfs: revalidate cached tree blocks on the uptodate path read_extent_buffer_pages_nowait() returns immediately when an extent buffer is already marked uptodate. On that cache-hit path, the caller supplied btrfs_tree_parent_check is not re-run. This can let read_tree_root_path() accept a cached tree block whose actual header level/owner does not match the expected value derived from the parent. E.g. a corrupted root item that points to a tree block which doesn't even belong to that root, and has mismatching level/owner. But that tree block is already read and cached, later the corrupted tree root got read from disk and hit the cached tree block. Fix this by re-validating cached extent buffers against the supplied btrfs_tree_parent_check on the uptodate path, and make read_tree_root_path() pass its check to btrfs_buffer_uptodate(). This makes cache hits and fresh reads follow the same tree-parent verification rules, and turns the corruption into a read failure instead of constructing an inconsistent root object. Signed-off-by: ZhengYuan Huang <gality369@gmail.com> Reviewed-by: Qu Wenruo <wqu@suse.com> [ Resolve the conflict with extent_buffer_uptodate() helper, handle transid mismatch case ] Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:02 +02:00
Qu Wenruo	9d7db41000	btrfs: move the mapping_set_error() out of the loop in end_bbio_data_write() Previously we have to call mapping_set_error() inside the for_each_folio_all() loop, because we do not have a better way to grab an inode, other than through folio->mapping. But nowadays every btrfs_bio has its inode member populated, thus we can easily grab the inode and its i_mapping easily, without the help from a folio. Now we can move that mapping_set_error() out of the loop, and use bbio->inode to grab the i_mapping. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:00 +02:00
Qu Wenruo	99fe7e57d3	btrfs: remove the alignment check in end_bbio_data_write() The check is not necessary because: - There is already assert_bbio_alignment() at btrfs_submit_bbio() - There is also btrfs_subpage_assert() for all btrfs_folio_*() helpers - The original commit mentions the check may go away in the future Commit `17a5adccf3` ("btrfs: do away with non-whole_page extent I/O") introduced the check first, and in the commit message: I've replaced the whole_page computations with warnings, just to be sure that we're not issuing partial page reads or writes. The warnings should probably just go away some time. - No similar check in all other endio functions No matter if it's data read, compressed read or write. - There is no such report for very long I do not even remember if there is any such report. Thus the need to do such check in end_bbio_data_write() is very weak, and we can just get rid of it. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:00 +02:00
Leo Martins	f9a48549a1	btrfs: inhibit extent buffer writeback to prevent COW amplification Inhibit writeback on COW'd extent buffers for the lifetime of the transaction handle, preventing background writeback from setting BTRFS_HEADER_FLAG_WRITTEN and causing unnecessary re-COW. COW amplification occurs when background writeback flushes an extent buffer that a transaction handle is still actively modifying. When lock_extent_buffer_for_io() transitions a buffer from dirty to writeback, it sets BTRFS_HEADER_FLAG_WRITTEN, marking the block as having been persisted to disk at its current bytenr. Once WRITTEN is set, should_cow_block() must either COW the block again or overwrite it in place, both of which are unnecessary overhead when the buffer is still being modified by the same handle that allocated it. By inhibiting background writeback on actively-used buffers, WRITTEN is never set while a transaction handle holds a reference to the buffer, avoiding this overhead entirely. Add an atomic_t writeback_inhibitors counter to struct extent_buffer, which fits in an existing 6-byte hole without increasing struct size. When a buffer is COW'd in btrfs_force_cow_block(), call btrfs_inhibit_eb_writeback() to store the eb in the transaction handle's writeback_inhibited_ebs xarray (keyed by eb->start), take a reference, and increment writeback_inhibitors. The function handles dedup (same eb inhibited twice by the same handle) and replacement (different eb at the same logical address). Allocation failure is graceful: the buffer simply falls back to the pre-existing behavior where it may be written back and re-COW'd. Also inhibit writeback in should_cow_block() when COW is skipped, so that every transaction handle that reuses an already-COW'd buffer also inhibits its writeback. Without this, if handle A COWs a block and inhibits it, and handle B later reuses the same block without inhibiting, handle A's uninhibit on end_transaction leaves the buffer unprotected while handle B is still using it. This ensures all handles that access a COW'd buffer contribute to the inhibitor count, and the buffer remains protected until the last handle releases it. In lock_extent_buffer_for_io(), when writeback_inhibitors is non-zero and the writeback mode is WB_SYNC_NONE, skip the buffer. WB_SYNC_NONE is used by the VM flusher threads for background and periodic writeback, which are the only paths that cause COW amplification by opportunistically writing out dirty extent buffers mid-transaction. Skipping these is safe because the buffers remain dirty in the page cache and will be written out at transaction commit time. WB_SYNC_ALL must always proceed regardless of writeback_inhibitors. This is required for correctness in the fsync path: btrfs_sync_log() writes log tree blocks via filemap_fdatawrite_range() (WB_SYNC_ALL) while the transaction handle that inhibited those same blocks is still active. Without the WB_SYNC_ALL bypass, those inhibited log tree blocks would be silently skipped, resulting in an incomplete log on disk and corruption on replay. btrfs_write_and_wait_transaction() also uses WB_SYNC_ALL via filemap_fdatawrite_range(); for that path, inhibitors are already cleared beforehand, but the bypass ensures correctness regardless. Uninhibit in __btrfs_end_transaction() before atomic_dec(num_writers) to prevent a race where the committer proceeds while buffers are still inhibited. Also uninhibit in btrfs_commit_transaction() before writing and in cleanup_transaction() for the error path. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Sun YangKai <sunk67188@gmail.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Leo Martins <loemra.dev@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:56:00 +02:00
Qu Wenruo	a11d6912fd	btrfs: remove folio parameter from ordered io related functions Both functions btrfs_finish_ordered_extent() and btrfs_mark_ordered_io_finished() are accepting an optional folio parameter. That @folio is passed into can_finish_ordered_extent(), which later will test and clear the ordered flag for the involved range. However I do not think there is any other call site that can clear ordered flags of an page cache folio and can affect can_finish_ordered_extent(). There are limited _clear_ordered() callers out of can_finish_ordered_extent() function: - btrfs_migrate_folio() This is completely unrelated, it's just migrating the ordered flag to the new folio. - btrfs_cleanup_ordered_extents() We manually clean the ordered flags of all involved folios, then call btrfs_mark_ordered_io_finished() without a @folio parameter. So it doesn't need and didn't pass a @folio parameter in the first place. - btrfs_writepage_fixup_worker() This function is going to be removed soon, and we should not hit that function anymore. - btrfs_invalidate_folio() This is the real call site we need to bother with. If we already have a bio running, btrfs_finish_ordered_extent() in end_bbio_data_write() will be executed first, as btrfs_invalidate_folio() will wait for the writeback to finish. Thus if there is a running bio, it will not see the range has ordered flags, and just skip to the next range. If there is no bio running, meaning the ordered extent is created but the folio is not yet submitted. In that case btrfs_invalidate_folio() will manually clear the folio ordered range, but then manually finish the ordered extent with btrfs_dec_test_ordered_pending() without bothering the folio ordered flags. Meaning if the OE range with folio ordered flags will be finished manually without the need to call can_finish_ordered_extent(). This means all can_finish_ordered_extent() call sites should get a range that has folio ordered flag set, thus the old "return false" branch should never be triggered. Now we can: - Remove the @folio parameter from involved functions btrfs_mark_ordered_io_finished() * btrfs_finish_ordered_extent() For call sites passing a @folio into those functions, let them manually clear the ordered flag of involved folios. - Move btrfs_finish_ordered_extent() out of the loop in end_bbio_data_write() We only need to call btrfs_finish_ordered_extent() once per bbio, not per folio. - Add an ASSERT() to make sure all folio ranges have ordered flags It's only for end_bbio_data_write(). And we already have enough safe nets to catch over-accounting of ordered extents. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:55:57 +02:00
Qu Wenruo	e677ccac5a	btrfs: remove out-of-date comments in btree_writepages() There is a lengthy comment introduced in commit `b3ff8f1d38` ("btrfs: Don't submit any btree write bio if the fs has errors") and commit `c9583ada8c` ("btrfs: avoid double clean up when submit_one_bio() failed"), explaining two things: - Why we don't want to submit metadata write if the fs has errors - Why we re-set @ret to 0 if it's positive However it's no longer uptodate by the following reasons: - We have better checks nowadays Commit `2618849f31` ("btrfs: ensure no dirty metadata is written back for an fs with errors") has introduced better checks, that if the fs is in an error state, metadata writes will not result in any bio but instead complete immediately. That covers all metadata writes better. - Mentioned incorrect function name The commit `c9583ada8c` ("btrfs: avoid double clean up when submit_one_bio() failed") introduced this ret > 0 handling, but at that time the function name submit_extent_page() was already incorrect. It was submit_eb_page() that could return >0 at that time, and submit_extent_page() could only return 0 or <0 for errors, never >0. Later commit `b35397d1d3` ("btrfs: convert submit_extent_page() to use a folio") changed "submit_extent_page()" to "submit_extent_folio()" in the comment, but it doesn't make any difference since the function name is wrong from day 1. Finally commit `5e121ae687` ("btrfs: use buffer xarray for extent buffer writeback operations") completely reworked how metadata writeback works, and removed submit_eb_page(), leaving only the wrong function name in the comment. Furthermore the function submit_extent_folio() still exists in the latest code base, but is never utilized for metadata writeback, causing more confusion. Just remove the lengthy comment, and replace the "if (ret > 0)" check with an ASSERT(), since only btrfs_check_meta_write_pointer() can modify @ret and it returns 0 or <0 for errors. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:55:57 +02:00
Filipe Manana	6ee5c986b0	btrfs: use the helper extent_buffer_uptodate() everywhere Instead of open coding testing the uptodate bit on the extent buffer's flags, use the existing helper extent_buffer_uptodate() (which is even shorter to type). Also change the helper's return value from int to bool, since we always use it in a boolean context. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-04-07 18:55:55 +02:00
Tal Zussman	511f04aac4	folio_batch: rename PAGEVEC_SIZE to FOLIO_BATCH_SIZE struct pagevec no longer exists. Rename the macro appropriately. Link: https://lkml.kernel.org/r/20260225-pagevec_cleanup-v2-4-716868cc2d11@columbia.edu Signed-off-by: Tal Zussman <tz2294@columbia.edu> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:07 -07:00
Tal Zussman	4e1d77a8f3	folio_batch: rename pagevec.h to folio_batch.h struct pagevec was removed in commit `1e0877d58b` ("mm: remove struct pagevec"). Rename include/linux/pagevec.h to reflect reality and update includes tree-wide. Add the new filename to MAINTAINERS explicitly, as it no longer matches the "include/linux/page[-_]*" pattern in MEMORY MANAGEMENT - CORE. Link: https://lkml.kernel.org/r/20260225-pagevec_cleanup-v2-3-716868cc2d11@columbia.edu Signed-off-by: Tal Zussman <tz2294@columbia.edu> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:07 -07:00
Linus Torvalds	e0b38d286e	for-7.0-rc3-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmmytygACgkQxWXV+ddt WDuiKw/+IyoS2IE95Jd4G2hezcsaqJNAvv5q90vVyLepT/7jzRbYrHytk7z/0v56 +Pjc1JHgp+TidYKZa52E/R09eEvYCyuZvEq4bnClOnO3CAJDCixqTKWV70CcYiHX HoSCuPML2CJMLZY3u6MxATgk46y1ey3UkbmQ4oufoUHrjAE+D3pDQsYQ9BWR/P6J 4LbyZ214uIUPvbp0wPJ14cVUMpNxs116JmWvK9dxWQNZSzFcq2IVuLwQUcZBPAdb dursd0kt6HXXmNCITmUD8O+ipG4U1akn8FuCjaLqo4LF1AH2f6OzNjucsisfa1tE 4MD+ATzmNsew5q6dtyfcvv8Btl+olbTP4KGibLl9PCCM9vlkzl3EN5GPUllGi6S9 T8jqe2pILXZHEx1DIQjHaXJsnuHG9aEkUbkSHUxIa15QDj66omrZZkZG3EF5Buy9 TogJJgESYU19dt/y110Q/vD/LPOgJhB3NBXwIx1FDYA1OwaAjt6hUcVRVRI5PtpL moAIG4nNrPz5Pa875NvguFmrXFIudpTqyANHCPVio4eqDR7LeSg9vVHj9zeOfeRD gmn3WGuGNb6L+86TNet/nFQhjxhkKqsnSbzO2zoPQKhG6FS+HSLnUwvJYL4/YnkW QEeveeI/6Duiei4DHuw7FXmZVNQxeBEWW6MTFDHA1UNV3HMKewA= =TgSD -----END PGP SIGNATURE----- Merge tag 'for-7.0-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - detect possible file name hash collision earlier so it does not lead to transaction abort - handle b-tree leaf overflows when snapshotting a subvolume with set received UUID, leading to transaction abort - in zoned mode, reorder relocation block group initialization after the transaction kthread start - fix orphan cleanup state tracking of subvolume, this could lead to invalid dentries under some conditions - add locking around updates of dynamic reclain state update - in subpage mode, add missing RCU unlock when trying to releae extent buffer - remap tree fixes: - add missing description strings for the newly added remap tree - properly update search key when iterating backrefs * tag 'for-7.0-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: remove duplicated definition of btrfs_printk_in_rcu() btrfs: remove unnecessary transaction abort in the received subvol ioctl btrfs: abort transaction on failure to update root in the received subvol ioctl btrfs: fix transaction abort on set received ioctl due to item overflow btrfs: fix transaction abort when snapshotting received subvolumes btrfs: fix transaction abort on file creation due to name hash collision btrfs: read key again after incrementing slot in move_existing_remaps() btrfs: add missing RCU unlock in error path in try_release_subpage_extent_buffer() btrfs: set BTRFS_ROOT_ORPHAN_CLEANUP during subvol create btrfs: zoned: move btrfs_zoned_reserve_data_reloc_bg() after kthread start btrfs: hold space_info->lock when clearing periodic reclaim ready btrfs: print-tree: add remap tree definitions	2026-03-12 12:15:27 -07:00
Bart Van Assche	b2840e3312	btrfs: add missing RCU unlock in error path in try_release_subpage_extent_buffer() Call rcu_read_lock() before exiting the loop in try_release_subpage_extent_buffer() because there is a rcu_read_unlock() call past the loop. This has been detected by the Clang thread-safety analyzer. Fixes: `ad580dfa38` ("btrfs: fix subpage deadlock in try_release_subpage_extent_buffer()") CC: stable@vger.kernel.org # 6.18+ Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-03-03 17:03:51 +01:00
Linus Torvalds	997f9640c9	fsverity updates for 7.0 fsverity cleanups, speedup, and memory usage optimization from Christoph Hellwig: - Move some logic into common code - Fix btrfs to reject truncates of fsverity files - Improve the readahead implementation - Store each inode's fsverity_info in a hash table instead of using a pointer in the filesystem-specific part of the inode. This optimizes for memory usage in the usual case where most files don't have fsverity enabled. - Look up the fsverity_info fewer times during verification, to amortize the hash table overhead -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCaY0nZhQcZWJpZ2dlcnNA a2VybmVsLm9yZwAKCRDzXCl4vpKOK/AVAP9wSLEYsG3dqnNIHjIvLeK+9NC3Ni4d m+fvT1JfuideOwEA9r2EfztusLU5iyqWJlHyxekibXItUDgYGltaYb7eXAU= =a+To -----END PGP SIGNATURE----- Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux Pull fsverity updates from Eric Biggers: "fsverity cleanups, speedup, and memory usage optimization from Christoph Hellwig: - Move some logic into common code - Fix btrfs to reject truncates of fsverity files - Improve the readahead implementation - Store each inode's fsverity_info in a hash table instead of using a pointer in the filesystem-specific part of the inode. This optimizes for memory usage in the usual case where most files don't have fsverity enabled. - Look up the fsverity_info fewer times during verification, to amortize the hash table overhead" * tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux: fsverity: remove inode from fsverity_verification_ctx fsverity: use a hashtable to find the fsverity_info btrfs: consolidate fsverity_info lookup f2fs: consolidate fsverity_info lookup ext4: consolidate fsverity_info lookup fs: consolidate fsverity_info lookup in buffer.c fsverity: push out fsverity_info lookup fsverity: deconstify the inode pointer in struct fsverity_info fsverity: kick off hash readahead at data I/O submission time ext4: move ->read_folio and ->readahead to readpage.c readahead: push invalidate_lock out of page_cache_ra_unbounded fsverity: don't issue readahead for non-ENOENT errors from __filemap_get_folio fsverity: start consolidating pagecache code fsverity: pass struct file to ->write_merkle_tree_block f2fs: don't build the fsverity work handler for !CONFIG_FS_VERITY ext4: don't build the fsverity work handler for !CONFIG_FS_VERITY fs,fsverity: clear out fsverity_info from common code fs,fsverity: reject size changes on fsverity files in setattr_prepare	2026-02-12 10:41:34 -08:00
Christoph Hellwig	b0160e4501	btrfs: consolidate fsverity_info lookup Look up the fsverity_info once in btrfs_do_readpage, and then use it for all operations performed there, and do the same in end_folio_read for all folios processed there. The latter is also changed to derive the inode from the btrfs_bio - while bbio->inode is optional, it is always set for buffered reads. This amortizes the lookup better once it becomes less efficient. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: David Sterba <dsterba@suse.com> Link: https://lore.kernel.org/r/20260202060754.270269-11-hch@lst.de Signed-off-by: Eric Biggers <ebiggers@kernel.org>	2026-02-04 11:31:54 -08:00
Filipe Manana	bb09b9a491	btrfs: remove out_failed label in find_lock_delalloc_range() There is no point in having the label since all it does is return the value in the 'found' variable. Instead make every goto return directly and remove the label. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:56:22 +01:00
Filipe Manana	ea7ab405c5	btrfs: use the btrfs_extent_map_end() helper everywhere We have a helper to calculate an extent map's exclusive end offset, but we only use it in some places. Update every site that open codes the calculation to use the helper. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:54:36 +01:00
Qu Wenruo	e6698b34fa	btrfs: replace for_each_set_bit() with for_each_set_bitmap() Inside extent_io.c, there are several simple call sites doing things like: for_each_set_bit(bit, bitmap, bitmap_size) { /* handle one fs block */ } The workload includes: - set_bit() Inside extent_writepage_io(). This can be replaced with a bitmap_set(). - btrfs_folio_set_lock() - btrfs_mark_ordered_io_finished() Inside writepage_delalloc(). Instead of calling it multiple times, we can pass a range into the function with one call. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 06:38:32 +01:00
Qu Wenruo	44820d8002	btrfs: concentrate the error handling of submit_one_sector() Currently submit_one_sector() has only one failure path from btrfs_get_extent(). However the error handling is split into two parts, one inside submit_one_sector(), which clears the dirty flag and finishes the writeback for the fs block. The other part is to submit any remaining bio inside bio_ctrl and mark the ordered extent finished for the fs block. There is no special reason that we must split the error handling, let's just concentrate all the error handling into submit_one_sector(). Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 06:38:32 +01:00
Qu Wenruo	3970da5c3b	btrfs: search for larger extent maps inside btrfs_do_readpage() [CORNER CASE] If we have the following file extents layout, btrfs_get_extent() can return a smaller hole during read, and cause unnecessary extra tree searches: item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53 generation 9 type 1 (regular) extent data disk byte 13631488 nr 4096 extent data offset 0 nr 4096 ram 4096 extent compression 0 (none) item 7 key (257 EXTENT_DATA 32768) itemoff 15757 itemsize 53 generation 9 type 1 (regular) extent data disk byte 13635584 nr 4096 extent data offset 0 nr 4096 ram 4096 extent compression 0 (none) In above case, range [0, 4K) and [32K, 36K) are regular extents, and there is a hole in range [4K, 32K), and the fs has "no-holes" feature, meaning the hole will not have a file extent item. [INEFFICIENCY] Assume the system has 4K page size, and we're doing readahead for range [4K, 32K), no large folio yet. btrfs_readahead() for range [4K, 32K) \|- btrfs_do_readpage() for folio 4K \| \|- get_extent_map() for range [4K, 8K) \| \|- btrfs_get_extent() for range [4K, 8K) \| We hit item 6, then for the next item 7. \| At this stage we know range [4K, 32K) is a hole. \| But our search range is only [4K, 8K), not reaching 32K, thus \| we go into not_found: tag, returning a hole em for [4K, 8K). \| \|- btrfs_do_readpage() for folio 8K \| \|- get_extent_map() for range [8K, 12K) \| \|- btrfs_get_extent() for range [8K, 12K) \| We hit the same item 6, and then item 7. \| But still we goto not_found tag, inserting a new hole em, \| which will be merged with previous one. \| \| [ Repeat the same btrfs_get_extent() calls until the end ] So we're calling btrfs_get_extent() again and again, just for a different part of the same hole range [4K, 32K). [ENHANCEMENT] Make btrfs_do_readpage() to search for a larger extent map if readahead is involved. For btrfs_readahead() we have bio_ctrl::ractl set, and lock extents for the whole readahead range. If we find bio_ctrl::ractl is set, we can use that end range as extent map search end, this allows btrfs_get_extent() to return a much larger hole, thus reduce the need to call btrfs_get_extent() again and again. btrfs_readahead() for range [4K, 32K) \|- btrfs_do_readpage() for folio 4K \| \|- get_extent_map() for range [4K, 32K) \| \|- btrfs_get_extent() for range [4K, 32K) \| We hit item 6, then for the next item 7. \| At this stage we know range [4K, 32K) is a hole. \| So the hole em for range [4K, 32K) is returned. \| \|- btrfs_do_readpage() for folio 8K \| \|- get_extent_map() for range [8K, 32K) \| The cached hole em range [4K, 32K) covers the range, \| and reuse that em. \| \| [ Repeat the same btrfs_get_extent() calls until the end ] Now we only call btrfs_get_extent() once for the whole range [4K, 32K), other than the old 8 times. Such change will reduce the overhead of reading large holes a little. For current experimental build (with larger folios) on aarch64, there will be a tiny but consistent ~1% improvement reading a large hole file: Reading a 1GiB sparse file (all hole) using xfs_io, with 64K block size, the result is the time needed to read the whole file, reported from xfs_io. 32 runs, experimental build (with large folios). 64K page size, 4K fs block size. - Avg before: 0.20823 s - Avg after: 0.20635 s - Diff: -0.9% Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 06:38:32 +01:00
Christoph Hellwig	47bc2ac9b6	fsverity: push out fsverity_info lookup Pass a struct fsverity_info to the verification and readahead helpers, and push the lookup into the callers. Right now this is a very dumb almost mechanic move that open codes a lot of fsverity_info_addr() calls in the file systems. The subsequent patches will clean this up. This prepares for reducing the number of fsverity_info lookups, which will allow to amortize them better when using a more expensive lookup method. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Acked-by: David Sterba <dsterba@suse.com> # btrfs Link: https://lore.kernel.org/r/20260202060754.270269-7-hch@lst.de Signed-off-by: Eric Biggers <ebiggers@kernel.org>	2026-02-02 17:15:26 -08:00
Linus Torvalds	e829083bc4	for-6.19-rc7-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAml7a7AACgkQxWXV+ddt WDt5pA//RX16rimnWzNzWZOWAAKhwrhJuand4Bn3qBVLGPRTPgPxiEVk5eTiMAy7 gZfiT6R0AhAlNHeT55D9AAYnrAE0f/QObhKEUuJgFQPMt8EOgfmjUJ54Eilhz5R1 18tyN4H6WDm6ckth8Pkwy/rcSKPYTlGbyv2te4q/LSXRNgOVVAL8fwMkXgV2mZca SW76xCoKjKXLgRbxihzI+TWSS8M9WRiX+qefL/apJ0Xzkt9e620uf8vkA3jEuMn9 6NhoXuqWpWb1c3TL7fdziGm2BTXqr9Sxh9KuyVA+182hs8SgXMza+k3Vc1aw+r1F WzWvxkqLAsybT4X5NdkKXNCdDStztO+OmS//c9+l/Hlo7jshMtIOAsdaFC8E45n4 50r0K4d7XNdHRukb5zg5i60TlnzJJLx7c7X4V1vc/vSV0oGafDuKn/73P4uBppEg rQYp0/Ed4KuxbiTneB7ZdPXoKo6H2jnQizo00v9OfE09wXjsRQDcxTsM8aVPqCpN 4xUcFdeYk76jmRouvH4RX6JcZk+VBSSqPE879JhR+Nb7szxYdtjXuFe0opX27Oil 30c/4qAW/ljd8gxpXizz4JMZOMrxJwICQ+S3FW55uKLgiuHXOWmMPlsVD+ksUB2z FMOa8n9jatDf+xEMI9Mjm61DyqMgQ47uSrPpb94rsDlutidCs8Q= =2Wjs -----END PGP SIGNATURE----- Merge tag 'for-6.19-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix leaked folio refcount on s390x when using hw zlib compression acceleration - remove own threshold from ->writepages() which could collide with cgroup limits and lead to a deadlock when metadadata are not written because the amount is under the internal limit * tag 'for-6.19-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: zlib: fix the folio leak on S390 hardware acceleration btrfs: do not strictly require dirty metadata threshold for metadata writepages	2026-01-29 09:07:17 -08:00
Qu Wenruo	4e159150a9	btrfs: do not strictly require dirty metadata threshold for metadata writepages [BUG] There is an internal report that over 1000 processes are waiting at the io_schedule_timeout() of balance_dirty_pages(), causing a system hang and trigger a kernel coredump. The kernel is v6.4 kernel based, but the root problem still applies to any upstream kernel before v6.18. [CAUSE] From Jan Kara for his wisdom on the dirty page balance behavior first. This cgroup dirty limit was what was actually playing the role here because the cgroup had only a small amount of memory and so the dirty limit for it was something like 16MB. Dirty throttling is responsible for enforcing that nobody can dirty (significantly) more dirty memory than there's dirty limit. Thus when a task is dirtying pages it periodically enters into balance_dirty_pages() and we let it sleep there to slow down the dirtying. When the system is over dirty limit already (either globally or within a cgroup of the running task), we will not let the task exit from balance_dirty_pages() until the number of dirty pages drops below the limit. So in this particular case, as I already mentioned, there was a cgroup with relatively small amount of memory and as a result with dirty limit set at 16MB. A task from that cgroup has dirtied about 28MB worth of pages in btrfs btree inode and these were practically the only dirty pages in that cgroup. So that means the only way to reduce the dirty pages of that cgroup is to writeback the dirty pages of btrfs btree inode, and only after that those processes can exit balance_dirty_pages(). Now back to the btrfs part, btree_writepages() is responsible for writing back dirty btree inode pages. The problem here is, there is a btrfs internal threshold that if the btree inode's dirty bytes are below the 32M threshold, it will not do any writeback. This behavior is to batch as much metadata as possible so we won't write back those tree blocks and then later re-COW them again for another modification. This internal 32MiB is higher than the existing dirty page size (28MiB), meaning no writeback will happen, causing a deadlock between btrfs and cgroup: - Btrfs doesn't want to write back btree inode until more dirty pages - Cgroup/MM doesn't want more dirty pages for btrfs btree inode Thus any process touching that btree inode is put into sleep until the number of dirty pages is reduced. Thanks Jan Kara a lot for the analysis of the root cause. [ENHANCEMENT] Since kernel commit `b55102826d` ("btrfs: set AS_KERNEL_FILE on the btree_inode"), btrfs btree inode pages will only be charged to the root cgroup which should have a much larger limit than btrfs' 32MiB threshold. So it should not affect newer kernels. But for all current LTS kernels, they are all affected by this problem, and backporting the whole AS_KERNEL_FILE may not be a good idea. Even for newer kernels I still think it's a good idea to get rid of the internal threshold at btree_writepages(), since for most cases cgroup/MM has a better view of full system memory usage than btrfs' fixed threshold. For internal callers using btrfs_btree_balance_dirty() since that function is already doing internal threshold check, we don't need to bother them. But for external callers of btree_writepages(), just respect their requests and write back whatever they want, ignoring the internal btrfs threshold to avoid such deadlock on btree inode dirty page balancing. CC: stable@vger.kernel.org CC: Jan Kara <jack@suse.cz> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-01-21 19:35:00 +01:00
Linus Torvalds	7f98ab9da0	for-6.19-rc4-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmlb99EACgkQxWXV+ddt WDtJfQ//cppHAHSxb3NNDGXDiKx4ccCp9CWiOF7z+BTFfngsNGvbs2FzKFnYI2f3 dT/DlPV8uBgVX3uYL3ZI1na/5MShXvS+sajIRhz3woyKBb2shVqVnFmfA8A3pKf6 3Dfm6FWrJHGCgV28Oi5pbg/UQeTAHAmA2aPLYJKRnNwIq8pSSzDWRCVNFfYrt4o2 7UUW1PzasZ7tuqL55HcwzuXjVTYr/t3puLjq+ydVfGSJSZlmlMd3pnZXz8S7/BC6 jVQGOT6nK9SWCnfXD9plqqr4CY+ThJZJNSdhVTwfVxkxVHmEBWfqfhAToqZaLKX9 co3rXvvZyIQf5KeHMmtbb2P736zaAcKb7G41liRN7EZg/gOsROE+UziYRkTg+Xyg rztTksc913DsuHj19sZhIgcKRcym2h57wyZyt7vYAdsv9uksLUgKUo3U9CiTbEsb 8d/vgt1e3+ELoVcc+xVZSSGRDVzvZnxVmRHQV2dAtIXK34FXzqCDeKnFG0wsjqtF Kw6bV93cXLohfcB7fPPBdAHzVN89kfUXTBT8mrri7HnjSnZTJNeHrGpcRNNQ76BT 8RL6gSP32Mpo9HZOYYhl1Xj2hRonRiJrUQAb6x9CY1MMUP2vwVvVBUVj2NAohWdM vAYwRQDigw92RoKIYvHu+X+E5PXgX2AQ9NV8qiL79od+A7NFLgY= =hmbc -----END PGP SIGNATURE----- Merge tag 'for-6.19-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix potential deadlock due to mismatching transaction states when waiting for the current transaction - fix squota accounting with nested snapshots - fix quota inheritance of qgroups with multiple parent qgroups - fix NULL inode pointer in evict tracepoint - fix writes beyond end of file on systems with 64K page size and 4K block size - fix logging of inodes after exchange rename - fix use after free when using ref_tracker feature - space reservation fixes * tag 'for-6.19-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix reservation leak in some error paths when inserting inline extent btrfs: do not free data reservation in fallback from inline due to -ENOSPC btrfs: fix use-after-free warning in btrfs_get_or_create_delayed_node() btrfs: always detect conflicting inodes when logging inode refs btrfs: fix beyond-EOF write handling btrfs: fix deadlock in wait_current_trans() due to ignored transaction type btrfs: fix NULL dereference on root when tracing inode eviction btrfs: qgroup: update all parent qgroups when doing quick inherit btrfs: fix qgroup_snapshot_quick_inherit() squota bug	2026-01-05 14:10:48 -08:00
Qu Wenruo	e9e3b22ddf	btrfs: fix beyond-EOF write handling [BUG] For the following write sequence with 64K page size and 4K fs block size, it will lead to file extent items to be inserted without any data checksum: mkfs.btrfs -s 4k -f $dev > /dev/null mount $dev $mnt xfs_io -f -c "pwrite 0 16k" -c "pwrite 32k 4k" -c pwrite "60k 64K" \ -c "truncate 16k" $mnt/foobar umount $mnt This will result the following 2 file extent items to be inserted (extra trace point added to insert_ordered_extent_file_extent()): btrfs_finish_one_ordered: root=5 ino=257 file_off=61440 num_bytes=4096 csum_bytes=0 btrfs_finish_one_ordered: root=5 ino=257 file_off=0 num_bytes=16384 csum_bytes=16384 Note for file offset 60K, we're inserting a file extent without any data checksum. Also note that range [32K, 36K) didn't reach insert_ordered_extent_file_extent(), which is the correct behavior as that OE is fully truncated, should not result any file extent. Although file extent at 60K will be later dropped by btrfs_truncate(), if the transaction got committed after file extent inserted but before the file extent dropping, we will have a small window where we have a file extent beyond EOF and without any data checksum. That will cause "btrfs check" to report error. [CAUSE] The sequence happens like this: - Buffered write dirtied the page cache and updated isize Now the inode size is 64K, with the following page cache layout: 0 16K 32K 48K 64K \|/////////////\| \|//\| \|//\| - Truncate the inode to 16K Which will trigger writeback through: btrfs_setsize() \|- truncate_setsize() \| Now the inode size is set to 16K \| \|- btrfs_truncate() \|- btrfs_wait_ordered_range() for [16K, u64(-1)] \|- btrfs_fdatawrite_range() for [16K, u64(-1)} \|- extent_writepage() for folio 0 \|- writepage_delalloc() \| Generated OE for [0, 16K), [32K, 36K] and [60K, 64K) \| \|- extent_writepage_io() Then inside extent_writepage_io(), the dirty fs blocks are handled differently: - Submit write for range [0, 16K) As they are still inside the inode size (16K). - Mark OE [32K, 36K) as truncated Since we only call btrfs_lookup_first_ordered_range() once, which returned the first OE after file offset 16K. - Mark all OEs inside range [16K, 64K) as finished Which will mark OE ranges [32K, 36K) and [60K, 64K) as finished. For OE [32K, 36K) since it's already marked as truncated, and its truncated length is 0, no file extent will be inserted. For OE [60K, 64K) it has never been submitted thus has no data checksum, and we insert the file extent as usual. This is the root cause of file extent at 60K to be inserted without any data checksum. - Clear dirty flags for range [16K, 64K) It is the function btrfs_folio_clear_dirty() which searches and clears any dirty blocks inside that range. [FIX] The bug itself was introduced a long time ago, way before subpage and large folio support. At that time, fs block size must match page size, thus the range [cur, end) is just one fs block. But later with subpage and large folios, the same range [cur, end) can have multiple blocks and ordered extents. Later commit `18de34daa7` ("btrfs: truncate ordered extent when skipping writeback past i_size") was fixing a bug related to subpage/large folios, but it's still utilizing the old range [cur, end), meaning only the first OE will be marked as truncated. The proper fix here is to make EOF handling block-by-block, not trying to handle the whole range to @end. By this we always locate and truncate the OE for every dirty block. CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-12-16 22:53:14 +01:00
Linus Torvalds	7696286034	for-6.19-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmko/DUACgkQxWXV+ddt WDtyCw//UaFOTX/k72HgA1n2MWfegWbWyD+OGNbGosoZljKOrAe/mRnjXTyF9lyW 8GDzGvJzF4Tkl5lyuGyiequlrO2F3veTpwHo94xnBTOYCeiTpMqTN/e/5SkasBpN 4YlWq7OGYR4hwghRvZpaW7nsmVCKDLIlZVkH77x9Bmvx0NLO24EJlEZusQT4zYew ntC/i9x3DW0ZxYyfRhFIFvk9JUUdgXfxJ6dNexz0zi3dKUSUIR9hI0J9Nwl++1cF SgjAzbtO064htWoCvsKykgA6YGbJCZjw8XO8D2eJonkN24VbqSMaY44TPXmCMLVs ZXw871jV2E/urfWhRNdxv/kJdCFudPk0qXG5ZtfHO4UUwS/nZ+qAig+LHawgAOCJ 9CgWy4zrfiYCqULRuqF1wzWu/z22++zIlZC552VAZd1RQ+JjqJY/aje4xhY5nUF4 n1uVBReZaI9sH3jJOsMWpwLMptbhpH9RZp3QPgqZlUHo6GtPJJmNKfw8KgMAhZ7L wf7iy6v9yo+7VZ2ACwu2qJ+lZRxbZ0yvCnFatN3O5G1O0kkIrZFUM3MwdKtufZ0u LHWkGfoaq7zR6E6DhIaxIhiTTXMlOfLTikNKgBUO3NEdrRZwrDhr7K07S25jFxSx ZCNV6OdSCeziShPqT0ntcwecnJ41/kOcm13732NHF+QgzMK5LrI= =rO4x -----END PGP SIGNATURE----- Merge tag 'for-6.19-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "Features: - shutdown ioctl support (needs CONFIG_BTRFS_EXPERIMENTAL for now): - set filesystem state as being shut down (also named going down in other filesystems), where all active operations return EIO and this cannot be changed until unmount - pending operations are attempted to be finished but error messages may still show up depending on where exactly the shutdown happened - scrub (and device replace) vs suspend/hibernate: - a running scrub will prevent suspend, which can be annoying as suspend is an immediate request and scrub is not critical - filesystem freezing before suspend was not sufficient as the problem was in process freezing - behaviour change: on suspend scrub and device replace are cancelled, where scrub can record the last state and continue from there; the device replace has to be restarted from the beginning - zone stats exported in sysfs, from the perspective of the filesystem this includes active, reclaimable, relocation etc zones Performance: - improvements when processing space reservation tickets by optimizing locking and shrinking critical sections, cumulative improvements in lockstat numbers show +15% Notable fixes: - use vmalloc fallback when allocating bios as high order allocations can happen with wide checksums (like sha256) - scrub will always track the last position of progress so it's not starting from zero after an error Core: - under experimental config, checksum calculations are offloaded to process context, simplifies locking and allows to remove compression write worker kthread(s): - speed improvement in direct IO throughput with buffered IO fallback is +15% when not offloaded but this is more related to internal crypto subsystem improvements - this will be probably default in the future removing the sysfs tunable - (experimental) block size > page size updates: - support more operations when not using large folios (encoded read/write and send) - raid56 - more preparations for fscrypt support Other: - more conversions to auto-cleaned variables - parameter cleanups and removals - extended warning fixes - improved printing of structured values like keys - lots of other cleanups and refactoring" * tag 'for-6.19-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (147 commits) btrfs: remove unnecessary inode key in btrfs_log_all_parents() btrfs: remove redundant zero/NULL initializations in btrfs_alloc_root() btrfs: remaining BTRFS_PATH_AUTO_FREE conversions btrfs: send: do not allocate memory for xattr data when checking it exists btrfs: send: add unlikely to all unexpected overflow checks btrfs: reduce arguments to btrfs_del_inode_ref_in_log() btrfs: remove root argument from btrfs_del_dir_entries_in_log() btrfs: use test_and_set_bit() in btrfs_delayed_delete_inode_ref() btrfs: don't search back for dir inode item in INO_LOOKUP_USER btrfs: don't rewrite ret from inode_permission btrfs: add orig_logical to btrfs_bio for encryption btrfs: disable verity on encrypted inodes btrfs: disable various operations on encrypted inodes btrfs: remove redundant level reset in btrfs_del_items() btrfs: simplify leaf traversal after path release in btrfs_next_old_leaf() btrfs: optimize balance_level() path reference handling btrfs: factor out root promotion logic into promote_child_to_root() btrfs: raid56: remove the "_step" infix btrfs: raid56: enable bs > ps support btrfs: raid56: prepare finish_parity_scrub() to support bs > ps cases ...	2025-12-03 20:03:46 -08:00
Linus Torvalds	f2e74ecfba	vfs-6.19-rc1.folio -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZQAKCRCRxhvAZXjc onGBAQDtqeO0jZzS7q9UxlJ84Wj/H9w+9INpO4jMxtWK4svhUAEAghG4qVxRvkE2 Qh+wrpTPIC7OCQ78k8psDRmkj9cn8QA= =FCVN -----END PGP SIGNATURE----- Merge tag 'vfs-6.19-rc1.folio' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull folio updates from Christian Brauner: "Add a new folio_next_pos() helper function that returns the file position of the first byte after the current folio. This is a common operation in filesystems when needing to know the end of the current folio. The helper is lifted from btrfs which already had its own version, and is now used across multiple filesystems and subsystems: - btrfs - buffer - ext4 - f2fs - gfs2 - iomap - netfs - xfs - mm This fixes a long-standing bug in ocfs2 on 32-bit systems with files larger than 2GiB. Presumably this is not a common configuration, but the fix is backported anyway. The other filesystems did not have bugs, they were just mildly inefficient. This also introduce uoff_t as the unsigned version of loff_t. A recent commit inadvertently changed a comparison from being unsigned (on 64-bit systems) to being signed (which it had always been on 32-bit systems), leading to sporadic fstests failures. Generally file sizes are restricted to being a signed integer, but in places where -1 is passed to indicate "up to the end of the file", it is convenient to have an unsigned type to ensure comparisons are always unsigned regardless of architecture" * tag 'vfs-6.19-rc1.folio' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: Add uoff_t mm: Use folio_next_pos() xfs: Use folio_next_pos() netfs: Use folio_next_pos() iomap: Use folio_next_pos() gfs2: Use folio_next_pos() f2fs: Use folio_next_pos() ext4: Use folio_next_pos() buffer: Use folio_next_pos() btrfs: Use folio_next_pos() filemap: Add folio_next_pos()	2025-12-01 10:26:38 -08:00
Linus Torvalds	ebaeabfa5a	vfs-6.19-rc1.writeback -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZQAKCRCRxhvAZXjc or4UAP9FbpFsZd0DpsYnKuv7kFepl291PuR0x2dKmseJ/wcf8AEAzI8FR5wd/fey 25ZNdExoUojAOj5wVn+jUep3u54jBws= =/toi -----END PGP SIGNATURE----- Merge tag 'vfs-6.19-rc1.writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull writeback updates from Christian Brauner: "Features: - Allow file systems to increase the minimum writeback chunk size. The relatively low minimal writeback size of 4MiB means that written back inodes on rotational media are switched a lot. Besides introducing additional seeks, this also can lead to extreme file fragmentation on zoned devices when a lot of files are cached relative to the available writeback bandwidth. This adds a superblock field that allows the file system to override the default size, and sets it to the zone size for zoned XFS. - Add logging for slow writeback when it exceeds sysctl_hung_task_timeout_secs. This helps identify tasks waiting for a long time and pinpoint potential issues. Recording the starting jiffies is also useful when debugging a crashed vmcore. - Wake up waiting tasks when finishing the writeback of a chunk Cleanups: - filemap_* writeback interface cleanups. Adding filemap_fdatawrite_wbc ended up being a mistake, as all but the original btrfs caller should be using better high level interfaces instead. This series removes all these low-level interfaces, switches btrfs to a more specific interface, and cleans up other too low-level interfaces. With this the writeback_control that is passed to the writeback code is only initialized in three places. - Remove __filemap_fdatawrite, __filemap_fdatawrite_range, and filemap_fdatawrite_wbc - Add filemap_flush_nr helper for btrfs - Push struct writeback_control into start_delalloc_inodes in btrfs - Rename filemap_fdatawrite_range_kick to filemap_flush_range - Stop opencoding filemap_fdatawrite_range in 9p, ocfs2, and mm - Make wbc_to_tag() inline and use it in fs" * tag 'vfs-6.19-rc1.writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: Make wbc_to_tag() inline and use it in fs. xfs: set s_min_writeback_pages for zoned file systems writeback: allow the file system to override MIN_WRITEBACK_PAGES writeback: cleanup writeback_chunk_size mm: rename filemap_fdatawrite_range_kick to filemap_flush_range mm: remove __filemap_fdatawrite_range mm: remove filemap_fdatawrite_wbc mm: remove __filemap_fdatawrite mm,btrfs: add a filemap_flush_nr helper btrfs: push struct writeback_control into start_delalloc_inodes btrfs: use the local tmp_inode variable in start_delalloc_inodes ocfs2: don't opencode filemap_fdatawrite_range in ocfs2_journal_submit_inode_data_buffers 9p: don't opencode filemap_fdatawrite_range in v9fs_mmap_vm_close mm: don't opencode filemap_fdatawrite_range in filemap_invalidate_inode writeback: Add logging for slow writeback (exceeds sysctl_hung_task_timeout_secs) writeback: Wake up waiting tasks when finishing the writeback of a chunk.	2025-12-01 09:20:51 -08:00
Qu Wenruo	39bc80216a	btrfs: relax btrfs_inode::ordered_tree_lock IRQ locking context We used IRQ version of spinlock for ordered_tree_lock, as btrfs_finish_ordered_extent() can be called in end_bbio_data_write() which was in IRQ context. However since we're moving all the btrfs_bio::end_io() calls into task context, there is no more need to support IRQ context thus we can relax to regular spin_lock()/spin_unlock() for btrfs_inode::ordered_tree_lock. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 22:42:21 +01:00
Qu Wenruo	81cea6cd70	btrfs: remove btrfs_bio::fs_info by extracting it from btrfs_bio::inode Currently there is only one caller which doesn't populate btrfs_bio::inode, and that's scrub. The idea is scrub doesn't want any automatic csum verification nor read-repair, as everything will be handled by scrub itself. However that behavior is really no different than metadata inode, thus we can reuse btree_inode as btrfs_bio::inode for scrub. The only exception is in btrfs_submit_chunk() where if a bbio is from scrub or data reloc inode, we set rst_search_commit_root to true. This means we still need a way to distinguish scrub from metadata, but that can be done by a new flag inside btrfs_bio. Now btrfs_bio::inode is a mandatory parameter, we can extract fs_info from that inode thus can remove btrfs_bio::fs_info to save 8 bytes from btrfs_bio structure. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 22:40:16 +01:00
Filipe Manana	28fe58ce6a	btrfs: add unlikely to unexpected error case in extent_writepages() We don't expect to hit errors and log the error message, so add the unlikely annotation to make it clear and to hint the compiler that it may reorganize code to be more efficient. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 21:59:08 +01:00
Filipe Manana	74ca34f79e	btrfs: split assertion into two in extent_writepage_io() If the assertion fails we don't get to know which of the two expressions failed and neither the values used in each expression. So split the assertion into two, each for a single expression, so that if any is triggered we see a line number reported in a stack trace that points to which expression failed. Also make the assertions use the verbose mode to print the values involved in the computations. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 21:59:08 +01:00
Filipe Manana	46a2390859	btrfs: use variable for end offset in extent_writepage_io() Instead of repeating the expression "start + len" multiple times, store it in a variable and use it where needed. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 21:59:08 +01:00
Filipe Manana	18de34daa7	btrfs: truncate ordered extent when skipping writeback past i_size While running test case btrfs/192 from fstests with support for large folios (needs CONFIG_BTRFS_EXPERIMENTAL=y) I ended up getting very sporadic btrfs check failures reporting that csum items were missing. Looking into the issue it turned out that btrfs check searches for csum items of a file extent item with a range that spans beyond the i_size of a file and we don't have any, because the kernel's writeback code skips submitting bios for ranges beyond eof. It's not expected however to find a file extent item that crosses the rounded up (by the sector size) i_size value, but there is a short time window where we can end up with a transaction commit leaving this small inconsistency between the i_size and the last file extent item. Example btrfs check output when this happens: $ btrfs check /dev/sdc Opening filesystem to check... Checking filesystem on /dev/sdc UUID: 69642c61-5efb-4367-aa31-cdfd4067f713 [1/8] checking log skipped (none written) [2/8] checking root items [3/8] checking extents [4/8] checking free space tree [5/8] checking fs roots root 5 inode 332 errors 1000, some csum missing ERROR: errors found in fs roots (...) Looking at a tree dump of the fs tree (root 5) for inode 332 we have: $ btrfs inspect-internal dump-tree -t 5 /dev/sdc (...) item 28 key (332 INODE_ITEM 0) itemoff 2006 itemsize 160 generation 17 transid 19 size 610969 nbytes 86016 block group 0 mode 100666 links 1 uid 0 gid 0 rdev 0 sequence 11 flags 0x0(none) atime 1759851068.391327881 (2025-10-07 16:31:08) ctime 1759851068.410098267 (2025-10-07 16:31:08) mtime 1759851068.410098267 (2025-10-07 16:31:08) otime 1759851068.391327881 (2025-10-07 16:31:08) item 29 key (332 INODE_REF 340) itemoff 1993 itemsize 13 index 2 namelen 3 name: f1f item 30 key (332 EXTENT_DATA 589824) itemoff 1940 itemsize 53 generation 19 type 1 (regular) extent data disk byte 21745664 nr 65536 extent data offset 0 nr 65536 ram 65536 extent compression 0 (none) (...) We can see that the file extent item for file offset 589824 has a length of 64K and its number of bytes is 64K. Looking at the inode item we see that its i_size is 610969 bytes which falls within the range of that file extent item [589824, 655360[. Looking into the csum tree: $ btrfs inspect-internal dump-tree /dev/sdc (...) item 15 key (EXTENT_CSUM EXTENT_CSUM 21565440) itemoff 991 itemsize 200 range start 21565440 end 21770240 length 204800 item 16 key (EXTENT_CSUM EXTENT_CSUM 1104576512) itemoff 983 itemsize 8 range start 1104576512 end 1104584704 length 8192 (..) We see that the csum item number 15 covers the first 24K of the file extent item - it ends at offset 21770240 and the extent's disk_bytenr is 21745664, so we have: 21770240 - 21745664 = 24K We see that the next csum item (number 16) is completely outside the range, so the remaining 40K of the extent doesn't have csum items in the tree. If we round up the i_size to the sector size, we get: round_up(610969, 4096) = 614400 If we subtract from that the file offset for the extent item we get: 614400 - 589824 = 24K So the missing 40K corresponds to the end of the file extent item's range minus the rounded up i_size: 655360 - 614400 = 40K Normally we don't expect a file extent item to span over the rounded up i_size of an inode, since when truncating, doing hole punching and other operations that trim a file extent item, the number of bytes is adjusted. There is however a short time window where the kernel can end up, temporarily,persisting an inode with an i_size that falls in the middle of the last file extent item and the file extent item was not yet trimmed (its number of bytes reduced so that it doesn't cross i_size rounded up by the sector size). The steps (in the kernel) that lead to such scenario are the following: 1) We have inode I as an empty file, no allocated extents, i_size is 0; 2) A buffered write is done for file range [589824, 655360[ (length of 64K) and the i_size is updated to 655360. Note that we got a single large folio for the range (64K); 3) A truncate operation starts that reduces the inode's i_size down to 610969 bytes. The truncate sets the inode's new i_size at btrfs_setsize() by calling truncate_setsize() and before calling btrfs_truncate(); 4) At btrfs_truncate() we trigger writeback for the range starting at 610304 (which is the new i_size rounded down to the sector size) and ending at (u64)-1; 5) During the writeback, at extent_write_cache_pages(), we get from the call to filemap_get_folios_tag(), the 64K folio that starts at file offset 589824 since it contains the start offset of the writeback range (610304); 6) At writepage_delalloc() we find the whole range of the folio is dirty and therefore we run delalloc for that 64K range ([589824, 655360[), reserving a 64K extent, creating an ordered extent, etc; 7) At extent_writepage_io() we submit IO only for subrange [589824, 614400[ because the inode's i_size is 610969 bytes (rounded up by sector size is 614400). There, in the while loop we intentionally skip IO beyond i_size to avoid any unnecessay work and just call btrfs_mark_ordered_io_finished() for the range [614400, 655360[ (which has a 40K length); 8) Once the IO finishes we finish the ordered extent by ending up at btrfs_finish_one_ordered(), join transaction N, insert a file extent item in the inode's subvolume tree for file offset 589824 with a number of bytes of 64K, and update the inode's delayed inode item or directly the inode item with a call to btrfs_update_inode_fallback(), which results in storing the new i_size of 610969 bytes; 9) Transaction N is committed either by the transaction kthread or some other task committed it (in response to a sync or fsync for example). At this point we have inode I persisted with an i_size of 610969 bytes and file extent item that starts at file offset 589824 and has a number of bytes of 64K, ending at an offset of 655360 which is beyond the i_size rounded up to the sector size (614400). --> So after a crash or power failure here, the btrfs check program reports that error about missing checksum items for this inode, as it tries to lookup for checksums covering the whole range of the extent; 10) Only after transaction N is committed that at btrfs_truncate() the call to btrfs_start_transaction() starts a new transaction, N + 1, instead of joining transaction N. And it's with transaction N + 1 that it calls btrfs_truncate_inode_items() which updates the file extent item at file offset 589824 to reduce its number of bytes from 64K down to 24K, so that the file extent item's range ends at the i_size rounded up to the sector size (614400 bytes). Fix this by truncating the ordered extent at extent_writepage_io() when we skip writeback because the current offset in the folio is beyond i_size. This ensures we don't ever persist a file extent item with a number of bytes beyond the rounded up (by sector size) value of the i_size. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 21:59:08 +01:00
Qu Wenruo	4e700ac62a	btrfs: remove unnecessary NULL fs_info check from find_lock_delalloc_range() [STATIC CHECK REPORT] Smatch is reporting that find_lock_delalloc_range() used to do a null pointer check before accessing fs_info, but now we're accessing it for sectorsize unconditionally. [FALSE ALERT] This is a false alert, the existing null pointer check is introduced in commit `f7b12a62f0` ("btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size"), but way before that, commit `7c0260ee09` ("btrfs: tests, require fs_info for root") is already forcing every btrfs_root to have a correct fs_info pointer. So there is no way that btrfs_root::fs_info is NULL. [FIX] Just remove the unnecessary NULL pointer checker. Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Fixes: `f7b12a62f0` ("btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size") Closes: https://lore.kernel.org/r/202509250925.4L4JQTtn-lkp@intel.com/ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 21:37:36 +01:00
Matthew Wilcox (Oracle)	48f3784b17	btrfs: Use folio_next_pos() btrfs defined its own variant of folio_next_pos() called folio_end(). This is an ambiguous name as 'end' might be exclusive or inclusive. Switch to the new folio_next_pos(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://patch.msgid.link/20251024170822.1427218-3-willy@infradead.org Acked-by: David Sterba <dsterba@suse.com> Cc: Chris Mason <clm@fb.com> Cc: David Sterba <dsterba@suse.com> Cc: linux-btrfs@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-10-31 13:11:37 +01:00
Qu Wenruo	2618849f31	btrfs: ensure no dirty metadata is written back for an fs with errors [BUG] During development of a minor feature (make sure all btrfs_bio::end_io() is called in task context), I noticed a crash in generic/388, where metadata writes triggered new works after btrfs_stop_all_workers(). It turns out that it can even happen without any code modification, just using RAID5 for metadata and the same workload from generic/388 is going to trigger the use-after-free. [CAUSE] If btrfs hits an error, the fs is marked as error, no new transaction is allowed thus metadata is in a frozen state. But there are some metadata modifications before that error, and they are still in the btree inode page cache. Since there will be no real transaction commit, all those dirty folios are just kept as is in the page cache, and they can not be invalidated by invalidate_inode_pages2() call inside close_ctree(), because they are dirty. And finally after btrfs_stop_all_workers(), we call iput() on btree inode, which triggers writeback of those dirty metadata. And if the fs is using RAID56 metadata, this will trigger RMW and queue new works into rmw_workers, which is already stopped, causing warning from queue_work() and use-after-free. [FIX] Add a special handling for write_one_eb(), that if the fs is already in an error state, immediately mark the bbio as failure, instead of really submitting them. Then during close_ctree(), iput() will just discard all those dirty tree blocks without really writing them back, thus no more new jobs for already stopped-and-freed workqueues. The extra discard in write_one_eb() also acts as an extra safenet. E.g. the transaction abort is triggered by some extent/free space tree corruptions, and since extent/free space tree is already corrupted some tree blocks may be allocated where they shouldn't be (overwriting existing tree blocks). In that case writing them back will further corrupting the fs. CC: stable@vger.kernel.org # 6.6+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-10-30 19:16:01 +01:00
Julian Sun	4952f35f05	fs: Make wbc_to_tag() inline and use it in fs. The logic in wbc_to_tag() is widely used in file systems, so modify this function to be inline and use it in file systems. This patch has only passed compilation tests, but it should be fine. Signed-off-by: Julian Sun <sunjunchao@bytedance.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-10-29 23:33:48 +01:00
Boris Burkov	8ab2fa6969	btrfs: fix incorrect readahead expansion length The intent of btrfs_readahead_expand() was to expand to the length of the current compressed extent being read. However, "ram_bytes" is not that, in the case where a single physical compressed extent is used for multiple file extents. Consider this case with a large compressed extent C and then later two non-compressed extents N1 and N2 written over C, leaving C1 and C2 pointing to offset/len pairs of C: [ C ] [ N1 ][ C1 ][ N2 ][ C2 ] In such a case, ram_bytes for both C1 and C2 is the full uncompressed length of C. So starting readahead in C1 will expand the readahead past the end of C1, past N2, and into C2. This will then expand readahead again, to C2_start + ram_bytes, way past EOF. First of all, this is totally undesirable, we don't want to read the whole file in arbitrary chunks of the large underlying extent if it happens to exist. Secondly, it results in zeroing the range past the end of C2 up to ram_bytes. This is particularly unpleasant with fs-verity as it can zero and set uptodate pages in the verity virtual space past EOF. This incorrect readahead behavior can lead to verity verification errors, if we iterate in a way that happens to do the wrong readahead. Fix this by using em->len for readahead expansion, not em->ram_bytes, resulting in the expected behavior of stopping readahead at the extent boundary. Reported-by: Max Chernoff <git@maxchernoff.ca> Link: https://bugzilla.redhat.com/show_bug.cgi?id=2399898 Fixes: `9e9ff875e4` ("btrfs: use readahead_expand() on compressed extents") CC: stable@vger.kernel.org # 6.17 Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2025-10-13 22:34:08 +02:00
David Sterba	cc53bd2085	btrfs: add unlikely annotations to branches leading to EIO The unlikely() annotation is a static prediction hint that compiler may use to reorder code out of hot path. We use it elsewhere (namely tree-checker.c) for error branches that almost never happen, where EIO is one of them. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:26 +02:00
Qu Wenruo	c2ffb1ec1a	btrfs: prepare compression folio alloc/free for bs > ps cases This includes the following preparation for bs > ps cases: - Always alloc/free the folio directly if bs > ps This adds a new @fs_info parameter for btrfs_alloc_compr_folio(), thus affecting all compression algorithms. For btrfs_free_compr_folio() it needs no parameter for now, as we can use the folio size to skip the caching part. For now the change is just to passing a @fs_info into the function, all the folio size assumption is still based on page size. - Properly zero the last folio in compress_file_range() Since the compressed folios can be larger than a page, we need to properly zero the whole folio. - Use correct folio size for btrfs_add_compressed_bio_folios() Instead of page size, use the correct folio size. - Use correct folio size/shift for btrfs_compress_filemap_get_folio() As we are not only using simple page sized folios anymore. - Use correct folio size for btrfs_decompress() There is an ASSERT() making sure the decompressed range is no larger than a page, which will be triggered for bs > ps cases. - Skip readahead for compressed pages Similar to subpage cases. - Make btrfs_alloc_folio_array() to accept a new @order parameter - Add a helper to calculate the minimal folio size All those changes should not affect the existing bs <= ps handling. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:24 +02:00
Qu Wenruo	7b26da4074	btrfs: fix the incorrect max_bytes value for find_lock_delalloc_range() [BUG] With my local branch to enable bs > ps support for btrfs, sometimes I hit the following ASSERT() inside submit_one_sector(): ASSERT(block_start != EXTENT_MAP_HOLE); Please note that it's not yet possible to hit this ASSERT() in the wild yet, as it requires btrfs bs > ps support, which is not even in the development branch. But on the other hand, there is also a very low chance to hit above ASSERT() with bs < ps cases, so this is an existing bug affect not only the incoming bs > ps support but also the existing bs < ps support. [CAUSE] Firstly that ASSERT() means we're trying to submit a dirty block but without a real extent map nor ordered extent map backing it. Furthermore with extra debugging, the folio triggering such ASSERT() is always larger than the fs block size in my bs > ps case. (8K block size, 4K page size) After some more debugging, the ASSERT() is trigger by the following sequence: extent_writepage() \| We got a 32K folio (4 fs blocks) at file offset 0, and the fs block \| size is 8K, page size is 4K. \| And there is another 8K folio at file offset 32K, which is also \| dirty. \| So the filemap layout looks like the following: \| \| "\|\|" is the filio boundary in the filemap. \| "//\| is the dirty range. \| \| 0 8K 16K 24K 32K 40K \| \|////////\| \|//////////////////////\|\|////////\| \| \|- writepage_delalloc() \| \|- find_lock_delalloc_range() for [0, 8K) \| \| Now range [0, 8K) is properly locked. \| \| \| \|- find_lock_delalloc_range() for [16K, 40K) \| \| \|- btrfs_find_delalloc_range() returned range [16K, 40K) \| \| \|- lock_delalloc_folios() locked folio 0 successfully \| \| \| \| \| \| The filemap range [32K, 40K) got dropped from filemap. \| \| \| \| \| \|- lock_delalloc_folios() failed with -EAGAIN on folio 32K \| \| \| As the folio at 32K is dropped. \| \| \| \| \| \|- loops = 1; \| \| \|- max_bytes = PAGE_SIZE; \| \| \|- goto again; \| \| \| This will re-do the lookup for dirty delalloc ranges. \| \| \| \| \| \|- btrfs_find_delalloc_range() called with @max_bytes == 4K \| \| \| This is smaller than block size, so \| \| \| btrfs_find_delalloc_range() is unable to return any range. \| \| \- return false; \| \| \| \- Now only range [0, 8K) has an OE for it, but for dirty range \| [16K, 32K) it's dirty without an OE. \| This breaks the assumption that writepage_delalloc() will find \| and lock all dirty ranges inside the folio. \| \|- extent_writepage_io() \|- submit_one_sector() for [0, 8K) \| Succeeded \| \|- submit_one_sector() for [16K, 24K) Triggering the ASSERT(), as there is no OE, and the original extent map is a hole. Please note that, this also exposed the same problem for bs < ps support. E.g. with 64K page size and 4K block size. If we failed to lock a folio, and falls back into the "loops = 1;" branch, we will re-do the search using 64K as max_bytes. Which may fail again to lock the next folio, and exit early without handling all dirty blocks inside the folio. [FIX] Instead of using the fixed size PAGE_SIZE as @max_bytes, use @sectorsize, so that we are ensured to find and lock any remaining blocks inside the folio. And since we're here, add an extra ASSERT() to before calling btrfs_find_delalloc_range() to make sure the @max_bytes is at least no smaller than a block to avoid false negative. Cc: stable@vger.kernel.org # 5.15+ Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:24 +02:00
Qu Wenruo	2d83ed6c6c	btrfs: return any hit error from extent_writepage_io() Since the support of bs < ps support, extent_writepage_io() will submit multiple blocks inside the folio. But if we hit error submitting one sector, but the next sector can still be submitted successfully, the function extent_writepage_io() will still return 0. This will make btrfs to silently ignore the error without setting error flag for the filemap. Fix it by recording the first error hit, and always return that value. Fixes: `8bf334beb3` ("btrfs: fix double accounting race when extent_writepage_io() failed") Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:23 +02:00
Filipe Manana	8f0534ec96	btrfs: mark extent buffer alignment checks as unlikely We are not expecting to ever fail the extent buffer alignment checks, so mark them as unlikely to allow the compiler to potentially generate more optimized code. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:23 +02:00
Filipe Manana	6a9e1d1a65	btrfs: store and use node size in local variable in check_eb_alignment() Instead of dereferencing fs_info every time we need to access the node size, store in a local variable to make the code less verbose and avoid a line split too. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:23 +02:00

1 2 3 4 5 ...

1562 Commits