linux

mirror of https://github.com/torvalds/linux.git synced 2026-05-12 16:18:45 +02:00

Author	SHA1	Message	Date
Linus Torvalds	a436a0b847	Various clean ups and bug fixes in ext4 for 7.1: * Refactor code paths involved with partial block zero-out in prearation for converting ext4 to use iomap for buffered writes. * Remove use of d_alloc() from ext4 in preparation for the deprecation of this interface. * Replace some J_ASSERTS with a journal abort so we can avoid a kernel panic for a localized file system error * Simplify various code paths in mballoc, move_extent, and fast commit * Fix rare deadlock in jbd2_journal_cancel_revoke() that can be triggered by generic/013 when blocksize < pagesize. * Fix memory leak when releasing an extended attribute when its value is stored in an ea_inode * Fix various potential kunit test bugs in fs/ext4/extents.c * Fix potential out-of-bounds access in check_xattr() with a corrupted file system * Make the jbd2_inode dirty range tracking safe for lockless reads * Avoid a WARN_ON when writeback files due to a corrupted file system; we already print an ext4 warning indicatign that data will be lost, so the WARN_ON is not necessary and doesn't add any new information -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmniS4wACgkQ8vlZVpUN gaM1lgf/e9Csrz5Zxyf+VEdE5+kQjuaT0feTe/gUMM53jqMiT7cC30JBF1istkaz O3qBe6x8O5iHDv1Z/umcI+A4GtAUj6SsTjwggQPJyJvUqHDPsruGfyD6wBmmNyJ2 N9Y3OZ8I5W7c7ecES6/a7W7VheSe9kkgPDD5eBhw7R4l71Q+B2foW4snQ8wcGTNm uO/QosbaOArGsXHTTcOmOdwxEDIUV16g/iJMhszCI1gaukmL14DIzFtwYkateUBD Wi1+tmZmMvqSkKJ/6PELLoawiUPhTL6OuPANyIid/1Yv5dxMec8XiO/CXRt7r2X3 q9Nrpqfs3fxv63r2IUJ7+yQVYGXK9g== =MG/0 -----END PGP SIGNATURE----- Merge tag 'ext4_for_linux-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: - Refactor code paths involved with partial block zero-out in prearation for converting ext4 to use iomap for buffered writes - Remove use of d_alloc() from ext4 in preparation for the deprecation of this interface - Replace some J_ASSERTS with a journal abort so we can avoid a kernel panic for a localized file system error - Simplify various code paths in mballoc, move_extent, and fast commit - Fix rare deadlock in jbd2_journal_cancel_revoke() that can be triggered by generic/013 when blocksize < pagesize - Fix memory leak when releasing an extended attribute when its value is stored in an ea_inode - Fix various potential kunit test bugs in fs/ext4/extents.c - Fix potential out-of-bounds access in check_xattr() with a corrupted file system - Make the jbd2_inode dirty range tracking safe for lockless reads - Avoid a WARN_ON when writeback files due to a corrupted file system; we already print an ext4 warning indicatign that data will be lost, so the WARN_ON is not necessary and doesn't add any new information * tag 'ext4_for_linux-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (37 commits) jbd2: fix deadlock in jbd2_journal_cancel_revoke() ext4: fix missing brelse() in ext4_xattr_inode_dec_ref_all() ext4: fix possible null-ptr-deref in mbt_kunit_exit() ext4: fix possible null-ptr-deref in extents_kunit_exit() ext4: fix the error handling process in extents_kunit_init). ext4: call deactivate_super() in extents_kunit_exit() ext4: fix miss unlock 'sb->s_umount' in extents_kunit_init() ext4: fix bounds check in check_xattrs() to prevent out-of-bounds access ext4: zero post-EOF partial block before appending write ext4: move pagecache_isize_extended() out of active handle ext4: remove ctime/mtime update from ext4_alloc_file_blocks() ext4: unify SYNC mode checks in fallocate paths ext4: ensure zeroed partial blocks are persisted in SYNC mode ext4: move zero partial block range functions out of active handle ext4: pass allocate range as loff_t to ext4_alloc_file_blocks() ext4: remove handle parameters from zero partial block functions ext4: move ordered data handling out of ext4_block_do_zero_range() ext4: rename ext4_block_zero_page_range() to ext4_block_zero_range() ext4: factor out journalled block zeroing range ext4: rename and extend ext4_block_truncate_page() ...	2026-04-17 17:08:31 -07:00
Linus Torvalds	b7d74ea0fd	vfs-7.1-rc1.kino Please consider pulling these changes from the signed vfs-7.1-rc1.kino tag. Thanks! Christian -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCadjZCgAKCRCRxhvAZXjc otmnAP4sbsxZQdz2TG2hJuOwnEZOkkxZQOUMc3ERVyZaWXIeTAEA7e5M+8FpoG9n 8ipO76UoaXdGLESrqVdp9EOhLqOW7QY= =uMeJ -----END PGP SIGNATURE----- Merge tag 'vfs-7.1-rc1.kino' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs i_ino updates from Christian Brauner: "For historical reasons, the inode->i_ino field is an unsigned long, which means that it's 32 bits on 32 bit architectures. This has caused a number of filesystems to implement hacks to hash a 64-bit identifier into a 32-bit field, and deprives us of a universal identifier field for an inode. This changes the inode->i_ino field from an unsigned long to a u64. This shouldn't make any material difference on 64-bit hosts, but 32-bit hosts will see struct inode grow by at least 4 bytes. This could have effects on slabcache sizes and field alignment. The bulk of the changes are to format strings and tracepoints, since the kernel itself doesn't care that much about the i_ino field. The first patch changes some vfs function arguments, so check that one out carefully. With this change, we may be able to shrink some inode structures. For instance, struct nfs_inode has a fileid field that holds the 64-bit inode number. With this set of changes, that field could be eliminated. I'd rather leave that sort of cleanups for later just to keep this simple" * tag 'vfs-7.1-rc1.kino' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: nilfs2: fix 64-bit division operations in nilfs_bmap_find_target_in_group() EVM: add comment describing why ino field is still unsigned long vfs: remove externs from fs.h on functions modified by i_ino widening treewide: fix missed i_ino format specifier conversions ext4: fix signed format specifier in ext4_load_inode trace event treewide: change inode->i_ino from unsigned long to u64 nilfs2: widen trace event i_ino fields to u64 f2fs: widen trace event i_ino fields to u64 ext4: widen trace event i_ino fields to u64 zonefs: widen trace event i_ino fields to u64 hugetlbfs: widen trace event i_ino fields to u64 ext2: widen trace event i_ino fields to u64 cachefiles: widen trace event i_ino fields to u64 vfs: widen trace event i_ino fields to u64 net: change sock.sk_ino and sock_i_ino() to u64 audit: widen ino fields to u64 vfs: widen inode hash/lookup functions to u64	2026-04-13 12:19:01 -07:00
Zhang Yi	1ad0f42823	ext4: move pagecache_isize_extended() out of active handle In ext4_alloc_file_blocks(), pagecache_isize_extended() is called under an active handle and may also hold folio lock if the block size is smaller than the folio size. This also breaks the "folio lock -> transaction start" lock ordering for the upcoming iomap buffered I/O path. Therefore, move pagecache_isize_extended() outside of an active handle. Additionally, it is unnecessary to update the file length during each iteration of the allocation loop. Instead, update the file length only to the position where the allocation is successful. Postpone updating the inode size until after the allocation loop completes or is interrupted due to an error. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260327102939.1095257-13-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2026-04-09 21:57:52 -04:00
Zhang Yi	116c0bdac2	ext4: remove ctime/mtime update from ext4_alloc_file_blocks() The ctime and mtime update is already handled by file_modified() in ext4_fallocate(), the caller of ext4_alloc_file_blocks(). So remove the redundant calls to inode_set_ctime_current() and inode_set_mtime_to_ts() in ext4_alloc_file_blocks(). Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260327102939.1095257-12-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2026-04-09 21:57:52 -04:00
Zhang Yi	c3688d212f	ext4: unify SYNC mode checks in fallocate paths In the ext4 fallocate call chain, SYNC mode handling is inconsistent: some places check the inode state, while others check the open file descriptor state. Unify these checks by evaluating both conditions to ensure consistent behavior across all fallocate operations. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260327102939.1095257-11-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2026-04-09 21:57:52 -04:00
Zhang Yi	7d81ec0246	ext4: ensure zeroed partial blocks are persisted in SYNC mode In ext4_zero_range() and ext4_punch_hole(), when operating in SYNC mode and zeroing a partial block, only data=journal modes guarantee that the zeroed data is synchronously persisted after the operation completes. For data=ordered/writeback mode and non-journal modes, this guarantee is missing. Introduce a partial_zero parameter to explicitly trigger writeback for all scenarios where a partial block is zeroed, ensuring the zeroed data is durably persisted. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20260327102939.1095257-10-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2026-04-09 21:57:52 -04:00
Zhang Yi	c4602a1d09	ext4: move zero partial block range functions out of active handle Move ext4_block_zero_eof() and ext4_zero_partial_blocks() calls out of the active handle context, making them independent operations, and also add return value checks. This is safe because it still ensures data is updated before metadata for data=ordered mode and data=journal mode because we still zero data and ordering data before modifying the metadata. This change is required for iomap infrastructure conversion because the iomap buffered I/O path does not use the same journal infrastructure for partial block zeroing. The lock ordering of folio lock and starting transactions is "folio lock -> transaction start", which is opposite of the current path. Therefore, zeroing partial blocks cannot be performed under the active handle. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260327102939.1095257-9-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2026-04-09 21:57:52 -04:00
Zhang Yi	ad1876bc4c	ext4: pass allocate range as loff_t to ext4_alloc_file_blocks() Change ext4_alloc_file_blocks() to accept offset and len in byte granularity instead of block granularity. This allows callers to pass byte offsets and lengths directly, and this prepares for moving the ext4_zero_partial_blocks() call from the while(len) loop for unaligned append writes, where it only needs to be invoked once before doing block allocation. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260327102939.1095257-8-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2026-04-09 21:57:52 -04:00
Zhang Yi	d3609a71b7	ext4: remove handle parameters from zero partial block functions Only journal data mode requires an active journal handle when zeroing partial blocks. Stop passing handle_t *handle to ext4_zero_partial_blocks() and related functions, and make ext4_block_journalled_zero_range() start a handle independently. This change has no practical impact now because all callers invoke these functions within the context of an active handle. It prepares for moving ext4_block_zero_eof() out of an active handle in the next patch, which is a prerequisite for converting block zero range operations to iomap infrastructure. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260327102939.1095257-7-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2026-04-09 21:57:52 -04:00
Zhang Yi	bd099a0565	ext4: rename and extend ext4_block_truncate_page() Rename ext4_block_truncate_page() to ext4_block_zero_eof() and extend its signature to accept an explicit 'end' offset instead of calculating the block boundary. This helper function now can replace all cases requiring zeroing of the partial EOF block, including the append buffered write paths in ext4_*_write_end(). Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260327102939.1095257-3-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2026-04-09 21:57:51 -04:00
hongao	3ceda17325	ext4: skip split extent recovery on corruption ext4_split_extent_at() retries after ext4_ext_insert_extent() fails by refinding the original extent and restoring its length. That recovery is only safe for transient resource failures such as -ENOSPC, -EDQUOT, and -ENOMEM. When ext4_ext_insert_extent() fails because the extent tree is already corrupted, ext4_find_extent() can return a leaf path without p_ext. ext4_split_extent_at() then dereferences path[depth].p_ext while trying to fix up the original extent length, causing a NULL pointer dereference while handling a pre-existing filesystem corruption. Do not enter the recovery path for corruption errors, and validate p_ext after refinding the extent before touching it. This keeps the recovery path limited to cases it can actually repair and turns the syzbot-triggered crash into a proper corruption report. Fixes: `716b9c23b8` ("ext4: refactor split and convert extents") Reported-by: syzbot+1ffa5d865557e51cb604@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=1ffa5d865557e51cb604 Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: hongao <hongao@uniontech.com> Link: https://patch.msgid.link/EF77870F23FF9C90+20260324015815.35248-1-hongao@uniontech.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org	2026-03-27 23:38:52 -04:00
Ye Bin	9e1b14320b	ext4: fix extents-test.c is not compiled when EXT4_KUNIT_TESTS=M Now, only EXT4_KUNIT_TESTS=Y testcase will be compiled in 'extents.c'. To solve this issue, the ext4 test code needs to be decoupled. The 'extents-test' module is compiled into 'ext4-test' module. Fixes: `cb1e0c1d1f` ("ext4: kunit tests for extent splitting and conversion") Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260314075258.1317579-4-yebin@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2026-03-27 23:36:06 -04:00
Edward Adam Davis	5422fe71d2	ext4: avoid infinite loops caused by residual data On the mkdir/mknod path, when mapping logical blocks to physical blocks, if inserting a new extent into the extent tree fails (in this example, because the file system disabled the huge file feature when marking the inode as dirty), ext4_ext_map_blocks() only calls ext4_free_blocks() to reclaim the physical block without deleting the corresponding data in the extent tree. This causes subsequent mkdir operations to reference the previously reclaimed physical block number again, even though this physical block is already being used by the xattr block. Therefore, a situation arises where both the directory and xattr are using the same buffer head block in memory simultaneously. The above causes ext4_xattr_block_set() to enter an infinite loop about "inserted" and cannot release the inode lock, ultimately leading to the 143s blocking problem mentioned in [1]. If the metadata is corrupted, then trying to remove some extent space can do even more harm. Also in case EXT4_GET_BLOCKS_DELALLOC_RESERVE was passed, remove space wrongly update quota information. Jan Kara suggests distinguishing between two cases: 1) The error is ENOSPC or EDQUOT - in this case the filesystem is fully consistent and we must maintain its consistency including all the accounting. However these errors can happen only early before we've inserted the extent into the extent tree. So current code works correctly for this case. 2) Some other error - this means metadata is corrupted. We should strive to do as few modifications as possible to limit damage. So I'd just skip freeing of allocated blocks. [1] INFO: task syz.0.17:5995 blocked for more than 143 seconds. Call Trace: inode_lock_nested include/linux/fs.h:1073 [inline] __start_dirop fs/namei.c:2923 [inline] start_dirop fs/namei.c:2934 [inline] Reported-by: syzbot+512459401510e2a9a39f@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=1659aaaaa8d9d11265d7 Tested-by: syzbot+1659aaaaa8d9d11265d7@syzkaller.appspotmail.com Reported-by: syzbot+1659aaaaa8d9d11265d7@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=512459401510e2a9a39f Tested-by: syzbot+1659aaaaa8d9d11265d7@syzkaller.appspotmail.com Signed-off-by: Edward Adam Davis <eadavis@qq.com> Reviewed-by: Jan Kara <jack@suse.cz> Tested-by: syzbot+512459401510e2a9a39f@syzkaller.appspotmail.com Link: https://patch.msgid.link/tencent_43696283A68450B761D76866C6F360E36705@qq.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org	2026-03-27 23:33:56 -04:00
Tejas Bharambe	2acb5c12eb	ext4: validate p_idx bounds in ext4_ext_correct_indexes ext4_ext_correct_indexes() walks up the extent tree correcting index entries when the first extent in a leaf is modified. Before accessing path[k].p_idx->ei_block, there is no validation that p_idx falls within the valid range of index entries for that level. If the on-disk extent header contains a corrupted or crafted eh_entries value, p_idx can point past the end of the allocated buffer, causing a slab-out-of-bounds read. Fix this by validating path[k].p_idx against EXT_LAST_INDEX() at both access sites: before the while loop and inside it. Return -EFSCORRUPTED if the index pointer is out of range, consistent with how other bounds violations are handled in the ext4 extent tree code. Reported-by: syzbot+04c4e65cab786a2e5b7e@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=04c4e65cab786a2e5b7e Signed-off-by: Tejas Bharambe <tejas.bharambe@outlook.com> Link: https://patch.msgid.link/JH0PR06MB66326016F9B6AD24097D232B897CA@JH0PR06MB6632.apcprd06.prod.outlook.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org	2026-03-27 23:33:46 -04:00
Ojaswin Mujoo	c4a48e9eee	ext4: minor fix for ext4_split_extent_zeroout() We missed storing the error which triggerd smatch warning: fs/ext4/extents.c:3369 ext4_split_extent_zeroout() warn: duplicate zero check 'err' (previous on line 3363) fs/ext4/extents.c 3361 3362 err = ext4_ext_get_access(handle, inode, path + depth); 3363 if (err) 3364 return err; 3365 3366 ext4_ext_mark_initialized(ex); 3367 3368 ext4_ext_dirty(handle, inode, path + depth); --> 3369 if (err) 3370 return err; 3371 3372 return 0; 3373 } Fix it by correctly storing the err value from ext4_ext_dirty(). Link: https://lore.kernel.org/all/aYXvVgPnKltX79KE@stanley.mountain/ Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Fixes: `a985e07c26` ("ext4: refactor zeroout path and handle all cases") Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Baokun Li <libaokun@linux.alibaba.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://patch.msgid.link/20260302143811.605174-1-ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org	2026-03-27 23:33:22 -04:00
Jeff Layton	0b2600f81c	treewide: change inode->i_ino from unsigned long to u64 On 32-bit architectures, unsigned long is only 32 bits wide, which causes 64-bit inode numbers to be silently truncated. Several filesystems (NFS, XFS, BTRFS, etc.) can generate inode numbers that exceed 32 bits, and this truncation can lead to inode number collisions and other subtle bugs on 32-bit systems. Change the type of inode->i_ino from unsigned long to u64 to ensure that inode numbers are always represented as 64-bit values regardless of architecture. Update all format specifiers treewide from %lu/%lx to %llu/%llx to match the new type, along with corresponding local variable types. This is the bulk treewide conversion. Earlier patches in this series handled trace events separately to allow trace field reordering for better struct packing on 32-bit. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://patch.msgid.link/20260304-iino-u64-v3-12-2257ad83d372@kernel.org Acked-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2026-03-06 14:31:28 +01:00
Kees Cook	69050f8d6d	treewide: Replace kmalloc with kmalloc_obj for non-scalar types This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(PTR, FAM, COUNT, ...) (where TYPE may also be VAR) The resulting allocations no longer return "void ", instead returning "TYPE ". Signed-off-by: Kees Cook <kees@kernel.org>	2026-02-21 01:02:28 -08:00
Ojaswin Mujoo	4f5e8e6f01	et4: allow zeroout when doing written to unwritten split Currently, when we are doing an extent split and convert operation of written to unwritten extent (example, as done by ZERO_RANGE), we don't allow the zeroout fallback in case the extent tree manipulation fails. This is mostly because zeroout might take unsually long and the fact that this code path is more tolerant to failures than endio. Since we have zeroout machinery in place, we might as well use it hence lift this restriction. To mitigate zeroout taking too long respect the max zeroout limit here so that the operation finishes relatively fast. Also, add kunit tests for this case. Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/1c3349020b8e098a63f293b84bc8a9b56011cef4.1769149131.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2026-01-23 16:50:11 -05:00
Ojaswin Mujoo	716b9c23b8	ext4: refactor split and convert extents ext4_split_convert_extents() has been historically prone to subtle bugs and inconsistent behavior due to the way all the various flags interact with the extent split and conversion process. For example, callers like ext4_convert_unwritten_extents_endio() and convert_initialized_extents() needed to open code extent conversion despite passing CONVERT or CONVERT_UNWRITTEN flags because ext4_split_convert_extents() wasn't performing the conversion. Hence, refactor ext4_split_convert_extents() to clearly enforce the semantics of each flag. The major changes here are: * Clearly separate the split and convert process: * ext4_split_extent() and ext4_split_extent_at() are now only responsible to perform the split. * ext4_split_convert_extents() is now responsible to perform extent conversion after calling ext4_split_extent() for splitting. * This helps get rid of all the MARK_UNWRIT* flags. * Clearly enforce the semantics of flags passed to ext4_split_convert_extents(): * EXT4_GET_BLOCKS_CONVERT: Will convert the split extent to written * EXT4_GET_BLOCKS_CONVERT_UNWRITTEN: Will convert the split extent to unwritten * Modify all callers to enforce the above semantics. * Use ext4_split_convert_extents() instead of ext4_split_extents() in ext4_ext_convert_to_initialized() for uniformity. * Now that ext4_split_convert_extents() is handling caching to es, we dont need to do it in ext4_split_extent_zeroout(). * Cleanup all callers open coding the conversion logic. Further, modify kuniy tests to pass flags based on the new semantics. >From an end user point of view, we should not see any changes in behavior of ext4. Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/2084a383d69ceefbaa293b8fcf725365eca0a349.1769149131.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2026-01-23 16:50:01 -05:00
Ojaswin Mujoo	a985e07c26	ext4: refactor zeroout path and handle all cases Currently, zeroout is used as a fallback in case we fail to split/convert extents in the "traditional" modify-the-extent-tree way. This is essential to mitigate failures in critical paths like extent splitting during endio. However, the logic is very messy and not easy to follow. Further, the fragile use of various flags has made it prone to errors. Refactor zeroout out logic by moving it up to ext4_split_extents(). Further, zeroout correctly based on the type of conversion we want, ie: - unwritten to written: Zeroout everything around the mapped range. - written to unwritten: Zeroout only the mapped range. Also, ext4_ext_convert_to_initialized() now passes EXT4_GET_BLOCKS_CONVERT to make the intention clear. Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/e1b51dedeca7c0b1f702141d91edfe4230560e7b.1769149131.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2026-01-23 16:49:53 -05:00
Ojaswin Mujoo	6066990c99	ext4: propagate flags to ext4_convert_unwritten_extents_endio() Currently, callers like ext4_convert_unwritten_extents() pass EXT4_EX_NOCACHE flag to avoid caching extents however this is not respected by ext4_convert_unwritten_extents_endio(). Hence, modify it to accept flags from the caller and to pass the flags on to other extent manipulation functions it calls. This makes sure the NOCACHE flag is respected throughout the code path. Also, since the caller already passes METADATA_NOFAIL and CONVERT flags we don't need to explicitly pass it anymore. Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/7c2139e0ad32c49c19b194f72219e15d613de284.1769149131.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2026-01-23 16:49:53 -05:00
Ojaswin Mujoo	3fffa44b6e	ext4: propagate flags to convert_initialized_extent() Currently, ext4_zero_range passes EXT4_EX_NOCACHE flag to avoid caching extents however this is not respected by convert_initialized_extent(). Hence, modify it to accept flags from the caller and to pass the flags on to other extent manipulation functions it calls. This makes sure the NOCACHE flag is respected throughout the code path. Also, we no longer explicitly pass CONVERT_UNWRITTEN as the caller takes care of this. Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/07008fbb14db727fddcaf4c30e2346c49f6c8fe0.1769149131.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2026-01-23 16:49:53 -05:00
Ojaswin Mujoo	82f80e2e3b	ext4: add extent status cache support to kunit tests Add support in Kunit tests to ensure that the extent status cache is also in sync after the extent split and conversion operations. Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/5f9d2668feeb89a3f3e9d03dadab8c10cbea3741.1769149131.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2026-01-23 16:49:43 -05:00
Ojaswin Mujoo	cb1e0c1d1f	ext4: kunit tests for extent splitting and conversion Add multiple KUnit tests to test various permutations of extent splitting and conversion. We test the following cases: 1. Split of unwritten extent into 2 parts and convert 1 part to written 2. Split of unwritten extent into 3 parts and convert 1 part to written 3. Split of written extent into 2 parts and convert 1 part to unwritten 4. Split of written extent into 3 parts and convert 1 part to unwritten 5. Zeroout fallback for all the above cases except 3-4 because zeroout is not supported for written to unwritten splits The main function we test here is ext4_split_convert_extents(). Currently some of the tests are failing due to issues in implementation. All failures are mitigated at other layers in ext4 [1] but still point out the mismatch in expectation of what the caller wants vs what the function does. The aim is to eventually fix all the failures we see here. More detailed implementation notes can be found in the topmost commit in the test file. [1] for example, EXT4_GET_BLOCKS_CONVERT doesn't really convert the split extent to written, but rather the callers end up doing the conversion. Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/22bb9d17cd88c1318a2edde48887ca7488cb8a13.1769149131.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2026-01-23 16:48:02 -05:00
Baolin Liu	591a4ab9b8	ext4: remove redundant NULL check after __GFP_NOFAIL Remove redundant NULL check after kcalloc() with GFP_NOFS \| __GFP_NOFAIL. Signed-off-by: Baolin Liu <liubaolin@kylinos.cn> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20260106062016.154573-1-liubaolin12138@163.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2026-01-19 22:37:06 -05:00
Zhang Yi	5f18f60d56	ext4: remove EXT4_GET_BLOCKS_IO_CREATE_EXT We do not use EXT4_GET_BLOCKS_IO_CREATE_EXT or split extents before submitting I/O; therefore, remove the related code. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20260105014522.1937690-8-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2026-01-19 22:28:30 -05:00
Zhang Yi	ea96cb5c4a	ext4: don't split extent before submitting I/O Currently, when writing back dirty pages to the filesystem with the dioread_nolock feature enabled and when doing DIO, if the area to be written back is part of an unwritten extent, the EXT4_GET_BLOCKS_IO_CREATE_EXT flag is set during block allocation before submitting I/O. The function ext4_split_convert_extents() then attempts to split this extent in advance. This approach is designed to prevents extent splitting and conversion to the written type from failing due to insufficient disk space at the time of I/O completion, which could otherwise result in data loss. However, we already have two mechanisms to ensure successful extent conversion. The first is the EXT4_GET_BLOCKS_METADATA_NOFAIL flag, which is a best effort, it permits the use of 2% of the reserved space or 4,096 blocks in the file system when splitting extents. This flag covers most scenarios where extent splitting might fail. The second is the EXT4_EXT_MAY_ZEROOUT flag, which is also set during extent splitting. If the reserved space is insufficient and splitting fails, it does not retry the allocation. Instead, it directly zeros out the extra part of the extent, thereby avoiding splitting and directly converting the entire extent to the written type. These two mechanisms also exist when I/Os are completed because there is a concurrency window between write-back and fallocate, which may still require us to split extents upon I/O completion. There is no much difference between splitting extents before submitting I/O. Therefore, It seems possible to defer the splitting until I/O completion, it won't increase the risk of I/O failure and data loss. On the contrary, if some I/Os can be merged when I/O completion, it can also reduce unnecessary splitting operations, thereby alleviating the pressure on reserved space. In addition, deferring extent splitting until I/O completion can also simplify the IO submission process and avoid initiating unnecessary journal handles when writing unwritten extents. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20260105014522.1937690-3-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2026-01-19 22:28:30 -05:00
Zhang Yi	01942af95a	ext4: use reserved metadata blocks when splitting extent on endio When performing buffered writes, we may need to split and convert an unwritten extent into a written one during the end I/O process. However, we do not reserve space specifically for these metadata changes, we only reserve 2% of space or 4096 blocks. To address this, we use EXT4_GET_BLOCKS_PRE_IO to potentially split extents in advance and EXT4_GET_BLOCKS_METADATA_NOFAIL to utilize reserved space if necessary. These two approaches can reduce the likelihood of running out of space and losing data. However, these methods are merely best efforts, we could still run out of space, and there is not much difference between converting an extent during the writeback process and the end I/O process, it won't increase the risk of losing data if we postpone the conversion. Therefore, also use EXT4_GET_BLOCKS_METADATA_NOFAIL in ext4_convert_unwritten_extents_endio() to prepare for the buffered I/O iomap conversion, which may perform extent conversion during the end I/O process. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20260105014522.1937690-2-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2026-01-19 22:28:30 -05:00
Zilin Guan	ca81109d4a	ext4: fix memory leak in ext4_ext_shift_extents() In ext4_ext_shift_extents(), if the extent is NULL in the while loop, the function returns immediately without releasing the path obtained via ext4_find_extent(), leading to a memory leak. Fix this by jumping to the out label to ensure the path is properly released. Fixes: `a18ed359bd` ("ext4: always check ext4_ext_find_extent result") Signed-off-by: Zilin Guan <zilin@seu.edu.cn> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Link: https://patch.msgid.link/20251225084800.905701-1-zilin@seu.edu.cn Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org	2026-01-19 22:28:30 -05:00
Zhang Yi	a5567347b6	ext4: replace ext4_es_insert_extent() when caching on-disk extents In ext4, the remaining places for inserting extents into the extent status tree within ext4_ext_determine_insert_hole() and ext4_map_query_blocks() directly cache on-disk extents. We can use ext4_es_cache_extent() instead of ext4_es_insert_extent() in these cases. This will help reduce unnecessary increases in extent sequence numbers and cache invalidations after supporting IOMAP in the future. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Message-ID: <20251129103247.686136-14-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2026-01-18 11:23:34 -05:00
Zhang Yi	c0329d0288	ext4: cleanup zeroout in ext4_split_extent_at() zero_ex is a temporary variable used only for writing zeros and inserting extent status entry, it will not be directly inserted into the tree. Therefore, it can be assigned values from the target extent in various scenarios, eliminating the need to explicitly assign values to each variable individually. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Baokun Li <libaokun1@huawei.com> Message-ID: <20251129103247.686136-9-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2026-01-18 11:23:33 -05:00
Zhang Yi	79b592e8f1	ext4: drop extent cache when splitting extent fails When the split extent fails, we might leave some extents still being processed and return an error directly, which will result in stale extent entries remaining in the extent status tree. So drop all of the remaining potentially stale extents if the splitting fails. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Cc: stable@kernel.org Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Message-ID: <20251129103247.686136-8-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2026-01-18 11:23:33 -05:00
Zhang Yi	6d882ea3b0	ext4: drop extent cache after doing PARTIAL_VALID1 zeroout When splitting an unwritten extent in the middle and converting it to initialized in ext4_split_extent() with the EXT4_EXT_MAY_ZEROOUT and EXT4_EXT_DATA_VALID2 flags set, it could leave a stale unwritten extent. Assume we have an unwritten file and buffered write in the middle of it without dioread_nolock enabled, it will allocate blocks as written extent. 0 A B N [UUUUUUUUUUUU] on-disk extent U: unwritten extent [UUUUUUUUUUUU] extent status tree [--DDDDDDDD--] D: valid data \|<- ->\| ----> this range needs to be initialized ext4_split_extent() first try to split this extent at B with EXT4_EXT_DATA_PARTIAL_VALID1 and EXT4_EXT_MAY_ZEROOUT flag set, but ext4_split_extent_at() failed to split this extent due to temporary lack of space. It zeroout B to N and leave the entire extent as unwritten. 0 A B N [UUUUUUUUUUUU] on-disk extent [UUUUUUUUUUUU] extent status tree [--DDDDDDDDZZ] Z: zeroed data ext4_split_extent() then try to split this extent at A with EXT4_EXT_DATA_VALID2 flag set. This time, it split successfully and leave an written extent from A to N. 0 A B N [UUWWWWWWWWWW] on-disk extent W: written extent [UUUUUUUUUUUU] extent status tree [--DDDDDDDDZZ] Finally ext4_map_create_blocks() only insert extent A to B to the extent status tree, and leave an stale unwritten extent in the status tree. 0 A B N [UUWWWWWWWWWW] on-disk extent W: written extent [UUWWWWWWWWUU] extent status tree [--DDDDDDDDZZ] Fix this issue by always cached extent status entry after zeroing out the second part. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Cc: stable@kernel.org Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Message-ID: <20251129103247.686136-7-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2026-01-18 11:23:33 -05:00
Zhang Yi	8b4b19a2f9	ext4: don't cache extent during splitting extent Caching extents during the splitting process is risky, as it may result in stale extents remaining in the status tree. Moreover, in most cases, the corresponding extent block entries are likely already cached before the split happens, making caching here not particularly useful. Assume we have an unwritten extent, and then DIO writes the first half. [UUUUUUUUUUUUUUUU] on-disk extent U: unwritten extent [UUUUUUUUUUUUUUUU] extent status tree \|<- ->\| ----> dio write this range First, when ext4_split_extent_at() splits this extent, it truncates the existing extent and then inserts a new one. During this process, this extent status entry may be shrunk, and calls to ext4_find_extent() and ext4_cache_extents() may occur, which could potentially insert the truncated range as a hole into the extent status tree. After the split is completed, this hole is not replaced with the correct status. [UUUUUUU\|UUUUUUUU] on-disk extent U: unwritten extent [UUUUUUU\|HHHHHHHH] extent status tree H: hole Then, the outer calling functions will not correct this remaining hole extent either. Finally, if we perform a delayed buffer write on this latter part, it will re-insert the delayed extent and cause an error in space accounting. In adition, if the unwritten extent cache is not shrunk during the splitting, ext4_cache_extents() also conflicts with existing extents when caching extents. In the future, we will add checks when caching extents, which will trigger a warning. Therefore, Do not cache extents that are being split. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Cc: stable@kernel.org Message-ID: <20251129103247.686136-6-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2026-01-18 11:23:33 -05:00
Zhang Yi	5f1a1ccceb	ext4: correct the mapping status if the extent has been zeroed Before submitting I/O and allocating blocks with the EXT4_GET_BLOCKS_PRE_IO flag set, ext4_split_convert_extents() may convert the target extent range to initialized due to ENOSPC, ENOMEM, or EQUOTA errors. However, it still marks the mapping as incorrectly unwritten. Although this may not seem to cause any practical problems, it will result in an unnecessary extent conversion operation after I/O completion. Therefore, it's better to correct the returned mapping status. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Message-ID: <20251129103247.686136-5-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2026-01-18 11:23:33 -05:00
Zhang Yi	feaf2a80e7	ext4: don't set EXT4_GET_BLOCKS_CONVERT when splitting before submitting I/O When allocating blocks during within-EOF DIO and writeback with dioread_nolock enabled, EXT4_GET_BLOCKS_PRE_IO was set to split an existing large unwritten extent. However, EXT4_GET_BLOCKS_CONVERT was set when calling ext4_split_convert_extents(), which may potentially result in stale data issues. Assume we have an unwritten extent, and then DIO writes the second half. [UUUUUUUUUUUUUUUU] on-disk extent U: unwritten extent [UUUUUUUUUUUUUUUU] extent status tree \|<- ->\| ----> dio write this range First, ext4_iomap_alloc() call ext4_map_blocks() with EXT4_GET_BLOCKS_PRE_IO, EXT4_GET_BLOCKS_UNWRIT_EXT and EXT4_GET_BLOCKS_CREATE flags set. ext4_map_blocks() find this extent and call ext4_split_convert_extents() with EXT4_GET_BLOCKS_CONVERT and the above flags set. Then, ext4_split_convert_extents() calls ext4_split_extent() with EXT4_EXT_MAY_ZEROOUT, EXT4_EXT_MARK_UNWRIT2 and EXT4_EXT_DATA_VALID2 flags set, and it calls ext4_split_extent_at() to split the second half with EXT4_EXT_DATA_VALID2, EXT4_EXT_MARK_UNWRIT1, EXT4_EXT_MAY_ZEROOUT and EXT4_EXT_MARK_UNWRIT2 flags set. However, ext4_split_extent_at() failed to insert extent since a temporary lack -ENOSPC. It zeroes out the first half but convert the entire on-disk extent to written since the EXT4_EXT_DATA_VALID2 flag set, but left the second half as unwritten in the extent status tree. [0000000000SSSSSS] data S: stale data, 0: zeroed [WWWWWWWWWWWWWWWW] on-disk extent W: written extent [WWWWWWWWWWUUUUUU] extent status tree Finally, if the DIO failed to write data to the disk, the stale data in the second half will be exposed once the cached extent entry is gone. Fix this issue by not passing EXT4_GET_BLOCKS_CONVERT when splitting an unwritten extent before submitting I/O, and make ext4_split_convert_extents() to zero out the entire extent range to zero for this case, and also mark the extent in the extent status tree for consistency. Fixes: `b8a8684502` ("ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Cc: stable@kernel.org Message-ID: <20251129103247.686136-4-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2026-01-18 11:23:33 -05:00
Zhang Yi	1bf6974822	ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1 When allocating initialized blocks from a large unwritten extent, or when splitting an unwritten extent during end I/O and converting it to initialized, there is currently a potential issue of stale data if the extent needs to be split in the middle. 0 A B N [UUUUUUUUUUUU] U: unwritten extent [--DDDDDDDD--] D: valid data \|<- ->\| ----> this range needs to be initialized ext4_split_extent() first try to split this extent at B with EXT4_EXT_DATA_ENTIRE_VALID1 and EXT4_EXT_MAY_ZEROOUT flag set, but ext4_split_extent_at() failed to split this extent due to temporary lack of space. It zeroout B to N and mark the entire extent from 0 to N as written. 0 A B N [WWWWWWWWWWWW] W: written extent [SSDDDDDDDDZZ] Z: zeroed, S: stale data ext4_split_extent() then try to split this extent at A with EXT4_EXT_DATA_VALID2 flag set. This time, it split successfully and left a stale written extent from 0 to A. 0 A B N [WW\|WWWWWWWWWW] [SS\|DDDDDDDDZZ] Fix this by pass EXT4_EXT_DATA_PARTIAL_VALID1 to ext4_split_extent_at() when splitting at B, don't convert the entire extent to written and left it as unwritten after zeroing out B to N. The remaining work is just like the standard two-part split. ext4_split_extent() will pass the EXT4_EXT_DATA_VALID2 flag when it calls ext4_split_extent_at() for the second time, allowing it to properly handle the split. If the split is successful, it will keep extent from 0 to A as unwritten. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Cc: stable@kernel.org Message-ID: <20251129103247.686136-3-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2026-01-18 11:23:33 -05:00
Zhang Yi	22784ca541	ext4: subdivide EXT4_EXT_DATA_VALID1 When splitting an extent, if the EXT4_GET_BLOCKS_CONVERT flag is set and it is necessary to split the target extent in the middle, ext4_split_extent() first handles splitting the latter half of the extent and passes the EXT4_EXT_DATA_VALID1 flag. This flag implies that all blocks before the split point contain valid data; however, this assumption is incorrect. Therefore, subdivid EXT4_EXT_DATA_VALID1 into EXT4_EXT_DATA_ENTIRE_VALID1 and EXT4_EXT_DATA_PARTIAL_VALID1, which indicate that the first half of the extent is either entirely valid or only partially valid, respectively. These two flags cannot be set simultaneously. This patch does not use EXT4_EXT_DATA_PARTIAL_VALID1, it only replaces EXT4_EXT_DATA_VALID1 with EXT4_EXT_DATA_ENTIRE_VALID1 at the location where it is set, no logical changes. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Cc: stable@kernel.org Message-ID: <20251129103247.686136-2-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2026-01-18 11:23:33 -05:00
Baokun Li	125d1f6a5a	ext4: add EXT4_LBLK_TO_B macro for logical block to bytes conversion No functional changes. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Message-ID: <20251121090654.631996-10-libaokun@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-11-28 22:35:27 -05:00
Yang Erkun	cc742fd1d1	ext4: correct the comments place for EXT4_EXT_MAY_ZEROOUT Move the comments just before we set EXT4_EXT_MAY_ZEROOUT in ext4_split_convert_extents. Signed-off-by: Yang Erkun <yangerkun@huawei.com> Message-ID: <20251112084538.1658232-4-yangerkun@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-11-26 17:13:34 -05:00
Yang Erkun	dac092195b	ext4: rename EXT4_GET_BLOCKS_PRE_IO This flag has been generalized to split an unwritten extent when we do dio or dioread_nolock writeback, or to avoid merge new extents which was created by extents split. Update some related comments too. Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Baokun Li <libaokun1@huawei.com> Signed-off-by: Yang Erkun <yangerkun@huawei.com> Message-ID: <20251112084538.1658232-2-yangerkun@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-11-26 17:13:33 -05:00
Zhang Yi	7da5565cab	ext4: make ext4_es_lookup_extent() pass out the extent seq counter When querying extents in the extent status tree, we should hold the data_sem if we want to obtain the sequence number as a valid cookie simultaneously. However, currently, ext4_map_blocks() calls ext4_es_lookup_extent() without holding data_sem. Therefore, we should acquire i_es_lock instead, which also ensures that the sequence cookie and the extent remain consistent. Consequently, make ext4_es_lookup_extent() to pass out the sequence number when necessary. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20251013015128.499308-4-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-11-06 10:44:39 -05:00
Linus Torvalds	ff7dcfedf9	Major ext4 changes for 6.17: - Better scalability for ext4 block allocation - Fix insufficient credits when writing back large folios Miscellaneous bug fixes, especially when handling exteded attriutes, inline data, and fast commit. -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmiIQEoACgkQ8vlZVpUN gaPB9wf/QursT7eLjx9Gz+4PYNWPKptBERQtmmDAnNYxDlEQ28+CHdMdEeiIPPoP IW1DIHfR7VaTI2K7gy6D5632VAhDDKiXBpIYu1yh3KPClAxjTZbhrif8J5UBXj1K ZwmCeLDF40jijua4rVKq3Fqf4iTJUyU2NqLpvcze7BZg7FwstXiNJrZ3DjAwi1BW j/5veWwh/KrNMzT5u0+RpMs4FBrdXQXvwSe/4pSx6d75r6WAdzhgUMy09os1wAWU 3N0JU+R5hAG6iFfbWQRURB6oLMmmxl4x2F7r5BvM27uQtELNLNcxBKZhMW97HpiE uSwKgo/59DKpWX0xQ2x/yugQIzd62w== =oPHD -----END PGP SIGNATURE----- Merge tag 'ext4_for_linus_6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Major ext4 changes for 6.17: - Better scalability for ext4 block allocation - Fix insufficient credits when writing back large folios Miscellaneous bug fixes, especially when handling exteded attriutes, inline data, and fast commit" * tag 'ext4_for_linus_6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (39 commits) ext4: do not BUG when INLINE_DATA_FL lacks system.data xattr ext4: implement linear-like traversal across order xarrays ext4: refactor choose group to scan group ext4: convert free groups order lists to xarrays ext4: factor out ext4_mb_scan_group() ext4: factor out ext4_mb_might_prefetch() ext4: factor out __ext4_mb_scan_group() ext4: fix largest free orders lists corruption on mb_optimize_scan switch ext4: fix zombie groups in average fragment size lists ext4: merge freed extent with existing extents before insertion ext4: convert sbi->s_mb_free_pending to atomic_t ext4: fix typo in CR_GOAL_LEN_SLOW comment ext4: get rid of some obsolete EXT4_MB_HINT flags ext4: utilize multiple global goals to reduce contention ext4: remove unnecessary s_md_lock on update s_mb_last_group ext4: remove unnecessary s_mb_last_start ext4: separate stream goal hits from s_bal_goals for better tracking ext4: add ext4_try_lock_group() to skip busy groups ext4: initialize superblock fields in the kballoc-test.c kunit tests ext4: refactor the inline directory conversion and new directory codepaths ...	2025-07-31 10:02:44 -07:00
Zhang Yi	57661f2875	ext4: replace ext4_writepage_trans_blocks() After ext4 supports large folios, the semantics of reserving credits in pages is no longer applicable. In most scenarios, reserving credits in extents is sufficient. Therefore, introduce ext4_chunk_trans_extent() to replace ext4_writepage_trans_blocks(). move_extent_per_page() is the only remaining location where we are still processing extents in pages. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250707140814.542883-10-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-13 23:41:52 -04:00
Zhang Yi	f4265b8d32	ext4: add FALLOC_FL_WRITE_ZEROES support Add support for FALLOC_FL_WRITE_ZEROES if the underlying device enable the unmap write zeroes operation. This first allocates blocks as unwritten, then issues a zero command outside of the running journal handle, and finally converts them to a written state. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/20250619111806.3546162-10-yi.zhang@huaweicloud.com Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-06-23 12:45:14 +02:00
Ritesh Harjani (IBM)	7acd1b315c	ext4: Add a WARN_ON_ONCE for querying LAST_IN_LEAF instead We added the documentation in ext4_map_blocks() for usage of EXT4_GET_BLOCKS_QUERY_LAST_IN_LEAF flag. But It's better to add a WARN_ON_ONCE in case if anyone tries using this flag with CREATE to avoid a random issue later. Since depth can change with CREATE and it needs to be re-calculated before using it in there. Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/ee6e82a224c50b432df9ce1ce3333c50182d8473.1747677758.git.ritesh.list@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-05-20 14:21:00 -04:00
Ritesh Harjani (IBM)	3be4bb7e71	ext4: Unwritten to written conversion requires EXT4_EX_NOCACHE This fixes the atomic write patch series after it was rebased on top of extent status cache cleanup series i.e. 'commit `402e38e6b7` ("ext4: prevent stale extent cache entries caused by concurrent I/O writeback")' After the above series, EXT4_GET_BLOCKS_IO_CONVERT_EXT flag which has EXT4_GET_BLOCKS_IO_SUBMIT flag set, requires that the io submit context of any kind should pass EXT4_EX_NOCACHE to avoid caching unncecessary extents in the extent status cache. This patch fixes that by adding the EXT4_EX_NOCACHE flag in ext4_convert_unwritten_extents_atomic() for unwritten to written conversion calls to ext4_map_blocks(). Fixes: `b86629c2b2` ("ext4: Add multi-fsblock atomic write support with bigalloc") Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/ea0ad9378ff6d31d73f4e53f87548e3a20817689.1747677758.git.ritesh.list@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-05-20 14:21:00 -04:00
Ritesh Harjani (IBM)	b86629c2b2	ext4: Add multi-fsblock atomic write support with bigalloc EXT4 supports bigalloc feature which allows the FS to work in size of clusters (group of blocks) rather than individual blocks. This patch adds atomic write support for bigalloc so that systems with bs = ps can also create FS using - mkfs.ext4 -F -O bigalloc -b 4096 -C 16384 <dev> With bigalloc ext4 can support multi-fsblock atomic writes. We will have to adjust ext4's atomic write unit max value to cluster size. This can then support atomic write of size anywhere between [blocksize, clustersize]. This patch adds the required changes to enable multi-fsblock atomic write support using bigalloc in the next patch. In this patch for block allocation: we first query the underlying region of the requested range by calling ext4_map_blocks() call. Here are the various cases which we then handle depending upon the underlying mapping type: 1. If the underlying region for the entire requested range is a mapped extent, then we don't call ext4_map_blocks() to allocate anything. We don't need to even start the jbd2 txn in this case. 2. For an append write case, we create a mapped extent. 3. If the underlying region is entirely a hole, then we create an unwritten extent for the requested range. 4. If the underlying region is a large unwritten extent, then we split the extent into 2 unwritten extent of required size. 5. If the underlying region has any type of mixed mapping, then we call ext4_map_blocks() in a loop to zero out the unwritten and the hole regions within the requested range. This then provide a single mapped extent type mapping for the requested range. Note: We invoke ext4_map_blocks() in a loop with the EXT4_GET_BLOCKS_ZERO flag only when the underlying extent mapping of the requested range is not entirely a hole, an unwritten extent, or a fully mapped extent. That is, if the underlying region contains a mix of hole(s), unwritten extent(s), and mapped extent(s), we use this loop to ensure that all the short mappings are zeroed out. This guarantees that the entire requested range becomes a single, uniformly mapped extent. It is ok to do so because we know this is being done on a bigalloc enabled filesystem where the block bitmap represents the entire cluster unit. Note having a single contiguous underlying region of type mapped, unwrittn or hole is not a problem. But the reason to avoid writing on top of mixed mapping region is because, atomic writes requires all or nothing should get written for the userspace pwritev2 request. So if at any point in time during the write if a crash or a sudden poweroff occurs, the region undergoing atomic write should read either complete old data or complete new data. But it should never have a mix of both old and new data. So, we first convert any mixed mapping region to a single contiguous mapped extent before any data gets written to it. This is because normally FS will only convert unwritten extents to written at the end of the write in ->end_io() call. And if we allow the writes over a mixed mapping and if a sudden power off happens in between, we will end up reading mix of new data (over mapped extents) and old data (over unwritten extents), because unwritten to written conversion never went through. So to avoid this and to avoid writes getting torned due to mixed mapping, we first allocate a single contiguous block mapping and then do the write. Acked-by: Darrick J. Wong <djwong@kernel.org> Co-developed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://patch.msgid.link/c4965ac3407cbc773f0bc954d0966d9696f5038a.1747337952.git.ritesh.list@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-05-20 10:31:12 -04:00
Ritesh Harjani (IBM)	5bb12b1837	ext4: Add support for EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS There can be a case where there are contiguous extents on the adjacent leaf nodes of on-disk extent trees. So when someone tries to write to this contiguous range, ext4_map_blocks() call will split by returning 1 extent at a time if this is not already cached in extent_status tree cache (where if these extents when cached can get merged since they are contiguous). This is fine for a normal write however in case of atomic writes, it can't afford to break the write into two. Now this is also something that will only happen in the slow write case where we call ext4_map_blocks() for each of these extents spread across different leaf nodes. However, there is no guarantee that these extent status cache cannot be reclaimed before the last call to ext4_map_blocks() in ext4_map_blocks_atomic_write_slow(). Hence this patch adds support of EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS. This flag checks if the requested range can be fully found in extent status cache and return. If not, it looks up in on-disk extent tree via ext4_map_query_blocks(). If the found extent is the last entry in the leaf node, then it goes and queries the next lblk to see if there is an adjacent contiguous extent in the adjacent leaf node of the on-disk extent tree. Even though there can be a case where there are multiple adjacent extent entries spread across multiple leaf nodes. But we only read an adjacent leaf block i.e. in total of 2 extent entries spread across 2 leaf nodes. The reason for this is that we are mostly only going to support atomic writes with upto 64KB or maybe max upto 1MB of atomic write support. Acked-by: Darrick J. Wong <djwong@kernel.org> Co-developed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://patch.msgid.link/6bb563e661f5fbd80e266a9e6ce6e29178f555f6.1747337952.git.ritesh.list@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-05-20 10:31:12 -04:00
Eric Biggers	6cbab5f95e	ext4: remove sbi argument from ext4_chksum() Since ext4_chksum() no longer uses its sbi argument, remove it. Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Link: https://patch.msgid.link/20250513053809.699974-2-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-05-20 10:31:12 -04:00

1 2 3 4 5 ...

735 Commits