mirror of
https://github.com/torvalds/linux.git
synced 2026-05-26 16:12:59 +02:00
btrfs: fix race between async reclaim worker and close_ctree()
Syzbot reported an assertion failure due to an attempt to add a delayed iput after we have set BTRFS_FS_STATE_NO_DELAYED_IPUT in the fs_info state: WARNING: CPU: 0 PID: 65 at fs/btrfs/inode.c:3420 btrfs_add_delayed_iput+0x2f8/0x370 fs/btrfs/inode.c:3420 Modules linked in: CPU: 0 UID: 0 PID: 65 Comm: kworker/u8:4 Not tainted 6.15.0-next-20250530-syzkaller #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025 Workqueue: btrfs-endio-write btrfs_work_helper RIP: 0010:btrfs_add_delayed_iput+0x2f8/0x370 fs/btrfs/inode.c:3420 Code: 4e ad 5d (...) RSP: 0018:ffffc9000213f780 EFLAGS: 00010293 RAX: ffffffff83c635b7 RBX: ffff888058920000 RCX: ffff88801c769e00 RDX: 0000000000000000 RSI: 0000000000000100 RDI: 0000000000000000 RBP: 0000000000000001 R08: ffff888058921b67 R09: 1ffff1100b12436c R10: dffffc0000000000 R11: ffffed100b12436d R12: 0000000000000001 R13: dffffc0000000000 R14: ffff88807d748000 R15: 0000000000000100 FS: 0000000000000000(0000) GS:ffff888125c53000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00002000000bd038 CR3: 000000006a142000 CR4: 00000000003526f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> btrfs_put_ordered_extent+0x19f/0x470 fs/btrfs/ordered-data.c:635 btrfs_finish_one_ordered+0x11d8/0x1b10 fs/btrfs/inode.c:3312 btrfs_work_helper+0x399/0xc20 fs/btrfs/async-thread.c:312 process_one_work kernel/workqueue.c:3238 [inline] process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3321 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3402 kthread+0x70e/0x8a0 kernel/kthread.c:464 ret_from_fork+0x3fc/0x770 arch/x86/kernel/process.c:148 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 </TASK> This can happen due to a race with the async reclaim worker like this: 1) The async metadata reclaim worker enters shrink_delalloc(), which calls btrfs_start_delalloc_roots() with an nr_pages argument that has a value less than LONG_MAX, and that in turn enters start_delalloc_inodes(), which sets the local variable 'full_flush' to false because wbc->nr_to_write is less than LONG_MAX; 2) There it finds inode X in a root's delalloc list, grabs a reference for inode X (with igrab()), and triggers writeback for it with filemap_fdatawrite_wbc(), which creates an ordered extent for inode X; 3) The unmount sequence starts from another task, we enter close_ctree() and we flush the workqueue fs_info->endio_write_workers, which waits for the ordered extent for inode X to complete and when dropping the last reference of the ordered extent, with btrfs_put_ordered_extent(), when we call btrfs_add_delayed_iput() we don't add the inode to the list of delayed iputs because it has a refcount of 2, so we decrement it to 1 and return; 4) Shortly after at close_ctree() we call btrfs_run_delayed_iputs() which runs all delayed iputs, and then we set BTRFS_FS_STATE_NO_DELAYED_IPUT in the fs_info state; 5) The async reclaim worker, after calling filemap_fdatawrite_wbc(), now calls btrfs_add_delayed_iput() for inode X and there we trigger an assertion failure since the fs_info state has the flag BTRFS_FS_STATE_NO_DELAYED_IPUT set. Fix this by setting BTRFS_FS_STATE_NO_DELAYED_IPUT only after we wait for the async reclaim workers to finish, after we call cancel_work_sync() for them at close_ctree(), and by running delayed iputs after wait for the reclaim workers to finish and before setting the bit. This race was recently introduced by commit19e60b2a95("btrfs: add extra warning if delayed iput is added when it's not allowed"). Without the new validation at btrfs_add_delayed_iput(), this described scenario was safe because close_ctree() later calls btrfs_commit_super(). That will run any final delayed iputs added by reclaim workers in the window between the btrfs_run_delayed_iputs() and the the reclaim workers being shut down. Reported-by: syzbot+0ed30ad435bf6f5b7a42@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/6840481c.a00a0220.d4325.000c.GAE@google.com/T/#u Fixes:19e60b2a95("btrfs: add extra warning if delayed iput is added when it's not allowed") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
This commit is contained in:
parent
1961d20f6f
commit
a26bf338cd
|
|
@ -4312,8 +4312,8 @@ void __cold close_ctree(struct btrfs_fs_info *fs_info)
|
|||
*
|
||||
* So wait for all ongoing ordered extents to complete and then run
|
||||
* delayed iputs. This works because once we reach this point no one
|
||||
* can either create new ordered extents nor create delayed iputs
|
||||
* through some other means.
|
||||
* can create new ordered extents, but delayed iputs can still be added
|
||||
* by a reclaim worker (see comments further below).
|
||||
*
|
||||
* Also note that btrfs_wait_ordered_roots() is not safe here, because
|
||||
* it waits for BTRFS_ORDERED_COMPLETE to be set on an ordered extent,
|
||||
|
|
@ -4324,15 +4324,29 @@ void __cold close_ctree(struct btrfs_fs_info *fs_info)
|
|||
btrfs_flush_workqueue(fs_info->endio_write_workers);
|
||||
/* Ordered extents for free space inodes. */
|
||||
btrfs_flush_workqueue(fs_info->endio_freespace_worker);
|
||||
/*
|
||||
* Run delayed iputs in case an async reclaim worker is waiting for them
|
||||
* to be run as mentioned above.
|
||||
*/
|
||||
btrfs_run_delayed_iputs(fs_info);
|
||||
/* There should be no more workload to generate new delayed iputs. */
|
||||
set_bit(BTRFS_FS_STATE_NO_DELAYED_IPUT, &fs_info->fs_state);
|
||||
|
||||
cancel_work_sync(&fs_info->async_reclaim_work);
|
||||
cancel_work_sync(&fs_info->async_data_reclaim_work);
|
||||
cancel_work_sync(&fs_info->preempt_reclaim_work);
|
||||
cancel_work_sync(&fs_info->em_shrinker_work);
|
||||
|
||||
/*
|
||||
* Run delayed iputs again because an async reclaim worker may have
|
||||
* added new ones if it was flushing delalloc:
|
||||
*
|
||||
* shrink_delalloc() -> btrfs_start_delalloc_roots() ->
|
||||
* start_delalloc_inodes() -> btrfs_add_delayed_iput()
|
||||
*/
|
||||
btrfs_run_delayed_iputs(fs_info);
|
||||
|
||||
/* There should be no more workload to generate new delayed iputs. */
|
||||
set_bit(BTRFS_FS_STATE_NO_DELAYED_IPUT, &fs_info->fs_state);
|
||||
|
||||
/* Cancel or finish ongoing discard work */
|
||||
btrfs_discard_cleanup(fs_info);
|
||||
|
||||
|
|
|
|||
Loading…
Reference in New Issue
Block a user