linux/fs
Qu Wenruo f3a5367c67 btrfs: protect folio::private when attaching extent buffer folios
[BUG]
Since v6.8 there are rare kernel crashes reported by various people,
the common factor is bad page status error messages like this:

  BUG: Bad page state in process kswapd0  pfn:d6e840
  page: refcount:0 mapcount:0 mapping:000000007512f4f2 index:0x2796c2c7c
  pfn:0xd6e840
  aops:btree_aops ino:1
  flags: 0x17ffffe0000008(uptodate|node=0|zone=2|lastcpupid=0x3fffff)
  page_type: 0xffffffff()
  raw: 0017ffffe0000008 dead000000000100 dead000000000122 ffff88826d0be4c0
  raw: 00000002796c2c7c 0000000000000000 00000000ffffffff 0000000000000000
  page dumped because: non-NULL mapping

[CAUSE]
Commit 09e6cef19c ("btrfs: refactor alloc_extent_buffer() to
allocate-then-attach method") changes the sequence when allocating a new
extent buffer.

Previously we always called grab_extent_buffer() under
mapping->i_private_lock, to ensure the safety on modification on
folio::private (which is a pointer to extent buffer for regular
sectorsize).

This can lead to the following race:

Thread A is trying to allocate an extent buffer at bytenr X, with 4
4K pages, meanwhile thread B is trying to release the page at X + 4K
(the second page of the extent buffer at X).

           Thread A                |                 Thread B
-----------------------------------+-------------------------------------
                                   | btree_release_folio()
				   | | This is for the page at X + 4K,
				   | | Not page X.
				   | |
alloc_extent_buffer()              | |- release_extent_buffer()
|- filemap_add_folio() for the     | |  |- atomic_dec_and_test(eb->refs)
|  page at bytenr X (the first     | |  |
|  page).                          | |  |
|  Which returned -EEXIST.         | |  |
|                                  | |  |
|- filemap_lock_folio()            | |  |
|  Returned the first page locked. | |  |
|                                  | |  |
|- grab_extent_buffer()            | |  |
|  |- atomic_inc_not_zero()        | |  |
|  |  Returned false               | |  |
|  |- folio_detach_private()       | |  |- folio_detach_private() for X
|     |- folio_test_private()      | |     |- folio_test_private()
      |  Returned true             | |     |  Returned true
      |- folio_put()               |       |- folio_put()

Now there are two puts on the same folio at folio X, leading to refcount
underflow of the folio X, and eventually causing the BUG_ON() on the
page->mapping.

The condition is not that easy to hit:

- The release must be triggered for the middle page of an eb
  If the release is on the same first page of an eb, page lock would kick
  in and prevent the race.

- folio_detach_private() has a very small race window
  It's only between folio_test_private() and folio_clear_private().

That's exactly when mapping->i_private_lock is used to prevent such race,
and commit 09e6cef19c ("btrfs: refactor alloc_extent_buffer() to
allocate-then-attach method") screwed that up.

At that time, I thought the page lock would kick in as
filemap_release_folio() also requires the page to be locked, but forgot
the filemap_release_folio() only locks one page, not all pages of an
extent buffer.

[FIX]
Move all the code requiring i_private_lock into
attach_eb_folio_to_filemap(), so that everything is done with proper
lock protection.

Furthermore to prevent future problems, add an extra
lockdep_assert_locked() to ensure we're holding the proper lock.

To reproducer that is able to hit the race (takes a few minutes with
instrumented code inserting delays to alloc_extent_buffer()):

  #!/bin/sh
  drop_caches () {
	  while(true); do
		  echo 3 > /proc/sys/vm/drop_caches
		  echo 1 > /proc/sys/vm/compact_memory
	  done
  }

  run_tar () {
	  while(true); do
		  for x in `seq 1 80` ; do
			  tar cf /dev/zero /mnt > /dev/null &
		  done
		  wait
	  done
  }

  mkfs.btrfs -f -d single -m single /dev/vda
  mount -o noatime /dev/vda /mnt
  # create 200,000 files, 1K each
  ./simoop -n 200000 -E -f 1k /mnt
  drop_caches &
  (run_tar)

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/linux-btrfs/CAHk-=wgt362nGfScVOOii8cgKn2LVVHeOvOA7OBwg1OwbuJQcw@mail.gmail.com/
Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Link: https://lore.kernel.org/lkml/CABXGCsPktcHQOvKTbPaTwegMExije=Gpgci5NW=hqORo-s7diA@mail.gmail.com/
Reported-by: Toralf Förster <toralf.foerster@gmx.de>
Link: https://lore.kernel.org/linux-btrfs/e8b3311c-9a75-4903-907f-fc0f7a3fe423@gmx.de/
Reported-by: syzbot+f80b066392366b4af85e@syzkaller.appspotmail.com
Fixes: 09e6cef19c ("btrfs: refactor alloc_extent_buffer() to allocate-then-attach method")
CC: stable@vger.kernel.org # 6.8+
CC: Chris Mason <clm@fb.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-06-06 21:42:22 +02:00
..
9p fs/9p: mitigate inode collisions 2024-04-22 15:34:27 +00:00
adfs mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
affs affs: remove SLAB_MEM_SPREAD flag usage 2024-02-26 11:36:28 +01:00
afs afs: Fix occasional rmdir-then-VNOVNODE with generic/011 2024-03-14 12:13:21 +01:00
autofs dcache stuff for this cycle 2024-01-11 20:11:35 -08:00
bcachefs bcachefs: fix integer conversion bug 2024-04-28 21:34:29 -04:00
befs mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
bfs mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
btrfs btrfs: protect folio::private when attaching extent buffer folios 2024-06-06 21:42:22 +02:00
cachefiles cachefiles: fix memory leak in cachefiles_add_cache() 2024-02-20 09:46:07 +01:00
ceph ceph: switch to use cap_delay_lock for the unlink delay list 2024-04-11 22:56:28 +02:00
coda mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
configfs
cramfs fs,block: yield devices early 2024-03-27 13:17:15 +01:00
crypto fscrypt updates for 6.9 2024-03-12 13:17:36 -07:00
debugfs debugfs: fix wait/cancellation handling during remove 2024-03-07 22:08:15 +00:00
devpts fs: Remove the now superfluous sentinel elements from ctl_table array 2023-12-28 04:57:57 -08:00
dlm dlm for 6.9 2024-03-18 15:39:48 -07:00
ecryptfs Merge tag 'exportfs-6.9' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/cel/linux 2024-01-23 17:56:30 +01:00
efivarfs efivarfs: Drop 'duplicates' bool parameter on efivar_init() 2024-02-25 09:43:39 +01:00
efs efs: remove SLAB_MEM_SPREAD flag usage 2024-02-27 11:21:33 +01:00
erofs erofs: reliably distinguish block based and fscache mode 2024-04-28 20:36:52 +08:00
exfat Description for this pull request: 2024-03-21 09:47:12 -07:00
exportfs fs: Create a generic is_dot_dotdot() utility 2024-01-23 10:58:56 -05:00
ext2 \n 2024-03-13 14:30:58 -07:00
ext4 fs,block: yield devices early 2024-03-27 13:17:15 +01:00
f2fs fs,block: yield devices early 2024-03-27 13:17:15 +01:00
fat - Kuan-Wei Chiu has developed the well-named series "lib min_heap: Min 2024-03-14 18:03:09 -07:00
freevxfs mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
fuse cuse: add kernel-doc comments to cuse_process_init_reply() 2024-04-15 11:02:10 +02:00
gfs2 gfs2 fix 2024-03-25 10:53:39 -07:00
hfs hfs: really remove hfs_writepage 2023-12-29 11:58:34 -08:00
hfsplus vfs-6.9.misc 2024-03-11 09:38:17 -07:00
hostfs hostfs: use d_splice_alias() calling conventions to simplify failure exits 2023-12-21 12:51:00 -05:00
hpfs mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
hugetlbfs vfs-6.9.misc 2024-03-11 09:38:17 -07:00
iomap vfs-6.9.rw_hint 2024-03-04 18:35:21 +01:00
isofs mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
jbd2 jbd2: abort journal when detecting metadata writeback error of fs dev 2024-01-04 23:42:21 -05:00
jffs2 mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
jfs fs,block: yield devices early 2024-03-27 13:17:15 +01:00
kernfs kernfs: annotate different lockdep class for of->mutex of writable files 2024-04-14 06:55:46 -04:00
lockd NFSD 6.9 Release Notes 2024-03-12 14:27:37 -07:00
minix minix: remove SLAB_MEM_SPREAD flag usage 2024-02-27 11:21:32 +01:00
netfs netfs: Fix the pre-flush when appending to a file in writethrough mode 2024-04-26 14:56:18 +02:00
nfs NFS client bugfixes for Linux 6.9 2024-04-29 12:07:37 -07:00
nfs_common
nfsd nfsd-6.9 fixes: 2024-04-29 14:22:24 -07:00
nilfs2 nilfs2: fix OOB in nilfs_set_de_type 2024-04-16 15:39:52 -07:00
nls
notify fanotify: allow freeze when waiting response for permission events 2024-03-07 12:59:51 +01:00
ntfs3 ntfs3: add legacy ntfs file operations 2024-04-23 09:39:07 +02:00
ocfs2 - Kuan-Wei Chiu has developed the well-named series "lib min_heap: Min 2024-03-14 18:03:09 -07:00
omfs
openpromfs openpromfs: remove SLAB_MEM_SPREAD flag usage 2024-02-27 11:21:32 +01:00
orangefs Julia Lawall reported this null pointer dereference, this should fix it. 2024-02-14 15:57:53 -05:00
overlayfs ovl: relax WARN_ON in ovl_verify_area() 2024-03-17 15:59:41 +02:00
proc mm: support page_mapcount() on page_has_type() pages 2024-04-24 19:34:26 -07:00
pstore pstore/zone: Don't clear memory twice 2024-03-09 12:33:22 -08:00
qnx4 mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
qnx6 qnx6: remove SLAB_MEM_SPREAD flag usage 2024-02-27 11:21:32 +01:00
quota mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
ramfs ramfs: Initialize security of in-memory inodes 2024-01-26 09:08:16 -08:00
reiserfs fs,block: yield devices early 2024-03-27 13:17:15 +01:00
romfs fs,block: yield devices early 2024-03-27 13:17:15 +01:00
smb smb3: fix lock ordering potential deadlock in cifs_sync_mid_result 2024-04-25 12:49:50 -05:00
squashfs Squashfs: check the inode number is not the invalid value of zero 2024-04-16 15:39:50 -07:00
sysfs fs: sysfs: Fix reference leak in sysfs_break_active_protection() 2024-04-11 15:16:48 +02:00
sysv sysv: remove SLAB_MEM_SPREAD flag usage 2024-02-27 11:21:31 +01:00
tracefs eventfs: Have "events" directory get permissions from its parent 2024-05-04 04:25:37 -04:00
ubifs This pull request contains updates for UBI and UBIFS: 2024-03-21 15:09:29 -07:00
udf mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
ufs mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
unicode
vboxsf vboxsf: explicitly deny setlease attempts 2024-04-03 16:06:39 +02:00
verity Networking changes for 6.9. 2024-03-12 17:44:08 -07:00
xfs Bug fixes for 6.9-rc3: 2024-04-06 09:14:18 -07:00
zonefs zonefs: Use str_plural() to fix Coccinelle warning 2024-04-10 07:23:47 +09:00
aio.c aio: Fix null ptr deref in aio_complete() wakeup 2024-04-05 11:20:28 +02:00
anon_inodes.c Merge branch 'kvm-guestmemfd' into HEAD 2023-11-14 08:31:31 -05:00
attr.c lsm/stable-6.9 PR 20240312 2024-03-12 20:03:34 -07:00
backing-file.c fs: Use KMEM_CACHE instead of kmem_cache_create 2024-02-02 13:11:50 +01:00
bad_inode.c
binfmt_elf_fdpic.c binfmt: replace deprecated strncpy 2024-03-21 20:20:52 -07:00
binfmt_elf_test.c
binfmt_elf.c
binfmt_flat.c
binfmt_misc.c execve updates for v6.7-rc1 2023-10-30 19:28:19 -10:00
binfmt_script.c
buffer.c vfs-6.9.iomap 2024-03-11 10:07:03 -07:00
char_dev.c As usual, lots of singleton and doubleton patches all over the tree and 2023-11-02 20:53:31 -10:00
compat_binfmt_elf.c
coredump.c iov_iter: get rid of 'copy_mc' flag 2024-03-06 10:52:12 +01:00
d_path.c
dax.c fs : Fix warning using plain integer as NULL 2023-11-18 15:00:01 +01:00
dcache.c vfs-6.9.misc 2024-03-11 09:38:17 -07:00
direct-io.c block, fs: Restore the per-bio/request data lifetime fields 2024-02-06 14:31:05 +01:00
drop_caches.c
eventfd.c eventfd: strictly check the count parameter of eventfd_write to avoid inputting illegal strings 2024-02-08 10:12:26 +01:00
eventpoll.c epoll: be better about file lifetimes 2024-05-05 14:00:48 -07:00
exec.c execve fixes for v6.9-rc2 2024-03-27 09:57:30 -07:00
fcntl.c vfs-6.9.iomap 2024-03-11 10:07:03 -07:00
fhandle.c do_sys_name_to_handle(): use kzalloc() to fix kernel-infoleak 2024-01-22 15:33:38 +01:00
file_table.c lsm/stable-6.9 PR 20240312 2024-03-12 20:03:34 -07:00
file.c file: remove __receive_fd() 2023-12-12 14:24:14 +01:00
filesystems.c
fs_context.c
fs_parser.c __fs_parse: Correct a documentation comment 2024-02-02 13:11:50 +01:00
fs_pin.c
fs_struct.c
fs_types.c
fs-writeback.c writeback: move wb_wakeup_delayed defination to fs-writeback.c 2024-01-22 15:33:38 +01:00
fsopen.c
init.c
inode.c bcachefs updates for 6.9 2024-03-15 09:00:09 -07:00
internal.h pidfs: remove config option 2024-03-13 12:53:53 -07:00
ioctl.c fs: Return ENOTTY directly if FS_IOC_GETUUID or FS_IOC_GETFSSYSFSPATH fail 2024-04-09 12:03:49 +02:00
Kconfig - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames 2024-03-14 17:43:30 -07:00
Kconfig.binfmt
kernel_read_file.c
libfs.c pidfs: remove config option 2024-03-13 12:53:53 -07:00
locks.c filelock: fix deadlock detection in POSIX locking 2024-02-20 09:53:33 +01:00
Makefile vfs-6.9.pidfd 2024-03-11 10:21:06 -07:00
mbcache.c vfs: remove SLAB_MEM_SPREAD flag usage 2024-02-27 11:21:31 +01:00
mnt_idmapping.c fs/mnt_idmapping.c: Return -EINVAL when no map is written 2024-02-08 10:12:37 +01:00
mount.h mounts: keep list of mounts in an rbtree 2023-11-18 14:56:16 +01:00
mpage.c block, fs: Restore the per-bio/request data lifetime fields 2024-02-06 14:31:05 +01:00
namei.c security: Place security_path_post_mknod() where the original IMA call was 2024-04-03 10:21:32 -07:00
namespace.c fs: relax mount_setattr() permission checks 2024-02-07 21:16:29 +01:00
nsfs.c pidfs: remove config option 2024-03-13 12:53:53 -07:00
open.c lsm/stable-6.9 PR 20240312 2024-03-12 20:03:34 -07:00
pidfs.c pidfs: remove config option 2024-03-13 12:53:53 -07:00
pipe.c fs/pipe: Convert to lockdep_cmp_fn 2024-02-02 13:11:49 +01:00
pnode.c mounts: keep list of mounts in an rbtree 2023-11-18 14:56:16 +01:00
pnode.h
posix_acl.c lsm/stable-6.9 PR 20240312 2024-03-12 20:03:34 -07:00
proc_namespace.c namespace: extract show_path() helper 2023-11-18 14:56:16 +01:00
read_write.c fsnotify: optionally pass access range in file permission hooks 2023-12-12 16:20:02 +01:00
readdir.c fsnotify: optionally pass access range in file permission hooks 2023-12-12 16:20:02 +01:00
remap_range.c remap_range: merge do_clone_file_range() into vfs_clone_file_range() 2024-02-06 17:07:21 +01:00
select.c fs/select: rework stack allocation hack for clang 2024-02-20 09:23:52 +01:00
seq_file.c
signalfd.c
splice.c fs: use splice_copy_file_range() inline helper 2023-12-12 16:20:02 +01:00
stack.c
stat.c vfs-6.8.mount 2024-01-08 10:57:34 -08:00
statfs.c
super.c fs,block: yield devices early 2024-03-27 13:17:15 +01:00
sync.c
sysctls.c fs: Remove the now superfluous sentinel elements from ctl_table array 2023-12-28 04:57:57 -08:00
timerfd.c
userfaultfd.c userfaultfd: use per-vma locks in userfaultfd operations 2024-02-22 15:27:20 -08:00
utimes.c
xattr.c evm: Move to LSM infrastructure 2024-02-15 23:43:47 -05:00