linux/fs
Boris Burkov 8ee0358025 btrfs: fix fatal extent_buffer readahead vs releasepage race
commit 6bf9cd2eed upstream.

Under somewhat convoluted conditions, it is possible to attempt to
release an extent_buffer that is under io, which triggers a BUG_ON in
btrfs_release_extent_buffer_pages.

This relies on a few different factors. First, extent_buffer reads done
as readahead for searching use WAIT_NONE, so they free the local extent
buffer reference while the io is outstanding. However, they should still
be protected by TREE_REF. However, if the system is doing signficant
reclaim, and simultaneously heavily accessing the extent_buffers, it is
possible for releasepage to race with two concurrent readahead attempts
in a way that leaves TREE_REF unset when the readahead extent buffer is
released.

Essentially, if two tasks race to allocate a new extent_buffer, but the
winner who attempts the first io is rebuffed by a page being locked
(likely by the reclaim itself) then the loser will still go ahead with
issuing the readahead. The loser's call to find_extent_buffer must also
race with the reclaim task reading the extent_buffer's refcount as 1 in
a way that allows the reclaim to re-clear the TREE_REF checked by
find_extent_buffer.

The following represents an example execution demonstrating the race:

            CPU0                                                         CPU1                                           CPU2
reada_for_search                                            reada_for_search
  readahead_tree_block                                        readahead_tree_block
    find_create_tree_block                                      find_create_tree_block
      alloc_extent_buffer                                         alloc_extent_buffer
                                                                  find_extent_buffer // not found
                                                                  allocates eb
                                                                  lock pages
                                                                  associate pages to eb
                                                                  insert eb into radix tree
                                                                  set TREE_REF, refs == 2
                                                                  unlock pages
                                                              read_extent_buffer_pages // WAIT_NONE
                                                                not uptodate (brand new eb)
                                                                                                            lock_page
                                                                if !trylock_page
                                                                  goto unlock_exit // not an error
                                                              free_extent_buffer
                                                                release_extent_buffer
                                                                  atomic_dec_and_test refs to 1
        find_extent_buffer // found
                                                                                                            try_release_extent_buffer
                                                                                                              take refs_lock
                                                                                                              reads refs == 1; no io
          atomic_inc_not_zero refs to 2
          mark_buffer_accessed
            check_buffer_tree_ref
              // not STALE, won't take refs_lock
              refs == 2; TREE_REF set // no action
    read_extent_buffer_pages // WAIT_NONE
                                                                                                              clear TREE_REF
                                                                                                              release_extent_buffer
                                                                                                                atomic_dec_and_test refs to 1
                                                                                                                unlock_page
      still not uptodate (CPU1 read failed on trylock_page)
      locks pages
      set io_pages > 0
      submit io
      return
    free_extent_buffer
      release_extent_buffer
        dec refs to 0
        delete from radix tree
        btrfs_release_extent_buffer_pages
          BUG_ON(io_pages > 0)!!!

We observe this at a very low rate in production and were also able to
reproduce it in a test environment by introducing some spurious delays
and by introducing probabilistic trylock_page failures.

To fix it, we apply check_tree_ref at a point where it could not
possibly be unset by a competing task: after io_pages has been
incremented. All the codepaths that clear TREE_REF check for io, so they
would not be able to clear it after this point until the io is done.

Stack trace, for reference:
[1417839.424739] ------------[ cut here ]------------
[1417839.435328] kernel BUG at fs/btrfs/extent_io.c:4841!
[1417839.447024] invalid opcode: 0000 [#1] SMP
[1417839.502972] RIP: 0010:btrfs_release_extent_buffer_pages+0x20/0x1f0
[1417839.517008] Code: ed e9 ...
[1417839.558895] RSP: 0018:ffffc90020bcf798 EFLAGS: 00010202
[1417839.570816] RAX: 0000000000000002 RBX: ffff888102d6def0 RCX: 0000000000000028
[1417839.586962] RDX: 0000000000000002 RSI: ffff8887f0296482 RDI: ffff888102d6def0
[1417839.603108] RBP: ffff88885664a000 R08: 0000000000000046 R09: 0000000000000238
[1417839.619255] R10: 0000000000000028 R11: ffff88885664af68 R12: 0000000000000000
[1417839.635402] R13: 0000000000000000 R14: ffff88875f573ad0 R15: ffff888797aafd90
[1417839.651549] FS:  00007f5a844fa700(0000) GS:ffff88885f680000(0000) knlGS:0000000000000000
[1417839.669810] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1417839.682887] CR2: 00007f7884541fe0 CR3: 000000049f609002 CR4: 00000000003606e0
[1417839.699037] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1417839.715187] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[1417839.731320] Call Trace:
[1417839.737103]  release_extent_buffer+0x39/0x90
[1417839.746913]  read_block_for_search.isra.38+0x2a3/0x370
[1417839.758645]  btrfs_search_slot+0x260/0x9b0
[1417839.768054]  btrfs_lookup_file_extent+0x4a/0x70
[1417839.778427]  btrfs_get_extent+0x15f/0x830
[1417839.787665]  ? submit_extent_page+0xc4/0x1c0
[1417839.797474]  ? __do_readpage+0x299/0x7a0
[1417839.806515]  __do_readpage+0x33b/0x7a0
[1417839.815171]  ? btrfs_releasepage+0x70/0x70
[1417839.824597]  extent_readpages+0x28f/0x400
[1417839.833836]  read_pages+0x6a/0x1c0
[1417839.841729]  ? startup_64+0x2/0x30
[1417839.849624]  __do_page_cache_readahead+0x13c/0x1a0
[1417839.860590]  filemap_fault+0x6c7/0x990
[1417839.869252]  ? xas_load+0x8/0x80
[1417839.876756]  ? xas_find+0x150/0x190
[1417839.884839]  ? filemap_map_pages+0x295/0x3b0
[1417839.894652]  __do_fault+0x32/0x110
[1417839.902540]  __handle_mm_fault+0xacd/0x1000
[1417839.912156]  handle_mm_fault+0xaa/0x1c0
[1417839.921004]  __do_page_fault+0x242/0x4b0
[1417839.930044]  ? page_fault+0x8/0x30
[1417839.937933]  page_fault+0x1e/0x30
[1417839.945631] RIP: 0033:0x33c4bae
[1417839.952927] Code: Bad RIP value.
[1417839.960411] RSP: 002b:00007f5a844f7350 EFLAGS: 00010206
[1417839.972331] RAX: 000000000000006e RBX: 1614b3ff6a50398a RCX: 0000000000000000
[1417839.988477] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000002
[1417840.004626] RBP: 00007f5a844f7420 R08: 000000000000006e R09: 00007f5a94aeccb8
[1417840.020784] R10: 00007f5a844f7350 R11: 0000000000000000 R12: 00007f5a94aecc79
[1417840.036932] R13: 00007f5a94aecc78 R14: 00007f5a94aecc90 R15: 00007f5a94aecc40

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-16 08:17:27 +02:00
..
9p 9p: avoid attaching writeback_fid on mmap with type PRIVATE 2019-10-11 18:21:13 +02:00
adfs fs/adfs: super: fix use-after-free bug 2019-08-06 19:06:49 +02:00
affs affs: fix a memory leak in affs_remount 2020-01-27 14:51:21 +01:00
afs afs: afs_write_end() should change i_size under the right lock 2020-06-25 15:33:06 +02:00
autofs autofs: fix a leak in autofs_expire_indirect() 2019-12-13 08:51:01 +01:00
befs
bfs bfs: add sanity check at bfs_fill_super() 2018-12-01 09:37:27 +01:00
btrfs btrfs: fix fatal extent_buffer readahead vs releasepage race 2020-07-16 08:17:27 +02:00
cachefiles cachefiles: Fix race between read_waiter and read_copier involving op->to_do 2020-06-03 08:19:29 +02:00
ceph ceph: fix double unlock in handle_cap_export() 2020-05-27 17:37:34 +02:00
cifs cifs: update ctime and mtime during truncate 2020-07-16 08:17:23 +02:00
coda coda: add error handling for fget 2019-08-06 19:06:51 +02:00
configfs configfs: fix config_item refcnt leak in configfs_rmdir() 2020-05-27 17:37:32 +02:00
cramfs Cramfs: fix abad comparison when wrap-arounds occur 2018-11-13 11:08:55 -08:00
crypto fscrypt: clean up some BUG_ON()s in block encryption/decryption 2019-07-26 09:14:02 +02:00
debugfs debugfs: fix use-after-free on symlink traversal 2019-05-08 07:21:48 +02:00
devpts fs/devpts: always delete dcache dentry-s in dput() 2019-03-23 20:09:59 +01:00
dlm dlm: remove BUG() before panic() 2020-06-25 15:32:56 +02:00
ecryptfs ecryptfs: replace BUG_ON with error handling code 2020-02-28 16:38:59 +01:00
efivarfs
efs
exofs exofs_mount(): fix leaks on failure exits 2019-12-05 09:20:32 +01:00
exportfs exportfs: fix 'passing zero to ERR_PTR()' warning 2020-01-27 14:50:02 +01:00
ext2 ext2: fix debug reference to ext2_xattr_cache 2020-04-23 10:30:21 +02:00
ext4 ext4: avoid race conditions when remounting with options that change dax 2020-06-25 15:33:07 +02:00
f2fs f2fs: report delalloc reserve as non-free in statfs for project quota 2020-06-25 15:32:49 +02:00
fat fat: don't allow to mount if the FAT length == 0 2020-06-22 09:05:08 +02:00
freevxfs
fscache fscache: fix race between enablement and dropping of object 2018-12-17 09:24:40 +01:00
fuse fuse: verify attributes 2019-12-13 08:52:36 +01:00
gfs2 gfs2: fix use-after-free on transaction ail lists 2020-06-25 15:33:03 +02:00
hfs fs/hfs/extent.c: fix array out of bounds read of array extent 2019-12-01 09:17:10 +01:00
hfsplus hfsplus: fix crash and filesystem corruption when deleting files 2020-04-17 10:48:52 +02:00
hostfs vfs: discard ATTR_ATTR_FLAG 2018-08-17 16:20:28 -07:00
hpfs hpfs: remove unnecessary checks on the value of r when assigning error code 2018-08-25 12:42:33 -07:00
hugetlbfs hugetlb: use same fault hash key for shared and private mappings 2019-05-22 07:37:40 +02:00
isofs isofs: reject hardware sector size > 2048 bytes 2018-08-21 11:37:41 +02:00
jbd2 jbd2: improve comments about freeing data buffers whose page mapping is NULL 2020-04-21 09:03:06 +02:00
jffs2 jffs2: fix use-after-free on symlink traversal 2019-05-08 07:21:48 +02:00
jfs jfs: fix bogus variable self-initialization 2020-01-27 14:50:33 +01:00
kernfs kernfs: fix ino wrap-around detection 2019-12-13 08:52:43 +01:00
lockd lockd: fix decoding of TEST results 2019-12-13 08:51:59 +01:00
minix
nfs NFSv4 fix CLOSE not waiting for direct IO compeletion 2020-06-30 23:17:18 -04:00
nfs_common
nfsd nfsd: apply umask on fs without ACL support 2020-07-09 09:37:11 +02:00
nilfs2 nilfs2: fix null pointer dereference at nilfs_segctor_do_construct() 2020-06-22 09:05:03 +02:00
nls
notify fanotify: fix ignore mask logic for events on child and on dir 2020-06-30 23:17:00 -04:00
ntfs ntfs: mft: remove VLA usage 2018-08-17 16:20:27 -07:00
ocfs2 ocfs2: fix panic on nfs server over ocfs2 2020-06-30 23:17:17 -04:00
omfs
openpromfs
orangefs help_next should increase position index 2020-02-24 08:34:53 +01:00
overlayfs ovl: initialize error in ovl_copy_xattr 2020-06-22 09:05:06 +02:00
proc proc: Use new_inode not new_inode_pseudo 2020-06-22 09:05:06 +02:00
pstore pstore: pstore_ftrace_seq_next should increase position index 2020-04-17 10:48:47 +02:00
qnx4
qnx6
quota fs: avoid softlockups in s_inodes iterators 2020-01-12 12:17:20 +01:00
ramfs
reiserfs reiserfs: prevent NULL pointer dereference in reiserfs_insert_item() 2020-02-24 08:34:52 +01:00
romfs
squashfs
sysfs Driver core patches for 4.19-rc1 2018-08-18 11:44:53 -07:00
sysv sysv: return 'err' instead of 0 in __sysv_write_inode 2018-12-17 09:24:30 +01:00
tracefs
ubifs ubifs: remove broken lazytime support 2020-05-27 17:37:30 +02:00
udf udf: Fix free space reporting for metadata and virtual partitions 2020-02-24 08:34:45 +01:00
ufs ufs: fix braino in ufs_get_inode_gid() for solaris UFS flavour 2019-05-25 18:23:46 +02:00
xfs xfs: add agf freeblocks verify in xfs_agf_verify 2020-06-30 23:17:19 -04:00
aio.c aio: fix async fsync creds 2020-06-22 09:05:01 +02:00
anon_inodes.c
attr.c
bad_inode.c
binfmt_aout.c
binfmt_elf_fdpic.c
binfmt_elf.c fs/binfmt_elf.c: allocate initialized memory in fill_thread_core_info() 2020-06-03 08:19:41 +02:00
binfmt_em86.c
binfmt_flat.c fs/binfmt_flat.c: make load_flat_shared_library() work 2019-07-03 13:14:44 +02:00
binfmt_misc.c
binfmt_script.c exec: load_script: Do not exec truncated interpreter path 2019-11-06 13:05:37 +01:00
block_dev.c block: Fix use-after-free in blkdev_get() 2020-06-25 15:33:06 +02:00
buffer.c ext4: use non-movable memory for superblock readahead 2020-04-23 10:30:12 +02:00
char_dev.c chardev: Avoid potential use-after-free in 'chrdev_open()' 2020-01-14 20:06:57 +01:00
compat_binfmt_elf.c
compat_ioctl.c fix compat handling of FICLONERANGE, FIDEDUPERANGE and FS_IOC_FIEMAP 2020-01-09 10:19:07 +01:00
compat.c
coredump.c coredump: fix crash when umh is disabled 2020-05-14 07:57:21 +02:00
d_path.c
dax.c dax: pass NOWAIT flag to iomap_apply 2020-03-05 16:42:12 +01:00
dcache.c dcache: sort the freeing-without-RCU-delay mess for good. 2019-05-25 18:23:26 +02:00
dcookies.c
direct-io.c direct-io: allow direct writes to empty inodes 2019-03-05 17:58:50 +01:00
drop_caches.c fs: avoid softlockups in s_inodes iterators 2020-01-12 12:17:20 +01:00
eventfd.c eventfd: track eventfd_signal() recursion depth 2020-02-11 04:34:08 -08:00
eventpoll.c fs/epoll: drop ovflist branch prediction 2019-02-12 19:47:19 +01:00
exec.c exec: Move would_dump into flush_old_exec 2020-05-20 08:18:50 +02:00
fcntl.c signal: Don't send signals to tasks that don't exist 2018-08-15 23:03:20 -05:00
fhandle.c
file_table.c overlayfs update for 4.19 2018-08-21 18:19:09 -07:00
file.c fix multiplication overflow in copy_fdtable() 2020-05-27 17:37:29 +02:00
filesystems.c fs/filesystems.c: downgrade user-reachable WARN_ONCE() to pr_warn_once() 2020-04-17 10:48:51 +02:00
fs_pin.c
fs_struct.c
fs-writeback.c cgroup, blkcg: Prepare some symbols for module and !CONFIG_CGROUP usages 2020-06-22 09:05:03 +02:00
inode.c futex: Fix inode life-time issue 2020-03-25 08:06:14 +01:00
internal.h acct_on(): don't mess with freeze protection 2019-05-31 06:46:05 -07:00
ioctl.c vfs: fix FIGETBSZ ioctl on an overlayfs file 2018-11-21 09:19:14 +01:00
iomap.c iomap: partially revert 4721a60109 (simulated directio short read on EFAULT) 2019-12-13 08:52:56 +01:00
Kconfig
Kconfig.binfmt
libfs.c libfs: fix infoleak in simple_attr_read() 2020-04-02 15:28:21 +02:00
locks.c locks: print unsigned ino in /proc/locks 2020-01-09 10:19:00 +01:00
Makefile
mbcache.c
mount.h
mpage.c mpage: mpage_readpages() should submit IO as read-ahead 2018-08-17 16:20:29 -07:00
namei.c namei: only return -ECHILD from follow_dotdot_rcu() 2020-03-05 16:42:20 +01:00
namespace.c fs/namespace.c: fix mountpoint reference counter race 2020-04-29 16:31:26 +02:00
no-block.c
nsfs.c dcache: sort the freeing-without-RCU-delay mess for good. 2019-05-25 18:23:26 +02:00
open.c cifs_atomic_open(): fix double-put on late allocation failure 2020-03-18 07:14:21 +01:00
pipe.c fs: prevent page refcount overflow in pipe_buf_get 2019-05-04 09:20:11 +02:00
pnode.c propagate_one(): mnt_set_mountpoint() needs mount_lock 2020-05-02 17:26:01 +02:00
pnode.h
posix_acl.c
proc_namespace.c
read_write.c vfs: avoid problematic remapping requests into partial EOF block 2019-12-01 09:17:04 +01:00
readdir.c filldir[64]: remove WARN_ON_ONCE() for bad directory entries 2020-01-04 19:13:26 +01:00
select.c
seq_file.c seq_file: fix problem when seeking mid-record 2019-08-25 10:47:43 +02:00
signalfd.c
splice.c splice: only read in as much information as there is pipe buffer space 2019-12-17 20:35:43 +01:00
stack.c
stat.c
statfs.c vfs: Fix EOVERFLOW testing in put_compat_statfs64 2019-10-11 18:21:39 +02:00
super.c Merge branch 'ida-4.19' of git://git.infradead.org/users/willy/linux-dax 2018-08-26 11:48:42 -07:00
sync.c
timerfd.c Merge branch 'work.aio' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-08-13 20:56:23 -07:00
userfaultfd.c userfaultfd: require CAP_SYS_PTRACE for UFFD_FEATURE_EVENT_FORK 2020-01-04 19:13:18 +01:00
utimes.c
xattr.c sysfs: Do not return POSIX ACL xattrs via listxattr 2018-09-18 07:30:48 -04:00