linux/fs
Omar Sandoval 3e6f71144a btrfs: don't prematurely free work in run_ordered_work()
[ Upstream commit c495dcd6fb ]

We hit the following very strange deadlock on a system with Btrfs on a
loop device backed by another Btrfs filesystem:

1. The top (loop device) filesystem queues an async_cow work item from
   cow_file_range_async(). We'll call this work X.
2. Worker thread A starts work X (normal_work_helper()).
3. Worker thread A executes the ordered work for the top filesystem
   (run_ordered_work()).
4. Worker thread A finishes the ordered work for work X and frees X
   (work->ordered_free()).
5. Worker thread A executes another ordered work and gets blocked on I/O
   to the bottom filesystem (still in run_ordered_work()).
6. Meanwhile, the bottom filesystem allocates and queues an async_cow
   work item which happens to be the recently-freed X.
7. The workqueue code sees that X is already being executed by worker
   thread A, so it schedules X to be executed _after_ worker thread A
   finishes (see the find_worker_executing_work() call in
   process_one_work()).

Now, the top filesystem is waiting for I/O on the bottom filesystem, but
the bottom filesystem is waiting for the top filesystem to finish, so we
deadlock.

This happens because we are breaking the workqueue assumption that a
work item cannot be recycled while it still depends on other work. Fix
it by waiting to free the work item until we are done with all of the
related ordered work.

P.S.:

One might ask why the workqueue code doesn't try to detect a recycled
work item. It actually does try by checking whether the work item has
the same work function (find_worker_executing_work()), but in our case
the function is the same. This is the only key that the workqueue code
has available to compare, short of adding an additional, layer-violating
"custom key". Considering that we're the only ones that have ever hit
this, we should just play by the rules.

Unfortunately, we haven't been able to create a minimal reproducer other
than our full container setup using a compress-force=zstd filesystem on
top of another compress-force=zstd filesystem.

Suggested-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-31 16:36:05 +01:00
..
9p 9p: avoid attaching writeback_fid on mmap with type PRIVATE 2019-10-11 18:21:13 +02:00
adfs fs/adfs: super: fix use-after-free bug 2019-08-06 19:06:49 +02:00
affs
afs afs: Fix leak in afs_lookup_cell_rcu() 2019-09-10 10:33:52 +01:00
autofs autofs: fix a leak in autofs_expire_indirect() 2019-12-13 08:51:01 +01:00
befs
bfs bfs: add sanity check at bfs_fill_super() 2018-12-01 09:37:27 +01:00
btrfs btrfs: don't prematurely free work in run_ordered_work() 2019-12-31 16:36:05 +01:00
cachefiles fscache, cachefiles: remove redundant variable 'cache' 2018-12-17 09:24:40 +01:00
ceph ceph: return -EINVAL if given fsc mount option on kernel w/o support 2019-12-05 09:19:45 +01:00
cifs CIFS: Close open handle after interrupted close 2019-12-21 10:57:35 +01:00
coda coda: add error handling for fget 2019-08-06 19:06:51 +02:00
configfs configfs: fix a deadlock in configfs_symlink() 2019-11-12 19:20:47 +01:00
cramfs Cramfs: fix abad comparison when wrap-arounds occur 2018-11-13 11:08:55 -08:00
crypto fscrypt: clean up some BUG_ON()s in block encryption/decryption 2019-07-26 09:14:02 +02:00
debugfs debugfs: fix use-after-free on symlink traversal 2019-05-08 07:21:48 +02:00
devpts fs/devpts: always delete dcache dentry-s in dput() 2019-03-23 20:09:59 +01:00
dlm dlm: fix invalid cluster name warning 2019-12-13 08:52:23 +01:00
ecryptfs ecryptfs_lookup_interpose(): lower_dentry->d_parent is not stable either 2019-11-20 18:45:18 +01:00
efivarfs efivars: Call guid_parse() against guid_t type of variable 2018-07-22 14:13:44 +02:00
efs
exofs exofs_mount(): fix leaks on failure exits 2019-12-05 09:20:32 +01:00
exportfs exportfs_decode_fh(): negative pinned may become positive without the parent locked 2019-12-13 08:51:02 +01:00
ext2 ext2: check err when partial != NULL 2019-12-17 20:35:18 +01:00
ext4 ext4: fix a bug in ext4_wait_for_tail_page_commit 2019-12-17 20:35:44 +01:00
f2fs f2fs: fix to allow node segment for GC by ioctl path 2019-12-13 08:51:51 +01:00
fat fat: work around race with userspace's read via blockdev while mounting 2019-10-07 18:57:14 +02:00
freevxfs
fscache fscache: fix race between enablement and dropping of object 2018-12-17 09:24:40 +01:00
fuse fuse: verify attributes 2019-12-13 08:52:36 +01:00
gfs2 gfs2: fix glock reference problem in gfs2_trans_remove_revoke 2019-12-17 20:35:55 +01:00
hfs fs/hfs/extent.c: fix array out of bounds read of array extent 2019-12-01 09:17:10 +01:00
hfsplus hfsplus: update timestamps on truncate() 2019-12-01 09:17:09 +01:00
hostfs vfs: discard ATTR_ATTR_FLAG 2018-08-17 16:20:28 -07:00
hpfs hpfs: remove unnecessary checks on the value of r when assigning error code 2018-08-25 12:42:33 -07:00
hugetlbfs hugetlb: use same fault hash key for shared and private mappings 2019-05-22 07:37:40 +02:00
isofs isofs: reject hardware sector size > 2048 bytes 2018-08-21 11:37:41 +02:00
jbd2 jbd2: introduce jbd2_inode dirty range scoping 2019-07-28 08:29:29 +02:00
jffs2 jffs2: fix use-after-free on symlink traversal 2019-05-08 07:21:48 +02:00
jfs Just one jfs patch for 4.19 2018-08-15 22:47:23 -07:00
kernfs kernfs: fix ino wrap-around detection 2019-12-13 08:52:43 +01:00
lockd lockd: fix decoding of TEST results 2019-12-13 08:51:59 +01:00
minix
nfs NFSv4.x: fix lock recovery during delegation recall 2019-11-24 08:20:24 +01:00
nfs_common
nfsd nfsd: Return EPERM, not EACCES, in some SETATTR cases 2019-12-13 08:52:26 +01:00
nilfs2 nilfs2: convert to SPDX license tags 2018-09-04 16:45:02 -07:00
nls
notify memcg, fsnotify: no oom-kill for remote memcg charging 2019-07-31 07:27:08 +02:00
ntfs ntfs: mft: remove VLA usage 2018-08-17 16:20:27 -07:00
ocfs2 quota: Check that quota is not dirty before release 2019-12-17 20:35:17 +01:00
omfs
openpromfs
orangefs orangefs: rate limit the client not running info message 2019-11-24 08:20:57 +01:00
overlayfs ovl: relax WARN_ON() on rename to self 2019-12-17 20:34:51 +01:00
proc mm, thp, proc: report THP eligibility for each vma 2019-12-17 20:35:45 +01:00
pstore pstore/ram: Avoid NULL deref in ftrace merging failure path 2019-12-13 08:52:24 +01:00
qnx4
qnx6
quota quota: fix livelock in dquot_writeback_dquots 2019-12-17 20:35:19 +01:00
ramfs
reiserfs reiserfs: fix extended attributes on the root directory 2019-12-17 20:35:20 +01:00
romfs
squashfs Squashfs: Compute expected length from inode size rather than block length 2018-08-02 09:34:02 -07:00
sysfs Driver core patches for 4.19-rc1 2018-08-18 11:44:53 -07:00
sysv sysv: return 'err' instead of 0 in __sysv_write_inode 2018-12-17 09:24:30 +01:00
tracefs tracefs: Annotate tracefs_ops with __ro_after_init 2018-07-31 11:32:44 -04:00
ubifs ubifs: Fix default compression selection in ubifs 2019-12-05 09:20:09 +01:00
udf udf: Fix crash during mount 2019-11-20 18:46:04 +01:00
ufs ufs: fix braino in ufs_get_inode_gid() for solaris UFS flavour 2019-05-25 18:23:46 +02:00
xfs xfs: add missing error check in xfs_prepare_shift() 2019-12-13 08:52:57 +01:00
aio.c Fix aio_poll() races 2019-05-02 09:58:59 +02:00
anon_inodes.c anon_inode_getfile(): switch to alloc_file_pseudo() 2018-07-12 10:04:27 -04:00
attr.c
bad_inode.c get rid of 'opened' argument of ->atomic_open() - part 3 2018-07-12 10:04:20 -04:00
binfmt_aout.c
binfmt_elf_fdpic.c
binfmt_elf.c binfmt_elf: Do not move brk for INTERP-less ET_EXEC 2019-10-05 13:10:06 +02:00
binfmt_em86.c
binfmt_flat.c fs/binfmt_flat.c: make load_flat_shared_library() work 2019-07-03 13:14:44 +02:00
binfmt_misc.c
binfmt_script.c exec: load_script: Do not exec truncated interpreter path 2019-11-06 13:05:37 +01:00
block_dev.c block: fix the return errno for direct IO 2019-04-17 08:38:52 +02:00
buffer.c fs: fix guard_bio_eod to check for real EOD errors 2019-04-05 22:33:00 +02:00
char_dev.c chardev: add additional check for minor range overlap 2019-05-31 06:46:27 -07:00
compat_binfmt_elf.c
compat_ioctl.c media: dvb: fix compat ioctl translation 2019-11-20 18:46:33 +01:00
compat.c
coredump.c
d_path.c
dax.c dax: dax_layout_busy_page() should not unmap cow pages 2019-08-16 10:12:52 +02:00
dcache.c dcache: sort the freeing-without-RCU-delay mess for good. 2019-05-25 18:23:26 +02:00
dcookies.c
direct-io.c direct-io: allow direct writes to empty inodes 2019-03-05 17:58:50 +01:00
drop_caches.c fs/drop_caches.c: avoid softlockups in drop_pagecache_sb() 2019-03-13 14:02:32 -07:00
eventfd.c
eventpoll.c fs/epoll: drop ovflist branch prediction 2019-02-12 19:47:19 +01:00
exec.c sched/fair: Don't free p->numa_faults with concurrent readers 2019-08-04 09:30:56 +02:00
fcntl.c signal: Don't send signals to tasks that don't exist 2018-08-15 23:03:20 -05:00
fhandle.c
file_table.c overlayfs update for 4.19 2018-08-21 18:19:09 -07:00
file.c fs/file.c: initialize init_files.resize_wait 2019-04-05 22:32:59 +02:00
filesystems.c
fs_pin.c
fs_struct.c
fs-writeback.c cgroup,writeback: don't switch wbs immediately on dead wbs if the memcg is dead 2019-11-12 19:21:20 +01:00
inode.c Abort file_remove_privs() for non-reg. files 2019-06-22 08:15:21 +02:00
internal.h acct_on(): don't mess with freeze protection 2019-05-31 06:46:05 -07:00
ioctl.c vfs: fix FIGETBSZ ioctl on an overlayfs file 2018-11-21 09:19:14 +01:00
iomap.c iomap: partially revert 4721a60109 (simulated directio short read on EFAULT) 2019-12-13 08:52:56 +01:00
Kconfig
Kconfig.binfmt kconfig: move the "Executable file formats" menu to fs/Kconfig.binfmt 2018-08-02 08:06:55 +09:00
libfs.c Fix the locking in dcache_readdir() and friends 2019-10-17 13:45:35 -07:00
locks.c overlayfs update for 4.19 2018-08-21 18:19:09 -07:00
Makefile
mbcache.c
mount.h
mpage.c mpage: mpage_readpages() should submit IO as read-ahead 2018-08-17 16:20:29 -07:00
namei.c Revert "vfs: Allow userns root to call mknod on owned filesystems." 2018-12-29 13:37:54 +01:00
namespace.c mnt: fix __detach_mounts infinite loop 2018-11-21 09:19:22 +01:00
no-block.c
nsfs.c dcache: sort the freeing-without-RCU-delay mess for good. 2019-05-25 18:23:26 +02:00
open.c access: avoid the RCU grace period for the temporary subjective credentials 2019-07-31 07:27:11 +02:00
pipe.c fs: prevent page refcount overflow in pipe_buf_get 2019-05-04 09:20:11 +02:00
pnode.c
pnode.h
posix_acl.c
proc_namespace.c
read_write.c vfs: avoid problematic remapping requests into partial EOF block 2019-12-01 09:17:04 +01:00
readdir.c
select.c
seq_file.c seq_file: fix problem when seeking mid-record 2019-08-25 10:47:43 +02:00
signalfd.c
splice.c splice: only read in as much information as there is pipe buffer space 2019-12-17 20:35:43 +01:00
stack.c
stat.c
statfs.c vfs: Fix EOVERFLOW testing in put_compat_statfs64 2019-10-11 18:21:39 +02:00
super.c Merge branch 'ida-4.19' of git://git.infradead.org/users/willy/linux-dax 2018-08-26 11:48:42 -07:00
sync.c
timerfd.c Merge branch 'work.aio' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-08-13 20:56:23 -07:00
userfaultfd.c userfaultfd_release: always remove uffd flags and clear vm_userfaultfd_ctx 2019-08-29 08:28:52 +02:00
utimes.c
xattr.c sysfs: Do not return POSIX ACL xattrs via listxattr 2018-09-18 07:30:48 -04:00