linux/fs
Eric Ren 6bbb8ff3b6 ocfs2: fix crash caused by stale lvb with fsdlm plugin
commit e7ee2c089e upstream.

The crash happens rather often when we reset some cluster nodes while
nodes contend fiercely to do truncate and append.

The crash backtrace is below:

   dlm: C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover_grant 1 locks on 971 resources
   dlm: C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover 9 generation 5 done: 4 ms
   ocfs2: Begin replay journal (node 318952601, slot 2) on device (253,18)
   ocfs2: End replay journal (node 318952601, slot 2) on device (253,18)
   ocfs2: Beginning quota recovery on device (253,18) for slot 2
   ocfs2: Finishing quota recovery on device (253,18) for slot 2
   (truncate,30154,1):ocfs2_truncate_file:470 ERROR: bug expression: le64_to_cpu(fe->i_size) != i_size_read(inode)
   (truncate,30154,1):ocfs2_truncate_file:470 ERROR: Inode 290321, inode i_size = 732 != di i_size = 937, i_flags = 0x1
   ------------[ cut here ]------------
   kernel BUG at /usr/src/linux/fs/ocfs2/file.c:470!
   invalid opcode: 0000 [#1] SMP
   Modules linked in: ocfs2_stack_user(OEN) ocfs2(OEN) ocfs2_nodemanager ocfs2_stackglue(OEN) quota_tree dlm(OEN) configfs fuse sd_mod    iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi af_packet iscsi_ibft iscsi_boot_sysfs softdog xfs libcrc32c ppdev parport_pc pcspkr parport      joydev virtio_balloon virtio_net i2c_piix4 acpi_cpufreq button processor ext4 crc16 jbd2 mbcache ata_generic cirrus virtio_blk ata_piix               drm_kms_helper ahci syscopyarea libahci sysfillrect sysimgblt fb_sys_fops ttm floppy libata drm virtio_pci virtio_ring uhci_hcd virtio ehci_hcd       usbcore serio_raw usb_common sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4
   Supported: No, Unsupported modules are loaded
   CPU: 1 PID: 30154 Comm: truncate Tainted: G           OE   N  4.4.21-69-default #1
   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20151112_172657-sheep25 04/01/2014
   task: ffff88004ff6d240 ti: ffff880074e68000 task.ti: ffff880074e68000
   RIP: 0010:[<ffffffffa05c8c30>]  [<ffffffffa05c8c30>] ocfs2_truncate_file+0x640/0x6c0 [ocfs2]
   RSP: 0018:ffff880074e6bd50  EFLAGS: 00010282
   RAX: 0000000000000074 RBX: 000000000000029e RCX: 0000000000000000
   RDX: 0000000000000001 RSI: 0000000000000246 RDI: 0000000000000246
   RBP: ffff880074e6bda8 R08: 000000003675dc7a R09: ffffffff82013414
   R10: 0000000000034c50 R11: 0000000000000000 R12: ffff88003aab3448
   R13: 00000000000002dc R14: 0000000000046e11 R15: 0000000000000020
   FS:  00007f839f965700(0000) GS:ffff88007fc80000(0000) knlGS:0000000000000000
   CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
   CR2: 00007f839f97e000 CR3: 0000000036723000 CR4: 00000000000006e0
   Call Trace:
     ocfs2_setattr+0x698/0xa90 [ocfs2]
     notify_change+0x1ae/0x380
     do_truncate+0x5e/0x90
     do_sys_ftruncate.constprop.11+0x108/0x160
     entry_SYSCALL_64_fastpath+0x12/0x6d
   Code: 24 28 ba d6 01 00 00 48 c7 c6 30 43 62 a0 8b 41 2c 89 44 24 08 48 8b 41 20 48 c7 c1 78 a3 62 a0 48 89 04 24 31 c0 e8 a0 97 f9 ff <0f> 0b 3d 00 fe ff ff 0f 84 ab fd ff ff 83 f8 fc 0f 84 a2 fd ff
   RIP  [<ffffffffa05c8c30>] ocfs2_truncate_file+0x640/0x6c0 [ocfs2]

It's because ocfs2_inode_lock() get us stale LVB in which the i_size is
not equal to the disk i_size.  We mistakenly trust the LVB because the
underlaying fsdlm dlm_lock() doesn't set lkb_sbflags with
DLM_SBF_VALNOTVALID properly for us.  But, why?

The current code tries to downconvert lock without DLM_LKF_VALBLK flag
to tell o2cb don't update RSB's LVB if it's a PR->NULL conversion, even
if the lock resource type needs LVB.  This is not the right way for
fsdlm.

The fsdlm plugin behaves different on DLM_LKF_VALBLK, it depends on
DLM_LKF_VALBLK to decide if we care about the LVB in the LKB.  If
DLM_LKF_VALBLK is not set, fsdlm will skip recovering RSB's LVB from
this lkb and set the right DLM_SBF_VALNOTVALID appropriately when node
failure happens.

The following diagram briefly illustrates how this crash happens:

RSB1 is inode metadata lock resource with LOCK_TYPE_USES_LVB;

The 1st round:

             Node1                                    Node2
RSB1: PR
                                                  RSB1(master): NULL->EX
ocfs2_downconvert_lock(PR->NULL, set_lvb==0)
  ocfs2_dlm_lock(no DLM_LKF_VALBLK)

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

dlm_lock(no DLM_LKF_VALBLK)
  convert_lock(overwrite lkb->lkb_exflags
               with no DLM_LKF_VALBLK)

RSB1: NULL                                        RSB1: EX
                                                  reset Node2
dlm_recover_rsbs()
  recover_lvb()

/* The LVB is not trustable if the node with EX fails and
 * no lock >= PR is left. We should set RSB_VALNOTVALID for RSB1.
 */

 if(!(kb_exflags & DLM_LKF_VALBLK)) /* This means we miss the chance to
           return;                   * to invalid the LVB here.
                                     */

The 2nd round:

         Node 1                                Node2
RSB1(become master from recovery)

ocfs2_setattr()
  ocfs2_inode_lock(NULL->EX)
    /* dlm_lock() return the stale lvb without setting DLM_SBF_VALNOTVALID */
    ocfs2_meta_lvb_is_trustable() return 1 /* so we don't refresh inode from disk */
  ocfs2_truncate_file()
      mlog_bug_on_msg(disk isize != i_size_read(inode))  /* crash! */

The fix is quite straightforward.  We keep to set DLM_LKF_VALBLK flag
for dlm_lock() if the lock resource type needs LVB and the fsdlm plugin
is uesed.

Link: http://lkml.kernel.org/r/1481275846-6604-1-git-send-email-zren@suse.com
Signed-off-by: Eric Ren <zren@suse.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-19 20:17:19 +01:00
..
9p posix_acl: Clear SGID bit when setting file permissions 2016-10-31 04:13:58 -06:00
adfs
affs affs: fix remount failure when there are no options changed 2016-06-07 18:14:32 -07:00
afs
autofs4 autofs: use dentry flags to block walks during expire 2016-09-30 10:18:37 +02:00
befs
bfs
btrfs btrfs: make file clone aware of fatal signals 2017-01-06 11:16:11 +01:00
cachefiles FS-Cache: Add missing initialization of ret in cachefiles_write_page() 2015-11-16 20:38:43 -05:00
ceph posix_acl: Clear SGID bit when setting file permissions 2016-10-31 04:13:58 -06:00
cifs CIFS: Fix a possible memory corruption in push locks 2017-01-06 11:16:16 +01:00
coda fs/coda: fix readlink buffer overflow 2015-09-10 13:29:01 -07:00
configfs configfs: allow dynamic group creation 2015-11-20 16:17:32 -08:00
cramfs
debugfs debugfs: Make automount point inodes permanently empty 2016-05-04 14:48:41 -07:00
devpts devpts: clean up interface to pty drivers 2016-08-16 09:30:49 +02:00
dlm dlm: free workqueues after the connections 2016-10-22 12:26:56 +02:00
ecryptfs ecryptfs: fix handling of directory opening 2016-09-15 08:27:47 +02:00
efivarfs efi: Make efivarfs entries immutable by default 2016-03-03 15:07:09 -08:00
efs
exofs osd fs: __r4w_get_page rely on PageUptodate for uptodate 2015-12-12 10:15:34 -08:00
exportfs
ext2 posix_acl: Clear SGID bit when setting file permissions 2016-10-31 04:13:58 -06:00
ext4 ext4: do not perform data journaling when data is encrypted 2017-01-06 11:16:13 +01:00
f2fs f2fs: set ->owner for debugfs status file's file_operations 2017-01-06 11:16:13 +01:00
fat fat: fix fake_offset handling on error path 2015-11-20 16:17:32 -08:00
freevxfs
fscache FS-Cache: Handle a write to the page immediately beyond the EOF marker 2015-11-11 02:11:02 -05:00
fuse fuse: fix fuse_write_end() if zero bytes were copied 2016-11-26 09:54:52 +01:00
gfs2 posix_acl: Clear SGID bit when setting file permissions 2016-10-31 04:13:58 -06:00
hfs hfs: fix B-tree corruption after insertion at position 0 2015-09-10 13:29:01 -07:00
hfsplus posix_acl: Clear SGID bit when setting file permissions 2016-10-31 04:13:58 -06:00
hostfs hostfs: Freeing an ERR_PTR in hostfs_fill_sb_common() 2016-09-30 10:18:39 +02:00
hpfs hpfs: implement the show_options method 2016-06-01 12:15:54 -07:00
hugetlbfs fs/hugetlbfs/inode.c: fix bugs in hugetlb_vmtruncate_list() 2016-02-25 12:01:22 -08:00
isofs isofs: Do not return EACCES for unknown filesystems 2016-10-28 03:01:34 -04:00
jbd2 jbd2: fix incorrect unlock on j_list_lock 2016-10-28 03:01:35 -04:00
jffs2 posix_acl: Clear SGID bit when setting file permissions 2016-10-31 04:13:58 -06:00
jfs posix_acl: Clear SGID bit when setting file permissions 2016-10-31 04:13:58 -06:00
kernfs kernfs: don't depend on d_find_any_alias() when generating notifications 2016-09-24 10:07:36 +02:00
lockd Mainly smaller bugfixes and cleanup. We're still finding some bugs from 2015-11-11 20:11:28 -08:00
logfs mm, fs: introduce mapping_gfp_constraint() 2015-11-06 17:50:42 -08:00
minix
ncpfs ncpfs: fix a braino in OOM handling in ncp_fill_cache() 2016-03-16 08:42:59 -07:00
nfs nfs_write_end(): fix handling of short copies 2017-01-09 08:07:52 +01:00
nfs_common lockd: NLM grace period shouldn't block NFSv4 opens 2015-08-13 10:22:06 -04:00
nfsd nfsd: Close race between nfsd4_release_lockowner and nfsd4_lock 2016-09-24 10:07:36 +02:00
nilfs2 fs/nilfs2: fix potential underflow in call to crc32_le 2016-08-10 11:49:25 +02:00
nls
notify fanotify: fix list corruption in fanotify_get_response() 2016-09-30 10:18:37 +02:00
ntfs mm, fs: introduce mapping_gfp_constraint() 2015-11-06 17:50:42 -08:00
ocfs2 ocfs2: fix crash caused by stale lvb with fsdlm plugin 2017-01-19 20:17:19 +01:00
omfs
openpromfs
overlayfs ovl: fsync after copy-up 2016-11-10 16:36:34 +01:00
proc mm: introduce get_task_exe_file 2016-09-24 10:07:36 +02:00
pstore pstore/ram: Use memcpy_fromio() to save old buffer 2016-10-28 03:01:27 -04:00
qnx4
qnx6
quota quota: Fix possible GPF due to uninitialised pointers 2016-04-12 09:08:56 -07:00
ramfs mm, fs: obey gfp_mapping for add_to_page_cache() 2015-10-16 11:42:28 -07:00
reiserfs posix_acl: Clear SGID bit when setting file permissions 2016-10-31 04:13:58 -06:00
romfs
squashfs squashfs: xattr simplifications 2015-11-13 20:34:33 -05:00
sysfs sysfs: correctly handle read offset on PREALLOC attrs 2016-09-07 08:32:46 +02:00
sysv fix sysvfs symlinks 2015-11-23 21:11:08 -05:00
tracefs tracefs: Fix refcount imbalance in start_creating() 2015-11-04 22:13:45 -05:00
ubifs ubifs: Fix regression in ubifs_readdir() 2016-11-10 16:36:33 +01:00
udf udf: Check output buffer length when converting name to CS0 2016-02-25 12:01:18 -08:00
ufs fix ufs write vs readpage race when writing into a hole 2015-09-09 10:43:12 -07:00
xfs xfs: set AGI buffer type in xlog_recover_clear_agi_bucket 2017-01-06 11:16:17 +01:00
aio.c aio: mark AIO pseudo-fs noexec 2016-10-07 15:23:47 +02:00
anon_inodes.c
attr.c vfs: move permission checking into notify_change() for utimes(NULL) 2016-10-22 12:26:56 +02:00
bad_inode.c
binfmt_aout.c
binfmt_elf_fdpic.c libnvdimm for 4.4: 2015-11-10 12:07:22 -08:00
binfmt_elf.c Merge branch 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-11-11 09:45:24 -08:00
binfmt_em86.c
binfmt_flat.c
binfmt_misc.c
binfmt_script.c
block_dev.c block: protect iterate_bdevs() against concurrent close 2017-01-09 08:07:47 +01:00
buffer.c vfs: remove unused wrapper block_page_mkwrite() 2015-11-11 02:19:33 -05:00
char_dev.c
compat_binfmt_elf.c
compat_ioctl.c i2c-dev: Fix typo in ioctl name reference 2015-10-23 23:26:43 +02:00
compat.c
coredump.c coredump: fix unfreezable coredumping task 2016-11-18 10:48:34 +01:00
dax.c dax: disable pmd mappings 2015-11-16 23:54:45 -08:00
dcache.c fs/dcache.c: avoid soft-lockup in dput() 2016-08-16 09:30:50 +02:00
dcookies.c
direct-io.c block: fix use-after-free in dio_bio_complete 2016-03-03 15:07:28 -08:00
drop_caches.c inode: convert inode_sb_list_lock to per-sb 2015-08-17 18:39:46 -04:00
eventfd.c
eventpoll.c
exec.c exec: Ensure mm->user_ns contains the execed files 2017-01-06 11:16:14 +01:00
fcntl.c
fhandle.c fs/coredump: prevent fsuid=0 dumps into user-controlled directories 2016-04-12 09:08:58 -07:00
file_table.c
file.c vfs: clear remainder of 'full_fds_bits' in dup_fd() 2015-11-05 23:05:32 -08:00
filesystems.c
fs_pin.c
fs_struct.c
fs-writeback.c writeback, cgroup: fix use of the wrong bdi_writeback which mismatches the inode 2016-04-12 09:09:04 -07:00
inode.c vfs: fix deadlock in file_remove_privs() on overlayfs 2016-08-10 11:49:30 +02:00
internal.h inode: rename i_wb_list to i_io_list 2015-08-17 23:38:10 -04:00
ioctl.c
Kconfig dax: disable pmd mappings 2015-11-16 23:54:45 -08:00
Kconfig.binfmt
libfs.c fs: Set the size of empty dirs to 0. 2015-08-12 15:28:45 -05:00
locks.c locks: use file_inode() 2016-08-10 11:49:27 +02:00
Makefile ext4: promote ext4 over ext2 in the default probe order 2015-10-15 10:33:21 -04:00
mbcache.c
mount.h
mpage.c mm, fs: introduce mapping_gfp_constraint() 2015-11-06 17:50:42 -08:00
namei.c fs: Check for invalid i_uid in may_follow_link() 2016-09-15 08:27:49 +02:00
namespace.c namespace: update event counter when umounting a deleted dentry 2016-08-10 11:49:27 +02:00
no-block.c
nsfs.c fs/seq_file: convert int seq_vprint/seq_printf/etc... returns to void 2015-09-11 15:21:34 -07:00
open.c vfs: add vfs_select_inode() helper 2016-05-18 17:06:48 -07:00
pipe.c pipe: limit the per-user amount of pages allocated in pipes 2016-06-07 18:14:35 -07:00
pnode.c propogate_mnt: Handle the first propogated copy being a slave 2016-05-11 11:21:19 +02:00
pnode.h
posix_acl.c posix_acl: Clear SGID bit when setting file permissions 2016-10-31 04:13:58 -06:00
proc_namespace.c vfs: show_vfsstat: do not ignore errors from show_devname method 2016-04-12 09:08:55 -07:00
read_write.c
readdir.c
select.c
seq_file.c fs/seq_file: fix out-of-bounds read 2016-09-07 08:32:43 +02:00
signalfd.c
splice.c splice: handle zero nr_pages in splice_to_pipe() 2016-04-12 09:08:55 -07:00
stack.c
stat.c fs/stat.c: remove unnecessary new_valid_dev() check 2015-11-09 15:11:24 -08:00
statfs.c
super.c fs/super.c: fix race between freeze_super() and thaw_super() 2016-10-28 03:01:32 -04:00
sync.c fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE writeback 2015-11-06 17:50:42 -08:00
timerfd.c timerfd: Handle relative timers with CONFIG_TIME_LOW_RES proper 2016-02-25 12:01:25 -08:00
userfaultfd.c userfaultfd: don't block on the last VM updates at exit time 2016-03-16 08:43:01 -07:00
utimes.c vfs: move permission checking into notify_change() for utimes(NULL) 2016-10-22 12:26:56 +02:00
xattr.c 9p: xattr simplifications 2015-11-13 20:34:33 -05:00