linux/fs
Hin-Tak Leung 61cabc7d54 hfs,hfsplus: cache pages correctly between bnode_create and bnode_free
commit 7cb74be6fd upstream.

Pages looked up by __hfs_bnode_create() (called by hfs_bnode_create() and
hfs_bnode_find() for finding or creating pages corresponding to an inode)
are immediately kmap()'ed and used (both read and write) and kunmap()'ed,
and should not be page_cache_release()'ed until hfs_bnode_free().

This patch fixes a problem I first saw in July 2012: merely running "du"
on a large hfsplus-mounted directory a few times on a reasonably loaded
system would get the hfsplus driver all confused and complaining about
B-tree inconsistencies, and generates a "BUG: Bad page state".  Most
recently, I can generate this problem on up-to-date Fedora 22 with shipped
kernel 4.0.5, by running "du /" (="/" + "/home" + "/mnt" + other smaller
mounts) and "du /mnt" simultaneously on two windows, where /mnt is a
lightly-used QEMU VM image of the full Mac OS X 10.9:

$ df -i / /home /mnt
Filesystem                  Inodes   IUsed      IFree IUse% Mounted on
/dev/mapper/fedora-root    3276800  551665    2725135   17% /
/dev/mapper/fedora-home   52879360  716221   52163139    2% /home
/dev/nbd0p2             4294967295 1387818 4293579477    1% /mnt

After applying the patch, I was able to run "du /" (60+ times) and "du
/mnt" (150+ times) continuously and simultaneously for 6+ hours.

There are many reports of the hfsplus driver getting confused under load
and generating "BUG: Bad page state" or other similar issues over the
years.  [1]

The unpatched code [2] has always been wrong since it entered the kernel
tree.  The only reason why it gets away with it is that the
kmap/memcpy/kunmap follow very quickly after the page_cache_release() so
the kernel has not had a chance to reuse the memory for something else,
most of the time.

The current RW driver appears to have followed the design and development
of the earlier read-only hfsplus driver [3], where-by version 0.1 (Dec
2001) had a B-tree node-centric approach to
read_cache_page()/page_cache_release() per bnode_get()/bnode_put(),
migrating towards version 0.2 (June 2002) of caching and releasing pages
per inode extents.  When the current RW code first entered the kernel [2]
in 2005, there was an REF_PAGES conditional (and "//" commented out code)
to switch between B-node centric paging to inode-centric paging.  There
was a mistake with the direction of one of the REF_PAGES conditionals in
__hfs_bnode_create().  In a subsequent "remove debug code" commit [4], the
read_cache_page()/page_cache_release() per bnode_get()/bnode_put() were
removed, but a page_cache_release() was mistakenly left in (propagating
the "REF_PAGES <-> !REF_PAGE" mistake), and the commented-out
page_cache_release() in bnode_release() (which should be spanned by
!REF_PAGES) was never enabled.

References:
[1]:
Michael Fox, Apr 2013
http://www.spinics.net/lists/linux-fsdevel/msg63807.html
("hfsplus volume suddenly inaccessable after 'hfs: recoff %d too large'")

Sasha Levin, Feb 2015
http://lkml.org/lkml/2015/2/20/85 ("use after free")

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/740814
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1027887
https://bugzilla.kernel.org/show_bug.cgi?id=42342
https://bugzilla.kernel.org/show_bug.cgi?id=63841
https://bugzilla.kernel.org/show_bug.cgi?id=78761

[2]:
http://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/\
fs/hfs/bnode.c?id=d1081202f1d0ee35ab0beb490da4b65d4bc763db
commit d1081202f1d0ee35ab0beb490da4b65d4bc763db
Author: Andrew Morton <akpm@osdl.org>
Date:   Wed Feb 25 16:17:36 2004 -0800

    [PATCH] HFS rewrite

http://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/\
fs/hfsplus/bnode.c?id=91556682e0bf004d98a529bf829d339abb98bbbd

commit 91556682e0bf004d98a529bf829d339abb98bbbd
Author: Andrew Morton <akpm@osdl.org>
Date:   Wed Feb 25 16:17:48 2004 -0800

    [PATCH] HFS+ support

[3]:
http://sourceforge.net/projects/linux-hfsplus/

http://sourceforge.net/projects/linux-hfsplus/files/Linux%202.4.x%20patch/hfsplus%200.1/
http://sourceforge.net/projects/linux-hfsplus/files/Linux%202.4.x%20patch/hfsplus%200.2/

http://linux-hfsplus.cvs.sourceforge.net/viewvc/linux-hfsplus/linux/\
fs/hfsplus/bnode.c?r1=1.4&r2=1.5

Date:   Thu Jun 6 09:45:14 2002 +0000
Use buffer cache instead of page cache in bnode.c. Cache inode extents.

[4]:
http://git.kernel.org/cgit/linux/kernel/git/\
stable/linux-stable.git/commit/?id=a5e3985fa014029eb6795664c704953720cc7f7d

commit a5e3985fa0
Author: Roman Zippel <zippel@linux-m68k.org>
Date:   Tue Sep 6 15:18:47 2005 -0700

[PATCH] hfs: remove debug code

Signed-off-by: Hin-Tak Leung <htl10@users.sourceforge.net>
Signed-off-by: Sergei Antonov <saproj@gmail.com>
Reviewed-by: Anton Altaparmakov <anton@tuxera.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Sougata Santra <sougata@tuxera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-01 12:07:34 +02:00
..
9p 9p: don't leave a half-initialized inode sitting around 2015-08-03 09:29:47 -07:00
adfs fs: Limit sys_mount to only request filesystem modules. 2013-03-03 19:36:31 -08:00
affs move d_rcu from overlapping d_child to overlapping d_alias 2015-04-29 10:34:00 +02:00
afs aio: don't include aio.h in sched.h 2013-05-07 20:16:25 -07:00
autofs4 move d_rcu from overlapping d_child to overlapping d_alias 2015-04-29 10:34:00 +02:00
befs befs_readdir(): do not increment ->f_pos if filldir tells us to stop 2013-05-31 15:17:56 -04:00
bfs fs: Limit sys_mount to only request filesystem modules. 2013-03-03 19:36:31 -08:00
btrfs Btrfs: use kmem_cache_free when freeing entry in inode cache 2015-08-03 09:29:46 -07:00
cachefiles lift sb_start_write() out of ->write() 2013-04-09 14:12:56 -04:00
ceph move d_rcu from overlapping d_child to overlapping d_alias 2015-04-29 10:34:00 +02:00
cifs move d_rcu from overlapping d_child to overlapping d_alias 2015-04-29 10:34:00 +02:00
coda move d_rcu from overlapping d_child to overlapping d_alias 2015-04-29 10:34:00 +02:00
configfs configfs: fix race between dentry put and lookup 2013-11-29 11:11:53 -08:00
cramfs fs: Limit sys_mount to only request filesystem modules. 2013-03-03 19:36:31 -08:00
debugfs move d_rcu from overlapping d_child to overlapping d_alias 2015-04-29 10:34:00 +02:00
devpts devpts: plug the memory leak in kill_sb 2013-12-04 10:55:49 -08:00
dlm Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2013-05-01 14:08:52 -07:00
ecryptfs eCryptfs: Remove buggy and unnecessary write in file name decode routine 2015-01-08 09:58:17 -08:00
efivarfs efivarfs: Never return ENOENT from firmware again 2013-05-13 20:12:10 +01:00
efs fs: Limit sys_mount to only request filesystem modules. 2013-03-03 19:36:31 -08:00
exofs ore: Fix wrong math in allocation of per device BIO 2014-02-13 13:48:00 -08:00
exportfs move d_rcu from overlapping d_child to overlapping d_alias 2015-04-29 10:34:00 +02:00
ext2 ext2: Fix oops in ext2_get_block() called from ext2_quota_write() 2014-12-16 09:09:43 -08:00
ext3 ext3: Don't check quota format when there are no quota files 2014-11-14 08:48:00 -08:00
ext4 ext4: replace open coded nofail allocation in ext4_free_blocks() 2015-08-03 09:29:43 -07:00
f2fs f2fs updates for v3.10 2013-05-08 15:11:48 -07:00
fat fat: fix possible overflow for fat_clusters 2013-05-24 16:22:50 -07:00
freevxfs fs: Readd the fs module aliases. 2013-03-12 18:55:21 -07:00
fscache fs/fscache/stats.c: fix memory leak 2013-04-29 15:54:27 -07:00
fuse fuse: initialize fc->release before calling it 2015-08-03 09:29:46 -07:00
gfs2 GFS2: Increase i_writecount during gfs2_setattr_chown 2014-01-25 08:27:11 -08:00
hfs hfs,hfsplus: cache pages correctly between bnode_create and bnode_free 2015-10-01 12:07:34 +02:00
hfsplus hfs,hfsplus: cache pages correctly between bnode_create and bnode_free 2015-10-01 12:07:34 +02:00
hostfs hostfs: use kmalloc instead of kzalloc 2013-05-04 15:48:45 -04:00
hpfs hpfs: update ctime and mtime on directory modification 2015-09-21 10:00:10 -07:00
hppfs hppfs: get rid of ->fsync() 2013-04-29 15:41:42 -04:00
hugetlbfs cope with potentially long ->d_dname() output for shmem/hugetlb 2013-10-18 07:45:45 -07:00
isofs isofs: Fix unchecked printing of ER records 2015-01-08 09:58:15 -08:00
jbd Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs 2013-05-03 09:56:25 -07:00
jbd2 jbd2: fix ocfs2 corrupt when updating journal superblock fails 2015-08-03 09:29:43 -07:00
jffs2 jffs2: fix handling of corrupted summary length 2015-03-06 14:40:53 -08:00
jfs jfs: fix readdir regression 2015-04-29 10:33:57 +02:00
lockd LOCKD: Fix a race when initialising nlmsvc_timeout 2015-01-27 07:52:33 -08:00
logfs block: Remove bi_idx references 2013-03-23 14:15:31 -07:00
minix fs: Limit sys_mount to only request filesystem modules. 2013-03-03 19:36:31 -08:00
ncpfs move d_rcu from overlapping d_child to overlapping d_alias 2015-04-29 10:34:00 +02:00
nfs NFS: nfs_set_pgio_error sometimes misses errors 2015-10-01 12:07:31 +02:00
nfs_common nfs_common: Update the translation between nfsv3 acls linux posix acls 2013-02-13 06:15:14 -08:00
nfsd nfsd4: fix xdr4 inclusion of escaped char 2015-01-16 06:59:02 -08:00
nilfs2 nilfs2: fix sanity check of btree level in nilfs_btree_root_broken() 2015-05-17 09:51:32 -07:00
nls
notify fsnotify: fix oops in fsnotify_clear_marks_by_group_flags() 2015-08-16 20:51:35 -07:00
ntfs aio: don't include aio.h in sched.h 2013-05-07 20:16:25 -07:00
ocfs2 ocfs2: fix BUG in ocfs2_downconvert_thread_do_work() 2015-08-16 20:51:40 -07:00
omfs fs, omfs: add NULL terminator in the end up the token list 2015-06-05 23:19:54 -07:00
openpromfs fs: Limit sys_mount to only request filesystem modules. 2013-03-03 19:36:31 -08:00
proc pagemap: do not leak physical addresses to non-privileged userspace 2015-04-19 10:10:51 +02:00
pstore pstore/ram: avoid atomic accesses for ioremapped regions 2015-02-05 22:35:40 -08:00
qnx4 fs: Limit sys_mount to only request filesystem modules. 2013-03-03 19:36:31 -08:00
qnx6 qnx6: qnx6_readdir() has a braino in pos calculation 2013-05-31 15:17:31 -04:00
quota quota: provide interface for readding allocated space into reserved space 2015-01-29 17:40:57 -08:00
ramfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-02-26 20:16:07 -08:00
reiserfs remove extra definitions of U32_MAX 2015-04-29 10:33:54 +02:00
romfs romfs: fix nommu map length to keep inside filesystem 2013-04-29 09:17:57 +10:00
squashfs fs: Limit sys_mount to only request filesystem modules. (Part 3) 2013-03-11 07:09:48 -07:00
sysfs sysfs: check if one entry has been removed before freeing 2013-04-05 15:35:52 -07:00
sysv sysv: Add forgotten superblock lock init for v7 fs 2013-10-05 07:13:09 -07:00
ubifs UBIFS: fix free log space calculation 2014-11-14 08:47:54 -08:00
udf udf: Verify symlink size before loading it 2015-01-08 09:58:17 -08:00
ufs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2013-04-30 09:36:50 -07:00
xfs xfs: fix remote symlinks on V5/CRC filesystems 2015-08-03 09:29:45 -07:00
aio.c aio: fix kernel memory disclosure in io_getevents() introduced in v3.10 2014-06-30 20:09:45 -07:00
anon_inodes.c get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero 2013-02-26 02:46:11 -05:00
attr.c fs,userns: Change inode_capable to capable_wrt_inode_uidgid 2014-06-16 13:42:52 -07:00
bad_inode.c
binfmt_aout.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-05-01 17:51:54 -07:00
binfmt_elf_fdpic.c Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc 2013-05-02 10:16:16 -07:00
binfmt_elf.c fs/binfmt_elf.c:load_elf_binary(): return -EINVAL on zero-length mappings 2015-06-05 23:20:00 -07:00
binfmt_em86.c
binfmt_flat.c new helper: read_code() 2013-04-29 15:40:23 -04:00
binfmt_misc.c binfmt_misc: reuse string_unescape_inplace() 2013-04-30 17:04:03 -07:00
binfmt_script.c
binfmt_som.c
bio-integrity.c bio-integrity: Fix bio_integrity_verify segment start bug 2014-03-23 21:38:21 -07:00
bio.c block: Fix bio_copy_data() 2013-10-05 07:13:09 -07:00
block_dev.c writeback: Fix periodic writeback after fs mount 2013-07-28 16:29:40 -07:00
buffer.c vfs: fix data corruption when blocksize < pagesize for mmaped data 2014-11-14 08:47:54 -08:00
char_dev.c
compat_binfmt_elf.c
compat_ioctl.c Removed unused typedef to avoid "unused local typedef" warnings. 2013-05-04 15:03:05 -04:00
compat.c aio: don't include aio.h in sched.h 2013-05-07 20:16:25 -07:00
coredump.c fs: if a coredump already exists, unlink and recreate with O_EXCL 2015-10-01 12:07:32 +02:00
coredump.h
dcache.c freeing unlinked file indefinitely delayed 2015-08-10 12:20:29 -07:00
dcookies.c fs/compat: fix lookup_dcookie() parameter handling 2014-02-13 13:48:00 -08:00
direct-io.c Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-block 2013-05-08 10:13:35 -07:00
drop_caches.c
eventfd.c
eventpoll.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal 2013-05-01 07:21:43 -07:00
exec.c fs: take i_mutex during prepare_binprm for set[ug]id executables 2015-07-03 19:48:09 -07:00
fcntl.c new helper: file_inode(file) 2013-02-22 23:31:31 -05:00
fhandle.c vfs: read file_handle only once in handle_to_path 2015-06-05 23:20:00 -07:00
file_table.c get rid of s_files and files_lock 2015-07-03 19:48:08 -07:00
file.c fs/file.c:fdtable: avoid triggering OOMs from alloc_fdmem 2014-02-22 12:41:25 -08:00
filesystems.c fs: Limit sys_mount to only request filesystem modules. 2013-03-03 19:36:31 -08:00
fs_struct.c constify path_get/path_put and fs_struct.c stuff 2013-03-01 23:51:07 -05:00
fs-writeback.c writeback: fix a subtle race condition in I_DIRTY clearing 2015-01-16 06:59:02 -08:00
generic_acl.c
inode.c fs: Fix S_NOSEC handling 2015-07-10 10:40:22 -07:00
internal.h get rid of s_files and files_lock 2015-07-03 19:48:08 -07:00
ioctl.c new helper: file_inode(file) 2013-02-22 23:31:31 -05:00
ioprio.c block: Fix computation of merged request priority 2014-11-21 09:22:53 -08:00
Kconfig efivarfs: Move to fs/efivarfs 2013-04-17 13:25:09 +01:00
Kconfig.binfmt fs: make binfmt support for #! scripts modular and removable 2013-04-30 17:04:04 -07:00
libfs.c move d_rcu from overlapping d_child to overlapping d_alias 2015-04-29 10:34:00 +02:00
locks.c locks: allow __break_lease to sleep even when break_time is 0 2014-05-13 13:59:44 +02:00
Makefile Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-05-01 17:51:54 -07:00
mbcache.c
mount.h vfs: Is mounted should be testing mnt_ns for NULL or error. 2014-02-06 11:08:16 -08:00
mpage.c
namei.c RCU pathwalk breakage when running into a symlink overmounting something 2015-05-06 21:56:27 +02:00
namespace.c umount: Disallow unprivileged mount force 2015-01-08 09:58:16 -08:00
no-block.c
open.c get rid of s_files and files_lock 2015-07-03 19:48:08 -07:00
pipe.c pipe: iovec: Fix memory corruption when retrying atomic copy as non-atomic 2015-06-29 12:08:34 -07:00
pnode.c vfs: Fix invalid ida_remove() call 2013-05-31 15:16:33 -04:00
pnode.h Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-05-01 17:51:54 -07:00
posix_acl.c posix_acl: handle NULL ACL in posix_acl_equiv_mode 2014-06-07 13:25:33 -07:00
proc_namespace.c
read_write.c fs/compat: fix parameter handling for compat readv/writev syscalls 2014-02-13 13:48:00 -08:00
readdir.c new helper: file_inode(file) 2013-02-22 23:31:31 -05:00
select.c sched/rt: Move rt specific bits into new header file 2013-02-07 20:51:08 +01:00
seq_file.c seq_file: always update file->f_pos in seq_lseek() 2013-11-13 12:05:34 +09:00
signalfd.c signalfd: fix information leak in signalfd_copyinfo 2015-08-16 20:51:42 -07:00
splice.c splice: Apply generic position and size checks to each write 2015-04-29 10:33:57 +02:00
stack.c
stat.c quota: provide interface for readding allocated space into reserved space 2015-01-29 17:40:57 -08:00
statfs.c vfs: allow O_PATH file descriptors for fstatfs() 2013-10-18 07:45:44 -07:00
super.c get rid of s_files and files_lock 2015-07-03 19:48:08 -07:00
sync.c teach SYSCALL_DEFINE<n> how to deal with long long/unsigned long long 2013-03-03 22:46:22 -05:00
timerfd.c compat: restore timerfd settime and gettime compat syscalls 2013-03-02 09:35:13 -05:00
utimes.c
xattr_acl.c
xattr.c