linux/fs
Mel Gorman 2457aec637 mm: non-atomically mark page accessed during page cache allocation where possible
aops->write_begin may allocate a new page and make it visible only to have
mark_page_accessed called almost immediately after.  Once the page is
visible the atomic operations are necessary which is noticable overhead
when writing to an in-memory filesystem like tmpfs but should also be
noticable with fast storage.  The objective of the patch is to initialse
the accessed information with non-atomic operations before the page is
visible.

The bulk of filesystems directly or indirectly use
grab_cache_page_write_begin or find_or_create_page for the initial
allocation of a page cache page.  This patch adds an init_page_accessed()
helper which behaves like the first call to mark_page_accessed() but may
called before the page is visible and can be done non-atomically.

The primary APIs of concern in this care are the following and are used
by most filesystems.

	find_get_page
	find_lock_page
	find_or_create_page
	grab_cache_page_nowait
	grab_cache_page_write_begin

All of them are very similar in detail to the patch creates a core helper
pagecache_get_page() which takes a flags parameter that affects its
behavior such as whether the page should be marked accessed or not.  Then
old API is preserved but is basically a thin wrapper around this core
function.

Each of the filesystems are then updated to avoid calling
mark_page_accessed when it is known that the VM interfaces have already
done the job.  There is a slight snag in that the timing of the
mark_page_accessed() has now changed so in rare cases it's possible a page
gets to the end of the LRU as PageReferenced where as previously it might
have been repromoted.  This is expected to be rare but it's worth the
filesystem people thinking about it in case they see a problem with the
timing change.  It is also the case that some filesystems may be marking
pages accessed that previously did not but it makes sense that filesystems
have consistent behaviour in this regard.

The test case used to evaulate this is a simple dd of a large file done
multiple times with the file deleted on each iterations.  The size of the
file is 1/10th physical memory to avoid dirty page balancing.  In the
async case it will be possible that the workload completes without even
hitting the disk and will have variable results but highlight the impact
of mark_page_accessed for async IO.  The sync results are expected to be
more stable.  The exception is tmpfs where the normal case is for the "IO"
to not hit the disk.

The test machine was single socket and UMA to avoid any scheduling or NUMA
artifacts.  Throughput and wall times are presented for sync IO, only wall
times are shown for async as the granularity reported by dd and the
variability is unsuitable for comparison.  As async results were variable
do to writback timings, I'm only reporting the maximum figures.  The sync
results were stable enough to make the mean and stddev uninteresting.

The performance results are reported based on a run with no profiling.
Profile data is based on a separate run with oprofile running.

async dd
                                    3.15.0-rc3            3.15.0-rc3
                                       vanilla           accessed-v2
ext3    Max      elapsed     13.9900 (  0.00%)     11.5900 ( 17.16%)
tmpfs	Max      elapsed      0.5100 (  0.00%)      0.4900 (  3.92%)
btrfs   Max      elapsed     12.8100 (  0.00%)     12.7800 (  0.23%)
ext4	Max      elapsed     18.6000 (  0.00%)     13.3400 ( 28.28%)
xfs	Max      elapsed     12.5600 (  0.00%)      2.0900 ( 83.36%)

The XFS figure is a bit strange as it managed to avoid a worst case by
sheer luck but the average figures looked reasonable.

        samples percentage
ext3       86107    0.9783  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
ext3       23833    0.2710  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
ext3        5036    0.0573  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
ext4       64566    0.8961  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
ext4        5322    0.0713  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
ext4        2869    0.0384  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
xfs        62126    1.7675  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
xfs         1904    0.0554  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
xfs          103    0.0030  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
btrfs      10655    0.1338  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
btrfs       2020    0.0273  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
btrfs        587    0.0079  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
tmpfs      59562    3.2628  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
tmpfs       1210    0.0696  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
tmpfs         94    0.0054  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed

[akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Tested-by: Prabhakar Lad <prabhakar.csengg@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:10 -07:00
..
9p fs/9p: kerneldoc fixes 2014-06-04 16:53:55 -07:00
adfs fs/adfs/super.c: add __init to init_inodecache() 2014-04-07 16:36:08 -07:00
affs fs/affs/super.c: bugfix / double free 2014-05-06 13:05:00 -07:00
afs File locking related changes for v3.16 2014-06-04 08:12:50 -07:00
autofs4 autofs: fix lockref lookup 2014-05-06 13:04:59 -07:00
befs Major changes for 3.14 include support for the newly added ZERO_RANGE 2014-04-04 15:39:39 -07:00
bfs fs/bfs/inode.c: add __init to init_inodecache() 2014-04-07 16:36:08 -07:00
btrfs mm: non-atomically mark page accessed during page cache allocation where possible 2014-06-04 16:54:10 -07:00
cachefiles Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-04-12 14:49:50 -07:00
ceph locks: ensure that fl_owner is always initialized properly in flock and lease codepaths 2014-06-02 08:09:29 -04:00
cifs cifs: fix actimeo=0 corner case when cifs_i->time == jiffies 2014-04-24 22:37:03 -05:00
coda Major changes for 3.14 include support for the newly added ZERO_RANGE 2014-04-04 15:39:39 -07:00
configfs fs/configfs: use pr_fmt 2014-06-04 16:53:53 -07:00
cramfs Major changes for 3.14 include support for the newly added ZERO_RANGE 2014-04-04 15:39:39 -07:00
debugfs Major changes for 3.14 include support for the newly added ZERO_RANGE 2014-04-04 15:39:39 -07:00
devpts fs: push sync_filesystem() down to the file system's remount_fs() 2014-03-13 10:14:33 -04:00
dlm net: Fix use after free by removing length arg from sk_data_ready callbacks. 2014-04-11 16:15:36 -04:00
ecryptfs Merge branch 'cross-rename' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs 2014-04-04 14:03:05 -07:00
efivarfs efivarfs: 'efivarfs_file_write' function reorganization 2014-03-04 16:16:16 +00:00
efs Major changes for 3.14 include support for the newly added ZERO_RANGE 2014-04-04 15:39:39 -07:00
exofs Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osd 2014-04-10 14:33:02 -07:00
exportfs exportfs: fix quadratic behavior in filehandle lookup 2013-11-09 00:16:38 -05:00
ext2 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs 2014-04-07 17:59:17 -07:00
ext3 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs 2014-04-07 17:59:17 -07:00
ext4 mm: non-atomically mark page accessed during page cache allocation where possible 2014-06-04 16:54:10 -07:00
f2fs mm: non-atomically mark page accessed during page cache allocation where possible 2014-06-04 16:54:10 -07:00
fat Major changes for 3.14 include support for the newly added ZERO_RANGE 2014-04-04 15:39:39 -07:00
freevxfs Major changes for 3.14 include support for the newly added ZERO_RANGE 2014-04-04 15:39:39 -07:00
fscache fs/fscache: replace seq_printf by seq_puts 2014-06-04 16:53:52 -07:00
fuse mm: non-atomically mark page accessed during page cache allocation where possible 2014-06-04 16:54:10 -07:00
gfs2 mm: non-atomically mark page accessed during page cache allocation where possible 2014-06-04 16:54:10 -07:00
hfs Major changes for 3.14 include support for the newly added ZERO_RANGE 2014-04-04 15:39:39 -07:00
hfsplus Major changes for 3.14 include support for the newly added ZERO_RANGE 2014-04-04 15:39:39 -07:00
hostfs mm + fs: store shadow entries in page cache 2014-04-03 16:21:01 -07:00
hpfs Major changes for 3.14 include support for the newly added ZERO_RANGE 2014-04-04 15:39:39 -07:00
hppfs
hugetlbfs fs/hugetlbfs/inode.c: complete conversion to pr_foo() 2014-06-04 16:54:00 -07:00
isofs Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs 2014-04-07 17:59:17 -07:00
jbd jbd: Revise KERN_EMERG error messages 2013-12-04 12:27:46 +01:00
jbd2 arch: Mass conversion of smp_mb__*() 2014-04-18 14:20:48 +02:00
jffs2 MTD updates for 3.15: 2014-04-07 10:17:30 -07:00
jfs fs/jfs/super.c: convert simple_str to kstr 2014-06-03 14:14:00 -05:00
kernfs kernfs: move the last knowledge of sysfs out from kernfs 2014-05-27 14:33:17 -07:00
lockd lockd: ensure we tear down any live sockets when socket creation fails during lockd_up 2014-03-28 10:43:08 -04:00
logfs mm + fs: store shadow entries in page cache 2014-04-03 16:21:01 -07:00
minix Major changes for 3.14 include support for the newly added ZERO_RANGE 2014-04-04 15:39:39 -07:00
ncpfs Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2014-04-12 17:31:22 -07:00
nfs File locking related changes for v3.16 2014-06-04 08:12:50 -07:00
nfs_common
nfsd nfsd4: warn on finding lockowner without stateid's 2014-05-21 11:11:21 -04:00
nilfs2 mm: implement ->map_pages for page cache 2014-04-07 16:35:53 -07:00
nls nls: have register_nls() set ->owner 2014-01-25 03:14:05 -05:00
notify fanotify: check file flags passed in fanotify_init 2014-06-04 16:53:52 -07:00
ntfs mm: non-atomically mark page accessed during page cache allocation where possible 2014-06-04 16:54:10 -07:00
ocfs2 fs/buffer.c: remove block_write_full_page_endio() 2014-06-04 16:54:02 -07:00
omfs mm + fs: store shadow entries in page cache 2014-04-03 16:21:01 -07:00
openpromfs fs: push sync_filesystem() down to the file system's remount_fs() 2014-03-13 10:14:33 -04:00
proc mm: softdirty: clear VM_SOFTDIRTY flag inside clear_refs_write() instead of clear_soft_dirty() 2014-06-04 16:53:56 -07:00
pstore Major changes for 3.14 include support for the newly added ZERO_RANGE 2014-04-04 15:39:39 -07:00
qnx4 fs: push sync_filesystem() down to the file system's remount_fs() 2014-03-13 10:14:33 -04:00
qnx6 fs: push sync_filesystem() down to the file system's remount_fs() 2014-03-13 10:14:33 -04:00
quota Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs 2014-04-07 17:59:17 -07:00
ramfs fs/ramfs: move ramfs_aops to inode.c 2014-01-23 16:36:58 -08:00
reiserfs Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs 2014-04-07 17:59:17 -07:00
romfs fs: push sync_filesystem() down to the file system's remount_fs() 2014-03-13 10:14:33 -04:00
squashfs fs/squashfs/squashfs.h: replace pr_warning by pr_warn 2014-06-04 16:53:52 -07:00
sysfs kernfs: move the last knowledge of sysfs out from kernfs 2014-05-27 14:33:17 -07:00
sysv Major changes for 3.14 include support for the newly added ZERO_RANGE 2014-04-04 15:39:39 -07:00
ubifs Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next 2014-06-03 12:57:53 -07:00
udf Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-04-12 14:49:50 -07:00
ufs fs/ufs: remove unused ufs_super_block_third pointer 2014-04-07 16:36:16 -07:00
xfs xfs: list_lru_init returns a negative error 2014-05-15 09:23:24 +10:00
aio.c aio: fix potential leak in aio_run_iocb(). 2014-05-01 08:37:43 -04:00
anon_inodes.c vfs: Allocate anon_inode_inode in anon_inode_init() 2014-03-27 09:52:54 -07:00
attr.c fs: fix iversion handling 2013-12-05 16:36:21 -06:00
bad_inode.c
binfmt_aout.c dump_skip(): dump_seek() replacement taking coredump_params 2013-11-09 00:16:26 -05:00
binfmt_elf_fdpic.c elf{,_fdpic} coredump: get rid of pointless if (siginfo->si_signo) 2013-11-09 00:16:30 -05:00
binfmt_elf.c exec: kill the unnecessary mm->def_flags setting in load_elf_binary() 2014-04-07 16:35:52 -07:00
binfmt_em86.c file->f_op is never NULL... 2013-10-24 23:34:54 -04:00
binfmt_flat.c
binfmt_misc.c binfmt_misc: add missing 'break' statement 2014-04-03 16:21:16 -07:00
binfmt_script.c
binfmt_som.c
block_dev.c fs/block_dev.c: add bdev_read_page() and bdev_write_page() 2014-06-04 16:54:02 -07:00
buffer.c mm: non-atomically mark page accessed during page cache allocation where possible 2014-06-04 16:54:10 -07:00
char_dev.c Merge branch 'for-3.13/core' of git://git.kernel.dk/linux-block 2013-11-14 12:08:14 +09:00
compat_binfmt_elf.c binfmt_elf: add ELF_HWCAP2 to compat auxv entries 2014-03-04 08:05:21 +00:00
compat_ioctl.c fs/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types 2014-03-06 16:30:44 +01:00
compat.c locks: rename file-private locks to "open file description locks" 2014-04-22 08:23:58 -04:00
coredump.c coredump: fix va_list corruption 2014-04-19 13:23:31 -07:00
dcache.c dcache: add missing lockdep annotation 2014-05-31 09:13:21 -07:00
dcookies.c fs/compat: fix lookup_dcookie() parameter handling 2014-01-29 16:22:40 -08:00
direct-io.c xfs: update for 3.15-rc1 2014-04-04 15:50:08 -07:00
drop_caches.c drop_caches: add some documentation and info message 2014-04-03 16:21:04 -07:00
eventfd.c eventfd_ctx_fdget(): use fdget() instead of fget() 2014-01-25 03:13:04 -05:00
eventpoll.c epoll: do not take the nested ep->mtx on EPOLL_CTL_DEL 2014-01-02 14:40:30 -08:00
exec.c metag: Reduce maximum stack size to 256MB 2014-05-15 00:00:35 +01:00
fcntl.c locks: rename file-private locks to "open file description locks" 2014-04-22 08:23:58 -04:00
fhandle.c
file_table.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-04-12 14:49:50 -07:00
file.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-04-12 14:49:50 -07:00
filesystems.c sys_sysfs: Add CONFIG_SYSFS_SYSCALL 2014-04-03 16:21:05 -07:00
fs_struct.c seqcount: Add lockdep functionality to seqcount/seqlock structures 2013-11-06 12:40:26 +01:00
fs-writeback.c One of the main highlights this time, is not the patches themselves 2014-04-04 14:49:16 -07:00
inode.c Major changes for 3.14 include support for the newly added ZERO_RANGE 2014-04-04 15:39:39 -07:00
internal.h get rid of s_files and files_lock 2013-11-09 00:16:20 -05:00
ioctl.c file->f_op is never NULL... 2013-10-24 23:34:54 -04:00
Kconfig kernfs: add CONFIG_KERNFS 2014-02-07 16:08:57 -08:00
Kconfig.binfmt
libfs.c fs/libfs.c: add generic data flush to fsync 2014-06-04 16:53:55 -07:00
locks.c locks: add some tracepoints in the lease handling code 2014-06-02 08:09:30 -04:00
Makefile block: move ioprio.c from fs/ to block/ 2014-05-19 11:02:18 -06:00
mbcache.c ext4: each filesystem creates and uses its own mb_cache 2014-03-18 19:24:49 -04:00
mount.h reduce m_start() cost... 2014-04-01 23:19:09 -04:00
mpage.c fs/block_dev.c: add bdev_read_page() and bdev_write_page() 2014-06-04 16:54:02 -07:00
namei.c fix races between __d_instantiate() and checks of dentry flags 2014-04-19 12:30:58 -04:00
namespace.c VFS: Make delayed_free() call free_vfsmnt() 2014-04-01 23:19:18 -04:00
no-block.c
open.c These are regression and bug fixes for ext4. 2014-04-20 20:43:47 -07:00
pipe.c switch pipe_read() to copy_page_to_iter() 2014-04-01 23:19:22 -04:00
pnode.c smarter propagate_mnt() 2014-04-01 23:19:08 -04:00
pnode.h smarter propagate_mnt() 2014-04-01 23:19:08 -04:00
posix_acl.c posix_acl: handle NULL ACL in posix_acl_equiv_mode 2014-05-06 13:58:42 -04:00
proc_namespace.c reduce m_start() cost... 2014-04-01 23:19:09 -04:00
read_write.c Merge branch 'compat' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux 2014-03-31 14:32:17 -07:00
readdir.c fanotify: create FAN_ACCESS event for readdir 2014-06-04 16:53:52 -07:00
select.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-11-13 15:34:18 +09:00
seq_file.c seq_file: always clear m->count when we free m->buf 2013-11-18 19:07:53 -08:00
signalfd.c
splice.c vfs: fix vmplice_to_user() 2014-05-28 01:54:52 -04:00
stack.c
stat.c vfs: split out vfs_getattr_nosec 2013-11-09 00:16:31 -05:00
statfs.c vfs: allow O_PATH file descriptors for fstatfs() 2013-10-12 13:12:31 -07:00
super.c fs: Don't return 0 from get_anon_bdev 2014-04-16 11:53:08 -07:00
sync.c Revert "writeback: do not sync data dirtied after sync start" 2014-02-22 02:02:28 +01:00
timerfd.c timerfd: support CLOCK_BOOTTIME clock 2014-01-23 16:57:40 -08:00
utimes.c locks: break delegations on any attribute modification 2013-11-09 00:16:44 -05:00
xattr.c