linux/mm
Wu Fengguang 94739dc4e2 vmscan: raise the bar to PAGEOUT_IO_SYNC stalls
commit e31f3698cd upstream.

Fix "system goes unresponsive under memory pressure and lots of
dirty/writeback pages" bug.

	http://lkml.org/lkml/2010/4/4/86

In the above thread, Andreas Mohr described that

	Invoking any command locked up for minutes (note that I'm
	talking about attempted additional I/O to the _other_,
	_unaffected_ main system HDD - such as loading some shell
	binaries -, NOT the external SSD18M!!).

This happens when the two conditions are both meet:
- under memory pressure
- writing heavily to a slow device

OOM also happens in Andreas' system.  The OOM trace shows that 3 processes
are stuck in wait_on_page_writeback() in the direct reclaim path.  One in
do_fork() and the other two in unix_stream_sendmsg().  They are blocked on
this condition:

	(sc->order && priority < DEF_PRIORITY - 2)

which was introduced in commit 78dc583d (vmscan: low order lumpy reclaim
also should use PAGEOUT_IO_SYNC) one year ago.  That condition may be too
permissive.  In Andreas' case, 512MB/1024 = 512KB.  If the direct reclaim
for the order-1 fork() allocation runs into a range of 512KB
hard-to-reclaim LRU pages, it will be stalled.

It's a severe problem in three ways.

Firstly, it can easily happen in daily desktop usage.  vmscan priority can
easily go below (DEF_PRIORITY - 2) on _local_ memory pressure.  Even if
the system has 50% globally reclaimable pages, it still has good
opportunity to have 0.1% sized hard-to-reclaim ranges.  For example, a
simple dd can easily create a big range (up to 20%) of dirty pages in the
LRU lists.  And order-1 to order-3 allocations are more than common with
SLUB.  Try "grep -v '1 :' /proc/slabinfo" to get the list of high order
slab caches.  For example, the order-1 radix_tree_node slab cache may
stall applications at swap-in time; the order-3 inode cache on most
filesystems may stall applications when trying to read some file; the
order-2 proc_inode_cache may stall applications when trying to open a
/proc file.

Secondly, once triggered, it will stall unrelated processes (not doing IO
at all) in the system.  This "one slow USB device stalls the whole system"
avalanching effect is very bad.

Thirdly, once stalled, the stall time could be intolerable long for the
users.  When there are 20MB queued writeback pages and USB 1.1 is writing
them in 1MB/s, wait_on_page_writeback() will stuck for up to 20 seconds.
Not to mention it may be called multiple times.

So raise the bar to only enable PAGEOUT_IO_SYNC when priority goes below
DEF_PRIORITY/3, or 6.25% LRU size.  As the default dirty throttle ratio is
20%, it will hardly be triggered by pure dirty pages.  We'd better treat
PAGEOUT_IO_SYNC as some last resort workaround -- its stall time is so
uncomfortably long (easily goes beyond 1s).

The bar is only raised for (order < PAGE_ALLOC_COSTLY_ORDER) allocations,
which are easy to satisfy in 1TB memory boxes.  So, although 6.25% of
memory could be an awful lot of pages to scan on a system with 1TB of
memory, it won't really have to busy scan that much.

Andreas tested an older version of this patch and reported that it mostly
fixed his problem.  Mel Gorman helped improve it and KOSAKI Motohiro will
fix it further in the next patch.

Reported-by: Andreas Mohr <andi@lisas.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-26 16:41:53 -07:00
..
allocpercpu.c percpu: use dynamic percpu allocator as the default percpu allocator 2009-06-24 15:13:35 +09:00
backing-dev.c Thaw refrigerated bdi flusher threads before invoking kthread_stop on them 2009-11-12 13:08:11 +01:00
bootmem.c kmemleak: Do not report alloc_bootmem blocks as leaks 2009-08-27 14:29:17 +01:00
bounce.c block: remove some includings of blktrace_api.h 2009-06-16 11:19:36 +02:00
debug-pagealloc.c generic debug pagealloc 2009-04-01 08:59:13 -07:00
dmapool.c dmapools: protect page_list walk in show_pools() 2009-06-30 18:56:00 -07:00
fadvise.c readahead: introduce FMODE_RANDOM for POSIX_FADV_RANDOM 2010-03-15 08:49:37 -07:00
failslab.c kmemtrace, mm: fix slab.h dependency problem in mm/failslab.c 2009-04-03 12:23:01 +02:00
filemap_xip.c const: mark struct vm_struct_operations 2009-09-27 11:39:25 -07:00
filemap.c do_generic_file_read: clear page errors when issuing a fresh read of the page 2010-07-05 11:10:57 -07:00
fremap.c Do not account for the address space used by hugetlbfs using VM_ACCOUNT 2009-02-10 10:48:42 -08:00
highmem.c highmem: Fix debug_kmap_atomic() to also handle KM_IRQ_PTE, KM_NMI, and KM_NMI_PTE 2009-11-10 04:15:47 +01:00
hugetlb.c mm: hugetlb: fix clear_huge_page() 2010-07-05 11:10:42 -07:00
hwpoison-inject.c HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs 2009-09-16 11:50:17 +02:00
init-mm.c mm: consolidate init_mm definition 2009-06-16 19:47:28 -07:00
internal.h ksm: fix mlockfreed to munlocked 2010-01-06 15:05:22 -08:00
Kconfig NOMMU: Optimise away the {dac_,}mmap_min_addr tests 2010-01-06 15:04:30 -08:00
Kconfig.debug trivial: improve help text for mm debug config options 2009-09-21 15:14:57 +02:00
kmemcheck.c kmemcheck: add hooks for the page allocator 2009-06-15 15:48:33 +02:00
kmemleak-test.c percpu: clean up percpu variable definitions 2009-06-24 15:13:48 +09:00
kmemleak.c kmemleak: Check for NULL pointer returned by create_object() 2009-10-09 13:28:47 -07:00
ksm.c ksm: fix mlockfreed to munlocked 2010-01-06 15:05:22 -08:00
maccess.c [S390] maccess: add weak attribute to probe_kernel_write 2009-06-12 10:27:37 +02:00
madvise.c Merge branch 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6 2009-09-24 07:53:22 -07:00
Makefile procfs: disable per-task stack usage on NOMMU 2009-09-24 17:11:24 -07:00
memcontrol.c memcg: fix prepare migration 2010-05-12 14:57:00 -07:00
memory_hotplug.c mm: allow memory hotplug and hibernation in the same kernel 2009-11-17 17:40:33 -08:00
memory-failure.c HWPOISON: abort on failed unmap 2010-08-13 13:20:17 -07:00
memory.c mm: make stack guard page logic use vm_prev pointer 2010-08-26 16:41:45 -07:00
mempolicy.c tmpfs: cleanup mpol_parse_str() 2010-04-01 15:58:27 -07:00
mempool.c mm: remove broken 'kzalloc' mempool 2009-09-22 07:17:35 -07:00
migrate.c Fix potential crash with sys_move_pages 2010-02-23 07:37:42 -08:00
mincore.c mm: hugetlb: fix hugepage memory leak in mincore() 2009-12-18 14:04:29 -08:00
mlock.c mm: make the mlock() stack guard page checks stricter 2010-08-26 16:41:44 -07:00
mm_init.c mm: mminit_loglevel cannot be __meminitdata anymore 2008-08-20 15:40:30 -07:00
mmap.c mm: make the vma list be doubly linked 2010-08-26 16:41:44 -07:00
mmu_context.c mm: reduce atomic use on use_mm fast path 2009-09-22 07:17:42 -07:00
mmu_notifier.c ksm: add mmu_notifier set_pte_at_notify() 2009-09-22 07:17:31 -07:00
mmzone.c [ARM] Double check memmap is actually valid with a memmap has unexpected holes V2 2009-05-18 11:22:24 +01:00
mprotect.c perf: Do the big rename: Performance Counters -> Performance Events 2009-09-21 14:28:04 +02:00
mremap.c untangle the do_mremap() mess 2010-01-18 10:19:11 -08:00
msync.c [CVE-2009-0029] System call wrappers part 13 2009-01-14 14:15:23 +01:00
nommu.c mm: make the vma list be doubly linked 2010-08-26 16:41:44 -07:00
oom_kill.c memcg: fix oom killing a child process in an other cgroup 2010-03-15 08:49:33 -07:00
page_alloc.c mm: fix migratetype bug which slowed swapping 2010-02-09 04:50:49 -08:00
page_cgroup.c memory hotplug: alloc page from other node in memory online 2009-09-22 07:17:26 -07:00
page_io.c mm: remove file argument from swap_readpage() 2009-06-16 19:47:44 -07:00
page_isolation.c memory hotplug: fix page_zone() calculation in test_pages_isolated() 2008-11-06 15:41:19 -08:00
page-writeback.c writeback: account IO throttling wait as iowait 2009-10-09 12:40:42 +02:00
pagewalk.c mm: hugetlb: fix hugepage memory leak in walk_page_range() 2009-12-18 14:04:30 -08:00
percpu.c percpu: restructure pcpu_extend_area_map() to fix bugs and improve readability 2009-11-13 00:55:35 +09:00
prio_tree.c
quicklist.c cpumask: use new-style cpumask ops in mm/quicklist. 2009-09-24 09:34:52 +09:30
readahead.c readahead: fix NULL filp dereference 2010-04-26 07:41:19 -07:00
rmap.c mm/rmap.c: fix comment 2009-10-01 16:11:12 -07:00
shmem_acl.c shmfs: use 'check_acl' instead of 'permission' 2009-09-08 11:08:46 -07:00
shmem.c const: mark struct vm_struct_operations 2009-09-27 11:39:25 -07:00
slab.c slab: fix object alignment 2010-08-26 16:41:46 -07:00
slob.c slab: remove duplicate kmem_cache_init_late() declarations 2009-08-06 11:36:25 +03:00
slub.c mm: kmem_cache_create(): make it easier to catch NULL cache names 2009-09-22 07:17:33 -07:00
sparse-vmemmap.c memory hotplug: alloc page from other node in memory online 2009-09-22 07:17:26 -07:00
sparse.c memory hotplug: alloc page from other node in memory online 2009-09-22 07:17:26 -07:00
swap_state.c mm: add_to_swap_cache() does not return -EEXIST 2009-09-22 07:17:35 -07:00
swap.c mm: replace various uses of num_physpages by totalram_pages 2009-09-22 07:17:38 -07:00
swapfile.c mm: fix corruption of hibernation caused by reusing swap during image saving 2010-08-13 13:20:26 -07:00
thrash.c mm: pass mm to grab_swap_token 2009-06-23 12:50:05 -07:00
truncate.c vfs: Fix vmtruncate() regression 2010-01-22 15:18:41 -08:00
util.c untangle the do_mremap() mess 2010-01-18 10:19:11 -08:00
vmalloc.c mm: purge fragmented percpu vmap blocks 2010-02-09 04:50:58 -08:00
vmscan.c vmscan: raise the bar to PAGEOUT_IO_SYNC stalls 2010-08-26 16:41:53 -07:00
vmstat.c mm: vmstat: add isolate pages 2009-09-22 07:17:29 -07:00