linux/mm
Hugh Dickins 7dc7fb432b shmem: fix splicing from a hole while it's punched
commit b1a366500b upstream.

shmem_fault() is the actual culprit in trinity's hole-punch starvation,
and the most significant cause of such problems: since a page faulted is
one that then appears page_mapped(), needing unmap_mapping_range() and
i_mmap_mutex to be unmapped again.

But it is not the only way in which a page can be brought into a hole in
the radix_tree while that hole is being punched; and Vlastimil's testing
implies that if enough other processors are busy filling in the hole,
then shmem_undo_range() can be kept from completing indefinitely.

shmem_file_splice_read() is the main other user of SGP_CACHE, which can
instantiate shmem pagecache pages in the read-only case (without holding
i_mutex, so perhaps concurrently with a hole-punch).  Probably it's
silly not to use SGP_READ already (using the ZERO_PAGE for holes): which
ought to be safe, but might bring surprises - not a change to be rushed.

shmem_read_mapping_page_gfp() is an internal interface used by
drivers/gpu/drm GEM (and next by uprobes): it should be okay.  And
shmem_file_read_iter() uses the SGP_DIRTY variant of SGP_CACHE, when
called internally by the kernel (perhaps for a stacking filesystem,
which might rely on holes to be reserved): it's unclear whether it could
be provoked to keep hole-punch busy or not.

We could apply the same umbrella as now used in shmem_fault() to
shmem_file_splice_read() and the others; but it looks ugly, and use over
a range raises questions - should it actually be per page? can these get
starved themselves?

The origin of this part of the problem is my v3.1 commit d0823576bf
("mm: pincer in truncate_inode_pages_range"), once it was duplicated
into shmem.c.  It seemed like a nice idea at the time, to ensure
(barring RCU lookup fuzziness) that there's an instant when the entire
hole is empty; but the indefinitely repeated scans to ensure that make
it vulnerable.

Revert that "enhancement" to hole-punch from shmem_undo_range(), but
retain the unproblematic rescanning when it's truncating; add a couple
of comments there.

Remove the "indices[0] >= end" test: that is now handled satisfactorily
by the inner loop, and mem_cgroup_uncharge_start()/end() are too light
to be worth avoiding here.

But if we do not always loop indefinitely, we do need to handle the case
of swap swizzled back to page before shmem_free_swap() gets it: add a
retry for that case, as suggested by Konstantin Khlebnikov; and for the
case of page swizzled back to swap, as suggested by Johannes Weiner.

Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lukas Czerner <lczerner@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-28 08:00:03 -07:00
..
backing-dev.c bdi: avoid oops on device removal 2014-04-26 17:15:35 -07:00
balloon_compaction.c mm: introduce a common interface for balloon pages mobility 2012-12-11 17:22:26 -08:00
bootmem.c mm: Add alloc_bootmem_low_pages_nopanic() 2013-01-29 19:32:59 -08:00
bounce.c mm/bounce.c: fix a regression where MS_SNAP_STABLE (stable pages snapshotting) was ignored 2013-10-13 16:08:33 -07:00
cleancache.c mm: cleancache: clean up cleancache_enabled 2013-04-30 17:04:01 -07:00
compaction.c mm/compaction: make isolate_freepages start at pageblock boundary 2014-06-16 13:42:53 -07:00
debug-pagealloc.c mm, x86: Remove debug_pagealloc_enabled 2011-12-06 09:24:07 +01:00
dmapool.c dmapool: make DMAPOOL_DEBUG detect corruption of free marker 2012-12-11 17:22:24 -08:00
fadvise.c teach SYSCALL_DEFINE<n> how to deal with long long/unsigned long long 2013-03-03 22:46:22 -05:00
failslab.c switch debugfs to umode_t 2012-01-03 22:54:56 -05:00
filemap_xip.c lift sb_start_write() out of ->write() 2013-04-09 14:12:56 -04:00
filemap.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-05-01 17:51:54 -07:00
fremap.c mm: fix use-after-free in sys_remap_file_pages 2014-01-09 12:24:24 -08:00
frontswap.c frontswap: fix incorrect zeroing and allocation size for frontswap_map 2013-06-12 16:29:46 -07:00
highmem.c Some nice cleanups, and even a patch my wife did as a "live" demo for 2012-12-20 08:37:05 -08:00
huge_memory.c thp: fix copy_page_rep GPF by testing is_huge_zero_pmd once only 2014-01-25 08:27:12 -08:00
hugetlb_cgroup.c mm/hugetlb: create hugetlb cgroup file in hugetlb_init 2012-12-18 15:02:15 -08:00
hugetlb.c hugetlb: fix copy_hugetlb_page_range() to handle migration/hwpoisoned entry 2014-07-09 11:14:03 -07:00
hwpoison-inject.c memcg: rename config variables 2012-07-31 18:42:43 -07:00
init-mm.c atomic: use <linux/atomic.h> 2011-07-26 16:49:47 -07:00
internal.h mm: accelerate munlock() treatment of THP pages 2013-02-27 19:10:09 -08:00
interval_tree.c mm: add CONFIG_DEBUG_VM_RB build option 2012-10-09 16:22:42 +09:00
Kconfig mmKconfig: add an option to disable bounce 2013-04-29 15:54:40 -07:00
Kconfig.debug mm: more intensive memory corruption debugging 2012-01-10 16:30:42 -08:00
kmemcheck.c
kmemleak-test.c kmemleak: remove memset by using kzalloc 2011-01-27 18:31:51 +00:00
kmemleak.c hlist: drop the node parameter from iterators 2013-02-27 19:10:24 -08:00
ksm.c mm: close PageTail race 2014-04-03 12:01:05 -07:00
maccess.c mm: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
madvise.c mm: madvise: complete input validation before taking lock 2013-04-29 15:54:37 -07:00
Makefile memcg: add memory.pressure_level events 2013-04-29 15:54:38 -07:00
memblock.c memblock: fix missing comment of memblock_insert_region() 2013-04-29 15:54:38 -07:00
memcontrol.c memcg: reparent charges of children before processing parent 2014-03-23 21:38:20 -07:00
memory_hotplug.c mm/memory_hotplug.c: fix printk format warnings 2013-05-24 16:22:52 -07:00
memory-failure.c mm/memory-failure.c: don't let collect_procs() skip over processes for MF_ACTION_REQUIRED 2014-06-30 20:09:42 -07:00
memory.c mm: make fixup_user_fault() check the vma access rights too 2014-06-07 13:25:29 -07:00
mempolicy.c cpuset,mempolicy: fix sleeping function called from invalid context 2014-07-17 15:58:00 -07:00
mempool.c mempool: add @gfp_mask to mempool_create_node() 2012-06-25 11:53:47 +02:00
migrate.c mm: numa: avoid unnecessary work on the failure path 2014-01-09 12:24:23 -08:00
mincore.c swap: make each swap partition have one address_space 2013-02-23 17:50:17 -08:00
mlock.c mm: try_to_unmap_cluster() should lock_page() before mlocking 2014-05-06 07:55:32 -07:00
mm_init.c mm: init: report on last-nid information stored in page->flags 2013-02-23 17:50:18 -08:00
mmap.c mm: ensure get_unmapped_area() returns higher address than mmap_min_addr 2013-12-04 10:56:39 -08:00
mmu_context.c mm: remove old aio use_mm() comment 2013-05-07 18:38:27 -07:00
mmu_notifier.c mm: mmu_notifier: re-fix freed page still mapped in secondary MMU 2013-05-24 16:22:51 -07:00
mmzone.c mm: rename page struct field helpers 2013-02-23 17:50:18 -08:00
mprotect.c mm: fix TLB flush race between migration, and change_protection_range 2014-01-09 12:24:23 -08:00
mremap.c mm, thp: close race between mremap() and split_huge_page() 2014-06-07 13:25:31 -07:00
msync.c
nobootmem.c mm, nobootmem: do memset() after memblock_reserve() 2013-04-29 15:54:39 -07:00
nommu.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal 2013-05-01 07:21:43 -07:00
oom_kill.c mm, oom: base root bonus on current usage 2014-02-13 13:48:02 -08:00
page_alloc.c mm: close PageTail race 2014-04-03 12:01:05 -07:00
page_cgroup.c memcontrol: use N_MEMORY instead N_HIGH_MEMORY 2012-12-12 17:38:32 -08:00
page_io.c Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-block 2013-05-08 10:13:35 -07:00
page_isolation.c mm: fix zone_watermark_ok_safe() accounting of isolated pages 2013-01-04 16:11:46 -08:00
page-writeback.c mm: __set_page_dirty_nobuffers() uses spin_lock_irqsave() instead of spin_lock_irq() 2014-02-20 11:06:11 -08:00
pagewalk.c mm/pagewalk.c: fix walk_page_range() access of wrong PTEs 2013-11-13 12:05:34 +09:00
percpu-km.c percpu: clear memory allocated with the km allocator 2010-10-02 10:28:42 +03:00
percpu-vm.c mm: fix kernel-doc warnings 2012-06-20 14:39:36 -07:00
percpu.c percpu: make pcpu_alloc_chunk() use pcpu_mem_free() instead of kfree() 2014-06-07 13:25:37 -07:00
pgtable-generic.c mm: fix TLB flush race between migration, and change_protection_range 2014-01-09 12:24:23 -08:00
process_vm_access.c Fix: compat_rw_copy_check_uvector() misuse in aio, readv, writev, and security keys 2013-03-12 11:05:45 -07:00
quicklist.c mm: delete various needless include <linux/module.h> 2011-10-31 09:20:11 -04:00
readahead.c teach SYSCALL_DEFINE<n> how to deal with long long/unsigned long long 2013-03-03 22:46:22 -05:00
rmap.c mm: fix sleeping function warning from __put_anon_vma 2014-06-30 20:09:42 -07:00
shmem.c shmem: fix splicing from a hole while it's punched 2014-07-28 08:00:03 -07:00
slab_common.c slab: prevent warnings when allocating with __GFP_NOWARN 2013-06-13 10:01:58 +03:00
slab.c slab: fix init_lock_keys 2013-07-21 18:21:26 -07:00
slab.h memcg: check that kmem_cache has memcg_params before accessing it 2013-09-07 22:09:58 -07:00
slob.c mm: rename page struct field helpers 2013-02-23 17:50:18 -08:00
slub.c slub: Fix calculation of cpu slabs 2014-02-13 13:48:00 -08:00
sparse-vmemmap.c sparse-vmemmap: specify vmemmap population range in bytes 2013-04-29 15:54:35 -07:00
sparse.c mm, hotplug: avoid compiling memory hotremove functions when disabled 2013-04-29 15:54:37 -07:00
swap_state.c swap: avoid read_swap_cache_async() race to deadlock while waiting on discard I/O completion 2013-06-12 16:29:45 -07:00
swap.c mm: close PageTail race 2014-04-03 12:01:05 -07:00
swapfile.c frontswap: fix incorrect zeroing and allocation size for frontswap_map 2013-06-12 16:29:46 -07:00
truncate.c mm: drop vmtruncate 2012-12-20 18:46:29 -05:00
util.c swap: make each swap partition have one address_space 2013-02-23 17:50:17 -08:00
vmalloc.c mm/vmalloc.c: fix an overflow bug in alloc_vmap_area() 2013-11-13 12:05:34 +09:00
vmpressure.c memcg: add memory.pressure_level events 2013-04-29 15:54:38 -07:00
vmscan.c mm: vmscan: clear kswapd's special reclaim powers before exiting 2014-06-30 20:09:42 -07:00
vmstat.c mm: numa: return the number of base pages altered by protection changes 2013-12-08 07:29:27 -08:00