linux/mm
Mel Gorman f1181047ff mm, mprotect: flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries
commit 3ea277194d upstream.

Stable note for 4.4: The upstream patch patches madvise(MADV_FREE) but 4.4
	does not have support for that feature. The changelog is left
	as-is but the hunk related to madvise is omitted from the backport.

Nadav Amit identified a theoritical race between page reclaim and
mprotect due to TLB flushes being batched outside of the PTL being held.

He described the race as follows:

        CPU0                            CPU1
        ----                            ----
                                        user accesses memory using RW PTE
                                        [PTE now cached in TLB]
        try_to_unmap_one()
        ==> ptep_get_and_clear()
        ==> set_tlb_ubc_flush_pending()
                                        mprotect(addr, PROT_READ)
                                        ==> change_pte_range()
                                        ==> [ PTE non-present - no flush ]

                                        user writes using cached RW PTE
        ...

        try_to_unmap_flush()

The same type of race exists for reads when protecting for PROT_NONE and
also exists for operations that can leave an old TLB entry behind such
as munmap, mremap and madvise.

For some operations like mprotect, it's not necessarily a data integrity
issue but it is a correctness issue as there is a window where an
mprotect that limits access still allows access.  For munmap, it's
potentially a data integrity issue although the race is massive as an
munmap, mmap and return to userspace must all complete between the
window when reclaim drops the PTL and flushes the TLB.  However, it's
theoritically possible so handle this issue by flushing the mm if
reclaim is potentially currently batching TLB flushes.

Other instances where a flush is required for a present pte should be ok
as either the page lock is held preventing parallel reclaim or a page
reference count is elevated preventing a parallel free leading to
corruption.  In the case of page_mkclean there isn't an obvious path
that userspace could take advantage of without using the operations that
are guarded by this patch.  Other users such as gup as a race with
reclaim looks just at PTEs.  huge page variants should be ok as they
don't race with reclaim.  mincore only looks at PTEs.  userfault also
should be ok as if a parallel reclaim takes place, it will either fault
the page back in or read some of the data before the flush occurs
triggering a fault.

Note that a variant of this patch was acked by Andy Lutomirski but this
was for the x86 parts on top of his PCID work which didn't make the 4.13
merge window as expected.  His ack is dropped from this version and
there will be a follow-on patch on top of PCID that will include his
ack.

[akpm@linux-foundation.org: tweak comments]
[akpm@linux-foundation.org: fix spello]
Link: http://lkml.kernel.org/r/20170717155523.emckq2esjro6hf3z@suse.de
Reported-by: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-11 09:08:50 -07:00
..
kasan kasan: respect /proc/sys/kernel/traceoff_on_warning 2017-06-17 06:39:36 +02:00
backing-dev.c block: fix double-free in the failure path of cgwb_bdi_init() 2017-02-26 11:07:51 +01:00
balloon_compaction.c virtio_balloon: fix race between migration and ballooning 2016-03-03 15:07:18 -08:00
bootmem.c bootmem: avoid freeing to bootmem after bootmem is done 2015-09-08 15:35:28 -07:00
cleancache.c cleancache: remove limit on the number of cleancache enabled filesystems 2015-04-14 16:49:03 -07:00
cma_debug.c mm/cma_debug: correct size input to bitmap function 2015-07-17 16:39:54 -07:00
cma.c mm/cma: silence warnings due to max() usage 2016-11-10 16:36:36 +01:00
cma.h mm: cma: mark cma_bitmap_maxno() inline in header 2015-08-14 15:56:32 -07:00
compaction.c mm, compaction: prevent VM_BUG_ON when terminating freeing scanner 2016-08-10 11:49:25 +02:00
debug-pagealloc.c
debug.c mm: make compound_head() robust 2015-11-06 17:50:42 -08:00
dmapool.c mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd 2015-11-06 17:50:42 -08:00
early_ioremap.c mm/early_ioremap: use offset_in_page macro 2015-11-05 19:34:48 -08:00
fadvise.c writeback: implement and use inode_congested() 2015-06-02 08:33:35 -06:00
failslab.c mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM 2015-11-06 17:50:42 -08:00
filemap.c mm: do not access page->mapping directly on page_endio 2017-03-12 06:37:25 +01:00
frame_vector.c mm: fix docbook comment for get_vaddr_frames() 2015-11-05 19:34:48 -08:00
frontswap.c frontswap: allow multiple backends 2015-06-24 17:49:45 -07:00
gup.c mm: larger stack guard gap, between vmas 2017-06-26 07:13:11 +02:00
highmem.c
huge_memory.c mm: numa: avoid waiting on freed migrated pages 2017-07-05 14:37:16 +02:00
hugetlb_cgroup.c mm: make compound_head() robust 2015-11-06 17:50:42 -08:00
hugetlb.c mm, hugetlb: use pte_present() instead of pmd_present() in follow_huge_pmd() 2017-04-08 09:53:32 +02:00
hwpoison-inject.c hwpoison: use page_cgroup_ino for filtering by memcg 2015-09-10 13:29:01 -07:00
init-mm.c mm: Add a user_ns owner to mm_struct and fix ptrace permission checks 2017-01-06 11:16:11 +01:00
internal.h mm, mprotect: flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries 2017-08-11 09:08:50 -07:00
interval_tree.c mm: replace vma->sharead.linear with vma->shared 2015-02-10 14:30:31 -08:00
Kconfig mm: make compound_head() robust 2015-11-06 17:50:42 -08:00
Kconfig.debug mm/debug_pagealloc: remove obsolete Kconfig options 2015-01-08 15:10:52 -08:00
kmemcheck.c
kmemleak-test.c
kmemleak.c mm/kmemleak.c: remove unneeded initialization of object to NULL 2015-11-05 19:34:48 -08:00
ksm.c mm,ksm: fix endless looping in allocating memory when ksm enable 2016-10-07 15:23:40 +02:00
list_lru.c mm/list_lru.c: fix list_lru_count_node() to be race free 2017-07-21 07:44:56 +02:00
maccess.c mm/maccess.c: actually return -EFAULT from strncpy_from_unsafe 2015-11-05 19:34:48 -08:00
madvise.c mm: madvise allow remove operation for hugetlbfs 2015-09-08 15:35:28 -07:00
Makefile media updates for v4.3-rc1 2015-09-11 16:42:39 -07:00
memblock.c mm: consider memblock reservations for deferred memory initialization sizing 2017-06-14 13:16:26 +02:00
memcontrol.c mm: memcontrol: avoid unused function warning 2017-03-18 19:09:57 +08:00
memory_hotplug.c base/memory, hotplug: fix a kernel oops in show_valid_zones() 2017-02-09 08:02:47 +01:00
memory-failure.c mm/memory-failure.c: use compound_head() flags for huge pages 2017-06-26 07:13:10 +02:00
memory.c mm, mprotect: flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries 2017-08-11 09:08:50 -07:00
mempolicy.c mm/mempolicy.c: fix error handling in set_mempolicy and mbind. 2017-04-12 12:38:35 +02:00
mempool.c mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd 2015-11-06 17:50:42 -08:00
memtest.c memtest: remove unused header files 2015-09-08 15:35:28 -07:00
migrate.c mm: Export migrate_page_move_mapping and migrate_page_copy 2016-07-27 09:47:31 -07:00
mincore.c mm/mincore: use offset_in_page macro 2015-11-05 19:34:48 -08:00
mlock.c mlock: fix mlock count can not decrease in race condition 2017-06-07 12:06:01 +02:00
mm_init.c mm: meminit: remove mminit_verify_page_links 2015-06-30 19:44:56 -07:00
mmap.c mm: fix overflow check in expand_upwards() 2017-07-21 07:44:58 +02:00
mmu_context.c
mmu_notifier.c mmu-notifier: add clear_young callback 2015-09-10 13:29:01 -07:00
mmzone.c mm: microoptimize zonelist operations 2015-02-11 17:06:02 -08:00
mprotect.c mm, mprotect: flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries 2017-08-11 09:08:50 -07:00
mremap.c mm, mprotect: flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries 2017-08-11 09:08:50 -07:00
msync.c mm/msync: use offset_in_page macro 2015-11-05 19:34:48 -08:00
nobootmem.c mm: page_alloc: pass PFN to __free_pages_bootmem 2015-06-30 19:44:55 -07:00
nommu.c mm/nommu.c: drop unlikely inside BUG_ON() 2015-11-05 19:34:48 -08:00
oom_kill.c mm/oom_kill.c: avoid attempting to kill init sharing same memory 2015-12-12 10:15:34 -08:00
page_alloc.c mm/page_alloc: Remove kernel address exposure in free_reserved_area() 2017-08-11 09:08:47 -07:00
page_counter.c mm: page_counter: let page_counter_try_charge() return bool 2015-11-05 19:34:48 -08:00
page_ext.c mm: introduce idle page tracking 2015-09-10 13:29:01 -07:00
page_idle.c mm: introduce idle page tracking 2015-09-10 13:29:01 -07:00
page_io.c fs: use helper bio_add_page() instead of open coding on bi_io_vec 2015-08-13 12:32:00 -06:00
page_isolation.c mm: fix invalid node in alloc_migrate_target() 2016-04-20 15:41:53 +09:00
page_owner.c mm/page_owner: set correct gfp_mask on page_owner 2015-07-17 16:39:54 -07:00
page-writeback.c writeback: use higher precision calculation in domain_dirty_limits() 2016-07-27 09:47:29 -07:00
pagewalk.c mm/pagewalk.c: prevent positive return value of walk_page_test() from being passed to callers 2015-03-25 16:20:30 -07:00
percpu-km.c
percpu-vm.c
percpu.c percpu: acquire pcpu_lock when updating pcpu_nr_empty_pop_pages 2017-03-26 12:13:20 +02:00
pgtable-generic.c mm,thp: khugepaged: call pte flush at the time of collapse 2016-02-25 12:01:23 -08:00
process_vm_access.c ptrace: use fsuid, fsgid, effective creds for fs access checks 2016-02-25 12:01:16 -08:00
quicklist.c
readahead.c mm, fs: introduce mapping_gfp_constraint() 2015-11-06 17:50:42 -08:00
rmap.c mm, mprotect: flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries 2017-08-11 09:08:50 -07:00
shmem.c tmpfs: fix regression hang in fallocate undo 2016-07-27 09:47:40 -07:00
slab_common.c mm: memcontrol: fix cgroup creation failure after many small jobs 2016-08-16 09:30:51 +02:00
slab.c slab/slub: adjust kmem_cache_alloc_bulk API 2015-11-22 11:58:44 -08:00
slab.h slab/slub: adjust kmem_cache_alloc_bulk API 2015-11-22 11:58:44 -08:00
slob.c slab/slub: adjust kmem_cache_alloc_bulk API 2015-11-22 11:58:44 -08:00
slub.c slub/memcg: cure the brainless abuse of sysfs attributes 2017-06-07 12:06:01 +02:00
sparse-vmemmap.c
sparse.c
swap_cgroup.c mm, swap_cgroup: reschedule when neeed in swap_cgroup_swapoff() 2017-07-05 14:37:15 +02:00
swap_state.c mm: swap: zswap: maybe_preload & refactoring 2015-09-08 15:35:28 -07:00
swap.c mm: make compound_head() robust 2015-11-06 17:50:42 -08:00
swapfile.c swapfile: fix memory corruption via malformed swapfile 2016-11-18 10:48:34 +01:00
truncate.c fs: add i_blocksize() 2017-06-14 13:16:24 +02:00
userfaultfd.c userfaultfd: avoid mmap_sem read recursion in mcopy_atomic 2015-09-04 16:54:41 -07:00
util.c proc: revert /proc/<pid>/maps [stack:TID] annotation 2016-09-15 08:27:46 +02:00
vmacache.c mm/vmacache: inline vmacache_valid_mm() 2015-11-05 19:34:48 -08:00
vmalloc.c mm: vmalloc: don't remove inexistent guard hole in remove_vm_area() 2015-11-20 16:17:32 -08:00
vmpressure.c mm: vmpressure: fix sending wrong events on underflow 2017-03-12 06:37:25 +01:00
vmscan.c mm: fix classzone_idx underflow in shrink_zones() 2017-07-15 11:57:44 +02:00
vmstat.c vmstat: allocate vmstat_wq before it is used 2016-01-08 23:47:54 -08:00
workingset.c mm: workingset: fix crash in shadow node shrinker caused by replace_page_cache_page() 2016-10-28 03:01:34 -04:00
zbud.c mm: zsmalloc: constify struct zs_pool name 2015-11-06 17:50:42 -08:00
zpool.c mm: zsmalloc: constify struct zs_pool name 2015-11-06 17:50:42 -08:00
zsmalloc.c zsmalloc: fix zs_can_compact() integer overflow 2016-05-18 17:06:44 -07:00
zswap.c zswap: disable changing params if init fails 2017-02-09 08:02:45 +01:00