linux/mm
Johannes Weiner b52b7b5a5c mm: filemap: don't plant shadow entries without radix tree node
commit d3798ae8c6 upstream.

When the underflow checks were added to workingset_node_shadow_dec(),
they triggered immediately:

  kernel BUG at ./include/linux/swap.h:276!
  invalid opcode: 0000 [#1] SMP
  Modules linked in: isofs usb_storage fuse xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_REJECT nf_reject_ipv6
   soundcore wmi acpi_als pinctrl_sunrisepoint kfifo_buf tpm_tis industrialio acpi_pad pinctrl_intel tpm_tis_core tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_crypt
  CPU: 0 PID: 20929 Comm: blkid Not tainted 4.8.0-rc8-00087-gbe67d60ba944 #1
  Hardware name: System manufacturer System Product Name/Z170-K, BIOS 1803 05/06/2016
  task: ffff8faa93ecd940 task.stack: ffff8faa7f478000
  RIP: page_cache_tree_insert+0xf1/0x100
  Call Trace:
    __add_to_page_cache_locked+0x12e/0x270
    add_to_page_cache_lru+0x4e/0xe0
    mpage_readpages+0x112/0x1d0
    blkdev_readpages+0x1d/0x20
    __do_page_cache_readahead+0x1ad/0x290
    force_page_cache_readahead+0xaa/0x100
    page_cache_sync_readahead+0x3f/0x50
    generic_file_read_iter+0x5af/0x740
    blkdev_read_iter+0x35/0x40
    __vfs_read+0xe1/0x130
    vfs_read+0x96/0x130
    SyS_read+0x55/0xc0
    entry_SYSCALL_64_fastpath+0x13/0x8f
  Code: 03 00 48 8b 5d d8 65 48 33 1c 25 28 00 00 00 44 89 e8 75 19 48 83 c4 18 5b 41 5c 41 5d 41 5e 5d c3 0f 0b 41 bd ef ff ff ff eb d7 <0f> 0b e8 88 68 ef ff 0f 1f 84 00
  RIP  page_cache_tree_insert+0xf1/0x100

This is a long-standing bug in the way shadow entries are accounted in
the radix tree nodes. The shrinker needs to know when radix tree nodes
contain only shadow entries, no pages, so node->count is split in half
to count shadows in the upper bits and pages in the lower bits.

Unfortunately, the radix tree implementation doesn't know of this and
assumes all entries are in node->count. When there is a shadow entry
directly in root->rnode and the tree is later extended, the radix tree
implementation will copy that entry into the new node and and bump its
node->count, i.e. increases the page count bits. Once the shadow gets
removed and we subtract from the upper counter, node->count underflows
and triggers the warning. Afterwards, without node->count reaching 0
again, the radix tree node is leaked.

Limit shadow entries to when we have actual radix tree nodes and can
count them properly. That means we lose the ability to detect refaults
from files that had only the first page faulted in at eviction time.

Fixes: 449dd6984d ("mm: keep page cache radix tree nodes in check")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-and-tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-28 03:01:32 -04:00
..
kasan kasan: fix kmemleak false-positive in kasan_module_alloc() 2015-11-20 16:17:32 -08:00
backing-dev.c block: fix bdi vs gendisk lifetime mismatch 2016-08-20 18:09:24 +02:00
balloon_compaction.c virtio_balloon: fix race between migration and ballooning 2016-03-03 15:07:18 -08:00
bootmem.c
cleancache.c
cma_debug.c
cma.c mm/cma.c: suppress warning 2015-11-05 19:34:48 -08:00
cma.h
compaction.c mm, compaction: prevent VM_BUG_ON when terminating freeing scanner 2016-08-10 11:49:25 +02:00
debug-pagealloc.c
debug.c mm: make compound_head() robust 2015-11-06 17:50:42 -08:00
dmapool.c mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd 2015-11-06 17:50:42 -08:00
early_ioremap.c mm/early_ioremap: use offset_in_page macro 2015-11-05 19:34:48 -08:00
fadvise.c
failslab.c mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM 2015-11-06 17:50:42 -08:00
filemap.c mm: filemap: don't plant shadow entries without radix tree node 2016-10-28 03:01:32 -04:00
frame_vector.c mm: fix docbook comment for get_vaddr_frames() 2015-11-05 19:34:48 -08:00
frontswap.c
gup.c mm: remove gup_flags FOLL_WRITE games from __get_user_pages() 2016-10-20 10:00:47 +02:00
highmem.c
huge_memory.c mm/huge_memory: replace VM_NO_THP VM_BUG_ON with actual VMA check 2016-05-04 14:48:49 -07:00
hugetlb_cgroup.c mm: make compound_head() robust 2015-11-06 17:50:42 -08:00
hugetlb.c hugetlb: fix nr_pmds accounting with shared page tables 2016-09-07 08:32:35 +02:00
hwpoison-inject.c hwpoison: use page_cgroup_ino for filtering by memcg 2015-09-10 13:29:01 -07:00
init-mm.c
internal.h mm, sl[au]b: add __GFP_ATOMIC to the GFP reclaim mask 2016-08-10 11:49:25 +02:00
interval_tree.c
Kconfig mm: make compound_head() robust 2015-11-06 17:50:42 -08:00
Kconfig.debug
kmemcheck.c
kmemleak-test.c
kmemleak.c mm/kmemleak.c: remove unneeded initialization of object to NULL 2015-11-05 19:34:48 -08:00
ksm.c mm,ksm: fix endless looping in allocating memory when ksm enable 2016-10-07 15:23:40 +02:00
list_lru.c memcg: simplify and inline __mem_cgroup_from_kmem 2015-11-05 19:34:48 -08:00
maccess.c mm/maccess.c: actually return -EFAULT from strncpy_from_unsafe 2015-11-05 19:34:48 -08:00
madvise.c
Makefile media updates for v4.3-rc1 2015-09-11 16:42:39 -07:00
memblock.c mm/memblock: make memblock_remove_range() static 2015-11-05 19:34:48 -08:00
memcontrol.c mm: memcontrol: fix memcg id ref counter on swap charge move 2016-08-16 09:30:51 +02:00
memory_hotplug.c mm/memory_hotplug.c: check for missing sections in test_pages_in_a_zone() 2015-12-29 17:45:49 -08:00
memory-failure.c mm: soft-offline: check return value in second __get_any_page() call 2016-02-25 12:01:21 -08:00
memory.c numa: fix /proc/<pid>/numa_maps for THP 2016-05-04 14:48:49 -07:00
mempolicy.c
mempool.c mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd 2015-11-06 17:50:42 -08:00
memtest.c
migrate.c mm: Export migrate_page_move_mapping and migrate_page_copy 2016-07-27 09:47:31 -07:00
mincore.c mm/mincore: use offset_in_page macro 2015-11-05 19:34:48 -08:00
mlock.c mm: fix mlock accouting 2016-02-25 12:01:21 -08:00
mm_init.c
mmap.c mm: fix regression in remap_file_pages() emulation 2016-02-25 12:01:21 -08:00
mmu_context.c
mmu_notifier.c mmu-notifier: add clear_young callback 2015-09-10 13:29:01 -07:00
mmzone.c
mprotect.c
mremap.c mm/mremap: use offset_in_page macro 2015-11-05 19:34:48 -08:00
msync.c mm/msync: use offset_in_page macro 2015-11-05 19:34:48 -08:00
nobootmem.c
nommu.c mm/nommu.c: drop unlikely inside BUG_ON() 2015-11-05 19:34:48 -08:00
oom_kill.c mm/oom_kill.c: avoid attempting to kill init sharing same memory 2015-12-12 10:15:34 -08:00
page_alloc.c mm, meminit: ensure node is online before checking whether pages are uninitialised 2016-08-10 11:49:25 +02:00
page_counter.c mm: page_counter: let page_counter_try_charge() return bool 2015-11-05 19:34:48 -08:00
page_ext.c mm: introduce idle page tracking 2015-09-10 13:29:01 -07:00
page_idle.c mm: introduce idle page tracking 2015-09-10 13:29:01 -07:00
page_io.c
page_isolation.c mm: fix invalid node in alloc_migrate_target() 2016-04-20 15:41:53 +09:00
page_owner.c
page-writeback.c writeback: use higher precision calculation in domain_dirty_limits() 2016-07-27 09:47:29 -07:00
pagewalk.c
percpu-km.c
percpu-vm.c
percpu.c percpu: fix synchronization between synchronous map extension and chunk destruction 2016-07-27 09:47:33 -07:00
pgtable-generic.c mm,thp: khugepaged: call pte flush at the time of collapse 2016-02-25 12:01:23 -08:00
process_vm_access.c ptrace: use fsuid, fsgid, effective creds for fs access checks 2016-02-25 12:01:16 -08:00
quicklist.c
readahead.c mm, fs: introduce mapping_gfp_constraint() 2015-11-06 17:50:42 -08:00
rmap.c mm: page migration use migration entry for swapcache too 2015-11-05 19:34:48 -08:00
shmem.c tmpfs: fix regression hang in fallocate undo 2016-07-27 09:47:40 -07:00
slab_common.c mm: memcontrol: fix cgroup creation failure after many small jobs 2016-08-16 09:30:51 +02:00
slab.c slab/slub: adjust kmem_cache_alloc_bulk API 2015-11-22 11:58:44 -08:00
slab.h slab/slub: adjust kmem_cache_alloc_bulk API 2015-11-22 11:58:44 -08:00
slob.c slab/slub: adjust kmem_cache_alloc_bulk API 2015-11-22 11:58:44 -08:00
slub.c slub: clean up code for kmem cgroup support to kmem_cache_free_bulk 2016-05-04 14:48:49 -07:00
sparse-vmemmap.c
sparse.c
swap_cgroup.c
swap_state.c mm: swap: zswap: maybe_preload & refactoring 2015-09-08 15:35:28 -07:00
swap.c mm: make compound_head() robust 2015-11-06 17:50:42 -08:00
swapfile.c
truncate.c
userfaultfd.c
util.c proc: revert /proc/<pid>/maps [stack:TID] annotation 2016-09-15 08:27:46 +02:00
vmacache.c mm/vmacache: inline vmacache_valid_mm() 2015-11-05 19:34:48 -08:00
vmalloc.c mm: vmalloc: don't remove inexistent guard hole in remove_vm_area() 2015-11-20 16:17:32 -08:00
vmpressure.c
vmscan.c mm: delete unnecessary and unsafe init_tlb_ubc() 2016-09-30 10:18:38 +02:00
vmstat.c vmstat: allocate vmstat_wq before it is used 2016-01-08 23:47:54 -08:00
workingset.c
zbud.c mm: zsmalloc: constify struct zs_pool name 2015-11-06 17:50:42 -08:00
zpool.c mm: zsmalloc: constify struct zs_pool name 2015-11-06 17:50:42 -08:00
zsmalloc.c zsmalloc: fix zs_can_compact() integer overflow 2016-05-18 17:06:44 -07:00
zswap.c mm/zswap: provide unique zpool name 2016-05-11 11:21:14 +02:00