linux/mm
Nishanth Aravamudan 59d76fa1c3 bootmem/sparsemem: remove limit constraint in alloc_bootmem_section
commit f5bf18fa22 upstream.

While testing AMS (Active Memory Sharing) / CMO (Cooperative Memory
Overcommit) on powerpc, we tripped the following:

  kernel BUG at mm/bootmem.c:483!
  cpu 0x0: Vector: 700 (Program Check) at [c000000000c03940]
      pc: c000000000a62bd8: .alloc_bootmem_core+0x90/0x39c
      lr: c000000000a64bcc: .sparse_early_usemaps_alloc_node+0x84/0x29c
      sp: c000000000c03bc0
     msr: 8000000000021032
    current = 0xc000000000b0cce0
    paca    = 0xc000000001d80000
      pid   = 0, comm = swapper
  kernel BUG at mm/bootmem.c:483!
  enter ? for help
  [c000000000c03c80] c000000000a64bcc
  .sparse_early_usemaps_alloc_node+0x84/0x29c
  [c000000000c03d50] c000000000a64f10 .sparse_init+0x12c/0x28c
  [c000000000c03e20] c000000000a474f4 .setup_arch+0x20c/0x294
  [c000000000c03ee0] c000000000a4079c .start_kernel+0xb4/0x460
  [c000000000c03f90] c000000000009670 .start_here_common+0x1c/0x2c

This is

        BUG_ON(limit && goal + size > limit);

and after some debugging, it seems that

	goal = 0x7ffff000000
	limit = 0x80000000000

and sparse_early_usemaps_alloc_node ->
sparse_early_usemaps_alloc_pgdat_section calls

	return alloc_bootmem_section(usemap_size() * count, section_nr);

This is on a system with 8TB available via the AMS pool, and as a quirk
of AMS in firmware, all of that memory shows up in node 0.  So, we end
up with an allocation that will fail the goal/limit constraints.

In theory, we could "fall-back" to alloc_bootmem_node() in
sparse_early_usemaps_alloc_node(), but since we actually have HOTREMOVE
defined, we'll BUG_ON() instead.  A simple solution appears to be to
unconditionally remove the limit condition in alloc_bootmem_section,
meaning allocations are allowed to cross section boundaries (necessary
for systems of this size).

Johannes Weiner pointed out that if alloc_bootmem_section() no longer
guarantees section-locality, we need check_usemap_section_nr() to print
possible cross-dependencies between node descriptors and the usemaps
allocated through it.  That makes the two loops in
sparse_early_usemaps_alloc_node() identical, so re-factor the code a
bit.

[akpm@linux-foundation.org: code simplification]
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ben Herrenschmidt <benh@kernel.crashing.org>
Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-04-02 09:27:11 -07:00
..
backing-dev.c backing-dev: ensure wakeup_timer is deleted 2011-11-21 14:31:25 -08:00
bootmem.c bootmem/sparsemem: remove limit constraint in alloc_bootmem_section 2012-04-02 09:27:11 -07:00
bounce.c bounce: call flush_dcache_page() after bounce_copy_vec() 2010-09-09 18:57:25 -07:00
cleancache.c mm: cleancache core ops functions and config 2011-05-26 10:01:36 -06:00
compaction.c mm: compaction: check for overlapping nodes during isolation for migration 2012-02-13 11:06:11 -08:00
debug-pagealloc.c
dmapool.c mm/dmapool.c: use TASK_UNINTERRUPTIBLE in dma_pool_alloc() 2011-01-13 17:32:48 -08:00
fadvise.c
failslab.c
filemap_xip.c mm/filemap_xip.c: fix race condition in xip_file_fault() 2012-02-13 11:06:07 -08:00
filemap.c readahead: fix pipeline break caused by block plug 2012-02-13 11:06:04 -08:00
fremap.c mm: don't access vm_flags as 'int' 2011-05-26 09:20:31 -07:00
highmem.c mm,x86: fix kmap_atomic_push vs ioremap_32.c 2010-10-27 18:03:05 -07:00
huge_memory.c mm: thp: fix BUG on mm->nr_ptes 2012-03-12 10:32:56 -07:00
hugetlb.c mm: hugetlb: fix non-atomic enqueue of huge page 2012-01-06 14:14:00 -08:00
hwpoison-inject.c Fix common misspellings 2011-03-31 11:26:23 -03:00
init-mm.c mm: convert mm->cpu_vm_cpumask into cpumask_var_t 2011-05-25 08:39:21 -07:00
internal.h mm: thp: tail page refcounting fix 2011-11-11 09:36:29 -08:00
Kconfig mm: cleancache core ops functions and config 2011-05-26 10:01:36 -06:00
Kconfig.debug mm: debug-pagealloc: fix kconfig dependency warning 2011-03-22 17:44:02 -07:00
kmemcheck.c
kmemleak-test.c kmemleak: remove memset by using kzalloc 2011-01-27 18:31:51 +00:00
kmemleak.c kmemleak: Do not return a pointer to an object that kmemleak did not get 2011-05-19 17:35:28 +01:00
ksm.c ksm: fix NULL pointer dereference in scan_get_next_rmap_item() 2011-06-15 20:04:02 -07:00
maccess.c maccess,probe_kernel: Make write/read src const void * 2011-05-25 19:56:23 -04:00
madvise.c thp: khugepaged: make khugepaged aware about madvise 2011-01-13 17:32:47 -08:00
Makefile mm: cleancache core ops functions and config 2011-05-26 10:01:36 -06:00
memblock.c mm/memblock: properly handle overlaps and fix error path 2011-03-22 17:44:09 -07:00
memcontrol.c mm: thp: fix pmd_bad() triggering in code paths holding mmap_sem read mode 2012-04-02 09:27:10 -07:00
memory_hotplug.c mm, hotplug: protect zonelist building with zonelists_mutex 2011-06-22 21:06:48 -07:00
memory-failure.c mm/memory-failure.c: fix spinlock vs mutex order 2011-06-27 18:00:13 -07:00
memory.c mm: thp: fix pmd_bad() triggering in code paths holding mmap_sem read mode 2012-04-02 09:27:10 -07:00
mempolicy.c mm: thp: fix pmd_bad() triggering in code paths holding mmap_sem read mode 2012-04-02 09:27:10 -07:00
mempool.c
migrate.c mm: fix race between mremap and removing migration entry 2011-10-25 07:10:17 +02:00
mincore.c mm: thp: fix pmd_bad() triggering in code paths holding mmap_sem read mode 2012-04-02 09:27:10 -07:00
mlock.c mm: don't access vm_flags as 'int' 2011-05-26 09:20:31 -07:00
mm_init.c
mmap.c mm: get rid of the most spurious find_vma_prev() users 2011-06-16 00:35:09 -07:00
mmu_context.c
mmu_notifier.c thp: mmu_notifier_test_young 2011-01-13 17:32:46 -08:00
mmzone.c mm: page allocator: adjust the per-cpu counter threshold when memory is low 2011-01-13 17:32:31 -08:00
mprotect.c thp: mprotect: transparent huge page support 2011-01-13 17:32:44 -08:00
mremap.c mm: Convert i_mmap_lock to a mutex 2011-05-25 08:39:18 -07:00
msync.c sanitize vfs_fsync calling conventions 2010-05-21 18:31:21 -04:00
nobootmem.c memblock/nobootmem: remove unneeded code from alloc_bootmem_node_high() 2011-05-25 08:39:31 -07:00
nommu.c NOMMU: Don't need to clear vm_mm when deleting a VMA 2012-03-12 10:32:56 -07:00
oom_kill.c oom: fix integer overflow of points in oom_badness 2012-01-06 14:13:51 -08:00
page_alloc.c mm: fix NULL ptr dereference in __count_immobile_pages 2012-01-25 17:25:05 -08:00
page_cgroup.c memcg: fix init_page_cgroup nid with sparsemem 2011-06-15 20:04:01 -07:00
page_io.c block: kill off REQ_UNPLUG 2011-03-10 08:52:27 +01:00
page_isolation.c mm: page_isolation: codeclean fix comment and rm unneeded val init 2010-10-26 16:52:11 -07:00
page-writeback.c writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage 2011-10-03 11:40:43 -07:00
pagewalk.c mm: thp: fix pmd_bad() triggering in code paths holding mmap_sem read mode 2012-04-02 09:27:10 -07:00
percpu-km.c percpu: clear memory allocated with the km allocator 2010-10-02 10:28:42 +03:00
percpu-vm.c percpu: fix chunk range calculation 2011-12-21 12:57:37 -08:00
percpu.c percpu: fix per_cpu_ptr_to_phys() handling of non-page-aligned addresses 2012-01-06 14:13:50 -08:00
pgtable-generic.c mm/pgtable-generic.c: fix CONFIG_SWAP=n build 2011-01-26 10:49:58 +10:00
prio_tree.c sanitize <linux/prefetch.h> usage 2011-05-20 12:50:29 -07:00
quicklist.c
readahead.c readahead: readahead page allocations are OK to fail 2011-05-25 08:39:25 -07:00
rmap.c mm/memory-failure.c: fix spinlock vs mutex order 2011-06-27 18:00:13 -07:00
shmem.c tmpfs: add shmem_read_mapping_page_gfp 2011-06-27 18:00:12 -07:00
slab.c SLAB: Record actual last user of freed objects. 2011-06-03 19:33:50 +03:00
slob.c mm: Remove support for kmem_cache_name() 2011-01-23 21:00:05 +02:00
slub.c slub: fix a possible memleak in __slab_alloc() 2012-02-20 12:48:14 -08:00
sparse-vmemmap.c tree-wide: fix comment/printk typos 2010-11-01 15:38:34 -04:00
sparse.c bootmem/sparsemem: remove limit constraint in alloc_bootmem_section 2012-04-02 09:27:11 -07:00
swap_state.c block: remove per-queue plugging 2011-03-10 08:52:07 +01:00
swap.c mm: fix UP THP spin_is_locked BUGs 2012-02-13 11:06:11 -08:00
swapfile.c mm: thp: fix pmd_bad() triggering in code paths holding mmap_sem read mode 2012-04-02 09:27:10 -07:00
thrash.c vmscan: implement swap token priority aging 2011-06-15 20:03:59 -07:00
truncate.c mm: fix assertion mapping->nrpages == 0 in end_writeback() 2011-06-27 18:00:13 -07:00
util.c mm: nommu: sort mm->mmap list properly 2011-05-25 08:39:05 -07:00
vmalloc.c mm: vmalloc: check for page allocation failure before vmlist insertion 2011-12-21 12:57:36 -08:00
vmscan.c memcg: fix vmscan count in small memcgs 2011-10-03 11:41:08 -07:00
vmstat.c mm, mem-hotplug: update pcp->stat_threshold when memory hotplug occur 2011-05-25 08:39:09 -07:00