linux/include
Michal Hocko ab5ac90aec mm, hugetlb: do not rely on overcommit limit during migration
hugepage migration relies on __alloc_buddy_huge_page to get a new page.
This has 2 main disadvantages.

1) it doesn't allow to migrate any huge page if the pool is used
   completely which is not an exceptional case as the pool is static and
   unused memory is just wasted.

2) it leads to a weird semantic when migration between two numa nodes
   might increase the pool size of the destination NUMA node while the
   page is in use.  The issue is caused by per NUMA node surplus pages
   tracking (see free_huge_page).

Address both issues by changing the way how we allocate and account
pages allocated for migration.  Those should temporal by definition.  So
we mark them that way (we will abuse page flags in the 3rd page) and
update free_huge_page to free such pages to the page allocator.  Page
migration path then just transfers the temporal status from the new page
to the old one which will be freed on the last reference.  The global
surplus count will never change during this path but we still have to be
careful when migrating a per-node suprlus page.  This is now handled in
move_hugetlb_state which is called from the migration path and it copies
the hugetlb specific page state and fixes up the accounting when needed

Rename __alloc_buddy_huge_page to __alloc_surplus_huge_page to better
reflect its purpose.  The new allocation routine for the migration path
is __alloc_migrate_huge_page.

The user visible effect of this patch is that migrated pages are really
temporal and they travel between NUMA nodes as per the migration
request:

Before migration
  /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0
  /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:1
  /sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0
  /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0
  /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0
  /sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0

After
  /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0
  /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:0
  /sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0
  /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0
  /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:1
  /sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0

with the previous implementation, both nodes would have nr_hugepages:1
until the page is freed.

Link: http://lkml.kernel.org/r/20180103093213.26329-4-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andrea Reale <ar@linux.vnet.ibm.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:40 -08:00
..
acpi Merge branches 'acpi-gpio', 'acpi-button', 'acpi-battery' and 'acpi-video' 2018-01-18 03:02:16 +01:00
asm-generic mm/thp: remove pmd_huge_split_prepare() 2018-01-31 17:18:38 -08:00
clocksource arm64 updates for 4.15 2017-11-15 10:56:56 -08:00
crypto Merge branch 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-01-30 17:58:07 -08:00
drm Merge branch 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-01-30 17:58:07 -08:00
dt-bindings fixes/cleanups for rc1, non-desktop flags for VR 2017-11-23 21:04:56 -10:00
keys
kvm KVM: arm/arm64: timer: Don't set irq as forwarded if no usable GIC 2017-12-18 10:53:23 +01:00
linux mm, hugetlb: do not rely on overcommit limit during migration 2018-01-31 17:18:40 -08:00
math-emu
media media: annotate ->poll() instances 2017-11-27 16:20:06 -05:00
memory
misc the rest of drivers/*: annotate ->poll() instances 2017-11-28 11:06:58 -05:00
net Merge branch 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-01-30 17:58:07 -08:00
pcmcia
ras
rdma RDMA/restrack: Add general infrastructure to track RDMA resources 2018-01-29 20:21:39 -07:00
scsi First merge window pull request for 4.16 2018-01-31 12:05:10 -08:00
soc We have two changes to the core framework this time around. The first being a 2017-11-17 20:04:24 -08:00
sound Merge branch 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-01-30 17:58:07 -08:00
target Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending 2017-11-24 19:19:20 -10:00
trace mm: use sc->priority for slab shrink targets 2018-01-31 17:18:36 -08:00
uapi First merge window pull request for 4.16 2018-01-31 12:05:10 -08:00
video fbdev changes for v4.15: 2017-11-20 21:50:24 -10:00
xen xen: fixes for 4.15-rc5 2017-12-22 12:30:10 -08:00