Linux kernel source tree
Go to file
Andrey Ryabinin 6b2cdcc8f5 mm: mempolicy: fix THP allocations escaping mempolicy restrictions
commit 3386353406 upstream.

alloc_pages_vma() may try to allocate THP page on the local NUMA node
first:

	page = __alloc_pages_node(hpage_node,
		gfp | __GFP_THISNODE | __GFP_NORETRY, order);

And if the allocation fails it retries allowing remote memory:

	if (!page && (gfp & __GFP_DIRECT_RECLAIM))
    		page = __alloc_pages_node(hpage_node,
					gfp, order);

However, this retry allocation completely ignores memory policy nodemask
allowing allocation to escape restrictions.

The first appearance of this bug seems to be the commit ac5b2c1891
("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings").

The bug disappeared later in the commit 89c83fb539 ("mm, thp:
consolidate THP gfp handling into alloc_hugepage_direct_gfpmask") and
reappeared again in slightly different form in the commit 76e654cc91
("mm, page_alloc: allow hugepage fallback to remote nodes when
madvised")

Fix this by passing correct nodemask to the __alloc_pages() call.

The demonstration/reproducer of the problem:

    $ mount -oremount,size=4G,huge=always /dev/shm/
    $ echo always > /sys/kernel/mm/transparent_hugepage/defrag
    $ cat mbind_thp.c
    #include <unistd.h>
    #include <sys/mman.h>
    #include <sys/stat.h>
    #include <fcntl.h>
    #include <assert.h>
    #include <stdlib.h>
    #include <stdio.h>
    #include <numaif.h>

    #define SIZE 2ULL << 30
    int main(int argc, char **argv)
    {
        int fd;
        unsigned long long i;
        char *addr;
        pid_t pid;
        char buf[100];
        unsigned long nodemask = 1;

        fd = open("/dev/shm/test", O_RDWR|O_CREAT);
        assert(fd > 0);
        assert(ftruncate(fd, SIZE) == 0);

        addr = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
                           MAP_SHARED, fd, 0);

        assert(mbind(addr, SIZE, MPOL_BIND, &nodemask, 2, MPOL_MF_STRICT|MPOL_MF_MOVE)==0);
        for (i = 0; i < SIZE; i+=4096) {
          addr[i] = 1;
        }
        pid = getpid();
        snprintf(buf, sizeof(buf), "grep shm /proc/%d/numa_maps", pid);
        system(buf);
        sleep(10000);

        return 0;
    }
    $ gcc mbind_thp.c -o mbind_thp -lnuma
    $ numactl -H
    available: 2 nodes (0-1)
    node 0 cpus: 0 2
    node 0 size: 1918 MB
    node 0 free: 1595 MB
    node 1 cpus: 1 3
    node 1 size: 2014 MB
    node 1 free: 1731 MB
    node distances:
    node   0   1
      0:  10  20
      1:  20  10
    $ rm -f /dev/shm/test; taskset -c 0 ./mbind_thp
    7fd970a00000 bind:0 file=/dev/shm/test dirty=524288 active=0 N0=396800 N1=127488 kernelpagesize_kB=4

Link: https://lkml.kernel.org/r/20211208165343.22349-1-arbn@yandex-team.com
Fixes: ac5b2c1891 ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings")
Signed-off-by: Andrey Ryabinin <arbn@yandex-team.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-12-29 12:28:58 +01:00
arch ARM: 9169/1: entry: fix Thumb2 bug in iWMMXt exception handling 2021-12-29 12:28:57 +01:00
block iocost: Fix divide-by-zero on donation from low hweight cgroup 2021-12-22 09:32:48 +01:00
certs certs: Add support for using elliptic curve keys for signing modules 2021-08-23 19:55:42 +03:00
crypto crypto: pcrypt - Delay write to padata->info 2021-11-18 19:16:44 +01:00
Documentation ALSA: hda/realtek: Add new alc285-hp-amp-init model 2021-12-29 12:28:51 +01:00
drivers mmc: mmci: stm32: clear DLYB_CR after sending tuning command 2021-12-29 12:28:57 +01:00
fs ksmbd: disable SMB2_GLOBAL_CAP_ENCRYPTION for SMB 3.1.1 2021-12-29 12:28:57 +01:00
include tee: handle lookup of shm with reference count 0 2021-12-29 12:28:54 +01:00
init init: make unknown command line param message clearer 2021-11-18 19:17:11 +01:00
ipc shm: extend forced shm destroy to support objects from several IPC nses 2021-11-25 09:48:42 +01:00
kernel kernel/crash_core: suppress unknown crashkernel parameter warning 2021-12-29 12:28:49 +01:00
lib siphash: use _unaligned version by default 2021-12-08 09:04:47 +01:00
LICENSES LICENSES/dual/CC-BY-4.0: Git rid of "smart quotes" 2021-07-15 06:31:24 -06:00
mm mm: mempolicy: fix THP allocations escaping mempolicy restrictions 2021-12-29 12:28:58 +01:00
net mac80211: fix locking in ieee80211_start_ap error path 2021-12-29 12:28:58 +01:00
samples samples/bpf: Fix incorrect use of strlen in xdp_redirect_cpu 2021-11-25 09:48:33 +01:00
scripts recordmcount.pl: look for jgnop instruction as well as bcrl on s390 2021-12-22 09:32:36 +01:00
security selinux: fix sleeping function called from invalid context 2021-12-22 09:32:47 +01:00
sound ASoC: tegra: Restore headphones jack name on Nyan Big 2021-12-29 12:28:52 +01:00
tools selftests: KVM: Fix non-x86 compiling 2021-12-29 12:28:37 +01:00
usr .gitignore: prefix local generated files with a slash 2021-05-02 00:43:35 +09:00
virt KVM: downgrade two BUG_ONs to WARN_ON_ONCE 2021-12-22 09:32:34 +01:00
.clang-format clang-format: Update with the latest for_each macro list 2021-05-12 23:32:39 +02:00
.cocciconfig
.get_maintainer.ignore
.gitattributes
.gitignore .gitignore: ignore only top-level modules.builtin 2021-05-02 00:43:35 +09:00
.mailmap mailmap: add Andrej Shadura 2021-10-18 20:22:03 -10:00
COPYING
CREDITS MAINTAINERS: Move Daniel Drake to credits 2021-09-21 08:34:58 +03:00
Kbuild
Kconfig kbuild: ensure full rebuild when the compiler is updated 2020-05-12 13:28:33 +09:00
MAINTAINERS drm fixes for 5.15 final 2021-10-28 12:17:01 -07:00
Makefile Linux 5.15.11 2021-12-22 09:32:52 +01:00
README

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.