linux

mirror of https://github.com/torvalds/linux.git synced 2026-06-08 06:25:52 +02:00

History

Greg Thelen 43f47331a4 mm: writeback: use exact memcg dirty counts commit `0b3d6e6f2d` upstream. Since commit `a983b5ebee` ("mm: memcontrol: fix excessive complexity in memory.stat reporting") memcg dirty and writeback counters are managed as: 1) per-memcg per-cpu values in range of [-32..32] 2) per-memcg atomic counter When a per-cpu counter cannot fit in [-32..32] it's flushed to the atomic. Stat readers only check the atomic. Thus readers such as balance_dirty_pages() may see a nontrivial error margin: 32 pages per cpu. Assuming 100 cpus: 4k x86 page_size: 13 MiB error per memcg 64k ppc page_size: 200 MiB error per memcg Considering that dirty+writeback are used together for some decisions the errors double. This inaccuracy can lead to undeserved oom kills. One nasty case is when all per-cpu counters hold positive values offsetting an atomic negative value (i.e. per_cpu[]=32, atomic=n_cpu-32). balance_dirty_pages() only consults the atomic and does not consider throttling the next n_cpu32 dirty pages. If the file_lru is in the 13..200 MiB range then there's absolutely no dirty throttling, which burdens vmscan with only dirty+writeback pages thus resorting to oom kill. It could be argued that tiny containers are not supported, but it's more subtle. It's the amount the space available for file lru that matters. If a container has memory.max-200MiB of non reclaimable memory, then it will also suffer such oom kills on a 100 cpu machine. The following test reliably ooms without this patch. This patch avoids oom kills. $ cat test mount -t cgroup2 none /dev/cgroup cd /dev/cgroup echo +io +memory > cgroup.subtree_control mkdir test cd test echo 10M > memory.max (echo $BASHPID > cgroup.procs && exec /memcg-writeback-stress /foo) (echo $BASHPID > cgroup.procs && exec dd if=/dev/zero of=/foo bs=2M count=100) $ cat memcg-writeback-stress.c / * Dirty pages from all but one cpu. * Clean pages from the non dirtying cpu. * This is to stress per cpu counter imbalance. * On a 100 cpu machine: * - per memcg per cpu dirty count is 32 pages for each of 99 cpus * - per memcg atomic is -9932 pages - thus the complete dirty limit: sum of all counters 0 * - balance_dirty_pages() only sees atomic count -9932 pages, which it max()s to 0. * - So a workload can dirty -9932 pages before balance_dirty_pages() cares. / #define _GNU_SOURCE #include <err.h> #include <fcntl.h> #include <sched.h> #include <stdlib.h> #include <stdio.h> #include <sys/stat.h> #include <sys/sysinfo.h> #include <sys/types.h> #include <unistd.h> static char buf; static int bufSize; static void set_affinity(int cpu) { cpu_set_t affinity; CPU_ZERO(&affinity); CPU_SET(cpu, &affinity); if (sched_setaffinity(0, sizeof(affinity), &affinity)) err(1, "sched_setaffinity"); } static void dirty_on(int output_fd, int cpu) { int i, wrote; set_affinity(cpu); for (i = 0; i < 32; i++) { for (wrote = 0; wrote < bufSize; ) { int ret = write(output_fd, buf+wrote, bufSize-wrote); if (ret == -1) err(1, "write"); wrote += ret; } } } int main(int argc, char *argv) { int cpu, flush_cpu = 1, output_fd; const char output; if (argc != 2) errx(1, "usage: output_file"); output = argv[1]; bufSize = getpagesize(); buf = malloc(getpagesize()); if (buf == NULL) errx(1, "malloc failed"); output_fd = open(output, O_CREAT\|O_RDWR); if (output_fd == -1) err(1, "open(%s)", output); for (cpu = 0; cpu < get_nprocs(); cpu++) { if (cpu != flush_cpu) dirty_on(output_fd, cpu); } set_affinity(flush_cpu); if (fsync(output_fd)) err(1, "fsync(%s)", output); if (close(output_fd)) err(1, "close(%s)", output); free(buf); } Make balance_dirty_pages() and wb_over_bg_thresh() work harder to collect exact per memcg counters. This avoids the aforementioned oom kills. This does not affect the overhead of memory.stat, which still reads the single atomic counter. Why not use percpu_counter? memcg already handles cpus going offline, so no need for that overhead from percpu_counter. And the percpu_counter spinlocks are more heavyweight than is required. It probably also makes sense to use exact dirty and writeback counters in memcg oom reports. But that is saved for later. Link: http://lkml.kernel.org/r/20190329174609.164344-1-gthelen@google.com Signed-off-by: Greg Thelen <gthelen@google.com> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> [4.16+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>		2019-04-17 08:38:51 +02:00
..
acpi	ACPICA: Update version to 20180810	2018-08-14 23:49:13 +02:00
asm-generic	x86/unwind/orc: Fix ORC unwind table alignment	2019-03-23 20:10:10 +01:00
clocksource
crypto	crypto: speck - remove Speck	2018-11-13 11:08:46 -08:00
drm	drm: disable uncached DMA optimization for ARM and arm64	2019-03-13 14:02:40 -07:00
dt-bindings	ARM: SoC: late updates	2018-08-25 14:12:36 -07:00
keys	keys: Fix dependency loop between construction record and auth key	2019-03-23 20:09:48 +01:00
kvm	KVM: arm/arm64: vgic: Make vgic_dist->lpi_list_lock a raw_spinlock	2019-03-23 20:09:42 +01:00
linux	mm: writeback: use exact memcg dirty counts	2019-04-17 08:38:51 +02:00
math-emu
media	media: cec: keep track of outstanding transmits	2019-01-09 17:38:46 +01:00
memory
misc
net	vrf: check accept_source_route on the original netdevice	2019-04-17 08:38:42 +02:00
pcmcia	pcmcia: remove long deprecated pcmcia_request_exclusive_irq() function	2018-08-18 12:30:42 -07:00
ras
rdma	IB/rxe: Revise the ib_wr_opcode enum	2018-11-13 11:08:43 -08:00
scsi	scsi: fcoe: make use of fip_mode enum complete	2019-04-05 22:33:04 +02:00
soc	soc: fsl: qbman: add APIs to retrieve the probing status	2018-09-27 15:43:35 -05:00
sound	ALSA: compress: Fix stop handling on compressed capture streams	2019-02-12 19:47:23 +01:00
target	scsi: target/core: Make sure that target_wait_for_sess_cmds() waits long enough	2019-01-26 09:32:38 +01:00
trace	sunrpc: use-after-free in svc_process_common()	2019-01-16 22:04:37 +01:00
uapi	inet_diag: fix reporting cgroup classid and fallback to priority	2019-02-27 10:08:58 +01:00
video	udlfb: handle unplug properly	2019-02-27 10:09:03 +01:00
xen	Revert "xen/balloon: Mark unallocated host memory as UNUSABLE"	2018-12-17 09:24:39 +01:00