linux/include
Johannes Weiner 8627c7750a mm: memcontrol: fix cgroup creation failure after many small jobs
commit 73f576c04b upstream.

The memory controller has quite a bit of state that usually outlives the
cgroup and pins its CSS until said state disappears.  At the same time
it imposes a 16-bit limit on the CSS ID space to economically store IDs
in the wild.  Consequently, when we use cgroups to contain frequent but
small and short-lived jobs that leave behind some page cache, we quickly
run into the 64k limitations of outstanding CSSs.  Creating a new cgroup
fails with -ENOSPC while there are only a few, or even no user-visible
cgroups in existence.

Although pinning CSSs past cgroup removal is common, there are only two
instances that actually need an ID after a cgroup is deleted: cache
shadow entries and swapout records.

Cache shadow entries reference the ID weakly and can deal with the CSS
having disappeared when it's looked up later.  They pose no hurdle.

Swap-out records do need to pin the css to hierarchically attribute
swapins after the cgroup has been deleted; though the only pages that
remain swapped out after offlining are tmpfs/shmem pages.  And those
references are under the user's control, so they are manageable.

This patch introduces a private 16-bit memcg ID and switches swap and
cache shadow entries over to using that.  This ID can then be recycled
after offlining when the CSS remains pinned only by objects that don't
specifically need it.

This script demonstrates the problem by faulting one cache page in a new
cgroup and deleting it again:

  set -e
  mkdir -p pages
  for x in `seq 128000`; do
    [ $((x % 1000)) -eq 0 ] && echo $x
    mkdir /cgroup/foo
    echo $$ >/cgroup/foo/cgroup.procs
    echo trex >pages/$x
    echo $$ >/cgroup/cgroup.procs
    rmdir /cgroup/foo
  done

When run on an unpatched kernel, we eventually run out of possible IDs
even though there are no visible cgroups:

  [root@ham ~]# ./cssidstress.sh
  [...]
  65000
  mkdir: cannot create directory '/cgroup/foo': No space left on device

After this patch, the IDs get released upon cgroup destruction and the
cache and css objects get released once memory reclaim kicks in.

[hannes@cmpxchg.org: init the IDR]
  Link: http://lkml.kernel.org/r/20160621154601.GA22431@cmpxchg.org
Fixes: b2052564e6 ("mm: memcontrol: continue cache reclaim from offlined groups")
Link: http://lkml.kernel.org/r/20160617162516.GD19084@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: John Garcia <john.garcia@mesosphere.io>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-08-16 09:30:51 +02:00
..
acpi
asm-generic vmlinux.lds: account for destructor sections 2016-08-10 11:49:25 +02:00
clocksource
crypto crypto: af_alg - Allow af_af_alg_release_parent to be called on nokey path 2016-02-17 12:31:03 -08:00
drm drm/ttm: Make ttm_bo_mem_compat available 2016-07-27 09:47:35 -07:00
dt-bindings ARM: DT updates for v4.4 2015-11-10 15:06:26 -08:00
keys
kvm KVM: arm/arm64: arch_timer: Preserve physical dist. active state on LR.active 2015-11-24 18:07:40 +01:00
linux mm: memcontrol: fix cgroup creation failure after many small jobs 2016-08-16 09:30:51 +02:00
math-emu
media videobuf2-core: Check user space planes array in dqbuf 2016-05-04 14:48:50 -07:00
memory
misc
net vxlan, gre, geneve: Set a large MTU on ovs-created tunnel devices 2016-06-24 10:18:18 -07:00
pcmcia
ras
rdma IB/security: Restrict use of the write() interface 2016-05-04 14:48:48 -07:00
rxrpc
scsi scsi: Add intermediate STARGET_REMOVE state to scsi_target_state 2016-06-01 12:15:54 -07:00
soc ARM: SoC driver updates for v4.4 2015-11-10 15:00:03 -08:00
sound ALSA: rawmidi: Make snd_rawmidi_transmit() race-free 2016-02-17 12:30:58 -08:00
target target: Fix WRITE_SAME/DISCARD conversion to linux 512b sectors 2016-03-09 15:34:51 -08:00
trace Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux 2015-11-11 09:03:01 -08:00
uapi uapi glibc compat: fix compilation when !__USE_MISC in glibc 2016-06-24 10:18:17 -07:00
video drm/imx: Match imx-ipuv3-crtc components using device node in platform data 2016-06-07 18:14:37 -07:00
xen xen: Fix page <-> pfn conversion on 32 bit systems 2016-05-11 11:21:14 +02:00
Kbuild