linux/Documentation
David Finkel c6f53ed8f2 mm, memcg: cg2 memory{.swap,}.peak write handlers
Patch series "mm, memcg: cg2 memory{.swap,}.peak write handlers", v7.


This patch (of 2):

Other mechanisms for querying the peak memory usage of either a process or
v1 memory cgroup allow for resetting the high watermark.  Restore parity
with those mechanisms, but with a less racy API.

For example:
 - Any write to memory.max_usage_in_bytes in a cgroup v1 mount resets
   the high watermark.
 - writing "5" to the clear_refs pseudo-file in a processes's proc
   directory resets the peak RSS.

This change is an evolution of a previous patch, which mostly copied the
cgroup v1 behavior, however, there were concerns about races/ownership
issues with a global reset, so instead this change makes the reset
filedescriptor-local.

Writing any non-empty string to the memory.peak and memory.swap.peak
pseudo-files reset the high watermark to the current usage for subsequent
reads through that same FD.

Notably, following Johannes's suggestion, this implementation moves the
O(FDs that have written) behavior onto the FD write(2) path.  Instead, on
the page-allocation path, we simply add one additional watermark to
conditionally bump per-hierarchy level in the page-counter.

Additionally, this takes Longman's suggestion of nesting the
page-charging-path checks for the two watermarks to reduce the number of
common-case comparisons.

This behavior is particularly useful for work scheduling systems that need
to track memory usage of worker processes/cgroups per-work-item.  Since
memory can't be squeezed like CPU can (the OOM-killer has opinions), these
systems need to track the peak memory usage to compute system/container
fullness when binpacking workitems.

Most notably, Vimeo's use-case involves a system that's doing global
binpacking across many Kubernetes pods/containers, and while we can use
PSI for some local decisions about overload, we strive to avoid packing
workloads too tightly in the first place.  To facilitate this, we track
the peak memory usage.  However, since we run with long-lived workers (to
amortize startup costs) we need a way to track the high watermark while a
work-item is executing.  Polling runs the risk of missing short spikes
that last for timescales below the polling interval, and peak memory
tracking at the cgroup level is otherwise perfect for this use-case.

As this data is used to ensure that binpacked work ends up with sufficient
headroom, this use-case mostly avoids the inaccuracies surrounding
reclaimable memory.

Link: https://lkml.kernel.org/r/20240730231304.761942-1-davidf@vimeo.com
Link: https://lkml.kernel.org/r/20240729143743.34236-1-davidf@vimeo.com
Link: https://lkml.kernel.org/r/20240729143743.34236-2-davidf@vimeo.com
Signed-off-by: David Finkel <davidf@vimeo.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Waiman Long <longman@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01 20:25:53 -07:00
..
ABI powerpc fixes for 6.11 #2 2024-08-17 19:23:02 -07:00
accel
accounting
admin-guide mm, memcg: cg2 memory{.swap,}.peak write handlers 2024-09-01 20:25:53 -07:00
arch Merge patch series "RISC-V: hwprobe: Misaligned scalar perf fix and rename" 2024-08-15 13:12:21 -07:00
block
bpf
cdrom
core-api workqueue: doc: Fix function name, remove markers 2024-08-05 18:33:36 -10:00
cpu-freq
crypto
dev-tools - 875fa64577da ("mm/hugetlb_vmemmap: fix race with speculative PFN 2024-07-21 17:15:46 -07:00
devicetree USB fixes for 6.11-rc6 2024-09-01 07:06:28 +12:00
doc-guide
driver-api thermal: core: Update thermal zone registration documentation 2024-08-02 13:22:37 +02:00
fault-injection
fb
features LoongArch: Add ARCH_HAS_DEBUG_VM_PGTABLE support 2024-07-20 22:40:59 +08:00
filesystems Changes since last update: 2024-08-22 06:06:09 +08:00
firmware_class
firmware-guide
fpga
gpu Documentation/amdgpu: Fix duplicate declaration 2024-07-16 11:45:22 -04:00
hid
hwmon hwmon updates for v6.11-rc1 2024-07-15 17:39:13 -07:00
i2c This release includes significant updates, with the primary 2024-07-13 11:10:54 +02:00
iio
images
infiniband
input
isdn
kbuild Documentation/llvm: turn make command for ccache into code block 2024-08-16 21:34:12 +09:00
kernel-hacking
leds
litmus-tests
livepatch
locking
maintainer docs: maintainer: discourage taking conversations off-list 2024-07-16 11:08:26 -06:00
mhi
misc-devices misc: mrvl-cn10k-dpi: add Octeon CN10K DPI administrative driver 2024-07-10 14:58:29 +02:00
mm - 875fa64577da ("mm/hugetlb_vmemmap: fix race with speculative PFN 2024-07-21 17:15:46 -07:00
netlabel
netlink ethtool: rss: echo the context number back 2024-07-25 16:23:47 -07:00
networking ethtool: rss: echo the context number back 2024-07-25 16:23:47 -07:00
nvdimm
nvme
PCI Merge branch 'pci/misc' 2024-07-19 10:10:33 -05:00
pcmcia
peci
power
process net: drop special comment style 2024-08-23 10:21:02 +01:00
RCU Merge branches 'doc.2024.06.06a', 'fixes.2024.07.04a', 'mb.2024.06.28a', 'nocb.2024.06.03a', 'rcu-tasks.2024.06.06a', 'rcutorture.2024.06.06a' and 'srcu.2024.06.18a' into HEAD 2024-07-04 13:54:17 -07:00
rust Rust changes for v6.11 2024-07-27 13:44:54 -07:00
scheduler docs/sp_SP: Add translation for scheduler/sched-design-CFS.rst 2024-07-09 09:14:33 -06:00
scsi
security
sound
sphinx
sphinx-static
spi
staging
target
tee
timers
tools Documentation/tools/rv: fix document header 2024-07-03 16:36:21 -06:00
trace ftrace: Rewrite of function graph tracer 2024-07-18 13:36:33 -07:00
translations pci-v6.11-changes 2024-07-19 19:03:18 -07:00
usb
userspace-api media: v4l: Fix missing tabular column hint for Y14P format 2024-07-30 08:36:29 +02:00
virt KVM/arm64 fixes for 6.11, round #1 2024-08-13 06:06:27 -04:00
w1
watchdog
wmi platform/x86: msi-wmi-platform: Fix spelling mistakes 2024-07-31 12:37:01 +03:00
.gitignore
atomic_bitops.txt
atomic_t.txt
Changes
CodingStyle
conf.py
docutils.conf
dontdiff
index.rst
Kconfig
Makefile
memory-barriers.txt
SubmittingPatches
subsystem-apis.rst