linux/kernel
Kairui Song 3697615914 mm, swap: cleanup swap entry management workflow
The current swap entry allocation/freeing workflow has never had a clear
definition.  This makes it hard to debug or add new optimizations.

This commit introduces a proper definition of how swap entries would be
allocated and freed.  Now, most operations are folio based, so they will
never exceed one swap cluster, and we now have a cleaner border between
swap and the rest of mm, making it much easier to follow and debug,
especially with new added sanity checks.  Also making more optimization
possible.

Swap entry will be mostly freed and free with a folio bound.  The folio
lock will be useful for resolving many swap related races.

Now swap allocation (except hibernation) always starts with a folio in the
swap cache, and gets duped/freed protected by the folio lock:

- folio_alloc_swap() - The only allocation entry point now.
  Context: The folio must be locked.
  This allocates one or a set of continuous swap slots for a folio and
  binds them to the folio by adding the folio to the swap cache. The
  swap slots' swap count start with zero value.

- folio_dup_swap() - Increase the swap count of one or more entries.
  Context: The folio must be locked and in the swap cache. For now, the
  caller still has to lock the new swap entry owner (e.g., PTL).
  This increases the ref count of swap entries allocated to a folio.
  Newly allocated swap slots' count has to be increased by this helper
  as the folio got unmapped (and swap entries got installed).

- folio_put_swap() - Decrease the swap count of one or more entries.
  Context: The folio must be locked and in the swap cache. For now, the
  caller still has to lock the new swap entry owner (e.g., PTL).
  This decreases the ref count of swap entries allocated to a folio.
  Typically, swapin will decrease the swap count as the folio got
  installed back and the swap entry got uninstalled

  This won't remove the folio from the swap cache and free the
  slot. Lazy freeing of swap cache is helpful for reducing IO.
  There is already a folio_free_swap() for immediate cache reclaim.
  This part could be further optimized later.

The above locking constraints could be further relaxed when the swap table
is fully implemented.  Currently dup still needs the caller to lock the
swap entry container (e.g.  PTL), or a concurrent zap may underflow the
swap count.

Some swap users need to interact with swap count without involving folio
(e.g.  forking/zapping the page table or mapping truncate without swapin).
In such cases, the caller has to ensure there is no race condition on
whatever owns the swap count and use the below helpers:

- swap_put_entries_direct() - Decrease the swap count directly.
  Context: The caller must lock whatever is referencing the slots to
  avoid a race.

  Typically the page table zapping or shmem mapping truncate will need
  to free swap slots directly. If a slot is cached (has a folio bound),
  this will also try to release the swap cache.

- swap_dup_entry_direct() - Increase the swap count directly.
  Context: The caller must lock whatever is referencing the entries to
  avoid race, and the entries must already have a swap count > 1.

  Typically, forking will need to copy the page table and hence needs to
  increase the swap count of the entries in the table. The page table is
  locked while referencing the swap entries, so the entries all have a
  swap count > 1 and can't be freed.

Hibernation subsystem is a bit different, so two special wrappers are here:

- swap_alloc_hibernation_slot() - Allocate one entry from one device.
- swap_free_hibernation_slot() - Free one entry allocated by the above
  helper.

All hibernation entries are exclusive to the hibernation subsystem and
should not interact with ordinary swap routines.

By separating the workflows, it will be possible to bind folio more
tightly with swap cache and get rid of the SWAP_HAS_CACHE as a temporary
pin.

This commit should not introduce any behavior change

[kasong@tencent.com: fix leak, per Chris Mason.  Remove WARN_ON, per Lai Yi]
  Link: https://lkml.kernel.org/r/CAMgjq7AUz10uETVm8ozDWcB3XohkOqf0i33KGrAquvEVvfp5cg@mail.gmail.com
[ryncsn@gmail.com: fix KSM copy pages for swapoff, per Chris]
  Link: https://lkml.kernel.org/r/aXxkANcET3l2Xu6J@KASONG-MC4
Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-14-8862a265a033@tencent.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Signed-off-by: Kairui Song <ryncsn@gmail.com>
Acked-by: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: Deepanshu Kartikey <kartikey406@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Chris Mason <clm@fb.com>
Cc: Chris Mason <clm@meta.com>
Cc: Lai Yi <yi1.lai@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31 14:22:56 -08:00
..
bpf bpf: Reject BPF_MAP_TYPE_INSN_ARRAY in check_reg_const_str() 2026-01-07 19:03:46 -08:00
cgroup cgroup: use nodes_and() output where appropriate 2026-01-26 20:02:37 -08:00
configs hung_task: panic when there are more than N hung tasks at the same time 2025-11-12 10:00:14 -08:00
debug kdb: Adapt kdb_msg_write to work with NBCON consoles 2025-10-24 12:56:20 +02:00
dma dma/pool: Avoid allocating redundant pools 2026-01-14 11:00:00 +01:00
entry rseq: Switch to TIF_RSEQ if supported 2025-11-04 08:35:37 +01:00
events Fix perf swevent hrtimer deinit regression. 2026-01-11 06:55:27 -10:00
futex Futex changes for v6.19: 2025-12-10 17:21:30 +09:00
gcov gcov: add support for GCC 15 2025-11-09 21:19:44 -08:00
irq treewide: Update email address 2026-01-11 06:09:11 -10:00
kcsan Kernel Concurrency Sanitizer (KCSAN) updates for v6.18 2025-10-02 08:31:44 -07:00
livepatch livepatching changes for 6.19 2025-12-03 13:46:48 -08:00
liveupdate kho: kho_preserve_vmalloc(): don't return 0 when ENOMEM 2026-01-26 19:03:48 -08:00
locking RCU pull request for v6.19 2025-12-03 12:18:07 -08:00
module kernel: modules: Add SPDX license identifier to kmod.c 2026-01-15 16:58:28 -08:00
power mm, swap: cleanup swap entry management workflow 2026-01-31 14:22:56 -08:00
printk printk fixup for 6.19 rc6 2026-01-16 09:46:59 -08:00
rcu RCU pull request for v6.19 2025-12-03 12:18:07 -08:00
sched sched/deadline: Use ENQUEUE_MOVE to allow priority change 2026-01-15 21:57:53 +01:00
time hrtimer: Fix softirq base check in update_needs_ipi() 2026-01-13 11:04:41 +01:00
trace ftrace: Do not over-allocate ftrace memory 2026-01-15 10:17:53 -05:00
unwind unwind_user/x86: Teach FP unwind about start of function 2025-10-29 10:29:58 +01:00
.gitignore kheaders: rebuild kheaders_data.tar.xz when a file is modified within a minute 2025-06-24 20:30:37 +09:00
acct.c act: use credential guards in acct_write_process() 2025-11-04 12:36:49 +01:00
async.c
audit_fsnotify.c VFS/audit: introduce kern_path_parent() for audit 2025-09-23 12:37:35 +02:00
audit_tree.c mount-related stuff for this cycle 2025-10-03 10:19:44 -07:00
audit_watch.c VFS/audit: introduce kern_path_parent() for audit 2025-09-23 12:37:35 +02:00
audit.c audit: fix skb leak when audit rate limit is exceeded 2025-09-10 19:55:00 -04:00
audit.h audit: fix comment misindentation in audit.h 2025-10-22 19:28:06 -04:00
auditfilter.c audit: Use kzalloc() instead of kmalloc()/memset() in audit_krule_to_data() 2025-11-07 16:38:34 -05:00
auditsc.c audit: merge loops in __audit_inode_child() 2025-11-07 16:50:42 -05:00
backtracetest.c
bounds.c x86/asm: Remove ANNOTATE_DATA_SPECIAL usage 2025-12-03 16:53:19 +01:00
capability.c capability: Remove unused has_capability 2025-03-07 22:03:09 -06:00
cfi.c cfi: Move BPF CFI types and helpers to generic code 2025-07-31 18:23:53 -07:00
compat.c
configs.c
context_tracking.c context_tracking: Make RCU watch ct_kernel_exit_state() warning 2025-03-04 18:44:29 -08:00
cpu_pm.c syscore: Pass context data to callbacks 2025-11-14 10:01:52 +01:00
cpu.c cpu: Make atomic hotplug callbacks run with interrupts disabled on UP 2025-12-10 15:49:11 +09:00
crash_core_test.c crash: add KUnit tests for crash_exclude_mem_range 2025-09-13 17:32:55 -07:00
crash_core.c crash: fix crashkernel resource shrink 2025-11-15 10:52:01 -08:00
crash_dump_dm_crypt.c crash_dump: retrieve dm crypt keys in kdump kernel 2025-05-21 10:48:21 -07:00
crash_reserve.c crash: let architecture decide crash memory export to iomem_resource 2025-11-12 10:00:15 -08:00
cred.c kernel-6.19-rc1.cred 2025-12-01 13:45:41 -08:00
delayacct.c delayacct: remove redundant code and adjust indentation 2025-05-27 19:40:33 -07:00
dma.c
elfcorehdr.c
exec_domain.c
exit.c Significant patch series in this pull request: 2025-12-06 14:01:20 -08:00
exit.h
extable.c
fail_function.c
fork.c Significant patch series in this pull request: 2025-12-06 14:01:20 -08:00
freezer.c freezer: Clarify that only cgroup1 freezer uses PM freezer 2025-10-30 20:10:27 +01:00
gen_kheaders.sh kheaders: make it possible to override TAR 2025-08-06 10:23:36 +09:00
groups.c
hung_task.c hung_task: add hung_task_sys_info sysctl to dump sys info on task-hung 2025-11-20 14:03:43 -08:00
iomem.c mm/memremap: Pass down MEMREMAP_* flags to arch_memremap_wb() 2025-02-21 15:05:38 +01:00
irq_work.c kasan: make kasan_record_aux_stack_noalloc() the default behaviour 2025-01-13 22:40:36 -08:00
jump_label.c jump_label: Use RCU in all users of __module_text_address(). 2025-03-10 11:54:46 +01:00
kallsyms_internal.h
kallsyms_selftest.c kallsyms: use kmalloc_array() instead of kmalloc() 2025-09-28 11:36:14 -07:00
kallsyms_selftest.h
kallsyms.c kallsyms: Fix wrong "big" kernel symbol type read from procfs 2025-11-24 16:56:24 +01:00
kcmp.c kcmp: improve performance adding an unlikely hint to task comparisons 2025-02-21 10:25:33 +01:00
Kconfig.freezer
Kconfig.hz kernel: Fix "select" wording on HZ_250 description 2025-02-21 09:20:30 +01:00
Kconfig.kexec liveupdate: kho: move to kernel/liveupdate 2025-11-27 14:24:33 -08:00
Kconfig.locks
Kconfig.preempt softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT 2025-09-17 16:25:41 +02:00
kcov.c kcov: use write memory barrier after memcpy() in kcov_move_area() 2025-09-13 17:32:44 -07:00
kexec_core.c kernel/kexec: fix IMA when allocation happens in CMA area 2025-12-23 11:23:14 -08:00
kexec_elf.c kexec: initialize ELF lowest address to ULONG_MAX 2025-03-16 22:30:47 -07:00
kexec_file.c x86/kexec: carry forward the boot DTB on kexec 2025-09-13 17:32:43 -07:00
kexec_internal.h kexec: enable CMA based contiguous allocation 2025-08-02 12:01:38 -07:00
kexec.c kexec: enable CMA based contiguous allocation 2025-08-02 12:01:38 -07:00
kheaders.c kheaders: Simplify attribute through __BIN_ATTR_SIMPLE_RO() 2024-12-24 09:46:49 +01:00
kprobes.c kprobes: Add missing kerneldoc for __get_insn_slot 2025-07-15 18:45:34 +09:00
kstack_erase.c sysctl: remove __user qualifier from stack_erasing_sysctl buffer argument 2025-11-27 15:44:53 +01:00
ksyms_common.c
ksysfs.c kexec: move sysfs entries to /sys/kernel/kexec 2025-11-27 14:24:42 -08:00
kthread.c kthread: Warn if mm_struct lacks user_ns in kthread_use_mm() 2025-12-24 21:32:58 +01:00
latencytop.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
Makefile liveupdate: kho: move to kernel/liveupdate 2025-11-27 14:24:33 -08:00
module_signature.c
notifier.c reboot: move reboot_notifier_list to kernel/reboot.c 2024-11-05 17:12:31 -08:00
nscommon.c ns: rename is_initial_namespace() 2025-11-11 10:01:31 +01:00
nsproxy.c nsproxy: fix free_nsproxy() and simplify create_new_namespaces() 2025-11-14 13:10:38 +01:00
nstree.c nstree: fix kernel-doc comments for internal functions 2025-11-14 13:10:38 +01:00
padata.c padata: remove __padata_list_init() 2025-11-14 18:15:49 +08:00
panic.c panic: only warn about deprecated panic_print on write access 2026-01-19 12:30:01 -08:00
params.c params: Replace deprecated strcpy() with strscpy() and memcpy() 2025-08-16 21:47:25 +02:00
pid_namespace.c pid: rely on common reference count behavior 2025-11-11 10:01:32 +01:00
pid_sysctl.h treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
pid.c ns: drop custom reference count initialization for initial namespaces 2025-11-11 10:01:32 +01:00
profile.c
ptrace.c rseq: Introduce struct rseq_data 2025-11-04 08:30:50 +01:00
range.c
reboot.c - The 7 patch series "powerpc/crash: use generic crashkernel 2025-04-01 10:06:52 -07:00
regset.c
relay.c relay: update relay to use mmap_prepare 2025-11-16 17:28:11 -08:00
resource_kunit.c
resource.c Reinstate "resource: avoid unnecessary lookups in find_next_iomem_res()" 2025-11-27 14:24:45 -08:00
rseq.c rseq: Switch to fast path processing on exit to user 2025-11-04 08:34:39 +01:00
scftorture.c scftorture: Handle NULL argument passed to scf_add_to_free_list(). 2024-11-14 16:09:51 -08:00
scs.c scs: fix a wrong parameter in __scs_magic 2025-11-12 10:00:13 -08:00
seccomp.c Performance events updates for v6.18: 2025-09-30 11:11:21 -07:00
signal.c signal: Move MMCID exit out of sighand lock 2025-11-25 19:45:40 +01:00
smp.c smp: Introduce a helper function to check for pending IPIs 2025-11-19 18:06:50 +01:00
smpboot.c sched/smp: Use the SMP version of idle_thread_set_boot_cpu() 2025-06-13 08:47:20 +02:00
smpboot.h
softirq.c softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT 2025-09-17 16:25:41 +02:00
stacktrace.c
static_call_inline.c Modules changes for 6.15-rc1 2025-03-30 15:44:36 -07:00
static_call.c
stop_machine.c sched/core: Fix migrate_swap() vs. hotplug 2025-07-01 15:02:03 +02:00
sys_ni.c uprobes/x86: Add uprobe syscall to speed up uprobe 2025-08-21 20:09:20 +02:00
sys.c Patch series in this pull request: 2025-10-02 18:44:54 -07:00
sysctl-test.c sysctl: move u8 register test to lib/test_sysctl.c 2025-04-14 14:13:41 +02:00
sysctl.c sysctl: Wrap do_proc_douintvec with the public function proc_douintvec_conv 2025-11-27 15:45:38 +01:00
task_work.c task_work: Fix NMI race condition 2025-10-29 10:29:54 +01:00
taskstats.c fdget(), more trivial conversions 2024-11-03 01:28:06 -05:00
torture.c torture: Delay CPU-hotplug operations until boot completes 2025-08-14 15:26:30 -07:00
tracepoint.c tracepoint: Print the function symbol when tracepoint_debug is set 2025-03-21 15:30:10 -04:00
tsacct.c pid: change bacct_add_tsk() to use task_ppid_nr_ns() 2025-08-19 13:38:20 +02:00
ucount.c ucount: use atomic_long_try_cmpxchg() in atomic_long_inc_below() 2025-08-02 12:01:38 -07:00
uid16.c
uid16.h
umh.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
up.c
user_namespace.c ns: move ns type into struct ns_common 2025-09-25 09:23:54 +02:00
user-return-notifier.c
user.c ns: drop custom reference count initialization for initial namespaces 2025-11-11 10:01:32 +01:00
utsname_sysctl.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
utsname.c namespace-6.18-rc1 2025-09-29 11:20:29 -07:00
vhost_task.c vhost: Take a reference on the task in struct vhost_task. 2025-09-21 17:44:20 -04:00
vmcore_info.c vmcoreinfo: make hwerr_data visible for debugging 2026-01-26 19:03:49 -08:00
watch_queue.c watch_queue: Use local kmap in post_one_notification() 2025-11-19 12:17:28 +01:00
watchdog_buddy.c watchdog: fix opencoded cpumask_next_wrap() in watchdog_next_cpu() 2025-07-31 11:28:03 -04:00
watchdog_perf.c watchdog: skip checks when panic is in progress 2025-09-13 17:32:53 -07:00
watchdog.c powerpc/watchdog: add support for hardlockup_sys_info sysctl 2026-01-14 22:16:22 -08:00
workqueue_internal.h
workqueue.c workqueue: Don't rely on wq->rescuer to stop rescuer 2025-11-21 09:45:36 -10:00