linux/kernel
Johannes Weiner a3835ce695 UPSTREAM: psi: Fix psi state corruption when schedule() races with cgroup move
4117cebf1a ("psi: Optimize task switch inside shared cgroups")
introduced a race condition that corrupts internal psi state. This
manifests as kernel warnings, sometimes followed by bogusly high IO
pressure:

  psi: task underflow! cpu=1 t=2 tasks=[0 0 0 0] clear=c set=0
  (schedule() decreasing RUNNING and ONCPU, both of which are 0)

  psi: incosistent task state! task=2412744:systemd cpu=17 psi_flags=e clear=3 set=0
  (cgroup_move_task() clearing MEMSTALL and IOWAIT, but task is MEMSTALL | RUNNING | ONCPU)

What the offending commit does is batch the two psi callbacks in
schedule() to reduce the number of cgroup tree updates. When prev is
deactivated and removed from the runqueue, nothing is done in psi at
first; when the task switch completes, TSK_RUNNING and TSK_IOWAIT are
updated along with TSK_ONCPU.

However, the deactivation and the task switch inside schedule() aren't
atomic: pick_next_task() may drop the rq lock for load balancing. When
this happens, cgroup_move_task() can run after the task has been
physically dequeued, but the psi updates are still pending. Since it
looks at the task's scheduler state, it doesn't move everything to the
new cgroup that the task switch that follows is about to clear from
it. cgroup_move_task() will leak the TSK_RUNNING count in the old
cgroup, and psi_sched_switch() will underflow it in the new cgroup.

A similar thing can happen for iowait. TSK_IOWAIT is usually set when
a p->in_iowait task is dequeued, but again this update is deferred to
the switch. cgroup_move_task() can see an unqueued p->in_iowait task
and move a non-existent TSK_IOWAIT. This results in the inconsistent
task state warning, as well as a counter underflow that will result in
permanent IO ghost pressure being reported.

Fix this bug by making cgroup_move_task() use task->psi_flags instead
of looking at the potentially mismatching scheduler state.

[ We used the scheduler state historically in order to not rely on
  task->psi_flags for anything but debugging. But that ship has sailed
  anyway, and this is simpler and more robust.

  We previously already batched TSK_ONCPU clearing with the
  TSK_RUNNING update inside the deactivation call from schedule(). But
  that ordering was safe and didn't result in TSK_ONCPU corruption:
  unlike most places in the scheduler, cgroup_move_task() only checked
  task_current() and handled TSK_ONCPU if the task was still queued. ]

bug: b/253347377

Fixes: 4117cebf1a ("psi: Optimize task switch inside shared cgroups")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210503174917.38579-1-hannes@cmpxchg.org
(cherry picked from commit d583d360a6)
Change-Id: Id0a292058d4bffb716d8e1496f72139e8d435410
2022-10-13 05:36:38 +00:00
..
bpf This is the 5.10.134 stable release 2022-08-03 12:42:13 +02:00
cgroup BACKPORT: cgroup: Fix threadgroup_rwsem <-> cpus_read_lock() deadlock 2022-08-19 18:44:41 +00:00
configs
debug lockdown: also lock down previous kgdb use 2022-05-30 09:33:22 +02:00
dma Merge tag 'android12-5.10.136_r00' into android12-5.10 2022-09-28 09:54:28 +02:00
entry KVM: rseq: Update rseq when processing NOTIFY_RESUME on xfer to KVM guest 2021-10-06 15:55:49 +02:00
events This is the 5.10.134 stable release 2022-08-03 12:42:13 +02:00
gcov gcov: re-fix clang-11+ support 2021-04-14 08:41:58 +02:00
irq Merge 5.10.119 into android12-5.10-lts 2022-07-14 14:31:17 +02:00
kcsan kcsan: Fix debugfs initcall return type 2021-05-26 12:06:54 +02:00
livepatch livepatch: Fix build failure on 32 bits processors 2022-04-08 14:40:15 +02:00
locking ANDROID: vendor_hook: rename the the name of hooks 2022-09-22 10:18:45 +00:00
power This is the 5.10.110 stable release 2022-04-18 17:41:18 +02:00
printk This is the 5.10.110 stable release 2022-04-18 17:41:18 +02:00
rcu This is the 5.10.121 stable release 2022-07-23 16:10:22 +02:00
sched UPSTREAM: psi: Fix psi state corruption when schedule() races with cgroup move 2022-10-13 05:36:38 +00:00
time This is the 5.10.132 stable release 2022-07-28 17:17:55 +02:00
trace This is the 5.10.132 stable release 2022-07-28 17:17:55 +02:00
.gitignore kbuild: update config_data.gz only when the content of .config is changed 2021-05-11 14:47:37 +02:00
acct.c kernel: acct.c: fix some kernel-doc nits 2020-10-16 11:11:19 -07:00
async.c Revert "module, async: async_synchronize_full() on module init iff async is used" 2022-02-23 12:01:00 +01:00
audit_fsnotify.c fsnotify: generalize handle_inode_event() 2020-12-30 11:54:18 +01:00
audit_tree.c audit: move put_tree() to avoid trim_trees refcount underflow and UAF 2021-09-03 10:09:31 +02:00
audit_watch.c fsnotify: generalize handle_inode_event() 2020-12-30 11:54:18 +01:00
audit.c audit: improve audit queue handling when "audit=1" on cmdline 2022-02-08 18:30:34 +01:00
audit.h audit: log AUDIT_TIME_* records only from rules 2022-04-08 14:40:00 +02:00
auditfilter.c treewide: Use fallthrough pseudo-keyword 2020-08-23 17:36:59 -05:00
auditsc.c audit: log AUDIT_TIME_* records only from rules 2022-04-08 14:40:00 +02:00
backtracetest.c treewide: Replace DECLARE_TASKLET() with DECLARE_TASKLET_OLD() 2020-07-30 11:15:58 -07:00
bounds.c
capability.c LSM: Signal to SafeSetID when setting group IDs 2020-10-13 09:17:34 -07:00
cfi.c ANDROID: cfi: explicitly clear diag in __cfi_slowpath 2021-09-02 08:55:56 +00:00
compat.c treewide: Use fallthrough pseudo-keyword 2020-08-23 17:36:59 -05:00
configs.c
context_tracking.c
cpu_pm.c PM: cpu: Make notifier chain use a raw_spinlock_t 2021-09-15 09:50:40 +02:00
cpu.c ANDROID: Fix kenelci build-break for !CONFIG_PERF_EVENTS 2022-10-06 18:00:02 +00:00
crash_core.c crash_core, vmcoreinfo: append 'SECTION_SIZE_BITS' to vmcoreinfo 2021-06-23 14:42:52 +02:00
crash_dump.c
cred.c Revert "Add a reference to ucounts for each cred" 2021-09-08 08:49:00 +02:00
delayacct.c
dma.c
exec_domain.c
exit.c This is the 5.10.132 stable release 2022-07-28 17:17:55 +02:00
extable.c
fail_function.c fail_function: Remove a redundant mutex unlock 2020-11-19 11:58:16 -08:00
fork.c Revert "ANDROID: vendor_hooks:vendor hook for mmput" 2022-08-17 05:54:29 +00:00
freezer.c ANDROID: freezer: Add vendor hook to freezer for GKI purpose. 2021-06-07 16:07:44 +00:00
futex.c ANDROID: vendor_hooks: Add hooks for oem futex optimization 2022-08-24 00:15:59 +00:00
gen_kheaders.sh
groups.c LSM: Signal to SafeSetID when setting group IDs 2020-10-13 09:17:34 -07:00
hung_task.c FROMLIST: freezer: Add frozen_or_skipped() helper function 2021-06-02 15:42:01 +00:00
iomem.c
irq_work.c ANDROID: Sched: Export scheduler symbols needed by vendor modules 2020-12-03 16:50:04 +00:00
jump_label.c jump_label: Fix jump_label_text_reserved() vs __init 2021-07-20 16:05:58 +02:00
kallsyms.c ANDROID: kallsyms: cfi: strip hashes from static functions 2021-01-14 16:31:46 +00:00
kcmp.c exec: Transform exec_update_mutex into a rw_semaphore 2021-01-09 13:46:24 +01:00
Kconfig.freezer
Kconfig.hz
Kconfig.locks
Kconfig.preempt
kcov.c kcov: make some symbols static 2020-08-12 10:58:02 -07:00
kexec_core.c kernel: kexec: remove the lock operation of system_transition_mutex 2021-02-03 23:28:37 +01:00
kexec_elf.c
kexec_file.c ima: force signature verification when CONFIG_KEXEC_SIG is configured 2022-07-21 21:20:11 +02:00
kexec_internal.h
kexec.c LSM: Introduce kernel_post_load_data() hook 2020-10-05 13:37:03 +02:00
kheaders.c
kmod.c kmod: remove redundant "be an" in the comment 2020-08-12 10:58:01 -07:00
kprobes.c kprobes: Limit max data_size of the kretprobe instances 2021-12-08 09:03:20 +01:00
ksysfs.c
kthread.c This is the 5.10.62 stable release 2021-09-03 10:51:56 +02:00
latencytop.c
Makefile kbuild: update config_data.gz only when the content of .config is changed 2021-05-11 14:47:37 +02:00
module_signature.c module: harden ELF info handling 2021-03-25 09:04:11 +01:00
module_signing.c module: harden ELF info handling 2021-03-25 09:04:11 +01:00
module-internal.h
module.c This is the 5.10.118 stable release 2022-06-06 16:37:12 +02:00
notifier.c notifier: Fix broken error handling pattern 2020-09-01 09:58:03 +02:00
nsproxy.c nsproxy: support CLONE_NEWTIME with setns() 2020-07-08 11:14:22 +02:00
padata.c padata: fix possible padata_works_lock deadlock 2020-09-04 17:51:55 +10:00
panic.c panic: don't dump stack twice on warn 2020-11-14 11:26:04 -08:00
params.c params: Replace zero-length array with flexible-array member 2020-10-29 17:22:59 -05:00
pid_namespace.c memcg: enable accounting for pids in nested pid namespaces 2021-09-18 13:40:36 +02:00
pid.c Revert "ANDROID: vendor_hooks:vendor hook for pidfd_open" 2022-08-17 07:58:13 +02:00
profile.c profiling: fix shift-out-of-bounds bugs 2021-09-26 14:08:58 +02:00
ptrace.c ptrace: Reimplement PTRACE_KILL by always sending SIGKILL 2022-06-09 10:20:49 +02:00
range.c kernel.h: split out min()/max() et al. helpers 2020-10-16 11:11:19 -07:00
reboot.c Merge e28c0d7c92 ("Merge branch 'akpm' (patches from Andrew)") into android-mainline 2020-11-15 14:37:09 +01:00
regset.c regset: kill ->get() 2020-07-27 14:31:12 -04:00
relay.c kernel/relay.c: drop unneeded initialization 2020-10-16 11:11:22 -07:00
resource.c kernel/resource: make walk_mem_res() find all busy IORESOURCE_MEM resources 2021-05-19 10:13:09 +02:00
rseq.c rseq: Remove broken uapi field layout on 32-bit little endian 2022-04-08 14:40:03 +02:00
scftorture.c scftorture: Fix distribution of short handler delays 2022-06-09 10:21:01 +02:00
scs.c FROMGIT: scs: Release kasan vmalloc poison in scs_free process 2021-10-04 15:44:53 +00:00
seccomp.c This is the 5.10.60 stable release 2021-08-27 17:14:51 +02:00
signal.c Merge tag 'android12-5.10.136_r00' into android12-5.10 2022-09-28 09:54:28 +02:00
smp.c This is the 5.10.112 stable release 2022-04-29 09:15:09 +02:00
smpboot.c sched/core: Initialize the idle task with preemption disabled 2021-07-14 16:55:50 +02:00
smpboot.h
softirq.c ANDROID: softirq: Export irq_handler_exit tracepoint 2020-12-21 17:48:06 +00:00
stackleak.c gcc-plugins/stackleak: Use noinstr in favor of notrace 2022-02-23 12:01:00 +01:00
stacktrace.c ANDROID: stacktrace: export stack_trace_save_tsk/regs 2021-04-13 13:18:04 +00:00
static_call.c static_call: Fix unused variable warn w/o MODULE 2021-09-08 08:49:00 +02:00
stop_machine.c ANDROID: stop_machine: stop_one_cpu_async 2020-12-08 19:07:21 +00:00
sys_ni.c UPSTREAM: mm: wire up syscall process_mrelease 2022-01-06 17:37:36 +00:00
sys.c This is the 5.10.69 stable release 2021-09-30 18:36:17 +02:00
sysctl-test.c
sysctl.c Merge branch 'android12-5.10' into branch 'android12-5.10-lts' 2022-08-16 14:34:54 +02:00
task_work.c FROMGIT: kasan: record task_work_add() call stack 2021-03-24 15:09:18 -07:00
taskstats.c taskstats: move specifying netlink policy back to ops 2020-10-02 19:11:12 -07:00
test_kprobes.c
torture.c
tracepoint.c This is the 5.10.62 stable release 2021-09-03 10:51:56 +02:00
tsacct.c taskstats: Cleanup the use of task->exit_code 2022-01-27 10:54:33 +01:00
ucount.c Revert "Add a reference to ucounts for each cred" 2021-09-08 08:49:00 +02:00
uid16.c
uid16.h
umh.c usermodehelper: reset umask to default before executing user process 2020-10-06 10:31:52 -07:00
up.c smp: Fix smp_call_function_single_async prototype 2021-05-14 09:50:46 +02:00
user_namespace.c Revert "Add a reference to ucounts for each cred" 2021-09-08 08:49:00 +02:00
user-return-notifier.c
user.c Revert "ANDROID: user: Add vendor hook to user for GKI purpose" 2022-01-20 10:04:49 +01:00
usermode_driver.c bpf: Fix umd memory leak in copy_process() 2021-03-30 14:32:03 +02:00
utsname_sysctl.c
utsname.c
watch_queue.c BACKPORT: watchqueue: make sure to serialize 'wqueue->defunct' properly 2022-09-12 07:24:14 +00:00
watchdog_hld.c
watchdog.c Merge 5.10.38 into android12-5.10 2021-05-20 15:35:25 +02:00
workqueue_internal.h
workqueue.c Merge branch 'android12-5.10' into android12-5.10-lts 2022-02-09 18:16:30 +01:00