mirror of
https://github.com/torvalds/linux.git
synced 2026-05-12 16:18:45 +02:00
Problem
=======
Commit 658eb5ab91 ("delayacct: add delay max to record delay peak")
introduced the delay max for getdelays, which records abnormal latency
peaks and helps us understand the magnitude of such delays. However, the
peak latency value alone is insufficient for effective root cause
analysis. Without the precise timestamp of when the peak occurred, we
still lack the critical context needed to correlate it with other system
events.
Solution
========
To address this, we need to additionally record a precise timestamp when
the maximum latency occurs. By correlating this timestamp with system
logs and monitoring metrics, we can identify processes with abnormal
resource usage at the same moment, which can help us to pinpoint root
causes.
Use Case
========
bash-4.4# ./getdelays -d -t 227
print delayacct stats ON
TGID 227
CPU count real total virtual total delay total delay average delay max delay min delay max timestamp
46 188000000 192348334 4098012 0.089ms 0.429260ms 0.051205ms 2026-01-15T15:06:58
IO count delay total delay average delay max delay min delay max timestamp
0 0 0.000ms 0.000000ms 0.000000ms N/A
SWAP count delay total delay average delay max delay min delay max timestamp
0 0 0.000ms 0.000000ms 0.000000ms N/A
RECLAIM count delay total delay average delay max delay min delay max timestamp
0 0 0.000ms 0.000000ms 0.000000ms N/A
THRAS HING count delay total delay average delay max delay min delay max timestamp
0 0 0.000ms 0.000000ms 0.000000ms N/A
COMPACT count delay total delay average delay max delay min delay max timestamp
0 0 0.000ms 0.000000ms 0.000000ms N/A
WPCOPY count delay total delay average delay max delay min delay max timestamp
182 19413338 0.107ms 0.547353ms 0.022462ms 2026-01-15T15:05:24
IRQ count delay total delay average delay max delay min delay max timestamp
0 0 0.000ms 0.000000ms 0.000000ms N/A
Link: https://lkml.kernel.org/r/20260119100241520gWubW8-5QfhSf9gjqcc_E@zte.com.cn
Signed-off-by: Wang Yaxin <wang.yaxin@zte.com.cn>
Cc: Fan Yu <fan.yu9@zte.com.cn>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: xu xin <xu.xin16@zte.com.cn>
Cc: Yang Yang <yang.yang29@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
215 lines
8.9 KiB
ReStructuredText
215 lines
8.9 KiB
ReStructuredText
================
|
|
Delay accounting
|
|
================
|
|
|
|
Tasks encounter delays in execution when they wait
|
|
for some kernel resource to become available e.g. a
|
|
runnable task may wait for a free CPU to run on.
|
|
|
|
The per-task delay accounting functionality measures
|
|
the delays experienced by a task while
|
|
|
|
a) waiting for a CPU (while being runnable)
|
|
b) completion of synchronous block I/O initiated by the task
|
|
c) swapping in pages
|
|
d) memory reclaim
|
|
e) thrashing
|
|
f) direct compact
|
|
g) write-protect copy
|
|
h) IRQ/SOFTIRQ
|
|
|
|
and makes these statistics available to userspace through
|
|
the taskstats interface.
|
|
|
|
Such delays provide feedback for setting a task's cpu priority,
|
|
io priority and rss limit values appropriately. Long delays for
|
|
important tasks could be a trigger for raising its corresponding priority.
|
|
|
|
The functionality, through its use of the taskstats interface, also provides
|
|
delay statistics aggregated for all tasks (or threads) belonging to a
|
|
thread group (corresponding to a traditional Unix process). This is a commonly
|
|
needed aggregation that is more efficiently done by the kernel.
|
|
|
|
Userspace utilities, particularly resource management applications, can also
|
|
aggregate delay statistics into arbitrary groups. To enable this, delay
|
|
statistics of a task are available both during its lifetime as well as on its
|
|
exit, ensuring continuous and complete monitoring can be done.
|
|
|
|
|
|
Interface
|
|
---------
|
|
|
|
Delay accounting uses the taskstats interface which is described
|
|
in detail in a separate document in this directory. Taskstats returns a
|
|
generic data structure to userspace corresponding to per-pid and per-tgid
|
|
statistics. The delay accounting functionality populates specific fields of
|
|
this structure. See
|
|
|
|
include/uapi/linux/taskstats.h
|
|
|
|
for a description of the fields pertaining to delay accounting.
|
|
It will generally be in the form of counters returning the cumulative
|
|
delay seen for cpu, sync block I/O, swapin, memory reclaim, thrash page
|
|
cache, direct compact, write-protect copy, IRQ/SOFTIRQ etc.
|
|
|
|
Taking the difference of two successive readings of a given
|
|
counter (say cpu_delay_total) for a task will give the delay
|
|
experienced by the task waiting for the corresponding resource
|
|
in that interval.
|
|
|
|
When a task exits, records containing the per-task statistics
|
|
are sent to userspace without requiring a command. If it is the last exiting
|
|
task of a thread group, the per-tgid statistics are also sent. More details
|
|
are given in the taskstats interface description.
|
|
|
|
The getdelays.c userspace utility in tools/accounting directory allows simple
|
|
commands to be run and the corresponding delay statistics to be displayed. It
|
|
also serves as an example of using the taskstats interface.
|
|
|
|
Usage
|
|
-----
|
|
|
|
Compile the kernel with::
|
|
|
|
CONFIG_TASK_DELAY_ACCT=y
|
|
CONFIG_TASKSTATS=y
|
|
|
|
Delay accounting is disabled by default at boot up.
|
|
To enable, add::
|
|
|
|
delayacct
|
|
|
|
to the kernel boot options. The rest of the instructions below assume this has
|
|
been done. Alternatively, use sysctl kernel.task_delayacct to switch the state
|
|
at runtime. Note however that only tasks started after enabling it will have
|
|
delayacct information.
|
|
|
|
After the system has booted up, use a utility
|
|
similar to getdelays.c to access the delays
|
|
seen by a given task or a task group (tgid).
|
|
The utility also allows a given command to be
|
|
executed and the corresponding delays to be
|
|
seen.
|
|
|
|
General format of the getdelays command::
|
|
|
|
getdelays [-dilv] [-t tgid] [-p pid]
|
|
|
|
Get delays, since system boot, for pid 10::
|
|
|
|
# ./getdelays -d -p 10
|
|
(output similar to next case)
|
|
|
|
Get sum and peak of delays, since system boot, for all pids with tgid 242::
|
|
|
|
bash-4.4# ./getdelays -d -t 242
|
|
print delayacct stats ON
|
|
TGID 242
|
|
|
|
|
|
CPU count real total virtual total delay total delay average delay max delay min delay max timestamp
|
|
46 188000000 192348334 4098012 0.089ms 0.429260ms 0.051205ms 2026-01-15T15:06:58
|
|
IO count delay total delay average delay max delay min delay max timestamp
|
|
0 0 0.000ms 0.000000ms 0.000000ms N/A
|
|
SWAP count delay total delay average delay max delay min delay max timestamp
|
|
0 0 0.000ms 0.000000ms 0.000000ms N/A
|
|
RECLAIM count delay total delay average delay max delay min delay max timestamp
|
|
0 0 0.000ms 0.000000ms 0.000000ms N/A
|
|
THRASHING count delay total delay average delay max delay min delay max timestamp
|
|
0 0 0.000ms 0.000000ms 0.000000ms N/A
|
|
COMPACT count delay total delay average delay max delay min delay max timestamp
|
|
0 0 0.000ms 0.000000ms 0.000000ms N/A
|
|
WPCOPY count delay total delay average delay max delay min delay max timestamp
|
|
182 19413338 0.107ms 0.547353ms 0.022462ms 2026-01-15T15:05:24
|
|
IRQ count delay total delay average delay max delay min delay max timestamp
|
|
0 0 0.000ms 0.000000ms 0.000000ms N/A
|
|
|
|
Get IO accounting for pid 1, it works only with -p::
|
|
|
|
# ./getdelays -i -p 1
|
|
printing IO accounting
|
|
linuxrc: read=65536, write=0, cancelled_write=0
|
|
|
|
The above command can be used with -v to get more debug information.
|
|
|
|
After the system starts, use `delaytop` to get the system-wide delay information,
|
|
which includes system-wide PSI information and Top-N high-latency tasks.
|
|
Note: PSI support requires `CONFIG_PSI=y` and `psi=1` for full functionality.
|
|
|
|
`delaytop` is an interactive tool for monitoring system pressure and task delays.
|
|
It supports multiple sorting options, display modes, and real-time keyboard controls.
|
|
|
|
Basic usage with default settings (sorts by CPU delay, shows top 20 tasks, refreshes every 2 seconds)::
|
|
|
|
bash# ./delaytop
|
|
System Pressure Information: (avg10/avg60vg300/total)
|
|
CPU some: 0.0%/ 0.0%/ 0.0%/ 106137(ms)
|
|
CPU full: 0.0%/ 0.0%/ 0.0%/ 0(ms)
|
|
Memory full: 0.0%/ 0.0%/ 0.0%/ 0(ms)
|
|
Memory some: 0.0%/ 0.0%/ 0.0%/ 0(ms)
|
|
IO full: 0.0%/ 0.0%/ 0.0%/ 2240(ms)
|
|
IO some: 0.0%/ 0.0%/ 0.0%/ 2783(ms)
|
|
IRQ full: 0.0%/ 0.0%/ 0.0%/ 0(ms)
|
|
[o]sort [M]memverbose [q]quit
|
|
Top 20 processes (sorted by cpu delay):
|
|
PID TGID COMMAND CPU(ms) IO(ms) IRQ(ms) MEM(ms)
|
|
------------------------------------------------------------------------
|
|
110 110 kworker/15:0H-s 27.91 0.00 0.00 0.00
|
|
57 57 cpuhp/7 3.18 0.00 0.00 0.00
|
|
99 99 cpuhp/14 2.97 0.00 0.00 0.00
|
|
51 51 cpuhp/6 0.90 0.00 0.00 0.00
|
|
44 44 kworker/4:0H-sy 0.80 0.00 0.00 0.00
|
|
60 60 ksoftirqd/7 0.74 0.00 0.00 0.00
|
|
76 76 idle_inject/10 0.31 0.00 0.00 0.00
|
|
100 100 idle_inject/14 0.30 0.00 0.00 0.00
|
|
1309 1309 systemsettings 0.29 0.00 0.00 0.00
|
|
45 45 cpuhp/5 0.22 0.00 0.00 0.00
|
|
63 63 cpuhp/8 0.20 0.00 0.00 0.00
|
|
87 87 cpuhp/12 0.18 0.00 0.00 0.00
|
|
93 93 cpuhp/13 0.17 0.00 0.00 0.00
|
|
1265 1265 acpid 0.17 0.00 0.00 0.00
|
|
1552 1552 sshd 0.17 0.00 0.00 0.00
|
|
2584 2584 sddm-helper 0.16 0.00 0.00 0.00
|
|
1284 1284 rtkit-daemon 0.15 0.00 0.00 0.00
|
|
1326 1326 nde-netfilter 0.14 0.00 0.00 0.00
|
|
27 27 cpuhp/2 0.13 0.00 0.00 0.00
|
|
631 631 kworker/11:2-rc 0.11 0.00 0.00 0.00
|
|
|
|
Interactive keyboard controls during runtime::
|
|
|
|
o - Select sort field (CPU, IO, IRQ, Memory, etc.)
|
|
M - Toggle display mode (Default/Memory Verbose)
|
|
q - Quit
|
|
|
|
Available sort fields(use -s/--sort or interactive command)::
|
|
|
|
cpu(c) - CPU delay
|
|
blkio(i) - I/O delay
|
|
irq(q) - IRQ delay
|
|
mem(m) - Total memory delay
|
|
swapin(s) - Swapin delay (memory verbose mode only)
|
|
freepages(r) - Freepages reclaim delay (memory verbose mode only)
|
|
thrashing(t) - Thrashing delay (memory verbose mode only)
|
|
compact(p) - Compaction delay (memory verbose mode only)
|
|
wpcopy(w) - Write page copy delay (memory verbose mode only)
|
|
|
|
Advanced usage examples::
|
|
|
|
# ./delaytop -s blkio
|
|
Sorted by IO delay
|
|
|
|
# ./delaytop -s mem -M
|
|
Sorted by memory delay in memory verbose mode
|
|
|
|
# ./delaytop -p pid
|
|
Print delayacct stats
|
|
|
|
# ./delaytop -P num
|
|
Display the top N tasks
|
|
|
|
# ./delaytop -n num
|
|
Set delaytop refresh frequency (num times)
|
|
|
|
# ./delaytop -d secs
|
|
Specify refresh interval as secs
|