linux/include
Eric Dumazet 6c4d4334e5 tcp: fix potential xmit stalls caused by TCP_NOTSENT_LOWAT
[ Upstream commit 4bfe744ff1 ]

I had this bug sitting for too long in my pile, it is time to fix it.

Thanks to Doug Porter for reminding me of it!

We had various attempts in the past, including commit
0cbe6a8f08 ("tcp: remove SOCK_QUEUE_SHRUNK"),
but the issue is that TCP stack currently only generates
EPOLLOUT from input path, when tp->snd_una has advanced
and skb(s) cleaned from rtx queue.

If a flow has a big RTT, and/or receives SACKs, it is possible
that the notsent part (tp->write_seq - tp->snd_nxt) reaches 0
and no more data can be sent until tp->snd_una finally advances.

What is needed is to also check if POLLOUT needs to be generated
whenever tp->snd_nxt is advanced, from output path.

This bug triggers more often after an idle period, as
we do not receive ACK for at least one RTT. tcp_notsent_lowat
could be a fraction of what CWND and pacing rate would allow to
send during this RTT.

In a followup patch, I will remove the bogus call
to tcp_chrono_stop(sk, TCP_CHRONO_SNDBUF_LIMITED)
from tcp_check_space(). Fact that we have decided to generate
an EPOLLOUT does not mean the application has immediately
refilled the transmit queue. This optimistic call
might have been the reason the bug seemed not too serious.

Tested:

200 ms rtt, 1% packet loss, 32 MB tcp_rmem[2] and tcp_wmem[2]

$ echo 500000 >/proc/sys/net/ipv4/tcp_notsent_lowat
$ cat bench_rr.sh
SUM=0
for i in {1..10}
do
 V=`netperf -H remote_host -l30 -t TCP_RR -- -r 10000000,10000 -o LOCAL_BYTES_SENT | egrep -v "MIGRATED|Bytes"`
 echo $V
 SUM=$(($SUM + $V))
done
echo SUM=$SUM

Before patch:
$ bench_rr.sh
130000000
80000000
140000000
140000000
140000000
140000000
130000000
40000000
90000000
110000000
SUM=1140000000

After patch:
$ bench_rr.sh
430000000
590000000
530000000
450000000
450000000
350000000
450000000
490000000
480000000
460000000
SUM=4680000000  # This is 410 % of the value before patch.

Fixes: c9bee3b7fd ("tcp: TCP_NOTSENT_LOWAT socket option")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Doug Porter <dsp@fb.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-05-09 09:14:37 +02:00
..
acpi ACPICA: actypes.h: Expand the ACPI_ACCESS_ definitions 2022-01-27 11:04:49 +01:00
asm-generic tlb: hugetlb: Add more sizes to tlb_remove_huge_tlb_entry 2022-04-20 09:34:16 +02:00
clocksource
crypto
drm drm/connector: Fix typo in documentation 2022-04-08 14:24:12 +02:00
dt-bindings
keys
kunit kunit: fix kernel-doc warnings due to mismatched arg names 2021-10-06 17:54:07 -06:00
kvm KVM: arm64: Fix PMU probe ordering 2021-09-20 12:43:34 +01:00
linux mtd: fix 'part' field data corruption in mtd_info 2022-05-09 09:14:34 +02:00
math-emu
media media: cec: fix a deadlock situation 2022-01-27 11:02:53 +01:00
memory memory: renesas-rpc-if: Fix HF/OSPI data transfer in Manual Mode 2022-05-09 09:14:34 +02:00
misc
net tcp: fix potential xmit stalls caused by TCP_NOTSENT_LOWAT 2022-05-09 09:14:37 +02:00
pcmcia
ras
rdma RDMA/netlink: Add __maybe_unused to static inline in C file 2021-11-25 09:49:07 +01:00
scsi scsi: iscsi: Fix NOP handling during conn recovery 2022-04-27 14:38:56 +02:00
soc net: dsa: tag_ocelot_8021q: break circular dependency with ocelot switch lib 2021-10-12 17:35:18 -07:00
sound ALSA: core: Add snd_card_free_on_error() helper 2022-04-20 09:34:05 +02:00
target scsi: target: Fix ordered tag handling 2021-11-25 09:48:29 +01:00
trace SUNRPC: Fix the svc_deferred_event trace class 2022-04-20 09:34:09 +02:00
uapi bpf: Make remote_port field in struct bpf_sk_lookup 16-bit wide 2022-04-13 20:59:25 +02:00
vdso
video
xen xen/gnttab: fix gnttab_end_foreign_access() without page specified 2022-03-11 12:22:37 +01:00