linux/net
Eric Dumazet 8a9d6ca360 tcp: fix potential xmit stalls caused by TCP_NOTSENT_LOWAT
[ Upstream commit 4bfe744ff1 ]

I had this bug sitting for too long in my pile, it is time to fix it.

Thanks to Doug Porter for reminding me of it!

We had various attempts in the past, including commit
0cbe6a8f08 ("tcp: remove SOCK_QUEUE_SHRUNK"),
but the issue is that TCP stack currently only generates
EPOLLOUT from input path, when tp->snd_una has advanced
and skb(s) cleaned from rtx queue.

If a flow has a big RTT, and/or receives SACKs, it is possible
that the notsent part (tp->write_seq - tp->snd_nxt) reaches 0
and no more data can be sent until tp->snd_una finally advances.

What is needed is to also check if POLLOUT needs to be generated
whenever tp->snd_nxt is advanced, from output path.

This bug triggers more often after an idle period, as
we do not receive ACK for at least one RTT. tcp_notsent_lowat
could be a fraction of what CWND and pacing rate would allow to
send during this RTT.

In a followup patch, I will remove the bogus call
to tcp_chrono_stop(sk, TCP_CHRONO_SNDBUF_LIMITED)
from tcp_check_space(). Fact that we have decided to generate
an EPOLLOUT does not mean the application has immediately
refilled the transmit queue. This optimistic call
might have been the reason the bug seemed not too serious.

Tested:

200 ms rtt, 1% packet loss, 32 MB tcp_rmem[2] and tcp_wmem[2]

$ echo 500000 >/proc/sys/net/ipv4/tcp_notsent_lowat
$ cat bench_rr.sh
SUM=0
for i in {1..10}
do
 V=`netperf -H remote_host -l30 -t TCP_RR -- -r 10000000,10000 -o LOCAL_BYTES_SENT | egrep -v "MIGRATED|Bytes"`
 echo $V
 SUM=$(($SUM + $V))
done
echo SUM=$SUM

Before patch:
$ bench_rr.sh
130000000
80000000
140000000
140000000
140000000
140000000
130000000
40000000
90000000
110000000
SUM=1140000000

After patch:
$ bench_rr.sh
430000000
590000000
530000000
450000000
450000000
350000000
450000000
490000000
480000000
460000000
SUM=4680000000  # This is 410 % of the value before patch.

Fixes: c9bee3b7fd ("tcp: TCP_NOTSENT_LOWAT socket option")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Doug Porter <dsp@fb.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-05-09 09:05:04 +02:00
..
6lowpan 6lowpan: iphc: Fix an off-by-one check of array index 2021-09-15 09:50:34 +02:00
9p xen/9p: use alloc/free_pages_exact() 2022-03-11 12:11:54 +01:00
802 net/802/garp: fix memleak in garp_request_join() 2021-07-31 08:16:11 +02:00
8021q net: vlan: fix underflow for the real_dev refcnt 2021-12-01 09:19:08 +01:00
appletalk appletalk: Fix skb allocation size in loopback case 2021-04-07 15:00:08 +02:00
atm
ax25 ax25: Fix UAF bugs in ax25 timers 2022-04-20 09:23:32 +02:00
batman-adv ipv6: make mc_forwarding atomic 2022-04-13 21:00:56 +02:00
bluetooth Bluetooth: Fix use after free in hci_send_acl 2022-04-13 21:01:00 +02:00
bpf bpf, test, cgroup: Use sk_{alloc,free} for test cases 2021-10-27 09:56:56 +02:00
bpfilter bpfilter: Specify the log level for the kmsg message 2021-07-14 16:56:29 +02:00
bridge net: bridge: vlan: fix memory leak in __allowed_ingress 2022-02-01 17:25:48 +01:00
caif net-caif: avoid user-triggerable WARN_ON(1) 2021-09-22 12:27:56 +02:00
can can: isotp: stop timeout monitoring when no first frame was sent 2022-04-27 13:53:57 +02:00
ceph
core bpf, lwt: Fix crash when using bpf_skb_set_tunnel_key() from bpf_xmit lwt hook 2022-05-09 09:05:02 +02:00
dcb net: dcb: disable softirqs in dcbnl_flush_dev() 2022-03-08 19:09:37 +01:00
dccp tcp: switch orphan_count to bare per-cpu counters 2021-11-18 14:04:08 +01:00
decnet net: decnet: Fix sleeping inside in af_decnet 2021-07-28 14:35:38 +02:00
dns_resolver
dsa net: dsa: Add missing of_node_put() in dsa_port_link_register_of 2022-05-09 09:05:02 +02:00
ethernet
ethtool ethtool: do not perform operations on net devices being unregistered 2021-12-17 10:14:41 +01:00
hsr net: hsr: fix mac_len checks 2021-06-03 09:00:50 +02:00
ieee802154 net: ieee802154: Return meaningful error codes from the netlink helpers 2022-02-08 18:30:37 +01:00
ife
ipv4 tcp: fix potential xmit stalls caused by TCP_NOTSENT_LOWAT 2022-05-09 09:05:04 +02:00
ipv6 ip_gre, ip6_gre: Fix race condition on o_seqno in collect_md mode 2022-05-09 09:05:04 +02:00
iucv net/af_iucv: remove WARN_ONCE on malformed RX packets 2021-03-07 12:34:05 +01:00
kcm
key af_key: add __GFP_ZERO flag for compose_sadb_supported in function pfkey_register 2022-04-08 14:39:48 +02:00
l2tp net/l2tp: Fix reference count leak in l2tp_udp_recv_core 2021-09-22 12:27:56 +02:00
l3mdev l3mdev: l3mdev_master_upper_ifindex_by_index_rcu should be using netdev_master_upper_dev_get_rcu 2022-04-27 13:53:50 +02:00
lapb
llc llc: only change llc->dev when bind() succeeds 2022-03-28 09:57:10 +02:00
mac80211 mac80211: fix potential double free on mesh join 2022-03-28 09:57:10 +02:00
mac802154 net: mac802154: Fix general protection fault 2021-04-14 08:42:13 +02:00
mpls net: mpls: Fix notifications when deleting a device 2021-12-08 09:03:23 +01:00
mptcp mptcp: clear 'kern' flag from fallback sockets 2021-12-22 09:30:54 +01:00
ncsi net/ncsi: check for error return from call to nla_put_u32 2022-01-05 12:40:32 +01:00
netfilter netfilter: nft_set_rbtree: overlap detection with element re-addition after deletion 2022-05-09 09:05:02 +02:00
netlabel netlabel: fix out-of-bounds memory accesses 2022-04-13 21:01:00 +02:00
netlink netlink: reset network and mac headers in netlink_dump() 2022-04-27 13:53:51 +02:00
netrom netrom: fix api breakage in nr_setsockopt() 2022-01-27 10:54:03 +01:00
nfc nfc: nci: add flush_workqueue to prevent uaf 2022-04-20 09:23:18 +02:00
nsh
openvswitch openvswitch: fix OOB access in reserve_sfa_size() 2022-04-27 13:53:55 +02:00
packet net/packet: fix packet_sock xmit return value checking 2022-04-27 13:53:50 +02:00
phonet phonet: refcount leak in pep_sock_accep 2022-01-11 15:25:01 +01:00
psample net: psample: Fix netlink skb length with tunnel info 2021-03-07 12:34:07 +01:00
qrtr net: qrtr: fix another OOB Read in qrtr_endpoint_post 2021-09-03 10:09:21 +02:00
rds rds: memory leak in __rds_conn_create() 2021-12-22 09:30:54 +01:00
rfkill
rose
rxrpc rxrpc: Restore removed timer deletion 2022-04-27 13:53:49 +02:00
sched net/sched: cls_u32: fix possible leak in u32_init_knode() 2022-04-27 13:53:50 +02:00
sctp sctp: check asoc strreset_chunk in sctp_generate_reconf_event 2022-05-09 09:05:03 +02:00
smc net/smc: sync err code when tcp connection was refused 2022-05-09 09:05:04 +02:00
strparser bpf: sockmap, strparser, and tls are reusing qdisc_skb_cb and colliding 2021-11-18 14:04:27 +01:00
sunrpc SUNRPC: svc_tcp_sendmsg() should handle errors from xdr_alloc_bvec() 2022-04-13 21:01:07 +02:00
switchdev
tipc tipc: fix the timer expires after interval 100ms 2022-04-08 14:40:23 +02:00
tls net/tls: fix slab-out-of-bounds bug in decrypt_internal 2022-04-13 21:01:04 +02:00
unix af_unix: annote lockless accesses to unix_tot_inflight & gc_in_progress 2022-01-27 10:54:31 +01:00
vmw_vsock vsock: each transport cycles only on its own sockets 2022-03-23 09:13:27 +01:00
wimax
wireless nl80211: correctly check NL80211_ATTR_REG_ALPHA2 size 2022-04-20 09:23:28 +02:00
x25 net/x25: Fix null-ptr-deref caused by x25_disconnect 2022-04-08 14:40:30 +02:00
xdp Revert "xsk: Do not sleep in poll() when need_wakeup set" 2021-12-22 09:30:59 +01:00
xfrm xfrm: fix tunnel model fragmentation behavior 2022-04-08 14:39:47 +02:00
compat.c net: Return the correct errno code 2021-06-18 10:00:06 +02:00
devres.c
Kconfig
Makefile
socket.c ethtool: improve compat ioctl handling 2021-09-18 13:40:21 +02:00
sysctl_net.c