linux/net
Daniel Borkmann 67bce5821c net, sched: fix soft lockup in tc_classify
[ Upstream commit 628185cfdd ]

Shahar reported a soft lockup in tc_classify(), where we run into an
endless loop when walking the classifier chain due to tp->next == tp
which is a state we should never run into. The issue only seems to
trigger under load in the tc control path.

What happens is that in tc_ctl_tfilter(), thread A allocates a new
tp, initializes it, sets tp_created to 1, and calls into tp->ops->change()
with it. In that classifier callback we had to unlock/lock the rtnl
mutex and returned with -EAGAIN. One reason why we need to drop there
is, for example, that we need to request an action module to be loaded.

This happens via tcf_exts_validate() -> tcf_action_init/_1() meaning
after we loaded and found the requested action, we need to redo the
whole request so we don't race against others. While we had to unlock
rtnl in that time, thread B's request was processed next on that CPU.
Thread B added a new tp instance successfully to the classifier chain.
When thread A returned grabbing the rtnl mutex again, propagating -EAGAIN
and destroying its tp instance which never got linked, we goto replay
and redo A's request.

This time when walking the classifier chain in tc_ctl_tfilter() for
checking for existing tp instances we had a priority match and found
the tp instance that was created and linked by thread B. Now calling
again into tp->ops->change() with that tp was successful and returned
without error.

tp_created was never cleared in the second round, thus kernel thinks
that we need to link it into the classifier chain (once again). tp and
*back point to the same object due to the match we had earlier on. Thus
for thread B's already public tp, we reset tp->next to tp itself and
link it into the chain, which eventually causes the mentioned endless
loop in tc_classify() once a packet hits the data path.

Fix is to clear tp_created at the beginning of each request, also when
we replay it. On the paths that can cause -EAGAIN we already destroy
the original tp instance we had and on replay we really need to start
from scratch. It seems that this issue was first introduced in commit
12186be7d2 ("net_cls: fix unconfigured struct tcf_proto keeps chaining
and avoid kernel panic when we use cls_cgroup").

Fixes: 12186be7d2 ("net_cls: fix unconfigured struct tcf_proto keeps chaining and avoid kernel panic when we use cls_cgroup")
Reported-by: Shahar Klein <shahark@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Tested-by: Shahar Klein <shahark@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-15 13:41:34 +01:00
..
6lowpan
9p IB/cma: Add support for network namespaces 2015-10-28 12:32:48 -04:00
802
8021q net: add recursion limit to GRO 2016-11-15 07:46:38 +01:00
appletalk
atm
ax25 AX.25: Close socket connection on session completion 2016-07-11 09:31:12 -07:00
batman-adv batman-adv: Check for alloc errors when preparing TT local data 2016-12-15 08:49:23 -08:00
bluetooth Bluetooth: Fix l2cap_sock_setsockopt() with optname BT_RCVMTU 2016-08-20 18:09:19 +02:00
bridge bridge: multicast: restore perm router ports on multicast enable 2016-11-15 07:46:38 +01:00
caif net: caif: fix misleading indentation 2016-09-30 10:18:35 +02:00
can can: raw: raw_setsockopt: limit number of can_filter that can be set 2016-12-15 08:49:22 -08:00
ceph libceph: verify authorize reply on connect 2017-01-09 08:07:52 +01:00
core net: avoid signed overflows for SO_{SND|RCV}BUFFORCE 2016-12-10 19:07:24 +01:00
dcb
dccp net/dccp: fix use-after-free in dccp_invalid_packet 2016-12-10 19:07:24 +01:00
decnet decnet: Do not build routes to devices without decnet private data. 2016-05-18 17:06:35 -07:00
dns_resolver net: dns_resolver: convert time_t to time64_t 2015-11-18 16:27:46 -05:00
dsa net: dsa: use switchdev obj for VLAN add/del ops 2015-11-01 15:56:11 -05:00
ethernet net: add recursion limit to GRO 2016-11-15 07:46:38 +01:00
hsr net/hsr: fix a warning message 2015-11-23 14:56:15 -05:00
ieee802154 net: fix percpu memory leaks 2015-11-02 22:47:14 -05:00
ipv4 esp4: Fix integrity verification when ESN are used 2016-12-10 19:07:26 +01:00
ipv6 ipv6: handle -EFAULT from skb_copy_bits 2017-01-15 13:41:34 +01:00
ipx
irda net/irda: handle iriap_register_lsap() allocation failure 2016-09-30 10:18:36 +02:00
iucv af_iucv: Validate socket address length in iucv_sock_bind() 2016-03-03 15:07:03 -08:00
key af_key: fix two typos 2015-10-23 03:05:19 -07:00
l2tp l2tp: fix racy SOCK_ZAPPED flag check in l2tp_ip{,6}_bind() 2016-12-10 19:07:23 +01:00
l3mdev
lapb
llc net: fix infoleak in llc 2016-05-18 17:06:40 -07:00
mac80211 mac80211: initialize fast-xmit 'info' later 2017-01-12 11:22:43 +01:00
mac802154
mpls mpls: find_outdev: check for err ptr in addition to NULL check 2016-04-20 15:42:07 +09:00
netfilter netfilter: nft_dynset: fix element timeout for HZ != 1000 2016-11-26 09:54:54 +01:00
netlabel netlabel: add address family checks to netlbl_{sock,req}_delattr() 2016-08-20 18:09:22 +02:00
netlink netlink: Do not schedule work from sk_destruct 2016-12-10 19:07:23 +01:00
netrom
nfc net: rename SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA 2015-12-01 15:45:05 -05:00
openvswitch vxlan, gre, geneve: Set a large MTU on ovs-created tunnel devices 2016-06-24 10:18:18 -07:00
packet packet: fix race condition in packet_set_ring 2016-12-10 19:07:24 +01:00
phonet phonet: properly unshare skbs in phonet_rcv() 2016-01-31 11:29:00 -08:00
rds rds: fix an infoleak in rds_inc_info_copy 2016-09-15 08:27:51 +02:00
rfkill rfkill: fix rfkill_fop_read wait_event usage 2016-03-03 15:07:26 -08:00
rose
rxrpc net: rename SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA 2015-12-01 15:45:05 -05:00
sched net, sched: fix soft lockup in tc_classify 2017-01-15 13:41:34 +01:00
sctp sctp: assign assoc_id earlier in __sctp_connect 2016-11-21 10:06:40 +01:00
sunrpc sunrpc: fix write space race causing stalls 2016-10-28 03:01:31 -04:00
switchdev switchdev: pass pointer to fib_info instead of copy 2016-06-24 10:18:16 -07:00
tipc tipc: fix NULL pointer dereference in shutdown() 2016-09-30 10:18:36 +02:00
unix af_unix: conditionally use freezable blocking calls in read 2016-12-10 19:07:22 +01:00
vmw_vsock VSOCK: do not disconnect socket when peer has shutdown SEND only 2016-05-18 17:06:41 -07:00
wimax
wireless cfg80211/mac80211: fix BSS leaks when abandoning assoc attempts 2017-01-09 08:07:42 +01:00
x25 net: fix a kernel infoleak in x25 module 2016-05-18 17:06:43 -07:00
xfrm xfrm: Fix crash observed during device unregistration and decryption 2016-04-20 15:42:05 +09:00
compat.c
Kconfig
Makefile
socket.c sock: fix sendmmsg for partial sendmsg 2016-11-21 10:06:40 +01:00
sysctl_net.c net: Use ns_capable_noaudit() when determining net sysctl permissions 2016-09-15 08:27:50 +02:00