linux

mirror of https://github.com/torvalds/linux.git synced 2026-05-26 16:12:59 +02:00

Author	SHA1	Message	Date
Gerd Rausch	9d27a0fb12	net/rds: Trigger rds_send_ping() more than once Even though a peer may have already received a non-zero value for "RDS_EXTHDR_NPATHS" from a node in the past, the current peer may not. Therefore it is important to initiate another rds_send_ping() after a re-connect to any peer: It is unknown at that time if we're still talking to the same instance of RDS kernel modules on the other side. Otherwise, the peer may just operate on a single lane ("c_npaths == 0"), not knowing that more lanes are available. However, if "c_with_sport_idx" is supported, we also need to check that the connection we accepted on lane#0 meets the proper source port modulo requirement, as we fan out: Since the exchange of "RDS_EXTHDR_NPATHS" and "RDS_EXTHDR_SPORT_IDX" is asynchronous, initially we have no choice but to accept an incoming connection (via "accept") in the first slot ("cp_index == 0") for backwards compatibility. But that very connection may have come from a different lane with "cp_index != 0", since the peer thought that we already understood and handled "c_with_sport_idx" properly, as indicated by a previous exchange before a module was reloaded. In short: If a module gets reloaded, we recover from that, but do not allow a downgrade to support fewer lanes. Downgrades would require us to merge messages from separate lanes, which is rather tricky with the current RDS design. Each lane has its own sequence number space and all messages would need to be re-sequenced as we merge, all while handling "RDS_FLAG_RETRANSMITTED" and "cp_retrans" properly. Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Link: https://patch.msgid.link/20260203055723.1085751-9-achender@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-04 20:46:39 -08:00
Gerd Rausch	826c1004d4	net/rds: rds_tcp_conn_path_shutdown must not discard messages RDS/TCP differs from RDS/RDMA in that message acknowledgment is done based on TCP sequence numbers: As soon as the last byte of a message has been acknowledged by the TCP stack of a peer, rds_tcp_write_space() goes on to discard prior messages from the send queue. Which is fine, for as long as the receiver never throws any messages away. The dequeuing of messages in RDS/TCP is done either from the "sk_data_ready" callback pointing to rds_tcp_data_ready() (the most common case), or from the receive worker pointing to rds_tcp_recv_path() which is called for as long as the connection is "RDS_CONN_UP". However, as soon as rds_conn_path_drop() is called for whatever reason, including "DR_USER_RESET", "cp_state" transitions to "RDS_CONN_ERROR", and rds_tcp_restore_callbacks() ends up restoring the callbacks and thereby disabling message receipt. So messages already acknowledged to the sender were dropped. Furthermore, the "->shutdown" callback was always called with an invalid parameter ("RCV_SHUTDOWN \| SEND_SHUTDOWN == 3"), instead of the correct pre-increment value ("SHUT_RDWR == 2"). inet_shutdown() returns "-EINVAL" in such cases, rendering this call a NOOP. So we change rds_tcp_conn_path_shutdown() to do the proper "->shutdown(SHUT_WR)" call in order to signal EOF to the peer and make it transition to "TCP_CLOSE_WAIT" (RFC 793). This should make the peer also enter rds_tcp_conn_path_shutdown() and do the same. This allows us to dequeue all messages already received and acknowledged to the peer. We do so, until we know that the receive queue no longer has data (skb_queue_empty()) and that we couldn't have any data in flight anymore, because the socket transitioned to any of the states "CLOSING", "TIME_WAIT", "CLOSE_WAIT", "LAST_ACK", or "CLOSE" (RFC 793). However, if we do just that, we suddenly see duplicate RDS messages being delivered to the application. So what gives? Turns out that with MPRDS and its multitude of backend connections, retransmitted messages ("RDS_FLAG_RETRANSMITTED") can outrace the dequeuing of their original counterparts. And the duplicate check implemented in rds_recv_local() only discards duplicates if flag "RDS_FLAG_RETRANSMITTED" is set. Rather curious, because a duplicate is a duplicate; it shouldn't matter which copy is looked at and delivered first. To avoid this entire situation, we simply make the sender discard messages from the send-queue right from within rds_tcp_conn_path_shutdown(). Just like rds_tcp_write_space() would have done, were it called in time or still called. This makes sure that we no longer have messages that we know the receiver already dequeued sitting in our send-queue, and therefore avoid the entire "RDS_FLAG_RETRANSMITTED" fiasco. Now we got rid of the duplicate RDS message delivery, but we still run into cases where RDS messages are dropped. This time it is due to the delayed setting of the socket-callbacks in rds_tcp_accept_one() via either rds_tcp_reset_callbacks() or rds_tcp_set_callbacks(). By the time rds_tcp_accept_one() gets there, the socket may already have transitioned into state "TCP_CLOSE_WAIT", but rds_tcp_state_change() was never called. Subsequently, "->shutdown(SHUT_WR)" did not happen either. So the peer ends up getting stuck in state "TCP_FIN_WAIT2". We fix that by checking for states "TCP_CLOSE_WAIT", "TCP_LAST_ACK", or "TCP_CLOSE" and drop the freshly accepted socket in that case. This problem is observable by running "rds-stress --reset" frequently on either of the two sides of a RDS connection, or both while other "rds-stress" processes are exchanging data. Those "rds-stress" processes reported out-of-sequence errors, with the expected sequence number being smaller than the one actually received (due to the dropped messages). Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Link: https://patch.msgid.link/20260203055723.1085751-4-achender@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-04 20:46:38 -08:00
Gerd Rausch	a20a699255	net/rds: Encode cp_index in TCP source port Upon "sendmsg", RDS/TCP selects a backend connection based on a hash calculated from the source-port ("RDS_MPATH_HASH"). However, "rds_tcp_accept_one" accepts connections in the order they arrive, which is non-deterministic. Therefore the mapping of the sender's "cp->cp_index" to that of the receiver changes if the backend connections are dropped and reconnected. However, connection state that's preserved across reconnects (e.g. "cp_next_rx_seq") relies on that sender<->receiver mapping to never change. So we make sure that client and server of the TCP connection have the exact same "cp->cp_index" across reconnects by encoding "cp->cp_index" in the lower three bits of the client's TCP source port. A new extension "RDS_EXTHDR_SPORT_IDX" is introduced, that allows the server to tell the difference between clients that do the "cp->cp_index" encoding, and legacy clients that pick source ports randomly. Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Link: https://patch.msgid.link/20260203055723.1085751-3-achender@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-04 20:46:38 -08:00
Gerd Rausch	db69e9b838	net/rds: rds_tcp_accept_one ought to not discard messages RDS/TCP differs from RDS/RDMA in that message acknowledgment is done based on TCP sequence numbers: As soon as the last byte of a message has been acknowledged by the TCP stack of a peer, "rds_tcp_write_space()" goes on to discard prior messages from the send queue. Which is fine, for as long as the receiver never throws any messages away. Unfortunately, that is not the case since the introduction of MPRDS: commit `1a0e100fb2` "RDS: TCP: Enable multipath RDS for TCP" A new function "rds_tcp_accept_one_path" was introduced, which is entitled to return "NULL", if no connection path is currently available. Unfortunately, this happens after the "->accept()" call, and the new socket often already contains messages, since the peer already transitioned to "RDS_CONN_UP" on behalf of "TCP_ESTABLISHED". That's also the case after this [1]: commit `1a0e100fb2` "RDS: TCP: Force every connection to be initiated by numerically smaller IP address" which tried to address the situation of pending data by only transitioning connections from a smaller IP address to "RDS_CONN_UP". But even in those cases, and in particular if the "RDS_EXTHDR_NPATHS" handshake has not occurred yet, and therefore we're working with "c_npaths <= 1", "c_conn[0]" may be in a state distinct from "RDS_CONN_DOWN", and therefore all messages on the just accepted socket will be tossed away. This fix changes "rds_tcp_accept_one": * If connected from a peer with a larger IP address, the new socket will continue to get closed right away. With commit [1] above, there should not be any messages in the socket receive buffer, since the peer never transitioned to "RDS_CONN_UP". Therefore it should be okay to not make any efforts to dispatch the socket receive buffer. * If connected from a peer with a smaller IP address, we call "rds_tcp_accept_one_path" to find a free slot/"path". If found, business goes on as usual. If none was found, we save/stash the newly accepted socket into "rds_tcp_accepted_sock", in order to not lose any messages that may have arrived already. We then return from "rds_tcp_accept_one" with "-ENOBUFS". Later on, when a slot/"path" does become available again (e.g. state transitioned to "RDS_CONN_DOWN", or HS extension header was received with "c_npaths > 1") we call "rds_tcp_conn_slots_available" that simply re-issues a "rds_tcp_accept_one_path" worker-callback and picks up the new socket from "rds_tcp_accepted_sock", and thereby continuing where it left with "-ENOBUFS" last time. Since a new slot has become available, those messages won't be lost, since processing proceeds as if that slot had been available the first time around. Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Link: https://patch.msgid.link/20260122055213.83608-3-achender@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-23 11:51:31 -08:00
Yue Haibing	2b8893b639	net/rds: Remove unused function declarations Commit `39de828179` ("RDS: Main header file") declared but never implemented rds_trans_init() and rds_trans_exit(), remove it. Commit `d37c935905` ("RDS: Move loop-only function to loop.c") removed the implementation rds_message_inc_free() but not the declaration. Since commit `55b7ed0b58` ("RDS: Common RDMA transport code") rds_rdma_conn_connect() is never implemented and used. rds_tcp_map_seq() is never implemented and used since commit `70041088e3` ("RDS: Add TCP transport to RDS"). Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-13 12:25:42 +01:00
Tetsuo Handa	6997fbd7a3	net: rds: use maybe_get_net() when acquiring refcount on TCP sockets Eric Dumazet is reporting addition on 0 problem at rds_tcp_tune(), for delayed works queued in rds_wq might be invoked after a net namespace's refcount already reached 0. Since rds_tcp_exit_net() from cleanup_net() calls flush_workqueue(rds_wq), it is guaranteed that we can instead use maybe_get_net() from delayed work functions until rds_tcp_exit_net() returns. Note that I'm not convinced that all works which might access a net namespace are already queued in rds_wq by the moment rds_tcp_exit_net() calls flush_workqueue(rds_wq). If some race is there, rds_tcp_exit_net() will fail to wait for work functions, and kmem_cache_free() could be called from net_free() before maybe_get_net() is called from rds_tcp_tune(). Reported-by: Eric Dumazet <edumazet@google.com> Fixes: `3a58f13a88` ("net: rds: acquire refcount on TCP sockets") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/41d09faf-bc78-1a87-dfd1-c6d1b5984b61@I-love.SAKURA.ne.jp Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-05-05 16:44:49 -07:00
Rao Shoaib	aced3ce57c	RDS tcp loopback connection can hang When TCP is used as transport and a program on the system connects to RDS port 16385, connection is accepted but denied per the rules of RDS. However, RDS connections object is left in the list. Next loopback connection will select that connection object as it is at the head of list. The connection attempt will hang as the connection object is set to connect over TCP which is not allowed The issue can be reproduced easily, use rds-ping to ping a local IP address. After that use any program like ncat to connect to the same IP address and port 16385. This will hang so ctrl-c out. Now try rds-ping, it will hang. To fix the issue this patch adds checks to disallow the connection object creation and destroys the connection object. Signed-off-by: Rao Shoaib <rao.shoaib@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-05-21 14:46:59 -07:00
Christoph Hellwig	480aeb9639	tcp: add tcp_sock_set_keepcnt Add a helper to directly set the TCP_KEEPCNT sockopt from kernel space without going through a fake uaccess. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-05-28 11:11:45 -07:00
Christoph Hellwig	12abc5ee78	tcp: add tcp_sock_set_nodelay Add a helper to directly set the TCP_NODELAY sockopt from kernel space without going through a fake uaccess. Cleanup the callers to avoid pointless wrappers now that this is a simple function call. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Sagi Grimberg <sagi@grimberg.me> Acked-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-05-28 11:11:45 -07:00
Christoph Hellwig	c433594c07	net: add sock_no_linger Add a helper to directly set the SO_LINGER sockopt from kernel space with onoff set to true and a linger time of 0 without going through a fake uaccess. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-05-28 11:11:44 -07:00
Ka-Cheong Poon	1e2b44e78e	rds: Enable RDS IPv6 support This patch enables RDS to use IPv6 addresses. For RDS/TCP, the listener is now an IPv6 endpoint which accepts both IPv4 and IPv6 connection requests. RDS/RDMA/IB uses a private data (struct rds_ib_connect_private) exchange between endpoints at RDS connection establishment time to support RDMA. This private data exchange uses a 32 bit integer to represent an IP address. This needs to be changed in order to support IPv6. A new private data struct rds6_ib_connect_private is introduced to handle this. To ensure backward compatibility, an IPv6 capable RDS stack uses another RDMA listener port (RDS_CM_PORT) to accept IPv6 connection. And it continues to use the original RDS_PORT for IPv4 RDS connections. When it needs to communicate with an IPv6 peer, it uses the RDS_CM_PORT to send the connection set up request. v5: Fixed syntax problem (David Miller). v4: Changed port history comments in rds.h (Sowmini Varadhan). v3: Added support to set up IPv4 connection using mapped address (David Miller). Added support to set up connection between link local and non-link addresses. Various review comments from Santosh Shilimkar and Sowmini Varadhan. v2: Fixed bound and peer address scope mismatched issue. Added back rds_connect() IPv6 changes. Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-23 21:17:44 -07:00
David S. Miller	5ca114400d	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net en_rx_am.c was deleted in 'net-next' but had a bug fixed in it in 'net'. The esp{4,6}_offload.c conflicts were overlapping changes. The 'out' label is removed so we just return ERR_PTR(-EINVAL) directly. Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-23 13:51:56 -05:00
Sowmini Varadhan	b589513e63	rds: tcp: compute m_ack_seq as offset from ->write_seq rds-tcp uses m_ack_seq to track the tcp ack# that indicates that the peer has received a rds_message. The m_ack_seq is used in rds_tcp_is_acked() to figure out when it is safe to drop the rds_message from the RDS retransmit queue. The m_ack_seq must be calculated as an offset from the right edge of the in-flight tcp buffer, i.e., it should be based on the ->write_seq, not the ->snd_nxt. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-22 15:43:54 -05:00
Sowmini Varadhan	f10b4cff98	rds: tcp: atomically purge entries from rds_tcp_conn_list during netns delete The rds_tcp_kill_sock() function parses the rds_tcp_conn_list to find the rds_connection entries marked for deletion as part of the netns deletion under the protection of the rds_tcp_conn_lock. Since the rds_tcp_conn_list tracks rds_tcp_connections (which have a 1:1 mapping with rds_conn_path), multiple tc entries in the rds_tcp_conn_list will map to a single rds_connection, and will be deleted as part of the rds_conn_destroy() operation that is done outside the rds_tcp_conn_lock. The rds_tcp_conn_list traversal done under the protection of rds_tcp_conn_lock should not leave any doomed tc entries in the list after the rds_tcp_conn_lock is released, else another concurrently executiong netns delete (for a differnt netns) thread may trip on these entries. Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-01 15:25:15 -05:00
Greg Kroah-Hartman	b24413180f	License cleanup: add SPDX GPL-2.0 license identifier to files with no license Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a /uapi/ one with no licensing information in it, - file was a /uapi/ one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non /uapi/ files that summary was: SPDX license identifier # files ---------------------------------------------------\|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a /uapi/ path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------\|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the /uapi/ ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------\|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2017-11-02 11:10:55 +01:00
Sowmini Varadhan	c14b036681	rds: tcp: set linger to 1 when unloading a rds-tcp If we are unloading the rds_tcp module, we can set linger to 1 and drop pending packets to accelerate reconnect. The peer will end up resetting the connection based on new generation numbers of the new incarnation, so hanging on to unsent TCP packets via linger is mostly pointless in this case. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Tested-by: Jenny Xu <jenny.x.xu@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-06-22 11:34:04 -04:00
Sowmini Varadhan	b21dd4506b	rds: tcp: Sequence teardown of listen and acceptor sockets to avoid races Commit `a93d01f577` ("RDS: TCP: avoid bad page reference in rds_tcp_listen_data_ready") added the function rds_tcp_listen_sock_def_readable() to handle the case when a partially set-up acceptor socket drops into rds_tcp_listen_data_ready(). However, if the listen socket (rtn->rds_tcp_listen_sock) is itself going through a tear-down via rds_tcp_listen_stop(), the (ready)() will be null and we would hit a panic of the form BUG: unable to handle kernel NULL pointer dereference at (null) IP: (null) : ? rds_tcp_listen_data_ready+0x59/0xb0 [rds_tcp] tcp_data_queue+0x39d/0x5b0 tcp_rcv_established+0x2e5/0x660 tcp_v4_do_rcv+0x122/0x220 tcp_v4_rcv+0x8b7/0x980 : In the above case, it is not fatal to encounter a NULL value for ready- we should just drop the packet and let the flush of the acceptor thread finish gracefully. In general, the tear-down sequence for listen() and accept() socket that is ensured by this commit is: rtn->rds_tcp_listen_sock = NULL; / prevent any new accepts */ In rds_tcp_listen_stop(): serialize with, and prevent, further callbacks using lock_sock() flush rds_wq flush acceptor workq sock_release(listen socket) Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-03-07 14:09:59 -08:00
Sowmini Varadhan	a93d01f577	RDS: TCP: avoid bad page reference in rds_tcp_listen_data_ready As the existing comments in rds_tcp_listen_data_ready() indicate, it is possible under some race-windows to get to this function with the accept() socket. If that happens, we could run into a sequence whereby thread 1 thread 2 rds_tcp_accept_one() thread sets up new_sock via ->accept(). The sk_user_data is now sock_def_readable data comes in for new_sock, ->sk_data_ready is called, and we land in rds_tcp_listen_data_ready rds_tcp_set_callbacks() takes the sk_callback_lock and sets up sk_user_data to be the cp read_lock sk_callback_lock ready = cp unlock sk_callback_lock page fault on ready In the above sequence, we end up with a panic on a bad page reference when trying to execute (*ready)(). Instead we need to call sock_def_readable() safely, which is what this patch achieves. Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-15 11:36:57 -07:00
Sowmini Varadhan	b04e8554f7	RDS: TCP: Hooks to set up a single connection path This patch adds ->conn_path_connect callbacks in the rds_transport that are used to set up a single connection path. Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 16:45:17 -04:00
Sowmini Varadhan	2da43c4a1b	RDS: TCP: make receive path use the rds_conn_path The ->sk_user_data contains a pointer to the rds_conn_path for the socket. Use this consistently in the rds_tcp_data_ready callbacks to get the rds_conn_path for rds_recv_incoming. Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 16:45:17 -04:00
Sowmini Varadhan	ea3b1ea539	RDS: TCP: make ->sk_user_data point to a rds_conn_path The socket callbacks should all operate on a struct rds_conn_path, in preparation for a MP capable RDS-TCP. Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 16:45:17 -04:00
Sowmini Varadhan	02105b2ccd	RDS: TCP: Make rds_tcp_connection track the rds_conn_path The struct rds_tcp_connection is the transport-specific private data structure that tracks TCP information per rds_conn_path. Modify this structure to have a back-pointer to the rds_conn_path for which it is the ->cp_transport_data. Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 16:45:17 -04:00
Sowmini Varadhan	226f7a7d97	RDS: Rework path specific indirections Refactor code to avoid separate indirections for single-path and multipath transports. All transports (both single and mp-capable) will get a pointer to the rds_conn_path, and can trivially derive the rds_connection from the ->cp_conn. Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 16:45:17 -04:00
Joshua Houghton	5c3da57d70	net: rds: fix coding style issues Fix coding style issues in the following files: ib_cm.c: add space loop.c: convert spaces to tabs sysctl.c: add space tcp.h: convert spaces to tabs tcp_connect.c:remove extra indentation in switch statement tcp_recv.c: convert spaces to tabs tcp_send.c: convert spaces to tabs transport.c: move brace up one line on for statement Signed-off-by: Joshua Houghton <josh@awful.name> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-18 21:34:09 -07:00
Sowmini Varadhan	335b48d980	RDS: TCP: Add/use rds_tcp_reset_callbacks to reset tcp socket safely When rds_tcp_accept_one() has to replace the existing tcp socket with a newer tcp socket (duelling-syn resolution), it must lock_sock() to suppress the rds_tcp_data_recv() path while callbacks are being changed. Also, existing RDS datagram reassembly state must be reset, so that the next datagram on the new socket does not have corrupted state. Similarly when resetting the newly accepted socket, appropriate locks and synchronization is needed. This commit ensures correct synchronization by invoking kernel_sock_shutdown to reset a newly accepted sock, and by taking appropriate lock_sock()s (for old and new sockets) when resetting existing callbacks. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-07 15:10:15 -07:00
Sowmini Varadhan	bd7c5f983f	RDS: TCP: Synchronize accept() and connect() paths on t_conn_lock. An arbitration scheme for duelling SYNs is implemented as part of commit `241b271952` ("RDS-TCP: Reset tcp callbacks if re-using an outgoing socket in rds_tcp_accept_one()") which ensures that both nodes involved will arrive at the same arbitration decision. However, this needs to be synchronized with an outgoing SYN to be generated by rds_tcp_conn_connect(). This commit achieves the synchronization through the t_conn_lock mutex in struct rds_tcp_connection. The rds_conn_state is checked in rds_tcp_conn_connect() after acquiring the t_conn_lock mutex. A SYN is sent out only if the RDS connection is not already UP (an UP would indicate that rds_tcp_accept_one() has completed 3WH, so no SYN needs to be generated). Similarly, the rds_conn_state is checked in rds_tcp_accept_one() after acquiring the t_conn_lock mutex. The only acceptable states (to allow continuation of the arbitration logic) are UP (i.e., outgoing SYN was SYN-ACKed by peer after it sent us the SYN) or CONNECTING (we sent outgoing SYN before we saw incoming SYN). Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-05-03 16:03:44 -04:00
Sowmini Varadhan	467fa15356	RDS-TCP: Support multiple RDS-TCP listen endpoints, one per netns. Register pernet subsys init/stop functions that will set up and tear down per-net RDS-TCP listen endpoints. Unregister pernet subusys functions on 'modprobe -r' to clean up these end points. Enable keepalive on both accept and connect socket endpoints. The keepalive timer expiration will ensure that client socket endpoints will be removed as appropriate from the netns when an interface is removed from a namespace. Register a device notifier callback that will clean up all sockets (and thus avoid the need to wait for keepalive timeout) when the loopback device is unregistered from the netns indicating that the netns is getting deleted. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-08-07 11:29:58 -07:00
Al Viro	c310e72c89	rds: switch ->inc_copy_to_user() to passing iov_iter instances get considerably simpler from that... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-11-24 05:16:43 -05:00
David S. Miller	676d23690f	net: Fix use after free by removing length arg from sk_data_ready callbacks. Several spots in the kernel perform a sequence like: skb_queue_tail(&sk->s_receive_queue, skb); sk->sk_data_ready(sk, skb->len); But at the moment we place the SKB onto the socket receive queue it can be consumed and freed up. So this skb->len access is potentially to freed up memory. Furthermore, the skb->len can be modified by the consumer so it is possible that the value isn't accurate. And finally, no actual implementation of this callback actually uses the length argument. And since nobody actually cared about it's value, lots of call sites pass arbitrary values in such as '0' and even '1'. So just remove the length argument from the callback, that way there is no confusion whatsoever and all of these use-after-free cases get fixed as a side effect. Based upon a patch by Eric Dumazet and his suggestion to audit this issue tree-wide. Signed-off-by: David S. Miller <davem@davemloft.net>	2014-04-11 16:15:36 -04:00
stephen hemminger	ff51bf8415	rds: make local functions/variables static The RDS protocol has lots of functions that should be declared static. rds_message_get/add_version_extension is removed since it defined but never used. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-21 04:26:39 -07:00
Zach Brown	ef87b7ea39	RDS: remove __init and __exit annotation The trivial amount of memory saved isn't worth the cost of dealing with section mismatches. Signed-off-by: Zach Brown <zach.brown@oracle.com>	2010-09-08 18:16:39 -07:00
Andy Grover	77dd550e55	RDS: Stop supporting old cong map sending method We now ask the transport to give us a rm for the congestion map, and then we handle it normally. Previously, the transport defined a function that we would call to send a congestion map. Convert TCP and loop transports to new cong map method. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:12:10 -07:00
Andy Grover	809fa148a2	RDS: inc_purge() transport function unused - remove it Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:11:46 -07:00
Andy Grover	70041088e3	RDS: Add TCP transport to RDS This code allows RDS to be tunneled over a TCP connection. RDMA operations are disabled when using TCP transport, but this frees RDS from the IB/RDMA stack dependency, and allows it to be used with standard Ethernet adapters, or in a VM. Signed-off-by: Andy Grover <andy.grover@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-08-23 19:13:02 -07:00

34 Commits