mirror of
https://github.com/torvalds/linux.git
synced 2026-05-24 07:03:03 +02:00
Merge branch 'doc-mptcp-new-general-doc-and-fixes'
Matthieu Baerts says: ==================== doc: mptcp: new general doc and fixes A general documentation about MPTCP was missing since its introduction in v5.6. The last patch adds a new 'mptcp' page in the 'networking' documentation. The first patch is a fix for a missing sysctl entry introduced in v6.10 rc0, and the second one reorder the sysctl entries. Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> ==================== v2: https://lore.kernel.org/r/20240528-upstream-net-20240520-mptcp-doc-v2-0-47f2d5bc2ef3@kernel.org v1: https://lore.kernel.org/r/20240520-upstream-net-20240520-mptcp-doc-v1-0-e3ad294382cb@kernel.org Link: https://lore.kernel.org/r/20240530-upstream-net-20240520-mptcp-doc-v3-0-e94cdd9f2673@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This commit is contained in:
commit
d1f9e6513e
|
|
@ -72,6 +72,7 @@ Contents:
|
|||
mac80211-injection
|
||||
mctp
|
||||
mpls-sysctl
|
||||
mptcp
|
||||
mptcp-sysctl
|
||||
multiqueue
|
||||
multi-pf-netdev
|
||||
|
|
|
|||
|
|
@ -7,14 +7,6 @@ MPTCP Sysfs variables
|
|||
/proc/sys/net/mptcp/* Variables
|
||||
===============================
|
||||
|
||||
enabled - BOOLEAN
|
||||
Control whether MPTCP sockets can be created.
|
||||
|
||||
MPTCP sockets can be created if the value is 1. This is a
|
||||
per-namespace sysctl.
|
||||
|
||||
Default: 1 (enabled)
|
||||
|
||||
add_addr_timeout - INTEGER (seconds)
|
||||
Set the timeout after which an ADD_ADDR control message will be
|
||||
resent to an MPTCP peer that has not acknowledged a previous
|
||||
|
|
@ -25,25 +17,6 @@ add_addr_timeout - INTEGER (seconds)
|
|||
|
||||
Default: 120
|
||||
|
||||
close_timeout - INTEGER (seconds)
|
||||
Set the make-after-break timeout: in absence of any close or
|
||||
shutdown syscall, MPTCP sockets will maintain the status
|
||||
unchanged for such time, after the last subflow removal, before
|
||||
moving to TCP_CLOSE.
|
||||
|
||||
The default value matches TCP_TIMEWAIT_LEN. This is a per-namespace
|
||||
sysctl.
|
||||
|
||||
Default: 60
|
||||
|
||||
checksum_enabled - BOOLEAN
|
||||
Control whether DSS checksum can be enabled.
|
||||
|
||||
DSS checksum can be enabled if the value is nonzero. This is a
|
||||
per-namespace sysctl.
|
||||
|
||||
Default: 0
|
||||
|
||||
allow_join_initial_addr_port - BOOLEAN
|
||||
Allow peers to send join requests to the IP address and port number used
|
||||
by the initial subflow if the value is 1. This controls a flag that is
|
||||
|
|
@ -57,6 +30,37 @@ allow_join_initial_addr_port - BOOLEAN
|
|||
|
||||
Default: 1
|
||||
|
||||
available_schedulers - STRING
|
||||
Shows the available schedulers choices that are registered. More packet
|
||||
schedulers may be available, but not loaded.
|
||||
|
||||
checksum_enabled - BOOLEAN
|
||||
Control whether DSS checksum can be enabled.
|
||||
|
||||
DSS checksum can be enabled if the value is nonzero. This is a
|
||||
per-namespace sysctl.
|
||||
|
||||
Default: 0
|
||||
|
||||
close_timeout - INTEGER (seconds)
|
||||
Set the make-after-break timeout: in absence of any close or
|
||||
shutdown syscall, MPTCP sockets will maintain the status
|
||||
unchanged for such time, after the last subflow removal, before
|
||||
moving to TCP_CLOSE.
|
||||
|
||||
The default value matches TCP_TIMEWAIT_LEN. This is a per-namespace
|
||||
sysctl.
|
||||
|
||||
Default: 60
|
||||
|
||||
enabled - BOOLEAN
|
||||
Control whether MPTCP sockets can be created.
|
||||
|
||||
MPTCP sockets can be created if the value is 1. This is a
|
||||
per-namespace sysctl.
|
||||
|
||||
Default: 1 (enabled)
|
||||
|
||||
pm_type - INTEGER
|
||||
Set the default path manager type to use for each new MPTCP
|
||||
socket. In-kernel path management will control subflow
|
||||
|
|
@ -74,6 +78,14 @@ pm_type - INTEGER
|
|||
|
||||
Default: 0
|
||||
|
||||
scheduler - STRING
|
||||
Select the scheduler of your choice.
|
||||
|
||||
Support for selection of different schedulers. This is a per-namespace
|
||||
sysctl.
|
||||
|
||||
Default: "default"
|
||||
|
||||
stale_loss_cnt - INTEGER
|
||||
The number of MPTCP-level retransmission intervals with no traffic and
|
||||
pending outstanding data on a given subflow required to declare it stale.
|
||||
|
|
@ -85,11 +97,3 @@ stale_loss_cnt - INTEGER
|
|||
This is a per-namespace sysctl.
|
||||
|
||||
Default: 4
|
||||
|
||||
scheduler - STRING
|
||||
Select the scheduler of your choice.
|
||||
|
||||
Support for selection of different schedulers. This is a per-namespace
|
||||
sysctl.
|
||||
|
||||
Default: "default"
|
||||
|
|
|
|||
156
Documentation/networking/mptcp.rst
Normal file
156
Documentation/networking/mptcp.rst
Normal file
|
|
@ -0,0 +1,156 @@
|
|||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
=====================
|
||||
Multipath TCP (MPTCP)
|
||||
=====================
|
||||
|
||||
Introduction
|
||||
============
|
||||
|
||||
Multipath TCP or MPTCP is an extension to the standard TCP and is described in
|
||||
`RFC 8684 (MPTCPv1) <https://www.rfc-editor.org/rfc/rfc8684.html>`_. It allows a
|
||||
device to make use of multiple interfaces at once to send and receive TCP
|
||||
packets over a single MPTCP connection. MPTCP can aggregate the bandwidth of
|
||||
multiple interfaces or prefer the one with the lowest latency. It also allows a
|
||||
fail-over if one path is down, and the traffic is seamlessly reinjected on other
|
||||
paths.
|
||||
|
||||
For more details about Multipath TCP in the Linux kernel, please see the
|
||||
official website: `mptcp.dev <https://www.mptcp.dev>`_.
|
||||
|
||||
|
||||
Use cases
|
||||
=========
|
||||
|
||||
Thanks to MPTCP, being able to use multiple paths in parallel or simultaneously
|
||||
brings new use-cases, compared to TCP:
|
||||
|
||||
- Seamless handovers: switching from one path to another while preserving
|
||||
established connections, e.g. to be used in mobility use-cases, like on
|
||||
smartphones.
|
||||
- Best network selection: using the "best" available path depending on some
|
||||
conditions, e.g. latency, losses, cost, bandwidth, etc.
|
||||
- Network aggregation: using multiple paths at the same time to have a higher
|
||||
throughput, e.g. to combine fixed and mobile networks to send files faster.
|
||||
|
||||
|
||||
Concepts
|
||||
========
|
||||
|
||||
Technically, when a new socket is created with the ``IPPROTO_MPTCP`` protocol
|
||||
(Linux-specific), a *subflow* (or *path*) is created. This *subflow* consists of
|
||||
a regular TCP connection that is used to transmit data through one interface.
|
||||
Additional *subflows* can be negotiated later between the hosts. For the remote
|
||||
host to be able to detect the use of MPTCP, a new field is added to the TCP
|
||||
*option* field of the underlying TCP *subflow*. This field contains, amongst
|
||||
other things, a ``MP_CAPABLE`` option that tells the other host to use MPTCP if
|
||||
it is supported. If the remote host or any middlebox in between does not support
|
||||
it, the returned ``SYN+ACK`` packet will not contain MPTCP options in the TCP
|
||||
*option* field. In that case, the connection will be "downgraded" to plain TCP,
|
||||
and it will continue with a single path.
|
||||
|
||||
This behavior is made possible by two internal components: the path manager, and
|
||||
the packet scheduler.
|
||||
|
||||
Path Manager
|
||||
------------
|
||||
|
||||
The Path Manager is in charge of *subflows*, from creation to deletion, and also
|
||||
address announcements. Typically, it is the client side that initiates subflows,
|
||||
and the server side that announces additional addresses via the ``ADD_ADDR`` and
|
||||
``REMOVE_ADDR`` options.
|
||||
|
||||
Path managers are controlled by the ``net.mptcp.pm_type`` sysctl knob -- see
|
||||
mptcp-sysctl.rst. There are two types: the in-kernel one (type ``0``) where the
|
||||
same rules are applied for all the connections (see: ``ip mptcp``) ; and the
|
||||
userspace one (type ``1``), controlled by a userspace daemon (i.e. `mptcpd
|
||||
<https://mptcpd.mptcp.dev/>`_) where different rules can be applied for each
|
||||
connection. The path managers can be controlled via a Netlink API; see
|
||||
netlink_spec/mptcp_pm.rst.
|
||||
|
||||
To be able to use multiple IP addresses on a host to create multiple *subflows*
|
||||
(paths), the default in-kernel MPTCP path-manager needs to know which IP
|
||||
addresses can be used. This can be configured with ``ip mptcp endpoint`` for
|
||||
example.
|
||||
|
||||
Packet Scheduler
|
||||
----------------
|
||||
|
||||
The Packet Scheduler is in charge of selecting which available *subflow(s)* to
|
||||
use to send the next data packet. It can decide to maximize the use of the
|
||||
available bandwidth, only to pick the path with the lower latency, or any other
|
||||
policy depending on the configuration.
|
||||
|
||||
Packet schedulers are controlled by the ``net.mptcp.scheduler`` sysctl knob --
|
||||
see mptcp-sysctl.rst.
|
||||
|
||||
|
||||
Sockets API
|
||||
===========
|
||||
|
||||
Creating MPTCP sockets
|
||||
----------------------
|
||||
|
||||
On Linux, MPTCP can be used by selecting MPTCP instead of TCP when creating the
|
||||
``socket``:
|
||||
|
||||
.. code-block:: C
|
||||
|
||||
int sd = socket(AF_INET(6), SOCK_STREAM, IPPROTO_MPTCP);
|
||||
|
||||
Note that ``IPPROTO_MPTCP`` is defined as ``262``.
|
||||
|
||||
If MPTCP is not supported, ``errno`` will be set to:
|
||||
|
||||
- ``EINVAL``: (*Invalid argument*): MPTCP is not available, on kernels < 5.6.
|
||||
- ``EPROTONOSUPPORT`` (*Protocol not supported*): MPTCP has not been compiled,
|
||||
on kernels >= v5.6.
|
||||
- ``ENOPROTOOPT`` (*Protocol not available*): MPTCP has been disabled using
|
||||
``net.mptcp.enabled`` sysctl knob; see mptcp-sysctl.rst.
|
||||
|
||||
MPTCP is then opt-in: applications need to explicitly request it. Note that
|
||||
applications can be forced to use MPTCP with different techniques, e.g.
|
||||
``LD_PRELOAD`` (see ``mptcpize``), eBPF (see ``mptcpify``), SystemTAP,
|
||||
``GODEBUG`` (``GODEBUG=multipathtcp=1``), etc.
|
||||
|
||||
Switching to ``IPPROTO_MPTCP`` instead of ``IPPROTO_TCP`` should be as
|
||||
transparent as possible for the userspace applications.
|
||||
|
||||
Socket options
|
||||
--------------
|
||||
|
||||
MPTCP supports most socket options handled by TCP. It is possible some less
|
||||
common options are not supported, but contributions are welcome.
|
||||
|
||||
Generally, the same value is propagated to all subflows, including the ones
|
||||
created after the calls to ``setsockopt()``. eBPF can be used to set different
|
||||
values per subflow.
|
||||
|
||||
There are some MPTCP specific socket options at the ``SOL_MPTCP`` (284) level to
|
||||
retrieve info. They fill the ``optval`` buffer of the ``getsockopt()`` system
|
||||
call:
|
||||
|
||||
- ``MPTCP_INFO``: Uses ``struct mptcp_info``.
|
||||
- ``MPTCP_TCPINFO``: Uses ``struct mptcp_subflow_data``, followed by an array of
|
||||
``struct tcp_info``.
|
||||
- ``MPTCP_SUBFLOW_ADDRS``: Uses ``struct mptcp_subflow_data``, followed by an
|
||||
array of ``mptcp_subflow_addrs``.
|
||||
- ``MPTCP_FULL_INFO``: Uses ``struct mptcp_full_info``, with one pointer to an
|
||||
array of ``struct mptcp_subflow_info`` (including the
|
||||
``struct mptcp_subflow_addrs``), and one pointer to an array of
|
||||
``struct tcp_info``, followed by the content of ``struct mptcp_info``.
|
||||
|
||||
Note that at the TCP level, ``TCP_IS_MPTCP`` socket option can be used to know
|
||||
if MPTCP is currently being used: the value will be set to 1 if it is.
|
||||
|
||||
|
||||
Design choices
|
||||
==============
|
||||
|
||||
A new socket type has been added for MPTCP for the userspace-facing socket. The
|
||||
kernel is in charge of creating subflow sockets: they are TCP sockets where the
|
||||
behavior is modified using TCP-ULP.
|
||||
|
||||
MPTCP listen sockets will create "plain" *accepted* TCP sockets if the
|
||||
connection request from the client didn't ask for MPTCP, making the performance
|
||||
impact minimal when MPTCP is enabled by default.
|
||||
|
|
@ -15753,7 +15753,7 @@ B: https://github.com/multipath-tcp/mptcp_net-next/issues
|
|||
T: git https://github.com/multipath-tcp/mptcp_net-next.git export-net
|
||||
T: git https://github.com/multipath-tcp/mptcp_net-next.git export
|
||||
F: Documentation/netlink/specs/mptcp_pm.yaml
|
||||
F: Documentation/networking/mptcp-sysctl.rst
|
||||
F: Documentation/networking/mptcp*.rst
|
||||
F: include/net/mptcp.h
|
||||
F: include/trace/events/mptcp.h
|
||||
F: include/uapi/linux/mptcp*.h
|
||||
|
|
|
|||
Loading…
Reference in New Issue
Block a user