bpf-next-for-netdev

-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ6NaUOruQGUkvPdG4raS+Z+3y5EwUCZ7ffOQAKCRAraS+Z+3y5
 EzVHAP9h/QkeYoOZW9gul08I8vFiZsFe/lbOSLJWxeVfxb9JhgD/cMqby3qAxQK6
 lsdNQ9jYG2232Wym89ag7fvTBK15Wg4=
 =gkN2
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Martin KaFai Lau says:

====================
pull-request: bpf-next 2025-02-20

We've added 19 non-merge commits during the last 8 day(s) which contain
a total of 35 files changed, 1126 insertions(+), 53 deletions(-).

The main changes are:

1) Add TCP_RTO_MAX_MS support to bpf_set/getsockopt, from Jason Xing

2) Add network TX timestamping support to BPF sock_ops, from Jason Xing

3) Add TX metadata Launch Time support, from Song Yoong Siang

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
  igc: Add launch time support to XDP ZC
  igc: Refactor empty frame insertion for launch time support
  net: stmmac: Add launch time support to XDP ZC
  selftests/bpf: Add launch time request to xdp_hw_metadata
  xsk: Add launch time hardware offload support to XDP Tx metadata
  selftests/bpf: Add simple bpf tests in the tx path for timestamping feature
  bpf: Support selective sampling for bpf timestamping
  bpf: Add BPF_SOCK_OPS_TSTAMP_SENDMSG_CB callback
  bpf: Add BPF_SOCK_OPS_TSTAMP_ACK_CB callback
  bpf: Add BPF_SOCK_OPS_TSTAMP_SND_HW_CB callback
  bpf: Add BPF_SOCK_OPS_TSTAMP_SND_SW_CB callback
  bpf: Add BPF_SOCK_OPS_TSTAMP_SCHED_CB callback
  net-timestamp: Prepare for isolating two modes of SO_TIMESTAMPING
  bpf: Disable unsafe helpers in TX timestamping callbacks
  bpf: Prevent unsafe access to the sock fields in the BPF timestamping callback
  bpf: Prepare the sock_ops ctx and call bpf prog for TX timestamping
  bpf: Add networking timestamping support to bpf_get/setsockopt()
  selftests/bpf: Add rto max for bpf_setsockopt test
  bpf: Support TCP_RTO_MAX_MS for bpf_setsockopt
====================

Link: https://patch.msgid.link/20250221022104.386462-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This commit is contained in:
Jakub Kicinski 2025-02-21 15:59:47 -08:00
commit e87700965a
35 changed files with 1126 additions and 53 deletions

View File

@ -70,6 +70,10 @@ definitions:
name: tx-checksum
doc:
L3 checksum HW offload is supported by the driver.
-
name: tx-launch-time-fifo
doc:
Launch time HW offload is supported by the driver.
-
name: queue-type
type: enum

View File

@ -50,6 +50,10 @@ The flags field enables the particular offload:
checksum. ``csum_start`` specifies byte offset of where the checksumming
should start and ``csum_offset`` specifies byte offset where the
device should store the computed checksum.
- ``XDP_TXMD_FLAGS_LAUNCH_TIME``: requests the device to schedule the
packet for transmission at a pre-determined time called launch time. The
value of launch time is indicated by ``launch_time`` field of
``union xsk_tx_metadata``.
Besides the flags above, in order to trigger the offloads, the first
packet's ``struct xdp_desc`` descriptor should set ``XDP_TX_METADATA``
@ -65,6 +69,63 @@ In this case, when running in ``XDK_COPY`` mode, the TX checksum
is calculated on the CPU. Do not enable this option in production because
it will negatively affect performance.
Launch Time
===========
The value of the requested launch time should be based on the device's PTP
Hardware Clock (PHC) to ensure accuracy. AF_XDP takes a different data path
compared to the ETF queuing discipline, which organizes packets and delays
their transmission. Instead, AF_XDP immediately hands off the packets to
the device driver without rearranging their order or holding them prior to
transmission. Since the driver maintains FIFO behavior and does not perform
packet reordering, a packet with a launch time request will block other
packets in the same Tx Queue until it is sent. Therefore, it is recommended
to allocate separate queue for scheduling traffic that is intended for
future transmission.
In scenarios where the launch time offload feature is disabled, the device
driver is expected to disregard the launch time request. For correct
interpretation and meaningful operation, the launch time should never be
set to a value larger than the farthest programmable time in the future
(the horizon). Different devices have different hardware limitations on the
launch time offload feature.
stmmac driver
-------------
For stmmac, TSO and launch time (TBS) features are mutually exclusive for
each individual Tx Queue. By default, the driver configures Tx Queue 0 to
support TSO and the rest of the Tx Queues to support TBS. The launch time
hardware offload feature can be enabled or disabled by using the tc-etf
command to call the driver's ndo_setup_tc() callback.
The value of the launch time that is programmed in the Enhanced Normal
Transmit Descriptors is a 32-bit value, where the most significant 8 bits
represent the time in seconds and the remaining 24 bits represent the time
in 256 ns increments. The programmed launch time is compared against the
PTP time (bits[39:8]) and rolls over after 256 seconds. Therefore, the
horizon of the launch time for dwmac4 and dwxlgmac2 is 128 seconds in the
future.
igc driver
----------
For igc, all four Tx Queues support the launch time feature. The launch
time hardware offload feature can be enabled or disabled by using the
tc-etf command to call the driver's ndo_setup_tc() callback. When entering
TSN mode, the igc driver will reset the device and create a default Qbv
schedule with a 1-second cycle time, with all Tx Queues open at all times.
The value of the launch time that is programmed in the Advanced Transmit
Context Descriptor is a relative offset to the starting time of the Qbv
transmission window of the queue. The Frst flag of the descriptor can be
set to schedule the packet for the next Qbv cycle. Therefore, the horizon
of the launch time for i225 and i226 is the ending time of the next cycle
of the Qbv transmission window of the queue. For example, when the Qbv
cycle time is set to 1 second, the horizon of the launch time ranges
from 1 second to 2 seconds, depending on where the Qbv cycle is currently
running.
Querying Device Capabilities
============================
@ -74,6 +135,7 @@ Refer to ``xsk-flags`` features bitmask in
- ``tx-timestamp``: device supports ``XDP_TXMD_FLAGS_TIMESTAMP``
- ``tx-checksum``: device supports ``XDP_TXMD_FLAGS_CHECKSUM``
- ``tx-launch-time-fifo``: device supports ``XDP_TXMD_FLAGS_LAUNCH_TIME``
See ``tools/net/ynl/samples/netdev.c`` on how to query this information.

View File

@ -579,6 +579,7 @@ struct igc_metadata_request {
struct xsk_tx_metadata *meta;
struct igc_ring *tx_ring;
u32 cmd_type;
u16 used_desc;
};
struct igc_q_vector {

View File

@ -1092,7 +1092,8 @@ static int igc_init_empty_frame(struct igc_ring *ring,
dma = dma_map_single(ring->dev, skb->data, size, DMA_TO_DEVICE);
if (dma_mapping_error(ring->dev, dma)) {
netdev_err_once(ring->netdev, "Failed to map DMA for TX\n");
net_err_ratelimited("%s: DMA mapping error for empty frame\n",
netdev_name(ring->netdev));
return -ENOMEM;
}
@ -1108,20 +1109,12 @@ static int igc_init_empty_frame(struct igc_ring *ring,
return 0;
}
static int igc_init_tx_empty_descriptor(struct igc_ring *ring,
struct sk_buff *skb,
struct igc_tx_buffer *first)
static void igc_init_tx_empty_descriptor(struct igc_ring *ring,
struct sk_buff *skb,
struct igc_tx_buffer *first)
{
union igc_adv_tx_desc *desc;
u32 cmd_type, olinfo_status;
int err;
if (!igc_desc_unused(ring))
return -EBUSY;
err = igc_init_empty_frame(ring, first, skb);
if (err)
return err;
cmd_type = IGC_ADVTXD_DTYP_DATA | IGC_ADVTXD_DCMD_DEXT |
IGC_ADVTXD_DCMD_IFCS | IGC_TXD_DCMD |
@ -1140,8 +1133,6 @@ static int igc_init_tx_empty_descriptor(struct igc_ring *ring,
ring->next_to_use++;
if (ring->next_to_use == ring->count)
ring->next_to_use = 0;
return 0;
}
#define IGC_EMPTY_FRAME_SIZE 60
@ -1567,6 +1558,40 @@ static bool igc_request_tx_tstamp(struct igc_adapter *adapter, struct sk_buff *s
return false;
}
static int igc_insert_empty_frame(struct igc_ring *tx_ring)
{
struct igc_tx_buffer *empty_info;
struct sk_buff *empty_skb;
void *data;
int ret;
empty_info = &tx_ring->tx_buffer_info[tx_ring->next_to_use];
empty_skb = alloc_skb(IGC_EMPTY_FRAME_SIZE, GFP_ATOMIC);
if (unlikely(!empty_skb)) {
net_err_ratelimited("%s: skb alloc error for empty frame\n",
netdev_name(tx_ring->netdev));
return -ENOMEM;
}
data = skb_put(empty_skb, IGC_EMPTY_FRAME_SIZE);
memset(data, 0, IGC_EMPTY_FRAME_SIZE);
/* Prepare DMA mapping and Tx buffer information */
ret = igc_init_empty_frame(tx_ring, empty_info, empty_skb);
if (unlikely(ret)) {
dev_kfree_skb_any(empty_skb);
return ret;
}
/* Prepare advanced context descriptor for empty packet */
igc_tx_ctxtdesc(tx_ring, 0, false, 0, 0, 0);
/* Prepare advanced data descriptor for empty packet */
igc_init_tx_empty_descriptor(tx_ring, empty_skb, empty_info);
return 0;
}
static netdev_tx_t igc_xmit_frame_ring(struct sk_buff *skb,
struct igc_ring *tx_ring)
{
@ -1586,6 +1611,7 @@ static netdev_tx_t igc_xmit_frame_ring(struct sk_buff *skb,
* + 1 desc for skb_headlen/IGC_MAX_DATA_PER_TXD,
* + 2 desc gap to keep tail from touching head,
* + 1 desc for context descriptor,
* + 2 desc for inserting an empty packet for launch time,
* otherwise try next time
*/
for (f = 0; f < skb_shinfo(skb)->nr_frags; f++)
@ -1605,24 +1631,16 @@ static netdev_tx_t igc_xmit_frame_ring(struct sk_buff *skb,
launch_time = igc_tx_launchtime(tx_ring, txtime, &first_flag, &insert_empty);
if (insert_empty) {
struct igc_tx_buffer *empty_info;
struct sk_buff *empty;
void *data;
empty_info = &tx_ring->tx_buffer_info[tx_ring->next_to_use];
empty = alloc_skb(IGC_EMPTY_FRAME_SIZE, GFP_ATOMIC);
if (!empty)
goto done;
data = skb_put(empty, IGC_EMPTY_FRAME_SIZE);
memset(data, 0, IGC_EMPTY_FRAME_SIZE);
igc_tx_ctxtdesc(tx_ring, 0, false, 0, 0, 0);
if (igc_init_tx_empty_descriptor(tx_ring,
empty,
empty_info) < 0)
dev_kfree_skb_any(empty);
/* Reset the launch time if the required empty frame fails to
* be inserted. However, this packet is not dropped, so it
* "dirties" the current Qbv cycle. This ensures that the
* upcoming packet, which is scheduled in the next Qbv cycle,
* does not require an empty frame. This way, the launch time
* continues to function correctly despite the current failure
* to insert the empty frame.
*/
if (igc_insert_empty_frame(tx_ring))
launch_time = 0;
}
done:
@ -2953,9 +2971,48 @@ static u64 igc_xsk_fill_timestamp(void *_priv)
return *(u64 *)_priv;
}
static void igc_xsk_request_launch_time(u64 launch_time, void *_priv)
{
struct igc_metadata_request *meta_req = _priv;
struct igc_ring *tx_ring = meta_req->tx_ring;
__le32 launch_time_offset;
bool insert_empty = false;
bool first_flag = false;
u16 used_desc = 0;
if (!tx_ring->launchtime_enable)
return;
launch_time_offset = igc_tx_launchtime(tx_ring,
ns_to_ktime(launch_time),
&first_flag, &insert_empty);
if (insert_empty) {
/* Disregard the launch time request if the required empty frame
* fails to be inserted.
*/
if (igc_insert_empty_frame(tx_ring))
return;
meta_req->tx_buffer =
&tx_ring->tx_buffer_info[tx_ring->next_to_use];
/* Inserting an empty packet requires two descriptors:
* one data descriptor and one context descriptor.
*/
used_desc += 2;
}
/* Use one context descriptor to specify launch time and first flag. */
igc_tx_ctxtdesc(tx_ring, launch_time_offset, first_flag, 0, 0, 0);
used_desc += 1;
/* Update the number of used descriptors in this request */
meta_req->used_desc += used_desc;
}
const struct xsk_tx_metadata_ops igc_xsk_tx_metadata_ops = {
.tmo_request_timestamp = igc_xsk_request_timestamp,
.tmo_fill_timestamp = igc_xsk_fill_timestamp,
.tmo_request_launch_time = igc_xsk_request_launch_time,
};
static void igc_xdp_xmit_zc(struct igc_ring *ring)
@ -2978,7 +3035,13 @@ static void igc_xdp_xmit_zc(struct igc_ring *ring)
ntu = ring->next_to_use;
budget = igc_desc_unused(ring);
while (xsk_tx_peek_desc(pool, &xdp_desc) && budget--) {
/* Packets with launch time require one data descriptor and one context
* descriptor. When the launch time falls into the next Qbv cycle, we
* may need to insert an empty packet, which requires two more
* descriptors. Therefore, to be safe, we always ensure we have at least
* 4 descriptors available.
*/
while (xsk_tx_peek_desc(pool, &xdp_desc) && budget >= 4) {
struct igc_metadata_request meta_req;
struct xsk_tx_metadata *meta = NULL;
struct igc_tx_buffer *bi;
@ -2999,9 +3062,19 @@ static void igc_xdp_xmit_zc(struct igc_ring *ring)
meta_req.tx_ring = ring;
meta_req.tx_buffer = bi;
meta_req.meta = meta;
meta_req.used_desc = 0;
xsk_tx_metadata_request(meta, &igc_xsk_tx_metadata_ops,
&meta_req);
/* xsk_tx_metadata_request() may have updated next_to_use */
ntu = ring->next_to_use;
/* xsk_tx_metadata_request() may have updated Tx buffer info */
bi = meta_req.tx_buffer;
/* xsk_tx_metadata_request() may use a few descriptors */
budget -= meta_req.used_desc;
tx_desc = IGC_TX_DESC(ring, ntu);
tx_desc->read.cmd_type_len = cpu_to_le32(meta_req.cmd_type);
tx_desc->read.olinfo_status = cpu_to_le32(olinfo_status);
@ -3019,9 +3092,11 @@ static void igc_xdp_xmit_zc(struct igc_ring *ring)
ntu++;
if (ntu == ring->count)
ntu = 0;
ring->next_to_use = ntu;
budget--;
}
ring->next_to_use = ntu;
if (tx_desc) {
igc_flush_tx_descriptors(ring);
xsk_tx_release(pool);

View File

@ -106,6 +106,8 @@ struct stmmac_metadata_request {
struct stmmac_priv *priv;
struct dma_desc *tx_desc;
bool *set_ic;
struct dma_edesc *edesc;
int tbs;
};
struct stmmac_xsk_tx_complete {

View File

@ -2486,9 +2486,20 @@ static u64 stmmac_xsk_fill_timestamp(void *_priv)
return 0;
}
static void stmmac_xsk_request_launch_time(u64 launch_time, void *_priv)
{
struct timespec64 ts = ns_to_timespec64(launch_time);
struct stmmac_metadata_request *meta_req = _priv;
if (meta_req->tbs & STMMAC_TBS_EN)
stmmac_set_desc_tbs(meta_req->priv, meta_req->edesc, ts.tv_sec,
ts.tv_nsec);
}
static const struct xsk_tx_metadata_ops stmmac_xsk_tx_metadata_ops = {
.tmo_request_timestamp = stmmac_xsk_request_timestamp,
.tmo_fill_timestamp = stmmac_xsk_fill_timestamp,
.tmo_request_launch_time = stmmac_xsk_request_launch_time,
};
static bool stmmac_xdp_xmit_zc(struct stmmac_priv *priv, u32 queue, u32 budget)
@ -2572,6 +2583,8 @@ static bool stmmac_xdp_xmit_zc(struct stmmac_priv *priv, u32 queue, u32 budget)
meta_req.priv = priv;
meta_req.tx_desc = tx_desc;
meta_req.set_ic = &set_ic;
meta_req.tbs = tx_q->tbs;
meta_req.edesc = &tx_q->dma_entx[entry];
xsk_tx_metadata_request(meta, &stmmac_xsk_tx_metadata_ops,
&meta_req);
if (set_ic) {

View File

@ -1508,6 +1508,7 @@ struct bpf_sock_ops_kern {
void *skb_data_end;
u8 op;
u8 is_fullsock;
u8 is_locked_tcp_sock;
u8 remaining_opt_len;
u64 temp; /* temp and everything after is not
* initialized to 0 before calling

View File

@ -470,7 +470,7 @@ struct skb_shared_hwtstamps {
/* Definitions for tx_flags in struct skb_shared_info */
enum {
/* generate hardware time stamp */
SKBTX_HW_TSTAMP = 1 << 0,
SKBTX_HW_TSTAMP_NOBPF = 1 << 0,
/* generate software time stamp when queueing packet to NIC */
SKBTX_SW_TSTAMP = 1 << 1,
@ -489,10 +489,16 @@ enum {
/* generate software time stamp when entering packet scheduling */
SKBTX_SCHED_TSTAMP = 1 << 6,
/* used for bpf extension when a bpf program is loaded */
SKBTX_BPF = 1 << 7,
};
#define SKBTX_HW_TSTAMP (SKBTX_HW_TSTAMP_NOBPF | SKBTX_BPF)
#define SKBTX_ANY_SW_TSTAMP (SKBTX_SW_TSTAMP | \
SKBTX_SCHED_TSTAMP)
SKBTX_SCHED_TSTAMP | \
SKBTX_BPF)
#define SKBTX_ANY_TSTAMP (SKBTX_HW_TSTAMP | \
SKBTX_HW_TSTAMP_USE_CYCLES | \
SKBTX_ANY_SW_TSTAMP)
@ -4564,7 +4570,7 @@ void skb_tstamp_tx(struct sk_buff *orig_skb,
static inline void skb_tx_timestamp(struct sk_buff *skb)
{
skb_clone_tx_timestamp(skb);
if (skb_shinfo(skb)->tx_flags & SKBTX_SW_TSTAMP)
if (skb_shinfo(skb)->tx_flags & (SKBTX_SW_TSTAMP | SKBTX_BPF))
skb_tstamp_tx(skb, NULL);
}

View File

@ -303,6 +303,7 @@ struct sk_filter;
* @sk_stamp: time stamp of last packet received
* @sk_stamp_seq: lock for accessing sk_stamp on 32 bit architectures only
* @sk_tsflags: SO_TIMESTAMPING flags
* @sk_bpf_cb_flags: used in bpf_setsockopt()
* @sk_use_task_frag: allow sk_page_frag() to use current->task_frag.
* Sockets that can be used under memory reclaim should
* set this to false.
@ -525,6 +526,8 @@ struct sock {
u8 sk_txtime_deadline_mode : 1,
sk_txtime_report_errors : 1,
sk_txtime_unused : 6;
#define SK_BPF_CB_FLAG_TEST(SK, FLAG) ((SK)->sk_bpf_cb_flags & (FLAG))
u8 sk_bpf_cb_flags;
void *sk_user_data;
#ifdef CONFIG_SECURITY
@ -2909,6 +2912,13 @@ int sock_set_timestamping(struct sock *sk, int optname,
struct so_timestamping timestamping);
void sock_enable_timestamps(struct sock *sk);
#if defined(CONFIG_CGROUP_BPF)
void bpf_skops_tx_timestamping(struct sock *sk, struct sk_buff *skb, int op);
#else
static inline void bpf_skops_tx_timestamping(struct sock *sk, struct sk_buff *skb, int op)
{
}
#endif
void sock_no_linger(struct sock *sk);
void sock_set_keepalive(struct sock *sk);
void sock_set_priority(struct sock *sk, u32 priority);

View File

@ -978,10 +978,12 @@ struct tcp_skb_cb {
__u8 sacked; /* State flags for SACK. */
__u8 ip_dsfield; /* IPv4 tos or IPv6 dsfield */
__u8 txstamp_ack:1, /* Record TX timestamp for ack? */
#define TSTAMP_ACK_SK 0x1
#define TSTAMP_ACK_BPF 0x2
__u8 txstamp_ack:2, /* Record TX timestamp for ack? */
eor:1, /* Is skb MSG_EOR marked? */
has_rxtstamp:1, /* SKB has a RX timestamp */
unused:5;
unused:4;
__u32 ack_seq; /* Sequence number ACK'd */
union {
struct {
@ -2671,6 +2673,7 @@ static inline int tcp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
if (sk_fullsock(sk)) {
sock_ops.is_fullsock = 1;
sock_ops.is_locked_tcp_sock = 1;
sock_owned_by_me(sk);
}

View File

@ -110,11 +110,16 @@ struct xdp_sock {
* indicates position where checksumming should start.
* csum_offset indicates position where checksum should be stored.
*
* void (*tmo_request_launch_time)(u64 launch_time, void *priv)
* Called when AF_XDP frame requested launch time HW offload support.
* launch_time indicates the PTP time at which the device can schedule the
* packet for transmission.
*/
struct xsk_tx_metadata_ops {
void (*tmo_request_timestamp)(void *priv);
u64 (*tmo_fill_timestamp)(void *priv);
void (*tmo_request_checksum)(u16 csum_start, u16 csum_offset, void *priv);
void (*tmo_request_launch_time)(u64 launch_time, void *priv);
};
#ifdef CONFIG_XDP_SOCKETS
@ -162,6 +167,11 @@ static inline void xsk_tx_metadata_request(const struct xsk_tx_metadata *meta,
if (!meta)
return;
if (ops->tmo_request_launch_time)
if (meta->flags & XDP_TXMD_FLAGS_LAUNCH_TIME)
ops->tmo_request_launch_time(meta->request.launch_time,
priv);
if (ops->tmo_request_timestamp)
if (meta->flags & XDP_TXMD_FLAGS_TIMESTAMP)
ops->tmo_request_timestamp(priv);

View File

@ -216,6 +216,7 @@ xsk_buff_raw_get_ctx(const struct xsk_buff_pool *pool, u64 addr)
#define XDP_TXMD_FLAGS_VALID ( \
XDP_TXMD_FLAGS_TIMESTAMP | \
XDP_TXMD_FLAGS_CHECKSUM | \
XDP_TXMD_FLAGS_LAUNCH_TIME | \
0)
static inline bool

View File

@ -6913,6 +6913,12 @@ enum {
BPF_SOCK_OPS_ALL_CB_FLAGS = 0x7F,
};
enum {
SK_BPF_CB_TX_TIMESTAMPING = 1<<0,
SK_BPF_CB_MASK = (SK_BPF_CB_TX_TIMESTAMPING - 1) |
SK_BPF_CB_TX_TIMESTAMPING
};
/* List of known BPF sock_ops operators.
* New entries can only be added at the end
*/
@ -7025,6 +7031,29 @@ enum {
* by the kernel or the
* earlier bpf-progs.
*/
BPF_SOCK_OPS_TSTAMP_SCHED_CB, /* Called when skb is passing
* through dev layer when
* SK_BPF_CB_TX_TIMESTAMPING
* feature is on.
*/
BPF_SOCK_OPS_TSTAMP_SND_SW_CB, /* Called when skb is about to send
* to the nic when SK_BPF_CB_TX_TIMESTAMPING
* feature is on.
*/
BPF_SOCK_OPS_TSTAMP_SND_HW_CB, /* Called in hardware phase when
* SK_BPF_CB_TX_TIMESTAMPING feature
* is on.
*/
BPF_SOCK_OPS_TSTAMP_ACK_CB, /* Called when all the skbs in the
* same sendmsg call are acked
* when SK_BPF_CB_TX_TIMESTAMPING
* feature is on.
*/
BPF_SOCK_OPS_TSTAMP_SENDMSG_CB, /* Called when every sendmsg syscall
* is triggered. It's used to correlate
* sendmsg timestamp with corresponding
* tskey.
*/
};
/* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
@ -7091,6 +7120,7 @@ enum {
TCP_BPF_SYN_IP = 1006, /* Copy the IP[46] and TCP header */
TCP_BPF_SYN_MAC = 1007, /* Copy the MAC, IP[46], and TCP header */
TCP_BPF_SOCK_OPS_CB_FLAGS = 1008, /* Get or Set TCP sock ops flags */
SK_BPF_CB_FLAGS = 1009, /* Get or set sock ops flags in socket */
};
enum {

View File

@ -127,6 +127,12 @@ struct xdp_options {
*/
#define XDP_TXMD_FLAGS_CHECKSUM (1 << 1)
/* Request launch time hardware offload. The device will schedule the packet for
* transmission at a pre-determined time called launch time. The value of
* launch time is communicated via launch_time field of struct xsk_tx_metadata.
*/
#define XDP_TXMD_FLAGS_LAUNCH_TIME (1 << 2)
/* AF_XDP offloads request. 'request' union member is consumed by the driver
* when the packet is being transmitted. 'completion' union member is
* filled by the driver when the transmit completion arrives.
@ -142,6 +148,10 @@ struct xsk_tx_metadata {
__u16 csum_start;
/* Offset from csum_start where checksum should be stored. */
__u16 csum_offset;
/* XDP_TXMD_FLAGS_LAUNCH_TIME */
/* Launch time in nanosecond against the PTP HW Clock */
__u64 launch_time;
} request;
struct {

View File

@ -59,10 +59,13 @@ enum netdev_xdp_rx_metadata {
* by the driver.
* @NETDEV_XSK_FLAGS_TX_CHECKSUM: L3 checksum HW offload is supported by the
* driver.
* @NETDEV_XSK_FLAGS_TX_LAUNCH_TIME_FIFO: Launch time HW offload is supported
* by the driver.
*/
enum netdev_xsk_flags {
NETDEV_XSK_FLAGS_TX_TIMESTAMP = 1,
NETDEV_XSK_FLAGS_TX_CHECKSUM = 2,
NETDEV_XSK_FLAGS_TX_LAUNCH_TIME_FIFO = 4,
};
enum netdev_queue_type {

View File

@ -8524,6 +8524,7 @@ static int bpf_prog_type_to_kfunc_hook(enum bpf_prog_type prog_type)
case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
case BPF_PROG_TYPE_CGROUP_SOCKOPT:
case BPF_PROG_TYPE_CGROUP_SYSCTL:
case BPF_PROG_TYPE_SOCK_OPS:
return BTF_KFUNC_HOOK_CGROUP;
case BPF_PROG_TYPE_SCHED_ACT:
return BTF_KFUNC_HOOK_SCHED_ACT;

View File

@ -4572,7 +4572,8 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
skb_reset_mac_header(skb);
skb_assert_len(skb);
if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP))
if (unlikely(skb_shinfo(skb)->tx_flags &
(SKBTX_SCHED_TSTAMP | SKBTX_BPF)))
__skb_tstamp_tx(skb, NULL, NULL, skb->sk, SCM_TSTAMP_SCHED);
/* Disable soft irqs for various locks below. Also

View File

@ -5222,6 +5222,25 @@ static const struct bpf_func_proto bpf_get_socket_uid_proto = {
.arg1_type = ARG_PTR_TO_CTX,
};
static int sk_bpf_set_get_cb_flags(struct sock *sk, char *optval, bool getopt)
{
u32 sk_bpf_cb_flags;
if (getopt) {
*(u32 *)optval = sk->sk_bpf_cb_flags;
return 0;
}
sk_bpf_cb_flags = *(u32 *)optval;
if (sk_bpf_cb_flags & ~SK_BPF_CB_MASK)
return -EINVAL;
sk->sk_bpf_cb_flags = sk_bpf_cb_flags;
return 0;
}
static int sol_socket_sockopt(struct sock *sk, int optname,
char *optval, int *optlen,
bool getopt)
@ -5238,6 +5257,7 @@ static int sol_socket_sockopt(struct sock *sk, int optname,
case SO_MAX_PACING_RATE:
case SO_BINDTOIFINDEX:
case SO_TXREHASH:
case SK_BPF_CB_FLAGS:
if (*optlen != sizeof(int))
return -EINVAL;
break;
@ -5247,6 +5267,9 @@ static int sol_socket_sockopt(struct sock *sk, int optname,
return -EINVAL;
}
if (optname == SK_BPF_CB_FLAGS)
return sk_bpf_set_get_cb_flags(sk, optval, getopt);
if (getopt) {
if (optname == SO_BINDTODEVICE)
return -EINVAL;
@ -5382,6 +5405,7 @@ static int sol_tcp_sockopt(struct sock *sk, int optname,
case TCP_USER_TIMEOUT:
case TCP_NOTSENT_LOWAT:
case TCP_SAVE_SYN:
case TCP_RTO_MAX_MS:
if (*optlen != sizeof(int))
return -EINVAL;
break;
@ -5500,6 +5524,11 @@ static int __bpf_setsockopt(struct sock *sk, int level, int optname,
return -EINVAL;
}
static bool is_locked_tcp_sock_ops(struct bpf_sock_ops_kern *bpf_sock)
{
return bpf_sock->op <= BPF_SOCK_OPS_WRITE_HDR_OPT_CB;
}
static int _bpf_setsockopt(struct sock *sk, int level, int optname,
char *optval, int optlen)
{
@ -5650,6 +5679,9 @@ static const struct bpf_func_proto bpf_sock_addr_getsockopt_proto = {
BPF_CALL_5(bpf_sock_ops_setsockopt, struct bpf_sock_ops_kern *, bpf_sock,
int, level, int, optname, char *, optval, int, optlen)
{
if (!is_locked_tcp_sock_ops(bpf_sock))
return -EOPNOTSUPP;
return _bpf_setsockopt(bpf_sock->sk, level, optname, optval, optlen);
}
@ -5735,6 +5767,9 @@ static int bpf_sock_ops_get_syn(struct bpf_sock_ops_kern *bpf_sock,
BPF_CALL_5(bpf_sock_ops_getsockopt, struct bpf_sock_ops_kern *, bpf_sock,
int, level, int, optname, char *, optval, int, optlen)
{
if (!is_locked_tcp_sock_ops(bpf_sock))
return -EOPNOTSUPP;
if (IS_ENABLED(CONFIG_INET) && level == SOL_TCP &&
optname >= TCP_BPF_SYN && optname <= TCP_BPF_SYN_MAC) {
int ret, copy_len = 0;
@ -5777,6 +5812,9 @@ BPF_CALL_2(bpf_sock_ops_cb_flags_set, struct bpf_sock_ops_kern *, bpf_sock,
struct sock *sk = bpf_sock->sk;
int val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS;
if (!is_locked_tcp_sock_ops(bpf_sock))
return -EOPNOTSUPP;
if (!IS_ENABLED(CONFIG_INET) || !sk_fullsock(sk))
return -EINVAL;
@ -7586,6 +7624,9 @@ BPF_CALL_4(bpf_sock_ops_load_hdr_opt, struct bpf_sock_ops_kern *, bpf_sock,
u8 search_kind, search_len, copy_len, magic_len;
int ret;
if (!is_locked_tcp_sock_ops(bpf_sock))
return -EOPNOTSUPP;
/* 2 byte is the minimal option len except TCPOPT_NOP and
* TCPOPT_EOL which are useless for the bpf prog to learn
* and this helper disallow loading them also.
@ -10358,10 +10399,10 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
} \
*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF( \
struct bpf_sock_ops_kern, \
is_fullsock), \
is_locked_tcp_sock), \
fullsock_reg, si->src_reg, \
offsetof(struct bpf_sock_ops_kern, \
is_fullsock)); \
is_locked_tcp_sock)); \
*insn++ = BPF_JMP_IMM(BPF_JEQ, fullsock_reg, 0, jmp); \
if (si->dst_reg == si->src_reg) \
*insn++ = BPF_LDX_MEM(BPF_DW, reg, si->src_reg, \
@ -10446,10 +10487,10 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
temp)); \
*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF( \
struct bpf_sock_ops_kern, \
is_fullsock), \
is_locked_tcp_sock), \
reg, si->dst_reg, \
offsetof(struct bpf_sock_ops_kern, \
is_fullsock)); \
is_locked_tcp_sock)); \
*insn++ = BPF_JMP_IMM(BPF_JEQ, reg, 0, 2); \
*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF( \
struct bpf_sock_ops_kern, sk),\
@ -12062,6 +12103,25 @@ __bpf_kfunc int bpf_sk_assign_tcp_reqsk(struct __sk_buff *s, struct sock *sk,
#endif
}
__bpf_kfunc int bpf_sock_ops_enable_tx_tstamp(struct bpf_sock_ops_kern *skops,
u64 flags)
{
struct sk_buff *skb;
if (skops->op != BPF_SOCK_OPS_TSTAMP_SENDMSG_CB)
return -EOPNOTSUPP;
if (flags)
return -EINVAL;
skb = skops->skb;
skb_shinfo(skb)->tx_flags |= SKBTX_BPF;
TCP_SKB_CB(skb)->txstamp_ack |= TSTAMP_ACK_BPF;
skb_shinfo(skb)->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
return 0;
}
__bpf_kfunc_end_defs();
int bpf_dynptr_from_skb_rdonly(struct __sk_buff *skb, u64 flags,
@ -12095,6 +12155,10 @@ BTF_KFUNCS_START(bpf_kfunc_check_set_tcp_reqsk)
BTF_ID_FLAGS(func, bpf_sk_assign_tcp_reqsk, KF_TRUSTED_ARGS)
BTF_KFUNCS_END(bpf_kfunc_check_set_tcp_reqsk)
BTF_KFUNCS_START(bpf_kfunc_check_set_sock_ops)
BTF_ID_FLAGS(func, bpf_sock_ops_enable_tx_tstamp, KF_TRUSTED_ARGS)
BTF_KFUNCS_END(bpf_kfunc_check_set_sock_ops)
static const struct btf_kfunc_id_set bpf_kfunc_set_skb = {
.owner = THIS_MODULE,
.set = &bpf_kfunc_check_set_skb,
@ -12115,6 +12179,11 @@ static const struct btf_kfunc_id_set bpf_kfunc_set_tcp_reqsk = {
.set = &bpf_kfunc_check_set_tcp_reqsk,
};
static const struct btf_kfunc_id_set bpf_kfunc_set_sock_ops = {
.owner = THIS_MODULE,
.set = &bpf_kfunc_check_set_sock_ops,
};
static int __init bpf_kfunc_init(void)
{
int ret;
@ -12133,7 +12202,8 @@ static int __init bpf_kfunc_init(void)
ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_XDP, &bpf_kfunc_set_xdp);
ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
&bpf_kfunc_set_sock_addr);
return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &bpf_kfunc_set_tcp_reqsk);
ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &bpf_kfunc_set_tcp_reqsk);
return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SOCK_OPS, &bpf_kfunc_set_sock_ops);
}
late_initcall(bpf_kfunc_init);

View File

@ -53,6 +53,8 @@ XDP_METADATA_KFUNC_xxx
xsk_features |= NETDEV_XSK_FLAGS_TX_TIMESTAMP;
if (netdev->xsk_tx_metadata_ops->tmo_request_checksum)
xsk_features |= NETDEV_XSK_FLAGS_TX_CHECKSUM;
if (netdev->xsk_tx_metadata_ops->tmo_request_launch_time)
xsk_features |= NETDEV_XSK_FLAGS_TX_LAUNCH_TIME_FIFO;
}
if (nla_put_u32(rsp, NETDEV_A_DEV_IFINDEX, netdev->ifindex) ||

View File

@ -5449,6 +5449,52 @@ void skb_complete_tx_timestamp(struct sk_buff *skb,
}
EXPORT_SYMBOL_GPL(skb_complete_tx_timestamp);
static bool skb_tstamp_tx_report_so_timestamping(struct sk_buff *skb,
struct skb_shared_hwtstamps *hwtstamps,
int tstype)
{
switch (tstype) {
case SCM_TSTAMP_SCHED:
return skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP;
case SCM_TSTAMP_SND:
return skb_shinfo(skb)->tx_flags & (hwtstamps ? SKBTX_HW_TSTAMP_NOBPF :
SKBTX_SW_TSTAMP);
case SCM_TSTAMP_ACK:
return TCP_SKB_CB(skb)->txstamp_ack & TSTAMP_ACK_SK;
}
return false;
}
static void skb_tstamp_tx_report_bpf_timestamping(struct sk_buff *skb,
struct skb_shared_hwtstamps *hwtstamps,
struct sock *sk,
int tstype)
{
int op;
switch (tstype) {
case SCM_TSTAMP_SCHED:
op = BPF_SOCK_OPS_TSTAMP_SCHED_CB;
break;
case SCM_TSTAMP_SND:
if (hwtstamps) {
op = BPF_SOCK_OPS_TSTAMP_SND_HW_CB;
*skb_hwtstamps(skb) = *hwtstamps;
} else {
op = BPF_SOCK_OPS_TSTAMP_SND_SW_CB;
}
break;
case SCM_TSTAMP_ACK:
op = BPF_SOCK_OPS_TSTAMP_ACK_CB;
break;
default:
return;
}
bpf_skops_tx_timestamping(sk, skb, op);
}
void __skb_tstamp_tx(struct sk_buff *orig_skb,
const struct sk_buff *ack_skb,
struct skb_shared_hwtstamps *hwtstamps,
@ -5461,6 +5507,13 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
if (!sk)
return;
if (skb_shinfo(orig_skb)->tx_flags & SKBTX_BPF)
skb_tstamp_tx_report_bpf_timestamping(orig_skb, hwtstamps,
sk, tstype);
if (!skb_tstamp_tx_report_so_timestamping(orig_skb, hwtstamps, tstype))
return;
tsflags = READ_ONCE(sk->sk_tsflags);
if (!hwtstamps && !(tsflags & SOF_TIMESTAMPING_OPT_TX_SWHW) &&
skb_shinfo(orig_skb)->tx_flags & SKBTX_IN_PROGRESS)

View File

@ -949,6 +949,20 @@ int sock_set_timestamping(struct sock *sk, int optname,
return 0;
}
#if defined(CONFIG_CGROUP_BPF)
void bpf_skops_tx_timestamping(struct sock *sk, struct sk_buff *skb, int op)
{
struct bpf_sock_ops_kern sock_ops;
memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
sock_ops.op = op;
sock_ops.is_fullsock = 1;
sock_ops.sk = sk;
bpf_skops_init_skb(&sock_ops, skb, 0);
__cgroup_bpf_run_filter_sock_ops(sk, &sock_ops, CGROUP_SOCK_OPS);
}
#endif
void sock_set_keepalive(struct sock *sk)
{
lock_sock(sk);

View File

@ -897,7 +897,7 @@ static void dsa_skb_tx_timestamp(struct dsa_user_priv *p,
{
struct dsa_switch *ds = p->dp->ds;
if (!(skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP))
if (!(skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP_NOBPF))
return;
if (!ds->ops->port_txtstamp)

View File

@ -492,10 +492,14 @@ static void tcp_tx_timestamp(struct sock *sk, struct sockcm_cookie *sockc)
sock_tx_timestamp(sk, sockc, &shinfo->tx_flags);
if (tsflags & SOF_TIMESTAMPING_TX_ACK)
tcb->txstamp_ack = 1;
tcb->txstamp_ack |= TSTAMP_ACK_SK;
if (tsflags & SOF_TIMESTAMPING_TX_RECORD_MASK)
shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
}
if (cgroup_bpf_enabled(CGROUP_SOCK_OPS) &&
SK_BPF_CB_FLAG_TEST(sk, SK_BPF_CB_TX_TIMESTAMPING) && skb)
bpf_skops_tx_timestamping(sk, skb, BPF_SOCK_OPS_TSTAMP_SENDMSG_CB);
}
static bool tcp_stream_is_readable(struct sock *sk, int target)

View File

@ -169,6 +169,7 @@ static void bpf_skops_parse_hdr(struct sock *sk, struct sk_buff *skb)
memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
sock_ops.op = BPF_SOCK_OPS_PARSE_HDR_OPT_CB;
sock_ops.is_fullsock = 1;
sock_ops.is_locked_tcp_sock = 1;
sock_ops.sk = sk;
bpf_skops_init_skb(&sock_ops, skb, tcp_hdrlen(skb));
@ -185,6 +186,7 @@ static void bpf_skops_established(struct sock *sk, int bpf_op,
memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
sock_ops.op = bpf_op;
sock_ops.is_fullsock = 1;
sock_ops.is_locked_tcp_sock = 1;
sock_ops.sk = sk;
/* sk with TCP_REPAIR_ON does not have skb in tcp_finish_connect */
if (skb)

View File

@ -525,6 +525,7 @@ static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb,
sock_owned_by_me(sk);
sock_ops.is_fullsock = 1;
sock_ops.is_locked_tcp_sock = 1;
sock_ops.sk = sk;
}
@ -570,6 +571,7 @@ static void bpf_skops_write_hdr_opt(struct sock *sk, struct sk_buff *skb,
sock_owned_by_me(sk);
sock_ops.is_fullsock = 1;
sock_ops.is_locked_tcp_sock = 1;
sock_ops.sk = sk;
}

View File

@ -681,7 +681,7 @@ void __sock_tx_timestamp(__u32 tsflags, __u8 *tx_flags)
u8 flags = *tx_flags;
if (tsflags & SOF_TIMESTAMPING_TX_HARDWARE) {
flags |= SKBTX_HW_TSTAMP;
flags |= SKBTX_HW_TSTAMP_NOBPF;
/* PTP hardware clocks can provide a free running cycle counter
* as a time base for virtual clocks. Tell driver to use the

View File

@ -742,6 +742,9 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
goto free_err;
}
}
if (meta->flags & XDP_TXMD_FLAGS_LAUNCH_TIME)
skb->skb_mstamp_ns = meta->request.launch_time;
}
}

View File

@ -6913,6 +6913,12 @@ enum {
BPF_SOCK_OPS_ALL_CB_FLAGS = 0x7F,
};
enum {
SK_BPF_CB_TX_TIMESTAMPING = 1<<0,
SK_BPF_CB_MASK = (SK_BPF_CB_TX_TIMESTAMPING - 1) |
SK_BPF_CB_TX_TIMESTAMPING
};
/* List of known BPF sock_ops operators.
* New entries can only be added at the end
*/
@ -7025,6 +7031,29 @@ enum {
* by the kernel or the
* earlier bpf-progs.
*/
BPF_SOCK_OPS_TSTAMP_SCHED_CB, /* Called when skb is passing
* through dev layer when
* SK_BPF_CB_TX_TIMESTAMPING
* feature is on.
*/
BPF_SOCK_OPS_TSTAMP_SND_SW_CB, /* Called when skb is about to send
* to the nic when SK_BPF_CB_TX_TIMESTAMPING
* feature is on.
*/
BPF_SOCK_OPS_TSTAMP_SND_HW_CB, /* Called in hardware phase when
* SK_BPF_CB_TX_TIMESTAMPING feature
* is on.
*/
BPF_SOCK_OPS_TSTAMP_ACK_CB, /* Called when all the skbs in the
* same sendmsg call are acked
* when SK_BPF_CB_TX_TIMESTAMPING
* feature is on.
*/
BPF_SOCK_OPS_TSTAMP_SENDMSG_CB, /* Called when every sendmsg syscall
* is triggered. It's used to correlate
* sendmsg timestamp with corresponding
* tskey.
*/
};
/* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
@ -7091,6 +7120,7 @@ enum {
TCP_BPF_SYN_IP = 1006, /* Copy the IP[46] and TCP header */
TCP_BPF_SYN_MAC = 1007, /* Copy the MAC, IP[46], and TCP header */
TCP_BPF_SOCK_OPS_CB_FLAGS = 1008, /* Get or Set TCP sock ops flags */
SK_BPF_CB_FLAGS = 1009, /* Get or set sock ops flags in socket */
};
enum {

View File

@ -127,6 +127,12 @@ struct xdp_options {
*/
#define XDP_TXMD_FLAGS_CHECKSUM (1 << 1)
/* Request launch time hardware offload. The device will schedule the packet for
* transmission at a pre-determined time called launch time. The value of
* launch time is communicated via launch_time field of struct xsk_tx_metadata.
*/
#define XDP_TXMD_FLAGS_LAUNCH_TIME (1 << 2)
/* AF_XDP offloads request. 'request' union member is consumed by the driver
* when the packet is being transmitted. 'completion' union member is
* filled by the driver when the transmit completion arrives.
@ -142,6 +148,10 @@ struct xsk_tx_metadata {
__u16 csum_start;
/* Offset from csum_start where checksum should be stored. */
__u16 csum_offset;
/* XDP_TXMD_FLAGS_LAUNCH_TIME */
/* Launch time in nanosecond against the PTP HW Clock */
__u64 launch_time;
} request;
struct {

View File

@ -59,10 +59,13 @@ enum netdev_xdp_rx_metadata {
* by the driver.
* @NETDEV_XSK_FLAGS_TX_CHECKSUM: L3 checksum HW offload is supported by the
* driver.
* @NETDEV_XSK_FLAGS_TX_LAUNCH_TIME_FIFO: Launch time HW offload is supported
* by the driver.
*/
enum netdev_xsk_flags {
NETDEV_XSK_FLAGS_TX_TIMESTAMP = 1,
NETDEV_XSK_FLAGS_TX_CHECKSUM = 2,
NETDEV_XSK_FLAGS_TX_LAUNCH_TIME_FIFO = 4,
};
enum netdev_queue_type {

View File

@ -0,0 +1,239 @@
#include <linux/net_tstamp.h>
#include <sys/time.h>
#include <linux/errqueue.h>
#include "test_progs.h"
#include "network_helpers.h"
#include "net_timestamping.skel.h"
#define CG_NAME "/net-timestamping-test"
#define NSEC_PER_SEC 1000000000LL
static const char addr4_str[] = "127.0.0.1";
static const char addr6_str[] = "::1";
static struct net_timestamping *skel;
static const int cfg_payload_len = 30;
static struct timespec usr_ts;
static u64 delay_tolerance_nsec = 10000000000; /* 10 seconds */
int SK_TS_SCHED;
int SK_TS_TXSW;
int SK_TS_ACK;
static int64_t timespec_to_ns64(struct timespec *ts)
{
return ts->tv_sec * NSEC_PER_SEC + ts->tv_nsec;
}
static void validate_key(int tskey, int tstype)
{
static int expected_tskey = -1;
if (tstype == SCM_TSTAMP_SCHED)
expected_tskey = cfg_payload_len - 1;
ASSERT_EQ(expected_tskey, tskey, "tskey mismatch");
expected_tskey = tskey;
}
static void validate_timestamp(struct timespec *cur, struct timespec *prev)
{
int64_t cur_ns, prev_ns;
cur_ns = timespec_to_ns64(cur);
prev_ns = timespec_to_ns64(prev);
ASSERT_LT(cur_ns - prev_ns, delay_tolerance_nsec, "latency");
}
static void test_socket_timestamp(struct scm_timestamping *tss, int tstype,
int tskey)
{
static struct timespec prev_ts;
validate_key(tskey, tstype);
switch (tstype) {
case SCM_TSTAMP_SCHED:
validate_timestamp(&tss->ts[0], &usr_ts);
SK_TS_SCHED += 1;
break;
case SCM_TSTAMP_SND:
validate_timestamp(&tss->ts[0], &prev_ts);
SK_TS_TXSW += 1;
break;
case SCM_TSTAMP_ACK:
validate_timestamp(&tss->ts[0], &prev_ts);
SK_TS_ACK += 1;
break;
}
prev_ts = tss->ts[0];
}
static void test_recv_errmsg_cmsg(struct msghdr *msg)
{
struct sock_extended_err *serr = NULL;
struct scm_timestamping *tss = NULL;
struct cmsghdr *cm;
for (cm = CMSG_FIRSTHDR(msg);
cm && cm->cmsg_len;
cm = CMSG_NXTHDR(msg, cm)) {
if (cm->cmsg_level == SOL_SOCKET &&
cm->cmsg_type == SCM_TIMESTAMPING) {
tss = (void *)CMSG_DATA(cm);
} else if ((cm->cmsg_level == SOL_IP &&
cm->cmsg_type == IP_RECVERR) ||
(cm->cmsg_level == SOL_IPV6 &&
cm->cmsg_type == IPV6_RECVERR) ||
(cm->cmsg_level == SOL_PACKET &&
cm->cmsg_type == PACKET_TX_TIMESTAMP)) {
serr = (void *)CMSG_DATA(cm);
ASSERT_EQ(serr->ee_origin, SO_EE_ORIGIN_TIMESTAMPING,
"cmsg type");
}
if (serr && tss)
test_socket_timestamp(tss, serr->ee_info,
serr->ee_data);
}
}
static bool socket_recv_errmsg(int fd)
{
static char ctrl[1024 /* overprovision*/];
char data[cfg_payload_len];
static struct msghdr msg;
struct iovec entry;
int n = 0;
memset(&msg, 0, sizeof(msg));
memset(&entry, 0, sizeof(entry));
memset(ctrl, 0, sizeof(ctrl));
entry.iov_base = data;
entry.iov_len = cfg_payload_len;
msg.msg_iov = &entry;
msg.msg_iovlen = 1;
msg.msg_name = NULL;
msg.msg_namelen = 0;
msg.msg_control = ctrl;
msg.msg_controllen = sizeof(ctrl);
n = recvmsg(fd, &msg, MSG_ERRQUEUE);
if (n == -1)
ASSERT_EQ(errno, EAGAIN, "recvmsg MSG_ERRQUEUE");
if (n >= 0)
test_recv_errmsg_cmsg(&msg);
return n == -1;
}
static void test_socket_timestamping(int fd)
{
while (!socket_recv_errmsg(fd));
ASSERT_EQ(SK_TS_SCHED, 1, "SCM_TSTAMP_SCHED");
ASSERT_EQ(SK_TS_TXSW, 1, "SCM_TSTAMP_SND");
ASSERT_EQ(SK_TS_ACK, 1, "SCM_TSTAMP_ACK");
SK_TS_SCHED = 0;
SK_TS_TXSW = 0;
SK_TS_ACK = 0;
}
static void test_tcp(int family, bool enable_socket_timestamping)
{
struct net_timestamping__bss *bss;
char buf[cfg_payload_len];
int sfd = -1, cfd = -1;
unsigned int sock_opt;
struct netns_obj *ns;
int cg_fd;
int ret;
cg_fd = test__join_cgroup(CG_NAME);
if (!ASSERT_OK_FD(cg_fd, "join cgroup"))
return;
ns = netns_new("net_timestamping_ns", true);
if (!ASSERT_OK_PTR(ns, "create ns"))
goto out;
skel = net_timestamping__open_and_load();
if (!ASSERT_OK_PTR(skel, "open and load skel"))
goto out;
if (!ASSERT_OK(net_timestamping__attach(skel), "attach skel"))
goto out;
skel->links.skops_sockopt =
bpf_program__attach_cgroup(skel->progs.skops_sockopt, cg_fd);
if (!ASSERT_OK_PTR(skel->links.skops_sockopt, "attach cgroup"))
goto out;
bss = skel->bss;
memset(bss, 0, sizeof(*bss));
skel->bss->monitored_pid = getpid();
sfd = start_server(family, SOCK_STREAM,
family == AF_INET6 ? addr6_str : addr4_str, 0, 0);
if (!ASSERT_OK_FD(sfd, "start_server"))
goto out;
cfd = connect_to_fd(sfd, 0);
if (!ASSERT_OK_FD(cfd, "connect_to_fd_server"))
goto out;
if (enable_socket_timestamping) {
sock_opt = SOF_TIMESTAMPING_SOFTWARE |
SOF_TIMESTAMPING_OPT_ID |
SOF_TIMESTAMPING_TX_SCHED |
SOF_TIMESTAMPING_TX_SOFTWARE |
SOF_TIMESTAMPING_TX_ACK;
ret = setsockopt(cfd, SOL_SOCKET, SO_TIMESTAMPING,
(char *) &sock_opt, sizeof(sock_opt));
if (!ASSERT_OK(ret, "setsockopt SO_TIMESTAMPING"))
goto out;
ret = clock_gettime(CLOCK_REALTIME, &usr_ts);
if (!ASSERT_OK(ret, "get user time"))
goto out;
}
ret = write(cfd, buf, sizeof(buf));
if (!ASSERT_EQ(ret, sizeof(buf), "send to server"))
goto out;
if (enable_socket_timestamping)
test_socket_timestamping(cfd);
ASSERT_EQ(bss->nr_active, 1, "nr_active");
ASSERT_EQ(bss->nr_snd, 2, "nr_snd");
ASSERT_EQ(bss->nr_sched, 1, "nr_sched");
ASSERT_EQ(bss->nr_txsw, 1, "nr_txsw");
ASSERT_EQ(bss->nr_ack, 1, "nr_ack");
out:
if (sfd >= 0)
close(sfd);
if (cfd >= 0)
close(cfd);
net_timestamping__destroy(skel);
netns_free(ns);
close(cg_fd);
}
void test_net_timestamping(void)
{
if (test__start_subtest("INET4: bpf timestamping"))
test_tcp(AF_INET, false);
if (test__start_subtest("INET4: bpf and socket timestamping"))
test_tcp(AF_INET, true);
if (test__start_subtest("INET6: bpf timestamping"))
test_tcp(AF_INET6, false);
if (test__start_subtest("INET6: bpf and socket timestamping"))
test_tcp(AF_INET6, true);
}

View File

@ -49,6 +49,7 @@
#define TCP_SAVED_SYN 28
#define TCP_CA_NAME_MAX 16
#define TCP_NAGLE_OFF 1
#define TCP_RTO_MAX_MS 44
#define TCP_ECN_OK 1
#define TCP_ECN_QUEUE_CWR 2

View File

@ -0,0 +1,248 @@
#include "vmlinux.h"
#include "bpf_tracing_net.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include "bpf_misc.h"
#include "bpf_kfuncs.h"
#include <errno.h>
__u32 monitored_pid = 0;
int nr_active;
int nr_snd;
int nr_passive;
int nr_sched;
int nr_txsw;
int nr_ack;
struct sk_stg {
__u64 sendmsg_ns; /* record ts when sendmsg is called */
};
struct sk_tskey {
u64 cookie;
u32 tskey;
};
struct delay_info {
u64 sendmsg_ns; /* record ts when sendmsg is called */
u32 sched_delay; /* SCHED_CB - sendmsg_ns */
u32 snd_sw_delay; /* SND_SW_CB - SCHED_CB */
u32 ack_delay; /* ACK_CB - SND_SW_CB */
};
struct {
__uint(type, BPF_MAP_TYPE_SK_STORAGE);
__uint(map_flags, BPF_F_NO_PREALLOC);
__type(key, int);
__type(value, struct sk_stg);
} sk_stg_map SEC(".maps");
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__type(key, struct sk_tskey);
__type(value, struct delay_info);
__uint(max_entries, 1024);
} time_map SEC(".maps");
static u64 delay_tolerance_nsec = 10000000000; /* 10 second as an example */
extern int bpf_sock_ops_enable_tx_tstamp(struct bpf_sock_ops_kern *skops, u64 flags) __ksym;
static int bpf_test_sockopt(void *ctx, const struct sock *sk, int expected)
{
int tmp, new = SK_BPF_CB_TX_TIMESTAMPING;
int opt = SK_BPF_CB_FLAGS;
int level = SOL_SOCKET;
if (bpf_setsockopt(ctx, level, opt, &new, sizeof(new)) != expected)
return 1;
if (bpf_getsockopt(ctx, level, opt, &tmp, sizeof(tmp)) != expected ||
(!expected && tmp != new))
return 1;
return 0;
}
static bool bpf_test_access_sockopt(void *ctx, const struct sock *sk)
{
if (bpf_test_sockopt(ctx, sk, -EOPNOTSUPP))
return true;
return false;
}
static bool bpf_test_access_load_hdr_opt(struct bpf_sock_ops *skops)
{
u8 opt[3] = {0};
int load_flags = 0;
int ret;
ret = bpf_load_hdr_opt(skops, opt, sizeof(opt), load_flags);
if (ret != -EOPNOTSUPP)
return true;
return false;
}
static bool bpf_test_access_cb_flags_set(struct bpf_sock_ops *skops)
{
int ret;
ret = bpf_sock_ops_cb_flags_set(skops, 0);
if (ret != -EOPNOTSUPP)
return true;
return false;
}
/* In the timestamping callbacks, we're not allowed to call the following
* BPF CALLs for the safety concern. Return false if expected.
*/
static bool bpf_test_access_bpf_calls(struct bpf_sock_ops *skops,
const struct sock *sk)
{
if (bpf_test_access_sockopt(skops, sk))
return true;
if (bpf_test_access_load_hdr_opt(skops))
return true;
if (bpf_test_access_cb_flags_set(skops))
return true;
return false;
}
static bool bpf_test_delay(struct bpf_sock_ops *skops, const struct sock *sk)
{
struct bpf_sock_ops_kern *skops_kern;
u64 timestamp = bpf_ktime_get_ns();
struct skb_shared_info *shinfo;
struct delay_info dinfo = {0};
struct sk_tskey key = {0};
struct delay_info *val;
struct sk_buff *skb;
struct sk_stg *stg;
u64 prior_ts, delay;
if (bpf_test_access_bpf_calls(skops, sk))
return false;
skops_kern = bpf_cast_to_kern_ctx(skops);
skb = skops_kern->skb;
shinfo = bpf_core_cast(skb->head + skb->end, struct skb_shared_info);
key.cookie = bpf_get_socket_cookie(skops);
if (!key.cookie)
return false;
if (skops->op == BPF_SOCK_OPS_TSTAMP_SENDMSG_CB) {
stg = bpf_sk_storage_get(&sk_stg_map, (void *)sk, 0, 0);
if (!stg)
return false;
dinfo.sendmsg_ns = stg->sendmsg_ns;
bpf_sock_ops_enable_tx_tstamp(skops_kern, 0);
key.tskey = shinfo->tskey;
if (!key.tskey)
return false;
bpf_map_update_elem(&time_map, &key, &dinfo, BPF_ANY);
return true;
}
key.tskey = shinfo->tskey;
if (!key.tskey)
return false;
val = bpf_map_lookup_elem(&time_map, &key);
if (!val)
return false;
switch (skops->op) {
case BPF_SOCK_OPS_TSTAMP_SCHED_CB:
val->sched_delay = timestamp - val->sendmsg_ns;
delay = val->sched_delay;
break;
case BPF_SOCK_OPS_TSTAMP_SND_SW_CB:
prior_ts = val->sched_delay + val->sendmsg_ns;
val->snd_sw_delay = timestamp - prior_ts;
delay = val->snd_sw_delay;
break;
case BPF_SOCK_OPS_TSTAMP_ACK_CB:
prior_ts = val->snd_sw_delay + val->sched_delay + val->sendmsg_ns;
val->ack_delay = timestamp - prior_ts;
delay = val->ack_delay;
break;
}
if (delay >= delay_tolerance_nsec)
return false;
/* Since it's the last one, remove from the map after latency check */
if (skops->op == BPF_SOCK_OPS_TSTAMP_ACK_CB)
bpf_map_delete_elem(&time_map, &key);
return true;
}
SEC("fentry/tcp_sendmsg_locked")
int BPF_PROG(trace_tcp_sendmsg_locked, struct sock *sk, struct msghdr *msg,
size_t size)
{
__u32 pid = bpf_get_current_pid_tgid() >> 32;
u64 timestamp = bpf_ktime_get_ns();
u32 flag = sk->sk_bpf_cb_flags;
struct sk_stg *stg;
if (pid != monitored_pid || !flag)
return 0;
stg = bpf_sk_storage_get(&sk_stg_map, sk, 0,
BPF_SK_STORAGE_GET_F_CREATE);
if (!stg)
return 0;
stg->sendmsg_ns = timestamp;
nr_snd += 1;
return 0;
}
SEC("sockops")
int skops_sockopt(struct bpf_sock_ops *skops)
{
struct bpf_sock *bpf_sk = skops->sk;
const struct sock *sk;
if (!bpf_sk)
return 1;
sk = (struct sock *)bpf_skc_to_tcp_sock(bpf_sk);
if (!sk)
return 1;
switch (skops->op) {
case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
nr_active += !bpf_test_sockopt(skops, sk, 0);
break;
case BPF_SOCK_OPS_TSTAMP_SENDMSG_CB:
if (bpf_test_delay(skops, sk))
nr_snd += 1;
break;
case BPF_SOCK_OPS_TSTAMP_SCHED_CB:
if (bpf_test_delay(skops, sk))
nr_sched += 1;
break;
case BPF_SOCK_OPS_TSTAMP_SND_SW_CB:
if (bpf_test_delay(skops, sk))
nr_txsw += 1;
break;
case BPF_SOCK_OPS_TSTAMP_ACK_CB:
if (bpf_test_delay(skops, sk))
nr_ack += 1;
break;
}
return 1;
}
char _license[] SEC("license") = "GPL";

View File

@ -61,6 +61,7 @@ static const struct sockopt_test sol_tcp_tests[] = {
{ .opt = TCP_NOTSENT_LOWAT, .new = 1314, .expected = 1314, },
{ .opt = TCP_BPF_SOCK_OPS_CB_FLAGS, .new = BPF_SOCK_OPS_ALL_CB_FLAGS,
.expected = BPF_SOCK_OPS_ALL_CB_FLAGS, },
{ .opt = TCP_RTO_MAX_MS, .new = 2000, .expected = 2000, },
{ .opt = 0, },
};

View File

@ -13,6 +13,7 @@
* - UDP 9091 packets trigger TX reply
* - TX HW timestamp is requested and reported back upon completion
* - TX checksum is requested
* - TX launch time HW offload is requested for transmission
*/
#include <test_progs.h>
@ -37,6 +38,15 @@
#include <time.h>
#include <unistd.h>
#include <libgen.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <linux/pkt_sched.h>
#include <linux/pkt_cls.h>
#include <linux/ethtool.h>
#include <sys/socket.h>
#include <arpa/inet.h>
#include "xdp_metadata.h"
@ -64,6 +74,18 @@ int rxq;
bool skip_tx;
__u64 last_hw_rx_timestamp;
__u64 last_xdp_rx_timestamp;
__u64 last_launch_time;
__u64 launch_time_delta_to_hw_rx_timestamp;
int launch_time_queue;
#define run_command(cmd, ...) \
({ \
char command[1024]; \
memset(command, 0, sizeof(command)); \
snprintf(command, sizeof(command), cmd, ##__VA_ARGS__); \
fprintf(stderr, "Running: %s\n", command); \
system(command); \
})
void test__fail(void) { /* for network_helpers.c */ }
@ -298,6 +320,12 @@ static bool complete_tx(struct xsk *xsk, clockid_t clock_id)
if (meta->completion.tx_timestamp) {
__u64 ref_tstamp = gettime(clock_id);
if (launch_time_delta_to_hw_rx_timestamp) {
print_tstamp_delta("HW Launch-time",
"HW TX-complete-time",
last_launch_time,
meta->completion.tx_timestamp);
}
print_tstamp_delta("HW TX-complete-time", "User TX-complete-time",
meta->completion.tx_timestamp, ref_tstamp);
print_tstamp_delta("XDP RX-time", "User TX-complete-time",
@ -395,6 +423,17 @@ static void ping_pong(struct xsk *xsk, void *rx_packet, clockid_t clock_id)
xsk, ntohs(udph->check), ntohs(want_csum),
meta->request.csum_start, meta->request.csum_offset);
/* Set the value of launch time */
if (launch_time_delta_to_hw_rx_timestamp) {
meta->flags |= XDP_TXMD_FLAGS_LAUNCH_TIME;
meta->request.launch_time = last_hw_rx_timestamp +
launch_time_delta_to_hw_rx_timestamp;
last_launch_time = meta->request.launch_time;
print_tstamp_delta("HW RX-time", "HW Launch-time",
last_hw_rx_timestamp,
meta->request.launch_time);
}
memcpy(data, rx_packet, len); /* don't share umem chunk for simplicity */
tx_desc->options |= XDP_TX_METADATA;
tx_desc->len = len;
@ -407,6 +446,7 @@ static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd, clockid_t
const struct xdp_desc *rx_desc;
struct pollfd fds[rxq + 1];
__u64 comp_addr;
__u64 deadline;
__u64 addr;
__u32 idx = 0;
int ret;
@ -477,9 +517,15 @@ static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd, clockid_t
if (ret)
printf("kick_tx ret=%d\n", ret);
for (int j = 0; j < 500; j++) {
/* wait 1 second + cover launch time */
deadline = gettime(clock_id) +
NANOSEC_PER_SEC +
launch_time_delta_to_hw_rx_timestamp;
while (true) {
if (complete_tx(xsk, clock_id))
break;
if (gettime(clock_id) >= deadline)
break;
usleep(10);
}
}
@ -608,6 +654,10 @@ static void print_usage(void)
" -h Display this help and exit\n\n"
" -m Enable multi-buffer XDP for larger MTU\n"
" -r Don't generate AF_XDP reply (rx metadata only)\n"
" -l Delta of launch time relative to HW RX-time in ns\n"
" default: 0 ns (launch time request is disabled)\n"
" -L Tx Queue to be enabled with launch time offload\n"
" default: 0 (Tx Queue 0)\n"
"Generate test packets on the other machine with:\n"
" echo -n xdp | nc -u -q1 <dst_ip> 9091\n";
@ -618,7 +668,7 @@ static void read_args(int argc, char *argv[])
{
int opt;
while ((opt = getopt(argc, argv, "chmr")) != -1) {
while ((opt = getopt(argc, argv, "chmrl:L:")) != -1) {
switch (opt) {
case 'c':
bind_flags &= ~XDP_USE_NEED_WAKEUP;
@ -634,6 +684,12 @@ static void read_args(int argc, char *argv[])
case 'r':
skip_tx = true;
break;
case 'l':
launch_time_delta_to_hw_rx_timestamp = atoll(optarg);
break;
case 'L':
launch_time_queue = atoll(optarg);
break;
case '?':
if (isprint(optopt))
fprintf(stderr, "Unknown option: -%c\n", optopt);
@ -657,23 +713,118 @@ static void read_args(int argc, char *argv[])
error(-1, errno, "Invalid interface name");
}
void clean_existing_configurations(void)
{
/* Check and delete root qdisc if exists */
if (run_command("sudo tc qdisc show dev %s | grep -q 'qdisc mqprio 8001:'", ifname) == 0)
run_command("sudo tc qdisc del dev %s root", ifname);
/* Check and delete ingress qdisc if exists */
if (run_command("sudo tc qdisc show dev %s | grep -q 'qdisc ingress ffff:'", ifname) == 0)
run_command("sudo tc qdisc del dev %s ingress", ifname);
/* Check and delete ethtool filters if any exist */
if (run_command("sudo ethtool -n %s | grep -q 'Filter:'", ifname) == 0) {
run_command("sudo ethtool -n %s | grep 'Filter:' | awk '{print $2}' | xargs -n1 sudo ethtool -N %s delete >&2",
ifname, ifname);
}
}
#define MAX_TC 16
int main(int argc, char *argv[])
{
clockid_t clock_id = CLOCK_TAI;
struct bpf_program *prog;
int server_fd = -1;
size_t map_len = 0;
size_t que_len = 0;
char *buf = NULL;
char *map = NULL;
char *que = NULL;
char *tmp = NULL;
int tc = 0;
int ret;
int i;
struct bpf_program *prog;
read_args(argc, argv);
rxq = rxq_num(ifname);
printf("rxq: %d\n", rxq);
if (launch_time_queue >= rxq || launch_time_queue < 0)
error(1, 0, "Invalid launch_time_queue.");
clean_existing_configurations();
sleep(1);
/* Enable tx and rx hardware timestamping */
hwtstamp_enable(ifname);
/* Prepare priority to traffic class map for tc-mqprio */
for (i = 0; i < MAX_TC; i++) {
if (i < rxq)
tc = i;
if (asprintf(&buf, "%d ", tc) == -1) {
printf("Failed to malloc buf for tc map.\n");
goto free_mem;
}
map_len += strlen(buf);
tmp = realloc(map, map_len + 1);
if (!tmp) {
printf("Failed to realloc tc map.\n");
goto free_mem;
}
map = tmp;
strcat(map, buf);
free(buf);
buf = NULL;
}
/* Prepare traffic class to hardware queue map for tc-mqprio */
for (i = 0; i <= tc; i++) {
if (asprintf(&buf, "1@%d ", i) == -1) {
printf("Failed to malloc buf for tc queues.\n");
goto free_mem;
}
que_len += strlen(buf);
tmp = realloc(que, que_len + 1);
if (!tmp) {
printf("Failed to realloc tc queues.\n");
goto free_mem;
}
que = tmp;
strcat(que, buf);
free(buf);
buf = NULL;
}
/* Add mqprio qdisc */
run_command("sudo tc qdisc add dev %s handle 8001: parent root mqprio num_tc %d map %squeues %shw 0",
ifname, tc + 1, map, que);
/* To test launch time, send UDP packet with VLAN priority 1 to port 9091 */
if (launch_time_delta_to_hw_rx_timestamp) {
/* Enable launch time hardware offload on launch_time_queue */
run_command("sudo tc qdisc replace dev %s parent 8001:%d etf offload clockid CLOCK_TAI delta 500000",
ifname, launch_time_queue + 1);
sleep(1);
/* Route incoming packet with VLAN priority 1 into launch_time_queue */
if (run_command("sudo ethtool -N %s flow-type ether vlan 0x2000 vlan-mask 0x1FFF action %d",
ifname, launch_time_queue)) {
run_command("sudo tc qdisc add dev %s ingress", ifname);
run_command("sudo tc filter add dev %s parent ffff: protocol 802.1Q flower vlan_prio 1 hw_tc %d",
ifname, launch_time_queue);
}
/* Enable VLAN tag stripping offload */
run_command("sudo ethtool -K %s rxvlan on", ifname);
}
rx_xsk = malloc(sizeof(struct xsk) * rxq);
if (!rx_xsk)
error(1, ENOMEM, "malloc");
@ -733,4 +884,11 @@ int main(int argc, char *argv[])
cleanup();
if (ret)
error(1, -ret, "verify_metadata");
clean_existing_configurations();
free_mem:
free(buf);
free(map);
free(que);
}