linux

mirror of https://github.com/torvalds/linux.git synced 2026-05-31 18:43:33 +02:00

Author	SHA1	Message	Date
Chuck Lever	f508262ae9	tls: Preserve sk_err across recvmsg() when data has been copied The sk_err check in tls_rx_rec_wait() consumes the error via sock_error(), which clears sk_err atomically. When the caller (tls_sw_recvmsg, tls_sw_splice_read, or tls_sw_read_sock) already has bytes copied to userspace, it returns those bytes and discards the error from this call. sk_err is now zero on the socket, so the next read syscall observes only RCV_SHUTDOWN and reports a clean EOF instead of the actual error (typically -ECONNRESET). The race is reachable when tls_read_flush_backlog()'s periodic sk_flush_backlog() triggers tcp_reset() in the middle of a multi-record read. Pass a has_copied flag to tls_rx_rec_wait(). When has_copied is false, consume sk_err via sock_error() as before. When has_copied is true, report the error from READ_ONCE() but leave sk_err set: the caller returns the byte count and discards the err from this call, and the next read syscall surfaces the preserved sk_err. This mirrors the tcp_recvmsg() preserve-and-surface pattern. The decrypt-abort path is unaffected: tls_err_abort() raises sk_err to EBADMSG after tls_rx_rec_wait() returns, and nothing on the caller's return path consumes it, so the EBADMSG surfaces on the next read. tls_sw_splice_read() passes has_copied=false: it processes one record per call, so no bytes have been copied within the function when tls_rx_rec_wait() runs. A reset that arrives between iterations of splice_direct_to_actor() (the sendfile() path) is still consumed by sock_error() in the later call, and the outer loop returns the prior iterations' byte count and drops the error. tcp_splice_read() exhibits the same pattern at the iteration boundary; addressing it belongs at the splice_direct_to_actor() layer and is out of scope here. Fixes: `c46b01839f` ("tls: rx: periodically flush socket backlog") Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://patch.msgid.link/20260513125825.205189-1-cel@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-14 18:19:44 -07:00
Jakub Kicinski	ff26a0e837	net: tls: prevent chain-after-chain in plain text SG Sashiko points out that if end = 0 (start != 0) the current code will create a chain link to content type right after the wrap link: This would create a chain where the wrap link points directly to another chain link. The scatterlist API sg_next iterator does not recursively resolve consecutive chain links. meaning this is illegal input to crypto. The wrapping link is unnecessary if end = 0. end is the entry after the last one used so end = 0 means there's nothing pushed after the wrap: end start i v v v [ ]...[ ][ d ][ d ][ d ][ d ][rsv for wrap] Skip the wrapping in this case. TLS 1.3 can use the "wrapping slot" for it's chaining if end = 0. This avoids the chain-after-chain. Move the wrap chaining before marking END and chaining off content type, that feels like more logical ordering to me, but should not matter from functional perspective. Reported-by: Sashiko <sashiko-bot@kernel.org> Fixes: `9aaaa56845` ("bpf: Sockmap/tls, skmsg can have wrapped skmsg that needs extra chaining") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20260511174920.433155-3-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-14 13:18:40 +02:00
Jakub Kicinski	285943c6e7	net: tls: fix off-by-one in sg_chain entry count for wrapped sk_msg ring When an sk_msg scatterlist ring wraps (sg.end < sg.start), tls_push_record() chains the tail portion of the ring to the head using sg_chain(). An extra entry in the sg array is reserved for this: struct sk_msg_sg { [...] /* The extra two elements: * 1) used for chaining the front and sections when the list becomes * partitioned (e.g. end < start). The crypto APIs require the * chaining; * 2) to chain tailer SG entries after the message. */ struct scatterlist data[MAX_MSG_FRAGS + 2]; The current code uses MAX_SKB_FRAGS + 1 as the ring size: sg_chain(&msg_pl->sg.data[msg_pl->sg.start], MAX_SKB_FRAGS - msg_pl->sg.start + 1, msg_pl->sg.data); This places the chain pointer at sg_chain(data[start], (MAX_SKB_FRAGS - msg_start + 1) .. = &data[start] + (MAX_SKB_FRAGS - msg_start + 1) - 1 = data[start + (MAX_SKB_FRAGS - start + 1) - 1] = data[MAX_SKB_FRAGS] instead of the true last entry. This is likely due to a "race" of the commit under Fixes landing close to commit `031097d9e0` ("bpf: sk_msg, zap ingress queue on psock down") Convert to ARRAY_SIZE and drop the data[start] / - start (as suggested by Sabrina). Reported-by: 钱一铭 <yimingqian591@gmail.com> Fixes: `9aaaa56845` ("bpf: Sockmap/tls, skmsg can have wrapped skmsg that needs extra chaining") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20260511174920.433155-2-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-14 13:18:40 +02:00
Jakub Kicinski	7e7be31bfd	net: tls: fix silent data drop under pipe back-pressure tls_sw_splice_read() uses len when advancing rxm->offset / rxm->full_len after skb_splice_bits(), rather than copied (the actual number of bytes successfully spliced into the pipe). When the destination pipe cannot accept all the requested bytes, splice_to_pipe() returns fewer bytes than len, and 'len - copied' of data is effectively skipped over. Fixes: `e062fe99cc` ("tls: splice_read: fix accessing pre-processed records") Link: https://patch.msgid.link/20260429222944.2139041-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-02 18:27:14 -07:00
Jakub Kicinski	58689498ca	net: tls: fix strparser anchor skb leak on offload RX setup failure When tls_set_device_offload_rx() fails at tls_dev_add(), the error path calls tls_sw_free_resources_rx() to clean up the SW context that was initialized by tls_set_sw_offload(). This function calls tls_sw_release_resources_rx() (which stops the strparser via tls_strp_stop()) and tls_sw_free_ctx_rx() (which kfrees the context), but never frees the anchor skb that was allocated by alloc_skb(0) in tls_strp_init(). Note that tls_sw_free_resources_rx() is exclusively used for this "failed to start offload" code path, there's no other caller. The leak did not exist before commit `84c61fe1a7` ("tls: rx: do not use the standard strparser"), because the standard strparser doesn't try to pre-allocate an skb. The normal close path in tls_sk_proto_close() handles cleanup by calling tls_sw_strparser_done() (which calls tls_strp_done()) after dropping the socket lock, because tls_strp_done() does cancel_work_sync() and the strparser work handler takes the socket lock. Fixes: `84c61fe1a7` ("tls: rx: do not use the standard strparser") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Link: https://patch.msgid.link/20260428231559.1358502-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-30 13:38:29 +02:00
Jakub Kicinski	b6e39e4846	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-7.0-rc8). Conflicts: net/ipv6/seg6_iptunnel.c `c3812651b5` ("seg6: separate dst_cache for input and output paths in seg6 lwtunnel") `78723a62b9` ("seg6: add per-route tunnel source address") https://lore.kernel.org/adZhwtOYfo-0ImSa@sirena.org.uk net/ipv4/icmp.c `fde29fd934` ("ipv4: icmp: fix null-ptr-deref in icmp_build_probe()") `d98adfbdd5` ("ipv4: drop ipv6_stub usage and use direct function calls") https://lore.kernel.org/adO3dccqnr6j-BL9@sirena.org.uk Adjacent changes: drivers/net/ethernet/stmicro/stmmac/chain_mode.c `51f4e090b9` ("net: stmmac: fix integer underflow in chain mode") `6b4286e055` ("net: stmmac: rename STMMAC_GET_ENTRY() -> STMMAC_NEXT_ENTRY()") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-09 13:20:59 -07:00
Muhammad Alifa Ramdhan	a9b8b18364	net/tls: fix use-after-free in -EBUSY error path of tls_do_encryption The -EBUSY handling in tls_do_encryption(), introduced by commit `8590541473` ("net: tls: handle backlogging of crypto requests"), has a use-after-free due to double cleanup of encrypt_pending and the scatterlist entry. When crypto_aead_encrypt() returns -EBUSY, the request is enqueued to the cryptd backlog and the async callback tls_encrypt_done() will be invoked upon completion. That callback unconditionally restores the scatterlist entry (sge->offset, sge->length) and decrements ctx->encrypt_pending. However, if tls_encrypt_async_wait() returns an error, the synchronous error path in tls_do_encryption() performs the same cleanup again, double-decrementing encrypt_pending and double-restoring the scatterlist. The double-decrement corrupts the encrypt_pending sentinel (initialized to 1), making tls_encrypt_async_wait() permanently skip the wait for pending async callbacks. A subsequent sendmsg can then free the tls_rec via bpf_exec_tx_verdict() while a cryptd callback is still pending, resulting in a use-after-free when the callback fires on the freed record. Fix this by skipping the synchronous cleanup when the -EBUSY async wait returns an error, since the callback has already handled encrypt_pending and sge restoration. Fixes: `8590541473` ("net: tls: handle backlogging of crypto requests") Cc: stable@vger.kernel.org Signed-off-by: Muhammad Alifa Ramdhan <ramdhan@starlabs.sg> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20260403013617.2838875-1-ramdhan@starlabs.sg Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-07 14:53:42 +02:00
Jakub Kicinski	9ebcf66cd6	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-7.0-rc6). No conflicts, or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-26 12:09:57 -07:00
Chuck Lever	84a8335d83	tls: Purge async_hold in tls_decrypt_async_wait() The async_hold queue pins encrypted input skbs while the AEAD engine references their scatterlist data. Once tls_decrypt_async_wait() returns, every AEAD operation has completed and the engine no longer references those skbs, so they can be freed unconditionally. A subsequent patch adds batch async decryption to tls_sw_read_sock(), introducing a new call site that must drain pending AEAD operations and release held skbs. Move __skb_queue_purge(&ctx->async_hold) into tls_decrypt_async_wait() so the purge is centralized and every caller -- recvmsg's drain path, the -EBUSY fallback in tls_do_decryption(), and the new read_sock batch path -- releases held skbs on synchronization without each site managing the purge independently. This fixes a leak when tls_strp_msg_hold() fails part-way through, after having added some cloned skbs to the async_hold queue. tls_decrypt_sg() will then call tls_decrypt_async_wait() to process all pending decrypts, and drop back to synchronous mode, but tls_sw_recvmsg() only flushes the async_hold queue when one record has been processed in "fully-async" mode, which may not be the case here. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reported-by: Yiming Qian <yimingqian591@gmail.com> Fixes: `b8a6ff84ab` ("tls: wait for pending async decryptions if tls_strp_msg_hold fails") Link: https://patch.msgid.link/20260324-tls-read-sock-v5-1-5408befe5774@oracle.com [pabeni@redhat.com: added leak comment] Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-03-26 09:55:53 +01:00
Eric Dumazet	8341c989ac	net: remove addr_len argument of recvmsg() handlers Use msg->msg_namelen as a place holder instead of a temporary variable, notably in inet[6]_recvmsg(). This removes stack canaries and allows tail-calls. $ scripts/bloat-o-meter -t vmlinux.old vmlinux add/remove: 0/0 grow/shrink: 2/19 up/down: 26/-532 (-506) Function old new delta rawv6_recvmsg 744 767 +23 vsock_dgram_recvmsg 55 58 +3 vsock_connectible_recvmsg 50 47 -3 unix_stream_recvmsg 161 158 -3 unix_seqpacket_recvmsg 62 59 -3 unix_dgram_recvmsg 42 39 -3 tcp_recvmsg 546 543 -3 mptcp_recvmsg 1568 1565 -3 ping_recvmsg 806 800 -6 tcp_bpf_recvmsg_parser 983 974 -9 ip_recv_error 588 576 -12 ipv6_recv_rxpmtu 442 428 -14 udp_recvmsg 1243 1224 -19 ipv6_recv_error 1046 1024 -22 udpv6_recvmsg 1487 1461 -26 raw_recvmsg 465 437 -28 udp_bpf_recvmsg 1027 984 -43 sock_common_recvmsg 103 27 -76 inet_recvmsg 257 175 -82 inet6_recvmsg 257 175 -82 tcp_bpf_recvmsg 663 568 -95 Total: Before=25143834, After=25143328, chg -0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260227151120.1346573-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-02 18:17:17 -08:00
Linus Torvalds	b9c8fc2cae	Including fixes from IPsec, Bluetooth and netfilter Current release - regressions: - wifi: fix dev_alloc_name() return value check - rds: fix recursive lock in rds_tcp_conn_slots_available Current release - new code bugs: - vsock: lock down child_ns_mode as write-once Previous releases - regressions: - core: - do not pass flow_id to set_rps_cpu() - consume xmit errors of GSO frames - netconsole: avoid OOB reads, msg is not nul-terminated - netfilter: h323: fix OOB read in decode_choice() - tcp: re-enable acceptance of FIN packets when RWIN is 0 - udplite: fix null-ptr-deref in __udp_enqueue_schedule_skb(). - wifi: brcmfmac: fix potential kernel oops when probe fails - phy: register phy led_triggers during probe to avoid AB-BA deadlock - eth: bnxt_en: fix deleting of Ntuple filters - eth: wan: farsync: fix use-after-free bugs caused by unfinished tasklets - eth: xscale: check for PTP support properly Previous releases - always broken: - tcp: fix potential race in tcp_v6_syn_recv_sock() - kcm: fix zero-frag skb in frag_list on partial sendmsg error - xfrm: - fix race condition in espintcp_close() - always flush state and policy upon NETDEV_UNREGISTER event - bluetooth: - purge error queues in socket destructors - fix response to L2CAP_ECRED_CONN_REQ - eth: mlx5: - fix circular locking dependency in dump - fix "scheduling while atomic" in IPsec MAC address query - eth: gve: fix incorrect buffer cleanup for QPL - eth: team: avoid NETDEV_CHANGEMTU event when unregistering slave - eth: usb: validate USB endpoints Signed-off-by: Paolo Abeni <pabeni@redhat.com> -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmmgYU4SHHBhYmVuaUBy ZWRoYXQuY29tAAoJECkkeY3MjxOkLBgQAINazHstJ0DoDkvmwXapRSN0Ffauyd46 oX6nfeWOT3BzZbAhZHtGgCSs4aULifJWMevtT7pq7a7PgZwMwfa47BugR1G/u5UE hCqalNjRTB/U2KmFk6eViKSacD4FvUIAyAMOotn1aEdRRAkBIJnIW/o/ZR9ZUkm0 5+UigO64aq57+FOc5EQdGjYDcTVdzW12iOZ8ZqwtSATdNd9aC+gn3voRomTEo+Fm kQinkFEPAy/YyHGmfpC/z87/RTgkYLpagmsT4ZvBJeNPrIRvFEibSpPNhuzTzg81 /BW5M8sJmm3XFiTiRp6Blv+0n6HIpKjAZMHn5c9hzX9cxPZQ24EjkXEex9ClaxLd OMef79rr1HBwqBTpIlK7xfLKCdT5Iex88s8HxXRB/Psqk9pVP469cSoK6cpyiGiP I+4WT0wn9ukTiu/yV2L2byVr1sanlu54P+UBYJpDwqq3lZ1ngWtkJ+SY369jhwAS FYIBmUSKhmWz3FEULaGpgPy4m9Fl/fzN8IFh2Buoc/Puq61HH7MAMjRty2ZSFTqj gbHrRhlkCRqubytgjsnCDPLoJF4ZYcXtpo/8ogG3641H1I+dN+DyGGVZ/ioswkks My1ds0rKqA3BHCmn+pN/qqkuopDCOB95dqOpgDqHG7GePrpa/FJ1guhxexsCd+nL Run2RcgDmd+d =HBOu -----END PGP SIGNATURE----- Merge tag 'net-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from IPsec, Bluetooth and netfilter Current release - regressions: - wifi: fix dev_alloc_name() return value check - rds: fix recursive lock in rds_tcp_conn_slots_available Current release - new code bugs: - vsock: lock down child_ns_mode as write-once Previous releases - regressions: - core: - do not pass flow_id to set_rps_cpu() - consume xmit errors of GSO frames - netconsole: avoid OOB reads, msg is not nul-terminated - netfilter: h323: fix OOB read in decode_choice() - tcp: re-enable acceptance of FIN packets when RWIN is 0 - udplite: fix null-ptr-deref in __udp_enqueue_schedule_skb(). - wifi: brcmfmac: fix potential kernel oops when probe fails - phy: register phy led_triggers during probe to avoid AB-BA deadlock - eth: - bnxt_en: fix deleting of Ntuple filters - wan: farsync: fix use-after-free bugs caused by unfinished tasklets - xscale: check for PTP support properly Previous releases - always broken: - tcp: fix potential race in tcp_v6_syn_recv_sock() - kcm: fix zero-frag skb in frag_list on partial sendmsg error - xfrm: - fix race condition in espintcp_close() - always flush state and policy upon NETDEV_UNREGISTER event - bluetooth: - purge error queues in socket destructors - fix response to L2CAP_ECRED_CONN_REQ - eth: - mlx5: - fix circular locking dependency in dump - fix "scheduling while atomic" in IPsec MAC address query - gve: fix incorrect buffer cleanup for QPL - team: avoid NETDEV_CHANGEMTU event when unregistering slave - usb: validate USB endpoints" * tag 'net-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (72 commits) netfilter: nf_conntrack_h323: fix OOB read in decode_choice() dpaa2-switch: validate num_ifs to prevent out-of-bounds write net: consume xmit errors of GSO frames vsock: document write-once behavior of the child_ns_mode sysctl vsock: lock down child_ns_mode as write-once selftests/vsock: change tests to respect write-once child ns mode net/mlx5e: Fix "scheduling while atomic" in IPsec MAC address query net/mlx5: Fix missing devlink lock in SRIOV enable error path net/mlx5: E-switch, Clear legacy flag when moving to switchdev net/mlx5: LAG, disable MPESW in lag_disable_change() net/mlx5: DR, Fix circular locking dependency in dump selftests: team: Add a reference count leak test team: avoid NETDEV_CHANGEMTU event when unregistering slave net: mana: Fix double destroy_workqueue on service rescan PCI path MAINTAINERS: Update maintainer entry for QUALCOMM ETHQOS ETHERNET DRIVER dpll: zl3073x: Remove redundant cleanup in devm_dpll_init() selftests/net: packetdrill: Verify acceptance of FIN packets when RWIN is 0 tcp: re-enable acceptance of FIN packets when RWIN is 0 vsock: Use container_of() to get net namespace in sysctl handlers net: usb: kaweth: validate USB endpoints ...	2026-02-26 08:00:13 -08:00
Hyunwoo Kim	7bb09315f9	tls: Fix race condition in tls_sw_cancel_work_tx() This issue was discovered during a code audit. After cancel_delayed_work_sync() is called from tls_sk_proto_close(), tx_work_handler() can still be scheduled from paths such as the Delayed ACK handler or ksoftirqd. As a result, the tx_work_handler() worker may dereference a freed TLS object. The following is a simple race scenario: cpu0 cpu1 tls_sk_proto_close() tls_sw_cancel_work_tx() tls_write_space() tls_sw_write_space() if (!test_and_set_bit(BIT_TX_SCHEDULED, &tx_ctx->tx_bitmask)) set_bit(BIT_TX_SCHEDULED, &ctx->tx_bitmask); cancel_delayed_work_sync(&ctx->tx_work.work); schedule_delayed_work(&tx_ctx->tx_work.work, 0); To prevent this race condition, cancel_delayed_work_sync() is replaced with disable_delayed_work_sync(). Fixes: `f87e62d45e` ("net/tls: remove close callback sock unlock/lock around TX work flush") Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/aZgsFO6nfylfvLE7@v4bel Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-23 17:08:14 -08:00
Linus Torvalds	bf4afc53b7	Convert 'alloc_obj' family to use the new default GFP_KERNEL argument This was done entirely with mindless brute force, using git grep -l '\<k[vmz]alloc_objs(., GFP_KERNEL)' \| xargs sed -i 's/$alloc_objs(.*$, GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-02-21 17:09:51 -08:00
Kees Cook	69050f8d6d	treewide: Replace kmalloc with kmalloc_obj for non-scalar types This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(PTR, FAM, COUNT, ...) (where TYPE may also be VAR) The resulting allocations no longer return "void ", instead returning "TYPE ". Signed-off-by: Kees Cook <kees@kernel.org>	2026-02-21 01:02:28 -08:00
Wilfred Mallawa	82cb5be6ad	net/tls: support setting the maximum payload size During a handshake, an endpoint may specify a maximum record size limit. Currently, the kernel defaults to TLS_MAX_PAYLOAD_SIZE (16KB) for the maximum record size. Meaning that, the outgoing records from the kernel can exceed a lower size negotiated during the handshake. In such a case, the TLS endpoint must send a fatal "record_overflow" alert [1], and thus the record is discarded. Upcoming Western Digital NVMe-TCP hardware controllers implement TLS support. For these devices, supporting TLS record size negotiation is necessary because the maximum TLS record size supported by the controller is less than the default 16KB currently used by the kernel. Currently, there is no way to inform the kernel of such a limit. This patch adds support to a new setsockopt() option `TLS_TX_MAX_PAYLOAD_LEN` that allows for setting the maximum plaintext fragment size. Once set, outgoing records are no larger than the size specified. This option can be used to specify the record size limit. [1] https://www.rfc-editor.org/rfc/rfc8449 Signed-off-by: Wilfred Mallawa <wilfred.mallawa@wdc.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20251022001937.20155-1-wilfred.opensource@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-27 16:13:42 -07:00
Sabrina Dubroca	7f846c65ca	tls: don't rely on tx_work during send() With async crypto, we rely on tx_work to actually transmit records once encryption completes. But while send() is running, both the tx_lock and socket lock are held, so tx_work_handler cannot process the queue of encrypted records, and simply reschedules itself. During a large send(), this could last a long time, and use a lot of memory. Transmit any pending encrypted records before restarting the main loop of tls_sw_sendmsg_locked. Fixes: `a42055e8d2` ("net/tls: Add support for async encryption of records for performance") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/8396631478f70454b44afb98352237d33f48d34d.1760432043.git.sd@queasysnail.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-15 17:41:45 -07:00
Sabrina Dubroca	b8a6ff84ab	tls: wait for pending async decryptions if tls_strp_msg_hold fails Async decryption calls tls_strp_msg_hold to create a clone of the input skb to hold references to the memory it uses. If we fail to allocate that clone, proceeding with async decryption can lead to various issues (UAF on the skb, writing into userspace memory after the recv() call has returned). In this case, wait for all pending decryption requests. Fixes: `84c61fe1a7` ("tls: rx: do not use the standard strparser") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/b9fe61dcc07dab15da9b35cf4c7d86382a98caf2.1760432043.git.sd@queasysnail.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-15 17:41:45 -07:00
Sabrina Dubroca	b014a4e066	tls: wait for async encrypt in case of error during latter iterations of sendmsg If we hit an error during the main loop of tls_sw_sendmsg_locked (eg failed allocation), we jump to send_end and immediately return. Previous iterations may have queued async encryption requests that are still pending. We should wait for those before returning, as we could otherwise be reading from memory that userspace believes we're not using anymore, which would be a sort of use-after-free. This is similar to what tls_sw_recvmsg already does: failures during the main loop jump to the "wait for async" code, not straight to the unlock/return. Fixes: `a42055e8d2` ("net/tls: Add support for async encryption of records for performance") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/c793efe9673b87f808d84fdefc0f732217030c52.1760432043.git.sd@queasysnail.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-15 17:41:45 -07:00
Sabrina Dubroca	ce5af41e32	tls: trim encrypted message to match the plaintext on short splice During tls_sw_sendmsg_locked, we pre-allocate the encrypted message for the size we're expecting to send during the current iteration, but we may end up sending less, for example when splicing: if we're getting the data from small fragments of memory, we may fill up all the slots in the skmsg with less data than expected. In this case, we need to trim the encrypted message to only the length we actually need, to avoid pushing uninitialized bytes down the underlying TCP socket. Fixes: `fe1e81d4f7` ("tls/sw: Support MSG_SPLICE_PAGES") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/66a0ae99c9efc15f88e9e56c1f58f902f442ce86.1760432043.git.sd@queasysnail.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-15 17:41:45 -07:00
Jakub Kicinski	0aeb54ac4c	tls: make sure to abort the stream if headers are bogus Normally we wait for the socket to buffer up the whole record before we service it. If the socket has a tiny buffer, however, we read out the data sooner, to prevent connection stalls. Make sure that we abort the connection when we find out late that the record is actually invalid. Retrying the parsing is fine in itself but since we copy some more data each time before we parse we can overflow the allocated skb space. Constructing a scenario in which we're under pressure without enough data in the socket to parse the length upfront is quite hard. syzbot figured out a way to do this by serving us the header in small OOB sends, and then filling in the recvbuf with a large normal send. Make sure that tls_rx_msg_size() aborts strp, if we reach an invalid record there's really no way to recover. Reported-by: Lee Jones <lee@kernel.org> Fixes: `84c61fe1a7` ("tls: rx: do not use the standard strparser") Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20250917002814.1743558-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-18 12:43:54 +02:00
Jakub Kicinski	62708b9452	tls: fix handling of zero-length records on the rx_list Each recvmsg() call must process either - only contiguous DATA records (any number of them) - one non-DATA record If the next record has different type than what has already been processed we break out of the main processing loop. If the record has already been decrypted (which may be the case for TLS 1.3 where we don't know type until decryption) we queue the pending record to the rx_list. Next recvmsg() will pick it up from there. Queuing the skb to rx_list after zero-copy decrypt is not possible, since in that case we decrypted directly to the user space buffer, and we don't have an skb to queue (darg.skb points to the ciphertext skb for access to metadata like length). Only data records are allowed zero-copy, and we break the processing loop after each non-data record. So we should never zero-copy and then find out that the record type has changed. The corner case we missed is when the initial record comes from rx_list, and it's zero length. Reported-by: Muhammad Alifa Ramdhan <ramdhan@starlabs.sg> Reported-by: Billy Jheng Bing-Jhong <billy@starlabs.sg> Fixes: `84c61fe1a7` ("tls: rx: do not use the standard strparser") Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20250820021952.143068-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-21 07:52:30 -07:00
Jakub Kicinski	6db015fc4b	tls: handle data disappearing from under the TLS ULP TLS expects that it owns the receive queue of the TCP socket. This cannot be guaranteed in case the reader of the TCP socket entered before the TLS ULP was installed, or uses some non-standard read API (eg. zerocopy ones). Replace the WARN_ON() and a buggy early exit (which leaves anchor pointing to a freed skb) with real error handling. Wipe the parsing state and tell the reader to retry. We already reload the anchor every time we (re)acquire the socket lock, so the only condition we need to avoid is an out of bounds read (not having enough bytes in the socket for previously parsed record len). If some data was read from under TLS but there's enough in the queue we'll reload and decrypt what is most likely not a valid TLS record. Leading to some undefined behavior from TLS perspective (corrupting a stream? missing an alert? missing an attack?) but no kernel crash should take place. Reported-by: William Liu <will@willsroot.io> Reported-by: Savino Dicanosa <savy@syst3mfailure.io> Link: https://lore.kernel.org/tFjq_kf7sWIG3A7CrCg_egb8CVsT_gsmHAK0_wxDPJXfIzxFAMxqmLwp3MlU5EHiet0AwwJldaaFdgyHpeIUCS-3m3llsmRzp9xIOBR4lAI=@syst3mfailure.io Fixes: `84c61fe1a7` ("tls: rx: do not use the standard strparser") Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250807232907.600366-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-12 18:59:05 -07:00
Jiayuan Chen	178f6a5c8c	bpf, ktls: Fix data corruption when using bpf_msg_pop_data() in ktls When sending plaintext data, we initially calculated the corresponding ciphertext length. However, if we later reduced the plaintext data length via socket policy, we failed to recalculate the ciphertext length. This results in transmitting buffers containing uninitialized data during ciphertext transmission. This causes uninitialized bytes to be appended after a complete "Application Data" packet, leading to errors on the receiving end when parsing TLS record. Fixes: `d3b18ad31f` ("tls: add bpf support to sk_msg handling") Reported-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/bpf/20250609020910.397930-2-jiayuan.chen@linux.dev	2025-06-11 16:59:42 +02:00
Jiayuan Chen	79f0c39ae7	ktls, sockmap: Fix missing uncharge operation When we specify apply_bytes, we divide the msg into multiple segments, each with a length of 'send', and every time we send this part of the data using tcp_bpf_sendmsg_redir(), we use sk_msg_return_zero() to uncharge the memory of the specified 'send' size. However, if the first segment of data fails to send, for example, the peer's buffer is full, we need to release all of the msg. When releasing the msg, we haven't uncharged the memory of the subsequent segments. This modification does not make significant logical changes, but only fills in the missing uncharge places. This issue has existed all along, until it was exposed after we added the apply test in test_sockmap: commit `3448ad23b3` ("selftests/bpf: Add apply_bytes test to test_txmsg_redir_wait_sndmem in test_sockmap") Fixes: `d3b18ad31f` ("tls: add bpf support to sk_msg handling") Reported-by: Cong Wang <xiyou.wangcong@gmail.com> Closes: https://lore.kernel.org/bpf/aAmIi0vlycHtbXeb@pop-os.localdomain/T/#t Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com> Link: https://lore.kernel.org/r/20250425060015.6968-2-jiayuan.chen@linux.dev	2025-05-09 18:09:59 -07:00
Jiayuan Chen	54a3ecaeee	bpf: fix ktls panic with sockmap [ 2172.936997] ------------[ cut here ]------------ [ 2172.936999] kernel BUG at lib/iov_iter.c:629! ...... [ 2172.944996] PKRU: 55555554 [ 2172.945155] Call Trace: [ 2172.945299] <TASK> [ 2172.945428] ? die+0x36/0x90 [ 2172.945601] ? do_trap+0xdd/0x100 [ 2172.945795] ? iov_iter_revert+0x178/0x180 [ 2172.946031] ? iov_iter_revert+0x178/0x180 [ 2172.946267] ? do_error_trap+0x7d/0x110 [ 2172.946499] ? iov_iter_revert+0x178/0x180 [ 2172.946736] ? exc_invalid_op+0x50/0x70 [ 2172.946961] ? iov_iter_revert+0x178/0x180 [ 2172.947197] ? asm_exc_invalid_op+0x1a/0x20 [ 2172.947446] ? iov_iter_revert+0x178/0x180 [ 2172.947683] ? iov_iter_revert+0x5c/0x180 [ 2172.947913] tls_sw_sendmsg_locked.isra.0+0x794/0x840 [ 2172.948206] tls_sw_sendmsg+0x52/0x80 [ 2172.948420] ? inet_sendmsg+0x1f/0x70 [ 2172.948634] __sys_sendto+0x1cd/0x200 [ 2172.948848] ? find_held_lock+0x2b/0x80 [ 2172.949072] ? syscall_trace_enter+0x140/0x270 [ 2172.949330] ? __lock_release.isra.0+0x5e/0x170 [ 2172.949595] ? find_held_lock+0x2b/0x80 [ 2172.949817] ? syscall_trace_enter+0x140/0x270 [ 2172.950211] ? lockdep_hardirqs_on_prepare+0xda/0x190 [ 2172.950632] ? ktime_get_coarse_real_ts64+0xc2/0xd0 [ 2172.951036] __x64_sys_sendto+0x24/0x30 [ 2172.951382] do_syscall_64+0x90/0x170 ...... After calling bpf_exec_tx_verdict(), the size of msg_pl->sg may increase, e.g., when the BPF program executes bpf_msg_push_data(). If the BPF program sets cork_bytes and sg.size is smaller than cork_bytes, it will return -ENOSPC and attempt to roll back to the non-zero copy logic. However, during rollback, msg->msg_iter is reset, but since msg_pl->sg.size has been increased, subsequent executions will exceed the actual size of msg_iter. ''' iov_iter_revert(&msg->msg_iter, msg_pl->sg.size - orig_size); ''' The changes in this commit are based on the following considerations: 1. When cork_bytes is set, rolling back to non-zero copy logic is pointless and can directly go to zero-copy logic. 2. We can not calculate the correct number of bytes to revert msg_iter. Assume the original data is "abcdefgh" (8 bytes), and after 3 pushes by the BPF program, it becomes 11-byte data: "abc?de?fgh?". Then, we set cork_bytes to 6, which means the first 6 bytes have been processed, and the remaining 5 bytes "?fgh?" will be cached until the length meets the cork_bytes requirement. However, some data in "?fgh?" is not within 'sg->msg_iter' (but in msg_pl instead), especially the data "?" we pushed. So it doesn't seem as simple as just reverting through an offset of msg_iter. 3. For non-TLS sockets in tcp_bpf_sendmsg, when a "cork" situation occurs, the user-space send() doesn't return an error, and the returned length is the same as the input length parameter, even if some data is cached. Additionally, I saw that the current non-zero-copy logic for handling corking is written as: ''' line 1177 else if (ret != -EAGAIN) { if (ret == -ENOSPC) ret = 0; goto send_end; ''' So it's ok to just return 'copied' without error when a "cork" situation occurs. Fixes: `fcb14cb1bd` ("new iov_iter flavour - ITER_UBUF") Fixes: `d3b18ad31f` ("tls: add bpf support to sk_msg handling") Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/20250219052015.274405-2-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-04-09 19:57:13 -07:00
Jakub Kicinski	14ea4cd1b1	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.13-rc7). Conflicts: `a42d71e322` ("net_sched: sch_cake: Add drop reasons") `737d4d91d3` ("sched: sch_cake: add bounds checks to host bulk flow fairness counts") Adjacent changes: drivers/net/ethernet/meta/fbnic/fbnic.h `3a856ab347` ("eth: fbnic: add IRQ reuse support") `95978931d5` ("eth: fbnic: Revert "eth: fbnic: Add hardware monitoring support via HWMON interface"") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-01-09 16:11:47 -08:00
Benjamin Coddington	b341ca51d2	tls: Fix tls_sw_sendmsg error handling We've noticed that NFS can hang when using RPC over TLS on an unstable connection, and investigation shows that the RPC layer is stuck in a tight loop attempting to transmit, but forever getting -EBADMSG back from the underlying network. The loop begins when tcp_sendmsg_locked() returns -EPIPE to tls_tx_records(), but that error is converted to -EBADMSG when calling the socket's error reporting handler. Instead of converting errors from tcp_sendmsg_locked(), let's pass them along in this path. The RPC layer handles -EPIPE by reconnecting the transport, which prevents the endless attempts to transmit on a broken connection. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Fixes: `a42055e8d2` ("net/tls: Add support for async encryption of records for performance") Link: https://patch.msgid.link/9594185559881679d81f071b181a10eb07cd079f.1736004079.git.bcodding@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-01-07 17:00:19 -08:00
Sabrina Dubroca	510128b30f	tls: add counters for rekey This introduces 5 counters to keep track of key updates: Tls{Rx,Tx}Rekey{Ok,Error} and TlsRxRekeyReceived. Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-12-16 12:47:30 +00:00
Sabrina Dubroca	47069594e6	tls: implement rekey for TLS1.3 This adds the possibility to change the key and IV when using TLS1.3. Changing the cipher or TLS version is not supported. Once we have updated the RX key, we can unblock the receive side. If the rekey fails, the context is unmodified and userspace is free to retry the update or close the socket. This change only affects tls_sw, since 1.3 offload isn't supported. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-12-16 12:47:29 +00:00
Sabrina Dubroca	0471b1093e	tls: block decryption when a rekey is pending When a TLS handshake record carrying a KeyUpdate message is received, all subsequent records will be encrypted with a new key. We need to stop decrypting incoming records with the old key, and wait until userspace provides a new key. Make a note of this in the RX context just after decrypting that record, and stop recvmsg/splice calls with EKEYEXPIRED until the new key is available. key_update_pending can't be combined with the existing bitfield, because we will read it locklessly in ->poll. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-12-16 12:47:29 +00:00
Sascha Hauer	54001d0f2f	net: tls: wait for async completion on last message When asynchronous encryption is used KTLS sends out the final data at proto->close time. This becomes problematic when the task calling close() receives a signal. In this case it can happen that tcp_sendmsg_locked() called at close time returns -ERESTARTSYS and the final data is not sent. The described situation happens when KTLS is used in conjunction with io_uring, as io_uring uses task_work_add() to add work to the current userspace task. A discussion of the problem along with a reproducer can be found in [1] and [2] Fix this by waiting for the asynchronous encryption to be completed on the final message. With this there is no data left to be sent at close time. [1] https://lore.kernel.org/all/20231010141932.GD3114228@pengutronix.de/ [2] https://lore.kernel.org/all/20240315100159.3898944-1-s.hauer@pengutronix.de/ Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> Link: https://patch.msgid.link/20240904-ktls-wait-async-v1-1-a62892833110@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-09-06 18:20:55 -07:00
Colin Ian King	f7ac8fbd32	tls: remove redundant assignment to variable decrypted The variable decrypted is being assigned a value that is never read, the control of flow after the assignment is via an return path and decrypted is not referenced in this path. The assignment is redundant and can be removed. Cleans up clang scan warning: net/tls/tls_sw.c:2150:4: warning: Value stored to 'decrypted' is never read [deadcode.DeadStores] Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Link: https://lore.kernel.org/r/20240410144136.289030-1-colin.i.king@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-04-11 20:00:22 -07:00
Sabrina Dubroca	417e91e856	tls: get psock ref after taking rxlock to avoid leak At the start of tls_sw_recvmsg, we take a reference on the psock, and then call tls_rx_reader_lock. If that fails, we return directly without releasing the reference. Instead of adding a new label, just take the reference after locking has succeeded, since we don't need it before. Fixes: `4cbc325ed6` ("tls: rx: allow only one reader at a time") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/fe2ade22d030051ce4c3638704ed58b67d0df643.1711120964.git.sd@queasysnail.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-26 20:48:24 -07:00
Sabrina Dubroca	85eef9a41d	tls: adjust recv return with async crypto and failed copy to userspace process_rx_list may not copy as many bytes as we want to the userspace buffer, for example in case we hit an EFAULT during the copy. If this happens, we should only count the bytes that were actually copied, which may be 0. Subtracting async_copy_bytes is correct in both peek and !peek cases, because decrypted == async_copy_bytes + peeked for the peek case: peek is always !ZC, and we can go through either the sync or async path. In the async case, we add chunk to both decrypted and async_copy_bytes. In the sync case, we add chunk to both decrypted and peeked. I missed that in commit `6caaf10442` ("tls: fix peeking with sync+async decryption"). Fixes: `4d42cd6bc2` ("tls: rx: fix return value for async crypto") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/1b5a1eaab3c088a9dd5d9f1059ceecd7afe888d1.1711120964.git.sd@queasysnail.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-26 20:48:24 -07:00
Sabrina Dubroca	7608a971fd	tls: recv: process_rx_list shouldn't use an offset with kvec Only MSG_PEEK needs to copy from an offset during the final process_rx_list call, because the bytes we copied at the beginning of tls_sw_recvmsg were left on the rx_list. In the KVEC case, we removed data from the rx_list as we were copying it, so there's no need to use an offset, just like in the normal case. Fixes: `692d7b5d1f` ("tls: Fix recvmsg() to be able to peek across multiple records") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/e5487514f828e0347d2b92ca40002c62b58af73d.1711120964.git.sd@queasysnail.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-26 20:48:24 -07:00
Sabrina Dubroca	13114dc554	tls: fix use-after-free on failed backlog decryption When the decrypt request goes to the backlog and crypto_aead_decrypt returns -EBUSY, tls_do_decryption will wait until all async decryptions have completed. If one of them fails, tls_do_decryption will return -EBADMSG and tls_decrypt_sg jumps to the error path, releasing all the pages. But the pages have been passed to the async callback, and have already been released by tls_decrypt_done. The only true async case is when crypto_aead_decrypt returns -EINPROGRESS. With -EBUSY, we already waited so we can tell tls_sw_recvmsg that the data is available for immediate copy, but we need to notify tls_decrypt_sg (via the new ->async_done flag) that the memory has already been released. Fixes: `8590541473` ("net: tls: handle backlogging of crypto requests") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://lore.kernel.org/r/4755dd8d9bebdefaa19ce1439b833d6199d4364c.1709132643.git.sd@queasysnail.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-02-29 09:07:16 -08:00
Sabrina Dubroca	41532b785e	tls: separate no-async decryption request handling from async If we're not doing async, the handling is much simpler. There's no reference counting, we just need to wait for the completion to wake us up and return its result. We should preferably also use a separate crypto_wait. I'm not seeing a UAF as I did in the past, I think `aec7961916` ("tls: fix race between async notify and socket close") took care of it. This will make the next fix easier. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://lore.kernel.org/r/47bde5f649707610eaef9f0d679519966fc31061.1709132643.git.sd@queasysnail.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-02-29 09:07:16 -08:00
Sabrina Dubroca	6caaf10442	tls: fix peeking with sync+async decryption If we peek from 2 records with a currently empty rx_list, and the first record is decrypted synchronously but the second record is decrypted async, the following happens: 1. decrypt record 1 (sync) 2. copy from record 1 to the userspace's msg 3. queue the decrypted record to rx_list for future read(!PEEK) 4. decrypt record 2 (async) 5. queue record 2 to rx_list 6. call process_rx_list to copy data from the 2nd record We currently pass copied=0 as skip offset to process_rx_list, so we end up copying once again from the first record. We should skip over the data we've already copied. Seen with selftest tls.12_aes_gcm.recv_peek_large_buf_mult_recs Fixes: `692d7b5d1f` ("tls: Fix recvmsg() to be able to peek across multiple records") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://lore.kernel.org/r/1b132d2b2b99296bfde54e8a67672d90d6d16e71.1709132643.git.sd@queasysnail.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-02-29 09:07:16 -08:00
Sabrina Dubroca	f7fa16d498	tls: decrement decrypt_pending if no async completion will be called With mixed sync/async decryption, or failures of crypto_aead_decrypt, we increment decrypt_pending but we never do the corresponding decrement since tls_decrypt_done will not be called. In this case, we should decrement decrypt_pending immediately to avoid getting stuck. For example, the prequeue prequeue test gets stuck with mixed modes (one async decrypt + one sync decrypt). Fixes: `94524d8fc9` ("net/tls: Add support for async decryption of tls records") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://lore.kernel.org/r/c56d5fc35543891d5319f834f25622360e1bfbec.1709132643.git.sd@queasysnail.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-02-29 09:07:16 -08:00
Sabrina Dubroca	ec823bf3a4	tls: don't skip over different type records from the rx_list If we queue 3 records: - record 1, type DATA - record 2, some other type - record 3, type DATA and do a recv(PEEK), the rx_list will contain the first two records. The next large recv will walk through the rx_list and copy data from record 1, then stop because record 2 is a different type. Since we haven't filled up our buffer, we will process the next available record. It's also DATA, so we can merge it with the current read. We shouldn't do that, since there was a record in between that we ignored. Add a flag to let process_rx_list inform tls_sw_recvmsg that it had more data available. Fixes: `692d7b5d1f` ("tls: Fix recvmsg() to be able to peek across multiple records") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://lore.kernel.org/r/f00c0c0afa080c60f016df1471158c1caf983c34.1708007371.git.sd@queasysnail.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-02-21 14:25:51 -08:00
Sabrina Dubroca	fdfbaec592	tls: stop recv() if initial process_rx_list gave us non-DATA If we have a non-DATA record on the rx_list and another record of the same type still on the queue, we will end up merging them: - process_rx_list copies the non-DATA record - we start the loop and process the first available record since it's of the same type - we break out of the loop since the record was not DATA Just check the record type and jump to the end in case process_rx_list did some work. Fixes: `692d7b5d1f` ("tls: Fix recvmsg() to be able to peek across multiple records") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://lore.kernel.org/r/bd31449e43bd4b6ff546f5c51cf958c31c511deb.1708007371.git.sd@queasysnail.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-02-21 14:25:51 -08:00
Sabrina Dubroca	10f41d0710	tls: break out of main loop when PEEK gets a non-data record PEEK needs to leave decrypted records on the rx_list so that we can receive them later on, so it jumps back into the async code that queues the skb. Unfortunately that makes us skip the TLS_RECORD_TYPE_DATA check at the bottom of the main loop, so if two records of the same (non-DATA) type are queued, we end up merging them. Add the same record type check, and make it unlikely to not penalize the async fastpath. Async decrypt only applies to data record, so this check is only needed for PEEK. process_rx_list also has similar issues. Fixes: `692d7b5d1f` ("tls: Fix recvmsg() to be able to peek across multiple records") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://lore.kernel.org/r/3df2eef4fdae720c55e69472b5bea668772b45a2.1708007371.git.sd@queasysnail.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-02-21 14:25:51 -08:00
Jakub Kicinski	ac437a51ce	net: tls: fix returned read length with async decrypt We double count async, non-zc rx data. The previous fix was lucky because if we fully zc async_copy_bytes is 0 so we add 0. Decrypted already has all the bytes we handled, in all cases. We don't have to adjust anything, delete the erroneous line. Fixes: `4d42cd6bc2` ("tls: rx: fix return value for async crypto") Co-developed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-02-10 21:38:19 +00:00
Sabrina Dubroca	32b55c5ff9	net: tls: fix use-after-free with partial reads and async decrypt tls_decrypt_sg doesn't take a reference on the pages from clear_skb, so the put_page() in tls_decrypt_done releases them, and we trigger a use-after-free in process_rx_list when we try to read from the partially-read skb. Fixes: `fd31f3996a` ("tls: rx: decrypt into a fresh skb") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-02-10 21:38:19 +00:00
Jakub Kicinski	8590541473	net: tls: handle backlogging of crypto requests Since we're setting the CRYPTO_TFM_REQ_MAY_BACKLOG flag on our requests to the crypto API, crypto_aead_{encrypt,decrypt} can return -EBUSY instead of -EINPROGRESS in valid situations. For example, when the cryptd queue for AESNI is full (easy to trigger with an artificially low cryptd.cryptd_max_cpu_qlen), requests will be enqueued to the backlog but still processed. In that case, the async callback will also be called twice: first with err == -EINPROGRESS, which it seems we can just ignore, then with err == 0. Compared to Sabrina's original patch this version uses the new tls_*crypt_async_wait() helpers and converts the EBUSY to EINPROGRESS to avoid having to modify all the error handling paths. The handling is identical. Fixes: `a54667f672` ("tls: Add support for encryption using async offload accelerator") Fixes: `94524d8fc9` ("net/tls: Add support for async decryption of tls records") Co-developed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://lore.kernel.org/netdev/9681d1febfec295449a62300938ed2ae66983f28.1694018970.git.sd@queasysnail.net/ Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-02-10 21:38:19 +00:00
Jakub Kicinski	e01e3934a1	tls: fix race between tx work scheduling and socket close Similarly to previous commit, the submitting thread (recvmsg/sendmsg) may exit as soon as the async crypto handler calls complete(). Reorder scheduling the work before calling complete(). This seems more logical in the first place, as it's the inverse order of what the submitting thread will do. Reported-by: valis <sec@valis.email> Fixes: `a42055e8d2` ("net/tls: Add support for async encryption of records for performance") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-02-10 21:38:19 +00:00
Jakub Kicinski	aec7961916	tls: fix race between async notify and socket close The submitting thread (one which called recvmsg/sendmsg) may exit as soon as the async crypto handler calls complete() so any code past that point risks touching already freed data. Try to avoid the locking and extra flags altogether. Have the main thread hold an extra reference, this way we can depend solely on the atomic ref counter for synchronization. Don't futz with reiniting the completion, either, we are now tightly controlling when completion fires. Reported-by: valis <sec@valis.email> Fixes: `0cada33241` ("net/tls: fix race condition causing kernel panic") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-02-10 21:38:19 +00:00
Jakub Kicinski	c57ca512f3	net: tls: factor out tls_*crypt_async_wait() Factor out waiting for async encrypt and decrypt to finish. There are already multiple copies and a subsequent fix will need more. No functional changes. Note that crypto_wait_req() returns wait->err Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-02-10 21:38:19 +00:00
John Fastabend	dc9dfc8dc6	net: tls, fix WARNIING in __sk_msg_free A splice with MSG_SPLICE_PAGES will cause tls code to use the tls_sw_sendmsg_splice path in the TLS sendmsg code to move the user provided pages from the msg into the msg_pl. This will loop over the msg until msg_pl is full, checked by sk_msg_full(msg_pl). The user can also set the MORE flag to hint stack to delay sending until receiving more pages and ideally a full buffer. If the user adds more pages to the msg than can fit in the msg_pl scatterlist (MAX_MSG_FRAGS) we should ignore the MORE flag and send the buffer anyways. What actually happens though is we abort the msg to msg_pl scatterlist setup and then because we forget to set 'full record' indicating we can no longer consume data without a send we fallthrough to the 'continue' path which will check if msg_data_left(msg) has more bytes to send and then attempts to fit them in the already full msg_pl. Then next iteration of sender doing send will encounter a full msg_pl and throw the warning in the syzbot report. To fix simply check if we have a full_record in splice code path and if not send the msg regardless of MORE flag. Reported-and-tested-by: syzbot+f2977222e0e95cec15c8@syzkaller.appspotmail.com Reported-by: Edward Adam Davis <eadavis@qq.com> Fixes: `fe1e81d4f7` ("tls/sw: Support MSG_SPLICE_PAGES") Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-01-14 12:17:14 +00:00
John Fastabend	c5a595000e	net: tls, update curr on splice as well The curr pointer must also be updated on the splice similar to how we do this for other copy types. Fixes: `d829e9c411` ("tls: convert to generic sk_msg interface") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Reported-by: Jann Horn <jannh@google.com> Link: https://lore.kernel.org/r/20231206232706.374377-2-john.fastabend@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-12-07 09:52:28 -08:00

1 2 3 4 5 ...

332 Commits