linux

mirror of https://github.com/torvalds/linux.git synced 2026-05-14 01:08:22 +02:00

Author	SHA1	Message	Date
Linus Torvalds	b85900e91c	NFS client updates for Linux 7.1 Highlights include: Bugfixes: - NFS: Fix handling of ENOSPC so that if we have to resend writes, they are written synchronously. - SUNRPC: RDMA transport fixes from Chuck - NFSv4.2: Several fixes for delegated timestamps - NFSv4: Failure to obtain a directory delegation should not cause stat() to fail. - NFSv4: Rename was failing to update timestamps when a directory delegation is held. - NFSv4: Ensure we check rsize/wsize after crossing a NFSv4 filesystem boundary. - NFSv4/pnfs: If the server is down, retry the layout returns on reboot - NFSv4/pnfs: Fallback to MDS could result in a short write being incorrectly logged. Cleanups: - NFS: use memcpy_and_pad in decode_fh -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQR8xgHcVzJNfOYElJo6EXfx2a6V0QUCaevSUgAKCRA6EXfx2a6V 0ewIAQD+23uMo5sxY10btKATcBBxswY5YMtN1qQBMyn88N0XfwEAz0+zoEbRv4L2 39goJ/WeJ0/gqhfJV9F+Oe2U1DbsEgM= =l9y/ -----END PGP SIGNATURE----- Merge tag 'nfs-for-7.1-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs Pull NFS client updates from Trond Myklebust: "Bugfixes: - Fix handling of ENOSPC so that if we have to resend writes, they are written synchronously - SUNRPC RDMA transport fixes from Chuck - Several fixes for delegated timestamps in NFSv4.2 - Failure to obtain a directory delegation should not cause stat() to fail with NFSv4 - Rename was failing to update timestamps when a directory delegation is held on NFSv4 - Ensure we check rsize/wsize after crossing a NFSv4 filesystem boundary - NFSv4/pnfs: - If the server is down, retry the layout returns on reboot - Fallback to MDS could result in a short write being incorrectly logged Cleanups: - Use memcpy_and_pad in decode_fh" * tag 'nfs-for-7.1-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (21 commits) NFS: Fix RCU dereference of cl_xprt in nfs_compare_super_address NFS: remove redundant __private attribute from nfs_page_class NFSv4.2: fix CLONE/COPY attrs in presence of delegated attributes NFS: fix writeback in presence of errors nfs: use memcpy_and_pad in decode_fh NFSv4.1: Apply session size limits on clone path NFSv4: retry GETATTR if GET_DIR_DELEGATION failed NFS: fix RENAME attr in presence of directory delegations pnfs/flexfiles: validate ds_versions_cnt is non-zero NFS/blocklayout: print each device used for SCSI layouts xprtrdma: Post receive buffers after RPC completion xprtrdma: Scale receive batch size with credit window xprtrdma: Replace rpcrdma_mr_seg with xdr_buf cursor xprtrdma: Decouple frwr_wp_create from frwr_map xprtrdma: Close lost-wakeup race in xprt_rdma_alloc_slot xprtrdma: Avoid 250 ms delay on backlog wakeup xprtrdma: Close sendctx get/put race that can block a transport nfs: update inode ctime after removexattr operation nfs: fix utimensat() for atime with delegated timestamps NFS: improve "Server wrote zero bytes" error ...	2026-04-24 14:20:03 -07:00
Linus Torvalds	36d179fd6b	NFSD 7.1 Release Notes Benjamin Coddington contributed filehandle signing to defend against filehandle-guessing attacks. The server now appends a SipHash-2-4 MAC to each filehandle when the new "sign_fh" export option is enabled. NFSD then verifies filehandles received from clients against the expected MAC; mismatches return NFS error STALE. Chuck Lever converted the entire NLMv4 server-side XDR layer from hand-written C to xdrgen-generated code, spanning roughly thirty patches. XDR functions are generally boilerplate code and are easy to get wrong. The goals of this conversion are improved memory safety, lower maintenance burden, and groundwork for eventual Rust code generation for these functions. Dai Ngo improved pNFS block/SCSI layout robustness with two related changes. SCSI persistent reservation fencing is now tracked per client and per device via an xarray, to avoid both redundant preempt operations on devices already fenced and a potential NFSD deadlock when all nfsd threads are waiting for a layout return. The remaining patches deliver scalability and infrastructure improvements. Sincere thanks to all contributors, reviewers, testers, and bug reporters who participated in the v7.1 NFSD development cycle. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmnlF50ACgkQM2qzM29m f5dfHBAAi2o1i9/RA6fmxi2qSV7tkg79viuGFRj3c4cjiW8ZqQXos63zmy6BNMFG joEoirdryUETkrrckXP81HKGSWBQqYjaXeklOw8dggQ8g72HGiqcoT3Ua7L9S7A8 /Db6IwZnJcehHO8XwHV4jSAfIZuvC0iiK02tVrVe/l/9GWcG+bS340GgE9Es2IAW copBGlTwQah+eRvy2hP+Eo3vUTP8Rdebp9iYFI12xqx2x3LquFR01PpjCzotqAvV AcvCPa/AGoSOjcL8idloL8F8mSaOCyx15YJH0lm3hRsPtS/VyXWjKvcejWUh/7PH gHi+5VTsSKbUBj3PJQZU6rBQ67KnwVLZ33KkIF2ZNGllvK0yDGM0UfX/TuaEPjUV 6N0UkRprCHJdrULt9XMXmX3Ddnz1xbYT8CaeIDObw3Ix7SJKedvlLTjvsYCYtsQn 5pkHUuHmr/YAF4AQi/JI4ubZhZ+K3YytNS8YiMUkBWDbPoKzo2yrkzwjGjHdUp0y l8LfEjePAcIpuFQZegERA9CnjIeKb66DJe8da0EwtreY+sejm/S8zbBUhMkXjo6u QwdXXeLX3/zni6Op8vRA5JH//S5ovlQFnkUSvHRItSUrDBRVm+wXD7Vnp9bykKcN leqbSvehnV4PIi0URMvN5ox1WNmsOFIZkv9nv8amyOX8PlRmLoA= =iFl6 -----END PGP SIGNATURE----- Merge tag 'nfsd-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd updates from Chuck Lever: - filehandle signing to defend against filehandle-guessing attacks (Benjamin Coddington) The server now appends a SipHash-2-4 MAC to each filehandle when the new "sign_fh" export option is enabled. NFSD then verifies filehandles received from clients against the expected MAC; mismatches return NFS error STALE - convert the entire NLMv4 server-side XDR layer from hand-written C to xdrgen-generated code, spanning roughly thirty patches (Chuck Lever) XDR functions are generally boilerplate code and are easy to get wrong. The goals of this conversion are improved memory safety, lower maintenance burden, and groundwork for eventual Rust code generation for these functions. - improve pNFS block/SCSI layout robustness with two related changes (Dai Ngo) SCSI persistent reservation fencing is now tracked per client and per device via an xarray, to avoid both redundant preempt operations on devices already fenced and a potential NFSD deadlock when all nfsd threads are waiting for a layout return. - scalability and infrastructure improvements Sincere thanks to all contributors, reviewers, testers, and bug reporters who participated in the v7.1 NFSD development cycle. * tag 'nfsd-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (83 commits) NFSD: Docs: clean up pnfs server timeout docs nfsd: fix comment typo in nfsxdr nfsd: fix comment typo in nfs3xdr NFSD: convert callback RPC program to per-net namespace NFSD: use per-operation statidx for callback procedures svcrdma: Use contiguous pages for RDMA Read sink buffers SUNRPC: Add svc_rqst_page_release() helper SUNRPC: xdr.h: fix all kernel-doc warnings svcrdma: Factor out WR chain linking into helper svcrdma: Add Write chunk WRs to the RPC's Send WR chain svcrdma: Clean up use of rdma->sc_pd->device svcrdma: Clean up use of rdma->sc_pd->device in Receive paths svcrdma: Add fair queuing for Send Queue access SUNRPC: Optimize rq_respages allocation in svc_alloc_arg SUNRPC: Track consumed rq_pages entries svcrdma: preserve rq_next_page in svc_rdma_save_io_pages SUNRPC: Handle NULL entries in svc_rqst_release_pages SUNRPC: Allocate a separate Reply page array SUNRPC: Tighten bounds checking in svc_rqst_replace_page NFSD: Sign filehandles ...	2026-04-20 10:44:02 -07:00
Chuck Lever	704f3f640f	xprtrdma: Post receive buffers after RPC completion rpcrdma_post_recvs() runs in CQ poll context and its cost falls on the latency-critical path between polling a Receive completion and waking the RPC consumer. Every cycle spent refilling the Receive Queue delays delivery of the reply to the NFS layer. Move the rpcrdma_post_recvs() call in rpcrdma_reply_handler() to after the RPC has been decoded and completed. The larger batch size from the preceding patch provides sufficient Receive Queue headroom to absorb the brief delay before buffers are replenished. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2026-04-13 12:05:01 -07:00
Chuck Lever	93b4791adb	xprtrdma: Scale receive batch size with credit window The fixed RPCRDMA_MAX_RECV_BATCH of 7 results in frequent small ib_post_recv batches during high-rate workloads. With a 128-slot credit window, receives are reposted every 7th completion, each batch incurring atomic serialization and a doorbell write. Replace the fixed batch constant with a per-endpoint value scaled to 25% of the negotiated credit window. For a typical 128-credit connection this raises the batch from 7 to 32, reducing doorbell frequency by roughly 4x and amortizing the per-batch atomic and MMIO costs over a larger group of receive WRs. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2026-04-13 12:05:00 -07:00
Chuck Lever	7a079ab57c	xprtrdma: Replace rpcrdma_mr_seg with xdr_buf cursor The FRWR registration path converts data through three representations: xdr_buf -> rpcrdma_mr_seg[] -> scatterlist[] -> ib_map_mr_sg(). The rpcrdma_mr_seg intermediate is a relic of when multiple registration strategies existed (FMR, physical, FRWR). Only FRWR remains, so this indirection and the 6240-byte rl_segments[260] array embedded in each rpcrdma_req serve no purpose. Introduce struct rpcrdma_xdr_cursor to track position within an xdr_buf during iterative MR registration. Rewrite frwr_map to populate scatterlist entries directly from the xdr_buf regions (head kvec, page list, tail kvec). The boundary logic for non-SG_GAPS devices is simpler because the xdr_buf structure guarantees that page-region entries after the first start at offset 0, and that head/tail kvecs are separate regions that naturally break at MR boundaries. Fix a pre-existing bug in rpcrdma_encode_write_list where the write-pad statistics accumulator added mr->mr_length from the last data MR rather than the write-pad MR. The refactored code uses ep->re_write_pad_mr->mr_length. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2026-04-13 12:04:58 -07:00
Chuck Lever	6f2e565fb3	xprtrdma: Decouple frwr_wp_create from frwr_map frwr_wp_create is the only caller of frwr_map outside the encode path. It registers a single 4-byte write-pad region from a stack- local rpcrdma_mr_seg. Inlining the registration logic directly (sg_init_table + sg_set_page + ib_dma_map_sg + ib_map_mr_sg + IOVA mangle + reg_wr setup) eliminates the coupling that would otherwise complicate the removal of rpcrdma_mr_seg from frwr_map's interface. The inlined version adds a proper error-unwind ladder: on failure, the DMA mapping (if established) is released, ep->re_write_pad_mr is cleared, and the MR is returned to the transport free list. The old frwr_map-based code relied on rpcrdma_mrs_destroy at teardown to reclaim partially-initialized MRs. This is a one-time setup path; duplicating ~20 lines is a reasonable tradeoff for decoupling the write-pad registration from the data- path MR registration. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2026-04-13 11:56:27 -07:00
Chuck Lever	765bde47fe	xprtrdma: Close lost-wakeup race in xprt_rdma_alloc_slot xprt_rdma_alloc_slot() and xprt_rdma_free_slot() lack serialization between the buffer pool and the backlog queue. A buffer freed after rpcrdma_buffer_get() finds the pool empty but before rpc_sleep_on() places the task on the backlog is returned to the pool with no waiter to wake, leaving the task stuck on the backlog indefinitely. After joining the backlog, re-check the pool and route any recovered buffer through xprt_wake_up_backlog(), whose queue lock serializes with concurrent wakeups and avoids double-assignment of slots. Because xprt_rdma_free_slot() does not hold reserve_lock, the XPRT_CONGESTED double-check in xprt_throttle_congested() is ineffective: a task can join the backlog through that path after free_slot has already found it empty and cleared the bit. Avoid this by using xprt_add_backlog_noncongested(), which queues the task without setting XPRT_CONGESTED, so every allocation reaches xprt_rdma_alloc_slot() and its post-sleep re-check. Fixes: `edb41e61a5` ("xprtrdma: Make rpc_rqst part of rpcrdma_req") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2026-04-13 11:56:20 -07:00
Chuck Lever	100142093e	xprtrdma: Avoid 250 ms delay on backlog wakeup Commit `a721035477` ("SUNRPC/xprt: async tasks mustn't block waiting for memory") changed xprt_rdma_alloc_slot() to set tk_status to -ENOMEM so that call_reserveresult() would sleep HZ/4 before retrying. That rationale applies to xprt_dynamic_alloc_slot(), where an immediate retry under memory pressure wastes CPU, but not to the RDMA backlog path: a task woken from the backlog has a slot waiting for it, so the 250 ms rpc_delay adds latency without benefit. This also aligns the code with the existing kernel-doc for xprt_rdma_alloc_slot(), which already documented %-EAGAIN. Fixes: `a721035477` ("SUNRPC/xprt: async tasks mustn't block waiting for memory") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2026-04-13 11:52:54 -07:00
Chuck Lever	24297c7cd3	xprtrdma: Close sendctx get/put race that can block a transport rpcrdma_sendctx_get_locked() and rpcrdma_sendctx_put_locked() can race in a way that leaves XPRT_WRITE_SPACE set permanently, blocking all further sends on the transport: get_locked put_locked (Send completion) ---------- -------------------------- read rb_sc_tail -> ring full advance rb_sc_tail xprt_write_space(): test_bit(WRITE_SPACE) -> not set, return set_bit(WRITE_SPACE) return NULL (-EAGAIN) After the sender releases XPRT_LOCKED, the release path refuses to wake the next task because XPRT_WRITE_SPACE is set. The sender retries, finds XPRT_WRITE_SPACE still set, and sleeps on xprt_sending. No further Send completions arrive to clear the flag because no new Sends can be posted. With nconnect, the stalled transport's share of congestion credits are never returned, starving the remaining transports as well. Fixes: `05eb06d866` ("xprtrdma: Fix occasional transport deadlock") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2026-04-13 11:52:49 -07:00
Christian Brauner	e3b2cf6e5d	kernfs: pass struct ns_common instead of const void * for namespace tags kernfs has historically used const void * to pass around namespace tags used for directory-level namespace filtering. The only current user of this is sysfs network namespace tagging where struct net pointers are cast to void . Replace all const void namespace parameters with const struct ns_common * throughout the kernfs, sysfs, and kobject namespace layers. This includes the kobj_ns_type_operations callbacks, kobject_namespace(), and all sysfs/kernfs APIs that accept or return namespace tags. Passing struct ns_common is needed because various codepaths require access to the underlying namespace. A struct ns_common can always be converted back to the concrete namespace type (e.g., struct net) via container_of() or to_ns_common() in the reverse direction. This is a preparatory change for switching to ns_id-based directory iteration to prevent a KASLR pointer leak through the current use of raw namespace pointers as hash seeds and comparison keys. Signed-off-by: Christian Brauner <brauner@kernel.org>	2026-04-09 14:36:52 +02:00
Chuck Lever	18755b8c2f	svcrdma: Use contiguous pages for RDMA Read sink buffers svc_rdma_build_read_segment() constructs RDMA Read sink buffers by consuming pages one-at-a-time from rq_pages[] and building one bvec per page. A 64KB NFS READ payload produces 16 separate bvecs, 16 DMA mappings, and potentially multiple RDMA Read WRs (on platforms with 4KB pages). A single higher-order allocation followed by split_page() yields physically contiguous memory while preserving per-page refcounts. A single bvec spanning the contiguous range causes rdma_rw_ctx_init_bvec() to take the rdma_rw_init_single_wr_bvec() fast path: one DMA mapping, one SGE, one WR. The split sub-pages replace the original rq_pages[] entries, so all downstream page tracking, completion handling, and xdr_buf assembly remain unchanged. Allocation uses __GFP_NORETRY \| __GFP_NOWARN and falls back through decreasing orders. If even order-1 fails, the existing per-page path handles the segment. When nr_pages is not a power of two, get_order() rounds up and the allocation yields more pages than needed. The extra split pages replace existing rq_pages[] entries (freed via put_page() first), so there is no net increase in per- request page consumption. Successive segments reuse the same padding slots, preventing accumulation. The rq_maxpages guard rejects any allocation that would overrun the array, falling back to the per-page path. Under memory pressure, __GFP_NORETRY causes the higher- order allocation to fail without stalling. The contiguous path is attempted when the segment starts page-aligned (rc_pageoff == 0) and spans at least two pages. NFS WRITE segments carry application-modified byte ranges of arbitrary length, so the optimization is not restricted to power-of-two page counts. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2026-04-03 09:26:31 -04:00
Chuck Lever	4e2866b2ba	SUNRPC: Add svc_rqst_page_release() helper svc_rqst_replace_page() releases displaced pages through a per-rqst folio batch, but exposes the add-or-flush sequence directly. svc_tcp_restore_pages() releases displaced pages individually with put_page(). Introduce svc_rqst_page_release() to encapsulate the batched release mechanism. Convert svc_rqst_replace_page() and svc_tcp_restore_pages() to use it. The latter now benefits from the same batched release that svc_rqst_replace_page() already uses. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2026-04-03 09:26:17 -04:00
Chuck Lever	2239535fb0	svcrdma: Factor out WR chain linking into helper svc_rdma_prepare_write_chunk() and svc_rdma_prepare_reply_chunk() contain identical code for linking RDMA R/W work requests onto a Send context's WR chain. This duplication increases maintenance burden and risks divergent bug fixes. Introduce svc_rdma_cc_link_wrs() to consolidate the WR chain linking logic. The helper walks the chunk context's rwctxts list, chains each WR via rdma_rw_ctx_wrs(), and updates the Send context's chain head and SQE count. Completion signaling is requested only for the tail WR (posted first). No functional change. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2026-03-29 21:25:09 -04:00
Chuck Lever	d16f060f3e	svcrdma: Add Write chunk WRs to the RPC's Send WR chain Previously, Write chunk RDMA Writes were posted via a separate ib_post_send() call with their own completion handler. Each Write chunk incurred a doorbell and generated a completion event. Link Write chunk WRs onto the RPC Reply's Send WR chain so that a single ib_post_send() call posts both the RDMA Writes and the Send WR. A single completion event signals that all operations have finished. This reduces both doorbell rate and completion rate, as well as eliminating the latency of a round-trip between the Write chunk completion and the subsequent Send WR posting. The lifecycle of Write chunk resources changes: previously, the svc_rdma_write_done() completion handler released Write chunk resources when RDMA Writes completed. With WR chaining, resources remain live until the Send completion. A new sc_write_info_list tracks Write chunk metadata attached to each Send context, and svc_rdma_write_chunk_release() frees these resources when the Send context is released. The svc_rdma_write_done() handler now handles only error cases. On success it returns immediately since the Send completion handles resource release. On failure (WR flush), it closes the connection to signal to the client that the RPC Reply is incomplete. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2026-03-29 21:25:09 -04:00
Chuck Lever	c553983efa	svcrdma: Clean up use of rdma->sc_pd->device I can't think of a reason why svcrdma is using the PD's device. Most other consumers of the IB DMA API use the ib_device pointer from the connection's rdma_cm_id. I don't think there's any functional difference between the two, but it is a little confusing to see some uses of rdma_cm_id and some of ib_pd. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2026-03-29 21:25:09 -04:00
Chuck Lever	a5f2087f37	svcrdma: Clean up use of rdma->sc_pd->device in Receive paths I can't think of a reason why svcrdma is using the PD's device. Most other consumers of the IB DMA API use the ib_device pointer from the connection's rdma_cm_id. I don't believe there's any functional difference between the two, but it is a little confusing to see some uses of rdma_cm_id->device and some of ib_pd->device. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2026-03-29 21:25:09 -04:00
Chuck Lever	ccc89b9d1e	svcrdma: Add fair queuing for Send Queue access When the Send Queue fills, multiple threads may wait for SQ slots. The previous implementation had no ordering guarantee, allowing starvation when one thread repeatedly acquires slots while others wait indefinitely. Introduce a ticket-based fair queuing system. Each waiter takes a ticket number and is served in FIFO order. This ensures forward progress for all waiters when SQ capacity is constrained. The implementation has two phases: 1. Fast path: attempt to reserve SQ slots without waiting 2. Slow path: take a ticket, wait for turn, then wait for slots The ticket system adds two atomic counters to the transport: - sc_sq_ticket_head: next ticket to issue - sc_sq_ticket_tail: ticket currently being served A dedicated wait queue (sc_sq_ticket_wait) handles ticket ordering, separate from sc_send_wait which handles SQ capacity. This separation ensures that send completions (the high-frequency wake source) wake only the current ticket holder rather than all queued waiters. Ticket handoff wakes only the ticket wait queue, and each ticket holder that exits via connection close propagates the wake to the next waiter in line. When a waiter successfully reserves slots, it advances the tail counter and wakes the next waiter. This creates an orderly handoff that prevents starvation while maintaining good throughput on the fast path when contention is low. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2026-03-29 21:25:09 -04:00
Chuck Lever	d7f3efd9ff	SUNRPC: Optimize rq_respages allocation in svc_alloc_arg svc_alloc_arg() invokes alloc_pages_bulk() with the full rq_maxpages count (~259 for 1MB messages) for the rq_respages array, causing a full-array scan despite most slots holding valid pages. svc_rqst_release_pages() NULLs only the range [rq_respages, rq_next_page) after each RPC, so only that range contains NULL entries. Limit the rq_respages fill in svc_alloc_arg() to that range instead of scanning the full array. svc_init_buffer() initializes rq_next_page to span the entire rq_respages array, so the first svc_alloc_arg() call fills all slots. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2026-03-29 21:25:09 -04:00
Chuck Lever	7ed7504287	SUNRPC: Track consumed rq_pages entries The rq_pages array holds pages allocated for incoming RPC requests. Two transport receive paths NULL entries in rq_pages to prevent svc_rqst_release_pages() from freeing pages that the transport has taken ownership of: - svc_tcp_save_pages() moves partial request data pages to svsk->sk_pages during multi-fragment TCP reassembly. - svc_rdma_clear_rqst_pages() moves request data pages to head->rc_pages because they are targets of active RDMA Read WRs. A new rq_pages_nfree field in struct svc_rqst records how many entries were NULLed. svc_alloc_arg() uses it to refill only those entries rather than scanning the full rq_pages array. In steady state, the transport NULLs a handful of entries per RPC, so the allocator visits only those entries instead of the full ~259 slots (for 1MB messages). Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2026-03-29 21:25:09 -04:00
Chuck Lever	26c8e6eb75	svcrdma: preserve rq_next_page in svc_rdma_save_io_pages svc_rdma_save_io_pages() transfers response pages to the send context and sets those slots to NULL. It then resets rq_next_page to equal rq_respages, hiding the NULL region from svc_rqst_release_pages(). Now that svc_rqst_release_pages() handles NULL entries, this reset is no longer necessary. Removing it preserves the invariant that the range [rq_respages, rq_next_page) accurately describes how many response pages were consumed, enabling a subsequent optimization in svc_alloc_arg() that refills only the consumed range. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2026-03-29 21:25:09 -04:00
Chuck Lever	22cc2ba5c2	SUNRPC: Handle NULL entries in svc_rqst_release_pages svc_rqst_release_pages() releases response pages between rq_respages and rq_next_page. It currently passes the entire range to release_pages(), which does not expect NULL entries. A subsequent patch preserves the rq_next_page pointer in svc_rdma_save_io_pages() so that it accurately records how many response pages were consumed. After that change, the range [rq_respages, rq_next_page) can contain NULL entries where pages have already been transferred to a send context. Iterate through the range entry by entry, skipping NULLs, to handle this case correctly. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2026-03-29 21:25:09 -04:00
Chuck Lever	ee66b9e3e1	SUNRPC: Allocate a separate Reply page array struct svc_rqst uses a single dynamically-allocated page array (rq_pages) for both the incoming RPC Call message and the outgoing RPC Reply message. rq_respages is a sliding pointer into rq_pages that each transport receive path must compute based on how many pages the Call consumed. This boundary tracking is a source of confusion and bugs, and prevents an RPC transaction from having both a large Call and a large Reply simultaneously. Allocate rq_respages as its own page array, eliminating the boundary arithmetic. This decouples Call and Reply buffer lifetimes, following the precedent set by rq_bvec (a separate dynamically- allocated array for I/O vectors). Each svc_rqst now pins twice as many pages as before. For a server running 16 threads with a 1MB maximum payload, the additional cost is roughly 16MB of pinned memory. The new dynamic svc thread count facility keeps this overhead minimal on an idle server. A subsequent patch in this series limits per-request repopulation to only the pages released during the previous RPC, avoiding a full-array scan on each call to svc_alloc_arg(). Note: We've considered several alternatives to maintaining a full second array. Each alternative reintroduces either boundary logic complexity or I/O-path allocation pressure. rq_next_page is initialized in svc_alloc_arg() and svc_process() during Reply construction, and in svc_rdma_recvfrom() as a precaution on error paths. Transport receive paths no longer compute it from the Call size. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2026-03-29 21:25:09 -04:00
Chuck Lever	46ca8dd244	SUNRPC: Tighten bounds checking in svc_rqst_replace_page svc_rqst_replace_page() builds the Reply buffer by advancing rq_next_page through the response page range. The bounds check validates rq_next_page against the full rq_pages array, but the valid range for rq_next_page is [rq_respages, rq_page_end]. Use those bounds instead. This is correct today because rq_respages and rq_page_end both point into rq_pages, and it prepares for a subsequent change that separates the Reply page array from rq_pages. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2026-03-29 21:25:09 -04:00
Jeff Layton	facc4e3c80	sunrpc: split cache_detail queue into request and reader lists Replace the single interleaved queue (which mixed cache_request and cache_reader entries distinguished by a ->reader flag) with two dedicated lists: cd->requests for upcall requests and cd->readers for open file handles. Readers now track their position via a monotonically increasing sequence number (next_seqno) rather than by their position in the shared list. Each cache_request is assigned a seqno when enqueued, and a new cache_next_request() helper finds the next request at or after a given seqno. This eliminates the cache_queue wrapper struct entirely, simplifies the reader-skipping loops in cache_read/cache_poll/cache_ioctl/ cache_release, and makes the data flow easier to reason about. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2026-03-29 21:25:09 -04:00
Jeff Layton	552d0e17ea	sunrpc: convert queue_wait from global to per-cache-detail waitqueue The queue_wait waitqueue is currently a file-scoped global, so a wake_up for one cache_detail wakes pollers on all caches. Convert it to a per-cache-detail field so that only pollers on the relevant cache are woken. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2026-03-29 21:25:09 -04:00
Jeff Layton	17c1d66579	sunrpc: convert queue_lock from global spinlock to per-cache-detail lock The global queue_lock serializes all upcall queue operations across every cache_detail instance. Convert it to a per-cache-detail spinlock so that different caches (e.g. auth.unix.ip vs nfsd.fh) no longer contend with each other on queue operations. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2026-03-29 21:25:09 -04:00
Andy Shevchenko	adcc59114c	sunrpc: Kill RPC_IFDEBUG() RPC_IFDEBUG() is used in only two places. In one the user of the definition is guarded by ifdeffery, in the second one it's implied due to dprintk() usage. Kill the macro and move the ifdeffery to the regular condition with the variable defined inside, while in the second case add the same conditional and move the respective code there. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2026-03-29 21:25:09 -04:00
Ryota Sakamoto	ed7f4d323b	SUNRPC: Replace KUnit tests for memcmp() with KUNIT_EXPECT_MEMEQ_MSG() Replace KUnit tests for memcmp() with KUNIT_EXPECT_MEMEQ_MSG() to improve debugging that prints the hex dump of the buffers when the assertion fails, whereas memcmp() only returns an integer difference. Signed-off-by: Ryota Sakamoto <sakamo.ryota@gmail.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2026-03-29 21:25:09 -04:00
NeilBrown	7b546bd899	sunrpc/cache: improve RCU safety in cache_list walking. 1/ consistently use hlist_add_head_rcu() when adding to the cachelist to reflect the fact that it can be concurrently walked using RCU. In fact hlist_add_head() has all the needed barriers so this is no safety issue, primarily a clarity issue. 2/ call cache_get() before adding the list with hlist_add_head_rcu(). It is generally safest to inc the refcount before publishing a reference. In this case it doesn't have any behavioural effect as code which does an RCU walk does not depend on precision of the refcount, and it will always be at least one. But it looks more correct to use this order. 3/ avoid possible races between NULL tests and hlist_entry_safe() calls. It is possible that a test will find that .next or .head is not NULL, but hlist_entry_safe() will find that it is NULL. This can lead to incorrect behaviour with the list-walk terminating early. It is safest to always call hlist_entry_safe() and test the result. Also simplify the *ppos calculation by simply assigning the hash shifted 32, rather than masking out low bits and incrementing high bits. Signed-off-by: NeilBrown <neil@brown.name> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2026-03-29 21:25:09 -04:00
Linus Torvalds	8a30aeb0d1	nfsd-7.0 fixes: Issues that need expedient stable backports: - Fix cache_request leak in cache_release() - Fix heap overflow in the NFSv4.0 LOCK replay cache - Hold net reference for the lifetime of /proc/fs/nfs/exports fd - Defer sub-object cleanup in export "put" callbacks -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmm7ELoACgkQM2qzM29m f5fbMQ/6AjYdEQh56X2G1Y899zsvT4jfOZSc8dYjxK6seNZLQBCOz54w4aRo0TmP keYIew8w2atCwWAlYT1xXqJVt90EG36fGodnw3EN+0g3nxPsIy1JeZwTUz1xagaI hDbFwo6bN4HxU457/XxPO4jNdvpztq8hbTdRkXsD/Ckh2Db1juKkTQ+kX0rCxL5s xZPDgKCsTQeFjfs+gdnbyEixc8vnQMAiUP15Df+HQdwCGD62meQ1S0BBVywRhCAK FoufgPRnCzB189PKYCpivCNSImeSasQ4cS3WYi1i9ZB3OvEzRnqaPAvvRWQTwWfs 7IIekorKagCvXbqEt3dMQn7UaVyFLgV8OMR04JGqpI05GylNBQVONty/BKzQVTdH Hp2C9PCitoPC68UabQZ22rCH8zpMREk+sH785ztLyuKGgC09YLTkxrltHllzKWAQ k5DkeTmySVeobpif4urQKHyxhWZ//ah0MJOsSE4XcPMCWk7RPshj4tZyzvXdbuR1 IZQbOSruUd9aaZ4Q9J8D66oVyBatq9RFP4yxxR7L3CLSXJUsWK0AriEY9EZAeUe7 GeOaiUJ34F2oE4FfF9XaTmsXG9EuXtps6PlYDlHjlSyRJyg3detTJP4YeKJCrlQC x+x7DN5gN2ZUuR+vqlS1BWGm24usmeNBPqvZ2hi6d+NpPgcLoUk= =xX5n -----END PGP SIGNATURE----- Merge tag 'nfsd-7.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: - Fix cache_request leak in cache_release() - Fix heap overflow in the NFSv4.0 LOCK replay cache - Hold net reference for the lifetime of /proc/fs/nfs/exports fd - Defer sub-object cleanup in export "put" callbacks * tag 'nfsd-7.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: nfsd: fix heap overflow in NFSv4.0 LOCK replay cache sunrpc: fix cache_request leak in cache_release NFSD: Hold net reference for the lifetime of /proc/fs/nfs/exports fd NFSD: Defer sub-object cleanup in export put callbacks	2026-03-18 14:27:11 -07:00
Jeff Layton	17ad31b3a4	sunrpc: fix cache_request leak in cache_release When a reader's file descriptor is closed while in the middle of reading a cache_request (rp->offset != 0), cache_release() decrements the request's readers count but never checks whether it should free the request. In cache_read(), when readers drops to 0 and CACHE_PENDING is clear, the cache_request is removed from the queue and freed along with its buffer and cache_head reference. cache_release() lacks this cleanup. The only other path that frees requests with readers == 0 is cache_dequeue(), but it runs only when CACHE_PENDING transitions from set to clear. If that transition already happened while readers was still non-zero, cache_dequeue() will have skipped the request, and no subsequent call will clean it up. Add the same cleanup logic from cache_read() to cache_release(): after decrementing readers, check if it reached 0 with CACHE_PENDING clear, and if so, dequeue and free the cache_request. Reported-by: NeilBrown <neilb@ownmail.net> Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Cc: stable@kernel.org Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2026-03-14 11:37:13 -04:00
Eric Badger	7b6275c80a	xprtrdma: Decrement re_receiving on the early exit paths In the event that rpcrdma_post_recvs() fails to create a work request (due to memory allocation failure, say) or otherwise exits early, we should decrement ep->re_receiving before returning. Otherwise we will hang in rpcrdma_xprt_drain() as re_receiving will never reach zero and the completion will never be triggered. On a system with high memory pressure, this can appear as the following hung task: INFO: task kworker/u385:17:8393 blocked for more than 122 seconds. Tainted: G S E 6.19.0 #3 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u385:17 state:D stack:0 pid:8393 tgid:8393 ppid:2 task_flags:0x4248060 flags:0x00080000 Workqueue: xprtiod xprt_autoclose [sunrpc] Call Trace: <TASK> __schedule+0x48b/0x18b0 ? ib_post_send_mad+0x247/0xae0 [ib_core] schedule+0x27/0xf0 schedule_timeout+0x104/0x110 __wait_for_common+0x98/0x180 ? __pfx_schedule_timeout+0x10/0x10 wait_for_completion+0x24/0x40 rpcrdma_xprt_disconnect+0x444/0x460 [rpcrdma] xprt_rdma_close+0x12/0x40 [rpcrdma] xprt_autoclose+0x5f/0x120 [sunrpc] process_one_work+0x191/0x3e0 worker_thread+0x2e3/0x420 ? __pfx_worker_thread+0x10/0x10 kthread+0x10d/0x230 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x273/0x2b0 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 Fixes: `15788d1d10` ("xprtrdma: Do not refresh Receive Queue while it is draining") Signed-off-by: Eric Badger <ebadger@purestorage.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>	2026-02-27 15:42:14 -05:00
Linus Torvalds	32a92f8c89	Convert more 'alloc_obj' cases to default GFP_KERNEL arguments This converts some of the visually simpler cases that have been split over multiple lines. I only did the ones that are easy to verify the resulting diff by having just that final GFP_KERNEL argument on the next line. Somebody should probably do a proper coccinelle script for this, but for me the trivial script actually resulted in an assertion failure in the middle of the script. I probably had made it a bit _too_ trivial. So after fighting that far a while I decided to just do some of the syntactically simpler cases with variations of the previous 'sed' scripts. The more syntactically complex multi-line cases would mostly really want whitespace cleanup anyway. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-02-21 20:03:00 -08:00
Linus Torvalds	323bbfcf1e	Convert 'alloc_flex' family to use the new default GFP_KERNEL argument This is the exact same thing as the 'alloc_obj()' version, only much smaller because there are a lot fewer users of the alloc_flex() interface. As with alloc_obj() version, this was done entirely with mindless brute force, using the same script, except using 'flex' in the pattern rather than 'objs'. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-02-21 17:09:51 -08:00
Linus Torvalds	bf4afc53b7	Convert 'alloc_obj' family to use the new default GFP_KERNEL argument This was done entirely with mindless brute force, using git grep -l '\<k[vmz]alloc_objs(., GFP_KERNEL)' \| xargs sed -i 's/\(alloc_objs(.*\), GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-02-21 17:09:51 -08:00
Kees Cook	69050f8d6d	treewide: Replace kmalloc with kmalloc_obj for non-scalar types This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(PTR, FAM, COUNT, ...) (where TYPE may also be VAR) The resulting allocations no longer return "void ", instead returning "TYPE ". Signed-off-by: Kees Cook <kees@kernel.org>	2026-02-21 01:02:28 -08:00
Linus Torvalds	7449f86baf	NFS Client Updates for Linux 7.0 New Features: * Use an LRU list for returning unused delegations * Introduce a KConfig option to disable NFS v4.0 and make NFS v4.1 the default Bugfixes: * NFS/localio: Handle short writes by retrying * NFS/localio: prevent direct reclaim recursion into NFS via nfs_writepages * NFS/localio: use GFP_NOIO and non-memreclaim workqueue in nfs_local_commit * NFS/localio: remove -EAGAIN handling in nfs_local_doio() * pNFS: fix a missing wake up while waiting on NFS_LAYOUT_DRAIN * fs/nfs: Fix a readdir slow-start regression * SUNRPC: fix gss_auth kref leak in gss_alloc_msg error path Other Cleanups and Improvements: * A few other NFS/localio cleanups * Various other delegation handling cleanups from Christoph * Unify security_inode_listsecurity() calls * Improvements to NFSv4 lease handling * Clean up SUNRPC _debug fields when CONFIG_SUNRPC_DEBUG is not set -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmmORX0ACgkQ18tUv7Cl QOtFORAAyCwTst5iEPRJ9rKZ/Kl39zHbA/QUn3CmmVkGlOBj0j7mWRyU5X0vlIQ9 mUF3Ikm1XYpsxPTKBEELVumPkggT2nfsFx5518BrpRTODibzc/CZ10/z7q4qarvI UhdFlt9SRG4RhhOdAaThF6XVUsRSwGwVZo/YyYemCc/evjNVyXa0wfwbDl9l4Nzr 1Sxt2/zeq3Eu4IfrxQpFM+0UuSScmVODSe8Jm4GnmlU/Q7x+onW35IvyuzTkgDwG 8PAeH4b5uADY9VWnTHpvr1fQNnBoEw8b4qr9a7AXQKRIcPGMvgKkdK+f6hOh1cEs +O+L4+uixo7QXudnWC27brZSyHwDIVVaJGPF/kNv4O2GKDyEcbsHtQv2G1+1+PtR FCtRFGpLq2pZxb9SY/s73FKp6a8bd81FAtzAL7iYU+2FDtvEDKss1nG6sQNG1+Z4 G8rI79PoimR4I6Jr5hk4sl8pM8wJVLZdcW+ytrEKl9FC+rFDrP9lVzHYArTFgIky N/IjEflejRfZ9bYIZ9/CYnFZC3Htrm8K9zerCRDsf96tvhxkX8FZM8tuZpHEMIbx Cx8XKCk+ubqLIF2mT+FKOc5T6CUmMiGRNagLkx0h0mbvRSI8HTpgQZGrbYkMk0Hs abUhvH73pRi0LRvkzPHfcNaZ7Y/mFBYfwBMwTUWJzh6CEgXnpks= =CwZq -----END PGP SIGNATURE----- Merge tag 'nfs-for-7.0-1' of git://git.linux-nfs.org/projects/anna/linux-nfs Pull NFS client updates from Anna Schumaker: "New Features: - Use an LRU list for returning unused delegations - Introduce a KConfig option to disable NFS v4.0 and make NFS v4.1 the default Bugfixes: - NFS/localio: - Handle short writes by retrying - Prevent direct reclaim recursion into NFS via nfs_writepages - Use GFP_NOIO and non-memreclaim workqueue in nfs_local_commit - Remove -EAGAIN handling in nfs_local_doio() - pNFS: fix a missing wake up while waiting on NFS_LAYOUT_DRAIN - fs/nfs: Fix a readdir slow-start regression - SUNRPC: fix gss_auth kref leak in gss_alloc_msg error path Other cleanups and improvements: - A few other NFS/localio cleanups - Various other delegation handling cleanups from Christoph - Unify security_inode_listsecurity() calls - Improvements to NFSv4 lease handling - Clean up SUNRPC _debug fields when CONFIG_SUNRPC_DEBUG is not set" * tag 'nfs-for-7.0-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (60 commits) SUNRPC: fix gss_auth kref leak in gss_alloc_msg error path nfs: nfs4proc: Convert comma to semicolon SUNRPC: Change list definition method sunrpc: rpc_debug and others are defined even if CONFIG_SUNRPC_DEBUG unset NFSv4: limit lease period in nfs4_set_lease_period() NFSv4: pass lease period in seconds to nfs4_set_lease_period() nfs: unify security_inode_listsecurity() calls fs/nfs: Fix readdir slow-start regression pNFS: fix a missing wake up while waiting on NFS_LAYOUT_DRAIN NFS: fix delayed delegation return handling NFS: simplify error handling in nfs_end_delegation_return NFS: fold nfs_abort_delegation_return into nfs_end_delegation_return NFS: remove the delegation == NULL check in nfs_end_delegation_return NFS: use bool for the issync argument to nfs_end_delegation_return NFS: return void from ->return_delegation NFS: return void from nfs4_inode_make_writeable NFS: Merge CONFIG_NFS_V4_1 with CONFIG_NFS_V4 NFS: Add a way to disable NFS v4.0 via KConfig NFS: Move sequence slot operations into minorversion operations NFS: Pass a struct nfs_client to nfs4_init_sequence() ...	2026-02-12 17:49:33 -08:00
Linus Torvalds	311aa68319	RDMA v7.0 merge window Usual smallish cycle: - Various code improvements in irdma, rtrs, qedr, ocrdma, irdma, rxe - Small driver improvements and minor bug fixes to hns, mlx5, rxe, mana, mlx5, irdma - Robusness improvements in completion processing for EFA - New query_port_speed() verb to move past limited IBA defined speed steps - Support for SG_GAPS in rts and many other small improvements - Rare list corruption fix in iwcm - Better support different page sizes in rxe - Device memory support for mana - Direct bio vec to kernel MR for use by NFS-RDMA - QP rate limiting for bnxt_re - Remote triggerable NULL pointer crash in siw - DMA-buf exporter support for RDMA mmaps like doorbells -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCaY44vgAKCRCFwuHvBreF YfiZAP91cMZfogN7r1FMD75xDZu55dI3Jvy8OaixyRxlWLGPcQEAjritdL0o7fZp YrD1OXNS/1XG//rPBVw7xj+54Aa8hAU= =AVcu -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma Pull rdma updates from Jason Gunthorpe: "Usual smallish cycle. The NFS biovec work to push it down into RDMA instead of indirecting through a scatterlist is pretty nice to see, been talked about for a long time now. - Various code improvements in irdma, rtrs, qedr, ocrdma, irdma, rxe - Small driver improvements and minor bug fixes to hns, mlx5, rxe, mana, mlx5, irdma - Robusness improvements in completion processing for EFA - New query_port_speed() verb to move past limited IBA defined speed steps - Support for SG_GAPS in rts and many other small improvements - Rare list corruption fix in iwcm - Better support different page sizes in rxe - Device memory support for mana - Direct bio vec to kernel MR for use by NFS-RDMA - QP rate limiting for bnxt_re - Remote triggerable NULL pointer crash in siw - DMA-buf exporter support for RDMA mmaps like doorbells" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (66 commits) RDMA/mlx5: Implement DMABUF export ops RDMA/uverbs: Add DMABUF object type and operations RDMA/uverbs: Support external FD uobjects RDMA/siw: Fix potential NULL pointer dereference in header processing RDMA/umad: Reject negative data_len in ib_umad_write IB/core: Extend rate limit support for RC QPs RDMA/mlx5: Support rate limit only for Raw Packet QP RDMA/bnxt_re: Report QP rate limit in debugfs RDMA/bnxt_re: Report packet pacing capabilities when querying device RDMA/bnxt_re: Add support for QP rate limiting MAINTAINERS: Drop RDMA files from Hyper-V section RDMA/uverbs: Add __GFP_NOWARN to ib_uverbs_unmarshall_recv() kmalloc svcrdma: use bvec-based RDMA read/write API RDMA/core: add rdma_rw_max_sge() helper for SQ sizing RDMA/core: add MR support for bvec-based RDMA operations RDMA/core: use IOVA-based DMA mapping for bvec RDMA operations RDMA/core: add bio_vec based RDMA read/write API RDMA/irdma: Use kvzalloc for paged memory DMA address array RDMA/rxe: Fix race condition in QP timer handlers RDMA/mana_ib: Add device‑memory support ...	2026-02-12 17:05:20 -08:00
Linus Torvalds	136114e0ab	mm.git review status for linus..mm-nonmm-stable Total patches: 107 Reviews/patch: 1.07 Reviewed rate: 67% - The 2 patch series "ocfs2: give ocfs2 the ability to reclaim suballocator free bg" from Heming Zhao saves disk space by teaching ocfs2 to reclaim suballocator block group space. - The 4 patch series "Add ARRAY_END(), and use it to fix off-by-one bugs" from Alejandro Colomar adds the ARRAY_END() macro and uses it in various places. - The 2 patch series "vmcoreinfo: support VMCOREINFO_BYTES larger than PAGE_SIZE" from Pnina Feder makes the vmcore code future-safe, if VMCOREINFO_BYTES ever exceeds the page size. - The 7 patch series "kallsyms: Prevent invalid access when showing module buildid" from Petr Mladek cleans up kallsyms code related to module buildid and fixes an invalid access crash when printing backtraces. - The 3 patch series "Address page fault in ima_restore_measurement_list()" from Harshit Mogalapalli fixes a kexec-related crash that can occur when booting the second-stage kernel on x86. - The 6 patch series "kho: ABI headers and Documentation updates" from Mike Rapoport updates the kexec handover ABI documentation. - The 4 patch series "Align atomic storage" from Finn Thain adds the __aligned attribute to atomic_t and atomic64_t definitions to get natural alignment of both types on csky, m68k, microblaze, nios2, openrisc and sh. - The 2 patch series "kho: clean up page initialization logic" from Pratyush Yadav simplifies the page initialization logic in kho_restore_page(). - The 6 patch series "Unload linux/kernel.h" from Yury Norov moves several things out of kernel.h and into more appropriate places. - The 7 patch series "don't abuse task_struct.group_leader" from Oleg Nesterov removes the usage of ->group_leader when it is "obviously unnecessary". - The 5 patch series "list private v2 & luo flb" from Pasha Tatashin adds some infrastructure improvements to the live update orchestrator. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaY4giAAKCRDdBJ7gKXxA jgusAQDnKkP8UWTqXPC1jI+OrDJGU5ciAx8lzLeBVqMKzoYk9AD/TlhT2Nlx+Ef6 0HCUHUD0FMvAw/7/Dfc6ZKxwBEIxyww= =mmsH -----END PGP SIGNATURE----- Merge tag 'mm-nonmm-stable-2026-02-12-10-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - "ocfs2: give ocfs2 the ability to reclaim suballocator free bg" saves disk space by teaching ocfs2 to reclaim suballocator block group space (Heming Zhao) - "Add ARRAY_END(), and use it to fix off-by-one bugs" adds the ARRAY_END() macro and uses it in various places (Alejandro Colomar) - "vmcoreinfo: support VMCOREINFO_BYTES larger than PAGE_SIZE" makes the vmcore code future-safe, if VMCOREINFO_BYTES ever exceeds the page size (Pnina Feder) - "kallsyms: Prevent invalid access when showing module buildid" cleans up kallsyms code related to module buildid and fixes an invalid access crash when printing backtraces (Petr Mladek) - "Address page fault in ima_restore_measurement_list()" fixes a kexec-related crash that can occur when booting the second-stage kernel on x86 (Harshit Mogalapalli) - "kho: ABI headers and Documentation updates" updates the kexec handover ABI documentation (Mike Rapoport) - "Align atomic storage" adds the __aligned attribute to atomic_t and atomic64_t definitions to get natural alignment of both types on csky, m68k, microblaze, nios2, openrisc and sh (Finn Thain) - "kho: clean up page initialization logic" simplifies the page initialization logic in kho_restore_page() (Pratyush Yadav) - "Unload linux/kernel.h" moves several things out of kernel.h and into more appropriate places (Yury Norov) - "don't abuse task_struct.group_leader" removes the usage of ->group_leader when it is "obviously unnecessary" (Oleg Nesterov) - "list private v2 & luo flb" adds some infrastructure improvements to the live update orchestrator (Pasha Tatashin) * tag 'mm-nonmm-stable-2026-02-12-10-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (107 commits) watchdog/hardlockup: simplify perf event probe and remove per-cpu dependency procfs: fix missing RCU protection when reading real_parent in do_task_stat() watchdog/softlockup: fix sample ring index wrap in need_counting_irqs() kcsan, compiler_types: avoid duplicate type issues in BPF Type Format kho: fix doc for kho_restore_pages() tests/liveupdate: add in-kernel liveupdate test liveupdate: luo_flb: introduce File-Lifecycle-Bound global state liveupdate: luo_file: Use private list list: add kunit test for private list primitives list: add primitives for private list manipulations delayacct: fix uapi timespec64 definition panic: add panic_force_cpu= parameter to redirect panic to a specific CPU netclassid: use thread_group_leader(p) in update_classid_task() RDMA/umem: don't abuse current->group_leader drm/pan*: don't abuse current->group_leader drm/amd: kill the outdated "Only the pthreads threading model is supported" checks drm/amdgpu: don't abuse current->group_leader android/binder: use same_thread_group(proc->tsk, current) in binder_mmap() android/binder: don't abuse current->group_leader kho: skip memoryless NUMA nodes when reserving scratch areas ...	2026-02-12 12:13:01 -08:00
Daniel Hodges	dd2fdc3504	SUNRPC: fix gss_auth kref leak in gss_alloc_msg error path Commit `5940d1cf9f` ("SUNRPC: Rebalance a kref in auth_gss.c") added a kref_get(&gss_auth->kref) call to balance the gss_put_auth() done in gss_release_msg(), but forgot to add a corresponding kref_put() on the error path when kstrdup_const() fails. If service_name is non-NULL and kstrdup_const() fails, the function jumps to err_put_pipe_version which calls put_pipe_version() and kfree(gss_msg), but never releases the gss_auth reference. This leads to a kref leak where the gss_auth structure is never freed. Add a forward declaration for gss_free_callback() and call kref_put() in the err_put_pipe_version error path to properly release the reference taken earlier. Fixes: `5940d1cf9f` ("SUNRPC: Rebalance a kref in auth_gss.c") Cc: stable@vger.kernel.org Signed-off-by: Daniel Hodges <git@danielhodges.dev> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>	2026-02-09 16:39:50 -05:00
Chenguang Zhao	afb24505ff	SUNRPC: Change list definition method The LIST_HEAD macro can both define a linked list and initialize it in one step. To simplify code, we replace the separate operations of linked list definition and manual initialization with the LIST_HEAD macro. Signed-off-by: Chenguang Zhao <zhaochenguang@kylinos.cn> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>	2026-02-09 14:24:19 -05:00
Jeff Layton	a0022a38be	sunrpc: allow svc_recv() to return -ETIMEDOUT and -EBUSY To dynamically adjust the thread count, nfsd requires some information about how busy things are. Change svc_recv() to take a timeout value, and then allow the wait for work to time out if it's set. If a timeout is not defined, then the schedule will be set to MAX_SCHEDULE_TIMEOUT. If the task waits for the full timeout, then have it return -ETIMEDOUT to the caller. If it wakes up, finds that there is more work and that no threads are available, then attempt to set SP_TASK_STARTING. If wasn't already set, have the task return -EBUSY to cue to the caller that the service could use more threads. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2026-01-28 10:15:42 -05:00
Jeff Layton	7f221b340d	sunrpc: split new thread creation into a separate function Break out the part of svc_start_kthreads() that creates a thread into svc_new_thread(), as a new exported helper function. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2026-01-28 10:15:42 -05:00
Jeff Layton	7ffc7ade2c	sunrpc: introduce the concept of a minimum number of threads per pool Add a new pool->sp_nrthrmin field to track the minimum number of threads in a pool. Add min_threads parameters to both svc_set_num_threads() and svc_set_pool_threads(). If min_threads is non-zero and less than the max, svc_set_num_threads() will ensure that the number of running threads is between the min and the max. If the min is 0 or greater than the max, then it is ignored, and the maximum number of threads will be started, and never spun down. For now, the min_threads is always 0, but a later patch will pass the proper value through from nfsd. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2026-01-28 10:15:42 -05:00
Jeff Layton	6cd60f4274	sunrpc: track the max number of requested threads in a pool The kernel currently tracks the number of threads running in a pool in the "sp_nrthreads" field. In the future, where threads are dynamically spun up and down, it'll be necessary to keep track of the maximum number of requested threads separately from the actual number running. Add a pool->sp_nrthrmax parameter to track this. When userland changes the number of threads in a pool, update that value accordingly. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2026-01-28 10:15:42 -05:00
Jeff Layton	2c01f0cf32	sunrpc: remove special handling of NULL pool from svc_start/stop_kthreads() Now that svc_set_num_threads() handles distributing the threads among the available pools, remove the special handling of a NULL pool pointer from svc_start_kthreads() and svc_stop_kthreads(). Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2026-01-28 10:15:42 -05:00
Jeff Layton	e344f87262	sunrpc: split svc_set_num_threads() into two functions svc_set_num_threads() will set the number of running threads for a given pool. If the pool argument is set to NULL however, it will distribute the threads among all of the pools evenly. These divergent codepaths complicate the move to dynamic threading. Simplify the API by splitting these two cases into different helpers: Add a new svc_set_pool_threads() function that sets the number of threads in a single, given pool. Modify svc_set_num_threads() to distribute the threads evenly between all of the pools and then call svc_set_pool_threads() for each. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2026-01-28 10:15:42 -05:00
Chuck Lever	5ee62b4a91	svcrdma: use bvec-based RDMA read/write API Convert svcrdma to the bvec-based RDMA API introduced earlier in this series. The bvec-based RDMA API eliminates the intermediate scatterlist conversion step, allowing direct DMA mapping from bio_vec arrays. This simplifies the svc_rdma_rw_ctxt structure by removing the chained SG table management. The structure retains an inline array approach similar to the previous scatterlist implementation: an inline bvec array sized to max_send_sge handles most I/O operations without additional allocation. Larger requests fall back to dynamic allocation. This preserves the allocation-free fast path for typical NFS operations while supporting arbitrarily large transfers. The bvec API handles all device types internally, including iWARP devices which require memory registration. No explicit fallback path is needed. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://patch.msgid.link/20260128005400.25147-6-cel@kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Leon Romanovsky <leon@kernel.org>	2026-01-28 05:54:53 -05:00
Chuck Lever	afcae7d7b8	RDMA/core: add rdma_rw_max_sge() helper for SQ sizing svc_rdma_accept() computes sc_sq_depth as the sum of rq_depth and the number of rdma_rw contexts (ctxts). This value is used to allocate the Send CQ and to initialize the sc_sq_avail credit pool. However, when the device uses memory registration for RDMA operations, rdma_rw_init_qp() inflates the QP's max_send_wr by a factor of three per context to account for REG and INV work requests. The Send CQ and credit pool remain sized for only one work request per context, causing Send Queue exhaustion under heavy NFS WRITE workloads. Introduce rdma_rw_max_sge() to compute the actual number of Send Queue entries required for a given number of rdma_rw contexts. Upper layer protocols call this helper before creating a Queue Pair so that their Send CQs and credit accounting match the QP's true capacity. Update svc_rdma_accept() to use rdma_rw_max_sge() when computing sc_sq_depth, ensuring the credit pool reflects the work requests that rdma_rw_init_qp() will reserve. Reviewed-by: Christoph Hellwig <hch@lst.de> Fixes: `00bd1439f4` ("RDMA/rw: Support threshold for registration vs scattering to local pages") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://patch.msgid.link/20260128005400.25147-5-cel@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>	2026-01-28 05:54:53 -05:00
Chuck Lever	3e6397b056	SUNRPC: auth_gss: fix memory leaks in XDR decoding error paths The gssx_dec_ctx(), gssx_dec_status(), and gssx_dec_name() functions allocate memory via gssx_dec_buffer(), which calls kmemdup(). When a subsequent decode operation fails, these functions return immediately without freeing previously allocated buffers, causing memory leaks. The leak in gssx_dec_ctx() is particularly relevant because the caller (gssp_accept_sec_context_upcall) initializes several buffer length fields to non-zero values, resulting in memory allocation: struct gssx_ctx rctxh = { .exported_context_token.len = GSSX_max_output_handle_sz, .mech.len = GSS_OID_MAX_LEN, .src_name.display_name.len = GSSX_max_princ_sz, .targ_name.display_name.len = GSSX_max_princ_sz }; If, for example, gssx_dec_name() succeeds for src_name but fails for targ_name, the memory allocated for exported_context_token, mech, and src_name.display_name remains unreferenced and cannot be reclaimed. Add error handling with goto-based cleanup to free any previously allocated buffers before returning an error. Reported-by: Xingjing Deng <micro6947@gmail.com> Closes: https://lore.kernel.org/linux-nfs/CAK+ZN9qttsFDu6h1FoqGadXjMx1QXqPMoYQ=6O9RY4SxVTvKng@mail.gmail.com/ Fixes: `1d658336b0` ("SUNRPC: Add RPC based upcall mechanism for RPCGSS auth") Cc: stable@vger.kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2026-01-26 10:10:58 -05:00

1 2 3 4 5 ...

5090 Commits