Highlights include:
Bugfixes:
- NFS: Fix handling of ENOSPC so that if we have to resend writes, they
are written synchronously.
- SUNRPC: RDMA transport fixes from Chuck
- NFSv4.2: Several fixes for delegated timestamps
- NFSv4: Failure to obtain a directory delegation should not cause
stat() to fail.
- NFSv4: Rename was failing to update timestamps when a directory
delegation is held.
- NFSv4: Ensure we check rsize/wsize after crossing a NFSv4 filesystem
boundary.
- NFSv4/pnfs: If the server is down, retry the layout returns on reboot
- NFSv4/pnfs: Fallback to MDS could result in a short write being
incorrectly logged.
Cleanups:
- NFS: use memcpy_and_pad in decode_fh
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQR8xgHcVzJNfOYElJo6EXfx2a6V0QUCaevSUgAKCRA6EXfx2a6V
0ewIAQD+23uMo5sxY10btKATcBBxswY5YMtN1qQBMyn88N0XfwEAz0+zoEbRv4L2
39goJ/WeJ0/gqhfJV9F+Oe2U1DbsEgM=
=l9y/
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-7.1-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Bugfixes:
- Fix handling of ENOSPC so that if we have to resend writes, they
are written synchronously
- SUNRPC RDMA transport fixes from Chuck
- Several fixes for delegated timestamps in NFSv4.2
- Failure to obtain a directory delegation should not cause stat() to
fail with NFSv4
- Rename was failing to update timestamps when a directory delegation
is held on NFSv4
- Ensure we check rsize/wsize after crossing a NFSv4 filesystem
boundary
- NFSv4/pnfs:
- If the server is down, retry the layout returns on reboot
- Fallback to MDS could result in a short write being incorrectly
logged
Cleanups:
- Use memcpy_and_pad in decode_fh"
* tag 'nfs-for-7.1-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (21 commits)
NFS: Fix RCU dereference of cl_xprt in nfs_compare_super_address
NFS: remove redundant __private attribute from nfs_page_class
NFSv4.2: fix CLONE/COPY attrs in presence of delegated attributes
NFS: fix writeback in presence of errors
nfs: use memcpy_and_pad in decode_fh
NFSv4.1: Apply session size limits on clone path
NFSv4: retry GETATTR if GET_DIR_DELEGATION failed
NFS: fix RENAME attr in presence of directory delegations
pnfs/flexfiles: validate ds_versions_cnt is non-zero
NFS/blocklayout: print each device used for SCSI layouts
xprtrdma: Post receive buffers after RPC completion
xprtrdma: Scale receive batch size with credit window
xprtrdma: Replace rpcrdma_mr_seg with xdr_buf cursor
xprtrdma: Decouple frwr_wp_create from frwr_map
xprtrdma: Close lost-wakeup race in xprt_rdma_alloc_slot
xprtrdma: Avoid 250 ms delay on backlog wakeup
xprtrdma: Close sendctx get/put race that can block a transport
nfs: update inode ctime after removexattr operation
nfs: fix utimensat() for atime with delegated timestamps
NFS: improve "Server wrote zero bytes" error
...
Benjamin Coddington contributed filehandle signing to defend against
filehandle-guessing attacks. The server now appends a SipHash-2-4
MAC to each filehandle when the new "sign_fh" export option is
enabled. NFSD then verifies filehandles received from clients
against the expected MAC; mismatches return NFS error STALE.
Chuck Lever converted the entire NLMv4 server-side XDR layer from
hand-written C to xdrgen-generated code, spanning roughly thirty
patches. XDR functions are generally boilerplate code and are easy
to get wrong. The goals of this conversion are improved memory
safety, lower maintenance burden, and groundwork for eventual Rust
code generation for these functions.
Dai Ngo improved pNFS block/SCSI layout robustness with two related
changes. SCSI persistent reservation fencing is now tracked per
client and per device via an xarray, to avoid both redundant preempt
operations on devices already fenced and a potential NFSD deadlock
when all nfsd threads are waiting for a layout return.
The remaining patches deliver scalability and infrastructure
improvements. Sincere thanks to all contributors, reviewers,
testers, and bug reporters who participated in the v7.1 NFSD
development cycle.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmnlF50ACgkQM2qzM29m
f5dfHBAAi2o1i9/RA6fmxi2qSV7tkg79viuGFRj3c4cjiW8ZqQXos63zmy6BNMFG
joEoirdryUETkrrckXP81HKGSWBQqYjaXeklOw8dggQ8g72HGiqcoT3Ua7L9S7A8
/Db6IwZnJcehHO8XwHV4jSAfIZuvC0iiK02tVrVe/l/9GWcG+bS340GgE9Es2IAW
copBGlTwQah+eRvy2hP+Eo3vUTP8Rdebp9iYFI12xqx2x3LquFR01PpjCzotqAvV
AcvCPa/AGoSOjcL8idloL8F8mSaOCyx15YJH0lm3hRsPtS/VyXWjKvcejWUh/7PH
gHi+5VTsSKbUBj3PJQZU6rBQ67KnwVLZ33KkIF2ZNGllvK0yDGM0UfX/TuaEPjUV
6N0UkRprCHJdrULt9XMXmX3Ddnz1xbYT8CaeIDObw3Ix7SJKedvlLTjvsYCYtsQn
5pkHUuHmr/YAF4AQi/JI4ubZhZ+K3YytNS8YiMUkBWDbPoKzo2yrkzwjGjHdUp0y
l8LfEjePAcIpuFQZegERA9CnjIeKb66DJe8da0EwtreY+sejm/S8zbBUhMkXjo6u
QwdXXeLX3/zni6Op8vRA5JH//S5ovlQFnkUSvHRItSUrDBRVm+wXD7Vnp9bykKcN
leqbSvehnV4PIi0URMvN5ox1WNmsOFIZkv9nv8amyOX8PlRmLoA=
=iFl6
-----END PGP SIGNATURE-----
Merge tag 'nfsd-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd updates from Chuck Lever:
- filehandle signing to defend against filehandle-guessing attacks
(Benjamin Coddington)
The server now appends a SipHash-2-4 MAC to each filehandle when
the new "sign_fh" export option is enabled. NFSD then verifies
filehandles received from clients against the expected MAC;
mismatches return NFS error STALE
- convert the entire NLMv4 server-side XDR layer from hand-written C to
xdrgen-generated code, spanning roughly thirty patches (Chuck Lever)
XDR functions are generally boilerplate code and are easy to get
wrong. The goals of this conversion are improved memory safety, lower
maintenance burden, and groundwork for eventual Rust code generation
for these functions.
- improve pNFS block/SCSI layout robustness with two related changes
(Dai Ngo)
SCSI persistent reservation fencing is now tracked per client and
per device via an xarray, to avoid both redundant preempt operations
on devices already fenced and a potential NFSD deadlock when all nfsd
threads are waiting for a layout return.
- scalability and infrastructure improvements
Sincere thanks to all contributors, reviewers, testers, and bug
reporters who participated in the v7.1 NFSD development cycle.
* tag 'nfsd-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (83 commits)
NFSD: Docs: clean up pnfs server timeout docs
nfsd: fix comment typo in nfsxdr
nfsd: fix comment typo in nfs3xdr
NFSD: convert callback RPC program to per-net namespace
NFSD: use per-operation statidx for callback procedures
svcrdma: Use contiguous pages for RDMA Read sink buffers
SUNRPC: Add svc_rqst_page_release() helper
SUNRPC: xdr.h: fix all kernel-doc warnings
svcrdma: Factor out WR chain linking into helper
svcrdma: Add Write chunk WRs to the RPC's Send WR chain
svcrdma: Clean up use of rdma->sc_pd->device
svcrdma: Clean up use of rdma->sc_pd->device in Receive paths
svcrdma: Add fair queuing for Send Queue access
SUNRPC: Optimize rq_respages allocation in svc_alloc_arg
SUNRPC: Track consumed rq_pages entries
svcrdma: preserve rq_next_page in svc_rdma_save_io_pages
SUNRPC: Handle NULL entries in svc_rqst_release_pages
SUNRPC: Allocate a separate Reply page array
SUNRPC: Tighten bounds checking in svc_rqst_replace_page
NFSD: Sign filehandles
...
rpcrdma_post_recvs() runs in CQ poll context and its cost
falls on the latency-critical path between polling a Receive
completion and waking the RPC consumer. Every cycle spent
refilling the Receive Queue delays delivery of the reply to
the NFS layer.
Move the rpcrdma_post_recvs() call in rpcrdma_reply_handler()
to after the RPC has been decoded and completed. The larger
batch size from the preceding patch provides sufficient
Receive Queue headroom to absorb the brief delay before
buffers are replenished.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The fixed RPCRDMA_MAX_RECV_BATCH of 7 results in frequent
small ib_post_recv batches during high-rate workloads. With
a 128-slot credit window, receives are reposted every 7th
completion, each batch incurring atomic serialization and a
doorbell write.
Replace the fixed batch constant with a per-endpoint value
scaled to 25% of the negotiated credit window. For a typical
128-credit connection this raises the batch from 7 to 32,
reducing doorbell frequency by roughly 4x and amortizing the
per-batch atomic and MMIO costs over a larger group of
receive WRs.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The FRWR registration path converts data through three
representations: xdr_buf -> rpcrdma_mr_seg[] -> scatterlist[]
-> ib_map_mr_sg(). The rpcrdma_mr_seg intermediate is a relic
of when multiple registration strategies existed (FMR, physical,
FRWR). Only FRWR remains, so this indirection and the 6240-byte
rl_segments[260] array embedded in each rpcrdma_req serve no
purpose.
Introduce struct rpcrdma_xdr_cursor to track position within
an xdr_buf during iterative MR registration. Rewrite frwr_map to
populate scatterlist entries directly from the xdr_buf regions
(head kvec, page list, tail kvec). The boundary logic for
non-SG_GAPS devices is simpler because the xdr_buf structure
guarantees that page-region entries after the first start at
offset 0, and that head/tail kvecs are separate regions that
naturally break at MR boundaries.
Fix a pre-existing bug in rpcrdma_encode_write_list where the
write-pad statistics accumulator added mr->mr_length from the last
data MR rather than the write-pad MR. The refactored code uses
ep->re_write_pad_mr->mr_length.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
frwr_wp_create is the only caller of frwr_map outside the encode
path. It registers a single 4-byte write-pad region from a stack-
local rpcrdma_mr_seg. Inlining the registration logic directly
(sg_init_table + sg_set_page + ib_dma_map_sg + ib_map_mr_sg +
IOVA mangle + reg_wr setup) eliminates the coupling that would
otherwise complicate the removal of rpcrdma_mr_seg from frwr_map's
interface.
The inlined version adds a proper error-unwind ladder: on failure,
the DMA mapping (if established) is released, ep->re_write_pad_mr is
cleared, and the MR is returned to the transport free list. The old
frwr_map-based code relied on rpcrdma_mrs_destroy at teardown to
reclaim partially-initialized MRs.
This is a one-time setup path; duplicating ~20 lines is a reasonable
tradeoff for decoupling the write-pad registration from the data-
path MR registration.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
xprt_rdma_alloc_slot() and xprt_rdma_free_slot() lack serialization
between the buffer pool and the backlog queue. A buffer freed
after rpcrdma_buffer_get() finds the pool empty but before
rpc_sleep_on() places the task on the backlog is returned to the
pool with no waiter to wake, leaving the task stuck on the backlog
indefinitely.
After joining the backlog, re-check the pool and route any
recovered buffer through xprt_wake_up_backlog(), whose queue lock
serializes with concurrent wakeups and avoids double-assignment
of slots.
Because xprt_rdma_free_slot() does not hold reserve_lock, the
XPRT_CONGESTED double-check in xprt_throttle_congested() is
ineffective: a task can join the backlog through that path after
free_slot has already found it empty and cleared the bit. Avoid
this by using xprt_add_backlog_noncongested(), which queues the
task without setting XPRT_CONGESTED, so every allocation reaches
xprt_rdma_alloc_slot() and its post-sleep re-check.
Fixes: edb41e61a5 ("xprtrdma: Make rpc_rqst part of rpcrdma_req")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Commit a721035477 ("SUNRPC/xprt: async tasks mustn't block waiting
for memory") changed xprt_rdma_alloc_slot() to set tk_status to
-ENOMEM so that call_reserveresult() would sleep HZ/4 before
retrying. That rationale applies to xprt_dynamic_alloc_slot(),
where an immediate retry under memory pressure wastes CPU, but not
to the RDMA backlog path: a task woken from the backlog has a slot
waiting for it, so the 250 ms rpc_delay adds latency without
benefit.
This also aligns the code with the existing kernel-doc for
xprt_rdma_alloc_slot(), which already documented %-EAGAIN.
Fixes: a721035477 ("SUNRPC/xprt: async tasks mustn't block waiting for memory")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
rpcrdma_sendctx_get_locked() and rpcrdma_sendctx_put_locked() can
race in a way that leaves XPRT_WRITE_SPACE set permanently, blocking
all further sends on the transport:
get_locked put_locked (Send completion)
---------- --------------------------
read rb_sc_tail
-> ring full
advance rb_sc_tail
xprt_write_space():
test_bit(WRITE_SPACE)
-> not set, return
set_bit(WRITE_SPACE)
return NULL (-EAGAIN)
After the sender releases XPRT_LOCKED, the release path refuses to
wake the next task because XPRT_WRITE_SPACE is set. The sender
retries, finds XPRT_WRITE_SPACE still set, and sleeps on
xprt_sending. No further Send completions arrive to clear the flag
because no new Sends can be posted.
With nconnect, the stalled transport's share of congestion credits
are never returned, starving the remaining transports as well.
Fixes: 05eb06d866 ("xprtrdma: Fix occasional transport deadlock")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
kernfs has historically used const void * to pass around namespace tags
used for directory-level namespace filtering. The only current user of
this is sysfs network namespace tagging where struct net pointers are
cast to void *.
Replace all const void * namespace parameters with const struct
ns_common * throughout the kernfs, sysfs, and kobject namespace layers.
This includes the kobj_ns_type_operations callbacks, kobject_namespace(),
and all sysfs/kernfs APIs that accept or return namespace tags.
Passing struct ns_common is needed because various codepaths require
access to the underlying namespace. A struct ns_common can always be
converted back to the concrete namespace type (e.g., struct net) via
container_of() or to_ns_common() in the reverse direction.
This is a preparatory change for switching to ns_id-based directory
iteration to prevent a KASLR pointer leak through the current use of
raw namespace pointers as hash seeds and comparison keys.
Signed-off-by: Christian Brauner <brauner@kernel.org>
svc_rdma_build_read_segment() constructs RDMA Read sink
buffers by consuming pages one-at-a-time from rq_pages[]
and building one bvec per page. A 64KB NFS READ payload
produces 16 separate bvecs, 16 DMA mappings, and
potentially multiple RDMA Read WRs (on platforms with
4KB pages).
A single higher-order allocation followed by split_page()
yields physically contiguous memory while preserving
per-page refcounts. A single bvec spanning the contiguous
range causes rdma_rw_ctx_init_bvec() to take the
rdma_rw_init_single_wr_bvec() fast path: one DMA mapping,
one SGE, one WR.
The split sub-pages replace the original rq_pages[] entries,
so all downstream page tracking, completion handling, and
xdr_buf assembly remain unchanged.
Allocation uses __GFP_NORETRY | __GFP_NOWARN and falls back
through decreasing orders. If even order-1 fails, the
existing per-page path handles the segment.
When nr_pages is not a power of two, get_order() rounds up
and the allocation yields more pages than needed. The extra
split pages replace existing rq_pages[] entries (freed via
put_page() first), so there is no net increase in per-
request page consumption. Successive segments reuse the
same padding slots, preventing accumulation. The
rq_maxpages guard rejects any allocation that would
overrun the array, falling back to the per-page path.
Under memory pressure, __GFP_NORETRY causes the higher-
order allocation to fail without stalling.
The contiguous path is attempted when the segment starts
page-aligned (rc_pageoff == 0) and spans at least two
pages. NFS WRITE segments carry application-modified byte
ranges of arbitrary length, so the optimization is not
restricted to power-of-two page counts.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
svc_rqst_replace_page() releases displaced pages through a
per-rqst folio batch, but exposes the add-or-flush sequence
directly. svc_tcp_restore_pages() releases displaced pages
individually with put_page().
Introduce svc_rqst_page_release() to encapsulate the
batched release mechanism. Convert svc_rqst_replace_page()
and svc_tcp_restore_pages() to use it. The latter now
benefits from the same batched release that
svc_rqst_replace_page() already uses.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
svc_rdma_prepare_write_chunk() and svc_rdma_prepare_reply_chunk()
contain identical code for linking RDMA R/W work requests onto a
Send context's WR chain. This duplication increases maintenance
burden and risks divergent bug fixes.
Introduce svc_rdma_cc_link_wrs() to consolidate the WR chain
linking logic. The helper walks the chunk context's rwctxts list,
chains each WR via rdma_rw_ctx_wrs(), and updates the Send
context's chain head and SQE count. Completion signaling is
requested only for the tail WR (posted first).
No functional change.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Previously, Write chunk RDMA Writes were posted via a separate
ib_post_send() call with their own completion handler. Each Write
chunk incurred a doorbell and generated a completion event.
Link Write chunk WRs onto the RPC Reply's Send WR chain so that a
single ib_post_send() call posts both the RDMA Writes and the Send
WR. A single completion event signals that all operations have
finished. This reduces both doorbell rate and completion rate, as
well as eliminating the latency of a round-trip between the Write
chunk completion and the subsequent Send WR posting.
The lifecycle of Write chunk resources changes: previously, the
svc_rdma_write_done() completion handler released Write chunk
resources when RDMA Writes completed. With WR chaining, resources
remain live until the Send completion. A new sc_write_info_list
tracks Write chunk metadata attached to each Send context, and
svc_rdma_write_chunk_release() frees these resources when the
Send context is released.
The svc_rdma_write_done() handler now handles only error cases.
On success it returns immediately since the Send completion handles
resource release. On failure (WR flush), it closes the connection
to signal to the client that the RPC Reply is incomplete.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
I can't think of a reason why svcrdma is using the PD's device. Most
other consumers of the IB DMA API use the ib_device pointer from the
connection's rdma_cm_id.
I don't think there's any functional difference between the two, but
it is a little confusing to see some uses of rdma_cm_id and some of
ib_pd.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
I can't think of a reason why svcrdma is using the PD's device. Most
other consumers of the IB DMA API use the ib_device pointer from the
connection's rdma_cm_id.
I don't believe there's any functional difference between the two,
but it is a little confusing to see some uses of rdma_cm_id->device
and some of ib_pd->device.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
When the Send Queue fills, multiple threads may wait for SQ slots.
The previous implementation had no ordering guarantee, allowing
starvation when one thread repeatedly acquires slots while others
wait indefinitely.
Introduce a ticket-based fair queuing system. Each waiter takes a
ticket number and is served in FIFO order. This ensures forward
progress for all waiters when SQ capacity is constrained.
The implementation has two phases:
1. Fast path: attempt to reserve SQ slots without waiting
2. Slow path: take a ticket, wait for turn, then wait for slots
The ticket system adds two atomic counters to the transport:
- sc_sq_ticket_head: next ticket to issue
- sc_sq_ticket_tail: ticket currently being served
A dedicated wait queue (sc_sq_ticket_wait) handles ticket
ordering, separate from sc_send_wait which handles SQ capacity.
This separation ensures that send completions (the high-frequency
wake source) wake only the current ticket holder rather than all
queued waiters. Ticket handoff wakes only the ticket wait queue,
and each ticket holder that exits via connection close propagates
the wake to the next waiter in line.
When a waiter successfully reserves slots, it advances the tail
counter and wakes the next waiter. This creates an orderly handoff
that prevents starvation while maintaining good throughput on the
fast path when contention is low.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
svc_alloc_arg() invokes alloc_pages_bulk() with the full rq_maxpages
count (~259 for 1MB messages) for the rq_respages array, causing a
full-array scan despite most slots holding valid pages.
svc_rqst_release_pages() NULLs only the range
[rq_respages, rq_next_page)
after each RPC, so only that range contains NULL entries. Limit the
rq_respages fill in svc_alloc_arg() to that range instead of
scanning the full array.
svc_init_buffer() initializes rq_next_page to span the entire
rq_respages array, so the first svc_alloc_arg() call fills all
slots.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The rq_pages array holds pages allocated for incoming RPC requests.
Two transport receive paths NULL entries in rq_pages to prevent
svc_rqst_release_pages() from freeing pages that the transport has
taken ownership of:
- svc_tcp_save_pages() moves partial request data pages to
svsk->sk_pages during multi-fragment TCP reassembly.
- svc_rdma_clear_rqst_pages() moves request data pages to
head->rc_pages because they are targets of active RDMA Read WRs.
A new rq_pages_nfree field in struct svc_rqst records how many
entries were NULLed. svc_alloc_arg() uses it to refill only those
entries rather than scanning the full rq_pages array. In steady
state, the transport NULLs a handful of entries per RPC, so the
allocator visits only those entries instead of the full ~259 slots
(for 1MB messages).
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
svc_rdma_save_io_pages() transfers response pages to the send
context and sets those slots to NULL. It then resets rq_next_page to
equal rq_respages, hiding the NULL region from
svc_rqst_release_pages().
Now that svc_rqst_release_pages() handles NULL entries, this reset
is no longer necessary. Removing it preserves the invariant that the
range [rq_respages, rq_next_page) accurately describes how many
response pages were consumed, enabling a subsequent optimization in
svc_alloc_arg() that refills only the consumed range.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
svc_rqst_release_pages() releases response pages between rq_respages
and rq_next_page. It currently passes the entire range to
release_pages(), which does not expect NULL entries.
A subsequent patch preserves the rq_next_page pointer in
svc_rdma_save_io_pages() so that it accurately records how many
response pages were consumed. After that change, the range
[rq_respages, rq_next_page)
can contain NULL entries where pages have already been transferred
to a send context.
Iterate through the range entry by entry, skipping NULLs, to handle
this case correctly.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
struct svc_rqst uses a single dynamically-allocated page array
(rq_pages) for both the incoming RPC Call message and the outgoing
RPC Reply message. rq_respages is a sliding pointer into rq_pages
that each transport receive path must compute based on how many
pages the Call consumed. This boundary tracking is a source of
confusion and bugs, and prevents an RPC transaction from having
both a large Call and a large Reply simultaneously.
Allocate rq_respages as its own page array, eliminating the boundary
arithmetic. This decouples Call and Reply buffer lifetimes,
following the precedent set by rq_bvec (a separate dynamically-
allocated array for I/O vectors).
Each svc_rqst now pins twice as many pages as before. For a server
running 16 threads with a 1MB maximum payload, the additional cost
is roughly 16MB of pinned memory. The new dynamic svc thread count
facility keeps this overhead minimal on an idle server. A subsequent
patch in this series limits per-request repopulation to only the
pages released during the previous RPC, avoiding a full-array scan
on each call to svc_alloc_arg().
Note: We've considered several alternatives to maintaining a full
second array. Each alternative reintroduces either boundary logic
complexity or I/O-path allocation pressure.
rq_next_page is initialized in svc_alloc_arg() and svc_process()
during Reply construction, and in svc_rdma_recvfrom() as a
precaution on error paths. Transport receive paths no longer compute
it from the Call size.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
svc_rqst_replace_page() builds the Reply buffer by advancing
rq_next_page through the response page range. The bounds
check validates rq_next_page against the full rq_pages array,
but the valid range for rq_next_page is
[rq_respages, rq_page_end].
Use those bounds instead.
This is correct today because rq_respages and rq_page_end
both point into rq_pages, and it prepares for a subsequent
change that separates the Reply page array from rq_pages.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Replace the single interleaved queue (which mixed cache_request and
cache_reader entries distinguished by a ->reader flag) with two
dedicated lists: cd->requests for upcall requests and cd->readers
for open file handles.
Readers now track their position via a monotonically increasing
sequence number (next_seqno) rather than by their position in the
shared list. Each cache_request is assigned a seqno when enqueued,
and a new cache_next_request() helper finds the next request at or
after a given seqno.
This eliminates the cache_queue wrapper struct entirely, simplifies
the reader-skipping loops in cache_read/cache_poll/cache_ioctl/
cache_release, and makes the data flow easier to reason about.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The queue_wait waitqueue is currently a file-scoped global, so a
wake_up for one cache_detail wakes pollers on all caches. Convert it
to a per-cache-detail field so that only pollers on the relevant cache
are woken.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The global queue_lock serializes all upcall queue operations across
every cache_detail instance. Convert it to a per-cache-detail spinlock
so that different caches (e.g. auth.unix.ip vs nfsd.fh) no longer
contend with each other on queue operations.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
RPC_IFDEBUG() is used in only two places. In one the user of
the definition is guarded by ifdeffery, in the second one
it's implied due to dprintk() usage. Kill the macro and move
the ifdeffery to the regular condition with the variable defined
inside, while in the second case add the same conditional and
move the respective code there.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Replace KUnit tests for memcmp() with KUNIT_EXPECT_MEMEQ_MSG() to improve
debugging that prints the hex dump of the buffers when the assertion fails,
whereas memcmp() only returns an integer difference.
Signed-off-by: Ryota Sakamoto <sakamo.ryota@gmail.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
1/ consistently use hlist_add_head_rcu() when adding to
the cachelist to reflect the fact that it can be concurrently
walked using RCU. In fact hlist_add_head() has all the needed
barriers so this is no safety issue, primarily a clarity issue.
2/ call cache_get() *before* adding the list with hlist_add_head_rcu().
It is generally safest to inc the refcount before publishing a
reference. In this case it doesn't have any behavioural effect
as code which does an RCU walk does not depend on precision of
the refcount, and it will always be at least one. But it looks
more correct to use this order.
3/ avoid possible races between NULL tests and hlist_entry_safe()
calls. It is possible that a test will find that .next or .head
is not NULL, but hlist_entry_safe() will find that it is NULL.
This can lead to incorrect behaviour with the list-walk terminating
early.
It is safest to always call hlist_entry_safe() and test the result.
Also simplify the *ppos calculation by simply assigning the hash
shifted 32, rather than masking out low bits and incrementing high
bits.
Signed-off-by: NeilBrown <neil@brown.name>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Issues that need expedient stable backports:
- Fix cache_request leak in cache_release()
- Fix heap overflow in the NFSv4.0 LOCK replay cache
- Hold net reference for the lifetime of /proc/fs/nfs/exports fd
- Defer sub-object cleanup in export "put" callbacks
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmm7ELoACgkQM2qzM29m
f5fbMQ/6AjYdEQh56X2G1Y899zsvT4jfOZSc8dYjxK6seNZLQBCOz54w4aRo0TmP
keYIew8w2atCwWAlYT1xXqJVt90EG36fGodnw3EN+0g3nxPsIy1JeZwTUz1xagaI
hDbFwo6bN4HxU457/XxPO4jNdvpztq8hbTdRkXsD/Ckh2Db1juKkTQ+kX0rCxL5s
xZPDgKCsTQeFjfs+gdnbyEixc8vnQMAiUP15Df+HQdwCGD62meQ1S0BBVywRhCAK
FoufgPRnCzB189PKYCpivCNSImeSasQ4cS3WYi1i9ZB3OvEzRnqaPAvvRWQTwWfs
7IIekorKagCvXbqEt3dMQn7UaVyFLgV8OMR04JGqpI05GylNBQVONty/BKzQVTdH
Hp2C9PCitoPC68UabQZ22rCH8zpMREk+sH785ztLyuKGgC09YLTkxrltHllzKWAQ
k5DkeTmySVeobpif4urQKHyxhWZ//ah0MJOsSE4XcPMCWk7RPshj4tZyzvXdbuR1
IZQbOSruUd9aaZ4Q9J8D66oVyBatq9RFP4yxxR7L3CLSXJUsWK0AriEY9EZAeUe7
GeOaiUJ34F2oE4FfF9XaTmsXG9EuXtps6PlYDlHjlSyRJyg3detTJP4YeKJCrlQC
x+x7DN5gN2ZUuR+vqlS1BWGm24usmeNBPqvZ2hi6d+NpPgcLoUk=
=xX5n
-----END PGP SIGNATURE-----
Merge tag 'nfsd-7.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
- Fix cache_request leak in cache_release()
- Fix heap overflow in the NFSv4.0 LOCK replay cache
- Hold net reference for the lifetime of /proc/fs/nfs/exports fd
- Defer sub-object cleanup in export "put" callbacks
* tag 'nfsd-7.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
nfsd: fix heap overflow in NFSv4.0 LOCK replay cache
sunrpc: fix cache_request leak in cache_release
NFSD: Hold net reference for the lifetime of /proc/fs/nfs/exports fd
NFSD: Defer sub-object cleanup in export put callbacks
When a reader's file descriptor is closed while in the middle of reading
a cache_request (rp->offset != 0), cache_release() decrements the
request's readers count but never checks whether it should free the
request.
In cache_read(), when readers drops to 0 and CACHE_PENDING is clear, the
cache_request is removed from the queue and freed along with its buffer
and cache_head reference. cache_release() lacks this cleanup.
The only other path that frees requests with readers == 0 is
cache_dequeue(), but it runs only when CACHE_PENDING transitions from
set to clear. If that transition already happened while readers was
still non-zero, cache_dequeue() will have skipped the request, and no
subsequent call will clean it up.
Add the same cleanup logic from cache_read() to cache_release(): after
decrementing readers, check if it reached 0 with CACHE_PENDING clear,
and if so, dequeue and free the cache_request.
Reported-by: NeilBrown <neilb@ownmail.net>
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Cc: stable@kernel.org
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
In the event that rpcrdma_post_recvs() fails to create a work request
(due to memory allocation failure, say) or otherwise exits early, we
should decrement ep->re_receiving before returning. Otherwise we will
hang in rpcrdma_xprt_drain() as re_receiving will never reach zero and
the completion will never be triggered.
On a system with high memory pressure, this can appear as the following
hung task:
INFO: task kworker/u385:17:8393 blocked for more than 122 seconds.
Tainted: G S E 6.19.0 #3
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u385:17 state:D stack:0 pid:8393 tgid:8393 ppid:2 task_flags:0x4248060 flags:0x00080000
Workqueue: xprtiod xprt_autoclose [sunrpc]
Call Trace:
<TASK>
__schedule+0x48b/0x18b0
? ib_post_send_mad+0x247/0xae0 [ib_core]
schedule+0x27/0xf0
schedule_timeout+0x104/0x110
__wait_for_common+0x98/0x180
? __pfx_schedule_timeout+0x10/0x10
wait_for_completion+0x24/0x40
rpcrdma_xprt_disconnect+0x444/0x460 [rpcrdma]
xprt_rdma_close+0x12/0x40 [rpcrdma]
xprt_autoclose+0x5f/0x120 [sunrpc]
process_one_work+0x191/0x3e0
worker_thread+0x2e3/0x420
? __pfx_worker_thread+0x10/0x10
kthread+0x10d/0x230
? __pfx_kthread+0x10/0x10
ret_from_fork+0x273/0x2b0
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
Fixes: 15788d1d10 ("xprtrdma: Do not refresh Receive Queue while it is draining")
Signed-off-by: Eric Badger <ebadger@purestorage.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
This converts some of the visually simpler cases that have been split
over multiple lines. I only did the ones that are easy to verify the
resulting diff by having just that final GFP_KERNEL argument on the next
line.
Somebody should probably do a proper coccinelle script for this, but for
me the trivial script actually resulted in an assertion failure in the
middle of the script. I probably had made it a bit _too_ trivial.
So after fighting that far a while I decided to just do some of the
syntactically simpler cases with variations of the previous 'sed'
scripts.
The more syntactically complex multi-line cases would mostly really want
whitespace cleanup anyway.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the exact same thing as the 'alloc_obj()' version, only much
smaller because there are a lot fewer users of the *alloc_flex()
interface.
As with alloc_obj() version, this was done entirely with mindless brute
force, using the same script, except using 'flex' in the pattern rather
than 'objs*'.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This was done entirely with mindless brute force, using
git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'
to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.
Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.
For the same reason the 'flex' versions will be done as a separate
conversion.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:
Single allocations: kmalloc(sizeof(TYPE), ...)
are replaced with: kmalloc_obj(TYPE, ...)
Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with: kmalloc_objs(TYPE, COUNT, ...)
Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...)
(where TYPE may also be *VAR)
The resulting allocations no longer return "void *", instead returning
"TYPE *".
Signed-off-by: Kees Cook <kees@kernel.org>
New Features:
* Use an LRU list for returning unused delegations
* Introduce a KConfig option to disable NFS v4.0 and make NFS v4.1 the default
Bugfixes:
* NFS/localio: Handle short writes by retrying
* NFS/localio: prevent direct reclaim recursion into NFS via nfs_writepages
* NFS/localio: use GFP_NOIO and non-memreclaim workqueue in nfs_local_commit
* NFS/localio: remove -EAGAIN handling in nfs_local_doio()
* pNFS: fix a missing wake up while waiting on NFS_LAYOUT_DRAIN
* fs/nfs: Fix a readdir slow-start regression
* SUNRPC: fix gss_auth kref leak in gss_alloc_msg error path
Other Cleanups and Improvements:
* A few other NFS/localio cleanups
* Various other delegation handling cleanups from Christoph
* Unify security_inode_listsecurity() calls
* Improvements to NFSv4 lease handling
* Clean up SUNRPC *_debug fields when CONFIG_SUNRPC_DEBUG is not set
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmmORX0ACgkQ18tUv7Cl
QOtFORAAyCwTst5iEPRJ9rKZ/Kl39zHbA/QUn3CmmVkGlOBj0j7mWRyU5X0vlIQ9
mUF3Ikm1XYpsxPTKBEELVumPkggT2nfsFx5518BrpRTODibzc/CZ10/z7q4qarvI
UhdFlt9SRG4RhhOdAaThF6XVUsRSwGwVZo/YyYemCc/evjNVyXa0wfwbDl9l4Nzr
1Sxt2/zeq3Eu4IfrxQpFM+0UuSScmVODSe8Jm4GnmlU/Q7x+onW35IvyuzTkgDwG
8PAeH4b5uADY9VWnTHpvr1fQNnBoEw8b4qr9a7AXQKRIcPGMvgKkdK+f6hOh1cEs
+O+L4+uixo7QXudnWC27brZSyHwDIVVaJGPF/kNv4O2GKDyEcbsHtQv2G1+1+PtR
FCtRFGpLq2pZxb9SY/s73FKp6a8bd81FAtzAL7iYU+2FDtvEDKss1nG6sQNG1+Z4
G8rI79PoimR4I6Jr5hk4sl8pM8wJVLZdcW+ytrEKl9FC+rFDrP9lVzHYArTFgIky
N/IjEflejRfZ9bYIZ9/CYnFZC3Htrm8K9zerCRDsf96tvhxkX8FZM8tuZpHEMIbx
Cx8XKCk+ubqLIF2mT+FKOc5T6CUmMiGRNagLkx0h0mbvRSI8HTpgQZGrbYkMk0Hs
abUhvH73pRi0LRvkzPHfcNaZ7Y/mFBYfwBMwTUWJzh6CEgXnpks=
=CwZq
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-7.0-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client updates from Anna Schumaker:
"New Features:
- Use an LRU list for returning unused delegations
- Introduce a KConfig option to disable NFS v4.0 and make NFS v4.1
the default
Bugfixes:
- NFS/localio:
- Handle short writes by retrying
- Prevent direct reclaim recursion into NFS via nfs_writepages
- Use GFP_NOIO and non-memreclaim workqueue in nfs_local_commit
- Remove -EAGAIN handling in nfs_local_doio()
- pNFS: fix a missing wake up while waiting on NFS_LAYOUT_DRAIN
- fs/nfs: Fix a readdir slow-start regression
- SUNRPC: fix gss_auth kref leak in gss_alloc_msg error path
Other cleanups and improvements:
- A few other NFS/localio cleanups
- Various other delegation handling cleanups from Christoph
- Unify security_inode_listsecurity() calls
- Improvements to NFSv4 lease handling
- Clean up SUNRPC *_debug fields when CONFIG_SUNRPC_DEBUG is not set"
* tag 'nfs-for-7.0-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (60 commits)
SUNRPC: fix gss_auth kref leak in gss_alloc_msg error path
nfs: nfs4proc: Convert comma to semicolon
SUNRPC: Change list definition method
sunrpc: rpc_debug and others are defined even if CONFIG_SUNRPC_DEBUG unset
NFSv4: limit lease period in nfs4_set_lease_period()
NFSv4: pass lease period in seconds to nfs4_set_lease_period()
nfs: unify security_inode_listsecurity() calls
fs/nfs: Fix readdir slow-start regression
pNFS: fix a missing wake up while waiting on NFS_LAYOUT_DRAIN
NFS: fix delayed delegation return handling
NFS: simplify error handling in nfs_end_delegation_return
NFS: fold nfs_abort_delegation_return into nfs_end_delegation_return
NFS: remove the delegation == NULL check in nfs_end_delegation_return
NFS: use bool for the issync argument to nfs_end_delegation_return
NFS: return void from ->return_delegation
NFS: return void from nfs4_inode_make_writeable
NFS: Merge CONFIG_NFS_V4_1 with CONFIG_NFS_V4
NFS: Add a way to disable NFS v4.0 via KConfig
NFS: Move sequence slot operations into minorversion operations
NFS: Pass a struct nfs_client to nfs4_init_sequence()
...
Usual smallish cycle:
- Various code improvements in irdma, rtrs, qedr, ocrdma, irdma, rxe
- Small driver improvements and minor bug fixes to hns, mlx5, rxe, mana,
mlx5, irdma
- Robusness improvements in completion processing for EFA
- New query_port_speed() verb to move past limited IBA defined speed steps
- Support for SG_GAPS in rts and many other small improvements
- Rare list corruption fix in iwcm
- Better support different page sizes in rxe
- Device memory support for mana
- Direct bio vec to kernel MR for use by NFS-RDMA
- QP rate limiting for bnxt_re
- Remote triggerable NULL pointer crash in siw
- DMA-buf exporter support for RDMA mmaps like doorbells
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCaY44vgAKCRCFwuHvBreF
YfiZAP91cMZfogN7r1FMD75xDZu55dI3Jvy8OaixyRxlWLGPcQEAjritdL0o7fZp
YrD1OXNS/1XG//rPBVw7xj+54Aa8hAU=
=AVcu
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
"Usual smallish cycle. The NFS biovec work to push it down into RDMA
instead of indirecting through a scatterlist is pretty nice to see,
been talked about for a long time now.
- Various code improvements in irdma, rtrs, qedr, ocrdma, irdma, rxe
- Small driver improvements and minor bug fixes to hns, mlx5, rxe,
mana, mlx5, irdma
- Robusness improvements in completion processing for EFA
- New query_port_speed() verb to move past limited IBA defined speed
steps
- Support for SG_GAPS in rts and many other small improvements
- Rare list corruption fix in iwcm
- Better support different page sizes in rxe
- Device memory support for mana
- Direct bio vec to kernel MR for use by NFS-RDMA
- QP rate limiting for bnxt_re
- Remote triggerable NULL pointer crash in siw
- DMA-buf exporter support for RDMA mmaps like doorbells"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (66 commits)
RDMA/mlx5: Implement DMABUF export ops
RDMA/uverbs: Add DMABUF object type and operations
RDMA/uverbs: Support external FD uobjects
RDMA/siw: Fix potential NULL pointer dereference in header processing
RDMA/umad: Reject negative data_len in ib_umad_write
IB/core: Extend rate limit support for RC QPs
RDMA/mlx5: Support rate limit only for Raw Packet QP
RDMA/bnxt_re: Report QP rate limit in debugfs
RDMA/bnxt_re: Report packet pacing capabilities when querying device
RDMA/bnxt_re: Add support for QP rate limiting
MAINTAINERS: Drop RDMA files from Hyper-V section
RDMA/uverbs: Add __GFP_NOWARN to ib_uverbs_unmarshall_recv() kmalloc
svcrdma: use bvec-based RDMA read/write API
RDMA/core: add rdma_rw_max_sge() helper for SQ sizing
RDMA/core: add MR support for bvec-based RDMA operations
RDMA/core: use IOVA-based DMA mapping for bvec RDMA operations
RDMA/core: add bio_vec based RDMA read/write API
RDMA/irdma: Use kvzalloc for paged memory DMA address array
RDMA/rxe: Fix race condition in QP timer handlers
RDMA/mana_ib: Add device‑memory support
...
Total patches: 107
Reviews/patch: 1.07
Reviewed rate: 67%
- The 2 patch series "ocfs2: give ocfs2 the ability to reclaim
suballocator free bg" from Heming Zhao saves disk space by teaching
ocfs2 to reclaim suballocator block group space.
- The 4 patch series "Add ARRAY_END(), and use it to fix off-by-one
bugs" from Alejandro Colomar adds the ARRAY_END() macro and uses it in
various places.
- The 2 patch series "vmcoreinfo: support VMCOREINFO_BYTES larger than
PAGE_SIZE" from Pnina Feder makes the vmcore code future-safe, if
VMCOREINFO_BYTES ever exceeds the page size.
- The 7 patch series "kallsyms: Prevent invalid access when showing
module buildid" from Petr Mladek cleans up kallsyms code related to
module buildid and fixes an invalid access crash when printing
backtraces.
- The 3 patch series "Address page fault in
ima_restore_measurement_list()" from Harshit Mogalapalli fixes a
kexec-related crash that can occur when booting the second-stage kernel
on x86.
- The 6 patch series "kho: ABI headers and Documentation updates" from
Mike Rapoport updates the kexec handover ABI documentation.
- The 4 patch series "Align atomic storage" from Finn Thain adds the
__aligned attribute to atomic_t and atomic64_t definitions to get
natural alignment of both types on csky, m68k, microblaze, nios2,
openrisc and sh.
- The 2 patch series "kho: clean up page initialization logic" from
Pratyush Yadav simplifies the page initialization logic in
kho_restore_page().
- The 6 patch series "Unload linux/kernel.h" from Yury Norov moves
several things out of kernel.h and into more appropriate places.
- The 7 patch series "don't abuse task_struct.group_leader" from Oleg
Nesterov removes the usage of ->group_leader when it is "obviously
unnecessary".
- The 5 patch series "list private v2 & luo flb" from Pasha Tatashin
adds some infrastructure improvements to the live update orchestrator.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaY4giAAKCRDdBJ7gKXxA
jgusAQDnKkP8UWTqXPC1jI+OrDJGU5ciAx8lzLeBVqMKzoYk9AD/TlhT2Nlx+Ef6
0HCUHUD0FMvAw/7/Dfc6ZKxwBEIxyww=
=mmsH
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2026-02-12-10-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
- "ocfs2: give ocfs2 the ability to reclaim suballocator free bg" saves
disk space by teaching ocfs2 to reclaim suballocator block group
space (Heming Zhao)
- "Add ARRAY_END(), and use it to fix off-by-one bugs" adds the
ARRAY_END() macro and uses it in various places (Alejandro Colomar)
- "vmcoreinfo: support VMCOREINFO_BYTES larger than PAGE_SIZE" makes
the vmcore code future-safe, if VMCOREINFO_BYTES ever exceeds the
page size (Pnina Feder)
- "kallsyms: Prevent invalid access when showing module buildid" cleans
up kallsyms code related to module buildid and fixes an invalid
access crash when printing backtraces (Petr Mladek)
- "Address page fault in ima_restore_measurement_list()" fixes a
kexec-related crash that can occur when booting the second-stage
kernel on x86 (Harshit Mogalapalli)
- "kho: ABI headers and Documentation updates" updates the kexec
handover ABI documentation (Mike Rapoport)
- "Align atomic storage" adds the __aligned attribute to atomic_t and
atomic64_t definitions to get natural alignment of both types on
csky, m68k, microblaze, nios2, openrisc and sh (Finn Thain)
- "kho: clean up page initialization logic" simplifies the page
initialization logic in kho_restore_page() (Pratyush Yadav)
- "Unload linux/kernel.h" moves several things out of kernel.h and into
more appropriate places (Yury Norov)
- "don't abuse task_struct.group_leader" removes the usage of
->group_leader when it is "obviously unnecessary" (Oleg Nesterov)
- "list private v2 & luo flb" adds some infrastructure improvements to
the live update orchestrator (Pasha Tatashin)
* tag 'mm-nonmm-stable-2026-02-12-10-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (107 commits)
watchdog/hardlockup: simplify perf event probe and remove per-cpu dependency
procfs: fix missing RCU protection when reading real_parent in do_task_stat()
watchdog/softlockup: fix sample ring index wrap in need_counting_irqs()
kcsan, compiler_types: avoid duplicate type issues in BPF Type Format
kho: fix doc for kho_restore_pages()
tests/liveupdate: add in-kernel liveupdate test
liveupdate: luo_flb: introduce File-Lifecycle-Bound global state
liveupdate: luo_file: Use private list
list: add kunit test for private list primitives
list: add primitives for private list manipulations
delayacct: fix uapi timespec64 definition
panic: add panic_force_cpu= parameter to redirect panic to a specific CPU
netclassid: use thread_group_leader(p) in update_classid_task()
RDMA/umem: don't abuse current->group_leader
drm/pan*: don't abuse current->group_leader
drm/amd: kill the outdated "Only the pthreads threading model is supported" checks
drm/amdgpu: don't abuse current->group_leader
android/binder: use same_thread_group(proc->tsk, current) in binder_mmap()
android/binder: don't abuse current->group_leader
kho: skip memoryless NUMA nodes when reserving scratch areas
...
Commit 5940d1cf9f ("SUNRPC: Rebalance a kref in auth_gss.c") added
a kref_get(&gss_auth->kref) call to balance the gss_put_auth() done
in gss_release_msg(), but forgot to add a corresponding kref_put()
on the error path when kstrdup_const() fails.
If service_name is non-NULL and kstrdup_const() fails, the function
jumps to err_put_pipe_version which calls put_pipe_version() and
kfree(gss_msg), but never releases the gss_auth reference. This leads
to a kref leak where the gss_auth structure is never freed.
Add a forward declaration for gss_free_callback() and call kref_put()
in the err_put_pipe_version error path to properly release the
reference taken earlier.
Fixes: 5940d1cf9f ("SUNRPC: Rebalance a kref in auth_gss.c")
Cc: stable@vger.kernel.org
Signed-off-by: Daniel Hodges <git@danielhodges.dev>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
The LIST_HEAD macro can both define a linked list and initialize
it in one step. To simplify code, we replace the separate operations
of linked list definition and manual initialization with the LIST_HEAD
macro.
Signed-off-by: Chenguang Zhao <zhaochenguang@kylinos.cn>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
To dynamically adjust the thread count, nfsd requires some information
about how busy things are.
Change svc_recv() to take a timeout value, and then allow the wait for
work to time out if it's set. If a timeout is not defined, then the
schedule will be set to MAX_SCHEDULE_TIMEOUT. If the task waits for the
full timeout, then have it return -ETIMEDOUT to the caller.
If it wakes up, finds that there is more work and that no threads are
available, then attempt to set SP_TASK_STARTING. If wasn't already set,
have the task return -EBUSY to cue to the caller that the service could
use more threads.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Break out the part of svc_start_kthreads() that creates a thread into
svc_new_thread(), as a new exported helper function.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Add a new pool->sp_nrthrmin field to track the minimum number of threads
in a pool. Add min_threads parameters to both svc_set_num_threads() and
svc_set_pool_threads(). If min_threads is non-zero and less than the
max, svc_set_num_threads() will ensure that the number of running
threads is between the min and the max.
If the min is 0 or greater than the max, then it is ignored, and the
maximum number of threads will be started, and never spun down.
For now, the min_threads is always 0, but a later patch will pass the
proper value through from nfsd.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The kernel currently tracks the number of threads running in a pool in
the "sp_nrthreads" field. In the future, where threads are dynamically
spun up and down, it'll be necessary to keep track of the maximum number
of requested threads separately from the actual number running.
Add a pool->sp_nrthrmax parameter to track this. When userland changes
the number of threads in a pool, update that value accordingly.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Now that svc_set_num_threads() handles distributing the threads among
the available pools, remove the special handling of a NULL pool pointer
from svc_start_kthreads() and svc_stop_kthreads().
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
svc_set_num_threads() will set the number of running threads for a given
pool. If the pool argument is set to NULL however, it will distribute
the threads among all of the pools evenly.
These divergent codepaths complicate the move to dynamic threading.
Simplify the API by splitting these two cases into different helpers:
Add a new svc_set_pool_threads() function that sets the number of
threads in a single, given pool. Modify svc_set_num_threads() to
distribute the threads evenly between all of the pools and then call
svc_set_pool_threads() for each.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Convert svcrdma to the bvec-based RDMA API introduced earlier in
this series.
The bvec-based RDMA API eliminates the intermediate scatterlist
conversion step, allowing direct DMA mapping from bio_vec arrays.
This simplifies the svc_rdma_rw_ctxt structure by removing the
chained SG table management.
The structure retains an inline array approach similar to the
previous scatterlist implementation: an inline bvec array sized
to max_send_sge handles most I/O operations without additional
allocation. Larger requests fall back to dynamic allocation.
This preserves the allocation-free fast path for typical NFS
operations while supporting arbitrarily large transfers.
The bvec API handles all device types internally, including iWARP
devices which require memory registration. No explicit fallback
path is needed.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://patch.msgid.link/20260128005400.25147-6-cel@kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
svc_rdma_accept() computes sc_sq_depth as the sum of rq_depth and the
number of rdma_rw contexts (ctxts). This value is used to allocate the
Send CQ and to initialize the sc_sq_avail credit pool.
However, when the device uses memory registration for RDMA operations,
rdma_rw_init_qp() inflates the QP's max_send_wr by a factor of three
per context to account for REG and INV work requests. The Send CQ and
credit pool remain sized for only one work request per context,
causing Send Queue exhaustion under heavy NFS WRITE workloads.
Introduce rdma_rw_max_sge() to compute the actual number of Send Queue
entries required for a given number of rdma_rw contexts. Upper layer
protocols call this helper before creating a Queue Pair so that their
Send CQs and credit accounting match the QP's true capacity.
Update svc_rdma_accept() to use rdma_rw_max_sge() when computing
sc_sq_depth, ensuring the credit pool reflects the work requests
that rdma_rw_init_qp() will reserve.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Fixes: 00bd1439f4 ("RDMA/rw: Support threshold for registration vs scattering to local pages")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://patch.msgid.link/20260128005400.25147-5-cel@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
The gssx_dec_ctx(), gssx_dec_status(), and gssx_dec_name()
functions allocate memory via gssx_dec_buffer(), which calls
kmemdup(). When a subsequent decode operation fails, these
functions return immediately without freeing previously
allocated buffers, causing memory leaks.
The leak in gssx_dec_ctx() is particularly relevant because
the caller (gssp_accept_sec_context_upcall) initializes several
buffer length fields to non-zero values, resulting in memory
allocation:
struct gssx_ctx rctxh = {
.exported_context_token.len = GSSX_max_output_handle_sz,
.mech.len = GSS_OID_MAX_LEN,
.src_name.display_name.len = GSSX_max_princ_sz,
.targ_name.display_name.len = GSSX_max_princ_sz
};
If, for example, gssx_dec_name() succeeds for src_name but
fails for targ_name, the memory allocated for
exported_context_token, mech, and src_name.display_name
remains unreferenced and cannot be reclaimed.
Add error handling with goto-based cleanup to free any
previously allocated buffers before returning an error.
Reported-by: Xingjing Deng <micro6947@gmail.com>
Closes: https://lore.kernel.org/linux-nfs/CAK+ZN9qttsFDu6h1FoqGadXjMx1QXqPMoYQ=6O9RY4SxVTvKng@mail.gmail.com/
Fixes: 1d658336b0 ("SUNRPC: Add RPC based upcall mechanism for RPCGSS auth")
Cc: stable@vger.kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>