Benjamin Coddington contributed filehandle signing to defend against
filehandle-guessing attacks. The server now appends a SipHash-2-4
MAC to each filehandle when the new "sign_fh" export option is
enabled. NFSD then verifies filehandles received from clients
against the expected MAC; mismatches return NFS error STALE.
Chuck Lever converted the entire NLMv4 server-side XDR layer from
hand-written C to xdrgen-generated code, spanning roughly thirty
patches. XDR functions are generally boilerplate code and are easy
to get wrong. The goals of this conversion are improved memory
safety, lower maintenance burden, and groundwork for eventual Rust
code generation for these functions.
Dai Ngo improved pNFS block/SCSI layout robustness with two related
changes. SCSI persistent reservation fencing is now tracked per
client and per device via an xarray, to avoid both redundant preempt
operations on devices already fenced and a potential NFSD deadlock
when all nfsd threads are waiting for a layout return.
The remaining patches deliver scalability and infrastructure
improvements. Sincere thanks to all contributors, reviewers,
testers, and bug reporters who participated in the v7.1 NFSD
development cycle.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmnlF50ACgkQM2qzM29m
f5dfHBAAi2o1i9/RA6fmxi2qSV7tkg79viuGFRj3c4cjiW8ZqQXos63zmy6BNMFG
joEoirdryUETkrrckXP81HKGSWBQqYjaXeklOw8dggQ8g72HGiqcoT3Ua7L9S7A8
/Db6IwZnJcehHO8XwHV4jSAfIZuvC0iiK02tVrVe/l/9GWcG+bS340GgE9Es2IAW
copBGlTwQah+eRvy2hP+Eo3vUTP8Rdebp9iYFI12xqx2x3LquFR01PpjCzotqAvV
AcvCPa/AGoSOjcL8idloL8F8mSaOCyx15YJH0lm3hRsPtS/VyXWjKvcejWUh/7PH
gHi+5VTsSKbUBj3PJQZU6rBQ67KnwVLZ33KkIF2ZNGllvK0yDGM0UfX/TuaEPjUV
6N0UkRprCHJdrULt9XMXmX3Ddnz1xbYT8CaeIDObw3Ix7SJKedvlLTjvsYCYtsQn
5pkHUuHmr/YAF4AQi/JI4ubZhZ+K3YytNS8YiMUkBWDbPoKzo2yrkzwjGjHdUp0y
l8LfEjePAcIpuFQZegERA9CnjIeKb66DJe8da0EwtreY+sejm/S8zbBUhMkXjo6u
QwdXXeLX3/zni6Op8vRA5JH//S5ovlQFnkUSvHRItSUrDBRVm+wXD7Vnp9bykKcN
leqbSvehnV4PIi0URMvN5ox1WNmsOFIZkv9nv8amyOX8PlRmLoA=
=iFl6
-----END PGP SIGNATURE-----
Merge tag 'nfsd-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd updates from Chuck Lever:
- filehandle signing to defend against filehandle-guessing attacks
(Benjamin Coddington)
The server now appends a SipHash-2-4 MAC to each filehandle when
the new "sign_fh" export option is enabled. NFSD then verifies
filehandles received from clients against the expected MAC;
mismatches return NFS error STALE
- convert the entire NLMv4 server-side XDR layer from hand-written C to
xdrgen-generated code, spanning roughly thirty patches (Chuck Lever)
XDR functions are generally boilerplate code and are easy to get
wrong. The goals of this conversion are improved memory safety, lower
maintenance burden, and groundwork for eventual Rust code generation
for these functions.
- improve pNFS block/SCSI layout robustness with two related changes
(Dai Ngo)
SCSI persistent reservation fencing is now tracked per client and
per device via an xarray, to avoid both redundant preempt operations
on devices already fenced and a potential NFSD deadlock when all nfsd
threads are waiting for a layout return.
- scalability and infrastructure improvements
Sincere thanks to all contributors, reviewers, testers, and bug
reporters who participated in the v7.1 NFSD development cycle.
* tag 'nfsd-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (83 commits)
NFSD: Docs: clean up pnfs server timeout docs
nfsd: fix comment typo in nfsxdr
nfsd: fix comment typo in nfs3xdr
NFSD: convert callback RPC program to per-net namespace
NFSD: use per-operation statidx for callback procedures
svcrdma: Use contiguous pages for RDMA Read sink buffers
SUNRPC: Add svc_rqst_page_release() helper
SUNRPC: xdr.h: fix all kernel-doc warnings
svcrdma: Factor out WR chain linking into helper
svcrdma: Add Write chunk WRs to the RPC's Send WR chain
svcrdma: Clean up use of rdma->sc_pd->device
svcrdma: Clean up use of rdma->sc_pd->device in Receive paths
svcrdma: Add fair queuing for Send Queue access
SUNRPC: Optimize rq_respages allocation in svc_alloc_arg
SUNRPC: Track consumed rq_pages entries
svcrdma: preserve rq_next_page in svc_rdma_save_io_pages
SUNRPC: Handle NULL entries in svc_rqst_release_pages
SUNRPC: Allocate a separate Reply page array
SUNRPC: Tighten bounds checking in svc_rqst_replace_page
NFSD: Sign filehandles
...
Please consider pulling these changes from the signed vfs-7.1-rc1.kino tag.
Thanks!
Christian
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCadjZCgAKCRCRxhvAZXjc
otmnAP4sbsxZQdz2TG2hJuOwnEZOkkxZQOUMc3ERVyZaWXIeTAEA7e5M+8FpoG9n
8ipO76UoaXdGLESrqVdp9EOhLqOW7QY=
=uMeJ
-----END PGP SIGNATURE-----
Merge tag 'vfs-7.1-rc1.kino' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs i_ino updates from Christian Brauner:
"For historical reasons, the inode->i_ino field is an unsigned long,
which means that it's 32 bits on 32 bit architectures. This has caused
a number of filesystems to implement hacks to hash a 64-bit identifier
into a 32-bit field, and deprives us of a universal identifier field
for an inode.
This changes the inode->i_ino field from an unsigned long to a u64.
This shouldn't make any material difference on 64-bit hosts, but
32-bit hosts will see struct inode grow by at least 4 bytes. This
could have effects on slabcache sizes and field alignment.
The bulk of the changes are to format strings and tracepoints, since
the kernel itself doesn't care that much about the i_ino field. The
first patch changes some vfs function arguments, so check that one out
carefully.
With this change, we may be able to shrink some inode structures. For
instance, struct nfs_inode has a fileid field that holds the 64-bit
inode number. With this set of changes, that field could be
eliminated. I'd rather leave that sort of cleanups for later just to
keep this simple"
* tag 'vfs-7.1-rc1.kino' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
nilfs2: fix 64-bit division operations in nilfs_bmap_find_target_in_group()
EVM: add comment describing why ino field is still unsigned long
vfs: remove externs from fs.h on functions modified by i_ino widening
treewide: fix missed i_ino format specifier conversions
ext4: fix signed format specifier in ext4_load_inode trace event
treewide: change inode->i_ino from unsigned long to u64
nilfs2: widen trace event i_ino fields to u64
f2fs: widen trace event i_ino fields to u64
ext4: widen trace event i_ino fields to u64
zonefs: widen trace event i_ino fields to u64
hugetlbfs: widen trace event i_ino fields to u64
ext2: widen trace event i_ino fields to u64
cachefiles: widen trace event i_ino fields to u64
vfs: widen trace event i_ino fields to u64
net: change sock.sk_ino and sock_i_ino() to u64
audit: widen ino fields to u64
vfs: widen inode hash/lookup functions to u64
Please consider pulling these changes from the signed vfs-7.1-rc1.directory tag.
Thanks!
Christian
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCadjZCgAKCRCRxhvAZXjc
oj2yAQDHLWMfxDr8INpBuGc09tqX04qyb7td3U+zfM5c3bvLLAD/eDFSSUvzLtPD
u540EWvTaJBV5ALUx3vQK96PPvS6Vg4=
=20jQ
-----END PGP SIGNATURE-----
Merge tag 'vfs-7.1-rc1.directory' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs directory updates from Christian Brauner:
"Recently 'start_creating', 'start_removing', 'start_renaming' and
related interfaces were added which combine the locking and the
lookup.
At that time many callers were changed to use the new interfaces.
However there are still an assortment of places out side of the core
vfs where the directory is locked explictly, whether with inode_lock()
or lock_rename() or similar. These were missed in the first pass for
an assortment of uninteresting reasons.
This addresses the remaining places where explicit locking is used,
and changes them to use the new interfaces, or otherwise removes the
explicit locking.
The biggest changes are in overlayfs. The other changes are quite
simple, though maybe the cachefiles changes is the least simple of
those"
* tag 'vfs-7.1-rc1.directory' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
VFS: unexport lock_rename(), lock_rename_child(), unlock_rename()
ovl: remove ovl_lock_rename_workdir()
ovl: use is_subdir() for testing if one thing is a subdir of another
ovl: change ovl_create_real() to get a new lock when re-opening created file.
ovl: pass name buffer to ovl_start_creating_temp()
cachefiles: change cachefiles_bury_object to use start_renaming_dentry()
ovl: Simplify ovl_lookup_real_one()
VFS: make lookup_one_qstr_excl() static.
nfsd: switch purge_old() to use start_removing_noperm()
selinux: Use simple_start_creating() / simple_done_creating()
Apparmor: Use simple_start_creating() / simple_done_creating()
libfs: change simple_done_creating() to use end_creating()
VFS: move the start_dirop() kerndoc comment to before start_dirop()
fs/proc: Don't lock root inode when creating "self" and "thread-self"
VFS: note error returns in documentation for various lookup functions
The file contains a spelling error in a source comment (occured).
Typos in comments reduce readability and make text searches less reliable
for developers and maintainers.
Replace 'occured' with 'occurred' in the affected comment. This is a
comment-only cleanup and does not change behavior.
Signed-off-by: Joseph Salisbury <joseph.salisbury@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The file contains a spelling error in a source comment (occured).
Typos in comments reduce readability and make text searches less reliable
for developers and maintainers.
Replace 'occured' with 'occurred' in the affected comment. This is a
comment-only cleanup and does not change behavior.
Signed-off-by: Joseph Salisbury <joseph.salisbury@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The callback channel's rpc_program, rpc_version, rpc_stat,
and per-procedure counts are declared as file-scope statics in
nfs4callback.c, shared across all network namespaces.
Forechannel RPC statistics are already maintained per-netns
(via nfsd_svcstats in struct nfsd_net); the backchannel
has no such separation. When backchannel statistics are
eventually surfaced to userspace, the global counters would
expose cross-namespace data.
Allocate per-netns copies of these structures through a new
opaque struct nfsd_net_cb, managed by nfsd_net_cb_init()
and nfsd_net_cb_shutdown(). The struct definition is private
to nfs4callback.c; struct nfsd_net holds only a pointer.
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The callback RPC procedure table uses NFSPROC4_CB_##call for
p_statidx, which maps CB_NULL to index 0 and every
compound-based callback (CB_RECALL, CB_LAYOUT, CB_OFFLOAD,
etc.) to index 1. All compound callback operations therefore
share a single statistics counter, making per-operation
accounting impossible.
Assign p_statidx from the NFSPROC4_CLNT_##proc enum instead,
giving each callback operation its own counter slot. The
counts array is already sized by ARRAY_SIZE(nfs4_cb_procedures),
so no allocation change is needed.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
NFS clients may bypass restrictive directory permissions by using
open_by_handle() (or other available OS system call) to guess the
filehandles for files below that directory.
In order to harden knfsd servers against this attack, create a method to
sign and verify filehandles using SipHash-2-4 as a MAC (Message
Authentication Code). According to
https://cr.yp.to/siphash/siphash-20120918.pdf, SipHash can be used as a
MAC, and our use of SipHash-2-4 provides a low 1 in 2^64 chance of forgery.
Filehandles that have been signed cannot be tampered with, nor can
clients reasonably guess correct filehandles and hashes that may exist in
parts of the filesystem they cannot access due to directory permissions.
Append the 8 byte SipHash to encoded filehandles for exports that have set
the "sign_fh" export option. Filehandles received from clients are
verified by comparing the appended hash to the expected hash. If the MAC
does not match the server responds with NFS error _STALE. If unsigned
filehandles are received for an export with "sign_fh" they are rejected
with NFS error _STALE.
Signed-off-by: Benjamin Coddington <bcodding@hammerspace.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
In order to signal that filehandles on this export should be signed, add a
"sign_fh" export option. Filehandle signing can help the server defend
against certain filehandle guessing attacks.
Setting the "sign_fh" export option sets NFSEXP_SIGN_FH. In a future patch
NFSD uses this signal to append a MAC onto filehandles for that export.
While we're in here, tidy a few stray expflags to more closely align to the
export flag order.
Link: https://lore.kernel.org/linux-nfs/cover.1772022373.git.bcodding@hammerspace.com
Signed-off-by: Benjamin Coddington <bcodding@hammerspace.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
A future patch will enable NFSD to sign filehandles by appending a Message
Authentication Code(MAC). To do this, NFSD requires a secret 128-bit key
that can persist across reboots. A persisted key allows the server to
accept filehandles after a restart. Enable NFSD to be configured with this
key via the netlink interface.
Link: https://lore.kernel.org/linux-nfs/cover.1772022373.git.bcodding@hammerspace.com
Signed-off-by: Benjamin Coddington <bcodding@hammerspace.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Commit 1e8e9913672a ("nfsd: fix heap overflow in NFSv4.0 LOCK
replay cache") capped the replay cache copy at NFSD4_REPLAY_ISIZE
to prevent a heap overflow, but set rp_buflen to zero when the
encoded response exceeded the inline buffer. A retransmitted LOCK
reaching the replay path then produced only a status code with no
operation body, resulting in a malformed XDR response.
When the encoded response exceeds the 112-byte inline rp_ibuf, a
buffer is kmalloc'd to hold it. If the allocation fails, rp_buflen
remains zero, preserving the behavior from the capped-copy fix.
The buffer is freed when the stateowner is released or when a
subsequent operation's response fits in the inline buffer.
Fixes: 1e8e9913672a ("nfsd: fix heap overflow in NFSv4.0 LOCK replay cache")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Replace the global state_lock spinlock with a per-nfsd_net deleg_lock.
The state_lock was only used to protect delegation lifecycle operations
(the del_recall_lru list and delegation hash/unhash), all of which are
scoped to a single network namespace. Making the lock per-net removes
a source of unnecessary contention between containers.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
When a layout conflict triggers a recall, enforcing a timeout is
necessary to prevent excessive nfsd threads from being blocked in
__break_lease ensuring the server continues servicing incoming
requests efficiently.
This patch introduces a new function to lease_manager_operations:
lm_breaker_timedout: Invoked when a lease recall times out and is
about to be disposed of. This function enables the lease manager
to inform the caller whether the file_lease should remain on the
flc_list or be disposed of.
For the NFSD lease manager, this function now handles layout recall
timeouts. If the layout type supports fencing and the client has not
been fenced, a fence operation is triggered to prevent the client
from accessing the block device.
While the fencing operation is in progress, the conflicting file_lease
remains on the flc_list until fencing is complete. This guarantees
that no other clients can access the file, and the client with
exclusive access is properly blocked before disposal.
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
In nfsd4_add_rdaccess_to_wrdeleg, if fp->fi_fds[O_RDONLY] is already
set by another thread, __nfs4_file_get_access should not be called
to increment the nfs4_file access count since that was already done
by the thread that added READ access to the file. The extra fi_access
count in nfs4_file can prevent the corresponding nfsd_file from being
freed.
When stopping nfs-server service, these extra access counts trigger a
BUG in kmem_cache_destroy() that shows nfsd_file object remaining on
__kmem_cache_shutdown.
This problem can be reproduced by running the Git project's test
suite over NFS.
Fixes: 8072e34e13 ("nfsd: fix nfsd_file reference leak in nfsd4_add_rdaccess_to_wrdeleg()")
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
RPC_IFDEBUG() is used in only two places. In one the user of
the definition is guarded by ifdeffery, in the second one
it's implied due to dprintk() usage. Kill the macro and move
the ifdeffery to the regular condition with the variable defined
inside, while in the second case add the same conditional and
move the respective code there.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The nlmsvc_unlock_all_by_sb() and nlmsvc_unlock_all_by_ip()
functions are part of lockd's external API, consumed by other
kernel subsystems. Their declarations currently reside in
linux/lockd/lockd.h alongside internal implementation details,
which blurs the boundary between lockd's public interface and
its private internals.
Moving these declarations to linux/lockd/bind.h groups them
with other external API functions and makes the separation
explicit. This clarifies which functions are intended for
external use and reduces the risk of internal implementation
details leaking into the public API surface.
Build-tested with allyesconfig; no functional changes.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The nlm_fopen() function is part of the API between nfsd and lockd.
Currently its return value is an on-the-wire NLM status code. But
that forces NFSD to include NLM wire protocol definitions despite
having no other dependency on the NLM wire protocol.
In addition, a CONFIG_LOCKD_V4 Kconfig symbol appears in the middle
of NFSD source code.
Refactor: Let's not use on-the-wire values as part of a high-level
API between two Linux kernel modules. That's what we have errno for,
right?
And, instead of simply moving the CONFIG_LOCKD_V4 check, we can get
rid of it entirely and let the decision of what actual NLM status
code goes on the wire to be left up to NLM version-specific code.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The nlm_drop_reply status code is internal to the kernel's lockd
implementation and must never appear on the wire. Its previous
location in xdr.h grouped it with legitimate NLM protocol status
codes, obscuring this critical distinction.
Relocate the definition to lockd.h with a comment block for internal
status codes, and rename to nlm__int__drop_reply to make its
internal-only nature explicit. This prepares for adding additional
internal status codes in subsequent patches.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Now that there is a runtime debugfs switch, eliminate the compile-time
switch and always build in support for delegated timestamps.
Administrators who previously disabled this feature at compile time can
disable it at runtime via:
echo 0 > /sys/kernel/debug/nfsd/delegated_timestamps
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The delegated timestamp code seems to be working well enough now that we
want to make it always be built in. In the event that there are problems
though, we still want to be able to disable them for debugging purposes.
Add a switch to debugfs to enable them at runtime.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
When a client holding pNFS SCSI layouts becomes unresponsive, the
server revokes access by preempting the client's SCSI persistent
reservation key. A layout recall is issued for each layout the
client holds; if the client fails to respond, each recall triggers
a fence operation. The first preempt for a given device succeeds
and removes the client's key registration. Subsequent preempts for
the same device fail because the key is no longer registered.
Update the NFS server to handle SCSI persistent registration
fencing on a per-client and per-device basis by utilizing an
xarray associated with the nfs4_client structure.
Each xarray entry is indexed by the dev_t of a block device
registered by the client. The entry maintains a flag indicating
whether this device has already been fenced for the corresponding
client.
When the server issues a persistent registration key to a client,
it creates a new xarray entry at the dev_t index with the fenced
flag initialized to 0.
Before performing a fence via nfsd4_scsi_fence_client, the server
checks the corresponding entry using the device's dev_t. If the
fenced flag is already set, the fence operation is skipped;
otherwise, the flag is set to 1 and fencing proceeds.
The xarray is destroyed when the nfs4_client is released in
__destroy_client.
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The svc_rqst->rq_cachetype field is only accessed by nfsd. Move it
into the nfsd_thread_local_info instead.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Benjamin Coddington <bcodding@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
rq_lease_breaker has always been a NFSv4 specific layering violation in
svc_rqst. The reason it's there though is that we need a place that is
thread-local, and accessible from the svc_rqst pointer.
Add a new rq_private pointer to struct svc_rqst. This is intended for
use by the threads that are handling the service. sunrpc code doesn't
touch it.
In nfsd, define a new struct nfsd_thread_local_info. nfsd declares one
of these on the stack and puts a pointer to it in rq_private.
Add a new ntli_lease_breaker field to the new struct and convert all of
the places that access rq_lease_breaker to use the new field instead.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Benjamin Coddington <bcodding@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Issues that need expedient stable backports:
- Fix cache_request leak in cache_release()
- Fix heap overflow in the NFSv4.0 LOCK replay cache
- Hold net reference for the lifetime of /proc/fs/nfs/exports fd
- Defer sub-object cleanup in export "put" callbacks
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmm7ELoACgkQM2qzM29m
f5fbMQ/6AjYdEQh56X2G1Y899zsvT4jfOZSc8dYjxK6seNZLQBCOz54w4aRo0TmP
keYIew8w2atCwWAlYT1xXqJVt90EG36fGodnw3EN+0g3nxPsIy1JeZwTUz1xagaI
hDbFwo6bN4HxU457/XxPO4jNdvpztq8hbTdRkXsD/Ckh2Db1juKkTQ+kX0rCxL5s
xZPDgKCsTQeFjfs+gdnbyEixc8vnQMAiUP15Df+HQdwCGD62meQ1S0BBVywRhCAK
FoufgPRnCzB189PKYCpivCNSImeSasQ4cS3WYi1i9ZB3OvEzRnqaPAvvRWQTwWfs
7IIekorKagCvXbqEt3dMQn7UaVyFLgV8OMR04JGqpI05GylNBQVONty/BKzQVTdH
Hp2C9PCitoPC68UabQZ22rCH8zpMREk+sH785ztLyuKGgC09YLTkxrltHllzKWAQ
k5DkeTmySVeobpif4urQKHyxhWZ//ah0MJOsSE4XcPMCWk7RPshj4tZyzvXdbuR1
IZQbOSruUd9aaZ4Q9J8D66oVyBatq9RFP4yxxR7L3CLSXJUsWK0AriEY9EZAeUe7
GeOaiUJ34F2oE4FfF9XaTmsXG9EuXtps6PlYDlHjlSyRJyg3detTJP4YeKJCrlQC
x+x7DN5gN2ZUuR+vqlS1BWGm24usmeNBPqvZ2hi6d+NpPgcLoUk=
=xX5n
-----END PGP SIGNATURE-----
Merge tag 'nfsd-7.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
- Fix cache_request leak in cache_release()
- Fix heap overflow in the NFSv4.0 LOCK replay cache
- Hold net reference for the lifetime of /proc/fs/nfs/exports fd
- Defer sub-object cleanup in export "put" callbacks
* tag 'nfsd-7.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
nfsd: fix heap overflow in NFSv4.0 LOCK replay cache
sunrpc: fix cache_request leak in cache_release
NFSD: Hold net reference for the lifetime of /proc/fs/nfs/exports fd
NFSD: Defer sub-object cleanup in export put callbacks
The NFSv4.0 replay cache uses a fixed 112-byte inline buffer
(rp_ibuf[NFSD4_REPLAY_ISIZE]) to store encoded operation responses.
This size was calculated based on OPEN responses and does not account
for LOCK denied responses, which include the conflicting lock owner as
a variable-length field up to 1024 bytes (NFS4_OPAQUE_LIMIT).
When a LOCK operation is denied due to a conflict with an existing lock
that has a large owner, nfsd4_encode_operation() copies the full encoded
response into the undersized replay buffer via read_bytes_from_xdr_buf()
with no bounds check. This results in a slab-out-of-bounds write of up
to 944 bytes past the end of the buffer, corrupting adjacent heap memory.
This can be triggered remotely by an unauthenticated attacker with two
cooperating NFSv4.0 clients: one sets a lock with a large owner string,
then the other requests a conflicting lock to provoke the denial.
We could fix this by increasing NFSD4_REPLAY_ISIZE to allow for a full
opaque, but that would increase the size of every stateowner, when most
lockowners are not that large.
Instead, fix this by checking the encoded response length against
NFSD4_REPLAY_ISIZE before copying into the replay buffer. If the
response is too large, set rp_buflen to 0 to skip caching the replay
payload. The status is still cached, and the client already received the
correct response on the original request.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Cc: stable@kernel.org
Reported-by: Nicholas Carlini <npc@anthropic.com>
Tested-by: Nicholas Carlini <npc@anthropic.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The /proc/fs/nfs/exports proc entry is created at module init
and persists for the module's lifetime. exports_proc_open()
captures the caller's current network namespace and stores
its svc_export_cache in seq->private, but takes no reference
on the namespace. If the namespace is subsequently torn down
(e.g. container destruction after the opener does setns() to a
different namespace), nfsd_net_exit() calls nfsd_export_shutdown()
which frees the cache. Subsequent reads on the still-open fd
dereference the freed cache_detail, walking a freed hash table.
Hold a reference on the struct net for the lifetime of the open
file descriptor. This prevents nfsd_net_exit() from running --
and thus prevents nfsd_export_shutdown() from freeing the cache
-- while any exports fd is open. cache_detail already stores
its net pointer (cd->net, set by cache_create_net()), so
exports_release() can retrieve it without additional per-file
storage.
Reported-by: Misbah Anjum N <misanjum@linux.ibm.com>
Closes: https://lore.kernel.org/linux-nfs/dcd371d3a95815a84ba7de52cef447b8@linux.ibm.com/
Fixes: 96d851c4d2 ("nfsd: use proper net while reading "exports" file")
Cc: stable@vger.kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Tested-by: Olga Kornievskaia <okorniev@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
svc_export_put() calls path_put() and auth_domain_put() immediately
when the last reference drops, before the RCU grace period. RCU
readers in e_show() and c_show() access both ex_path (via
seq_path/d_path) and ex_client->name (via seq_escape) without
holding a reference. If cache_clean removes the entry and drops the
last reference concurrently, the sub-objects are freed while still
in use, producing a NULL pointer dereference in d_path.
Commit 2530766492 ("nfsd: fix UAF when access ex_uuid or
ex_stats") moved kfree of ex_uuid and ex_stats into the
call_rcu callback, but left path_put() and auth_domain_put() running
before the grace period because both may sleep and call_rcu
callbacks execute in softirq context.
Replace call_rcu/kfree_rcu with queue_rcu_work(), which defers the
callback until after the RCU grace period and executes it in process
context where sleeping is permitted. This allows path_put() and
auth_domain_put() to be moved into the deferred callback alongside
the other resource releases. Apply the same fix to expkey_put(),
which has the identical pattern with ek_path and ek_client.
A dedicated workqueue scopes the shutdown drain to only NFSD
export release work items; flushing the shared
system_unbound_wq would stall on unrelated work from other
subsystems. nfsd_export_shutdown() uses rcu_barrier() followed
by flush_workqueue() to ensure all deferred release callbacks
complete before the export caches are destroyed.
Reported-by: Misbah Anjum N <misanjum@linux.ibm.com>
Closes: https://lore.kernel.org/linux-nfs/dcd371d3a95815a84ba7de52cef447b8@linux.ibm.com/
Fixes: c224edca7a ("nfsd: no need get cache ref when protected by rcu")
Fixes: 1b10f0b603 ("SUNRPC: no need get cache ref when protected by rcu")
Cc: stable@vger.kernel.org
Reviwed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Tested-by: Olga Kornievskaia <okorniev@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
On 32-bit architectures, unsigned long is only 32 bits wide, which
causes 64-bit inode numbers to be silently truncated. Several
filesystems (NFS, XFS, BTRFS, etc.) can generate inode numbers that
exceed 32 bits, and this truncation can lead to inode number collisions
and other subtle bugs on 32-bit systems.
Change the type of inode->i_ino from unsigned long to u64 to ensure that
inode numbers are always represented as 64-bit values regardless of
architecture. Update all format specifiers treewide from %lu/%lx to
%llu/%llx to match the new type, along with corresponding local variable
types.
This is the bulk treewide conversion. Earlier patches in this series
handled trace events separately to allow trace field reordering for
better struct packing on 32-bit.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20260304-iino-u64-v3-12-2257ad83d372@kernel.org
Acked-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Rather than explicit locking, use the start_removing_noperm() and
end_removing() wrappers.
This was not done with other start_removing changes due to conflicting
in-flight patches.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neil@brown.name>
Link: https://patch.msgid.link/20260224222542.3458677-8-neilb@ownmail.net
Signed-off-by: Christian Brauner <brauner@kernel.org>
NFSD fixes that arrived too late for the 7.0 merge window.
Fixes for commits merged in 7.0:
- Restore previous nfsd thread count reporting behavior
Issues that need expedient stable backports:
- Fix credential reference leaks in the NFSD netlink admin protocol
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmmlmfQACgkQM2qzM29m
f5ckDA//eSSeZG+Ld2MX+DrYH7aYUSQzlUJ7ENKpt904tw6qy8o3dPFg5gcF2ZKQ
2cK05l0G96eaYYZu/nto3RdS7lx3iDgQuWq9KorQMYXoCad/tNl8RYC8HiuH+aDu
Fir7RApMknXe54Mz7uiPaZBUZHkb+hqe9wHOVJkZyMlRMYNAtijsI4wfaY9a5ACK
dh03lCMOJyU3emBXizNsZ9lysuRbPVpHQEmcZsJUTnA7f6xcCTF/CyEtxjCHX9Z5
KZ0Ltb/kG9V1VFyuGAm1S0dQmAKbl2WUo5k5eslRXmHxFx072BFOpwXlr4qd4yWt
zjY9VY5q0anXWNgwz1U897R5xDfx43C+OdnRcMxWF7bRnNmNyCNeXnYUgSuh4HYF
Y2IHBJk9HXSlxeiSZAq45lDgNOfg5ZBgGVVfcuKqUxgcCqG5r56FqGFkJiPvuDiI
CEW6dIn7OQuUzDnSK0vXWFR1KGu39nKaunJHAq2BTLxbW42K5EPDw+Vhibym2LQG
uSsBNHtviWKMONkb3jrkK5sIZryL07M/fLsYKYkSmF/B1XVwtvZHGG2k7qCCCM7B
5IKjAFeFCRqiyYO8lm3dhLz/SbH5jpqUb3V7OpxAytk8FAEsUGX5y8fu43rfQCZD
g2spjlrtoAhg3dSsAmrw9bDSs2TdAagWkSj1NfNgbsJs2irEgyU=
=kHSn
-----END PGP SIGNATURE-----
Merge tag 'nfsd-7.0-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
- Restore previous nfsd thread count reporting behavior
- Fix credential reference leaks in the NFSD netlink admin protocol
* tag 'nfsd-7.0-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
nfsd: report the requested maximum number of threads instead of number running
nfsd: Fix cred ref leak in nfsd_nl_listener_set_doit().
nfsd: Fix cred ref leak in nfsd_nl_threads_set_doit().
The current netlink and /proc interfaces deviate from their traditional
values when dynamic threading is enabled, and there is currently no way
to know what the current setting is. This patch brings the reporting
back in line with traditional behavior.
Make these interfaces report the requested maximum number of threads
instead of the number currently running. Also, update documentation and
comments to reflect that this value represents a maximum and not the
number currently running.
Fixes: d8316b837c ("nfsd: add controls to set the minimum number of threads per pool")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
This converts some of the visually simpler cases that have been split
over multiple lines. I only did the ones that are easy to verify the
resulting diff by having just that final GFP_KERNEL argument on the next
line.
Somebody should probably do a proper coccinelle script for this, but for
me the trivial script actually resulted in an assertion failure in the
middle of the script. I probably had made it a bit _too_ trivial.
So after fighting that far a while I decided to just do some of the
syntactically simpler cases with variations of the previous 'sed'
scripts.
The more syntactically complex multi-line cases would mostly really want
whitespace cleanup anyway.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the exact same thing as the 'alloc_obj()' version, only much
smaller because there are a lot fewer users of the *alloc_flex()
interface.
As with alloc_obj() version, this was done entirely with mindless brute
force, using the same script, except using 'flex' in the pattern rather
than 'objs*'.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This was done entirely with mindless brute force, using
git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'
to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.
Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.
For the same reason the 'flex' versions will be done as a separate
conversion.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:
Single allocations: kmalloc(sizeof(TYPE), ...)
are replaced with: kmalloc_obj(TYPE, ...)
Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with: kmalloc_objs(TYPE, COUNT, ...)
Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...)
(where TYPE may also be *VAR)
The resulting allocations no longer return "void *", instead returning
"TYPE *".
Signed-off-by: Kees Cook <kees@kernel.org>
Please consider pulling these changes from the signed vfs-7.0-rc1.misc.2 tag.
Thanks!
Christian
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaZMOCwAKCRCRxhvAZXjc
oswrAP9r1zjzMimjX2J0hBoMnYjNzQfLLew8+IRygImQ+yaqWgD9Fiw/cQ9eE1Hm
TMLqck/ky588ywSDaBzfztrXAY3ISgg=
=4yr2
-----END PGP SIGNATURE-----
Merge tag 'vfs-7.0-rc1.misc.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull more misc vfs updates from Christian Brauner:
"Features:
- Optimize close_range() from O(range size) to O(active FDs) by using
find_next_bit() on the open_fds bitmap instead of linearly scanning
the entire requested range. This is a significant improvement for
large-range close operations on sparse file descriptor tables.
- Add FS_XFLAG_VERITY file attribute for fs-verity files, retrievable
via FS_IOC_FSGETXATTR and file_getattr(). The flag is read-only.
Add tracepoints for fs-verity enable and verify operations,
replacing the previously removed debug printk's.
- Prevent nfsd from exporting special kernel filesystems like pidfs
and nsfs. These filesystems have custom ->open() and ->permission()
export methods that are designed for open_by_handle_at(2) only and
are incompatible with nfsd. Update the exportfs documentation
accordingly.
Fixes:
- Fix KMSAN uninit-value in ovl_fill_real() where strcmp() was used
on a non-null-terminated decrypted directory entry name from
fscrypt. This triggered on encrypted lower layers when the
decrypted name buffer contained uninitialized tail data.
The fix also adds VFS-level name_is_dot(), name_is_dotdot(), and
name_is_dot_dotdot() helpers, replacing various open-coded "." and
".." checks across the tree.
- Fix read-only fsflags not being reset together with xflags in
vfs_fileattr_set(). Currently harmless since no read-only xflags
overlap with flags, but this would cause inconsistencies for any
future shared read-only flag
- Return -EREMOTE instead of -ESRCH from PIDFD_GET_INFO when the
target process is in a different pid namespace. This lets userspace
distinguish "process exited" from "process in another namespace",
matching glibc's pidfd_getpid() behavior
Cleanups:
- Use C-string literals in the Rust seq_file bindings, replacing the
kernel::c_str!() macro (available since Rust 1.77)
- Fix typo in d_walk_ret enum comment, add porting notes for the
readlink_copy() calling convention change"
* tag 'vfs-7.0-rc1.misc.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: add porting notes about readlink_copy()
pidfs: return -EREMOTE when PIDFD_GET_INFO is called on another ns
nfsd: do not allow exporting of special kernel filesystems
exportfs: clarify the documentation of open()/permission() expotrfs ops
fsverity: add tracepoints
fs: add FS_XFLAG_VERITY for fs-verity files
rust: seq_file: replace `kernel::c_str!` with C-Strings
fs: dcache: fix typo in enum d_walk_ret comment
ovl: use name_is_dot* helpers in readdir code
fs: add helpers name_is_dot{,dot,_dotdot}
ovl: Fix uninit-value in ovl_fill_real
fs: reset read-only fsflags together with xflags
fs/file: optimize close_range() complexity from O(N) to O(Sparse)
nfsd_nl_listener_set_doit() uses get_current_cred() without
put_cred().
As we can see from other callers, svc_xprt_create_from_sa()
does not require the extra refcount.
nfsd_nl_listener_set_doit() is always in the process context,
sendmsg(), and current->cred does not go away.
Let's use current_cred() in nfsd_nl_listener_set_doit().
Fixes: 16a4711774 ("NFSD: add listener-{set,get} netlink command")
Cc: stable@vger.kernel.org
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
syzbot reported memory leak of struct cred. [0]
nfsd_nl_threads_set_doit() passes get_current_cred() to
nfsd_svc(), but put_cred() is not called after that.
The cred is finally passed down to _svc_xprt_create(),
which calls get_cred() with the cred for struct svc_xprt.
The ownership of the refcount by get_current_cred() is not
transferred to anywhere and is just leaked.
nfsd_svc() is also called from write_threads(), but it does
not bump file->f_cred there.
nfsd_nl_threads_set_doit() is called from sendmsg() and
current->cred does not go away.
Let's use current_cred() in nfsd_nl_threads_set_doit().
[0]:
BUG: memory leak
unreferenced object 0xffff888108b89480 (size 184):
comm "syz-executor", pid 5994, jiffies 4294943386
hex dump (first 32 bytes):
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace (crc 369454a7):
kmemleak_alloc_recursive include/linux/kmemleak.h:44 [inline]
slab_post_alloc_hook mm/slub.c:4958 [inline]
slab_alloc_node mm/slub.c:5263 [inline]
kmem_cache_alloc_noprof+0x412/0x580 mm/slub.c:5270
prepare_creds+0x22/0x600 kernel/cred.c:185
copy_creds+0x44/0x290 kernel/cred.c:286
copy_process+0x7a7/0x2870 kernel/fork.c:2086
kernel_clone+0xac/0x6e0 kernel/fork.c:2651
__do_sys_clone+0x7f/0xb0 kernel/fork.c:2792
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xa4/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Fixes: 924f4fb003 ("NFSD: convert write_threads to netlink command")
Cc: stable@vger.kernel.org
Reported-by: syzbot+dd3b43aa0204089217ee@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/69744674.a00a0220.33ccc7.0000.GAE@google.com/
Tested-by: syzbot+dd3b43aa0204089217ee@syzkaller.appspotmail.com
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Total patches: 107
Reviews/patch: 1.07
Reviewed rate: 67%
- The 2 patch series "ocfs2: give ocfs2 the ability to reclaim
suballocator free bg" from Heming Zhao saves disk space by teaching
ocfs2 to reclaim suballocator block group space.
- The 4 patch series "Add ARRAY_END(), and use it to fix off-by-one
bugs" from Alejandro Colomar adds the ARRAY_END() macro and uses it in
various places.
- The 2 patch series "vmcoreinfo: support VMCOREINFO_BYTES larger than
PAGE_SIZE" from Pnina Feder makes the vmcore code future-safe, if
VMCOREINFO_BYTES ever exceeds the page size.
- The 7 patch series "kallsyms: Prevent invalid access when showing
module buildid" from Petr Mladek cleans up kallsyms code related to
module buildid and fixes an invalid access crash when printing
backtraces.
- The 3 patch series "Address page fault in
ima_restore_measurement_list()" from Harshit Mogalapalli fixes a
kexec-related crash that can occur when booting the second-stage kernel
on x86.
- The 6 patch series "kho: ABI headers and Documentation updates" from
Mike Rapoport updates the kexec handover ABI documentation.
- The 4 patch series "Align atomic storage" from Finn Thain adds the
__aligned attribute to atomic_t and atomic64_t definitions to get
natural alignment of both types on csky, m68k, microblaze, nios2,
openrisc and sh.
- The 2 patch series "kho: clean up page initialization logic" from
Pratyush Yadav simplifies the page initialization logic in
kho_restore_page().
- The 6 patch series "Unload linux/kernel.h" from Yury Norov moves
several things out of kernel.h and into more appropriate places.
- The 7 patch series "don't abuse task_struct.group_leader" from Oleg
Nesterov removes the usage of ->group_leader when it is "obviously
unnecessary".
- The 5 patch series "list private v2 & luo flb" from Pasha Tatashin
adds some infrastructure improvements to the live update orchestrator.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaY4giAAKCRDdBJ7gKXxA
jgusAQDnKkP8UWTqXPC1jI+OrDJGU5ciAx8lzLeBVqMKzoYk9AD/TlhT2Nlx+Ef6
0HCUHUD0FMvAw/7/Dfc6ZKxwBEIxyww=
=mmsH
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2026-02-12-10-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
- "ocfs2: give ocfs2 the ability to reclaim suballocator free bg" saves
disk space by teaching ocfs2 to reclaim suballocator block group
space (Heming Zhao)
- "Add ARRAY_END(), and use it to fix off-by-one bugs" adds the
ARRAY_END() macro and uses it in various places (Alejandro Colomar)
- "vmcoreinfo: support VMCOREINFO_BYTES larger than PAGE_SIZE" makes
the vmcore code future-safe, if VMCOREINFO_BYTES ever exceeds the
page size (Pnina Feder)
- "kallsyms: Prevent invalid access when showing module buildid" cleans
up kallsyms code related to module buildid and fixes an invalid
access crash when printing backtraces (Petr Mladek)
- "Address page fault in ima_restore_measurement_list()" fixes a
kexec-related crash that can occur when booting the second-stage
kernel on x86 (Harshit Mogalapalli)
- "kho: ABI headers and Documentation updates" updates the kexec
handover ABI documentation (Mike Rapoport)
- "Align atomic storage" adds the __aligned attribute to atomic_t and
atomic64_t definitions to get natural alignment of both types on
csky, m68k, microblaze, nios2, openrisc and sh (Finn Thain)
- "kho: clean up page initialization logic" simplifies the page
initialization logic in kho_restore_page() (Pratyush Yadav)
- "Unload linux/kernel.h" moves several things out of kernel.h and into
more appropriate places (Yury Norov)
- "don't abuse task_struct.group_leader" removes the usage of
->group_leader when it is "obviously unnecessary" (Oleg Nesterov)
- "list private v2 & luo flb" adds some infrastructure improvements to
the live update orchestrator (Pasha Tatashin)
* tag 'mm-nonmm-stable-2026-02-12-10-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (107 commits)
watchdog/hardlockup: simplify perf event probe and remove per-cpu dependency
procfs: fix missing RCU protection when reading real_parent in do_task_stat()
watchdog/softlockup: fix sample ring index wrap in need_counting_irqs()
kcsan, compiler_types: avoid duplicate type issues in BPF Type Format
kho: fix doc for kho_restore_pages()
tests/liveupdate: add in-kernel liveupdate test
liveupdate: luo_flb: introduce File-Lifecycle-Bound global state
liveupdate: luo_file: Use private list
list: add kunit test for private list primitives
list: add primitives for private list manipulations
delayacct: fix uapi timespec64 definition
panic: add panic_force_cpu= parameter to redirect panic to a specific CPU
netclassid: use thread_group_leader(p) in update_classid_task()
RDMA/umem: don't abuse current->group_leader
drm/pan*: don't abuse current->group_leader
drm/amd: kill the outdated "Only the pthreads threading model is supported" checks
drm/amdgpu: don't abuse current->group_leader
android/binder: use same_thread_group(proc->tsk, current) in binder_mmap()
android/binder: don't abuse current->group_leader
kho: skip memoryless NUMA nodes when reserving scratch areas
...
Neil Brown and Jeff Layton contributed a dynamic thread pool sizing
mechanism for NFSD. The sunrpc layer now tracks minimum and maximum
thread counts per pool, and NFSD adjusts running thread counts based
on workload: idle threads exit after a timeout when the pool exceeds
its minimum, and new threads spawn automatically when all threads
are busy. Administrators control this behavior via the nfsdctl
netlink interface.
Rick Macklem, FreeBSD NFS maintainer, generously contributed server-
side support for the POSIX ACL extension to NFSv4, as specified in
draft-ietf-nfsv4-posix-acls. This extension allows NFSv4 clients to
get and set POSIX access and default ACLs using native NFSv4
operations, eliminating the need for sideband protocols. The feature
is gated by a Kconfig option since the IETF draft has not yet been
ratified.
Chuck Lever delivered numerous improvements to the xdrgen tool.
Error reporting now covers parsing, AST transformation, and invalid
declarations. Generated enum decoders validate incoming values
against valid enumerator lists. New features include pass-through
line support for embedding C directives in XDR specifications,
16-bit integer types, and program number definitions. Several code
generation issues were also addressed.
When an administrator revokes NFSv4 state for a filesystem via the
unlock_fs interface, ongoing async COPY operations referencing that
filesystem are now cancelled, with CB_OFFLOAD callbacks notifying
affected clients.
The remaining patches in this pull request are clean-ups and minor
optimizations. Sincere thanks to all contributors, reviewers,
testers, and bug reporters who participated in the v7.0 NFSD
development cycle.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmmJ8kAACgkQM2qzM29m
f5ejCQ//RdoWNgN1VZdNoUrh1tm1Fhi1YN/RJS26G25OxgTBc3/qtGxrpW+ZAW6+
mIAJ2bT/l66741drki4/x6WJU4OMI/4mJxrLd0WCb1POaeRQWnL1MdzNY+IP/QZv
3DgcTv1T6FKE7pFmAqW0nFPCgaK+vlR+fo4uJognbB6+hZB3HlrLkfeZOWMAmchC
y3U6nzrtP+IljAtdzKZ120E+LHp0PtTbJwPCPSt3/FR/dkA0DcjnOS9jybIYlJOu
0ByX24BcrW/c3rJUdL8lL4G7gsPWjdARqczFiN8sufI9Q3zlHOxtYdUT7BNjd+04
jcSKLlAXwcbNcK2f54B/QFKmNxllvoHLB3wo2hfEPig4LQELuxcUHYxmmD4vNKen
lp6zmaLq3PiRGlew6eLRFxRxbdLds+9l0xjXV+J+rtQmjppXdXUoVNMm+D+tD6bF
T5TUq4WNCGJIrpkR7pdF7uMD51s8fphvaDxOCjhSi3WHAtZAhOR8HFUU97qddM34
KqF6Gph3tN/C4oNb8kKvzxBRpRhHIzKHZbreiu5fZr9pPe9IRBHnn/Dg4p/yYQcw
K3/y1EnKrIlprfbFFkY1LzNFpf309uoZTVzwBcMfSJVsFgUqWD7KHJ/rmCJQ/pS6
k0+YLRoUmtUHDYk2QNlstlt7r6FwA0d2GjT8n7viGoNQ3PA7rJQ=
=hqla
-----END PGP SIGNATURE-----
Merge tag 'nfsd-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd updates from Chuck Lever:
"Neil Brown and Jeff Layton contributed a dynamic thread pool sizing
mechanism for NFSD. The sunrpc layer now tracks minimum and maximum
thread counts per pool, and NFSD adjusts running thread counts based
on workload: idle threads exit after a timeout when the pool exceeds
its minimum, and new threads spawn automatically when all threads are
busy. Administrators control this behavior via the nfsdctl netlink
interface.
Rick Macklem, FreeBSD NFS maintainer, generously contributed server-
side support for the POSIX ACL extension to NFSv4, as specified in
draft-ietf-nfsv4-posix-acls. This extension allows NFSv4 clients to
get and set POSIX access and default ACLs using native NFSv4
operations, eliminating the need for sideband protocols. The feature
is gated by a Kconfig option since the IETF draft has not yet been
ratified.
Chuck Lever delivered numerous improvements to the xdrgen tool. Error
reporting now covers parsing, AST transformation, and invalid
declarations. Generated enum decoders validate incoming values against
valid enumerator lists. New features include pass-through line support
for embedding C directives in XDR specifications, 16-bit integer
types, and program number definitions. Several code generation issues
were also addressed.
When an administrator revokes NFSv4 state for a filesystem via the
unlock_fs interface, ongoing async COPY operations referencing that
filesystem are now cancelled, with CB_OFFLOAD callbacks notifying
affected clients.
The remaining patches in this pull request are clean-ups and minor
optimizations. Sincere thanks to all contributors, reviewers, testers,
and bug reporters who participated in the v7.0 NFSD development cycle"
* tag 'nfsd-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (45 commits)
NFSD: Add POSIX ACL file attributes to SUPPATTR bitmasks
NFSD: Add POSIX draft ACL support to the NFSv4 SETATTR operation
NFSD: Add support for POSIX draft ACLs for file creation
NFSD: Add support for XDR decoding POSIX draft ACLs
NFSD: Refactor nfsd_setattr()'s ACL error reporting
NFSD: Do not allow NFSv4 (N)VERIFY to check POSIX ACL attributes
NFSD: Add nfsd4_encode_fattr4_posix_access_acl
NFSD: Add nfsd4_encode_fattr4_posix_default_acl
NFSD: Add nfsd4_encode_fattr4_acl_trueform_scope
NFSD: Add nfsd4_encode_fattr4_acl_trueform
Add RPC language definition of NFSv4 POSIX ACL extension
NFSD: Add a Kconfig setting to enable support for NFSv4 POSIX ACLs
xdrgen: Implement pass-through lines in specifications
nfsd: cancel async COPY operations when admin revokes filesystem state
nfsd: add controls to set the minimum number of threads per pool
nfsd: adjust number of running nfsd threads based on activity
sunrpc: allow svc_recv() to return -ETIMEDOUT and -EBUSY
sunrpc: split new thread creation into a separate function
sunrpc: introduce the concept of a minimum number of threads per pool
sunrpc: track the max number of requested threads in a pool
...
Please consider pulling these changes from the signed vfs-7.0-rc1.atomic_open tag.
Thanks!
Christian
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaYX49gAKCRCRxhvAZXjc
ogUIAQDJTGgoi7H5a8OllRLXU/6D4OXhIhvZtvrK31HfLLDTRAEAw8JFnvFrCJP9
xf3yVklTJ9aW65zeh2mG0uiJ87JORgE=
=qP0i
-----END PGP SIGNATURE-----
Merge tag 'vfs-7.0-rc1.atomic_open' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs atomic_open updates from Christian Brauner:
"Allow knfsd to use atomic_open()
While knfsd offers combined exclusive create and open results to
clients, on some filesystems those results are not atomic. The
separate vfs_create() + vfs_open() sequence in dentry_create() can
produce races and unexpected errors. For example, open O_CREAT with
mode 0 will succeed in creating the file but return -EACCES from
vfs_open(). Additionally, network filesystems benefit from reducing
remote round-trip operations by using a single atomic_open() call.
Teach dentry_create() -- whose sole caller is knfsd -- to use
atomic_open() for filesystems that support it"
* tag 'vfs-7.0-rc1.atomic_open' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs/namei: fix kernel-doc markup for dentry_create
VFS/knfsd: Teach dentry_create() to use atomic_open()
VFS: Prepare atomic_open() for dentry_create()
VFS: move dentry_create() from fs/open.c to fs/namei.c
pidfs and nsfs recently gained support for encode/decode of file handles
via name_to_handle_at(2)/open_by_handle_at(2).
These special kernel filesystems have custom ->open() and ->permission()
export methods, which nfsd does not respect and it was never meant to be
used for exporting those filesystems by nfsd.
Therefore, do not allow nfsd to export filesystems with custom ->open()
or ->permission() methods.
Fixes: b3caba8f7a ("pidfs: implement file handle support")
Fixes: 5222470b2f ("nsfs: support file handles")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://patch.msgid.link/20260129100212.49727-3-amir73il@gmail.com
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Now that infrastructure for NFSv4 POSIX draft ACL has been added
to NFSD, it should be safe to advertise support to NFS clients.
NFSD_SUPPATTR_EXCLCREAT_WORD2 includes NFSv4.2-only attributes,
but version filtering occurs via nfsd_suppattrs[] before this
mask is applied, ensuring pre-4.2 clients never see unsupported
attributes.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The POSIX ACL extension to NFSv4 enables clients to set access
and default ACLs via FATTR4_POSIX_ACCESS_ACL and
FATTR4_POSIX_DEFAULT_ACL attributes. Integration of these
attributes into SETATTR processing requires wiring them through
the nfsd_attrs structure and ensuring proper cleanup on all
code paths.
This patch connects the na_pacl and na_dpacl fields in
nfsd_attrs to the decoded ACL pointers from the NFSv4 SETATTR
decoder. Ownership of these ACL references transfers to attrs
immediately after initialization, with the decoder's pointers
cleared to NULL. This transfer ensures nfsd_attrs_free()
releases the ACLs on normal completion, while new error paths
call posix_acl_release() directly when cleanup occurs before
nfsd_attrs_free() runs.
Early returns in the nfsd4_setattr() function gain conversions
to goto statements that branch to proper cleanup handlers.
Error paths before fh_want_write() branch to out_err for ACL
release only; paths after fh_want_write() use the existing out
label for full cleanup via nfsd_attrs_free().
The patch adds mutual exclusion between NFSv4 ACLs (sa_acl) and
POSIX ACLs. Setting both types simultaneously returns
nfserr_inval because these ACL models cannot coexist on the
same file object.
Signed-off-by: Rick Macklem <rmacklem@uoguelph.ca>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
NFSv4.2 clients can specify POSIX draft ACLs when creating file
objects via OPEN(CREATE) and CREATE operations. The previous patch
added POSIX ACL support to the NFSv4 SETATTR operation for modifying
existing objects, but file creation follows different code paths
that also require POSIX ACL handling.
This patch integrates POSIX ACL support into nfsd4_create() and
nfsd4_create_file(). Ownership of the decoded ACL pointers
(op_dpacl, op_pacl, cr_dpacl, cr_pacl) transfers to the nfsd_attrs
structure immediately, with the original fields cleared to NULL.
This transfer ensures nfsd_attrs_free() releases the ACLs upon
completion while preventing double-free on error paths.
Mutual exclusion between NFSv4 ACLs and POSIX ACLs is enforced:
setting both op_acl and op_dpacl/op_pacl simultaneously returns
nfserr_inval. Errors during ACL application clear the corresponding
bits in the result bitmask (fattr->bmval), signaling partial
completion to the client.
Signed-off-by: Rick Macklem <rmacklem@uoguelph.ca>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The POSIX ACL extension to NFSv4 defines FATTR4_POSIX_ACCESS_ACL
and FATTR4_POSIX_DEFAULT_ACL for setting access and default ACLs
via CREATE, OPEN, and SETATTR operations. This patch adds the XDR
decoders for those attributes.
The nfsd4_decode_fattr4() function gains two additional parameters
for receiving decoded POSIX ACLs. CREATE, OPEN, and SETATTR
decoders pass pointers to these new parameters, enabling clients
to set POSIX ACLs during object creation or modification.
Signed-off-by: Rick Macklem <rmacklem@uoguelph.ca>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Support for FATTR4_POSIX_ACCESS_ACL and FATTR4_POSIX_DEFAULT_ACL
attributes in subsequent patches allows clients to set both ACL
types simultaneously during SETATTR and file creation. Each ACL
type can succeed or fail independently, requiring the server to
clear individual attribute bits in the reply bitmap when one
fails while the other succeeds.
The existing na_aclerr field cannot distinguish which ACL type
encountered an error. Separate error fields (na_paclerr for
access ACLs, na_dpaclerr for default ACLs) enable the server to
report per-ACL-type failures accurately.
This refactoring also adds validation previously absent: default
ACL processing rejects non-directory targets with EINVAL and
passes NULL to set_posix_acl() when a_count is zero to delete
the ACL. Access ACL processing rejects zero a_count with EINVAL
for ACL_SCOPE_FILE_SYSTEM semantics (the only scope currently
supported).
The changes preserve compatibility with existing NFSv4 ACL code.
NFSv4 ACL conversion (nfs4_acl_nfsv4_to_posix()) never produces
POSIX ACLs with a_count == 0, so the new validation logic only
affects future POSIX ACL attribute handling.
Signed-off-by: Rick Macklem <rmacklem@uoguelph.ca>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Section 9.3 of draft-ietf-nfsv4-posix-acls-00 prohibits use of
the POSIX ACL attributes with VERIFY and NVERIFY operations: the
server MUST reply NFS4ERR_INVAL when a client attempts this.
Beyond the protocol requirement, comparison of POSIX draft ACLs
via (N)VERIFY presents an implementation challenge. Clients are
not required to order the ACEs within a POSIX ACL in any
particular way, making reliable attribute comparison impractical.
Return nfserr_inval when the client requests FATTR4_POSIX_ACCESS_ACL
or FATTR4_POSIX_DEFAULT_ACL in a VERIFY or NVERIFY operation.
Signed-off-by: Rick Macklem <rmacklem@uoguelph.ca>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The POSIX ACL extension to NFSv4 defines FATTR4_POSIX_ACCESS_ACL
for retrieving the access ACL of a file or directory. This patch
adds the XDR encoder for that attribute.
The access ACL is retrieved via get_inode_acl(). If the filesystem
provides no explicit access ACL, one is synthesized from the file
mode via posix_acl_from_mode(). Each entry is encoded as a
posixace4: tag type, permission bits, and principal name (empty
for structural entries, resolved via idmapping for USER/GROUP
entries).
Unlike the default ACL encoder which applies only to directories,
this encoder handles all inode types and ensures an access ACL is
always available through mode-based synthesis when needed.
Signed-off-by: Rick Macklem <rmacklem@uoguelph.ca>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>