Commit Graph

345 Commits

Author SHA1 Message Date
Chuck Lever
fc15110009 NFSD: Report whether fh_key was actually updated
The nfsd_ctl_fh_key_set tracepoint was introduced to capture
operator activity on the filehandle signing key. Earlier revisions
logged the key bytes verbatim; the version that landed hashes the
16 key bytes through crc32_le and stores the result.

CRC32 is a linear projection of its input rather than a one-way
function, and truncating any hash of fixed-size secret material
leaves the key recoverable under offline brute force when the
threat model includes an attacker with access to the trace ring.

The operational question the fingerprint was meant to answer is
whether a NFSD_CMD_THREADS_SET call that carries an
NFSD_A_SERVER_FH_KEY attribute actually replaced the active key or
re-installed the value already in place. Answer that question
directly: compare the incoming key bytes against the current
nn->fh_key inside nfsd_nl_fh_key_set() and surface a single bit to
the tracepoint. The event now prints "updated" when the stored
key changed and "unmodified" otherwise. A first set that fails
kmalloc reports "unmodified" because no key was installed.

Reported-by: jaeyeong <fin@spl.team>
Fixes: 62346217fd ("NFSD: Add a key for signing filehandles")
Cc: Benjamin Coddington <bcodding@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@hammerspace.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-05-21 17:08:47 -04:00
Dai Ngo
42cc139959 NFSD: convert callback RPC program to per-net namespace
The callback channel's rpc_program, rpc_version, rpc_stat,
and per-procedure counts are declared as file-scope statics in
nfs4callback.c, shared across all network namespaces.
Forechannel RPC statistics are already maintained per-netns
(via nfsd_svcstats in struct nfsd_net); the backchannel
has no such separation. When backchannel statistics are
eventually surfaced to userspace, the global counters would
expose cross-namespace data.

Allocate per-netns copies of these structures through a new
opaque struct nfsd_net_cb, managed by nfsd_net_cb_init()
and nfsd_net_cb_shutdown(). The struct definition is private
to nfs4callback.c; struct nfsd_net holds only a pointer.

Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-04-03 09:27:52 -04:00
Benjamin Coddington
62346217fd NFSD: Add a key for signing filehandles
A future patch will enable NFSD to sign filehandles by appending a Message
Authentication Code(MAC).  To do this, NFSD requires a secret 128-bit key
that can persist across reboots.  A persisted key allows the server to
accept filehandles after a restart.  Enable NFSD to be configured with this
key via the netlink interface.

Link: https://lore.kernel.org/linux-nfs/cover.1772022373.git.bcodding@hammerspace.com
Signed-off-by: Benjamin Coddington <bcodding@hammerspace.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-03-29 21:25:09 -04:00
Chuck Lever
efb5b15e3b lockd: Relocate nlmsvc_unlock API declarations
The nlmsvc_unlock_all_by_sb() and nlmsvc_unlock_all_by_ip()
functions are part of lockd's external API, consumed by other
kernel subsystems. Their declarations currently reside in
linux/lockd/lockd.h alongside internal implementation details,
which blurs the boundary between lockd's public interface and
its private internals.

Moving these declarations to linux/lockd/bind.h groups them
with other external API functions and makes the separation
explicit. This clarifies which functions are intended for
external use and reduces the risk of internal implementation
details leaking into the public API surface.

Build-tested with allyesconfig; no functional changes.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-03-29 21:25:09 -04:00
Linus Torvalds
8a30aeb0d1 nfsd-7.0 fixes:
Issues that need expedient stable backports:
 - Fix cache_request leak in cache_release()
 - Fix heap overflow in the NFSv4.0 LOCK replay cache
 - Hold net reference for the lifetime of /proc/fs/nfs/exports fd
 - Defer sub-object cleanup in export "put" callbacks
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmm7ELoACgkQM2qzM29m
 f5fbMQ/6AjYdEQh56X2G1Y899zsvT4jfOZSc8dYjxK6seNZLQBCOz54w4aRo0TmP
 keYIew8w2atCwWAlYT1xXqJVt90EG36fGodnw3EN+0g3nxPsIy1JeZwTUz1xagaI
 hDbFwo6bN4HxU457/XxPO4jNdvpztq8hbTdRkXsD/Ckh2Db1juKkTQ+kX0rCxL5s
 xZPDgKCsTQeFjfs+gdnbyEixc8vnQMAiUP15Df+HQdwCGD62meQ1S0BBVywRhCAK
 FoufgPRnCzB189PKYCpivCNSImeSasQ4cS3WYi1i9ZB3OvEzRnqaPAvvRWQTwWfs
 7IIekorKagCvXbqEt3dMQn7UaVyFLgV8OMR04JGqpI05GylNBQVONty/BKzQVTdH
 Hp2C9PCitoPC68UabQZ22rCH8zpMREk+sH785ztLyuKGgC09YLTkxrltHllzKWAQ
 k5DkeTmySVeobpif4urQKHyxhWZ//ah0MJOsSE4XcPMCWk7RPshj4tZyzvXdbuR1
 IZQbOSruUd9aaZ4Q9J8D66oVyBatq9RFP4yxxR7L3CLSXJUsWK0AriEY9EZAeUe7
 GeOaiUJ34F2oE4FfF9XaTmsXG9EuXtps6PlYDlHjlSyRJyg3detTJP4YeKJCrlQC
 x+x7DN5gN2ZUuR+vqlS1BWGm24usmeNBPqvZ2hi6d+NpPgcLoUk=
 =xX5n
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-7.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:

 - Fix cache_request leak in cache_release()

 - Fix heap overflow in the NFSv4.0 LOCK replay cache

 - Hold net reference for the lifetime of /proc/fs/nfs/exports fd

 - Defer sub-object cleanup in export "put" callbacks

* tag 'nfsd-7.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  nfsd: fix heap overflow in NFSv4.0 LOCK replay cache
  sunrpc: fix cache_request leak in cache_release
  NFSD: Hold net reference for the lifetime of /proc/fs/nfs/exports fd
  NFSD: Defer sub-object cleanup in export put callbacks
2026-03-18 14:27:11 -07:00
Chuck Lever
e7fcf179b8 NFSD: Hold net reference for the lifetime of /proc/fs/nfs/exports fd
The /proc/fs/nfs/exports proc entry is created at module init
and persists for the module's lifetime. exports_proc_open()
captures the caller's current network namespace and stores
its svc_export_cache in seq->private, but takes no reference
on the namespace. If the namespace is subsequently torn down
(e.g. container destruction after the opener does setns() to a
different namespace), nfsd_net_exit() calls nfsd_export_shutdown()
which frees the cache. Subsequent reads on the still-open fd
dereference the freed cache_detail, walking a freed hash table.

Hold a reference on the struct net for the lifetime of the open
file descriptor. This prevents nfsd_net_exit() from running --
and thus prevents nfsd_export_shutdown() from freeing the cache
-- while any exports fd is open. cache_detail already stores
its net pointer (cd->net, set by cache_create_net()), so
exports_release() can retrieve it without additional per-file
storage.

Reported-by: Misbah Anjum N <misanjum@linux.ibm.com>
Closes: https://lore.kernel.org/linux-nfs/dcd371d3a95815a84ba7de52cef447b8@linux.ibm.com/
Fixes: 96d851c4d2 ("nfsd: use proper net while reading "exports" file")
Cc: stable@vger.kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Tested-by: Olga Kornievskaia <okorniev@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-03-14 11:34:25 -04:00
Chuck Lever
48db892356 NFSD: Defer sub-object cleanup in export put callbacks
svc_export_put() calls path_put() and auth_domain_put() immediately
when the last reference drops, before the RCU grace period. RCU
readers in e_show() and c_show() access both ex_path (via
seq_path/d_path) and ex_client->name (via seq_escape) without
holding a reference. If cache_clean removes the entry and drops the
last reference concurrently, the sub-objects are freed while still
in use, producing a NULL pointer dereference in d_path.

Commit 2530766492 ("nfsd: fix UAF when access ex_uuid or
ex_stats") moved kfree of ex_uuid and ex_stats into the
call_rcu callback, but left path_put() and auth_domain_put() running
before the grace period because both may sleep and call_rcu
callbacks execute in softirq context.

Replace call_rcu/kfree_rcu with queue_rcu_work(), which defers the
callback until after the RCU grace period and executes it in process
context where sleeping is permitted. This allows path_put() and
auth_domain_put() to be moved into the deferred callback alongside
the other resource releases. Apply the same fix to expkey_put(),
which has the identical pattern with ek_path and ek_client.

A dedicated workqueue scopes the shutdown drain to only NFSD
export release work items; flushing the shared
system_unbound_wq would stall on unrelated work from other
subsystems. nfsd_export_shutdown() uses rcu_barrier() followed
by flush_workqueue() to ensure all deferred release callbacks
complete before the export caches are destroyed.

Reported-by: Misbah Anjum N <misanjum@linux.ibm.com>
Closes: https://lore.kernel.org/linux-nfs/dcd371d3a95815a84ba7de52cef447b8@linux.ibm.com/
Fixes: c224edca7a ("nfsd: no need get cache ref when protected by rcu")
Fixes: 1b10f0b603 ("SUNRPC: no need get cache ref when protected by rcu")
Cc: stable@vger.kernel.org
Reviwed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Tested-by: Olga Kornievskaia <okorniev@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-03-14 11:34:25 -04:00
Linus Torvalds
1b37ac211a nfsd-7.0 fixes:
NFSD fixes that arrived too late for the 7.0 merge window.
 
 Fixes for commits merged in 7.0:
 - Restore previous nfsd thread count reporting behavior
 
 Issues that need expedient stable backports:
 - Fix credential reference leaks in the NFSD netlink admin protocol
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmmlmfQACgkQM2qzM29m
 f5ckDA//eSSeZG+Ld2MX+DrYH7aYUSQzlUJ7ENKpt904tw6qy8o3dPFg5gcF2ZKQ
 2cK05l0G96eaYYZu/nto3RdS7lx3iDgQuWq9KorQMYXoCad/tNl8RYC8HiuH+aDu
 Fir7RApMknXe54Mz7uiPaZBUZHkb+hqe9wHOVJkZyMlRMYNAtijsI4wfaY9a5ACK
 dh03lCMOJyU3emBXizNsZ9lysuRbPVpHQEmcZsJUTnA7f6xcCTF/CyEtxjCHX9Z5
 KZ0Ltb/kG9V1VFyuGAm1S0dQmAKbl2WUo5k5eslRXmHxFx072BFOpwXlr4qd4yWt
 zjY9VY5q0anXWNgwz1U897R5xDfx43C+OdnRcMxWF7bRnNmNyCNeXnYUgSuh4HYF
 Y2IHBJk9HXSlxeiSZAq45lDgNOfg5ZBgGVVfcuKqUxgcCqG5r56FqGFkJiPvuDiI
 CEW6dIn7OQuUzDnSK0vXWFR1KGu39nKaunJHAq2BTLxbW42K5EPDw+Vhibym2LQG
 uSsBNHtviWKMONkb3jrkK5sIZryL07M/fLsYKYkSmF/B1XVwtvZHGG2k7qCCCM7B
 5IKjAFeFCRqiyYO8lm3dhLz/SbH5jpqUb3V7OpxAytk8FAEsUGX5y8fu43rfQCZD
 g2spjlrtoAhg3dSsAmrw9bDSs2TdAagWkSj1NfNgbsJs2irEgyU=
 =kHSn
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-7.0-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:

 - Restore previous nfsd thread count reporting behavior

 - Fix credential reference leaks in the NFSD netlink admin protocol

* tag 'nfsd-7.0-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  nfsd: report the requested maximum number of threads instead of number running
  nfsd: Fix cred ref leak in nfsd_nl_listener_set_doit().
  nfsd: Fix cred ref leak in nfsd_nl_threads_set_doit().
2026-03-02 09:05:20 -08:00
Jeff Layton
364410170a nfsd: report the requested maximum number of threads instead of number running
The current netlink and /proc interfaces deviate from their traditional
values when dynamic threading is enabled, and there is currently no way
to know what the current setting is. This patch brings the reporting
back in line with traditional behavior.

Make these interfaces report the requested maximum number of threads
instead of the number currently running. Also, update documentation and
comments to reflect that this value represents a maximum and not the
number currently running.

Fixes: d8316b837c ("nfsd: add controls to set the minimum number of threads per pool")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-02-24 10:27:51 -05:00
Linus Torvalds
bf4afc53b7 Convert 'alloc_obj' family to use the new default GFP_KERNEL argument
This was done entirely with mindless brute force, using

    git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
        xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'

to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.

Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.

For the same reason the 'flex' versions will be done as a separate
conversion.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21 17:09:51 -08:00
Kees Cook
69050f8d6d treewide: Replace kmalloc with kmalloc_obj for non-scalar types
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:

Single allocations:	kmalloc(sizeof(TYPE), ...)
are replaced with:	kmalloc_obj(TYPE, ...)

Array allocations:	kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with:	kmalloc_objs(TYPE, COUNT, ...)

Flex array allocations:	kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with:	kmalloc_flex(*PTR, FAM, COUNT, ...)

(where TYPE may also be *VAR)

The resulting allocations no longer return "void *", instead returning
"TYPE *".

Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-21 01:02:28 -08:00
Kuniyuki Iwashima
92978c83bb nfsd: Fix cred ref leak in nfsd_nl_listener_set_doit().
nfsd_nl_listener_set_doit() uses get_current_cred() without
put_cred().

As we can see from other callers, svc_xprt_create_from_sa()
does not require the extra refcount.

nfsd_nl_listener_set_doit() is always in the process context,
sendmsg(), and current->cred does not go away.

Let's use current_cred() in nfsd_nl_listener_set_doit().

Fixes: 16a4711774 ("NFSD: add listener-{set,get} netlink command")
Cc: stable@vger.kernel.org
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-02-14 12:50:24 -05:00
Kuniyuki Iwashima
1cb968a201 nfsd: Fix cred ref leak in nfsd_nl_threads_set_doit().
syzbot reported memory leak of struct cred. [0]

nfsd_nl_threads_set_doit() passes get_current_cred() to
nfsd_svc(), but put_cred() is not called after that.

The cred is finally passed down to _svc_xprt_create(),
which calls get_cred() with the cred for struct svc_xprt.

The ownership of the refcount by get_current_cred() is not
transferred to anywhere and is just leaked.

nfsd_svc() is also called from write_threads(), but it does
not bump file->f_cred there.

nfsd_nl_threads_set_doit() is called from sendmsg() and
current->cred does not go away.

Let's use current_cred() in nfsd_nl_threads_set_doit().

[0]:
BUG: memory leak
unreferenced object 0xffff888108b89480 (size 184):
  comm "syz-executor", pid 5994, jiffies 4294943386
  hex dump (first 32 bytes):
    01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace (crc 369454a7):
    kmemleak_alloc_recursive include/linux/kmemleak.h:44 [inline]
    slab_post_alloc_hook mm/slub.c:4958 [inline]
    slab_alloc_node mm/slub.c:5263 [inline]
    kmem_cache_alloc_noprof+0x412/0x580 mm/slub.c:5270
    prepare_creds+0x22/0x600 kernel/cred.c:185
    copy_creds+0x44/0x290 kernel/cred.c:286
    copy_process+0x7a7/0x2870 kernel/fork.c:2086
    kernel_clone+0xac/0x6e0 kernel/fork.c:2651
    __do_sys_clone+0x7f/0xb0 kernel/fork.c:2792
    do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
    do_syscall_64+0xa4/0xf80 arch/x86/entry/syscall_64.c:94
    entry_SYSCALL_64_after_hwframe+0x77/0x7f

Fixes: 924f4fb003 ("NFSD: convert write_threads to netlink command")
Cc: stable@vger.kernel.org
Reported-by: syzbot+dd3b43aa0204089217ee@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/69744674.a00a0220.33ccc7.0000.GAE@google.com/
Tested-by: syzbot+dd3b43aa0204089217ee@syzkaller.appspotmail.com
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-02-14 12:48:51 -05:00
Chuck Lever
3daab3112f nfsd: cancel async COPY operations when admin revokes filesystem state
Async COPY operations hold copy stateids that represent NFSv4 state.
Thus, when the NFS server administrator revokes all NFSv4 state for
a filesystem via the unlock_fs interface, ongoing async COPY
operations referencing that filesystem must also be canceled.

Each cancelled copy triggers a CB_OFFLOAD callback carrying the
NFS4ERR_ADMIN_REVOKED status to notify the client that the server
terminated the operation.

The static drop_client() function is renamed to nfsd4_put_client()
and exported. The function must be exported because both the new
nfsd4_cancel_copy_by_sb() and the CB_OFFLOAD release callback in
nfs4proc.c need to release client references.

Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-01-28 10:15:42 -05:00
Jeff Layton
d8316b837c nfsd: add controls to set the minimum number of threads per pool
Add a new "min_threads" variable to the nfsd_net, along with the
corresponding netlink interface, to set that value from userland.
Pass that value to svc_set_pool_threads() and svc_set_num_threads().

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-01-28 10:15:42 -05:00
Linus Torvalds
f0b9d8eb98 nfsd-6.19 fixes:
A set of NFSD fixes that arrived after the 6.19 merge window.
 
 Issues that need expedient stable backports:
 - Remove an invalid NFS status code
 - Fix an fstests failure when using pNFS
 - Fix a UAF in v4_end_grace()
 - Fix the administrative interface used to revoke NFSv4 state
 - Fix a memory leak reported by syzbot
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmlb2WUACgkQM2qzM29m
 f5cqaA/+MbO1kop63/TiNE0tRc34yTBnApg1XVza4vSmcpSgpB8ZKGZ5xOjnRpwg
 yBw9+/puEJhyogPE6JKEGnLiFr+s3ApInFHaxnXnrGZz1RR1qkqfioKudIcpC0s1
 /pKx7y/fktltgo/5Dl0gp2QH3Oytg375ge+dcSQbopSTQYPbsAw7AmoHDPBQd8Nr
 Q/pIu1q/tAM8R2zyijU3eAiUMyYRCrxNVYnlsdYmj7Dn0ypybOyKufkpVCEaS3kO
 a7SV/QSVKdNbZOf8annwAhW+VN4urFmA9nnnr/yirrLJ0i2h18E0txrPFBszhftf
 xpOvaDR7okfEvzqwrHvVfRsqB4nYq9f0TSvvpPsS8vCtq34pWKZPa6iiSxeVL/jb
 EmFtiesUWClZzTIQSpUdbuU80cST6WEoNJJKDPZwF1XbA2navsDqgxKiYxsczjt6
 M5SStHcafK5LrXPruqOhfco/uKTmHNJJlvBWxUGCMQEDvdXdEJ4MIlg8VxxvoWPR
 FQDwU+iSdPOwlG7L3Tl9/PGSNe0MxJSgvzK6JNoKL3LvDx80FtMErWxPJdqdIL0+
 RpBsW7zaCyX9lwD866Frs4K2H1w2XFeQjOMI0Pz1SG9dZ8NoKJ+lzcwVY7GgHUvq
 NUNJLzL6MVCHytwTfqrSY7PGvUCrDqR102FQusQyplT4edcUv0M=
 =okQh
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.19-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:
 "A set of NFSD fixes for stable that arrived after the merge window:

   - Remove an invalid NFS status code

   - Fix an fstests failure when using pNFS

   - Fix a UAF in v4_end_grace()

   - Fix the administrative interface used to revoke NFSv4 state

   - Fix a memory leak reported by syzbot"

* tag 'nfsd-6.19-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  NFSD: net ref data still needs to be freed even if net hasn't startup
  nfsd: check that server is running in unlock_filesystem
  nfsd: use correct loop termination in nfsd4_revoke_states()
  nfsd: provide locking for v4_end_grace
  NFSD: Fix permission check for read access to executable-only files
  NFSD: Remove NFSERR_EAGAIN
2026-01-06 09:12:52 -08:00
Olga Kornievskaia
d0424066fc nfsd: check that server is running in unlock_filesystem
If we are trying to unlock the filesystem via an administrative
interface and nfsd isn't running, it crashes the server. This
happens currently because nfsd4_revoke_states() access state
structures (eg., conf_id_hashtbl) that has been freed as a part
of the server shutdown.

[   59.465072] Call trace:
[   59.465308]  nfsd4_revoke_states+0x1b4/0x898 [nfsd] (P)
[   59.465830]  write_unlock_fs+0x258/0x440 [nfsd]
[   59.466278]  nfsctl_transaction_write+0xb0/0x120 [nfsd]
[   59.466780]  vfs_write+0x1f0/0x938
[   59.467088]  ksys_write+0xfc/0x1f8
[   59.467395]  __arm64_sys_write+0x74/0xb8
[   59.467746]  invoke_syscall.constprop.0+0xdc/0x1e8
[   59.468177]  do_el0_svc+0x154/0x1d8
[   59.468489]  el0_svc+0x40/0xe0
[   59.468767]  el0t_64_sync_handler+0xa0/0xe8
[   59.469138]  el0t_64_sync+0x1ac/0x1b0

Ensure this can't happen by taking the nfsd_mutex and checking that
the server is still up, and then holding the mutex across the call to
nfsd4_revoke_states().

Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Fixes: 1ac3629bf0 ("nfsd: prepare for supporting admin-revocation of state")
Cc: stable@vger.kernel.org
Signed-off-by: Olga Kornievskaia <okorniev@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-01-02 13:49:55 -05:00
NeilBrown
2857bd59fe nfsd: provide locking for v4_end_grace
Writing to v4_end_grace can race with server shutdown and result in
memory being accessed after it was freed - reclaim_str_hashtbl in
particularly.

We cannot hold nfsd_mutex across the nfsd4_end_grace() call as that is
held while client_tracking_op->init() is called and that can wait for
an upcall to nfsdcltrack which can write to v4_end_grace, resulting in a
deadlock.

nfsd4_end_grace() is also called by the landromat work queue and this
doesn't require locking as server shutdown will stop the work and wait
for it before freeing anything that nfsd4_end_grace() might access.

However, we must be sure that writing to v4_end_grace doesn't restart
the work item after shutdown has already waited for it.  For this we
add a new flag protected with nn->client_lock.  It is set only while it
is safe to make client tracking calls, and v4_end_grace only schedules
work while the flag is set with the spinlock held.

So this patch adds a nfsd_net field "client_tracking_active" which is
set as described.  Another field "grace_end_forced", is set when
v4_end_grace is written.  After this is set, and providing
client_tracking_active is set, the laundromat is scheduled.
This "grace_end_forced" field bypasses other checks for whether the
grace period has finished.

This resolves a race which can result in use-after-free.

Reported-by: Li Lingfeng <lilingfeng3@huawei.com>
Closes: https://lore.kernel.org/linux-nfs/20250623030015.2353515-1-neil@brown.name/T/#t
Fixes: 7f5ef2e900 ("nfsd: add a v4_end_grace file to /proc/fs/nfsd")
Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neil@brown.name>
Tested-by: Li Lingfeng <lilingfeng3@huawei.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-01-02 13:48:22 -05:00
Al Viro
0c9a266868 convert nfsctl
One instance per net-ns.  There's a fixed subset (several files in root,
an optional symlink in root + initially empty /clients/) + per-client
subdirectory in /clients/.  Clients can appear only after the filesystem
is there and they are all gone before it gets through ->kill_sb().

Fixed subset created in fill_super(), regular files by simple_fill_super(),
then a subdirectory and a symlink - manually.  It is removed by
kill_litter_super().

Per-client subdirectories are created by nfsd_client_mkdir() (populated
with client-supplied list of files in them).  Removed by nfsd_client_rmdir(),
which is simple_recursive_removal().

All dentries except for the ones from simple_fill_super() come from
	* nfsd_mkdir() (subdirectory, dentry from simple_start_creating()).
	  Called from fill_super() (creates initially empty /clients)
	  and from nfsd_client_mkdir (creates a per-client subdirectory
	  in /clients).
	* _nfsd_symlink() (symlink, dentry from simple_start_creating()), called
	  from fill_super().
	* nfsdfs_create_files() (regulars, dentry from simple_start_creating()),
	  called only from nfsd_client_mkdir().

Turn d_instatiate() + inode_unlock() into d_make_persistent() + simple_done_creating()
in nfsd_mkdir(), _nfsd_symlink() and nfsdfs_create_files() and we are done.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-11-17 23:59:27 -05:00
Linus Torvalds
070a542f08 NFS Client Updates for Linux 6.18
New Features:
  * Add a Kconfig option to redirect dfprintk() to the trace buffer
  * Enable use of the RWF_DONTCACHE flag on the NFS client
  * Add striped layout handling to pNFS flexfiles
  * Add proper localio handling for READ and WRITE O_DIRECT
 
 Bugfixes:
  * Handle NFS4ERR_GRACE errors during delegation recall
  * Fix NFSv4.1 backchannel max_resp_sz verification check
  * Fix mount hang after CREATE_SESSION failure
  * Fix d_parent->d_inode locking in nfs4_setup_readdir()
 
 Other Cleanups and Improvements:
  * Improvements to write handling tracepoints
  * Fix a few trivial spelling mistakes
  * Cleanups to the rpcbind cleanup call sites
  * Convert the SUNRPC xdr_buf to use a scratch folio instead of scratch page
  * Remove unused NFS_WBACK_BUSY() macro
  * Remove __GFP_NOWARN flags
  * Unexport rpc_malloc() and rpc_free()
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmjgOEkACgkQ18tUv7Cl
 QOuX7RAA33AUq4NBxzDOgz4u4eNU/a8z2AazRgAtfmPVLTitrx/kqcfVEtHAdHFi
 cWkN2SO+TVzxIGOrudqNyjV2cjfUJV4ZJBkNY6lJvxPNAH27Dk9P2iMF12QYtHOq
 qyrwqoUQcyBkmtpgFUyHzydA4J17JDl5A7I/tOkro3ZfV4gmYAUwVdS+VtJoosLp
 7FnXv+W5FBWkfKrIT+vPyiBqxl0gZXmzUkJK2lG9m9NvE2Jk2MbPFyhdUEA5JybJ
 akNLdBnFwNWw2rLulSqs68ZbCGz6NY634q1Z+ZsRJ907ZdBqJ7zIBFv4yc/bMpZm
 Q9kh1M0OyvK0MlLRFe3efOLxRoN9nJPd+kuaw9eYw5V57Jrwj6QGV4nud2C8nzs8
 iB+LuJli+FRCeD84SY8NnMFKpXphHCeMXcBMRMsLTOSotJZFithO95+w1pKlK64A
 lxY1JXOQYelwJZxfhGPovwac4t1arpDjsumRlTmq12KaQnM3Z1gR2PUgeLxEPHQM
 f6gEiN9KDOhW/gZrFQxNs2hVAH68RDKpWxeR2XeVJlJYf37Hgh8bNGEiURi3G57n
 ED7tFbK9lzHVFR07hiP3Cvzop4z2mzadgHo+1vzdXmZZQA4gc4MSFfszFLCnQopw
 LEb7RqpVVXtQb+f7A+LuD+a2rLEnW+gTf6iLqCR8hAB5k1xmcYQ=
 =8wnU
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-6.18-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client updates from Anna Schumaker:
 "New Features:
   - Add a Kconfig option to redirect dfprintk() to the trace buffer
   - Enable use of the RWF_DONTCACHE flag on the NFS client
   - Add striped layout handling to pNFS flexfiles
   - Add proper localio handling for READ and WRITE O_DIRECT

  Bugfixes:
   - Handle NFS4ERR_GRACE errors during delegation recall
   - Fix NFSv4.1 backchannel max_resp_sz verification check
   - Fix mount hang after CREATE_SESSION failure
   - Fix d_parent->d_inode locking in nfs4_setup_readdir()

  Other Cleanups and Improvements:
   - Improvements to write handling tracepoints
   - Fix a few trivial spelling mistakes
   - Cleanups to the rpcbind cleanup call sites
   - Convert the SUNRPC xdr_buf to use a scratch folio instead of
     scratch page
   - Remove unused NFS_WBACK_BUSY() macro
   - Remove __GFP_NOWARN flags
   - Unexport rpc_malloc() and rpc_free()"

* tag 'nfs-for-6.18-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (46 commits)
  NFS: add basic STATX_DIOALIGN and STATX_DIO_READ_ALIGN support
  nfs/localio: add tracepoints for misaligned DIO READ and WRITE support
  nfs/localio: add proper O_DIRECT support for READ and WRITE
  nfs/localio: refactor iocb initialization
  nfs/localio: refactor iocb and iov_iter_bvec initialization
  nfs/localio: avoid issuing misaligned IO using O_DIRECT
  nfs/localio: make trace_nfs_local_open_fh more useful
  NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support
  sunrpc: unexport rpc_malloc() and rpc_free()
  NFSv4/flexfiles: Add support for striped layouts
  NFSv4/flexfiles: Update layout stats & error paths for striped layouts
  NFSv4/flexfiles: Write path updates for striped layouts
  NFSv4/flexfiles: Commit path updates for striped layouts
  NFSv4/flexfiles: Read path updates for striped layouts
  NFSv4/flexfiles: Update low level helper functions to be DS stripe aware.
  NFSv4/flexfiles: Add data structure support for striped layouts
  NFSv4/flexfiles: Use ds_commit_idx when marking a write commit
  NFSv4/flexfiles: Remove cred local variable dependency
  nfs4_setup_readdir(): insufficient locking for ->d_parent->d_inode dereferencing
  NFS: Enable use of the RWF_DONTCACHE flag on the NFS client
  ...
2025-10-03 14:20:40 -07:00
Chuck Lever
62c0c0e749 SUNRPC: Move the svc_rpcb_cleanup() call sites
Clean up: because svc_rpcb_cleanup() and svc_xprt_destroy_all()
are always invoked in pairs, we can deduplicate code by moving
the svc_rpcb_cleanup() call sites into svc_xprt_destroy_all().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Olga Kornievskaia <okorniev@redhat.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-09-23 13:28:19 -04:00
Al Viro
92003056e5 nfsd_get_inode(): lift setting ->i_{,f}op to callers.
Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-17 19:40:40 -04:00
Al Viro
e59f06bb5d nfsdfs_create_files(): switch to simple_start_creating()
Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-17 19:40:40 -04:00
Al Viro
a6281a7043 _nfsd_symlink(): switch to simple_start_creating()
Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-17 19:40:40 -04:00
Al Viro
6b614c56cd nfsd_mkdir(): switch to simple_start_creating()
... and fold __nfsd_mkdir() into it.

Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-17 19:40:40 -04:00
Al Viro
7e6135fe7b nfsctl: symlink has no business bumping link count of parent directory
mkdir should incrment the parent's refcount; symlink should not.

Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-17 19:40:40 -04:00
Linus Torvalds
8be4d31cb8 Networking changes for 6.17.
Core & protocols
 ----------------
 
  - Wrap datapath globals into net_aligned_data, to avoid false sharing.
 
  - Preserve MSG_ZEROCOPY in forwarding (e.g. out of a container).
 
  - Add SO_INQ and SCM_INQ support to AF_UNIX.
 
  - Add SIOCINQ support to AF_VSOCK.
 
  - Add TCP_MAXSEG sockopt to MPTCP.
 
  - Add IPv6 force_forwarding sysctl to enable forwarding per interface.
 
  - Make TCP validation of whether packet fully fits in the receive
    window and the rcv_buf more strict. With increased use of HW
    aggregation a single "packet" can be multiple 100s of kB.
 
  - Add MSG_MORE flag to optimize large TCP transmissions via sockmap,
    improves latency up to 33% for sockmap users.
 
  - Convert TCP send queue handling from tasklet to BH workque.
 
  - Improve BPF iteration over TCP sockets to see each socket exactly once.
 
  - Remove obsolete and unused TCP RFC3517/RFC6675 loss recovery code.
 
  - Support enabling kernel threads for NAPI processing on per-NAPI
    instance basis rather than a whole device. Fully stop the kernel NAPI
    thread when threaded NAPI gets disabled. Previously thread would stick
    around until ifdown due to tricky synchronization.
 
  - Allow multicast routing to take effect on locally-generated packets.
 
  - Add output interface argument for End.X in segment routing.
 
  - MCTP: add support for gateway routing, improve bind() handling.
 
  - Don't require rtnl_lock when fetching an IPv6 neighbor over Netlink.
 
  - Add a new neighbor flag ("extern_valid"), which cedes refresh
    responsibilities to userspace. This is needed for EVPN multi-homing
    where a neighbor entry for a multi-homed host needs to be synced
    across all the VTEPs among which the host is multi-homed.
 
  - Support NUD_PERMANENT for proxy neighbor entries.
 
  - Add a new queuing discipline for IETF RFC9332 DualQ Coupled AQM.
 
  - Add sequence numbers to netconsole messages. Unregister netconsole's
    console when all net targets are removed. Code refactoring.
    Add a number of selftests.
 
  - Align IPSec inbound SA lookup to RFC 4301. Only SPI and protocol
    should be used for an inbound SA lookup.
 
  - Support inspecting ref_tracker state via DebugFS.
 
  - Don't force bonding advertisement frames tx to ~333 ms boundaries.
    Add broadcast_neighbor option to send ARP/ND on all bonded links.
 
  - Allow providing upcall pid for the 'execute' command in openvswitch.
 
  - Remove DCCP support from Netfilter's conntrack.
 
  - Disallow multiple packet duplications in the queuing layer.
 
  - Prevent use of deprecated iptables code on PREEMPT_RT.
 
 Driver API
 ----------
 
  - Support RSS and hashing configuration over ethtool Netlink.
 
  - Add dedicated ethtool callbacks for getting and setting hashing fields.
 
  - Add support for power budget evaluation strategy in PSE /
    Power-over-Ethernet. Generate Netlink events for overcurrent etc.
 
  - Support DPLL phase offset monitoring across all device inputs.
    Support providing clock reference and SYNC over separate DPLL
    inputs.
 
  - Support traffic classes in devlink rate API for bandwidth management.
 
  - Remove rtnl_lock dependency from UDP tunnel port configuration.
 
 Device drivers
 --------------
 
  - Add a new Broadcom driver for 800G Ethernet (bnge).
 
  - Add a standalone driver for Microchip ZL3073x DPLL.
 
  - Remove IBM's NETIUCV device driver.
 
  - Ethernet high-speed NICs:
    - Broadcom (bnxt):
     - support zero-copy Tx of DMABUF memory
     - take page size into account for page pool recycling rings
    - Intel (100G, ice, idpf):
      - idpf: XDP and AF_XDP support preparations
      - idpf: add flow steering
      - add link_down_events statistic
      - clean up the TSPLL code
      - preparations for live VM migration
    - nVidia/Mellanox:
     - support zero-copy Rx/Tx interfaces (DMABUF and io_uring)
     - optimize context memory usage for matchers
     - expose serial numbers in devlink info
     - support PCIe congestion metrics
    - Meta (fbnic):
      - add 25G, 50G, and 100G link modes to phylink
      - support dumping FW logs
    - Marvell/Cavium:
      - support for CN20K generation of the Octeon chips
    - Amazon:
      - add HW clock (without timestamping, just hypervisor time access)
 
  - Ethernet virtual:
    - VirtIO net:
      - support segmentation of UDP-tunnel-encapsulated packets
    - Google (gve):
      - support packet timestamping and clock synchronization
    - Microsoft vNIC:
      - add handler for device-originated servicing events
      - allow dynamic MSI-X vector allocation
      - support Tx bandwidth clamping
 
  - Ethernet NICs consumer, and embedded:
    - AMD:
      - amd-xgbe: hardware timestamping and PTP clock support
    - Broadcom integrated MACs (bcmgenet, bcmasp):
      - use napi_complete_done() return value to support NAPI polling
      - add support for re-starting auto-negotiation
    - Broadcom switches (b53):
      - support BCM5325 switches
      - add bcm63xx EPHY power control
    - Synopsys (stmmac):
      - lots of code refactoring and cleanups
    - TI:
      - icssg-prueth: read firmware-names from device tree
      - icssg: PRP offload support
    - Microchip:
      - lan78xx: convert to PHYLINK for improved PHY and MAC management
      - ksz: add KSZ8463 switch support
    - Intel:
      - support similar queue priority scheme in multi-queue and
        time-sensitive networking (taprio)
      - support packet pre-emption in both
    - RealTek (r8169):
      - enable EEE at 5Gbps on RTL8126
    - Airoha:
      - add PPPoE offload support
      - MDIO bus controller for Airoha AN7583
 
  - Ethernet PHYs:
    - support for the IPQ5018 internal GE PHY
    - micrel KSZ9477 switch-integrated PHYs:
      - add MDI/MDI-X control support
      - add RX error counters
      - add cable test support
      - add Signal Quality Indicator (SQI) reporting
    - dp83tg720: improve reset handling and reduce link recovery time
    - support bcm54811 (and its MII-Lite interface type)
    - air_en8811h: support resume/suspend
    - support PHY counters for QCA807x and QCA808x
    - support WoL for QCA807x
 
  - CAN drivers:
    - rcar_canfd: support for Transceiver Delay Compensation
    - kvaser: report FW versions via devlink dev info
 
  - WiFi:
    - extended regulatory info support (6 GHz)
    - add statistics and beacon monitor for Multi-Link Operation (MLO)
    - support S1G aggregation, improve S1G support
    - add Radio Measurement action fields
    - support per-radio RTS threshold
    - some work around how FIPS affects wifi, which was wrong (RC4 is used
      by TKIP, not only WEP)
    - improvements for unsolicited probe response handling
 
  - WiFi drivers:
    - RealTek (rtw88):
      - IBSS mode for SDIO devices
    - RealTek (rtw89):
      - BT coexistence for MLO/WiFi7
      - concurrent station + P2P support
      - support for USB devices RTL8851BU/RTL8852BU
    - Intel (iwlwifi):
      - use embedded PNVM in (to be released) FW images to fix
        compatibility issues
      - many cleanups (unused FW APIs, PCIe code, WoWLAN)
      - some FIPS interoperability
    - MediaTek (mt76):
      - firmware recovery improvements
      - more MLO work
    - Qualcomm/Atheros (ath12k):
      - fix scan on multi-radio devices
      - more EHT/Wi-Fi 7 features
      - encapsulation/decapsulation offload
    - Broadcom (brcm80211):
      - support SDIO 43751 device
 
  - Bluetooth:
    - hci_event: add support for handling LE BIG Sync Lost event
    - ISO: add socket option to report packet seqnum via CMSG
    - ISO: support SCM_TIMESTAMPING for ISO TS
 
  - Bluetooth drivers:
    - intel_pcie: support Function Level Reset
    - nxpuart: add support for 4M baudrate
    - nxpuart: implement powerup sequence, reset, FW dump, and FW loading
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmiFgLgACgkQMUZtbf5S
 IrvafxAAnQRwYBoIG+piCILx6z5pRvBGHkmEQ4AQgSCFuq2eO3ubwMFIqEybfma1
 5+QFjUZAV3OgGgKRBS2KGWxtSzdiF+/JGV1VOIN67sX3Mm0a2QgjA4n5CgKL0FPr
 o6BEzjX5XwG1zvGcBNQ5BZ19xUUKjoZQgTtnea8sZ57Fsp5RtRgmYRqoewNvNk/n
 uImh0NFsDVb0UeOpSzC34VD9l1dJvLGdui4zJAjno/vpvmT1DkXjoK419J/r52SS
 X+5WgsfJ6DkjHqVN1tIhhK34yWqBOcwGFZJgEnWHMkFIl2FqRfFKMHyqtfLlVnLA
 mnIpSyz8Sq2AHtx0TlgZ3At/Ri8p5+yYJgHOXcDKyABa8y8Zf4wrycmr6cV9JLuL
 z54nLEVnJuvfDVDVJjsLYdJXyhMpZFq6+uAItdxKaw8Ugp/QqG4QtoRj+XIHz4ZW
 z6OohkCiCzTwEISFK+pSTxPS30eOxq43kCspcvuLiwCCStJBRkRb5GdZA4dm7LA+
 1Od4ADAkHjyrFtBqTyyC2scX8UJ33DlAIpAYyIeS6w9Cj9EXxtp1z33IAAAZ03MW
 jJwIaJuc8bK2fWKMmiG7ucIXjPo4t//KiWlpkwwqLhPbjZgfDAcxq1AC2TLoqHBL
 y4EOgKpHDCMAghSyiFIAn2JprGcEt8dp+11B0JRXIn4Pm/eYDH8=
 =lqbe
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "Core & protocols:

   - Wrap datapath globals into net_aligned_data, to avoid false sharing

   - Preserve MSG_ZEROCOPY in forwarding (e.g. out of a container)

   - Add SO_INQ and SCM_INQ support to AF_UNIX

   - Add SIOCINQ support to AF_VSOCK

   - Add TCP_MAXSEG sockopt to MPTCP

   - Add IPv6 force_forwarding sysctl to enable forwarding per interface

   - Make TCP validation of whether packet fully fits in the receive
     window and the rcv_buf more strict. With increased use of HW
     aggregation a single "packet" can be multiple 100s of kB

   - Add MSG_MORE flag to optimize large TCP transmissions via sockmap,
     improves latency up to 33% for sockmap users

   - Convert TCP send queue handling from tasklet to BH workque

   - Improve BPF iteration over TCP sockets to see each socket exactly
     once

   - Remove obsolete and unused TCP RFC3517/RFC6675 loss recovery code

   - Support enabling kernel threads for NAPI processing on per-NAPI
     instance basis rather than a whole device. Fully stop the kernel
     NAPI thread when threaded NAPI gets disabled. Previously thread
     would stick around until ifdown due to tricky synchronization

   - Allow multicast routing to take effect on locally-generated packets

   - Add output interface argument for End.X in segment routing

   - MCTP: add support for gateway routing, improve bind() handling

   - Don't require rtnl_lock when fetching an IPv6 neighbor over Netlink

   - Add a new neighbor flag ("extern_valid"), which cedes refresh
     responsibilities to userspace. This is needed for EVPN multi-homing
     where a neighbor entry for a multi-homed host needs to be synced
     across all the VTEPs among which the host is multi-homed

   - Support NUD_PERMANENT for proxy neighbor entries

   - Add a new queuing discipline for IETF RFC9332 DualQ Coupled AQM

   - Add sequence numbers to netconsole messages. Unregister
     netconsole's console when all net targets are removed. Code
     refactoring. Add a number of selftests

   - Align IPSec inbound SA lookup to RFC 4301. Only SPI and protocol
     should be used for an inbound SA lookup

   - Support inspecting ref_tracker state via DebugFS

   - Don't force bonding advertisement frames tx to ~333 ms boundaries.
     Add broadcast_neighbor option to send ARP/ND on all bonded links

   - Allow providing upcall pid for the 'execute' command in openvswitch

   - Remove DCCP support from Netfilter's conntrack

   - Disallow multiple packet duplications in the queuing layer

   - Prevent use of deprecated iptables code on PREEMPT_RT

  Driver API:

   - Support RSS and hashing configuration over ethtool Netlink

   - Add dedicated ethtool callbacks for getting and setting hashing
     fields

   - Add support for power budget evaluation strategy in PSE /
     Power-over-Ethernet. Generate Netlink events for overcurrent etc

   - Support DPLL phase offset monitoring across all device inputs.
     Support providing clock reference and SYNC over separate DPLL
     inputs

   - Support traffic classes in devlink rate API for bandwidth
     management

   - Remove rtnl_lock dependency from UDP tunnel port configuration

  Device drivers:

   - Add a new Broadcom driver for 800G Ethernet (bnge)

   - Add a standalone driver for Microchip ZL3073x DPLL

   - Remove IBM's NETIUCV device driver

   - Ethernet high-speed NICs:
      - Broadcom (bnxt):
         - support zero-copy Tx of DMABUF memory
         - take page size into account for page pool recycling rings
      - Intel (100G, ice, idpf):
         - idpf: XDP and AF_XDP support preparations
         - idpf: add flow steering
         - add link_down_events statistic
         - clean up the TSPLL code
         - preparations for live VM migration
      - nVidia/Mellanox:
         - support zero-copy Rx/Tx interfaces (DMABUF and io_uring)
         - optimize context memory usage for matchers
         - expose serial numbers in devlink info
         - support PCIe congestion metrics
      - Meta (fbnic):
         - add 25G, 50G, and 100G link modes to phylink
         - support dumping FW logs
      - Marvell/Cavium:
         - support for CN20K generation of the Octeon chips
      - Amazon:
         - add HW clock (without timestamping, just hypervisor time access)

   - Ethernet virtual:
      - VirtIO net:
         - support segmentation of UDP-tunnel-encapsulated packets
      - Google (gve):
         - support packet timestamping and clock synchronization
      - Microsoft vNIC:
         - add handler for device-originated servicing events
         - allow dynamic MSI-X vector allocation
         - support Tx bandwidth clamping

   - Ethernet NICs consumer, and embedded:
      - AMD:
         - amd-xgbe: hardware timestamping and PTP clock support
      - Broadcom integrated MACs (bcmgenet, bcmasp):
         - use napi_complete_done() return value to support NAPI polling
         - add support for re-starting auto-negotiation
      - Broadcom switches (b53):
         - support BCM5325 switches
         - add bcm63xx EPHY power control
      - Synopsys (stmmac):
         - lots of code refactoring and cleanups
      - TI:
         - icssg-prueth: read firmware-names from device tree
         - icssg: PRP offload support
      - Microchip:
         - lan78xx: convert to PHYLINK for improved PHY and MAC management
         - ksz: add KSZ8463 switch support
      - Intel:
         - support similar queue priority scheme in multi-queue and
           time-sensitive networking (taprio)
         - support packet pre-emption in both
      - RealTek (r8169):
         - enable EEE at 5Gbps on RTL8126
      - Airoha:
         - add PPPoE offload support
         - MDIO bus controller for Airoha AN7583

   - Ethernet PHYs:
      - support for the IPQ5018 internal GE PHY
      - micrel KSZ9477 switch-integrated PHYs:
         - add MDI/MDI-X control support
         - add RX error counters
         - add cable test support
         - add Signal Quality Indicator (SQI) reporting
      - dp83tg720: improve reset handling and reduce link recovery time
      - support bcm54811 (and its MII-Lite interface type)
      - air_en8811h: support resume/suspend
      - support PHY counters for QCA807x and QCA808x
      - support WoL for QCA807x

   - CAN drivers:
      - rcar_canfd: support for Transceiver Delay Compensation
      - kvaser: report FW versions via devlink dev info

   - WiFi:
      - extended regulatory info support (6 GHz)
      - add statistics and beacon monitor for Multi-Link Operation (MLO)
      - support S1G aggregation, improve S1G support
      - add Radio Measurement action fields
      - support per-radio RTS threshold
      - some work around how FIPS affects wifi, which was wrong (RC4 is
        used by TKIP, not only WEP)
      - improvements for unsolicited probe response handling

   - WiFi drivers:
      - RealTek (rtw88):
         - IBSS mode for SDIO devices
      - RealTek (rtw89):
         - BT coexistence for MLO/WiFi7
         - concurrent station + P2P support
         - support for USB devices RTL8851BU/RTL8852BU
      - Intel (iwlwifi):
         - use embedded PNVM in (to be released) FW images to fix
           compatibility issues
         - many cleanups (unused FW APIs, PCIe code, WoWLAN)
         - some FIPS interoperability
      - MediaTek (mt76):
         - firmware recovery improvements
         - more MLO work
      - Qualcomm/Atheros (ath12k):
         - fix scan on multi-radio devices
         - more EHT/Wi-Fi 7 features
         - encapsulation/decapsulation offload
      - Broadcom (brcm80211):
         - support SDIO 43751 device

   - Bluetooth:
      - hci_event: add support for handling LE BIG Sync Lost event
      - ISO: add socket option to report packet seqnum via CMSG
      - ISO: support SCM_TIMESTAMPING for ISO TS

   - Bluetooth drivers:
      - intel_pcie: support Function Level Reset
      - nxpuart: add support for 4M baudrate
      - nxpuart: implement powerup sequence, reset, FW dump, and FW loading"

* tag 'net-next-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1742 commits)
  dpll: zl3073x: Fix build failure
  selftests: bpf: fix legacy netfilter options
  ipv6: annotate data-races around rt->fib6_nsiblings
  ipv6: fix possible infinite loop in fib6_info_uses_dev()
  ipv6: prevent infinite loop in rt6_nlmsg_size()
  ipv6: add a retry logic in net6_rt_notify()
  vrf: Drop existing dst reference in vrf_ip6_input_dst
  net/sched: taprio: align entry index attr validation with mqprio
  net: fsl_pq_mdio: use dev_err_probe
  selftests: rtnetlink.sh: remove esp4_offload after test
  vsock: remove unnecessary null check in vsock_getname()
  igb: xsk: solve negative overflow of nb_pkts in zerocopy mode
  stmmac: xsk: fix negative overflow of budget in zerocopy mode
  dt-bindings: ieee802154: Convert at86rf230.txt yaml format
  net: dsa: microchip: Disable PTP function of KSZ8463
  net: dsa: microchip: Setup fiber ports for KSZ8463
  net: dsa: microchip: Write switch MAC address differently for KSZ8463
  net: dsa: microchip: Use different registers for KSZ8463
  net: dsa: microchip: Add KSZ8463 switch support to KSZ DSA driver
  dt-bindings: net: dsa: microchip: Add KSZ8463 switch support
  ...
2025-07-30 08:58:55 -07:00
Chuck Lever
a2d61427fb NFSD: Make nfsd_genl_rqstp::rq_ops array best-effort
To enable NFSD to handle NFSv4 COMPOUNDs of unrestricted size,
resize the array in struct nfsd_genl_rqstp so it saves only up to
16 operations per COMPOUND.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-07-14 12:46:41 -04:00
Chuck Lever
71bc2c6c41 NFSD: Rename a function parameter
Clean up: A function parameter called "rqstp" typically refers to an
object of type "struct svc_rqst", so it's confusing when such an
parameter refers to a different struct type with field names that are
very similar to svc_rqst.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-07-14 12:46:40 -04:00
Carolina Jubran
42401c4238 netlink: introduce type-checking attribute iteration for nlmsg
Add the nlmsg_for_each_attr_type() macro to simplify iteration over
attributes of a specific type in a Netlink message.

Convert existing users in vxlan and nfsd to use the new macro.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250629142138.361537-2-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02 15:39:04 -07:00
Jeff Layton
8ea688a337 nfsd: use threads array as-is in netlink interface
The old nfsdfs interface for starting a server with multiple pools
handles the special case of a single entry array passed down from
userland by distributing the threads over every NUMA node.

The netlink control interface however constructs an array of length
nfsd_nrpools() and fills any unprovided slots with 0's. This behavior
defeats the special casing that the old interface relies on.

Change nfsd_nl_threads_set_doit() to pass down the array from userland
as-is.

Fixes: 7f5c330b26 ("nfsd: allow passing in array of thread counts via netlink")
Cc: stable@vger.kernel.org
Reported-by: Mike Snitzer <snitzer@kernel.org>
Closes: https://lore.kernel.org/linux-nfs/aDC-ftnzhJAlwqwh@kernel.org/
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-06-19 09:35:38 -04:00
Chuck Lever
9fe5ea760e NFSD: Add /sys/kernel/debug/nfsd
Create a small sandbox under /sys/kernel/debug for experimental NFS
server feature settings. There is no API/ABI compatibility guarantee
for these settings.

The only documentation for such settings, if any documentation exists,
is in the kernel source code.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-11 19:48:23 -04:00
Maninder Singh
f7fb730cac NFSD: fix race between nfsd registration and exports_proc
As of now nfsd calls create_proc_exports_entry() at start of init_nfsd
and cleanup by remove_proc_entry() at last of exit_nfsd.

Which causes kernel OOPs if there is race between below 2 operations:
(i) exportfs -r
(ii) mount -t nfsd none /proc/fs/nfsd

for 5.4 kernel ARM64:

CPU 1:
el1_irq+0xbc/0x180
arch_counter_get_cntvct+0x14/0x18
running_clock+0xc/0x18
preempt_count_add+0x88/0x110
prep_new_page+0xb0/0x220
get_page_from_freelist+0x2d8/0x1778
__alloc_pages_nodemask+0x15c/0xef0
__vmalloc_node_range+0x28c/0x478
__vmalloc_node_flags_caller+0x8c/0xb0
kvmalloc_node+0x88/0xe0
nfsd_init_net+0x6c/0x108 [nfsd]
ops_init+0x44/0x170
register_pernet_operations+0x114/0x270
register_pernet_subsys+0x34/0x50
init_nfsd+0xa8/0x718 [nfsd]
do_one_initcall+0x54/0x2e0

CPU 2 :
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000010

PC is at : exports_net_open+0x50/0x68 [nfsd]

Call trace:
exports_net_open+0x50/0x68 [nfsd]
exports_proc_open+0x2c/0x38 [nfsd]
proc_reg_open+0xb8/0x198
do_dentry_open+0x1c4/0x418
vfs_open+0x38/0x48
path_openat+0x28c/0xf18
do_filp_open+0x70/0xe8
do_sys_open+0x154/0x248

Sometimes it crashes at exports_net_open() and sometimes cache_seq_next_rcu().

and same is happening on latest 6.14 kernel as well:

[    0.000000] Linux version 6.14.0-rc5-next-20250304-dirty
...
[  285.455918] Unable to handle kernel paging request at virtual address 00001f4800001f48
...
[  285.464902] pc : cache_seq_next_rcu+0x78/0xa4
...
[  285.469695] Call trace:
[  285.470083]  cache_seq_next_rcu+0x78/0xa4 (P)
[  285.470488]  seq_read+0xe0/0x11c
[  285.470675]  proc_reg_read+0x9c/0xf0
[  285.470874]  vfs_read+0xc4/0x2fc
[  285.471057]  ksys_read+0x6c/0xf4
[  285.471231]  __arm64_sys_read+0x1c/0x28
[  285.471428]  invoke_syscall+0x44/0x100
[  285.471633]  el0_svc_common.constprop.0+0x40/0xe0
[  285.471870]  do_el0_svc_compat+0x1c/0x34
[  285.472073]  el0_svc_compat+0x2c/0x80
[  285.472265]  el0t_32_sync_handler+0x90/0x140
[  285.472473]  el0t_32_sync+0x19c/0x1a0
[  285.472887] Code: f9400885 93407c23 937d7c27 11000421 (f86378a3)
[  285.473422] ---[ end trace 0000000000000000 ]---

It reproduced simply with below script:
while [ 1 ]
do
/exportfs -r
done &

while [ 1 ]
do
insmod /nfsd.ko
mount -t nfsd none /proc/fs/nfsd
umount /proc/fs/nfsd
rmmod nfsd
done &

So exporting interfaces to user space shall be done at last and
cleanup at first place.

With change there is no Kernel OOPs.

Co-developed-by: Shubham Rana <s9.rana@samsung.com>
Signed-off-by: Shubham Rana <s9.rana@samsung.com>
Signed-off-by: Maninder Singh <maninder1.s@samsung.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-11 19:48:23 -04:00
Maninder Singh
ff12eb3795 NFSD: unregister filesystem in case genl_register_family() fails
With rpc_status netlink support, unregister of register_filesystem()
was missed in case of genl_register_family() fails.

Correcting it by making new label.

Fixes: bd9d6a3efa ("NFSD: add rpc_status netlink support")
Cc: stable@vger.kernel.org
Signed-off-by: Maninder Singh <maninder1.s@samsung.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-11 19:48:22 -04:00
Jeff Layton
930b64ca0c nfsd: don't ignore the return code of svc_proc_register()
Currently, nfsd_proc_stat_init() ignores the return value of
svc_proc_register(). If the procfile creation fails, then the kernel
will WARN when it tries to remove the entry later.

Fix nfsd_proc_stat_init() to return the same type of pointer as
svc_proc_register(), and fix up nfsd_net_init() to check that and fail
the nfsd_net construction if it occurs.

svc_proc_register() can fail if the dentry can't be allocated, or if an
identical dentry already exists. The second case is pretty unlikely in
the nfsd_net construction codepath, so if this happens, return -ENOMEM.

Reported-by: syzbot+e34ad04f27991521104c@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/linux-nfs/67a47501.050a0220.19061f.05f9.GAE@google.com/
Cc: stable@vger.kernel.org # v6.9
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10 09:10:59 -04:00
Olga Kornievskaia
d093c90892 nfsd: fix management of listener transports
Currently, when no active threads are running, a root user using nfsdctl
command can try to remove a particular listener from the list of previously
added ones, then start the server by increasing the number of threads,
it leads to the following problem:

[  158.835354] refcount_t: addition on 0; use-after-free.
[  158.835603] WARNING: CPU: 2 PID: 9145 at lib/refcount.c:25 refcount_warn_saturate+0x160/0x1a0
[  158.836017] Modules linked in: rpcrdma rdma_cm iw_cm ib_cm ib_core nfsd auth_rpcgss nfs_acl lockd grace overlay isofs uinput snd_seq_dummy snd_hrtimer nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set nf_tables qrtr sunrpc vfat fat uvcvideo videobuf2_vmalloc videobuf2_memops uvc videobuf2_v4l2 videodev videobuf2_common snd_hda_codec_generic mc e1000e snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore sg loop dm_multipath dm_mod nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vmw_vmci vsock xfs libcrc32c crct10dif_ce ghash_ce vmwgfx sha2_ce sha256_arm64 sr_mod sha1_ce cdrom nvme drm_client_lib drm_ttm_helper ttm nvme_core drm_kms_helper nvme_auth drm fuse
[  158.840093] CPU: 2 UID: 0 PID: 9145 Comm: nfsd Kdump: loaded Tainted: G    B   W          6.13.0-rc6+ #7
[  158.840624] Tainted: [B]=BAD_PAGE, [W]=WARN
[  158.840802] Hardware name: VMware, Inc. VMware20,1/VBSA, BIOS VMW201.00V.24006586.BA64.2406042154 06/04/2024
[  158.841220] pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
[  158.841563] pc : refcount_warn_saturate+0x160/0x1a0
[  158.841780] lr : refcount_warn_saturate+0x160/0x1a0
[  158.842000] sp : ffff800089be7d80
[  158.842147] x29: ffff800089be7d80 x28: ffff00008e68c148 x27: ffff00008e68c148
[  158.842492] x26: ffff0002e3b5c000 x25: ffff600011cd1829 x24: ffff00008653c010
[  158.842832] x23: ffff00008653c000 x22: 1fffe00011cd1829 x21: ffff00008653c028
[  158.843175] x20: 0000000000000002 x19: ffff00008653c010 x18: 0000000000000000
[  158.843505] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[  158.843836] x14: 0000000000000000 x13: 0000000000000001 x12: ffff600050a26493
[  158.844143] x11: 1fffe00050a26492 x10: ffff600050a26492 x9 : dfff800000000000
[  158.844475] x8 : 00009fffaf5d9b6e x7 : ffff000285132493 x6 : 0000000000000001
[  158.844823] x5 : ffff000285132490 x4 : ffff600050a26493 x3 : ffff8000805e72bc
[  158.845174] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff000098588000
[  158.845528] Call trace:
[  158.845658]  refcount_warn_saturate+0x160/0x1a0 (P)
[  158.845894]  svc_recv+0x58c/0x680 [sunrpc]
[  158.846183]  nfsd+0x1fc/0x348 [nfsd]
[  158.846390]  kthread+0x274/0x2f8
[  158.846546]  ret_from_fork+0x10/0x20
[  158.846714] ---[ end trace 0000000000000000 ]---

nfsd_nl_listener_set_doit() would manipulate the list of transports of
server's sv_permsocks and close the specified listener but the other
list of transports (server's sp_xprts list) would not be changed leading
to the problem above.

Instead, determined if the nfsdctl is trying to remove a listener, in
which case, delete all the existing listener transports and re-create
all-but-the-removed ones.

Fixes: 16a4711774 ("NFSD: add listener-{set,get} netlink command")
Signed-off-by: Olga Kornievskaia <okorniev@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10 09:10:54 -04:00
Linus Torvalds
b88fe2b5dd NFS Client Updates for Linux 6.14
New Features:
   * Enable using direct IO with localio
   * Added localio related tracepoints
 
 Bugfixes:
   * Sunrpc fixes for working with a very large cl_tasks list
   * Fix a possible buffer overflow in nfs_sysfs_link_rpc_client()
   * Fixes for handling reconnections with localio
   * Fix how the NFS_FSCACHE kconfig option interacts with NETFS_SUPPORT
   * Fix COPY_NOTIFY xdr_buf size calculations
   * pNFS/Flexfiles fix for retrying requesting a layout segment for reads
   * Sunrpc fix for retrying on EKEYEXPIRED error when the TGT is expired
 
 Cleanups:
   * Various other nfs & nfsd localio cleanups
   * Prepratory patches for async copy improvements that are under development
   * Make OFFLOAD_CANCEL, LAYOUTSTATS, and LAYOUTERR moveable to other xprts
   * Add netns inum and srcaddr to debugfs rpc_xprt info
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmeZUzUACgkQ18tUv7Cl
 QOvArw/9HltIlcJHbi7tApGJ4dFpuJCa/fHbA1n5bHvKrCR5aElmFZoiFDdsM1JX
 kFAlMED9n1dW9VmzLJcepxmrLo/t7KXueiZNharHynWTxcszSl6jS+tOFBW6OflG
 Rrrjq/SrsWI2Fu8X4e/7ZV7pqRLGGn5SSMwgbuMbcyzBvVgN8mZM/BneIp1J59AI
 5NOsif5KWetVhQc43zlRlbVWR5cvNGcUK4i58LIaPFzPMt0xq/XJI+QWffj6kv4g
 cHabCNYTdQYMkhiPQC+LLYkw6sMbw2NatajTTYNMWfR/I+7wz9k5ej6CHKPIFCSr
 xjmscypySTLfMFQjrDFZkpX2CwSp/VIbV6go36DJwAlcCRzqz+I7cajlrRK4zvyr
 DyrcaZHvClEczP9QqdPj2wqRXbmIOsDMksOu4ACTUImd4o3f2v1K6DcwRj9oUIhV
 AGR31OEMt2A+RaVvVZYR4PpixJ01vH9LcmsaOu5KkHX8X4q2osQ7eMy+FV4kV09S
 pMnxDMAyszJU8IuzUG1/HfkonNlDMivIbqpgG4ZaVW08Nq4mCxJll1vTAa9FTLz2
 z+9eocqKwf724q1RAgOB7vj4AwOwL4Ul6d18UBtyUitZz3ndLRZ8Yy6r/AhrpCsC
 3co0Y3znZbKeRjmReNl0GLG4qiKE+E7Xh23Lf3IqXg8GE2Mu+Ls=
 =srvH
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-6.14-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client updates from Anna Schumaker:
 "New Features:
   - Enable using direct IO with localio
   - Added localio related tracepoints

  Bugfixes:
   - Sunrpc fixes for working with a very large cl_tasks list
   - Fix a possible buffer overflow in nfs_sysfs_link_rpc_client()
   - Fixes for handling reconnections with localio
   - Fix how the NFS_FSCACHE kconfig option interacts with NETFS_SUPPORT
   - Fix COPY_NOTIFY xdr_buf size calculations
   - pNFS/Flexfiles fix for retrying requesting a layout segment for
     reads
   - Sunrpc fix for retrying on EKEYEXPIRED error when the TGT is
     expired

  Cleanups:
   - Various other nfs & nfsd localio cleanups
   - Prepratory patches for async copy improvements that are under
     development
   - Make OFFLOAD_CANCEL, LAYOUTSTATS, and LAYOUTERR moveable to other
     xprts
   - Add netns inum and srcaddr to debugfs rpc_xprt info"

* tag 'nfs-for-6.14-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (28 commits)
  SUNRPC: do not retry on EKEYEXPIRED when user TGT ticket expired
  sunrpc: add netns inum and srcaddr to debugfs rpc_xprt info
  pnfs/flexfiles: retry getting layout segment for reads
  NFSv4.2: make LAYOUTSTATS and LAYOUTERROR MOVEABLE
  NFSv4.2: mark OFFLOAD_CANCEL MOVEABLE
  NFSv4.2: fix COPY_NOTIFY xdr buf size calculation
  NFS: Rename struct nfs4_offloadcancel_data
  NFS: Fix typo in OFFLOAD_CANCEL comment
  NFS: CB_OFFLOAD can return NFS4ERR_DELAY
  nfs: Make NFS_FSCACHE select NETFS_SUPPORT instead of depending on it
  nfs: fix incorrect error handling in LOCALIO
  nfs: probe for LOCALIO when v3 client reconnects to server
  nfs: probe for LOCALIO when v4 client reconnects to server
  nfs/localio: remove redundant code and simplify LOCALIO enablement
  nfs_common: add nfs_localio trace events
  nfs_common: track all open nfsd_files per LOCALIO nfs_client
  nfs_common: rename nfslocalio nfs_uuid_lock to nfs_uuids_lock
  nfsd: nfsd_file_acquire_local no longer returns GC'd nfsd_file
  nfsd: rename nfsd_serv_ prefixed methods and variables with nfsd_net_
  nfsd: update percpu_ref to manage references on nfsd_net
  ...
2025-01-28 14:23:46 -08:00
Mike Snitzer
085804110a nfs_common: track all open nfsd_files per LOCALIO nfs_client
This tracking enables __nfsd_file_cache_purge() to call
nfs_localio_invalidate_clients(), upon shutdown or export change, to
nfs_close_local_fh() all open nfsd_files that are still cached by the
LOCALIO nfs clients associated with nfsd_net that is being shutdown.

Now that the client must track all open nfsd_files there was more work
than necessary being done with the global nfs_uuids_lock contended.
This manifested in various RCU issues, e.g.:
 hrtimer: interrupt took 47969440 ns
 rcu: INFO: rcu_sched detected stalls on CPUs/tasks:

Use nfs_uuid->lock to protect all nfs_uuid_t members, instead of
nfs_uuids_lock, once nfs_uuid_is_local() adds the client to
nn->local_clients.

Also add 'local_clients_lock' to 'struct nfsd_net' to protect
nn->local_clients.  And store a pointer to spinlock in the 'list_lock'
member of nfs_uuid_t so nfs_localio_disable_client() can use it to
avoid taking the global nfs_uuids_lock.

In combination, these split out locks eliminate the use of the single
nfslocalio.c global nfs_uuids_lock in the IO paths (open and close).

Also refactored associated fs/nfs_common/nfslocalio.c methods' locking
to reduce work performed with spinlocks held in general.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-01-14 17:05:10 -05:00
Mike Snitzer
b49f049a22 nfs_common: rename functions that invalidate LOCALIO nfs_clients
Rename nfs_uuid_invalidate_one_client to nfs_localio_disable_client.
Rename nfs_uuid_invalidate_clients to nfs_localio_invalidate_clients.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-01-14 17:04:05 -05:00
NeilBrown
a4b853f183 sunrpc: remove all connection limit configuration
Now that the connection limit only apply to unconfirmed connections,
there is no need to configure it.  So remove all the configuration and
fix the number of unconfirmed connections as always 64 - which is
now given a name: XPT_MAX_TMP_CONN

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-01-06 09:37:36 -05:00
Weston Andros Adamson
fa4983862e nfsd: add LOCALIO support
Add server support for bypassing NFS for localhost reads, writes, and
commits. This is only useful when both the client and server are
running on the same host.

If nfsd_open_local_fh() fails then the NFS client will both retry and
fallback to normal network-based read, write and commit operations if
localio is no longer supported.

Care is taken to ensure the same NFS security mechanisms are used
(authentication, etc) regardless of whether localio or regular NFS
access is used.  The auth_domain established as part of the traditional
NFS client access to the NFS server is also used for localio.  Store
auth_domain for localio in nfsd_uuid_t and transfer it to the client
if it is local to the server.

Relative to containers, localio gives the client access to the network
namespace the server has.  This is required to allow the client to
access the server's per-namespace nfsd_net struct.

This commit also introduces the use of NFSD's percpu_ref to interlock
nfsd_destroy_serv and nfsd_open_local_fh, to ensure nn->nfsd_serv is
not destroyed while in use by nfsd_open_local_fh and other LOCALIO
client code.

CONFIG_NFS_LOCALIO enables NFS server support for LOCALIO.

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Co-developed-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Co-developed-by: NeilBrown <neilb@suse.de>
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-09-23 15:03:30 -04:00
NeilBrown
86ab08beb3 SUNRPC: replace program list with program array
A service created with svc_create_pooled() can be given a linked list of
programs and all of these will be served.

Using a linked list makes it cumbersome when there are several programs
that can be optionally selected with CONFIG settings.

After this patch is applied, API consumers must use only
svc_create_pooled() when creating an RPC service that listens for more
than one RPC program.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-09-23 15:03:30 -04:00
NeilBrown
60749cbe3d sunrpc: change sp_nrthreads from atomic_t to unsigned int.
sp_nrthreads is only ever accessed under the service mutex
  nlmsvc_mutex nfs_callback_mutex nfsd_mutex
so these is no need for it to be an atomic_t.

The fact that all code using it is single-threaded means that we can
simplify svc_pool_victim and remove the temporary elevation of
sp_nrthreads.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-20 19:31:03 -04:00
NeilBrown
73598a0cfb nfsd: don't allocate the versions array.
Instead of using kmalloc to allocate an array for storing active version
info, just declare an array to the max size - it is only 5 or so.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-20 19:29:23 -04:00
NeilBrown
c9f10f811c nfsd: move nfsd_pool_stats_open into nfsctl.c
nfsd_pool_stats_open() is used in nfsctl.c, so move it there.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-01 10:04:57 -04:00
Jeff Layton
91da337e5d nfsd: don't set SVC_SOCK_ANONYMOUS when creating nfsd sockets
When creating nfsd sockets via the netlink interface, we do want to
register with the portmapper. Don't set SVC_SOCK_ANONYMOUS.

Reported-by: Steve Dickson <steved@redhat.com>
Fixes: 16a4711774 ("NFSD: add listener-{set,get} netlink command")
Cc: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-07-22 09:47:39 -04:00
Jeff Layton
00506072d7 nfsd: new netlink ops to get/set server pool_mode
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-07-08 14:10:05 -04:00
Jeff Layton
7f5c330b26 nfsd: allow passing in array of thread counts via netlink
Now that nfsd_svc can handle an array of thread counts, fix up the
netlink threads interface to construct one from the netlink call
and pass it through so we can start a pooled server the same way we
would start a normal one.

Note that any unspecified values in the array are considered zeroes,
so it's possible to shut down a pooled server by passing in a short
array that has only zeros, or even an empty array.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-07-08 14:10:05 -04:00
Jeff Layton
b4d8f22891 nfsd: make nfsd_svc take an array of thread counts
Now that the refcounting is fixed, rework nfsd_svc to use the same
thread setup as the pool_threads interface. Have it take an array of
thread counts instead of just a single value, and pass that from the
netlink threads set interface.  Since the new netlink interface doesn't
have the same restriction as pool_threads, move the guard against
shutting down all threads to write_pool_threads.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-07-08 14:10:04 -04:00
NeilBrown
e0011bca60 nfsd: initialise nfsd_info.mutex early.
nfsd_info.mutex can be dereferenced by svc_pool_stats_start()
immediately after the new netns is created.  Currently this can
trigger an oops.

Move the initialisation earlier before it can possibly be dereferenced.

Fixes: 7b207ccd98 ("svc: don't hold reference for poolstats, only mutex.")
Reported-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Closes: https://lore.kernel.org/all/c2e9f6de-1ec4-4d3a-b18d-d5a6ec0814a0@linux.ibm.com/
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-06-25 10:15:39 -04:00