Currently, cached directory contents were not reused across subsequent
'ls' operations because the cache validity check relied on comparing
the ctx pointer, which changes with each readdir invocation. As a
result, the cached dir entries was not marked as valid and the cache was
not utilized for subsequent 'ls' operations.
This change uses the file pointer, which remains consistent across all
readdir calls for a given directory instance, to associate and validate
the cache. As a result, cached directory contents can now be
correctly reused, improving performance for repeated directory listings.
Performance gains with local windows SMB server:
Without the patch and default actimeo=1:
1000 directory enumeration operations on dir with 10k files took 135.0s
With this patch and actimeo=0:
1000 directory enumeration operations on dir with 10k files took just 5.1s
Signed-off-by: Bharath SM <bharathsm@microsoft.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
Customer reported that one of their applications started failing to
open files with STATUS_INSUFFICIENT_RESOURCES due to NetApp server
hitting the maximum number of opens to same file that it would allow
for a single client connection.
It turned out the client was failing to reuse open handles with
deferred closes because matching ->f_flags directly without masking
off O_CREAT|O_EXCL|O_TRUNC bits first broke the comparision and then
client ended up with thousands of deferred closes to same file. Those
bits are already satisfied on the original open, so no need to check
them against existing open handles.
Reproducer:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <pthread.h>
#define NR_THREADS 4
#define NR_ITERATIONS 2500
#define TEST_FILE "/mnt/1/test/dir/foo"
static char buf[64];
static void *worker(void *arg)
{
int i, j;
int fd;
for (i = 0; i < NR_ITERATIONS; i++) {
fd = open(TEST_FILE, O_WRONLY|O_CREAT|O_APPEND, 0666);
for (j = 0; j < 16; j++)
write(fd, buf, sizeof(buf));
close(fd);
}
}
int main(int argc, char *argv[])
{
pthread_t t[NR_THREADS];
int fd;
int i;
fd = open(TEST_FILE, O_WRONLY|O_CREAT|O_TRUNC, 0666);
close(fd);
memset(buf, 'a', sizeof(buf));
for (i = 0; i < NR_THREADS; i++)
pthread_create(&t[i], NULL, worker, NULL);
for (i = 0; i < NR_THREADS; i++)
pthread_join(t[i], NULL);
return 0;
}
Before patch:
$ mount.cifs //srv/share /mnt/1 -o ...
$ mkdir -p /mnt/1/test/dir
$ gcc repro.c && ./a.out
...
number of opens: 1391
After patch:
$ mount.cifs //srv/share /mnt/1 -o ...
$ mkdir -p /mnt/1/test/dir
$ gcc repro.c && ./a.out
...
number of opens: 1
Cc: linux-cifs@vger.kernel.org
Cc: David Howells <dhowells@redhat.com>
Cc: Jay Shin <jaeshin@redhat.com>
Cc: Pierguido Lambri <plambri@redhat.com>
Fixes: b8ea3b1ff5 ("smb: enable reuse of deferred file handles for write operations")
Acked-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
We normally can't create a new directory with the case-insensitive
option already set - except when we're creating a snapshot.
And if casefolding is enabled filesystem wide, we should still set it
even though not strictly required, for consistency.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We have to be able to print superblock sections even if they fail to
validate (for debugging), so we have to calculate the number of entries
from the field size.
Reported-by: syzbot+5138f00559ffb3cb3610@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
It seems btree node scan picked up a partially overwritten btree node,
and corrected the "bset version older than sb version_min" error -
resulting in an invalid superblock with a bad version_min field.
Don't run this check at all when we're in btree node scan, and when we
do run it, do something saner if the bset version is totally crazy.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Multiple ioctl handlers individually use a lot of stack space, and clang chooses
to inline them into the bch2_fs_ioctl() function, blowing through the warning
limit:
fs/bcachefs/chardev.c:655:6: error: stack frame size (1032) exceeds limit (1024) in 'bch2_fs_ioctl' [-Werror,-Wframe-larger-than]
655 | long bch2_fs_ioctl(struct bch_fs *c, unsigned cmd, void __user *arg)
By marking the largest two of them as noinline_for_stack, no indidual code path
ends up using this much, which avoids the warning and reduces the possible
total stack usage in the ioctl handler.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
fsck_err() can return a transaction restart if passed a transaction
object - this has always been true when it has to drop locks to prompt
for user input, but we're seeing this more now that we're logging the
error being corrected in the journal.
gc_accounting_done() doesn't call fsck_err() from an actual commit loop,
and it doesn't need to be holding btree locks when it calls fsck_err(),
so the easy fix here for the unhandled transaction restart is to just
not pass it the transaction object. We'll miss out on the fancy new
logging, but that's ok.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
PREEMPT_RT redefines how standard spinlocks work, so local_irq_save() +
spin_lock() is no longer equivalent to spin_lock_irqsave(). Fortunately,
we don't strictly need to do it that way.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Fix a UAF: we were calling darray_make_room() and retaining a pointer to
the old buffer.
And fix an UBSAN warning: struct bch_sb_field_downgrade_entry uses
__counted_by, so set dst->nr_errors before assigning to the array entry.
Reported-by: syzbot+14c52d86ddbd89bea13e@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Object debugging generally needs special provisions for putting said
objects on the stack, which rhashtable does not have.
Reported-by: syzbot+bcc38a9556d0324c2ec2@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If we think we're read-only but the VFS doesn't, fun will ensue.
And now that we know we have to be able to do this safely, just make
nochanges imply ro.
Reported-by: syzbot+a7d6ceaba099cc21dee4@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_btree_lost_data() gets called on btree node read error, but the
error might be transient.
btree_node_scan is expensive, and there's no need to run it persistently
(marking it in the superblock as required to run) - check_topology
will run it if required, via bch2_get_scanned_nodes().
Running it non-persistently is fine, to avoid check_topology having to
rewind recovery to run it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_btree_increase_depth() was originally for disaster recovery, to get
some data back from the journal when a btree root was bad.
We don't need it for that purpose anymore; on bad btree root we'll
launch btree node scan and reconstruct all the interior nodes.
If there's a key in the journal for a depth that doesn't exists, and
it's not from check_topology/btree node scan, we should just ignore it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
It seems excessive forced btree node rewrites can cause interior btree
updates to become wedged during recovery, before we're using the write
buffer for backpointer updates.
Add more flags so we can determine where these are coming from.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We had a deadlock during recovery where interior btree updates became
wedged and all open_buckets were consumed; start adding more
introspection.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Log the specific error being corrected in the journal when we're
repairing, this helps greatly with 'bcachefs list_journal' analysis.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The next patch will add logging of the specific error being corrected in
repair paths to the journal; this means __bch2_fsck_err() can return
transaction restarts in places that previously weren't expecting them.
check_topology() is old code that doesn't use btree iterators for btree
node locking - it'll have to be rewritten in the future to work online.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If SMB 3.1.1 POSIX Extensions are available and negotiated, the client
should be able to use all characters and not remap anything. Currently, the
user has to explicitly request this behavior by specifying the "nomapposix"
mount option.
Link: https://lore.kernel.org/4195bb677b33d680e77549890a4f4dd3b474ceaf.camel@rx2.rx-server.de
Signed-off-by: Philipp Kerling <pkerling@casix.org>
Reviewed-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
The renaming to the timer_*() namespace was delayed due massive conflicts
against Linux-next. Now that everything is upstream finish the conversion.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmhFOY0THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoRw8D/9ii/hq8jKguupde3UNVsdqICggO7bY
8PIY8FjZB2z3ALGOML9Pf1yystwnz1wbda9UhgGkKGj2iWvG0wWiN56J6FpksuIn
08poxMXUsLu7Wu6DaQkQrDwJ2Wu4EMefsxf6YtY/dGLLe553Bh5FHBLr75PO3d1j
AZNjGXysowzBBr//oSQuP8/MTVXd9KWvPSFPMn9oJZlkFVUbB0a6imjy10tDFC5s
uLUXwyPhsJvU6lj+B41H1hTNIoTBZexJgRgl1PhuNrN/5FLcUAPUVKbyLo+cCrqt
iB8WRw7fJu2CaKnSfRIWmi4kSeUP2d4H8oC/W4xymQtzvKNW6l0RIETg40FYqDAs
wucMBc5FmLzOBnUyoWDpn34NxOND5sHWd42yPHpxowmLWIZ2wAbSR/AHGA9vkmXa
Ksh8elyTR3swO9PRalrSsg3vM8KhH2RBXDotVFKBGmkay4WkW0TzTyjVDVZd1+bH
XxGO4PZWOXYcoQ840ocb1UMHdfEZivuaWrY4j5HWzsK/3No5f9ECJ9Dd5p/u6Ju7
FDmhrhovqKgLGnqo3MBmOeI1zSBsQuqPpRxUG1/gHVl4CYFwhcOU8pk0064ZSN9Q
RasjoJEghSlwKf4FEHJN9Z3+izoMntZGB3aUG+MXfxbBNJkHmO8/Tb4AwVQ6HyhT
+xF2fwKHwHyIbw==
=jJaU
-----END PGP SIGNATURE-----
Merge tag 'timers-cleanups-2025-06-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer cleanup from Thomas Gleixner:
"The delayed from_timer() API cleanup:
The renaming to the timer_*() namespace was delayed due massive
conflicts against Linux-next. Now that everything is upstream finish
the conversion"
* tag 'timers-cleanups-2025-06-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
treewide, timers: Rename from_timer() to timer_container_of()
- Cure IO bitmap inconsistencies
A failed fork cleans up all resources of the newly created thread via
exit_thread(). exit_thread() invokes io_bitmap_exit() which does the IO
bitmap cleanups, which unfortunately assume that the cleanup is related
to the current task, which is obviously bogus. Make it work correctly
- A lockdep fix in the resctrl code removed the clearing of the command
buffer in two places, which keeps stale error messages around. Bring
them back.
- Remove unused trace events
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmhFO0MTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoZYqEADAB+3WsgT9WZlGfksOgcL05uD0sGKP
mrGkrVuyGLQaMK+Vv7F3Vll0khe3nMJ1DzOOdmejerzEi2yb7CaZHIT1esujVHAk
kieR/D7t5zUvPislBSUxgsmNjqwoCDAwe5vKTSM0p20T7CWkenAzCcA4cUFQfGWo
2TqJWLDgdlvOk6qnaUTm0tzGjeYD9PegoUEyz5wgrf+YjsiKIWJLRbDMb3tXsOpy
La2vh2APV6Tt/PMalQiedQdi3VAf5yD+3HpvlXJ6b1BhJy3iPO6iMQPp1kwsKFjm
dzFT0IqM7lOERjMbhhnntLWe9z0vyezwIEj6dPlCf6MG8lTJ3io0bWJ2LwQEzt3q
RkFJO7wc7QqJoIGsZARGREA99Zkh6/qTh8QovbjpHWoQBrMIZ9mxw9gJ6QR/vjNa
dVhIEi4mOBNw+bq4BmYExC8j60qpGBrEi+lJrGWZoEBWHqhUY3I38DU1oH97mn/s
qL1iZo1TOGyxGuHIN79fKWduEpY+tyvhLxB2zW5m9+B26XRXZBec1lZ+qfIpPo/f
ZuWIbNYsKKPgpHZLSb21RJ/ost+sTr4iEb3XuguP+YAxfjOVJR7CAhsfWRTDNhPS
wGiyPAGmdbQrx/2Wn4vZnznOm/OH4aQX5cxbNR8TPknHJzYRmIY3VTDjDr8Xqxx+
R4/IRQOmcJ0Q/Q==
=x7j0
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2025-06-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"A small set of x86 fixes:
- Cure IO bitmap inconsistencies
A failed fork cleans up all resources of the newly created thread
via exit_thread(). exit_thread() invokes io_bitmap_exit() which
does the IO bitmap cleanups, which unfortunately assume that the
cleanup is related to the current task, which is obviously bogus.
Make it work correctly
- A lockdep fix in the resctrl code removed the clearing of the
command buffer in two places, which keeps stale error messages
around. Bring them back.
- Remove unused trace events"
* tag 'x86-urgent-2025-06-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
fs/resctrl: Restore the rdt_last_cmd_clear() calls after acquiring rdtgroup_mutex
x86/iopl: Cure TIF_IO_BITMAP inconsistencies
x86/fpu: Remove unused trace events
this cycle regression (well, bugfix for this cycle bugfix for v6.15-rc1 regression)
do_move_mount(): split the checks in subtree-of-our-ns and entire-anon cases
selftests/mount_setattr: adapt detached mount propagation test
v6.15 fs: allow clone_private_mount() for a path on real rootfs
v6.11 fs/fhandle.c: fix a race in call of has_locked_children()
v5.15 fix propagation graph breakage by MOVE_MOUNT_SET_GROUP move_mount(2)
v5.15 clone_private_mnt(): make sure that caller has CAP_SYS_ADMIN in the right userns
v5.7 path_overmount(): avoid false negatives
v3.12 finish_automount(): don't leak MNT_LOCKED from parent to child
v2.6.15 do_change_type(): refuse to operate on unmounted/not ours mounts
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCaEXGDwAKCRBZ7Krx/gZQ
61WQAPwNBpcwum3F5fqT8rcKymqAUFpc0+rluJoBi+qfCQA9ywEAwn+Kh5qqtz++
cdVnUYQxBrh0u5IOzMEFITlgfYFJZA4=
=BIeU
-----END PGP SIGNATURE-----
Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull mount fixes from Al Viro:
"Various mount-related bugfixes:
- split the do_move_mount() checks in subtree-of-our-ns and
entire-anon cases and adapt detached mount propagation selftest for
mount_setattr
- allow clone_private_mount() for a path on real rootfs
- fix a race in call of has_locked_children()
- fix move_mount propagation graph breakage by MOVE_MOUNT_SET_GROUP
- make sure clone_private_mnt() caller has CAP_SYS_ADMIN in the right
userns
- avoid false negatives in path_overmount()
- don't leak MNT_LOCKED from parent to child in finish_automount()
- do_change_type(): refuse to operate on unmounted/not ours mounts"
* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
do_change_type(): refuse to operate on unmounted/not ours mounts
clone_private_mnt(): make sure that caller has CAP_SYS_ADMIN in the right userns
selftests/mount_setattr: adapt detached mount propagation test
do_move_mount(): split the checks in subtree-of-our-ns and entire-anon cases
fs: allow clone_private_mount() for a path on real rootfs
fix propagation graph breakage by MOVE_MOUNT_SET_GROUP move_mount(2)
finish_automount(): don't leak MNT_LOCKED from parent to child
path_overmount(): avoid false negatives
fs/fhandle.c: fix a race in call of has_locked_children()
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmhEsJkACgkQiiy9cAdy
T1Gjlgv+P1I4uD8iDbkJvyrh7mVcnIDnjttbgN1pTZZnpcX7JIyMPR9YPUn+QTKf
3GStGgLHqKtnXYYCdrkpPipsH/fjt7kjVLUC/YMhUGEr5dSFKzygOVaRTo4LDgNO
sSlOxTPyTaPgxdvIXriU5CeJGtm95GqowLDG29We9ubX1P5GLjIjGlMbAisAk4/Q
ek478D0qBSO5u6FkMYUFuSbjGwOYWAlabMxeOVHpjgEH/36tVtDKOc/bORnVZtHB
6bRjhymn9CZQOE2P+gDSqF6pZwHWhU7uYJYW+kbDbmDIgWDk9xl5/kbbqZ6RR3t5
joPWzZkO1I58avmJ31aauhvRpYIOXfOhx2dHB6VM/7bSHTvE6CW60RIlpXo1xbO6
d30pTy7ICqP/RWNZlt5ZgoPTDB1hPmVDdMXoELp7uZzChhrAz4qc/2BSD0Y5fCjE
e/U0bPfsyjscB7NtpcrQYVMbOsrweQ4qKUgolEk1klLolsKxDGE176wbPmLNAzo4
C9SzyxU3
=RvlW
-----END PGP SIGNATURE-----
Merge tag '6.16-rc-part2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull more smb client updates from Steve French:
- multichannel/reconnect fixes
- move smbdirect (smb over RDMA) defines to fs/smb/common so they will
be able to be used in the future more broadly, and a documentation
update explaining setting up smbdirect mounts
- update email address for Paulo
* tag '6.16-rc-part2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: update internal version number
MAINTAINERS, mailmap: Update Paulo Alcantara's email address
cifs: add documentation for smbdirect setup
cifs: do not disable interface polling on failure
cifs: serialize other channels when query server interfaces is pending
cifs: deal with the channel loading lag while picking channels
smb: client: make use of common smbdirect_socket_parameters
smb: smbdirect: introduce smbdirect_socket_parameters
smb: client: make use of common smbdirect_socket
smb: smbdirect: add smbdirect_socket.h
smb: client: make use of common smbdirect.h
smb: smbdirect: add smbdirect.h with public structures
smb: client: make use of common smbdirect_pdu.h
smb: smbdirect: add smbdirect_pdu.h with protocol definitions
Move this API to the canonical timer_*() namespace.
[ tglx: Redone against pre rc1 ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com
JFFS2:
- Correctly check return code of jffs2_prealloc_raw_node_refs()
UBIFS:
- Spelling fixes
-----BEGIN PGP SIGNATURE-----
iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAmhD0qQWHHJpY2hhcmRA
c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wclID/9UaXAATiKVBDJwZvpgYoLMO8Gm
xBJkVQCAOl7IqBliA4/VN7YZEDhKWZh9s1U1jD0UvNJGsSewpObmamI4kkdPTg2K
pEyosATBzQT6IDQ6J6h1GCfv0l4YzNK7wKbumHc1jxGMoDY/m6JMEvZHUuOIvGBy
YGppNOL7kPtxKWGJtq1KX2/8ivg5BiUIodjtgLrb/pO7BHDxqB0YabP3DXfsRfmG
kaACZRBSLxw+AWlAarYdE/0eCvMCSHcxVYS9xhpE2zJa8SLESxV1EQxcZwf3XXsb
rKlC8jM31nMWGgEDBHDYHLebz2Hynv+WgwYX+Um7Rt0Dx2EhYg/waCfVg8hb8owU
GzAxQrVNucmCldUoqfirt05g2HVegD/fePCGRqpyqevlMOVQRHKO5QXh04bUH/ly
718aRaL+j4vBFnvYJ59oaBBBNBCuAH0IDg64P7ijhgMAFTibRcj0YCvtBIWPrzLE
30vAk8bjvxLXOxy/VuHjfhbSV2YfTyLKJ1XQ6Mvsl+lGiTNIZSPCbvnO3npgqNxf
IaHjWQTKlrJwRpv30u4ZrNIaSRw4ZDIdHkoJkuJoFRekmb0NkBQnIJhpB7da+uP7
VxB2dfHBFKgU50t3MbPl9KAjt4/cciWn4cE7uDJ7jSuvzot4Mr7IQ5ZURrLesZFf
tlDYg/MSp0mHecAfZw==
=tbHc
-----END PGP SIGNATURE-----
Merge tag 'ubifs-for-linus-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs
Pull JFFS2 and UBIFS fixes from Richard Weinberger:
"JFFS2:
- Correctly check return code of jffs2_prealloc_raw_node_refs()
UBIFS:
- Spelling fixes"
* tag 'ubifs-for-linus-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
jffs2: check jffs2_prealloc_raw_node_refs() result in few other places
jffs2: check that raw node were preallocated before writing summary
ubifs: Fix grammar in error message
Ensure that propagation settings can only be changed for mounts located
in the caller's mount namespace. This change aligns permission checking
with the rest of mount(2).
Reviewed-by: Christian Brauner <brauner@kernel.org>
Fixes: 07b20889e3 ("beginning of the shared-subtree proper")
Reported-by: "Orlando, Noah" <Noah.Orlando@deshaw.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
What we want is to verify there is that clone won't expose something
hidden by a mount we wouldn't be able to undo. "Wouldn't be able to undo"
may be a result of MNT_LOCKED on a child, but it may also come from
lacking admin rights in the userns of the namespace mount belongs to.
clone_private_mnt() checks the former, but not the latter.
There's a number of rather confusing CAP_SYS_ADMIN checks in various
userns during the mount, especially with the new mount API; they serve
different purposes and in case of clone_private_mnt() they usually,
but not always end up covering the missing check mentioned above.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reported-by: "Orlando, Noah" <Noah.Orlando@deshaw.com>
Fixes: 427215d85e ("ovl: prevent private clone if bind mount is not allowed")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... and fix the breakage in anon-to-anon case. There are two cases
acceptable for do_move_mount() and mixing checks for those is making
things hard to follow.
One case is move of a subtree in caller's namespace.
* source and destination must be in caller's namespace
* source must be detachable from parent
Another is moving the entire anon namespace elsewhere
* source must be the root of anon namespace
* target must either in caller's namespace or in a suitable
anon namespace (see may_use_mount() for details).
* target must not be in the same namespace as source.
It's really easier to follow if tests are *not* mixed together...
Reviewed-by: Christian Brauner <brauner@kernel.org>
Fixes: 3b5260d12b ("Don't propagate mounts into detached trees")
Reported-by: Allison Karlitskaya <lis@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Mounting overlayfs with a directory on real rootfs (initramfs)
as upperdir has failed with following message since commit
db04662e2f ("fs: allow detached mounts in clone_private_mount()").
[ 4.080134] overlayfs: failed to clone upperpath
Overlayfs mount uses clone_private_mount() to create internal mount
for the underlying layers.
The commit made clone_private_mount() reject real rootfs because
it does not have a parent mount and is in the initial mount namespace,
that is not an anonymous mount namespace.
This issue can be fixed by modifying the permission check
of clone_private_mount() following [1].
Reviewed-by: Christian Brauner <brauner@kernel.org>
Fixes: db04662e2f ("fs: allow detached mounts in clone_private_mount()")
Link: https://lore.kernel.org/all/20250514190252.GQ2023217@ZenIV/ [1]
Link: https://lore.kernel.org/all/20250506194849.GT2023217@ZenIV/
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Kazuma Kondo <kazuma-kondo@nec.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9ffb14ef61 "move_mount: allow to add a mount into an existing group"
breaks assertions on ->mnt_share/->mnt_slave. For once, the data structures
in question are actually documented.
Documentation/filesystem/sharedsubtree.rst:
All vfsmounts in a peer group have the same ->mnt_master. If it is
non-NULL, they form a contiguous (ordered) segment of slave list.
do_set_group() puts a mount into the same place in propagation graph
as the old one. As the result, if old mount gets events from somewhere
and is not a pure event sink, new one needs to be placed next to the
old one in the slave list the old one's on. If it is a pure event
sink, we only need to make sure the new one doesn't end up in the
middle of some peer group.
"move_mount: allow to add a mount into an existing group" ends up putting
the new one in the beginning of list; that's definitely not going to be
in the middle of anything, so that's fine for case when old is not marked
shared. In case when old one _is_ marked shared (i.e. is not a pure event
sink), that breaks the assumptions of propagation graph iterators.
Put the new mount next to the old one on the list - that does the right thing
in "old is marked shared" case and is just as correct as the current behaviour
if old is not marked shared (kudos to Pavel for pointing that out - my original
suggested fix changed behaviour in the "nor marked" case, which complicated
things for no good reason).
Reviewed-by: Christian Brauner <brauner@kernel.org>
Fixes: 9ffb14ef61 ("move_mount: allow to add a mount into an existing group")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Holding namespace_sem is enough to make sure that result remains valid.
It is *not* enough to avoid false negatives from __lookup_mnt(). Mounts
can be unhashed outside of namespace_sem (stuck children getting detached
on final mntput() of lazy-umounted mount) and having an unrelated mount
removed from the hash chain while we traverse it may end up with false
negative from __lookup_mnt(). We need to sample and recheck the seqlock
component of mount_lock...
Bug predates the introduction of path_overmount() - it had come from
the code in finish_automount() that got abstracted into that helper.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Fixes: 26df6034fd ("fix automount/automount race properly")
Fixes: 6ac3928156 ("fs: allow to mount beneath top mount")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
may_decode_fh() is calling has_locked_children() while holding no locks.
That's an oopsable race...
The rest of the callers are safe since they are holding namespace_sem and
are guaranteed a positive refcount on the mount in question.
Rename the current has_locked_children() to __has_locked_children(), make
it static and switch the fs/namespace.c users to it.
Make has_locked_children() a wrapper for __has_locked_children(), calling
the latter under read_seqlock_excl(&mount_lock).
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Fixes: 620c266f39 ("fhandle: relax open_by_handle_at() permission checks")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
performance improvement in cases where an IMA policy with rules that
are based on fsmagic matching is enforced, an encryption-related fixup
that addresses generic/397 and other fstest failures and a couple of
cleanups in CephFS.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmhDKSQTHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi5ofCACeaTBV/Dwr90/P+j9sYOfoirVi9rps
onoce9aMPxEgVHDuhmVSwVnta6Fw7XtD1xxpe3rc1km/zi2/pH/FjBNNWYPk5bjX
0+LPSqdf4HGBVmAFahLeYgB1beUznEd0saddoiq4tzY7DRpoD+yhfU8kHUzus8Ai
77yEU5kh3SoTUPPoat0ePIR3KdUIFZy7wKHF3oo6urp8RmOjFw+knZd44uPTpvd5
ApLzAwsGqJm7gwCSuB67TL3Wd0/6I+wnyNXKX8bYe090ZcPd6fy3WwIacTlo0CFu
+tCenZBbVFXWVkv4LQtvIz+YlfmTZ31/HzgvLqb/33yjK+jWSzskDsBy
=lFOg
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-6.16-rc1' of https://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
- a one-liner that leads to a startling (but also very much rational)
performance improvement in cases where an IMA policy with rules that
are based on fsmagic matching is enforced
- an encryption-related fixup that addresses generic/397 and other
fstest failures
- a couple of cleanups in CephFS
* tag 'ceph-for-6.16-rc1' of https://github.com/ceph/ceph-client:
ceph: fix variable dereferenced before check in ceph_umount_begin()
ceph: set superblock s_magic for IMA fsmagic matching
ceph: cleanup hardcoded constants of file handle size
ceph: fix possible integer overflow in ceph_zero_objects()
ceph: avoid kernel BUG for encrypted inode with unaligned file size
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCaEKzdwAKCRDh3BK/laaZ
PLpuAQCK2B/LsbyLslWVN6lWbQNwiPHF7l49+GjS2BaWVDxnTwEAwdpaktgg7tRI
wsMp9CEc0lbp8lMDjHDOEqhc/Qvejg4=
=Y7HA
-----END PGP SIGNATURE-----
Merge tag 'ovl-update-v2-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs
Pull overlayfs update from Miklos Szeredi:
- Fix a regression in getting the path of an open file (e.g. in
/proc/PID/maps) for a nested overlayfs setup (André Almeida)
- Support data-only layers and verity in a user namespace (unprivileged
composefs use case)
- Fix a gcc warning (Kees)
- Cleanups
* tag 'ovl-update-v2-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs:
ovl: Annotate struct ovl_entry with __counted_by()
ovl: Replace offsetof() with struct_size() in ovl_stack_free()
ovl: Replace offsetof() with struct_size() in ovl_cache_entry_new()
ovl: Check for NULL d_inode() in ovl_dentry_upper()
ovl: Use str_on_off() helper in ovl_show_options()
ovl: don't require "metacopy=on" for "verity"
ovl: relax redirect/metacopy requirements for lower -> data redirect
ovl: make redirect/metacopy rejection consistent
ovl: Fix nested backing file paths
smatch warnings:
fs/ceph/super.c:1042 ceph_umount_begin() warn: variable dereferenced before check 'fsc' (see line 1041)
vim +/fsc +1042 fs/ceph/super.c
void ceph_umount_begin(struct super_block *sb)
{
struct ceph_fs_client *fsc = ceph_sb_to_fs_client(sb);
doutc(fsc->client, "starting forced umount\n");
^^^^^^^^^^^
Dereferenced
if (!fsc)
^^^^
Checked too late.
return;
fsc->mount_state = CEPH_MOUNT_SHUTDOWN;
__ceph_umount_begin(fsc);
}
The VFS guarantees that the superblock is still
alive when it calls into ceph via ->umount_begin().
Finally, we don't need to check the fsc and
it should be valid. This patch simply removes
the fsc check.
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202503280852.YDB3pxUY-lkp@intel.com/
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Reviewed by: Alex Markuze <amarkuze@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmhAvJsACgkQiiy9cAdy
T1F0rwv7BYKtOEGcRQSToncJOoqSxRZKPA9ulWgv5a2MVt2L91M0DcITxmQBh/Tg
OmO1E7i2pYZr+fhDAQZeRp6g0RCfPBaUDPjslXyCwK9dJjGhADRYcg/ci7eYOFsv
RXiG+3nsR08k03hRqerI76LhuI1OZyH32e14Sa7B35cuT4czxtz8JiTcpvFxfr3W
rz33svuNJFNhPoFfckyxnIkQAgTvgcXbrTt/zWj2KzhoFWY91UvOdOeh6mU/OG+a
Zeq+sx1cfG7nib8XXxAtHJkes0uO05HBkzcYcXD2Os1ZYh1tyS19IQejfev0Cqfo
7BLQpTuoOS7j8eSJsq6bpDUgIP0PTLdleewx6jgeUHCDzs6YeOETlmDeh1ic0Iyi
LRrMwI4MH8rtHnJnGPStsqus47AlHFd8fSXgi2Y+G0fAMJ5sziMfsbfrleHpqogK
+3+zISyLbhjADuBAsLwIgQkjvHJp5k1GRDS3qo3Hs3knOjHKWmW3yqfTuaGf0uIq
SQ9VBTt5
=DKRi
-----END PGP SIGNATURE-----
Merge tag '6.16-rc-ksmbd-server-fixes' of git://git.samba.org/ksmbd
Pull smb server updates from Steve French:
"Four smb3 server fixes:
- Fix for special character handling when mounting with "posix"
- Fix for mounts from Mac for fs that don't provide unique inode
numbers
- Two cleanup patches (e.g. for crypto calls)"
* tag '6.16-rc-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: allow a filename to contain special characters on SMB3.1.1 posix extension
ksmbd: provide zero as a unique ID to the Mac client
ksmbd: remove unnecessary softdep on crc32
ksmbd: use SHA-256 library API instead of crypto_shash API
- More stack usage improvements (~600 bytes).
- Define CLASS()es for some commonly used types, and convert most
rcu_read_lock() uses to the new lock guards
- New introspection:
- Superblock error counters are now available in sysfs: previously,
they were only visible with 'show-super', which doesn't provide a
live view
- New tracepoint, error_throw(), which is called any time we return an
error and start to unwind
- Repair
- check_fix_ptrs() can now repair btree node roots
- We can now repair when we've somehow ended up with the journal using
a superblock bucket
- Revert some leftovers from the aborted directory i_size feature, and
add repair code: some userspace programs (e.g. sshfs) were getting
confused.
It seems in 6.15 there's a bug where i_nlink on the vfs inode has been
getting incorrectly set to 0, with some unfortunate results;
list_journal analysis showed bch2_inode_rm() being called (by
bch2_evict_inode()) when it clearly should not have been.
- bch2_inode_rm() now runs "should we be deleting this inode?" checks
that were previously only run when deleting unlinked inodes in
recovery.
- check_subvol() was treating a dangling subvol (pointing to a missing
root inode) like a dangling dirent, and deleting it. This was the
really unfortunate one: check_subvol() will now recreate the root
inode if necessary.
This took longer to debug than it should have, and we lost several
filesystems unnecessarily, becuase users have been ignoring the release
notes and blindly running 'fsck -y'. Debugging required reconstructing
what happened through analyzing the journal, when ideally someone would
have noticed 'hey, fsck is asking me if I want to repair this: it
usually doesn't, maybe I should run this in dry run mode and check
what's going on?'.
As a reminder, fsck errors are being marked as autofix once we've
verified, in real world usage, that they're working correctly; blindly
running 'fsck -y' on an experimental filesystem is playing with fire.
Up to this incident we've had an excellent track record of not losing
data, so let's try to learn from this one.
This is a community effort, I wouldn't be able to get this done without
the help of all the people QAing and providing excellent bug reports and
feedback based on real world usage. But please don't ignore advice and
expect me to pick up the pieces.
If an error isn't marked as autofix, and it /is/ happening in the wild,
that's also something I need to know about so we can check it out and
add it to the autofix list if repair looks good. I haven't been getting
those reports, and I should be; since we don't have any sort of
telemetry yet I am absolutely dependent on user reports.
Now I'll be spending the weekend working on new repair code to see if I
can get a filesystem back for a user who didn't have backups.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmhAuL0ACgkQE6szbY3K
bnZlCg/+Pu2TgWBbkwrmHgKH9v4K3pwQRREXSj0TlbWQp9bK00zEBrmdEfTZKgUC
q5nAAa6zCs0w/A9TFA7t1W/3+JY28ENhoArKFWemLhFZ2qEEXTZlVHvqyHOyuPBf
Loe+hQO8qgWJm6KO9VMCT1pEupslQLRlhI8GhbPPcxPvYXVjmTne7KCanhjeSEx5
TLaOiMn7jr+qPeLZ7xSMaaUTbH2SASjwl2E9/4kG6VqaTTF2MnPNwrdJI0exjyvs
QRaUvYbwBBTe/ru5ddmJuWj+61awKS87ANg+rkO2FWpOrai2HfgHd6o+zge/IR2Z
/Cfarv1SSd1+0caVaGUAzhnoVhOpY1FU4emJwVvcwnBXeXdGIb/kpaw+Lxm7fr+U
J6EnqgUoBsBWBCWgvUxlNHVeJ6wBdVNtDlTHabaH8RSCJZjgjg2JaSQM/v9VPLNa
6jTy3rhkPo50BJBb/F/AZmrobWXR2MkgID3iPEMcpjEyLaRZvW9FPqMFIxKQrUfB
XGDU4dAu3C+Q9i1KDkFIvIG3e7z9nSmv6np4O57CgrmrmmCpRUz7Yy0yhqNs36/H
WhLh/Pjb9gupdFK0TwFiEEG3wfnmXlde2c8IfrXXzKSKPIZ0T/RnLZapS7i94c2E
DumhLYjNjSCiciQZh4eLK0bKx0NETUG79eLUTz5Gi3Pc02E0pU8=
=ZGDn
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2025-06-04' of git://evilpiepirate.org/bcachefs
Pull more bcachefs updates from Kent Overstreet:
"More bcachefs updates:
- More stack usage improvements (~600 bytes)
- Define CLASS()es for some commonly used types, and convert most
rcu_read_lock() uses to the new lock guards
- New introspection:
- Superblock error counters are now available in sysfs:
previously, they were only visible with 'show-super', which
doesn't provide a live view
- New tracepoint, error_throw(), which is called any time we
return an error and start to unwind
- Repair
- check_fix_ptrs() can now repair btree node roots
- We can now repair when we've somehow ended up with the journal
using a superblock bucket
- Revert some leftovers from the aborted directory i_size feature,
and add repair code: some userspace programs (e.g. sshfs) were
getting confused
It seems in 6.15 there's a bug where i_nlink on the vfs inode has been
getting incorrectly set to 0, with some unfortunate results;
list_journal analysis showed bch2_inode_rm() being called (by
bch2_evict_inode()) when it clearly should not have been.
- bch2_inode_rm() now runs "should we be deleting this inode?" checks
that were previously only run when deleting unlinked inodes in
recovery
- check_subvol() was treating a dangling subvol (pointing to a
missing root inode) like a dangling dirent, and deleting it. This
was the really unfortunate one: check_subvol() will now recreate
the root inode if necessary
This took longer to debug than it should have, and we lost several
filesystems unnecessarily, because users have been ignoring the
release notes and blindly running 'fsck -y'. Debugging required
reconstructing what happened through analyzing the journal, when
ideally someone would have noticed 'hey, fsck is asking me if I want
to repair this: it usually doesn't, maybe I should run this in dry run
mode and check what's going on?'
As a reminder, fsck errors are being marked as autofix once we've
verified, in real world usage, that they're working correctly; blindly
running 'fsck -y' on an experimental filesystem is playing with fire
Up to this incident we've had an excellent track record of not losing
data, so let's try to learn from this one
This is a community effort, I wouldn't be able to get this done
without the help of all the people QAing and providing excellent bug
reports and feedback based on real world usage. But please don't
ignore advice and expect me to pick up the pieces
If an error isn't marked as autofix, and it /is/ happening in the
wild, that's also something I need to know about so we can check it
out and add it to the autofix list if repair looks good. I haven't
been getting those reports, and I should be; since we don't have any
sort of telemetry yet I am absolutely dependent on user reports
Now I'll be spending the weekend working on new repair code to see if
I can get a filesystem back for a user who didn't have backups"
* tag 'bcachefs-2025-06-04' of git://evilpiepirate.org/bcachefs: (69 commits)
bcachefs: add cond_resched() to handle_overwrites()
bcachefs: Make journal read log message a bit quieter
bcachefs: Fix subvol to missing root repair
bcachefs: Run may_delete_deleted_inode() checks in bch2_inode_rm()
bcachefs: delete dead code from may_delete_deleted_inode()
bcachefs: Add flags to subvolume_to_text()
bcachefs: Fix oops in btree_node_seq_matches()
bcachefs: Fix dirent_casefold_mismatch repair
bcachefs: Fix bch2_fsck_rename_dirent() for casefold
bcachefs: Redo bch2_dirent_init_name()
bcachefs: Fix -Wc23-extensions in bch2_check_dirents()
bcachefs: Run check_dirents second time if required
bcachefs: Run snapshot deletion out of system_long_wq
bcachefs: Make check_key_has_snapshot safer
bcachefs: BCH_RECOVERY_PASS_NO_RATELIMIT
bcachefs: bch2_require_recovery_pass()
bcachefs: bch_err_throw()
bcachefs: Repair code for directory i_size
bcachefs: Kill un-reverted directory i_size code
bcachefs: Delete redundant fsck_err()
...
Users seem to be assuming that the 'dropped unflushed entries' message
at the end of journal read indicates some sort of problem, when it does
not - we expect there to be entries in the journal that weren't
commited, it's purely informational so that we can correlate journal
sequence numbers elsewhere when debugging.
Shorten the log message a bit to hopefully make this clearer.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We had a bug where the root inode of a subvolume was erronously deleted:
bch2_evict_inode() called bch2_inode_rm(), meaning the VFS inode's
i_nlink was somehow set to 0 when it shouldn't have - the inode in the
btree indicated it clearly was not unlinked.
This has been addressed with additional safety checks in
bch2_inode_rm() - pulling in the safety checks we already were doing
when deleting unlinked inodes in recovery - but the really disastrous
bug was in check_subvols(), which on finding a dangling subvol (subvol
with a missing root inode) would delete the subvolume.
I assume this bug dates from early check_directory_structure() code,
which originally handled subvolumes and normal paths - the idea being
that still live contents of the subvolume would get reattached
somewhere.
But that's incorrect, and disastrously so; deleting a subvolume triggers
deleting the snapshot ID it points to, deleting the entire contents.
The correct way to repair is to recreate the root inode if it's missing;
then any contents will get reattached under that subvolume's lost+found.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We had a bug where bch2_evict_inode() incorrectly called bch2_inode_rm()
- the journal clearly showed the inode was not unlinked.
We've got checks that we use in recovery when cleaning up deleted
inodes, lift them to bch2_inode_rm() as well.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>