linux

mirror of https://github.com/torvalds/linux.git synced 2026-05-22 22:22:08 +02:00

Author	SHA1	Message	Date
Linus Torvalds	3ba310f2a3	lsm/stable-7.1 PR 20260410 -----BEGIN PGP SIGNATURE----- iQJIBAABCgAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmnZeioUHHBhdWxAcGF1 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXN15w//e6Tgou0wgffKb+vZP+9xC53bQzmL Z0en5gdfifbqeIvj6LzSdUlIlsDUw65S42eGhqIYwk5oYNd2lwFMXd16fggakc/n TDdF1/7WTwcwMFKJxtew5tcE3pjwC96F6bqF9YmDdcNycjuQ5cbsJ/56hQsWZYxo g8y5y3fmWrkQ28gst2NJiR6XQx7acFc3S2FRKZc8mldkDjrmb9gN9WdwWJ6/nw+A xnNm6BdVjZ/gnrQliO4eL4J5T1ijvy3gddW3rXdytcIoH8js/pcZh3BpfTVWlzs+ 5KLPy9Tm39g3BwNx5cHrmzz1Ug6fCqIJyJzPK0M3q3vr+7w1kEWZnM5IUQwdQPsg dVmLBvhrvnKNBnMXd53seQJm33UkcKPpfWbaYQFpUC1ZocUiQDvE3eCH+q4SIwjo kwB6Ycc27O3PjXzMtwQv5a1oKeOHuNKr0YCOAoOs1bshcJ4lTzVwqe4StuLeoN8z 7G+sfoT4JpM/izPTQjF+tNRcDECvX7j8b42BJcGx8Zf+JiP4HC1xBmLOg4egLc6m hxwaT5ipLhBL95eNCp16bqVrw5vVzZ+HbtDCkXJmU+grdsuTsp0bUGmm1DjpsFVk l/PyMDCvMNzi3uuNI9v9usQblv57XK6oNVBqcuoqejEDkEuX3MBSkov7DZr9nLLO JO1EvsWAjQdyXeQ= =giHS -----END PGP SIGNATURE----- Merge tag 'lsm-pr-20260410' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm Pull LSM updates from Paul Moore: "We only have five patches in the LSM tree, but three of the five are for an important bugfix relating to overlayfs and the mmap() and mprotect() access controls for LSMs. Highlights below: - Fix problems with the mmap() and mprotect() LSM hooks on overlayfs As we are dealing with problems both in mmap() and mprotect() there are essentially two components to this fix, spread across three patches with all marked for stable. The simplest portion of the fix is the creation of a new LSM hook, security_mmap_backing_file(), that is used to enforce LSM mmap() access controls on backing files in the stacked/overlayfs case. The existing security_mmap_file() does not have visibility past the user file. You can see from the associated SELinux hook callback the code is fairly straightforward. The mprotect() fix is a bit more complicated as there is no way in the mprotect() code path to inspect both the user and backing files, and bolting on a second file reference to vm_area_struct wasn't really an option. The solution taken here adds a LSM security blob and associated hooks to the backing_file struct that LSMs can use to capture and store relevant information from the user file. While the necessary SELinux information is relatively small, a single u32, I expect other LSMs to require more than that, and a dedicated backing_file LSM blob provides a storage mechanism without negatively impacting other filesystems. I want to note that other LSMs beyond SELinux have been involved in the discussion of the fixes presented here and they are working on their own related changes using these new hooks, but due to other issues those patches will be coming at a later date. - Use kstrdup_const()/kfree_const() for securityfs symlink targets - Resolve a handful of kernel-doc warnings in cred.h" * tag 'lsm-pr-20260410' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm: selinux: fix overlayfs mmap() and mprotect() access checks lsm: add backing_file LSM hooks fs: prepare for adding LSM blob to backing_file securityfs: use kstrdup_const() to manage symlink targets cred: fix kernel-doc warnings in cred.h	2026-04-13 15:17:28 -07:00
Paul Moore	6af36aeb14	lsm: add backing_file LSM hooks Stacked filesystems such as overlayfs do not currently provide the necessary mechanisms for LSMs to properly enforce access controls on the mmap() and mprotect() operations. In order to resolve this gap, a LSM security blob is being added to the backing_file struct and the following new LSM hooks are being created: security_backing_file_alloc() security_backing_file_free() security_mmap_backing_file() The first two hooks are to manage the lifecycle of the LSM security blob in the backing_file struct, while the third provides a new mmap() access control point for the underlying backing file. It is also expected that LSMs will likely want to update their security_file_mprotect() callback to address issues with their mprotect() controls, but that does not require a change to the security_file_mprotect() LSM hook. There are a three other small changes to support these new LSM hooks: * Pass the user file associated with a backing file down to alloc_empty_backing_file() so it can be included in the security_backing_file_alloc() hook. * Add getter and setter functions for the backing_file struct LSM blob as the backing_file struct remains private to fs/file_table.c. * Constify the file struct field in the LSM common_audit_data struct to better support LSMs that need to pass a const file struct pointer into the common LSM audit code. Thanks to Arnd Bergmann for identifying the missing EXPORT_SYMBOL_GPL() and supplying a fixup. Cc: stable@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org Cc: linux-unionfs@vger.kernel.org Cc: linux-erofs@lists.ozlabs.org Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Serge Hallyn <serge@hallyn.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Paul Moore <paul@paul-moore.com>	2026-04-03 16:53:50 -04:00
Christoph Hellwig	0924f6b80d	fs: pass on FTRUNCATE_* flags to do_truncate Pass the flags one level down to replace the somewhat confusing small argument, and clean up do_truncate as a result. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260323070205.2939118-3-hch@lst.de Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2026-03-23 12:41:57 +01:00
Christoph Hellwig	e43dce8a0b	fs: fix archiecture-specific compat_ftruncate64 The "small" argument to do_sys_ftruncate indicates if > 32-bit size should be reject, but all the arch-specific compat ftruncate64 implementations get this wrong. Merge do_sys_ftruncate and ksys_ftruncate, replace the integer as boolean small flag with a descriptive one about LFS semantics, and use it correctly in the architecture-specific ftruncate64 implementations. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Fixes: `3dd681d944` ("arm64: 32-bit (compat) applications support") Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260323070205.2939118-2-hch@lst.de Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2026-03-23 12:41:57 +01:00
Linus Torvalds	26c9342bb7	struct filename series [mostly] sanitize struct filename hanling Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCaYlcJgAKCRBZ7Krx/gZQ 6xlKAP9c9J13sJ/mcobsj1Ov7nSHISNbnYqvRRCu09Wq3UQvJgEApNQYOEdLtpff zUnWOAQ0nOKY7w9VMLkRRustXpuGjAc= =Fld4 -----END PGP SIGNATURE----- Merge tag 'pull-filename' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs 'struct filename' updates from Al Viro: "[Mostly] sanitize struct filename handling" * tag 'pull-filename' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (68 commits) sysfs(2): fs_index() argument is _not_ a pathname alpha: switch osf_mount() to strndup_user() ksmbd: use CLASS(filename_kernel) mqueue: switch to CLASS(filename) user_statfs(): switch to CLASS(filename) statx: switch to CLASS(filename_maybe_null) quotactl_block(): switch to CLASS(filename) chroot(2): switch to CLASS(filename) move_mount(2): switch to CLASS(filename_maybe_null) namei.c: switch user pathname imports to CLASS(filename{,_flags}) namei.c: convert getname_kernel() callers to CLASS(filename_kernel) do_f{chmod,chown,access}at(): use CLASS(filename_uflags) do_readlinkat(): switch to CLASS(filename_flags) do_sys_truncate(): switch to CLASS(filename) do_utimes_path(): switch to CLASS(filename_uflags) chdir(2): unspaghettify a bit... do_fchownat(): unspaghettify a bit... fspick(2): use CLASS(filename_flags) name_to_handle_at(): use CLASS(filename_uflags) vfs_open_tree(): use CLASS(filename_uflags) ...	2026-02-09 16:58:28 -08:00
Linus Torvalds	157d3d6efd	vfs-7.0-rc1.namespace Please consider pulling these changes from the signed vfs-7.0-rc1.namespace tag. Thanks! Christian -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaYX49gAKCRCRxhvAZXjc ovzgAP9BpqMQhMy2VCurru8/T5VAd6eJdgXzEfXqMksL5BNm8gEAsLx666KJNKgm Sh/yVA2KBjf51gvcLZ4gHOISaMU8bAI= =RGLf -----END PGP SIGNATURE----- Merge tag 'vfs-7.0-rc1.namespace' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs mount updates from Christian Brauner: - statmount: accept fd as a parameter Extend struct mnt_id_req with a file descriptor field and a new STATMOUNT_BY_FD flag. When set, statmount() returns mount information for the mount the fd resides on — including detached mounts (unmounted via umount2(MNT_DETACH)). For detached mounts the STATMOUNT_MNT_POINT and STATMOUNT_MNT_NS_ID mask bits are cleared since neither is meaningful. The capability check is skipped for STATMOUNT_BY_FD since holding an fd already implies prior access to the mount and equivalent information is available through fstatfs() and /proc/pid/mountinfo without privilege. Includes comprehensive selftests covering both attached and detached mount cases. - fs: Remove internal old mount API code (1 patch) Now that every in-tree filesystem has been converted to the new mount API, remove all the legacy shim code in fs_context.c that handled unconverted filesystems. This deletes ~280 lines including legacy_init_fs_context(), the legacy_fs_context struct, and associated wrappers. The mount(2) syscall path for userspace remains untouched. Documentation references to the legacy callbacks are cleaned up. - mount: add OPEN_TREE_NAMESPACE to open_tree() Container runtimes currently use CLONE_NEWNS to copy the caller's entire mount namespace — only to then pivot_root() and recursively unmount everything they just copied. With large mount tables and thousands of parallel container launches this creates significant contention on the namespace semaphore. OPEN_TREE_NAMESPACE copies only the specified mount tree (like OPEN_TREE_CLONE) but returns a mount namespace fd instead of a detached mount fd. The new namespace contains the copied tree mounted on top of a clone of the real rootfs. This functions as a combined unshare(CLONE_NEWNS) + pivot_root() in a single syscall. Works with user namespaces: an unshare(CLONE_NEWUSER) followed by OPEN_TREE_NAMESPACE creates a mount namespace owned by the new user namespace. Mount namespace file mounts are excluded from the copy to prevent cycles. Includes ~1000 lines of selftests" * tag 'vfs-7.0-rc1.namespace' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: selftests/open_tree: add OPEN_TREE_NAMESPACE tests mount: add OPEN_TREE_NAMESPACE fs: Remove internal old mount API code selftests: statmount: tests for STATMOUNT_BY_FD statmount: accept fd as a parameter statmount: permission check should return EPERM	2026-02-09 14:43:47 -08:00
Linus Torvalds	c84bb79f70	vfs-7.0-rc1.nullfs Please consider pulling these changes from the signed vfs-7.0-rc1.nullfs tag. Thanks! Christian -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaYX49gAKCRCRxhvAZXjc olG7AQD9TywOR0HC9PMT8jrhC1TKODnZ4H1aLNlYVltzfJ09xwEAwFSGO4rQmGAF aZdD0RQw4bkf7IC1PIZHEGUqmVXJCQ8= =NvyI -----END PGP SIGNATURE----- Merge tag 'vfs-7.0-rc1.nullfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs nullfs update from Christian Brauner: "Add a completely catatonic minimal pseudo filesystem called "nullfs" and make pivot_root() work in the initramfs. Currently pivot_root() does not work on the real rootfs because it cannot be unmounted. Userspace has to recursively delete initramfs contents manually before continuing boot, using the fragile switch_root sequence (overmount + chroot). Add nullfs, a minimal immutable filesystem that serves as the true root of the mount hierarchy. The mutable rootfs (tmpfs/ramfs) is mounted on top of it. This allows userspace to simply: chdir(new_root); pivot_root(".", "."); umount2(".", MNT_DETACH); without the traditional switch_root workarounds. systemd already handles this correctly. It tries pivot_root() first and falls back to MS_MOVE only when that fails. This also means rootfs mounts in unprivileged namespaces no longer need MNT_LOCKED, since the immutable nullfs guarantees nothing can be revealed by unmounting the covering mount. nullfs is a single-instance filesystem (get_tree_single()) marked SB_NOUSER \| SB_I_NOEXEC \| SB_I_NODEV with an immutable empty root directory. This means sooner or later it can be used to overmount other directories to hide their contents without any additional protection needed. We enable it unconditionally. If we see any real regression we'll hide it behind a boot option. nullfs has extensions beyond this in the future. It will serve as a concept to support the creation of completely empty mount namespaces - which is work coming up in the next cycle" * tag 'vfs-7.0-rc1.nullfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: use nullfs unconditionally as the real rootfs docs: mention nullfs fs: add immutable rootfs fs: add init_pivot_root() fs: ensure that internal tmpfs mount gets mount id zero	2026-02-09 13:41:34 -08:00
Christian Brauner	9b8a0ba682	mount: add OPEN_TREE_NAMESPACE When creating containers the setup usually involves using CLONE_NEWNS via clone3() or unshare(). This copies the caller's complete mount namespace. The runtime will also assemble a new rootfs and then use pivot_root() to switch the old mount tree with the new rootfs. Afterward it will recursively umount the old mount tree thereby getting rid of all mounts. On a basic system here where the mount table isn't particularly large this still copies about 30 mounts. Copying all of these mounts only to get rid of them later is pretty wasteful. This is exacerbated if intermediary mount namespaces are used that only exist for a very short amount of time and are immediately destroyed again causing a ton of mounts to be copied and destroyed needlessly. With a large mount table and a system where thousands or ten-thousands of containers are spawned in parallel this quickly becomes a bottleneck increasing contention on the semaphore. Extend open_tree() with a new OPEN_TREE_NAMESPACE flag. Similar to OPEN_TREE_CLONE only the indicated mount tree is copied. Instead of returning a file descriptor referring to that mount tree OPEN_TREE_NAMESPACE will cause open_tree() to return a file descriptor to a new mount namespace. In that new mount namespace the copied mount tree has been mounted on top of a copy of the real rootfs. The caller can setns() into that mount namespace and perform any additionally required setup such as move_mount() detached mounts in there. This allows OPEN_TREE_NAMESPACE to function as a combined unshare(CLONE_NEWNS) and pivot_root(). A caller may for example choose to create an extremely minimal rootfs: fd_mntns = open_tree(-EBADF, "/var/lib/containers/wootwoot", OPEN_TREE_NAMESPACE); This will create a mount namespace where "wootwoot" has become the rootfs mounted on top of the real rootfs. The caller can now setns() into this new mount namespace and assemble additional mounts. This also works with user namespaces: unshare(CLONE_NEWUSER); fd_mntns = open_tree(-EBADF, "/var/lib/containers/wootwoot", OPEN_TREE_NAMESPACE); which creates a new mount namespace owned by the earlier created user namespace with "wootwoot" as the rootfs mounted on top of the real rootfs. Link: https://patch.msgid.link/20251229-work-empty-namespace-v1-1-bfb24c7b061f@kernel.org Tested-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Aleksa Sarai <cyphar@cyphar.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Suggested-by: Christian Brauner <brauner@kernel.org> Suggested-by: Aleksa Sarai <cyphar@cyphar.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2026-01-16 19:21:40 +01:00
Al Viro	e50aae1d39	non-consuming variants of do_{unlinkat,rmdir}() similar to previous commit; replacements are filename_{unlinkat,rmdir}() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2026-01-16 12:51:50 -05:00
Al Viro	88fdc27617	non-consuming variant of do_mknodat() similar to previous commit; replacement is filename_mknodat() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2026-01-16 12:49:26 -05:00
Al Viro	dc912db15a	non-consuming variant of do_mkdirat() similar to previous commit; replacement is filename_mkdirat() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2026-01-16 12:48:49 -05:00
Al Viro	da72b76aae	non-consuming variant of do_symlinkat() similar to previous commit; replacement is filename_symlinkat() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2026-01-16 12:48:16 -05:00
Al Viro	037193b0ae	non-consuming variant of do_linkat() similar to previous commit; replacement is filename_linkat() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2026-01-16 12:47:42 -05:00
Al Viro	e6d50234cc	non-consuming variant of do_renameat2() filename_renameat2() replaces do_renameat2(); unlike the latter, it does not drop filename references - these days it can be just as easily arranged in the caller. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2026-01-16 12:46:57 -05:00
Al Viro	541003b576	rename do_filp_open() to do_file_open() "filp" thing never made sense; seeing that there are exactly 4 callers in the entire tree (and it's neither exported nor even declared in linux//.h), there's no point keeping that ugliness. FWIW, the 'filp' thing did originate in OSD&I; for some reason Tanenbaum decided to call the object representing an opened file 'struct filp', the last letter standing for 'position'. In all Unices, Linux included, the corresponding object had always been 'struct file'... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2026-01-13 15:18:07 -05:00
Al Viro	c3a3577cdb	struct filename: use names_cachep only for getname() and friends Instances of struct filename come from names_cachep (via __getname()). That is done by getname_flags() and getname_kernel() and these two are the main callers of __getname(). However, there are other callers that simply want to allocate PATH_MAX bytes for uses that have nothing to do with struct filename. We want saner allocation rules for long pathnames, so that struct filename would always come from names_cachep, with the out-of-line pathname getting kmalloc'ed. For that we need to be able to change the size of objects allocated by getname_flags()/getname_kernel(). That requires the rest of __getname() users to stop using names_cachep; we could explicitly switch all of those to kmalloc(), but that would cause quite a bit of noise. So the plan is to switch getname_...() to new helpers and turn __getname() into a wrapper for kmalloc(). Remaining __getname() users could be converted to explicit kmalloc() at leisure, hopefully along with figuring out what size do they really want - PATH_MAX is an overkill for some of them, used out of laziness ("we have a convenient helper that does 4K allocations and that's large enough, let's use it"). As a side benefit, names_cachep is no longer used outside of fs/namei.c, so we can move it there and be done with that. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2026-01-13 15:16:44 -05:00
Al Viro	4bfe0692d6	init_mknod(): turn into a trivial wrapper for do_mknodat() Same as init_unlink() and init_rmdir() already are; the only obstacle is do_mknodat() being static. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2026-01-13 15:01:32 -05:00
Christian Brauner	3c1b73fc6a	fs: add init_pivot_root() We will soon be able to pivot_root() with the introduction of the immutable rootfs. Add a wrapper for kernel internal usage. Link: https://patch.msgid.link/20260112-work-immutable-rootfs-v2-2-88dd1c34a204@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2026-01-12 16:52:09 +01:00
Christoph Hellwig	188344c8ac	fs: factor out a sync_lazytime helper Centralize how we synchronize a lazytime update into the actual on-disk timestamp into a single helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260108141934.2052404-7-hch@lst.de Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2026-01-12 14:01:33 +01:00
Eric Sandeen	51a146e059	fs: Remove internal old mount API code Now that the last in-tree filesystem has been converted to the new mount API, remove all legacy mount API code designed to handle un-converted filesystems, and remove associated documentation as well. (The code to handle the legacy mount(2) syscall from userspace is still in place, of course.) Tested with an allmodconfig build on x86_64, and a sanity check of an old mount(2) syscall mount. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Link: https://patch.msgid.link/20251212174403.2882183-1-sandeen@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-12-15 14:48:33 +01:00
Linus Torvalds	7cd122b552	Some filesystems use a kinda-sorta controlled dentry refcount leak to pin dentries of created objects in dcache (and undo it when removing those). Reference is grabbed and not released, but it's not actually _stored_ anywhere. That works, but it's hard to follow and verify; among other things, we have no way to tell _which_ of the increments is intended to be an unpaired one. Worse, on removal we need to decide whether the reference had already been dropped, which can be non-trivial if that removal is on umount and we need to figure out if this dentry is pinned due to e.g. unlink() not done. Usually that is handled by using kill_litter_super() as ->kill_sb(), but there are open-coded special cases of the same (consider e.g. /proc/self). Things get simpler if we introduce a new dentry flag (DCACHE_PERSISTENT) marking those "leaked" dentries. Having it set claims responsibility for +1 in refcount. The end result this series is aiming for: * get these unbalanced dget() and dput() replaced with new primitives that would, in addition to adjusting refcount, set and clear persistency flag. * instead of having kill_litter_super() mess with removing the remaining "leaked" references (e.g. for all tmpfs files that hadn't been removed prior to umount), have the regular shrink_dcache_for_umount() strip DCACHE_PERSISTENT of all dentries, dropping the corresponding reference if it had been set. After that kill_litter_super() becomes an equivalent of kill_anon_super(). Doing that in a single step is not feasible - it would affect too many places in too many filesystems. It has to be split into a series. This work has really started early in 2024; quite a few preliminary pieces have already gone into mainline. This chunk is finally getting to the meat of that stuff - infrastructure and most of the conversions to it. Some pieces are still sitting in the local branches, but the bulk of that stuff is here. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCaTEq1wAKCRBZ7Krx/gZQ 643uAQC1rRslhw5l7OjxEpIYbGG4M+QaadN4Nf5Sr2SuTRaPJQD/W4oj/u4C2eCw Dd3q071tqyvm/PXNgN2EEnIaxlFUlwc= =rKq+ -----END PGP SIGNATURE----- Merge tag 'pull-persistency' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull persistent dentry infrastructure and conversion from Al Viro: "Some filesystems use a kinda-sorta controlled dentry refcount leak to pin dentries of created objects in dcache (and undo it when removing those). A reference is grabbed and not released, but it's not actually _stored_ anywhere. That works, but it's hard to follow and verify; among other things, we have no way to tell _which_ of the increments is intended to be an unpaired one. Worse, on removal we need to decide whether the reference had already been dropped, which can be non-trivial if that removal is on umount and we need to figure out if this dentry is pinned due to e.g. unlink() not done. Usually that is handled by using kill_litter_super() as ->kill_sb(), but there are open-coded special cases of the same (consider e.g. /proc/self). Things get simpler if we introduce a new dentry flag (DCACHE_PERSISTENT) marking those "leaked" dentries. Having it set claims responsibility for +1 in refcount. The end result this series is aiming for: - get these unbalanced dget() and dput() replaced with new primitives that would, in addition to adjusting refcount, set and clear persistency flag. - instead of having kill_litter_super() mess with removing the remaining "leaked" references (e.g. for all tmpfs files that hadn't been removed prior to umount), have the regular shrink_dcache_for_umount() strip DCACHE_PERSISTENT of all dentries, dropping the corresponding reference if it had been set. After that kill_litter_super() becomes an equivalent of kill_anon_super(). Doing that in a single step is not feasible - it would affect too many places in too many filesystems. It has to be split into a series. This work has really started early in 2024; quite a few preliminary pieces have already gone into mainline. This chunk is finally getting to the meat of that stuff - infrastructure and most of the conversions to it. Some pieces are still sitting in the local branches, but the bulk of that stuff is here" * tag 'pull-persistency' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits) d_make_discardable(): warn if given a non-persistent dentry kill securityfs_recursive_remove() convert securityfs get rid of kill_litter_super() convert rust_binderfs convert nfsctl convert rpc_pipefs convert hypfs hypfs: swich hypfs_create_u64() to returning int hypfs: switch hypfs_create_str() to returning int hypfs: don't pin dentries twice convert gadgetfs gadgetfs: switch to simple_remove_by_name() convert functionfs functionfs: switch to simple_remove_by_name() functionfs: fix the open/removal races functionfs: need to cancel ->reset_work in ->kill_sb() functionfs: don't bother with ffs->ref in ffs_data_{opened,closed}() functionfs: don't abuse ffs_data_closed() on fs shutdown convert selinuxfs ...	2025-12-05 14:36:21 -08:00
Al Viro	fc45aee662	get rid of kill_litter_super() Not used anymore. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-11-17 23:59:27 -05:00
NeilBrown	4037d966f0	VFS: introduce start_dirop() and end_dirop() The fact that directory operations (create,remove,rename) are protected by a lock on the parent is known widely throughout the kernel. In order to change this - to instead lock the target dentry - it is best to centralise this knowledge so it can be changed in one place. This patch introduces start_dirop() which is local to VFS code. It performs the required locking for create and remove. Rename will be handled separately. Various functions with names like start_creating() or start_removing_path(), some of which already exist, will export this functionality beyond the VFS. end_dirop() is the partner of start_dirop(). It drops the lock and releases the reference on the dentry. It is exported so that various end_creating etc functions can be inline. As vfs_mkdir() drops the dentry on error we cannot use end_dirop() as that won't unlock when the dentry IS_ERR(). For now we need an explicit unlock when dentry IS_ERR(). I hope to change vfs_mkdir() to unlock when it drops a dentry so that explicit unlock can go away. end_dirop() can always be called on the result of start_dirop(), but not after vfs_mkdir(). After a vfs_mkdir() we still may need the explicit unlock as seen in end_creating_path(). As well as adding start_dirop() and end_dirop() this patch uses them in: - simple_start_creating (which requires sharing lookup_noperm_common() with libfs.c) - start_removing_path / start_removing_user_path_at - filename_create / end_creating_path() - do_rmdir(), do_unlinkat() Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neil@brown.name> Link: https://patch.msgid.link/20251113002050.676694-3-neilb@ownmail.net Tested-by: syzbot@syzkaller.appspotmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-14 13:15:56 +01:00
Linus Torvalds	50647a1176	file->f_path constification Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCaN3daAAKCRBZ7Krx/gZQ 6zNWAP9kD6rOJRNqDgea4pibDPa47Tps/WM5tsDv3dsLliY29gEA6sveOWZ3guAj 4oY3ts/NtHLWXvhI7Vd/1mr2aTKEZQk= =YNK+ -----END PGP SIGNATURE----- Merge tag 'pull-f_path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull file->f_path constification from Al Viro: "Only one thing was modifying ->f_path of an opened file - acct(2). Massaging that away and constifying a bunch of struct path * arguments in functions that might be given &file->f_path ends up with the situation where we can turn ->f_path into an anon union of const struct path f_path and struct path __f_path, the latter modified only in a few places in fs/{file_table,open,namei}.c, all for struct file instances that are yet to be opened" * tag 'pull-f_path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (23 commits) Have cc(1) catch attempts to modify ->f_path kernel/acct.c: saner struct file treatment configfs:get_target() - release path as soon as we grab configfs_item reference apparmor/af_unix: constify struct path * arguments ovl_is_real_file: constify realpath argument ovl_sync_file(): constify path argument ovl_lower_dir(): constify path argument ovl_get_verity_digest(): constify path argument ovl_validate_verity(): constify {meta,data}path arguments ovl_ensure_verity_loaded(): constify datapath argument ksmbd_vfs_set_init_posix_acl(): constify path argument ksmbd_vfs_inherit_posix_acl(): constify path argument ksmbd_vfs_kern_path_unlock(): constify path argument ksmbd_vfs_path_lookup_locked(): root_share_path can be const struct path * check_export(): constify path argument export_operations->open(): constify path argument rqst_exp_get_by_name(): constify path argument nfs: constify path argument of __vfs_getattr() bpf...d_path(): constify path argument done_path_create(): constify path argument ...	2025-10-03 16:32:36 -07:00
Linus Torvalds	e64aeecbbb	mount-related stuff for this cycle * saner handling of guards in fs/namespace.c, getting rid of needlessly strong locking in some of the users. * lock_mount() calling conventions change - have it set the environment for attaching to given location, storing the results in caller-supplied object, without altering the passed struct path. Make unlock_mount() called as __cleanup for those objects. It's not exactly guard(), but similar to it. * MNT_WRITE_HOLD done right - mnt_hold_writers() does not mess with ->mnt_flags anymore, so insertion of a new mount into ->s_mounts of underlying superblock does not, in itself, expose ->mnt_flags of that mount to concurrent modifications. * getting rid of pathological cases when umount() spends quadratic time removing the victims from propagation graph - part of that had been dealt with last cycle, this should finish it. * a bunch of stuff constified. * assorted cleanups. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCaNhzLAAKCRBZ7Krx/gZQ 63/IAP4yxJ6e3Pt66Uw0MeuSNmeLsQwb7mYo72lsYHpxjYANZAEAspMaLDU9NHxM Dy6WDVoJnf7+aDlD6E443YMfPX8XRQM= =5T+t -----END PGP SIGNATURE----- Merge tag 'pull-mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs mount updates from Al Viro: "Several piles this cycle, this mount-related one being the largest and trickiest: - saner handling of guards in fs/namespace.c, getting rid of needlessly strong locking in some of the users - lock_mount() calling conventions change - have it set the environment for attaching to given location, storing the results in caller-supplied object, without altering the passed struct path. Make unlock_mount() called as __cleanup for those objects. It's not exactly guard(), but similar to it - MNT_WRITE_HOLD done right. mnt_hold_writers() does not mess with ->mnt_flags anymore, so insertion of a new mount into ->s_mounts of underlying superblock does not, in itself, expose ->mnt_flags of that mount to concurrent modifications - getting rid of pathological cases when umount() spends quadratic time removing the victims from propagation graph - part of that had been dealt with last cycle, this should finish it - a bunch of stuff constified - assorted cleanups * tag 'pull-mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (64 commits) constify {__,}mnt_is_readonly() WRITE_HOLD machinery: no need for to bump mount_lock seqcount struct mount: relocate MNT_WRITE_HOLD bit preparations to taking MNT_WRITE_HOLD out of ->mnt_flags setup_mnt(): primitive for connecting a mount to filesystem simplify the callers of mnt_unhold_writers() copy_mnt_ns(): use guards copy_mnt_ns(): use the regular mechanism for freeing empty mnt_ns on failure open_detached_copy(): separate creation of namespace into helper open_detached_copy(): don't bother with mount_lock_hash() path_has_submounts(): use guard(mount_locked_reader) fs/namespace.c: sanitize descriptions for {__,}lookup_mnt() ecryptfs: get rid of pointless mount references in ecryptfs dentries umount_tree(): take all victims out of propagation graph at once do_mount(): use __free(path_put) do_move_mount_old(): use __free(path_put) constify can_move_mount_beneath() arguments path_umount(): constify struct path argument may_copy_tree(), __do_loopback(): constify struct path argument path_mount(): constify struct path argument ...	2025-10-03 10:19:44 -07:00
Al Viro	ae8425014d	Merge branches 'work.path' and 'work.mount' into work.f_path	2025-09-27 20:18:21 -04:00
Christian Brauner	e83f0b5d10	nsfs: support exhaustive file handles Pidfd file handles are exhaustive meaning they don't require a handle on another pidfd to pass to open_by_handle_at() so it can derive the filesystem to decode in. Instead it can be derived from the file handle itself. The same is possible for namespace file handles. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-19 14:26:16 +02:00
Al Viro	f91c433a5c	path_umount(): constify struct path argument Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-15 21:26:44 -04:00
Al Viro	8ec7ee2e0b	path_mount(): constify struct path argument now it finally can be done. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-15 21:26:44 -04:00
Al Viro	7b129f2e70	filename_lookup(): constify root argument Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-15 21:17:07 -04:00
Linus Torvalds	672dcda246	vfs-6.17-rc1.pidfs -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaINCiQAKCRCRxhvAZXjc orltAQDq3y1anYETz5/FD6P2gXY1W5hXdSm3EHHeacQ1JjTXvgEA2g1lWO7J4anf oOVE8aSvMow/FOjivLZBYmI65pkYJAE= =oDKB -----END PGP SIGNATURE----- Merge tag 'vfs-6.17-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull pidfs updates from Christian Brauner: - persistent info Persist exit and coredump information independent of whether anyone currently holds a pidfd for the struct pid. The current scheme allocated pidfs dentries on-demand repeatedly. This scheme is reaching it's limits as it makes it impossible to pin information that needs to be available after the task has exited or coredumped and that should not be lost simply because the pidfd got closed temporarily. The next opener should still see the stashed information. This is also a prerequisite for supporting extended attributes on pidfds to allow attaching meta information to them. If someone opens a pidfd for a struct pid a pidfs dentry is allocated and stashed in pid->stashed. Once the last pidfd for the struct pid is closed the pidfs dentry is released and removed from pid->stashed. So if 10 callers create a pidfs dentry for the same struct pid sequentially, i.e., each closing the pidfd before the other creates a new one then a new pidfs dentry is allocated every time. Because multiple tasks acquiring and releasing a pidfd for the same struct pid can race with each another a task may still find a valid pidfs entry from the previous task in pid->stashed and reuse it. Or it might find a dead dentry in there and fail to reuse it and so stashes a new pidfs dentry. Multiple tasks may race to stash a new pidfs dentry but only one will succeed, the other ones will put their dentry. The current scheme aims to ensure that a pidfs dentry for a struct pid can only be created if the task is still alive or if a pidfs dentry already existed before the task was reaped and so exit information has been was stashed in the pidfs inode. That's great except that it's buggy. If a pidfs dentry is stashed in pid->stashed after pidfs_exit() but before __unhash_process() is called we will return a pidfd for a reaped task without exit information being available. The pidfds_pid_valid() check does not guard against this race as it doens't sync at all with pidfs_exit(). The pid_has_task() check might be successful simply because we're before __unhash_process() but after pidfs_exit(). Introduce a new scheme where the lifetime of information associated with a pidfs entry (coredump and exit information) isn't bound to the lifetime of the pidfs inode but the struct pid itself. The first time a pidfs dentry is allocated for a struct pid a struct pidfs_attr will be allocated which will be used to store exit and coredump information. If all pidfs for the pidfs dentry are closed the dentry and inode can be cleaned up but the struct pidfs_attr will stick until the struct pid itself is freed. This will ensure minimal memory usage while persisting relevant information. The new scheme has various advantages. First, it allows to close the race where we end up handing out a pidfd for a reaped task for which no exit information is available. Second, it minimizes memory usage. Third, it allows to remove complex lifetime tracking via dentries when registering a struct pid with pidfs. There's no need to get or put a reference. Instead, the lifetime of exit and coredump information associated with a struct pid is bound to the lifetime of struct pid itself. - extended attributes Now that we have a way to persist information for pidfs dentries we can start supporting extended attributes on pidfds. This will allow userspace to attach meta information to tasks. One natural extension would be to introduce a custom pidfs.* extended attribute space and allow for the inheritance of extended attributes across fork() and exec(). The first simple scheme will allow privileged userspace to set trusted extended attributes on pidfs inodes. - Allow autonomous pidfs file handles Various filesystems such as pidfs and drm support opening file handles without having to require a file descriptor to identify the filesystem. The filesystem are global single instances and can be trivially identified solely on the information encoded in the file handle. This makes it possible to not have to keep or acquire a sentinal file descriptor just to pass it to open_by_handle_at() to identify the filesystem. That's especially useful when such sentinel file descriptor cannot or should not be acquired. For pidfs this means a file handle can function as full replacement for storing a pid in a file. Instead a file handle can be stored and reopened purely based on the file handle. Such autonomous file handles can be opened with or without specifying a a file descriptor. If no proper file descriptor is used the FD_PIDFS_ROOT sentinel must be passed. This allows us to define further special negative fd sentinels in the future. Userspace can trivially test for support by trying to open the file handle with an invalid file descriptor. - Allow pidfds for reaped tasks with SCM_PIDFD messages This is a logical continuation of the earlier work to create pidfds for reaped tasks through the SO_PEERPIDFD socket option merged in `923ea4d448` ("Merge patch series "net, pidfs: enable handing out pidfds for reaped sk->sk_peer_pid""). - Two minor fixes: * Fold fs_struct->{lock,seq} into a seqlock * Don't bother with path_{get,put}() in unix_open_file() * tag 'vfs-6.17-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (37 commits) don't bother with path_get()/path_put() in unix_open_file() fold fs_struct->{lock,seq} into a seqlock selftests: net: extend SCM_PIDFD test to cover stale pidfds af_unix: enable handing out pidfds for reaped tasks in SCM_PIDFD af_unix: stash pidfs dentry when needed af_unix/scm: fix whitespace errors af_unix: introduce and use scm_replace_pid() helper af_unix: introduce unix_skb_to_scm helper af_unix: rework unix_maybe_add_creds() to allow sleep selftests/pidfd: decode pidfd file handles withou having to specify an fd fhandle, pidfs: support open_by_handle_at() purely based on file handle uapi/fcntl: add FD_PIDFS_ROOT uapi/fcntl: add FD_INVALID fcntl/pidfd: redefine PIDFD_SELF_THREAD_GROUP uapi/fcntl: mark range as reserved fhandle: reflow get_path_anchor() pidfs: add pidfs_root_path() helper fhandle: rename to get_path_anchor() fhandle: hoist copy_from_user() above get_path_from_fd() fhandle: raise FILEID_IS_DIR in handle_type ...	2025-07-28 14:10:15 -07:00
Amir Goldstein	4e301d858a	fs: constify file ptr in backing_file accessor helpers Add internal helper backing_file_set_user_path() for the only two cases that need to modify backing_file fields. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/20250607115304.2521155-2-amir73il@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-07-18 11:09:29 +02:00
Christian Brauner	a0d8051cfd	pidfs: add pidfs_root_path() helper Allow to return the root of the global pidfs filesystem. Link: https://lore.kernel.org/20250624-work-pidfs-fhandle-v2-4-d02a04858fe3@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-06-24 13:00:09 +02:00
Christian Brauner	bda3f1608d	libfs: massage path_from_stashed() to allow custom stashing behavior * Add a callback to struct stashed_operations so it's possible to implement custom behavior for pidfs and allow for it to return errors. * Teach stashed_dentry_get() to handle error pointers. Link: https://lore.kernel.org/20250618-work-pidfs-persistent-v2-2-98f3456fd552@kernel.org Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-06-19 14:28:24 +02:00
Linus Torvalds	8dd53535f1	vfs-6.16-rc1.super -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaDBPTwAKCRCRxhvAZXjc oi3BAQD/IBxTbAZIe7vEAsuLlBoKbWrzPGvxzd4UeMGo6OY18wEAvvyJM+arQy51 jS0ZErDOJnPNe7jps+Gh+WDx6d3NMAY= =lqAG -----END PGP SIGNATURE----- Merge tag 'vfs-6.16-rc1.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs freezing updates from Christian Brauner: "This contains various filesystem freezing related work for this cycle: - Allow the power subsystem to support filesystem freeze for suspend and hibernate. Now all the pieces are in place to actually allow the power subsystem to freeze/thaw filesystems during suspend/resume. Filesystems are only frozen and thawed if the power subsystem does actually own the freeze. If the filesystem is already frozen by the time we've frozen all userspace processes we don't care to freeze it again. That's userspace's job once the process resumes. We only actually freeze filesystems if we absolutely have to and we ignore other failures to freeze. We could bubble up errors and fail suspend/resume if the error isn't EBUSY (aka it's already frozen) but I don't think that this is worth it. Filesystem freezing during suspend/resume is best-effort. If the user has 500 ext4 filesystems mounted and 4 fail to freeze for whatever reason then we simply skip them. What we have now is already a big improvement and let's see how we fare with it before making our lives even harder (and uglier) than we have to. - Allow efivars to support freeze and thaw Allow efivarfs to partake to resync variable state during system hibernation and suspend. Add freeze/thaw support. This is a pretty straightforward implementation. We simply add regular freeze/thaw support for both userspace and the kernel. efivars is the first pseudofilesystem that adds support for filesystem freezing and thawing. The simplicity comes from the fact that we simply always resync variable state after efivarfs has been frozen. It doesn't matter whether that's because of suspend, userspace initiated freeze or hibernation. Efivars is simple enough that it doesn't matter that we walk all dentries. There are no directories and there aren't insane amounts of entries and both freeze/thaw are already heavy-handed operations. If userspace initiated a freeze/thaw cycle they would need CAP_SYS_ADMIN in the initial user namespace (as that's where efivarfs is mounted) so it can't be triggered by random userspace. IOW, we really really don't care" * tag 'vfs-6.16-rc1.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: f2fs: fix freezing filesystem during resize kernfs: add warning about implementing freeze/thaw efivarfs: support freeze/thaw power: freeze filesystems during suspend/resume libfs: export find_next_child() super: add filesystem freezing helpers for suspend and hibernate gfs2: pass through holder from the VFS for freeze/thaw super: use common iterator (Part 2) super: use a common iterator (Part 1) super: skip dying superblocks early super: simplify user_get_super() super: remove pointless s_root checks fs: allow all writers to be frozen locking/percpu-rwsem: add freezable alternative to down_read	2025-05-26 09:33:44 -07:00
Linus Torvalds	181d8e399f	vfs-6.16-rc1.misc -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaDBPTwAKCRCRxhvAZXjc om0+AQDMxKLweJXplqQQ7jxuvW2dEa60YpE2EalEKWGg9YA3KgEA3nI4kyKMKn7Y PRFXgIcKvhs62oJLKsq8SGQUqExqvAE= =atEw -----END PGP SIGNATURE----- Merge tag 'vfs-6.16-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual selections of misc updates for this cycle. Features: - Use folios for symlinks in the page cache FUSE already uses folios for its symlinks. Mirror that conversion in the generic code and the NFS code. That lets us get rid of a few folio->page->folio conversions in this path, and some of the few remaining users of read_cache_page() / read_mapping_page() - Try and make a few filesystem operations killable on the VFS inode->i_mutex level - Add sysctl vfs_cache_pressure_denom for bulk file operations Some workloads need to preserve more dentries than we currently allow through out sysctl interface A HDFS servers with 12 HDDs per server, on a HDFS datanode startup involves scanning all files and caching their metadata (including dentries and inodes) in memory. Each HDD contains approximately 2 million files, resulting in a total of ~20 million cached dentries after initialization To minimize dentry reclamation, they set vfs_cache_pressure to 1. Despite this configuration, memory pressure conditions can still trigger reclamation of up to 50% of cached dentries, reducing the cache from 20 million to approximately 10 million entries. During the subsequent cache rebuild period, any HDFS datanode restart operation incurs substantial latency penalties until full cache recovery completes To maintain service stability, more dentries need to be preserved during memory reclamation. The current minimum reclaim ratio (1/100 of total dentries) remains too aggressive for such workload. This patch introduces vfs_cache_pressure_denom for more granular cache pressure control The configuration [vfs_cache_pressure=1, vfs_cache_pressure_denom=10000] effectively maintains the full 20 million dentry cache under memory pressure, preventing datanode restart performance degradation - Avoid some jumps in inode_permission() using likely()/unlikely() - Avid a memory access which is most likely a cache miss when descending into devcgroup_inode_permission() - Add fastpath predicts for stat() and fdput() - Anonymous inodes currently don't come with a proper mode causing issues in the kernel when we want to add useful VFS debug assert. Fix that by giving them a proper mode and masking it off when we report it to userspace which relies on them not having any mode - Anonymous inodes currently allow to change inode attributes because the VFS falls back to simple_setattr() if i_op->setattr isn't implemented. This means the ownership and mode for every single user of anon_inode_inode can be changed. Block that as it's either useless or actively harmful. If specific ownership is needed the respective subsystem should allocate anonymous inodes from their own private superblock - Raise SB_I_NODEV and SB_I_NOEXEC on the anonymous inode superblock - Add proper tests for anonymous inode behavior - Make it easy to detect proper anonymous inodes and to ensure that we can detect them in codepaths such as readahead() Cleanups: - Port pidfs to the new anon_inode_{g,s}etattr() helpers - Try to remove the uselib() system call - Add unlikely branch hint return path for poll - Add unlikely branch hint on return path for core_sys_select - Don't allow signals to interrupt getdents copying for fuse - Provide a size hint to dir_context for during readdir() - Use writeback_iter directly in mpage_writepages - Update compression and mtime descriptions in initramfs documentation - Update main netfs API document - Remove useless plus one in super_cache_scan() - Remove unnecessary NULL-check guards during setns() - Add separate separate {get,put}_cgroup_ns no-op cases Fixes: - Fix typo in root= kernel parameter description - Use KERN_INFO for infof()\|info_plog()\|infofc() - Correct comments of fs_validate_description() - Mark an unlikely if condition with unlikely() in vfs_parse_monolithic_sep() - Delete macro fsparam_u32hex() - Remove unused and problematic validate_constant_table() - Fix potential unsigned integer underflow in fs_name() - Make file-nr output the total allocated file handles" * tag 'vfs-6.16-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (43 commits) fs: Pass a folio to page_put_link() nfs: Use a folio in nfs_get_link() fs: Convert __page_get_link() to use a folio fs/read_write: make default_llseek() killable fs/open: make do_truncate() killable fs/open: make chmod_common() and chown_common() killable include/linux/fs.h: add inode_lock_killable() readdir: supply dir_context.count as readdir buffer size hint vfs: Add sysctl vfs_cache_pressure_denom for bulk file operations fuse: don't allow signals to interrupt getdents copying Documentation: fix typo in root= kernel parameter description include/cgroup: separate {get,put}_cgroup_ns no-op case kernel/nsproxy: remove unnecessary guards fs: use writeback_iter directly in mpage_writepages fs: remove useless plus one in super_cache_scan() fs: add S_ANON_INODE fs: remove uselib() system call device_cgroup: avoid access to ->i_rdev in the common case in devcgroup_inode_permission() fs/fs_parse: Remove unused and problematic validate_constant_table() fs: touch up predicts in inode_permission() ...	2025-05-26 09:02:39 -07:00
Christian Brauner	33445d6fc5	libfs: export find_next_child() Export find_next_child() so it can be used by efivarfs. Keep it internal for now. There's no reason to advertise this kernel-wide. Link: https://lore.kernel.org/r/20250331-work-freeze-v1-1-6dfbe8253b9f@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-05-09 12:41:23 +02:00
NeilBrown	06c567403a	Use try_lookup_noperm() instead of d_hash_and_lookup() outside of VFS try_lookup_noperm() and d_hash_and_lookup() are nearly identical. The former does some validation of the name where the latter doesn't. Outside of the VFS that validation is likely valuable, and having only one exported function for this task is certainly a good idea. So make d_hash_and_lookup() local to VFS files and change all other callers to try_lookup_noperm(). Note that the arguments are swapped. Signed-off-by: NeilBrown <neilb@suse.de> Link: https://lore.kernel.org/r/20250319031545.2999807-6-neil@brown.name Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-04-08 11:24:41 +02:00
Christian Brauner	22bdf3d658	anon_inode: explicitly block ->setattr() It is currently possible to change the mode and owner of the single anonymous inode in the kernel: int main(int argc, char *argv[]) { int ret, sfd; sigset_t mask; struct signalfd_siginfo fdsi; sigemptyset(&mask); sigaddset(&mask, SIGINT); sigaddset(&mask, SIGQUIT); ret = sigprocmask(SIG_BLOCK, &mask, NULL); if (ret < 0) _exit(1); sfd = signalfd(-1, &mask, 0); if (sfd < 0) _exit(2); ret = fchown(sfd, 5555, 5555); if (ret < 0) _exit(3); ret = fchmod(sfd, 0777); if (ret < 0) _exit(3); _exit(4); } This is a bug. It's not really a meaningful one because anonymous inodes don't really figure into path lookup and they cannot be reopened via /proc/<pid>/fd/<nr> and can't be used for lookup itself. So they can only ever serve as direct references. But it is still completely bogus to allow the mode and ownership or any of the properties of the anonymous inode to be changed. Block this! Link: https://lore.kernel.org/20250407-work-anon_inode-v1-3-53a44c20d44e@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Cc: stable@vger.kernel.org # all LTS kernels Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-04-07 16:18:59 +02:00
Christian Brauner	cfd86ef7e8	anon_inode: use a proper mode internally This allows the VFS to not trip over anonymous inodes and we can add asserts based on the mode into the vfs. When we report it to userspace we can simply hide the mode to avoid regressions. I've audited all direct callers of alloc_anon_inode() and only secretmen overrides i_mode and i_op inode operations but it already uses a regular file. Link: https://lore.kernel.org/20250407-work-anon_inode-v1-1-53a44c20d44e@kernel.org Fixes: `af153bb63a` ("vfs: catch invalid modes in may_open()") Reviewed-by: Jeff Layton <jlayton@kernel.org> Cc: stable@vger.kernel.org # all LTS kernels Reported-by: syzbot+5d8e79d323a13aa0b248@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/67ed3fb3.050a0220.14623d.0009.GAE@google.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-04-07 16:18:46 +02:00
Linus Torvalds	912b82dc0b	vfs-6.15-rc1.file -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ90sLAAKCRCRxhvAZXjc ooi3AQDZhUV94xStEOzoV/R96mbUmsJDHKCnVTtqLCKcu3/IvwEAgUTDPptXgH/K JAUM7jCwhVkngvL3YRL1yEsY90raagg= =cjIy -----END PGP SIGNATURE----- Merge tag 'vfs-6.15-rc1.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs file handling updates from Christian Brauner: "This contains performance improvements for struct file's new refcount mechanism and various other performance work: - The stock kernel transitioning the file to no refs held penalizes the caller with an extra atomic to block any increments. For cases where the file is highly likely to be going away this is easily avoidable. Add file_ref_put_close() to better handle the common case where closing a file descriptor also operates on the last reference and build fput_close_sync() and fput_close() on top of it. This brings about 1% performance improvement by eliding one atomic in the common case. - Predict no error in close() since the vast majority of the time system call returns 0. - Reduce the work done in fdget_pos() by predicting that the file was found and by explicitly comparing the reference count to one and ignoring the dead zone" * tag 'vfs-6.15-rc1.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: reduce work in fdget_pos() fs: use fput_close() in path_openat() fs: use fput_close() in filp_close() fs: use fput_close_sync() in close() file: add fput and file_ref_put routines optimized for use when closing a fd fs: predict no error in close()	2025-03-24 13:19:17 -07:00
Linus Torvalds	df00ded23a	vfs-6.15-rc1.pidfs -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ90pqgAKCRCRxhvAZXjc oqVsAP9Aq/fMCI14HeXehPCezKQZPu1HTrPPo2clLHXoSnafawEAsA3YfWTT4Heb iexzqvAEUOMYOVN66QEc+6AAwtMLrwc= =0eYo -----END PGP SIGNATURE----- Merge tag 'vfs-6.15-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs pidfs updates from Christian Brauner: - Allow retrieving exit information after a process has been reaped through pidfds via the new PIDFD_INTO_EXIT extension for the PIDFD_GET_INFO ioctl. Various tools need access to information about a process/task even after it has already been reaped. Pidfd polling allows waiting on either task exit or for a task to have been reaped. The contract for PIDFD_INFO_EXIT is simply that EPOLLHUP must be observed before exit information can be retrieved, i.e., exit information is only provided once the task has been reaped and then can be retrieved as long as the pidfd is open. - Add PIDFD_SELF_{THREAD,THREAD_GROUP} sentinels allowing userspace to forgo allocating a file descriptor for their own process. This is useful in scenarios where users want to act on their own process through pidfds and is akin to AT_FDCWD. - Improve premature thread-group leader and subthread exec behavior when polling on pidfds: (1) During a multi-threaded exec by a subthread, i.e., non-thread-group leader thread, all other threads in the thread-group including the thread-group leader are killed and the struct pid of the thread-group leader will be taken over by the subthread that called exec. IOW, two tasks change their TIDs. (2) A premature thread-group leader exit means that the thread-group leader exited before all of the other subthreads in the thread-group have exited. Both cases lead to inconsistencies for pidfd polling with PIDFD_THREAD. Any caller that holds a PIDFD_THREAD pidfd to the current thread-group leader may or may not see an exit notification on the file descriptor depending on when poll is performed. If the poll is performed before the exec of the subthread has concluded an exit notification is generated for the old thread-group leader. If the poll is performed after the exec of the subthread has concluded no exit notification is generated for the old thread-group leader. The correct behavior is to simply not generate an exit notification on the struct pid of a subhthread exec because the struct pid is taken over by the subthread and thus remains alive. But this is difficult to handle because a thread-group may exit premature as mentioned in (2). In that case an exit notification is reliably generated but the subthreads may continue to run for an indeterminate amount of time and thus also may exec at some point. After this pull no exit notifications will be generated for a PIDFD_THREAD pidfd for a thread-group leader until all subthreads have been reaped. If a subthread should exec before no exit notification will be generated until that task exits or it creates subthreads and repeates the cycle. This means an exit notification indicates the ability for the father to reap the child. * tag 'vfs-6.15-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (25 commits) selftests/pidfd: third test for multi-threaded exec polling selftests/pidfd: second test for multi-threaded exec polling selftests/pidfd: first test for multi-threaded exec polling pidfs: improve multi-threaded exec and premature thread-group leader exit polling pidfs: ensure that PIDFS_INFO_EXIT is available selftests/pidfd: add seventh PIDFD_INFO_EXIT selftest selftests/pidfd: add sixth PIDFD_INFO_EXIT selftest selftests/pidfd: add fifth PIDFD_INFO_EXIT selftest selftests/pidfd: add fourth PIDFD_INFO_EXIT selftest selftests/pidfd: add third PIDFD_INFO_EXIT selftest selftests/pidfd: add second PIDFD_INFO_EXIT selftest selftests/pidfd: add first PIDFD_INFO_EXIT selftest selftests/pidfd: expand common pidfd header pidfs/selftests: ensure correct headers for ioctl handling selftests/pidfd: fix header inclusion pidfs: allow to retrieve exit information pidfs: record exit code and cgroupid at exit pidfs: use private inode slab cache pidfs: move setting flags into pidfs_alloc_file() pidfd: rely on automatic cleanup in __pidfd_prepare() ...	2025-03-24 10:16:37 -07:00
Linus Torvalds	fd101da676	vfs-6.15-rc1.mount -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ90qAwAKCRCRxhvAZXjc on7lAP0akpIsJMWREg9tLwTNTySI1b82uKec0EAgM6T7n/PYhAD/T4zoY8UYU0Pr qCxwTXHUVT6bkNhjREBkfqq9OkPP8w8= =GxeN -----END PGP SIGNATURE----- Merge tag 'vfs-6.15-rc1.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs mount updates from Christian Brauner: - Mount notifications The day has come where we finally provide a new api to listen for mount topology changes outside of /proc/<pid>/mountinfo. A mount namespace file descriptor can be supplied and registered with fanotify to listen for mount topology changes. Currently notifications for mount, umount and moving mounts are generated. The generated notification record contains the unique mount id of the mount. The listmount() and statmount() api can be used to query detailed information about the mount using the received unique mount id. This allows userspace to figure out exactly how the mount topology changed without having to generating diffs of /proc/<pid>/mountinfo in userspace. - Support O_PATH file descriptors with FSCONFIG_SET_FD in the new mount api - Support detached mounts in overlayfs Since last cycle we support specifying overlayfs layers via file descriptors. However, we don't allow detached mounts which means userspace cannot user file descriptors received via open_tree(OPEN_TREE_CLONE) and fsmount() directly. They have to attach them to a mount namespace via move_mount() first. This is cumbersome and means they have to undo mounts via umount(). Allow them to directly use detached mounts. - Allow to retrieve idmappings with statmount Currently it isn't possible to figure out what idmapping has been attached to an idmapped mount. Add an extension to statmount() which allows to read the idmapping from the mount. - Allow creating idmapped mounts from mounts that are already idmapped So far it isn't possible to allow the creation of idmapped mounts from already idmapped mounts as this has significant lifetime implications. Make the creation of idmapped mounts atomic by allow to pass struct mount_attr together with the open_tree_attr() system call allowing to solve these issues without complicating VFS lookup in any way. The system call has in general the benefit that creating a detached mount and applying mount attributes to it becomes an atomic operation for userspace. - Add a way to query statmount() for supported options Allow userspace to query which mount information can be retrieved through statmount(). - Allow superblock owners to force unmount * tag 'vfs-6.15-rc1.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (21 commits) umount: Allow superblock owners to force umount selftests: add tests for mount notification selinux: add FILE__WATCH_MOUNTNS samples/vfs: fix printf format string for size_t fs: allow changing idmappings fs: add kflags member to struct mount_kattr fs: add open_tree_attr() fs: add copy_mount_setattr() helper fs: add vfs_open_tree() helper statmount: add a new supported_mask field samples/vfs: add STATMOUNT_MNT_{G,U}IDMAP selftests: add tests for using detached mount with overlayfs samples/vfs: check whether flag was raised statmount: allow to retrieve idmappings uidgid: add map_id_range_up() fs: allow detached mounts in clone_private_mount() selftests/overlayfs: test specifying layers as O_PATH file descriptors fs: support O_PATH fds with FSCONFIG_SET_FD vfs: add notifications for mount attach and detach fanotify: notify on mount attach and detach ...	2025-03-24 09:34:10 -07:00
Jan Kara	93fd0d46cb	vfs: Remove invalidate_inodes() The function can be replaced by evict_inodes. The only difference is that evict_inodes() skips the inodes with positive refcount without touching ->i_lock, but they are equivalent as evict_inodes() repeats the refcount check after having grabbed ->i_lock. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20250307144318.28120-2-jack@suse.cz Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-03-08 12:19:22 +01:00
Mateusz Guzik	e83588458f	file: add fput and file_ref_put routines optimized for use when closing a fd Vast majority of the time closing a file descriptor also operates on the last reference, where a regular fput usage will result in 2 atomics. This can be changed to only suffer 1. See commentary above file_ref_put_close() for more information. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20250305123644.554845-2-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-03-05 18:30:00 +01:00
Christian Brauner	4513522984	pidfs: record exit code and cgroupid at exit Record the exit code and cgroupid in release_task() and stash in struct pidfs_exit_info so it can be retrieved even after the task has been reaped. Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-5-c8c3d8361705@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-03-05 13:26:12 +01:00
Christian Brauner	da06e3c517	fs: don't needlessly acquire f_lock Before 2011 there was no meaningful synchronization between read/readdir/write/seek. Only in commit `ef3d0fd27e` ("vfs: do (nearly) lockless generic_file_llseek") synchronization was added for SEEK_CUR by taking f_lock around vfs_setpos(). Then in 2014 full synchronization between read/readdir/write/seek was added in commit `9c225f2655` ("vfs: atomic f_pos accesses as per POSIX") by introducing f_pos_lock for regular files with FMODE_ATOMIC_POS and for directories. At that point taking f_lock became unnecessary for such files. So only acquire f_lock for SEEK_CUR if this isn't a file that would have acquired f_pos_lock if necessary. Link: https://lore.kernel.org/r/20250207-daten-mahlzeit-99d2079864fb@brauner Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-21 10:25:32 +01:00
Yuichiro Tsuji	29d80d506b	open: Fix return type of several functions from long to int Fix the return type of several functions from long to int to match its actu al behavior. These functions only return int values. This change improves type consistency across the filesystem code and aligns the function signatu re with its existing implementation and usage. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Yuichiro Tsuji <yuichtsu@amazon.com> Link: https://lore.kernel.org/r/20250121070844.4413-2-yuichtsu@amazon.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-21 10:25:32 +01:00
Christian Brauner	37c4a9590e	statmount: allow to retrieve idmappings This adds the STATMOUNT_MNT_UIDMAP and STATMOUNT_MNT_GIDMAP options. It allows the retrieval of idmappings via statmount(). Currently it isn't possible to figure out what idmappings are applied to an idmapped mount. This information is often crucial. Before statmount() the only realistic options for an interface like this would have been to add it to /proc/<pid>/fdinfo/<nr> or to expose it in /proc/<pid>/mountinfo. Both solution would have been pretty ugly and would've shown information that is of strong interest to some application but not all. statmount() is perfect for this. The idmappings applied to an idmapped mount are shown relative to the caller's user namespace. This is the most useful solution that doesn't risk leaking information or confuse the caller. For example, an idmapped mount might have been created with the following idmappings: mount --bind -o X-mount.idmap="0:10000:1000 2000:2000:1 3000:3000:1" /srv /opt Listing the idmappings through statmount() in the same context shows: mnt_id: 2147485088 mnt_parent_id: 2147484816 fs_type: btrfs mnt_root: /srv mnt_point: /opt mnt_opts: ssd,discard=async,space_cache=v2,subvolid=5,subvol=/ mnt_uidmap[0]: 0 10000 1000 mnt_uidmap[1]: 2000 2000 1 mnt_uidmap[2]: 3000 3000 1 mnt_gidmap[0]: 0 10000 1000 mnt_gidmap[1]: 2000 2000 1 mnt_gidmap[2]: 3000 3000 1 But the idmappings might not always be resolvable in the caller's user namespace. For example: unshare --user --map-root In this case statmount() will skip any mappings that fil to resolve in the caller's idmapping: mnt_id: 2147485087 mnt_parent_id: 2147484016 fs_type: btrfs mnt_root: /srv mnt_point: /opt mnt_opts: ssd,discard=async,space_cache=v2,subvolid=5,subvol=/ The caller can differentiate between a mount not being idmapped and a mount that is idmapped but where all mappings fail to resolve in the caller's idmapping by check for the STATMOUNT_MNT_{G,U}IDMAP flag being raised but the number of mappings in ->mnt_{g,u}idmap_num being zero. Note that statmount() requires that the whole range must be resolvable in the caller's user namespace. If a subrange fails to map it will still list the map as not resolvable. This is a practical compromise to avoid having to find which subranges are resovable and wich aren't. Idmappings are listed as a string array with each mapping separated by zero bytes. This allows to retrieve the idmappings and immediately use them for writing to e.g., /proc/<pid>/{g,u}id_map and it also allow for simple iteration like: if (stmnt->mask & STATMOUNT_MNT_UIDMAP) { const char *idmap = stmnt->str + stmnt->mnt_uidmap; for (size_t idx = 0; idx < stmnt->mnt_uidmap_nr; idx++) { printf("mnt_uidmap[%lu]: %s\n", idx, idmap); idmap += strlen(idmap) + 1; } } Link: https://lore.kernel.org/r/20250204-work-mnt_idmap-statmount-v2-2-007720f39f2e@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-12 12:12:27 +01:00
Linus Torvalds	c6d64479d6	sanitize struct filename and lookup flags handling in statx and friends Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZzdpZQAKCRBZ7Krx/gZQ 6whMAQDhlGFV+nGRetwe4t60mVRpxIoc71GLC7b6V8FmyfTI5AEAkAigkJ8KCZDP mfGsN/3PtzoxnIkIqdk7Y7q4/fowyAw= =4DWZ -----END PGP SIGNATURE----- Merge tag 'pull-statx' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull statx updates from Al Viro: "Sanitize struct filename and lookup flags handling in statx and friends" * tag 'pull-statx' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: libfs: kill empty_dir_getattr() fs: Simplify getattr interface function checking AT_GETATTR_NOSEC flag fs/stat.c: switch to CLASS(fd_raw) kill getname_statx_lookup_flags() io_statx_prep(): use getname_uflags()	2024-11-18 14:54:10 -08:00

1 2 3 4 5 ...

332 Commits