linux

mirror of https://github.com/torvalds/linux.git synced 2026-07-27 09:36:22 +02:00

Author	SHA1	Message	Date
Linus Torvalds	4235cb24ec	vfs-7.2-rc5.fixes Please consider pulling these changes from the signed vfs-7.2-rc5.fixes tag. Thanks! Christian -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCamYljAAKCRCRxhvAZXjc ogWyAPwORZDIIsRiAQbnPs+OkYszuWCY29OeUrTf+m3z+tBTLwD+NUFtpH5XIcYJ 3jHtGXPoHjEaOVsNyIdxwnOxWfo+6ws= =VNq0 -----END PGP SIGNATURE----- Merge tag 'vfs-7.2-rc5.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - vfs: Preserve the ACL_DONT_CACHE state in forget_cached_acl(). ACL_DONT_CACHE is meant to be a permanent opt-out from ACL caching which FUSE relies on for servers that don't negotiate FUSE_POSIX_ACL. The helper replaced it with ACL_NOT_CACHED, silently re-enabling the cache, and as fuse doesn't invalidate the cache for such servers a properly timed get_acl() returned stale ACLs. Comes with a fuse selftest reproducing this. - pidfs: - Preserve PIDFD_THREAD when a thread pidfd is reopened via open_by_handle_at(). PIDFD_THREAD shares the O_EXCL bit which do_dentry_open() strips after the flags have been validated, so the reopened pidfd silently became a process pidfd. Comes with a selftest. - Add a pidfs_dentry_open() helper so the regular pidfd allocation path and the file handle path share the code that forces O_RDWR and reapplies the pidfd flags that do_dentry_open() strips. - Handle FS_IOC32_GETVERSION in the compat ioctl path. - Make pidfs_ino_lock static. - iomap: - Fix the block range calculation in ifs_clear_range_dirty() so a partial clear doesn't drop the dirty state of blocks the range only partially covers. - Support invalidating partial folios so a partial truncate or hole punch with blocksize < foliosize doesn't leave stale dirty bits behind. - Only set did_zero when iomap_zero_iter() actually zeroed something. - Guard ifs_set_range_dirty() and ifs_set_range_uptodate() against zero-length ranges where the unsigned last-block calculation underflows and bitmap_set() writes far beyond the ifs->state allocation. - Don't merge ioends with different io_private values as the merge could leak or corrupt the private data of the individual ioends. - exec: - Raise bprm->have_execfd only once the binfmt_misc interpreter has actually been opened. The flag was set as soon as a matching 'O' or 'C' entry was found. If the interpreter open failed with ENOEXEC the exec fell through to the next binary format with have_execfd raised but no executable staged and begin_new_exec() NULL derefed past the point of no return. - Fix an unsigned loop counter wrap in transfer_args_to_stack() on nommu. An overlong argument or environment string pushes bprm->p below PAGE_SIZE, the stop index becomes zero, and the loop never terminates, wrapping its counter and copying garbage from in front of the page array into the new process stack. - Make binfmt_elf_fdpic only honour the first PT_INTERP like binfmt_elf does. Each additional PT_INTERP overwrote the previous interpreter, leaking the name allocation and the interpreter file reference together with the write denial open_exec() took, leaving the file unwritable for as long as the system runs. - overlayfs: - Compare the full escaped xattr prefix including the trailing dot. An xattr like "trusted.overlay.overlayfoo" was misclassified as an escaped overlay xattr. - Check read access to the copy_file_range() source with the source's mounter credentials. - super: Thawing a filesystem whose block device was frozen with bdev_freeze() deadlocked. Dropping the last block layer freeze reference from under s_umount ends up in fs_bdev_thaw() which reacquires s_umount on the same task. Pin the superblock with an active reference instead and call bdev_thaw() without holding s_umount. - procfs: Return EACCES instead of success when the ptrace access check for namespace links fails. - afs: Use afs_dir_get_block() rather than afs_dir_find_block() for block 0 in afs_edit_dir_remove(), matching afs_edit_dir_add(). - Push the memcg gating of ->nr_cached_objects() down into the btrfs and shmem callbacks instead of skipping every callback during non-root memcg reclaim. The blanket check short-circuited XFS whose inode reclaim hook is intentionally driven from per-memcg contexts to free memcg-charged slab. - eventpoll: Pin files while checking reverse paths. Since struct file became SLAB_TYPESAFE_BY_RCU a concurrent close could free and recycle the file under the check which then took and dropped the f_lock of whatever live file now occupies that slot. * tag 'vfs-7.2-rc5.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (24 commits) super: fix emergency thaw deadlock on frozen block devices pidfs: make pidfs_ino_lock static eventpoll: pin files while checking reverse paths fs: push nr_cached_objects memcg gating into individual filesystems afs: Fix afs_edit_dir_remove() to get, not find, block 0 iomap: prevent ioend merge when io_private differs iomap: add comments for ifs_clear/set_range_dirty() iomap: fix out-of-bounds bitmap_set() with zero-length range iomap: fix incorrect did_zero setting in iomap_zero_iter() iomap: support invalidating partial folios iomap: correct the range of a partial dirty clear fs/super: fix emergency thaw double-unlock of s_umount pidfs: handle FS_IOC32_GETVERSION in compat ioctl ovl: check access to copy_file_range source with src mounter creds proc: Fix broken error paths for namespace links pidfs: add pidfs_dentry_open() helper selftests/pidfd: check PIDFD_THREAD survives open_by_handle_at() pidfs: preserve thread pidfds reopened by file handle ovl: fix trusted xattr escape prefix matching selftests/fuse: add ACL_DONT_CACHE regression test ...	2026-07-26 12:22:57 -07:00
Christian Brauner	749d7aa037	super: fix emergency thaw deadlock on frozen block devices do_thaw_all_callback() calls bdev_thaw() while holding sb->s_umount exclusively. If the block device was frozen via bdev_freeze() dropping the last block layer freeze reference calls fs_bdev_thaw() which reacquires s_umount: do_thaw_all_callback(sb) super_lock_excl(sb) # holds sb->s_umount bdev_thaw(sb->s_bdev) mutex_lock(&bdev->bd_fsfreeze_mutex) # bd_fsfreeze_count drops 1 -> 0 bd_holder_ops->thaw == fs_bdev_thaw get_bdev_super(bdev) bdev_super_lock(bdev, true) super_lock(sb, true) down_write(&sb->s_umount) # same task: deadlock The emergency thaw worker deadlocks against itself holding both s_umount and bd_fsfreeze_mutex. That fscks any subsequent unmount, freeze, or thaw of that filesystem and block device. [ 81.878470] sysrq: Show Blocked State [ 81.880140] task:kworker/0:1 state:D stack:0 pid:11 tgid:11 ppid:2 task_flags:0x4208060 flags:0x00080000 [ 81.884876] Workqueue: events do_thaw_all [ 81.886656] Call Trace: [ 81.887759] <TASK> [ 81.888763] __schedule+0x579/0x1420 [ 81.890372] schedule+0x3a/0x100 [ 81.891794] schedule_preempt_disabled+0x15/0x30 [ 81.893848] rwsem_down_write_slowpath+0x1ea/0x900 [ 81.895191] ? __pfx_do_thaw_all_callback+0x10/0x10 [ 81.896528] down_write+0xbd/0xc0 [ 81.897505] super_lock+0x91/0x180 [ 81.898457] ? __mutex_lock+0xa99/0x1140 [ 81.900748] ? __mutex_unlock_slowpath+0x1f/0x400 [ 81.902069] bdev_super_lock+0x5b/0x150 [ 81.903132] get_bdev_super+0x10/0x60 [ 81.904042] fs_bdev_thaw+0x23/0xf0 [ 81.904755] bdev_thaw+0x82/0x100 [ 81.905484] do_thaw_all_callback+0x2c/0x50 [ 81.906298] __iterate_supers+0x5d/0x130 [ 81.907067] do_thaw_all+0x20/0x40 [ 81.907739] process_one_work+0x206/0x5e0 [ 81.908545] worker_thread+0x1e2/0x3c0 [ 81.909339] ? __pfx_worker_thread+0x10/0x10 [ 81.910171] kthread+0xf4/0x130 [ 81.910799] ? __pfx_kthread+0x10/0x10 [ 81.911528] ret_from_fork+0x2e2/0x3b0 [ 81.912259] ? __pfx_kthread+0x10/0x10 [ 81.913010] ret_from_fork_asm+0x1a/0x30 [ 81.913806] </TASK> bdev_super_lock() even documents the violated requirement with lockdep_assert_not_held(&sb->s_umount). Acquiring bd_fsfreeze_mutex under s_umount also inverts the bd_fsfreeze_mutex vs. s_umount ordering established by bdev_{freeze,thaw}() and can thus ABBA against a concurrent block-layer freeze even when the recursive path isn't hit. Fix this by not holding s_umount around the bdev_thaw() loop at all. Pin the superblock with an active reference instead as filesystems_freeze_callback() does. The active reference keeps the superblock from being shut down and so ->s_bdev stays valid without holding s_umount. The block-layer-held freeze is dropped by fs_bdev_thaw() with FREEZE_MAY_NEST \| FREEZE_HOLDER_USERSPACE exactly as a regular unfreeze would and thaw_super_locked() handles filesystem-level freezes as before. The emergency thaw path has deadlocked like this in one form or another for a long long time but the current exclusively-held shape dates back to commit [1] where thaw_bdev() already ended in thaw_super() with s_umount held by do_thaw_all_callback(). Fixes: `08fdc8a013` ("buffer.c: call thaw_super during emergency thaw") [1] Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260723-work-super-emergency_thaw-v1-1-7c315c600245@kernel.org Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>	2026-07-26 17:08:52 +02:00
Linus Torvalds	8e371eff3f	eight ksmbd server fixes -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmpj6jgACgkQiiy9cAdy T1F7Agv/Q0LNuCs18A0dR8SlNLwbo6lE6RVL3A8TfFKqiqpyigPBf0Ax2VqR/uuH 6rfuwZYLMNOp9P+jgFYPug8+fx2Ya58oLlVtCY+JooASUbwEg7PG/R3oHuZe4tOS m6m8+DVO82JwGH88dkLXBvyrlkxCWZJtPWlS3s+os/2mGAX8M4hvIiCKGz3PAFmb Qul3AGlxlSgi7eKuZGMU2OslgRVmMVOrNUgW4LfkDP/c8JjwgdDAuhebJOV2sAKi qL8GlIk/ZsXbZ9qK2ogQ03dt3WxcikWcqbj4GX/d0191sPRx0XuRRXwhOetTUzR3 7PtcDSk6x94+ut4vpuheEaz4Mx8MiNV73pJzNqPQ5WLbA493PUJdEVPbQE2z+y5j lPFRRYqNtTDAMKv5ZFCUSKs7gO52/jIgdn+RSJ29tJ2Di4IzmguTGv3Cwmq/yL7s lhfoo2YM55AVkgeOfSsHXNGB89XQjolGb0S5j5fHCkmtvKt5zoWVTp8eoHgG7jih l7h1AR1X =7btg -----END PGP SIGNATURE----- Merge tag 'v7.2-rc4-smb3-server-fixes' of git://git.samba.org/ksmbd Pull smb server fixes from Steve French: "This contains eight ksmbd fixes covering POSIX ACL handling, SMB signing enforcement, DACL parsing and construction hardening, session lifetime handling, and validation of malformed transform and compressed SMB2 requests: - preserve inherited POSIX ACL mask when creating objects. - enforce the session signing requirement for plaintext SMB requests. - harden DACL/ACE processing against size overflows, incomplete ACE copies, and undersized SIDs. - defer teardown of a previous session until NTLM authentication succeeds. - reject undersized encryption-transform and decompressed SMB2 requests before they can reach normal SMB2 request processing" * tag 'v7.2-rc4-smb3-server-fixes' of git://git.samba.org/ksmbd: ksmbd: reject undersized decompressed SMB2 requests ksmbd: validate minimum PDU size for transform requests ksmbd: defer destroy_previous_session() until after NTLM authentication ksmbd: validate ACE size against SID sub-authorities ksmbd: restore DACL size on check_add_overflow() to avoid malformed ACL ksmbd: bound DACL dedup walk to copied ACEs ksmbd: enforce signing required by the session ksmbd: preserve VFS inherited POSIX ACL mask	2026-07-24 19:50:05 -07:00
Linus Torvalds	dad0a87d79	A bunch of assorted fixes with the majority being hardening against malformed input and invalid data scenarios that don't happen in real deployments but can be utilized to trigger use-after-free and similar issues, some error path leak fixups and two patches from Max to avoid a potential hang in __ceph_get_caps() and unintended nesting of current->journal_info while handling replies from the MDS. All marked for stable. -----BEGIN PGP SIGNATURE----- iQFHBAABCgAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmpjx6wTHGlkcnlvbW92 QGdtYWlsLmNvbQAKCRBKf944AhHzixDmB/4/J+dqhhDKg69t0ADnAPLgxe3AoXwi 7GRd2Uk5/AU9o+1fexyfWt2u+3ZtAVGLVRZPF1ARFQQDGUoj99X06NlTyVAFRNMb p78Sun5iuiDJb72UOD3WzW9lmpjSeCVUbTuTadtmF4y34KIvZ7AYltjZBpycE2Yj lyiUr4CSkXMAa/wWDKg+8SAw2tBI61WdyJeyu8ESCUmm9Q2XLq1+2N/jhfifk0oC Dc7EHUZmg7YNbTr0DKyLXdAoZHE54zKC2qDrvfTPmpTsOhMI2v5LY6JtXzm8/l6c wAGgIj5hRePGDDPb1V5Mbz3rkY/VIGdKmkqEV4PZn+IHafZGp45lnS7V =q2wM -----END PGP SIGNATURE----- Merge tag 'ceph-for-7.2-rc5' of https://github.com/ceph/ceph-client Pull ceph fixes from Ilya Dryomov: "A bunch of assorted fixes with the majority being hardening against malformed input and invalid data scenarios that don't happen in real deployments but can be utilized to trigger use-after-free and similar issues, some error path leak fixups and two patches from Max to avoid a potential hang in __ceph_get_caps() and unintended nesting of current->journal_info while handling replies from the MDS. All marked for stable" * tag 'ceph-for-7.2-rc5' of https://github.com/ceph/ceph-client: ceph: avoid fs reclaim while using current->journal_info ceph: add owner/capability checks for CEPH_IOC_SET_LAYOUT* ceph: fix hanging __ceph_get_caps() with stale mds_wanted rbd: Reset positive result codes to zero in object map update path libceph: bound pg_{temp,upmap,upmap_items} length to CEPH_PG_MAX_SIZE libceph: refresh auth->authorizer_buf{,_len} after authorizer update ceph: fix refcount leak in ceph_readdir() libceph: guard missing CRUSH type name lookup libceph: remove debugfs files before client teardown libceph: bound get_version reply decode to front len ceph: fix writeback_count leak in write_folio_nounlock() libceph: fix two unsafe bare decodes in decode_lockers() ceph: fix pre-auth out-of-bounds read on snaptrace in ceph_handle_caps() libceph: Reject monmaps advertising zero monitors libceph: reject zero bucket types in crush_decode libceph: Fix multiplication overflow in decode_new_up_state_weight()	2026-07-24 13:22:41 -07:00
Linus Torvalds	981f4a2baa	fscrypt fixes for v7.2-rc5 A couple fixes for AI-detected bugs. -----BEGIN PGP SIGNATURE----- iIkEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCamPDWBQcZWJpZ2dlcnNA a2VybmVsLm9yZwAKCRDzXCl4vpKOK4C8APjY2sD4jMa1oX05SiLL7iUulkTXivOb n14nFabjqF25AP9MZImv0oS5eQWt/QPLHfNlB6olRd9Q0GqVOYYdKQ/6DA== =pStx -----END PGP SIGNATURE----- Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux Pull fscrypt fixes from Eric Biggers: "A couple fixes for AI-detected bugs" * tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux: fscrypt: Avoid dynamic allocation in fscrypt_get_devices() fscrypt: Add missing superblock check in find_or_insert_direct_key()	2026-07-24 13:12:43 -07:00
Mateusz Guzik	c1d04c1bce	pidfs: make pidfs_ino_lock static Fixes: `87caaeef79` ("pidfs: implement ino allocation without the pidmap lock") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202607231547.ehCQxi0L-lkp@intel.com/ Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://patch.msgid.link/20260723160114.291515-1-mjguzik@gmail.com Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>	2026-07-24 10:30:58 +02:00
Linus Torvalds	48a5a7ab8d	six client fixes -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmpiexAACgkQiiy9cAdy T1GDPwv9HWIwZSpoOs1ZPneXk4Sj9S1HkkEE+xT4edgKl63O8Wly4DQexZzrWaYg QjCJ6gvYDjzL1NyGOfhtBxz695kGf2Cliq3IgUxWrSyB9S8o2qahJrvaF6kjckWB cu4+U650GxLXygvgx+YXDWrDp03GcP8Dk/iAwq5BdcdaMpPMv7FFSEtf0U4UWERG 1LgzBBVdBHibGIsrZGc6zBxBt20cLdJSc4X7UycSMvfudsc3PDnCvgQVorSROeIE Kgk9gUj4Yv6x+qRHCiFCrPRpNz8nqrEQ+jv3lOhsFEIrvflxX5IVl+NYjufuovgY ERTQ3GwLYmjX4WNg0774jdifUOWducqVVSrp9eQx8ueKgy4FakCm6/ko/jDzzGfn WZ5rCZs0GFtZ8pjlhxINOzkq1wjsZxueLHCQUxp1Ty0I5Sy8q0e8epv0J66iXAwa /doMyvw5COvWEo/ypY3LQ0HcmYwojc3VrGrcabTvgnBHygjFM4UQnqljZejGWy3R tWBjCPxe =ujbz -----END PGP SIGNATURE----- Merge tag 'v7.2-rc4-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6 Pull smb client fixes from Steve French: - Fix leak in cifs_close_deferred_file() - Fix resolving MacOS symlinks - Fix stale file size in readdir - Update git branches in MAINTAINERS file - Fix bounds check in cifs_filldir - Fix checks in parse_dfs_referrals() - Fix DFS referral checks for malformed packet * tag 'v7.2-rc4-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: fix cifsFileInfo leak on kmalloc failure in deferred close drain paths cifs: prevent readdir from changing file size due to stale directory metadata smb: client: handle STATUS_STOPPED_ON_SYMLINK responses without a symlink target Add missing git branch info for cifs and ksmbd to MAINTAINERS file smb: client: bound dirent name against end of SMB response in cifs_filldir smb: client: validate DFS referral PathConsumed	2026-07-23 13:49:54 -07:00
Max Kellermann	5b602344a4	ceph: avoid fs reclaim while using current->journal_info handle_reply() stores a `ceph_mds_request` pointer in `current->journal_info` while filling the inode and dentry cache from an MDS reply. An allocation in this section can enter direct reclaim and prune dentries from another filesystem. If this dirties an ext4 inode, ext4 starts a JBD2 transaction. JBD2 interprets the Ceph request in `current->journal_info` as a journal handle and dereferences the request's `r_tid` as `h_transaction`, causing a kernel crash, e.g.: Unable to handle kernel paging request at virtual address 00000000077b4818 [...] Internal error: Oops: 0000000096000004 [#1] SMP Modules linked in: CPU: 6 UID: 0 PID: 2699135 Comm: kworker/6:3 Tainted: G W 6.18.38-i3 #1113 NONE [...] Workqueue: ceph-msgr ceph_con_workfn pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : jbd2__journal_start+0x2c/0x208 lr : __ext4_journal_start_sb+0x100/0x178 [...] Call trace: jbd2__journal_start+0x2c/0x208 (P) __ext4_journal_start_sb+0x100/0x178 ext4_dirty_inode+0x3c/0x90 __mark_inode_dirty+0x58/0x400 iput.part.0+0x2b0/0x370 iput+0x18/0x30 dentry_unlink_inode+0xc0/0x158 __dentry_kill+0x80/0x250 shrink_dentry_list+0x90/0x130 prune_dcache_sb+0x60/0x98 super_cache_scan+0xe8/0x190 do_shrink_slab+0x174/0x388 shrink_slab+0xd8/0x4c0 shrink_node+0x31c/0x908 do_try_to_free_pages+0xd0/0x508 try_to_free_pages+0x11c/0x238 __alloc_frozen_pages_noprof+0x4d0/0xdd0 __folio_alloc_noprof+0x18/0x70 __filemap_get_folio+0x248/0x440 ceph_readdir_prepopulate+0x570/0x9e8 mds_dispatch+0x1424/0x1ba0 ceph_con_process_message+0x74/0xa0 ceph_con_v1_try_read+0x3a0/0x1510 ceph_con_workfn+0x260/0x460 Enter a scoped NOFS allocation context and leave it after clearing `journal_info`. This prevents filesystem reclaim from recursing into another filesystem while the field contains Ceph-private data. Cc: stable@vger.kernel.org Fixes: `315f240880` ("ceph: fix security xattr deadlock") Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Reviewed-by: Viacheslav Dubeyko <slava@dubeyko.com> Reviewed-by: Xiubo Li <xiubo.li@clyso.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2026-07-23 20:29:42 +02:00
Max Kellermann	cee38bbf55	ceph: add owner/capability checks for CEPH_IOC_SET_LAYOUT* These permission checks were already missing in the initial impementation of these ioctls. This Ceph allows any user who owns a file descriptor to manipulate the layout of any file, even if they don't have write permissions. It might be a good idea to guard other ioctls with permission checks as well or even disallow regular users (even if they own the file) to manipulate layout settings completely, as this may be abused to DoS the Ceph servers, but right now, I find it most urgent to have setter checks at all. Cc: stable@vger.kernel.org Fixes: `8f4e91dee2` ("ceph: ioctls") Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Reviewed-by: Xiubo Li <xiubo.li@clyso.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2026-07-23 20:29:42 +02:00
Max Kellermann	50958bb928	ceph: fix hanging __ceph_get_caps() with stale mds_wanted A reader can hang forever in __ceph_get_caps() when the client no longer holds `FILE_RD`, but local cap state still says that the capability is already wanted (via `mds_wanted`). One way to trigger this is through MDS cap revocation. If another client performs a conflicting operation, the MDS can revoke `FILE_RD` from the reader; the next read then has to reacquire `FILE_RD`. If the cap update that should request `FILE_RD` never reaches the MDS after `cap->mds_wanted` was raised, the reader is left holding only non-file caps while local `mds_wanted` still includes the file read caps. In that state, try_get_cap_refs() sees `need <= mds_wanted` and returns 0, so __ceph_get_caps() just waits on `i_cap_wq`. If the cap update that was supposed to request `FILE_RD never reaches the MDS after `cap->mds_wanted was` raised, no further request is sent and the waiter can sleep indefinitely until unrelated cap traffic happens to wake it up. The ordering issue is that `cap->mds_wanted` is updated in __prep_cap() before the `CEPH_MSG_CLIENT_CAPS message` is actually queued for send. That makes one field serve two different meanings at once: what this client wants, and what the client believes the MDS already knows it wants. A proper fix would be to split those states and track whether a cap update is actually in flight or has been observed by the MDS. However, simply moving the `cap->mds_wanted assignment` later would not be sufficient: queueing the message in the messenger does not guarantee that the MDS processed that specific wanted set, and reconnect or message loss can still invalidate that assumption. Fixing that properly would require a larger rework of the cap state machine. To allow simpler backports to stable kernels, this patch implements a simpler workaround: - stop waiting forever in __ceph_get_caps(); after a bounded wait, fall back to the renew path - make ceph_renew_caps() issue a synchronous `OPEN` request whenever the inode still does not actually hold the wanted caps, instead of only calling ceph_check_caps() The extra issued-vs-wanted check in ceph_renew_caps() is necessary because the previous test only checked whether the inode still had any real caps at all. That is not enough after revocation: the client can still hold something like `pLs` and yet be missing `FILE_RD` completely. In that case, falling back to ceph_check_caps() is not sufficient, because it still trusts `cap->mds_wanted` and may resend nothing. By requiring `(issued & wanted) == wanted` before taking the asynchronous path, the code only uses ceph_check_caps() when the `wanted caps` are already actually issued. Otherwise, it sends the synchronous `OPEN` renew. This preserves the existing asynchronous fast path when the wanted caps are already issued, avoids changing cap-state semantics, and fixes the hang by guaranteeing that a stalled waiter eventually retries through a path that does not rely on the stale `mds_wanted` state. [ idryomov: move CEPH_GET_CAPS_WAIT_TIMEOUT from libceph.h to mds_client.h, formatting ] Cc: stable@vger.kernel.org Fixes: `0a454bdd50` ("ceph: reorganize __send_cap for less spinlock abuse") Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Reviewed-by: Alex Markuze <amarkuze@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2026-07-23 20:29:42 +02:00
WenTao Liang	c3e64079d8	ceph: fix refcount leak in ceph_readdir() The ceph_readdir() function allocates a ceph_mds_request via ceph_mdsc_create_request() and stores it in dfi->last_readdir. In the directory entry processing loop, if the entry's offset is less than ctx->pos or if the inode pointer is unexpectedly NULL, the function returns -EIO without releasing the reference held by dfi->last_readdir, causing a refcount leak. Fix this by adding ceph_mdsc_put_request(dfi->last_readdir) before returning on these error paths. Also set dfi->last_readdir to NULL for safety, matching the cleanup done at the normal exit. Cc: stable@vger.kernel.org Fixes: `af9ffa6df7` ("ceph: add support to readdir for encrypted names") Signed-off-by: WenTao Liang <vulab@iscas.ac.cn> Reviewed-by: Viacheslav Dubeyko <slava@dubeyko.com> Reviewed-by: Alex Markuze <amarkuze@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2026-07-23 20:29:41 +02:00
Wentao Liang	cbf59617cd	ceph: fix writeback_count leak in write_folio_nounlock() write_folio_nounlock() increments fsc->writeback_count to track in-flight writeback operations. On several error paths where the function returns early (folio lookup failure, snapshot context allocation failure, and writepages submission failure), the function returns without calling atomic_long_dec_return() to decrement the counter. Each leaked increment keeps the counter above zero, which can prevent the filesystem from cleanly unmounting or suspending writes. Add atomic_long_dec_return() calls on all error paths that currently return without decrementing the counter. Cc: stable@vger.kernel.org Fixes: `d55207717d` ("ceph: add encryption support to writepage and writepages") Signed-off-by: Wentao Liang <vulab@iscas.ac.cn> Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2026-07-23 20:29:41 +02:00
Bryam Vargas	4dbc71bcaf	ceph: fix pre-auth out-of-bounds read on snaptrace in ceph_handle_caps() ceph_handle_caps() reads snap_trace_len from the wire-format ceph_mds_caps header and uses it unconditionally to build a fake end pointer (snaptrace + snaptrace_len) that is later handed to ceph_update_snap_trace() in the CEPH_CAP_OP_IMPORT case: snaptrace = h + 1; snaptrace_len = le32_to_cpu(h->snap_trace_len); p = snaptrace + snaptrace_len; ... case CEPH_CAP_OP_IMPORT: if (snaptrace_len) { ... if (ceph_update_snap_trace(mdsc, snaptrace, snaptrace + snaptrace_len, false, &realm)) { ... } ceph_update_snap_trace() then decodes a struct ceph_mds_snap_realm from snaptrace using ceph_decode_need(&p, e, sizeof(ri), bad) with the attacker-supplied fake end e == snaptrace + snaptrace_len. With snaptrace_len == 0xFFFFFFFF the bound check is trivially satisfied, ri = p reads sizeof(struct ceph_mds_snap_realm) past the legitimate msg->front buffer, and ri->num_snaps / ri->num_prior_parent_snaps then drive further out-of-bounds reads of the encoded snap arrays. The eleven msg_version >= 2 .. msg_version >= 12 decoder blocks above the op switch each catch this OOB through their ceph_decode__safe() / ceph_decode_need() helpers, but they sit behind a hdr.version-gated if, so a malicious or compromised MDS that sets msg->hdr.version = 1 reaches the IMPORT path with no version-gated decoder having validated snap_trace_len. The shape has been present since ceph_handle_caps() was introduced. Validate snap_trace_len against the message front buffer before consuming it, using the canonical ceph_decode_need() / ceph_has_room() helper. The helper bounds the length with subtraction (n <= end - p, guarded by end >= p) rather than pointer addition, so it is wrap-safe for the attacker-controlled u32 length on 32-bit builds where p + snap_trace_len could overflow the address space. This matches the rest of the ceph decode path (e.g. the pool_ns_len check a few lines below), and the existing goto bad cleanup already covers this exit path. Cc: stable@vger.kernel.org Fixes: `a8599bd821` ("ceph: capability management") Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me> Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2026-07-23 20:29:41 +02:00
Guidong Han	8b7e8245e2	eventpoll: pin files while checking reverse paths Commit `319c151747` ("epoll: take epitem list out of struct file") intentionally removed temporary file references from the reverse path check list. At the time, both epitems and their files were freed after an RCU grace period, so unlist_file() could obtain file->f_lock through an epitem while clear_tfile_check_list() held rcu_read_lock(). Commit `0ede61d858` ("file: convert to SLAB_TYPESAFE_BY_RCU") made struct file SLAB_TYPESAFE_BY_RCU and removed its RCU-delayed freeing. RCU still protects the epitem, but no longer keeps the referenced file from being freed and reused. A concurrent close can therefore make unlist_file() lock or unlock f_lock in a recycled file object. This violates the documented SLAB_TYPESAFE_BY_RCU rule requiring a reference before acquiring an object's lock. The race was reproduced, causing a wild unlock of f_lock in a recycled file and breaking its mutual exclusion. Add ->file to epitems_head to remember the pinned file independently of ->epitems. A concurrent EPOLL_CTL_DEL can empty ->epitems before the head is unlisted, leaving no epi->ffd.file from which to drop the reference. In list_file(), acquire the reference before adding the head to the check list. The caller either owns a reference or holds the ep->mtx for the epitem leading to the file. In the latter case, file_ref_get() can fail after the last reference is dropped, but eventpoll_release_file() must acquire the same mutex before the file can be freed. The dying leaf can be skipped because removing links cannot increase the reverse path count. In unlist_file(), epnested_mutex excludes another list_file() or unlist_file(), while head->next prevents a concurrent EPOLL_CTL_DEL from freeing the head. Save head->file locally, clear it with head->next under f_lock, and drop the reference after the RCU-protected operation. Christian Brauner <brauner@kernel.org> quotes: > SLAB_TYPESAFE_BY_RCU allows a slab slot to be reused while an RCU reader > still holds its old address. Once that address contains a new live > struct file, KASAN sees valid, unpoisoned memory and cannot distinguish > the stale object identity. CONFIG_DEBUG_SPINLOCK exposes the failure > instead. > > The failing interleaving is: > > CPU0: nested EPOLL_CTL_ADD CPU1: close/open churn > ------------------------------------ --------------------------------- > p = hlist_first_rcu(&head->epitems) > epi = container_of(p, ...) > close(victim) > __fput() > eventpoll_release_file() > file_free(victim) > // the slot is free; f_lock remains > spin_lock(&epi->ffd.file->f_lock) > open() reuses the slot as new_file > spin_lock_init(&new_file->f_lock) > spin_unlock(&epi->ffd.file->f_lock) // wild unlock of new_file's lock > > CONFIG_DEBUG_SPINLOCK reports: > > BUG: spinlock already unlocked on CPU#0, poc_unlist/150 > lock: 0xffff8880067fb200, .magic: dead4ead, .owner: <none>/-1, .owner_cpu: -1 > CPU: 0 UID: 1000 PID: 150 Comm: poc_unlist Not tainted 7.2.0-rc3-dirty #22 PREEMPTLAZY > Hardware name: QEMU Ubuntu 24.04 PC v2 (i440FX + PIIX, arch_caps fix, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 > Call Trace: > <TASK> > dump_stack_lvl+0x64/0x80 > do_raw_spin_unlock+0x75/0xb0 > _raw_spin_unlock+0xe/0x30 > clear_tfile_check_list+0x88/0xe0 > do_epoll_ctl_file+0x519/0xcf0 > ? __pfx_ep_ptable_queue_proc+0x10/0x10 > do_epoll_ctl+0x8f/0x100 > __x64_sys_epoll_ctl+0x6f/0xa0 > do_syscall_64+0xdc/0x520 > ? srso_alias_return_thunk+0x5/0xfbef5 > entry_SYSCALL_64_after_hwframe+0x76/0x7e > RIP: 0033:0x42034e > Code: 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 e9 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 > RSP: 002b:00007a657ff3c198 EFLAGS: 00000202 ORIG_RAX: 00000000000000e9 > RAX: ffffffffffffffda RBX: 00007a657ff3ccdc RCX: 000000000042034e > RDX: 0000000000000003 RSI: 0000000000000001 RDI: 0000000000000004 > RBP: 00007a657ff3c2f0 R08: 0000000000000000 R09: 00007a657ff3c6c0 > R10: 00007a657ff3c1a4 R11: 0000000000000202 R12: 00007a657ff3c6c0 > R13: ffffffffffffffb8 R14: 000000000000000d R15: 00007fffb7de0210 > </TASK> > ------------[ cut here ]------------ > > unlist_file() does not appear as a separate frame because it was inlined > into clear_tfile_check_list(). This report was obtained with mdelay() > instrumentation immediately before spin_lock() and spin_unlock() in > unlist_file() to widen the two race windows. > > More importantly, this is a wild unlock. The stale unlock can target > f_lock of a different live file and invalidate mutual exclusion for > state protected by that lock. Turning this into a reliable exploit > would require precise scheduling and same-slot reuse and is likely > difficult, but the primitive is potentially exploitable. Reported-by: Qi Tang <tpluszz77@gmail.com> Reported-by: Junxi Qian <qjx1298677004@gmail.com> Fixes: `0ede61d858` ("file: convert to SLAB_TYPESAFE_BY_RCU") Cc: stable@vger.kernel.org Signed-off-by: Guidong Han <2045gemini@gmail.com> Link: https://patch.msgid.link/20260718104406.27897-1-2045gemini@gmail.com Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>	2026-07-23 12:48:22 +02:00
Usama Arif	0ef8faff49	fs: push nr_cached_objects memcg gating into individual filesystems Commit 0baad6f9b997 ("fs/super: skip non-memcg-aware nr_cached_objects in memcg slab shrink") added a check in fs/super.c that skipped every ->nr_cached_objects() hook whenever the shrinker was invoked for a non-root memcg, on the assumption that none of them honour sc->memcg. That assumption is wrong for XFS, whose inode-reclaim hook is intentionally driven from per-memcg contexts to free memcg-charged slab. Encoding a blanket "never memcg-aware" policy in fs/super.c short-circuits that path. Push the check down into the callbacks whose counters really are irrelevant to per-memcg reclaim - btrfs_nr_cached_objects() and shmem_unused_huge_count() - and drop the fs/super.c gate. Each filesystem can now lift the restriction independently if its counter later grows memcg awareness, without touching fs/super.c. Introduce mem_cgroup_shrink_is_root() in <linux/memcontrol.h> so the callbacks don't open-code "sc->memcg is NULL or root". Fixes: 0baad6f9b997 ("fs/super: skip non-memcg-aware nr_cached_objects in memcg slab shrink") Acked-by: Qi Zheng <qi.zheng@linux.dev> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Usama Arif <usama.arif@linux.dev> Link: https://patch.msgid.link/20260715103516.2410175-1-usama.arif@linux.dev Acked-by: David Sterba <dsterba@suse.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>	2026-07-23 11:35:02 +02:00
David Howells	62d9853aa4	afs: Fix afs_edit_dir_remove() to get, not find, block 0 Fix afs_edit_dir_remove() to use afs_dir_get_block() to get block 0 rather than afs_dir_find_block() as the latter caches the found block in the afs_dir_iter and may[] switch out the page it's on if another afs_dir_find_block() is done. This parallels what afs_edit_dir_add() does. [] There's more than one block per page. Fixes: `a5b5beebcf` ("afs: Use the contained hashtable to search a directory") Closes: https://sashiko.dev/#/patchset/20260706153408.1231650-1-dhowells%40redhat.com Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/2380759.1783956175@warthog.procyon.org.uk cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-fsdevel@vger.kernel.org cc: stable@vger.kernel.org Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>	2026-07-23 11:32:40 +02:00
Zhang Yi	c97cd6f447	iomap: prevent ioend merge when io_private differs Different io_private values indicate distinct completion contexts that must not be merged together, as this could leak or corrupt the private data associated with each ioend. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20260713074206.1768006-1-yi.zhang@huaweicloud.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>	2026-07-23 11:28:47 +02:00
Christian Brauner	1c9cb1826b	Merge patch series "iomap: trivial fixes for ext4 conversion" Zhang Yi <yi.zhang@huaweicloud.com> says: This patch series contains a few trivial iomap-related fixes in preparation for converting ext4 buffered I/O to use iomap. The first three patches are taken from my ext4 conversion series [1], as suggested by Christoph. The fourth patch fixes a bug originally reported by Sashiko during review of my series; although unrelated to the ext4 conversion, it is worth fixing on its own. Please see the following patches for detail. The fifth patch add comments for ifs_clear/set_range_dirty(), and the last patch avoids merging ioends that have different private data. [1] https://lore.kernel.org/linux-ext4/20260511072344.191271-1-yi.zhang@huaweicloud.com/ * patches from https://patch.msgid.link/20260714082325.325163-1-yi.zhang@huaweicloud.com: iomap: add comments for ifs_clear/set_range_dirty() iomap: fix out-of-bounds bitmap_set() with zero-length range iomap: fix incorrect did_zero setting in iomap_zero_iter() iomap: support invalidating partial folios iomap: correct the range of a partial dirty clear Link: https://patch.msgid.link/20260714082325.325163-1-yi.zhang@huaweicloud.com Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>	2026-07-23 11:26:28 +02:00
Zhang Yi	09b53b0787	iomap: add comments for ifs_clear/set_range_dirty() The range alignment strategy differs between ifs_clear_range_dirty() and ifs_set_range_dirty(). The former rounds inwards to clear only fully-covered blocks, while the latter rounds outwards to mark any partially-touched block as dirty. Add comments to document this asymmetry in block range calculation. Suggested-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20260714082325.325163-6-yi.zhang@huaweicloud.com Reviewed-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>	2026-07-23 11:26:28 +02:00
Zhang Yi	9c7d8f7c89	iomap: fix out-of-bounds bitmap_set() with zero-length range ifs_set_range_dirty() and ifs_set_range_uptodate() compute last_blk as (off + len - 1) >> i_blkbits. When off is 0 and len is 0, the unsigned subtraction underflows to SIZE_MAX, producing a huge last_blk and nr_blks value that causes bitmap_set() to write far beyond the ifs->state allocation. Regarding ifs_set_range_uptodate(), it is temporarily safe because len cannot be passed in as 0. However, for ifs_set_range_dirty() this is reachable from __iomap_write_end(): when copy_folio_from_iter_atomic() returns 0 (e.g. user buffer fault) and the folio is already uptodate, the guard at the top of __iomap_write_end() does not trigger because !folio_test_uptodate() is false, and iomap_set_range_dirty() is called with copied == 0. Add a !len guard to both functions before the computation, so that a zero-length range is a no-op. Fixes: `4ce02c6797` ("iomap: Add per-block dirty state tracking to improve performance") Cc: stable@vger.kernel.org # v6.6 Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20260714082325.325163-5-yi.zhang@huaweicloud.com Reviewed-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>	2026-07-23 11:26:28 +02:00
Zhang Yi	7a6fd6b21d	iomap: fix incorrect did_zero setting in iomap_zero_iter() The did_zero output parameter was unconditionally set after the loop, which is incorrect. It should only be set when the zeroing operation actually completes, not when IOMAP_F_STALE is set or when IOMAP_F_FOLIO_BATCH is set but !folio causes the loop to break early, or when iomap_iter_advance() returns an error. This causes did_zero to be incorrectly set when zeroing a clean unwritten extent because the loop exits early without actually zeroing any data. Fix it by using a local variable to track whether any folio was actually zeroed, and only set did_zero after the loop if zeroing happened. Fixes: `98eb8d9502` ("iomap: set did_zero to true when zeroing successfully") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20260714082325.325163-4-yi.zhang@huaweicloud.com Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>	2026-07-23 11:26:28 +02:00
Zhang Yi	562d192c43	iomap: support invalidating partial folios Current iomap_invalidate_folio() can only invalidate an entire folio. If we truncate a partial folio on a filesystem where the block size is smaller than the folio size, it will leave behind dirty bits for the truncated or punched blocks. During the write-back process, it will attempt to map the invalid hole range. Fortunately, this has not caused any real problems so far because the ->writeback_range() function corrects the length. However, the implementation of FALLOC_FL_ZERO_RANGE in ext4 depends on the support for invalidating partial folios. When ext4 partially zeroes out a dirty and unwritten folio, it does not perform a flush first like XFS. Therefore, if the dirty bits of the corresponding area cannot be cleared, the zeroed area after writeback remains in the written state rather than reverting to the unwritten state. Fix this by supporting invalidation of partial folios. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20260714082325.325163-3-yi.zhang@huaweicloud.com Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>	2026-07-23 11:26:27 +02:00
Zhang Yi	88c2651531	iomap: correct the range of a partial dirty clear The block range calculation in ifs_clear_range_dirty() is incorrect when partially clearing a range in a folio. We cannot clear the dirty bit of the first block or the last block if the start or end offset is not blocksize-aligned. This has not yet caused any issues since we always clear a whole folio in iomap_writeback_folio(). Fix this by rounding up the first block to blocksize alignment, and calculate the last block by rounding down (using truncation). Correct the nr_blks calculation accordingly. Fixes: `4ce02c6797` ("iomap: Add per-block dirty state tracking to improve performance") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20260714082325.325163-2-yi.zhang@huaweicloud.com Reviewed-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>	2026-07-23 11:26:27 +02:00
Chen Changcheng	503d67fbae	fs/super: fix emergency thaw double-unlock of s_umount do_thaw_all() iterates over all superblocks via __iterate_supers() with SUPER_ITER_EXCL, which acquires s_umount exclusively before calling the callback and releases it afterwards. However, the callback do_thaw_all_callback() calls thaw_super_locked() which unconditionally releases s_umount on every code path. This results in a second unlock attempt in __iterate_supers() that corrupts the rwsem state, triggering a DEBUG_RWSEMS warning: [ 182.601148] sysrq: Emergency Thaw of all frozen filesystems [ 182.601865] ------------[ cut here ]------------ [ 182.602375] DEBUG_RWSEMS_WARN_ON((rwsem_owner(sem) != current) && !rwsem_test_oflags(sem, RWSEM_NONSPINNABLE)): count = 0x0, magic = 0xffff99b1011e5870, owner = 0x0, curr 0xffff99b101b06c80, list not empty [ 182.603817] WARNING: kernel/locking/rwsem.c:1412 at up_write+0xa3/0x170, CPU#2: kworker/2:1/53 [ 182.604578] Modules linked in: [ 182.604864] CPU: 2 UID: 0 PID: 53 Comm: kworker/2:1 Not tainted 7.2.0-rc4-00001-gbd3bd93ea98a-dirty #4 PREEMPT(lazy) [ 182.605711] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1kylin1 04/01/2014 [ 182.606417] Workqueue: events do_thaw_all [ 182.606750] RIP: 0010:up_write+0xaf/0x170 [ 182.607076] Code: 19 3a 92 48 0f 44 c2 48 8b 55 08 48 8b 55 00 4c 8b 45 08 48 8b 55 00 48 8d 3d ad 91 e0 01 48 8b 4d 20 50 48 c7 c6 f0 8c 26 92 <67> 48 0f b9 3a e8 d7 93 4e 00 58 eb 81 48 83 7f 18 00 48 c7 c2 8d [ 182.608563] RSP: 0018:ffffb670001d7e08 EFLAGS: 00010246 [ 182.609007] RAX: ffffffff92349e8d RBX: 0000000000000000 RCX: ffff99b1011e5870 [ 182.609595] RDX: 0000000000000000 RSI: ffffffff92268cf0 RDI: ffffffff92914d10 [ 182.610283] RBP: ffff99b1011e5870 R08: 0000000000000000 R09: ffff99b101b06c80 [ 182.610847] R10: ffff99b10139a808 R11: fefefefefefefeff R12: 0000000000000000 [ 182.611414] R13: ffffffff90cf74d0 R14: 0000000000000000 R15: ffff99b1011e5800 [ 182.612009] FS: 0000000000000000(0000) GS:ffff99b1eaaee000(0000) knlGS:0000000000000000 [ 182.612670] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 182.613146] CR2: 00000000005c631c CR3: 00000000013ee000 CR4: 00000000000006f0 [ 182.613722] Call Trace: [ 182.613946] <TASK> [ 182.614130] __iterate_supers+0x128/0x150 [ 182.614463] do_thaw_all+0x1b/0x30 [ 182.614759] process_scheduled_works+0xbb/0x3f0 [ 182.615150] ? __pfx_worker_thread+0x10/0x10 [ 182.615499] worker_thread+0x129/0x270 [ 182.615816] ? __pfx_worker_thread+0x10/0x10 [ 182.616201] kthread+0xe2/0x120 [ 182.616469] ? __pfx_kthread+0x10/0x10 [ 182.616792] ret_from_fork+0x15b/0x240 [ 182.617115] ? __pfx_kthread+0x10/0x10 [ 182.617426] ret_from_fork_asm+0x1a/0x30 [ 182.617761] </TASK> [ 182.617968] ---[ end trace 0000000000000000 ]--- [ 182.618412] Emergency Thaw complete Fix this by switching to SUPER_ITER_UNLOCKED and acquiring s_umount in the callback via super_lock_excl() before calling thaw_super_locked(). This matches the locking pattern expected by thaw_super_locked() and eliminates the double unlock. While at it, remove the dead 'return;' at the end of do_thaw_all_callback(). Fixes: `2992476528` ("super: use a common iterator (Part 1)") Cc: stable@vger.kernel.org Signed-off-by: Chen Changcheng <chenchangcheng@kylinos.cn> Link: https://patch.msgid.link/20260721064140.152305-1-chenchangcheng@kylinos.cn Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>	2026-07-23 10:10:24 +02:00
Namjae Jeon	5e1b924808	ksmbd: reject undersized decompressed SMB2 requests ksmbd_decompress_request() bounds the decompressed size only against the maximum request size. A compression transform can therefore produce a buffer smaller than an SMB2 PDU and install it as conn->request_buf. The receive path subsequently calls ksmbd_smb_request(), which reads the protocol ID before the normal SMB2 minimum-size check. If the decompressed output is too short, that read can access beyond the request allocation. Require the decompressed output to contain at least a complete minimum SMB2 PDU before allocating and installing the replacement request buffer. Fixes: `a08de24c2b` ("ksmbd: negotiate and decode SMB2 compression") Cc: stable@vger.kernel.org Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2026-07-22 09:54:10 -05:00
Namjae Jeon	cfc0b8e508	ksmbd: validate minimum PDU size for transform requests The receive path applies the minimum SMB2 PDU size check only when ProtocolId is SMB2_PROTO_NUMBER. A packet carrying SMB2_TRANSFORM_PROTO_NUM bypasses the check even when the negotiated dialect does not provide transform handling. On an SMB 2.1 connection, a short transform packet therefore reaches init_smb2_rsp_hdr(), which interprets the request as a full SMB2 header and reads beyond the request allocation. The copied fields can then be returned to the unauthenticated client. Compression transforms are converted to ordinary SMB2 messages before protocol validation. After that conversion, validate ordinary SMB2 requests against SMB2_MIN_SUPPORTED_PDU_SIZE and require encryption transform requests to contain both a transform header and an SMB2 header. This rejects truncated requests before work allocation. Fixes: `368ba06881` ("ksmbd: check the validation of pdu_size in ksmbd_conn_handler_loop") Cc: stable@vger.kernel.org Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-31063 Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2026-07-22 09:54:10 -05:00
James Montgomery	c74801ee52	ksmbd: defer destroy_previous_session() until after NTLM authentication In ntlm_authenticate(), destroy_previous_session() is called using a user pointer resolved from the client-supplied NTLM blob username field before the NTLMv2 response is validated. An authenticated attacker can set the NTLM blob username to match a victim account and set PreviousSessionId to the victim's session ID; destroy_previous_session() destroys the victim's session while ksmbd_decode_ntlmssp_auth_blob() subsequently rejects the request with -EPERM. Move destroy_previous_session() and the prev_id assignment to after ksmbd_decode_ntlmssp_auth_blob() returns success and use sess->user rather than the pre-authentication lookup result. This matches the ordering already used by krb5_authenticate(), where destroy_previous_session() is called only after ksmbd_krb5_authenticate() returns success. Fixes: `e2f34481b2` ("cifsd: add server-side procedures for SMB3") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/linux-cifs/20260702155449.3639773-1-james_montgomery@disroot.org/ Signed-off-by: James Montgomery <james_montgomery@disroot.org> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2026-07-22 09:54:10 -05:00
Namjae Jeon	5152c6d49e	ksmbd: validate ACE size against SID sub-authorities set_ntacl_dacl() validates sid.num_subauth before copying an ACE, but does not verify that the declared ACE size contains all sub-authorities described by that field. An undersized ACE can therefore be copied and later make the POSIX ACL deduplication walk inspect data beyond the copied ACE boundary. The existing initial bound check is also too small. It only ensures that the ACE size field is accessible before set_ntacl_dacl() reads sid.num_subauth farther into the input buffer. Require enough input for the fixed SID header before accessing num_subauth, reject ACEs smaller than that header, and skip ACEs whose declared size cannot contain the complete SID. This makes the validation consistent with the other ACE walk paths. Reported-by: LocalHost <localhost.detect@gmail.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2026-07-22 09:54:10 -05:00
Wentao Guan	bbf0a8e931	ksmbd: restore DACL size on check_add_overflow() to avoid malformed ACL check_add_overflow() unconditionally writes the truncated sum into d even on overflow, per its contract in include/linux/overflow.h. The four check_add_overflow() guards in set_posix_acl_entries_dacl() and set_ntacl_dacl() break out of the ACE-building loops on overflow, but the truncated size is then consumed downstream at the end of set_ntacl_dacl(): pndacl->size = cpu_to_le16(le16_to_cpu(pndacl->size) + size); This produces an on-wire NT ACL whose pndacl->size under-reports the bytes actually written by the preceding fill_ace_for_sid()/memcpy() calls, yielding a malformed ACL that can trigger out-of-bounds reads when re-parsed by clients or ksmbd itself. Restore size to its pre-addition value on each overflow branch (via `size -= ace_sz` / `size -= nt_ace_size`) so that after the break, *size once again holds the cumulative size of the successfully-written ACEs. The committed ACL is then truncated-but-self-consistent rather than malformed. The ksmbd DACL builders are the only check_add_overflow() sites found where an overflow path breaks out of a loop and the destination value is consumed afterward. The other nearby break-style cases either return -EINVAL on overflow (transport_ipc.c) or break without consuming the overflowed destination value afterward (buildid.c). Fixes: `299f962c0b` ("ksmbd: use check_add_overflow() to prevent u16 DACL size overflow") Assisted-by: atomcode:glm-5.2 Assisted-by: Codex:gpt-5.5 Cc: stable@vger.kernel.org Signed-off-by: Wentao Guan <guanwentao@uniontech.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2026-07-22 09:54:10 -05:00
Namjae Jeon	58d97fcd0b	ksmbd: bound DACL dedup walk to copied ACEs set_ntacl_dacl() can stop copying ACEs before consuming the full input DACL when size accounting overflows. When that happens, num_aces reflects only the ACEs that were actually copied into the output DACL, but set_posix_acl_entries_dacl() still receives nt_num_aces and uses it to walk the existing ACE array during dedup. That makes the dedup walk scan past the copied ACE array and inspect buffer tail that does not contain valid ACEs. Split the two meanings currently carried by the NT ACE count. Pass the number of copied NT ACEs to bound the dedup walk, and preserve the original "input DACL had NT ACEs" state separately for the Everyone/default ACL fallback. This keeps the dedup walk aligned with the ACEs that are actually present in the rebuilt DACL. Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2026-07-22 09:54:10 -05:00
Namjae Jeon	2bebf2470a	ksmbd: enforce signing required by the session SMB2_FLAGS_SIGNED is controlled by the incoming request and only indicates that a signature accompanies that request. Do not use it to decide whether a signing-required session must authenticate the request. Reject an unsigned plaintext request before dispatch when the session requires signing. Continue to validate signatures on signed requests, including when signing is optional. Encrypted requests have already been authenticated during decryption. An OPLOCK_BREAK acknowledgment is a session request and is subject to the same signing rule, so do not exclude it from signed-request detection. Reported-by: Charles Vosburgh <trilobyte777@gmail.com> Tested-by: ChenXiaoSong <chenxiaosong@kylinos.cn> Reviewed-by: ChenXiaoSong <chenxiaosong@kylinos.cn> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2026-07-22 09:54:03 -05:00
Namjae Jeon	e148e567a9	ksmbd: preserve VFS inherited POSIX ACL mask The VFS initializes a child's POSIX ACL from the parent's default ACL and the requested creation mode. Do not mutate the parent ACL or overwrite the child's VFS-computed access and default ACLs afterwards. This preserves restrictive ACL_MASK entries and prevents SMB object creation from widening effective permissions. Reported-by: Charles Vosburgh <trilobyte777@gmail.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2026-07-22 09:54:01 -05:00
Li Chen	1d0cff74d8	pidfs: handle FS_IOC32_GETVERSION in compat ioctl FS_IOC32_GETVERSION has a distinct compat command encoding. Passing it through compat_ptr_ioctl() leaves pidfd_ioctl() unable to recognize the otherwise architecture-independent inode generation query. Translate the compat command to FS_IOC_GETVERSION before dispatching it through the native pidfd ioctl implementation. Signed-off-by: Li Chen <me@linux.beauty> Link: https://patch.msgid.link/20260716052822.1034228-1-me@linux.beauty Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>	2026-07-22 16:45:57 +02:00
Amir Goldstein	a1e0eb8f55	ovl: check access to copy_file_range source with src mounter creds Commit `5dae222a5f` ("vfs: allow copy_file_range to copy across devices") allowed filesystems that implement the copy_file_range() f_op to decide if they want to access cross-sb copy from/to the same fs type. The same commit added checks to verify same sb copy for filesystems that implement ->copy_file_range() and do not support cross-sb copy at the time, namely, to ceph, fuse and nfs. The two remaining fs which implement ->copy_file_range(), cifs and overlayfs started to support cross-sb copy from this time. While overlayfs does support cross-sb copy when the two underlying files are on the same base fs, the copy operation on the two real files from two different overalyfs filesystems is performed with the mounter creds of the destination overlayfs and the read permission access hook for the source file was called with the wrong creds. This could cause either deny of access to copy which would otherwise be allowed (e.g. with splice) or allow read access to file which would otherwise be denied. Fix the latter case by explicitly verifying read access to source file with the source overlayfs mounter creds. The former case remains a quirk of cross-sb overlayfs copy, but userspace could fall back to regular copy so no harm done. Fixes: `5dae222a5f` ("vfs: allow copy_file_range to copy across devices") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://patch.msgid.link/20260712122421.203113-1-amir73il@gmail.com Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>	2026-07-22 16:42:28 +02:00
Jann Horn	425224c2d7	proc: Fix broken error paths for namespace links Don't return the return value of down_read_killable() (0) when a ptrace access check fails, return -EACCES as intended. Reported-by: Magnus Lindholm <linmag7@gmail.com> Closes: https://lore.kernel.org/r/20260706170735.2941493-1-linmag7@gmail.com Fixes: `6650527444` ("proc: protect ptrace_may_access() with exec_update_lock (part 1)") Cc: stable@vger.kernel.org Signed-off-by: Jann Horn <jannh@google.com> Link: https://patch.msgid.link/20260706-procfs-ns-eacces-fix-v1-1-a69ab14c02e6@google.com Tested-by: Magnus Lindholm <linmag7@gmail.com> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>	2026-07-22 15:37:24 +02:00
Christian Brauner	927b89700e	pidfs: add pidfs_dentry_open() helper Both pidfs_alloc_file() and pidfs_export_open() need to force O_RDWR and reapply the pidfd flags that do_dentry_open() strips. Move the common logic into a helper. PIDFD_AUTOKILL is now part of the restore mask in the file handle path as well, but pidfs_export_permission() rejects O_TRUNC, so this is a no-op there. But warn nonetheless. Link: https://patch.msgid.link/20260722-esszimmer-umsetzen-nennt-ed5fc604300a@brauner Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>	2026-07-22 14:05:15 +02:00
Christian Brauner	bb6bc13c53	pidfs: preserve thread pidfds reopened by file handle PIDFD_THREAD shares O_EXCL. do_dentry_open() clears O_EXCL after pidfs_export_open() validates the flags, so open_by_handle_at() silently turns a thread pidfd into a process pidfd. Restore PIDFD_THREAD on the opened file, matching pidfs_alloc_file(). Signed-off-by: Li Chen <me@linux.beauty> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260716052726.1032092-1-me@linux.beauty Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>	2026-07-22 13:55:20 +02:00
Yichong Chen	a8e72879cd	ovl: fix trusted xattr escape prefix matching In the trusted.* xattr namespace, ovl_is_escaped_xattr() compares one byte less than the escaped overlay xattr prefix length. This makes it match "trusted.overlay.overlay" without requiring the trailing dot. As a result, an xattr such as "trusted.overlay.overlayfoo" is incorrectly treated as an escaped overlay xattr. This can be reproduced by setting "trusted.overlay.overlayfoo" on a lower file and listing xattrs through an overlay mount. listxattr() then exposes it as "trusted.overlay.oo", and a following getxattr() on that listed name fails with ENODATA. Compare the full escaped prefix, including the trailing dot, so similarly-prefixed private xattrs are not misclassified. Fixes: `dad02fad84` ("ovl: Support escaped overlay.* xattrs") Signed-off-by: Yichong Chen <chenyichong@uniontech.com> Link: https://patch.msgid.link/20260708082221.633602-1-chenyichong@uniontech.com Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>	2026-07-22 13:48:00 +02:00
Frank Sorenson	c2f2e83e3b	cifs: fix cifsFileInfo leak on kmalloc failure in deferred close drain paths In cifs_close_deferred_file(), cifs_close_all_deferred_files(), and cifs_close_deferred_file_under_dentry(), when a pending deferred close is cancelled via cancel_delayed_work(), the subsequent kmalloc_obj() to add the file to the local processing list may fail under memory pressure. The loop breaks immediately, but the cancelled work is no longer pending (it would have called _cifsFileInfo_put()), and the cfile is never added to file_head for processing. The cifsFileInfo reference and the open server handle both leak. Fix by saving the cfile that failed allocation in a local variable, breaking as before, and calling _cifsFileInfo_put() on it after releasing the lock. Any files later in the iteration are unaffected since their deferred work is still pending and will fire normally. Fixes: `e3fc065682` ("cifs: Deferred close performance improvements") Signed-off-by: Frank Sorenson <sorenson@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2026-07-21 20:19:11 -05:00
Frank Sorenson	e8a8d54c2d	cifs: prevent readdir from changing file size due to stale directory metadata Windows Server's directory enumeration metadata lags behind the actual file size after a write+close or rename. A concurrent readdir() in the window between close() returning to userspace and stat() being called overwrites the correct cached i_size with the stale server value, causing stat() to return the wrong size. Once _cifsFileInfo_put() removes the last writable handle from openFileList, is_size_safe_to_change() permits readdir to overwrite i_size. smb2_close_getattr() then stamps cifs_i->time = jiffies, making the corrupt cached value appear fresh to the next stat(). The existing check (see Fixes:) only blocked stale size updates while an active RW lease was held, not after the last writable handle closes. Add cifsInodeInfo->time_last_write, written via smp_store_release() at writable close and on setattr/truncate. is_size_safe_to_change() checks is_inode_writable() first (acquiring open_file_lock), then rejects a readdir size update if time_last_write falls within acregmax jiffies. The spinlock release in _cifsFileInfo_put() forms a store-release barrier that pairs with the spin_lock() (load-acquire) in is_inode_writable(), ensuring the subsequent smp_load_acquire() on time_last_write observes any update from a concurrent close(). When a size update is rejected and the server value differs from the cached one, cifs_i->time is cleared to force a fresh QUERY_INFO on the next stat(). readdir is also blocked from changing i_size while writable handles are open or an RW lease is held, even on direct-IO mounts. For deferred close (closetimeo > 0), time_last_write is refreshed at the actual server close in smb2_deferred_work_close() and in the cifs_close_deferred_file*() drain paths invoked by lease/oplock breaks and tcon teardown, anchoring the protection window to the real close time rather than the earlier userspace close. time_last_write == 0 skips the time_before() check to avoid false positives near boot on 32-bit systems where jiffies starts close to INITIAL_JIFFIES. Does not reproduce against Samba or with actimeo=0. Fixes: `e4b61f3b1c` ("cifs: prevent updating file size from server if we have a read/write lease") Signed-off-by: Frank Sorenson <sorenson@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2026-07-21 20:19:03 -05:00
Carl Johnson	2eb74eef4b	smb: client: handle STATUS_STOPPED_ON_SYMLINK responses without a symlink target The macOS built-in SMB server returns STATUS_STOPPED_ON_SYMLINK for a CREATE on a path whose final component is a symlink, but it does not include a Symbolic Link Error Response in the error data: both ErrorContextCount and ByteCount are zero, so the symlink target is not present in the response at all. Per [MS-SMB2] section 2.2.2 such a response should carry a valid Symbolic Link Error Response, so this is a server bug, but the target can still be retrieved with FSCTL_GET_REPARSE_POINT. Frame from a capture against macOS 26.5.2 (build 25F84): SMB2 hdr : Status=0x8000002d STATUS_STOPPED_ON_SYMLINK, Cmd=Create Error Rsp: StructureSize=0x0009 Error Context Count: 0 Byte Count: 0 Error Data: 00 symlink_data() cannot find a struct smb2_symlink_err_rsp in such a response and returns -EINVAL, which parse_create_response() propagates, so smb2_query_path_info() bails out at if (rc \|\| !data->reparse_point) goto out; before it can retry with SMB2_OP_GET_REPARSE. stat(), readlink() and ls of any server-side symlink then fail with -EINVAL: $ ls -la Config l????????? ? ? ? ? ? Config.json $ stat Config/Config.json stat: cannot statx 'Config/Config.json': Invalid argument A 5.10 client resolves these symlinks correctly against the same server and share, so this is a regression for Apple SMB servers. Handle it in several places: - symlink_data() detects the empty response (ErrorContextCount and ByteCount both zero) and returns a distinct -ENODATA, so that "server did not send the target" can be told apart from a genuinely malformed response and only this case is worked around. - parse_create_response() treats -ENODATA like STATUS_IO_REPARSE_TAG_NOT_HANDLED, which does not carry the target either: leave the reparse tag unset and clear rc, so the existing SMB2_OP_GET_REPARSE path retrieves the target. - smb2_query_path_info() only fixes up the symlink target type when the target is already known. SMB2_OP_GET_REPARSE sets data->reparse.tag but does not parse the target out of the reparse buffer; that happens later, in reparse_info_to_fattr(). Without this check smb2_fix_symlink_target_type() is called with a NULL target and returns -EIO. This could not happen with servers that send the target inline and therefore skip SMB2_OP_GET_REPARSE. - smb2_open_file() maps -ENODATA to -EIO, matching STATUS_IO_REPARSE_TAG_NOT_HANDLED, so its callers retrieve the target with SMB2_OP_GET_REPARSE as well. Tested on Debian 13, kernel 6.18.38 (armv7), against macOS 26.5.2: symlinks now resolve, including relative, parent-traversing and directory symlinks, and reads through symlinks succeed. Cc: stable@vger.kernel.org Co-developed-by: Pali Rohár <pali@kernel.org> Signed-off-by: Pali Rohár <pali@kernel.org> Signed-off-by: Carl Johnson <carl@jpartners.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2026-07-21 15:11:07 -05:00
Amir Goldstein	4b9a5458d0	fs: preserve ACL_DONT_CACHE state in forget_cached_acl() The ACL_DONT_CACHE state is meant to be a constant state for the inode for filesystems that want to opt out of posix acl caching. Commit `facd61053c` ("fuse: fixes after adapting to new posix acl api") used this facility to opt out of posix acl caching for fuse inodes with fuse server that does not negotiate FUSE_POSIX_ACL (fc->posix_acl). The commit also takes care to gate the forget_all_cached_acls() call in fuse_set_acl() on fc->posix_acl because there is no need for it, but there are other placed in fuse code which call forget_all_cached_acls() unconditional to fc->posix_acl and those cause the loss of the ACL_DONT_CACHE state. This is not only a functional bug. Properly timed, a get_acl() from this fuse filesystem can return a stale cached value, as was observed in tests, because set_acl() does not invalidate the unintentional acl cache. We could fix this in fuse, but it actually makes no sense for the vfs helper forget_cached_acl() to invalidate the ACL_DONT_CACHE state, so let it not do that to fix fuse and future users of ACL_DONT_CACHE. Fixes: `facd61053c` ("fuse: fixes after adapting to new posix acl api") Cc: stable@vger.kernel.org Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://patch.msgid.link/20260713220932.413004-2-amir73il@gmail.com Reviewed-by: Luis Henriques <luis@igalia.com> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>	2026-07-21 21:52:51 +02:00
Linus Torvalds	5a52217525	nfsd-7.2 fixes: Issues reported with v7.2-rc: - Fix issue with NLMv3 GRANTED_MSG introduced in v7.2 -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmpftS0ACgkQM2qzM29m f5epbA/+OdiOdpnmxMClc7tKcvNlbPzDQxM8yi1LCKWD5IYbDzdtZatnfHsjGkIw pYtoDTrvlJC9nC8LyNu2eaj8BQXM8IYNNDlQ/m2zLWQ+5E59JUwBPzZyLCMBRHb+ l7h7SCsCb/6S6+sn0/iAi7DFvlreDPNOMCSlArmulmRZRq99JbIOfbHg02qXK3bx huPNMeU/3y9GZccqHRz6l8joxCGYn2iXlFIs8wex+MecifmhfKhyAhjA5QP6ZbjL vCkluLrlPy+nM/ZjESx7UgKZQYkwnq9zitYNakIr7I53wrPci7z/6ZzCucEEtwSN q+Zl+8fmLfr1MyicNn30hVFuZ7Q04+K3GjaV4beuSj5YGe5lI1rlNultQBGSK3WZ 1GmdOo7XiKQ2x4k37HXf38DxKFYYCJp//HYXKRHnjOhKIaXDaMqj1Qq8H0fyElPG txIFm3ZNLgUyoULihpA8sx/j/mCcub3GIu/yplXFCBOwMKf0fKFPdY9jTWU93cDN krIhucg6RGWMoHQ8WnWDY1/+B5NrAska4/RHzqLhSESiLAUOtgLCQHPsJSufqf3h +Qbf6XTM4G/ABEXkOB4P515fcNYxHx6v0/qkyUdESHGLFrAuP2leSAWG1oKmeBDn HUYKaJvjZcx+Tvd5xhXm9jPBBkwfzDFhURm4IZ0rLKpDu489BIE= =dW6d -----END PGP SIGNATURE----- Merge tag 'nfsd-7.2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fix from Chuck Lever: - Fix issue with NLMv3 GRANTED_MSG introduced in v7.2 * tag 'nfsd-7.2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: lockd: fix NLMv3 GRANTED_MSG handling	2026-07-21 11:46:15 -07:00
Linus Torvalds	51f247c4b2	for-7.2-rc4-tag -----BEGIN PGP SIGNATURE----- iQJPBAABCgA5FiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmpdukAbFIAAAAAABAAO bWFudTIsMi41KzEuMTIsMiwyAAoJEMVl1fnXbVg70hwQAJ78AUVnvTSwMi2w6TEx KFYZSMPdBi9TwGC3lDCcTA/U5PqGTkKACWBLf5z+9YXAcb3By7OiSAQeko7sib0X k0jz5SmJjMHrHe1qGQvaztLin9/Ow9K8/QUQW0qhunwF6FbmgkfWdXw4au5+zV5R rwdyDt74nlhf3pCOpJb1JL2BFy23PXbfQIty9I+2qKUMPSgqhC6l7CBYsWZoG5ub QC2+Go5Ygz61iKDpAZ3TKjULI58+7K3FBq+76Jr9UajPVD75637Wze3zsrgcb2LV THxITsWTxx5CwiQF1Sf0JGo/rU+V0wU5oJqwmn4FrGeZXhnGXXcnXmG70VzvAJYX 4so+p0aejefqQnwcFbQE0+dFjrEuD+Dl8PVzlDSMygz/y1xpSlxVz+83kQFH+MV6 1bxrOJNS6sLwvqtwRa8INjqvMfk443Ub5sFUS+bav0ZUlOg3BmitkEP0H0NlQ+5p CAMCXFSfb/1UDXMrMHAcZn3LZtDLNaVVkneCz97A72D5C6Ae4rL3/FeHvmDk1KUh StO3ReF4JXeN3FYTzWpMIw07jGxR9ZuFVn78DAW9Gwjk658wsXBqcGPgg/GAsy6R bvrFRkWiJGSSdUeipNwxjfTeM8NwkA7fv+DnLbp6HzpOLDKIC8pID1Eoh1nwTlas R073SeoNHO0sC6Lz+2NcvVMb =1jNx -----END PGP SIGNATURE----- Merge tag 'for-7.2-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "I'm catching up with the fix backlog in the development branch, so here's a number of them and will probably send one more for this or the next rc: - relocation fixes: - skip attempting compression on reloc inodes - exclude inline extents from file extent offset checks - fix minor memory leak after error when adding reloc root - fix root cleanup after inserting and merging - fix clearing folio tags after writeback - clear logging flag of extent map before splitting - fix unsigned 32/64 type conversions when accounting dirty metadata, leading to continually exceeding threshold - fix regression in 32bit compat ioctl for subvolume info - fix type of SEARCH_TREE ioctl buffer in UAPI header - fix expression in ASSERT expression which can be unconditionally evaluated on some compilers - only account delalloc bytes for regular inodes" * tag 'for-7.2-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix GET_SUBVOL_INFO after compat refactor btrfs: free mapping node on duplicate reloc root insert btrfs: fix a regression where PAGECACHE_TAG_DIRTY is never cleared btrfs: don't propagate EXTENT_FLAG_LOGGING to split extent maps btrfs: fix u32 to s64 type conversion in dirty_metadata_bytes accounting btrfs: fix NULL pointer deref during assertion in btrfs_backref_free_node() btrfs: only account delalloc bytes for regular file inodes in btrfs_getattr() btrfs: reject inline file extents item in get_new_location() btrfs: do not try compression for data reloc inodes btrfs: declare btrfs_ioctl_search_args_v2::buf as __u8 btrfs: fix reloc root cleanup in merge_reloc_roots() btrfs: fix use-after-free on reloc root after error in insert_dirty_subvol()	2026-07-21 08:06:24 -07:00
Christian Brauner	3349ef6a36	binfmt_elf_fdpic: only honour the first PT_INTERP The program header scan handles PT_INTERP from a switch nested in the scan loop, so its break leaves the switch and not the loop. A binary carrying more than one PT_INTERP runs the case again and overwrites both interpreter_name and interpreter. The previous name allocation leaks and so does the previous interpreter reference, along with the write denial open_exec() took on it. The denial is never released, so the file stays unwritable for as long as the system runs. An unprivileged caller reaches this with a crafted binary and repeats it at will. binfmt_elf stops at the first PT_INTERP. Do the same here. The flaw dates back to the driver's introduction in the pre-git history tree introduced in v2.6.11 by 91808d6ebe39 ("[PATCH] FRV: Add FDPIC ELF binary format driver"). Link: https://patch.msgid.link/20260721-gezittert-medium-kreide-b41fc1f0277e@brauner Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Cc: stable@vger.kernel.org Reviewed-by: Jori Koolstra <jkoolstra@xs4all.nl> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>	2026-07-21 13:22:34 +02:00
Christian Brauner	16cc4f5c1c	exec: fix unsigned loop counter wrap in transfer_args_to_stack() The stop value is derived from bprm->p >> PAGE_SHIFT. The index variable is an unsigned long. If bprm->p drops below PAGE_SIZE and stop becomes zero the loop condition index >= stop is always true. After the index == 0 iteration the decrement wraps to ULONG_MAX and bprm->page[ULONG_MAX] reads sizeof(void ) bytes in front of the array. The pointer has wrapped to -1. That garbage pointer is then passed to kmap_local_page() and PAGE_SIZE bytes are copied from wherever that lands into the stack of the process being created. And the loop doesn't terminate either... Getting there only requires bprm->p < PAGE_SIZE. On !MMU bprm_set_stack_limit() and bprm_hit_stack_limit() are empty. So the only constraint on how far bprm->p is pushed down is valid_arg_len(), i.e. that each individual string still fits in what is left. bprm->p starts at PAGE_SIZE MAX_ARG_PAGES - sizeof(void *) so a single argument or environment string of a little over 31 pages leaves it in the first page: Oops - load access fault [#1] CPU: 0 UID: 0 PID: 1 Comm: victim Not tainted 7.2.0-rc4 #1 epc : __memcpy+0xd4/0xf8 ra : transfer_args_to_stack+0xaa/0xae s4 : ffffffffffffffff s2 : 0000000000000000 a1 : ffffffdc98000000 a2 : 0000000000001000 status: 0000000a00001880 badaddr: ffffffdc98000000 cause: 0000000000000005 [<801a5324>] __memcpy+0xd4/0xf8 [<800d5f6a>] load_flat_binary+0x43a/0x65e [<800a2de4>] bprm_execve+0x1d4/0x316 [<800a351a>] do_execveat_common+0x12e/0x138 [<800a3d44>] __riscv_sys_execve+0x38/0x4e Kernel panic - not syncing: Fatal exception in interrupt This is an arcane bug but we should still fix it. Count down from MAX_ARG_PAGES so the loop ends when index reaches stop, stop == 0 included. The iterations performed are unchanged for every other value of stop. Only CONFIG_MMU=n builds are affected, transfer_args_to_stack() is used by binfmt_flat and binfmt_elf_fdpic on nommu only. The loop predates git history. commit `7e7ec6a934` ("elf_fdpic_transfer_args_to_stack(): make it generic") only moved it from binfmt_elf_fdpic.c into fs/exec.c and narrowed the copy to the used part of the first page. The condition and the decrement are unchanged from 2.6.12-rc2. Link: https://patch.msgid.link/20260721-hochachtung-staumauer-pigmente-15d71f7d7d04@brauner Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Cc: stable@vger.kernel.org Reviewed-by: David Hildenbrand (Arm) <david@kernel.org> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>	2026-07-21 13:00:49 +02:00
Christian Brauner	bbf5f63991	binfmt_misc: set have_execfd only once the interpreter is opened load_misc_binary() raises bprm->have_execfd as soon as it sees the 'O' (or 'C') flag. This happens well before it opens the interpreter. If that open fails the flag stays set on the bprm. binfmt_misc is at the head of the format list so an interpreter open failure that returns -ENOEXEC lets the search fall through to a later format. This means it runs the matched binary directly having never staged an interpreter. So bprm->executable is NULL while have_execfd falsely claims a descriptor is present. Consequently, begin_new_exec() dereferences the missing executable: would_dump(bprm, bprm->executable); and NULL derefs. Had it not, the hand-off later in the same function would have failed anyway. FD_ADD(0, bprm->executable) rejects a NULL file with -ENOMEM. Both sites are past the point of no return so the exec cannot be unwound either way. This can be reached by unprivileged users as binfmt_misc can be mounted in user namespaces. So a user can register an 'O' entry whose interpreter lives on a FUSE mount, have the FUSE server fail the open with -ENOEXEC and execute a native ELF file that matches the entry. have_execfd only means anything alongside the executable it describes which is not set until the interpreter has been opened and staged. So lets raise it there, next to execfd_creds, which is already set at that point. An open failure now leaves it clear, so the fallback format derives credentials from the binary and emits no AT_EXECFD, as it would for any native exec. The argv rewrite load_misc_binary() performs before the open is still not undone. This means the binary sees the interpreter path in argv[0] and its own path in argv[1] but that predates this change and only became observable once the exec stopped faulting. Link: https://patch.msgid.link/20260720-beglichen-kognitiv-organismus-5e1e55326c56@brauner Fixes: `bc2bf338d5` ("exec: Remove recursion from search_binary_handler") Cc: stable@vger.kernel.org Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>	2026-07-20 23:34:15 +02:00
Linus Torvalds	b95f03f04d	12 hotfixes. 8 are cc:stable and the remainder address post-7.1 issues or aren't considered appropriate for backporting. 10 are for MM. All are singletons - please see the relevant changelogs for details. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCal5rBQAKCRDdBJ7gKXxA jtgtAQCuWNUTCR7u+MzAuO3Nh46DxHXeb27OTHZL8JcazQTEQgD+NZfwqVYnNNX/ 4CVqqZvrXJQDg0aiWtIP4VdLirNh/Ac= =Gslp -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2026-07-20-11-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "12 hotfixes. 8 are cc:stable and the remainder address post-7.1 issues or aren't considered appropriate for backporting. 10 are for MM. All are singletons - please see the relevant changelogs for details" * tag 'mm-hotfixes-stable-2026-07-20-11-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm/memory-failure: trace: change memory_failure_event to ras subsystem mm: page_reporting: allow driver to set batch capacity mm/kmemleak: fix checksum computation for per-cpu objects mm/damon/core: disallow overlapping input ranges for damon_set_regions() MAINTAINERS: add Usama as a THP reviewer fat: avoid stack overflow warning mm/damon/core: validate ranges in damon_set_regions() m68k: avoid -Wunused-but-set-parameter in clear_user_page() mm/huge_memory: set PG_has_hwpoisoned only after new folio head is established mm/page_vma_mapped: fix device-private PMD handling MAINTAINERS: s/SeongJae/SJ/ userfaultfd: prevent registration of special VMAs	2026-07-20 13:04:47 -07:00
Eric Biggers	6fe4e4b825	fscrypt: Avoid dynamic allocation in fscrypt_get_devices() When a blk_crypto_key starts being used or is evicted, fs/crypto/ calls fscrypt_get_devices() to get the filesystem's list of block devices, then iterates over them and calls blk_crypto_config_supported(), blk_crypto_start_using_key(), or blk_crypto_evict_key() on each one. Currently, the block device pointers are placed in a dynamically allocated array. This dynamic allocation is problematic because: - It can fail, especially at the fscrypt_destroy_inline_crypt_key() call site when it's invoked for inode eviction under direct reclaim. - fscrypt_destroy_inline_crypt_key() doesn't handle the failure. It just zeroizes and frees the blk_crypto_key without calling blk_crypto_evict_key(). That causes a use-after-free. For now, let's fix this in the straightforward and easily-backportable way by switching to an on-stack array. Currently the fscrypt multi-device functionality is used only by f2fs, which has a hardcoded limit of 8 block devices. An on-stack array works fine for that. (Of course, this solution won't scale up to large number of block devices. For that we'd need a different solution, like moving the block device iteration into the filesystem. Or in the case of btrfs, which will only support blk-crypto-fallback, we should make it just call blk-crypto-fallback directly, so the block devices won't be needed.) Fixes: `22e9947a4b` ("fscrypt: stop holding extra request_queue references") Cc: stable@vger.kernel.org Reported-by: Sashiko <sashiko-bot@kernel.org> Closes: https://sashiko.dev/#/patchset/20260713023708.9245-1-ebiggers%40kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260719055602.78828-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org>	2026-07-20 10:39:24 -07:00
Eric Biggers	b5fa40226e	fscrypt: Add missing superblock check in find_or_insert_direct_key() The legacy 'fscrypt_direct_keys' table caches master keys that are used by v1 encryption policies that have FSCRYPT_POLICY_FLAG_DIRECT_KEY. It's just a global table for all filesystems (since the keys can be provided by the legacy process-subscribed keyrings mechanism, which makes it difficult to reuse super_block::s_master_keys). The entries in it ('struct fscrypt_direct_key') do contain a super_block pointer, though, for passing to fscrypt_destroy_inline_crypt_key() when the last inode that references the key is evicted. However, when finding the fscrypt_direct_key for an inode, we weren't actually comparing the super_block pointer. As a result, inodes with different super_blocks could point to the same fscrypt_direct_key. That could extend the lifetime of a fscrypt_direct_key beyond the super_block it points to, causing a use-after-free later. Fix this by creating distinct fscrypt_direct_key structs for distinct super_block structs. Note that this problem doesn't exist in the v2 policy equivalent ("per-mode keys"), since the data structures there are per super_block. Fixes: `22e9947a4b` ("fscrypt: stop holding extra request_queue references") Cc: stable@vger.kernel.org Reported-by: Sashiko <sashiko-bot@kernel.org> Closes: https://sashiko.dev/#/patchset/20260717044303.425265-1-ebiggers%40kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260719033120.122120-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org>	2026-07-20 10:39:24 -07:00

1 2 3 4 5 ...

106985 Commits