linux

mirror of https://github.com/torvalds/linux.git synced 2026-05-30 18:13:41 +02:00

Author	SHA1	Message	Date
Christoph Hellwig	c2257d9f63	xfs: add a separate tracepoint for stealing an open zone for GC The case where we have to reuse an already open zone warrants a different trace point vs the normal opening of a GC zone. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-04-07 13:28:47 +02:00
Christoph Hellwig	c658488886	xfs: handle too many open zones when mounting When running on conventional zones or devices, the zoned allocator does not have a real write pointer, but instead fakes it up at mount time based on the last block recorded in the rmap. This can create spurious "open" zones when the last written blocks in a conventional zone are invalidated. Add a loop to the mount code to find the conventional zone with the highest used block in the rmap tree and "finish" it until we are below the open zones limit. While we're at it, also error out if there are too many open sequential zones, which can only happen when the user overrode the max open zones limit (or with really buggy hardware reducing the limit, but not much we can do about that). Fixes: `4e4d520755` ("xfs: add the zoned space allocator") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-04-07 13:16:59 +02:00
Christoph Hellwig	d02ee47bbe	xfs: use a lockref for the buffer reference count The lockref structure allows incrementing/decrementing counters like an atomic_t for the fast path, while still allowing complex slow path operations as if the counter was protected by a lock. The only slow path operations that actually need to take the lock are the final put, LRU evictions and marking a buffer stale. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-30 16:34:05 +02:00
Darrick J. Wong	e31c53a806	xfs: remove file_path tracepoint data The xfile/xmbuf shmem file descriptions are no longer as detailed as they were when online fsck was first merged, because moving to static strings in commit `60382993a2` ("xfs: get rid of the xchk_xfile_*_descr calls") removed a memory allocation and hence a source of failure. However this makes encoding the description in the tracepoints sort of a waste of memory. David Laight also points out that file_path doesn't zero the whole buffer which causes exposure of stale trace bytes, and Steven Rostedt wonders why we're not using a dynamic array for the file path. I don't think this is worth fixing, so let's just rip it out. Cc: rostedt@goodmis.org Cc: david.laight.linux@gmail.com Link: https://lore.kernel.org/linux-xfs/20260323172204.work.979-kees@kernel.org/ Cc: stable@vger.kernel.org # v6.11 Fixes: `19ebc8f84e` ("xfs: fix file_path handling in tracepoints") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-26 14:25:23 +01:00
Yuto Ohnuki	79ef34ec05	xfs: avoid dereferencing log items after push callbacks After xfsaild_push_item() calls iop_push(), the log item may have been freed if the AIL lock was dropped during the push. Background inode reclaim or the dquot shrinker can free the log item while the AIL lock is not held, and the tracepoints in the switch statement dereference the log item after iop_push() returns. Fix this by capturing the log item type, flags, and LSN before calling xfsaild_push_item(), and introducing a new xfs_ail_push_class trace event class that takes these pre-captured values and the ailp pointer instead of the log item pointer. Reported-by: syzbot+652af2b3c5569c4ab63c@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=652af2b3c5569c4ab63c Fixes: `90c60e1640` ("xfs: xfs_iflush() is no longer necessary") Cc: stable@vger.kernel.org # v5.9 Signed-off-by: Yuto Ohnuki <ytohnuki@amazon.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-18 09:40:31 +01:00
Carlos Maiolino	c04ed39d85	xfs: improve shortform attr performance [2/3] Improve performance of the xattr (and parent pointer) code when the attr structure is in short format and we can therefore perform all updates in a single transaction. Avoiding the attr intent code brings a very nice speedup in those operations. With a bit of luck, this should all go splendidly. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCaXbguQAKCRBKO3ySh0YR pkGhAP4q0606NWz+XcF+5f3KlehLBOnpmnozVvudVMCd1rCmpgD9HecarQThh0VI ZHo7LrKQpl+jrg0fhuKcbocQzxGNpgI= =2NAV -----END PGP SIGNATURE----- Merge tag 'attr-pptr-speedup-7.0_2026-01-25' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-7.0-merge xfs: improve shortform attr performance [2/3] Improve performance of the xattr (and parent pointer) code when the attr structure is in short format and we can therefore perform all updates in a single transaction. Avoiding the attr intent code brings a very nice speedup in those operations. With a bit of luck, this should all go splendidly. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-01-28 10:16:12 +01:00
Darrick J. Wong	eaec8aeff3	xfs: add a method to replace shortform attrs If we're trying to replace an xattr in a shortform attr structure and the old entry fits the new entry, we can just memcpy and exit without having to delete, compact, and re-add the entry (or worse use the attr intent machinery). For parent pointers this only advantages renaming where the filename length stays the same (e.g. mv autoexec.bat scandisk.exe) but for regular xattrs it might be useful for updating security labels and the like. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2026-01-23 09:27:36 -08:00
Darrick J. Wong	b8accfd65d	xfs: add media verification ioctl Add a new privileged ioctl so that xfs_scrub can ask the kernel to verify the media of the devices backing an xfs filesystem, and have any resulting media errors reported to fsnotify and xfs_healer. To accomplish this, the kernel allocates a folio between the base page size and 1MB, and issues read IOs to a gradually incrementing range of one of the storage devices underlying an xfs filesystem. If any error occurs, that raw error is reported to the calling process. If the error happens to be one of the ones that the kernel considers indicative of data loss, then it will also be reported to xfs_healthmon and fsnotify. Driving the verification from the kernel enables xfs (and by extension xfs_scrub) to have precise control over the size and error handling of IOs that are issued to the underlying block device, and to emit notifications about problems to other relevant kernel subsystems immediately. Note that the caller is also allowed to reduce the size of the IO and to ask for a relaxation period after each IO. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2026-01-20 18:06:52 -08:00
Darrick J. Wong	dfa8bad3a8	xfs: convey file I/O errors to the health monitor Connect the fserror reporting to the health monitor so that xfs can send events about file I/O errors to the xfs_healer daemon. These events are entirely informational because xfs cannot regenerate user data, so hopefully the fsnotify I/O error event gets noticed by the relevant management systems. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2026-01-20 18:06:50 -08:00
Darrick J. Wong	e76e0e3fc9	xfs: convey externally discovered fsdax media errors to the health monitor Connect the fsdax media failure notification code to the health monitor so that xfs can send events about that to the xfs_healer daemon. Later on we'll add the ability for the xfs_scrub media scan (phase 6) to report the errors that it finds to the kernel so that those are also logged by xfs_healer. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2026-01-20 18:06:49 -08:00
Darrick J. Wong	74c4795e50	xfs: convey filesystem shutdown events to the health monitor Connect the filesystem shutdown code to the health monitor so that xfs can send events about that to the xfs_healer daemon. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2026-01-20 18:06:48 -08:00
Darrick J. Wong	5eb4cb18e4	xfs: convey metadata health events to the health monitor Connect the filesystem metadata health event collection system to the health monitor so that xfs can send events to xfs_healer as it collects information. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2026-01-20 18:06:47 -08:00
Darrick J. Wong	25ca57fa36	xfs: convey filesystem unmount events to the health monitor In xfs_healthmon_unmount, send events to xfs_healer so that it knows that nothing further can be done for the filesystem. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2026-01-20 18:06:47 -08:00
Darrick J. Wong	b3a289a2a9	xfs: create event queuing, formatting, and discovery infrastructure Create the basic infrastructure that we need to report health events to userspace. We need a compact form for recording critical information about an event and queueing them; a means to notice that we've lost some events; and a means to format the events into something that userspace can handle. Make the kernel export C structures via read(). In a previous iteration of this new subsystem, I wanted to explore data exchange formats that are more flexible and easier for humans to read than C structures. The thought being that when we want to rev (or worse, enlarge) the event format, it ought to be trivially easy to do that in a way that doesn't break old userspace. I looked at formats such as protobufs and capnproto. These look really nice in that extending the wire format is fairly easy, you can give it a data schema and it generates the serialization code for you, handles endianness problems, etc. The huge downside is that neither support C all that well. Too hard, and didn't want to port either of those huge sprawling libraries first to the kernel and then again to xfsprogs. Then I thought, how about JSON? Javascript objects are human readable, the kernel can emit json without much fuss (it's all just strings!) and there are plenty of interpreters for python/rust/c/etc. There's a proposed schema format for json, which means that xfs can publish a description of the events that kernel will emit. Userspace consumers (e.g. xfsprogs/xfs_healer) can embed the same schema document and use it to validate the incoming events from the kernel, which means it can discard events that it doesn't understand, or garbage being emitted due to bugs. However, json has a huge crutch -- javascript is well known for its vague definitions of what are numbers. This makes expressing a large number rather fraught, because the runtime is free to represent a number in nearly any way it wants. Stupider ones will truncate values to word size, others will roll out doubles for uint52_t (yes, fifty-two) with the resulting loss of precision. Not good when you're dealing with discrete units. It just so happens that python's json library is smart enough to see a sequence of digits and put them in a u64 (at least on x86_64/aarch64) but an actual javascript interpreter (pasting into Firefox) isn't necessarily so clever. It turns out that none of the proposed json schemas were ever ratified even in an open-consensus way, so json blobs are still just loosely structured blobs. The parsing in userspace was also noticeably slow and memory-consumptive. Hence only the C interface survives. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2026-01-20 18:06:46 -08:00
Christoph Hellwig	be665a4e27	xfs: don't use xlog_in_core_2_t in struct xlog_in_core Most accessed to the on-disk log record header are for the original xlog_rec_header. Make that the main structure, and case for the single remaining place using other union legs. This prepares for removing xlog_in_core_2_t entirely. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-11-12 11:09:25 +01:00
Christoph Hellwig	d0f93c0d7c	xfs: xfs_qm_dqattach_one is never called with a non-NULL *IO_idqpp The caller already checks that, so replace the handling of this case with an assert that it does not happen. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-11-11 11:45:58 +01:00
Christoph Hellwig	6b6e6e7521	xfs: remove xfs_qm_dqput and optimize dropping dquot references With the new lockref-based dquot reference counting, there is no need to hold q_qlock for dropping the reference. Make xfs_qm_dqrele the main function to drop dquot references without taking q_qlock and convert all callers of xfs_qm_dqput to unlock q_qlock and call xfs_qm_dqrele instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-11-11 11:45:58 +01:00
Christoph Hellwig	0c5e80bd57	xfs: use a lockref for the xfs_dquot reference count The xfs_dquot structure currently uses the anti-pattern of using the in-object lock that protects the content to also serialize reference count updates for the structure, leading to a cumbersome free path. This is partially papered over by the fact that we never free the dquot directly but always through the LRU. Switch to use a lockref instead and move the reference counter manipulations out of q_qlock. To make this work, xfs_qm_flush_one and xfs_qm_flush_one are converted to acquire a dquot reference while flushing to integrate with the lockref "get if not dead" scheme. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-11-11 11:45:57 +01:00
Linus Torvalds	56e7b31071	vfs-6.18-rc1.inode -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaNZQQgAKCRCRxhvAZXjc oud9AQD5IG4sNnzCjsvcTDpQkbX5eZW+LFIiAiiN+nztZ+OcRQEAvC2N7YovfqM3 TWpVoNDKvEPdtDc9ttFMUKqBZYvxvgE= =sEaL -----END PGP SIGNATURE----- Merge tag 'vfs-6.18-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs inode updates from Christian Brauner: "This contains a series I originally wrote and that Eric brought over the finish line. It moves out the i_crypt_info and i_verity_info pointers out of 'struct inode' and into the fs-specific part of the inode. So now the few filesytems that actually make use of this pay the price in their own private inode storage instead of forcing it upon every user of struct inode. The pointer for the crypt and verity info is simply found by storing an offset to its address in struct fsverity_operations and struct fscrypt_operations. This shrinks struct inode by 16 bytes. I hope to move a lot more out of it in the future so that struct inode becomes really just about very core stuff that we need, much like struct dentry and struct file, instead of the dumping ground it has become over the years. On top of this are a various changes associated with the ongoing inode lifetime handling rework that multiple people are pushing forward: - Stop accessing inode->i_count directly in f2fs and gfs2. They simply should use the __iget() and iput() helpers - Make the i_state flags an enum - Rework the iput() logic Currently, if we are the last iput, and we have the I_DIRTY_TIME bit set, we will grab a reference on the inode again and then mark it dirty and then redo the put. This is to make sure we delay the time update for as long as possible We can rework this logic to simply dec i_count if it is not 1, and if it is do the time update while still holding the i_count reference Then we can replace the atomic_dec_and_lock with locking the ->i_lock and doing atomic_dec_and_test, since we did the atomic_add_unless above - Add an icount_read() helper and convert everyone that accesses inode->i_count directly for this purpose to use the helper - Expand dump_inode() to dump more information about an inode helping in debugging - Add some might_sleep() annotations to iput() and associated helpers" * tag 'vfs-6.18-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: add might_sleep() annotation to iput() and more fs: expand dump_inode() inode: fix whitespace issues fs: add an icount_read helper fs: rework iput logic fs: make the i_state flags an enum fs: stop accessing ->i_count directly in f2fs and gfs2 fsverity: check IS_VERITY() in fsverity_cleanup_inode() fs: remove inode::i_verity_info btrfs: move verity info pointer to fs-specific part of inode f2fs: move verity info pointer to fs-specific part of inode ext4: move verity info pointer to fs-specific part of inode fsverity: add support for info in fs-specific part of inode fs: remove inode::i_crypt_info ceph: move crypt info pointer to fs-specific part of inode ubifs: move crypt info pointer to fs-specific part of inode f2fs: move crypt info pointer to fs-specific part of inode ext4: move crypt info pointer to fs-specific part of inode fscrypt: add support for info in fs-specific part of inode fscrypt: replace raw loads of info pointer with helper function	2025-09-29 09:42:30 -07:00
Josef Bacik	37b27bd5d6	fs: add an icount_read helper Instead of doing direct access to ->i_count, add a helper to handle this. This will make it easier to convert i_count to a refcount later. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/9bc62a84c6b9d6337781203f60837bd98fbc4a96.1756222464.git.josef@toxicpanda.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-01 12:41:09 +02:00
Christoph Hellwig	f76823e3b2	xfs: split xfs_zone_record_blocks xfs_zone_record_blocks not only records successfully written blocks that now back file data, but is also used for blocks speculatively written by garbage collection that were never linked to an inode and instantly become invalid. Split the latter functionality out to be easier to understand. This also make it clear that we don't need to attach the rmap inode to a transaction for the skipped blocks case as we never dirty any peristent data structure. Also make the argument order to xfs_zone_record_blocks a bit more natural. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-08-11 14:04:20 +02:00
Steven Rostedt	75fe259ff7	xfs: remove unused trace event xfs_reflink_cow_enospc The call to the event xfs_reflink_cow_enospc was removed when the COW handling was merged into xfs_file_iomap_begin_delay, but the trace event itself was not. Remove it. Fixes: `db46e604ad` ("xfs: merge COW handling into xfs_file_iomap_begin_delay") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-07-24 17:30:14 +02:00
Steven Rostedt	2b74404188	xfs: remove unused trace event xfs_discard_rtrelax The trace event xfs_discard_rtrelax was added but never used. Remove it. Fixes: `a330cae8a7` ("xfs: Remove header files which are included more than once") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-07-24 17:30:14 +02:00
Steven Rostedt	3c4052cb9f	xfs: remove unused trace event xfs_log_cil_return The trace event xfs_log_cil_return was added but never used. Remove it. Fixes: `c1220522ef` ("xfs: grant heads track byte counts, not LSNs") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-07-24 17:30:14 +02:00
Steven Rostedt	b9adb86b90	xfs: remove unused trace event xfs_dqreclaim_dirty The tracepoint trace_xfs_dqreclaim_dirty was removed with other code removed from xfs_qm_dquot_isolate() but the defined tracepoint was not. Fixes: `d62016b1a2` ("xfs: avoid dquot buffer pin deadlock") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-07-24 17:30:14 +02:00
Christoph Hellwig	329b996d92	xfs: rename oz_write_pointer to oz_allocated This member just tracks how much space we handed out for sequential write required zones. Only for conventional space it actually is the pointer where thing are written at, otherwise zone append manages that. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-07-24 17:30:14 +02:00
Steven Rostedt	31b98ef240	xfs: only create event xfs_file_compat_ioctl when CONFIG_COMPAT is configure The trace event xfs_file_compat_ioctl is only used when CONFIG_COMPAT is configured in the build. As trace events can take up to 5K in memory for text and meta data regardless if they are used, they should not be created when unused. Add #ifdef CONFIG_COMPAT around the event so that it is only created when that is configured. Fixes: `cca28fb83d` ("xfs: split xfs_itrace_entry") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-07-24 17:30:12 +02:00
Steven Rostedt	9a8a536fe5	xfs: remove usused xfs_end_io_direct events When the use of iomap_dio_rw was added, the calls to the trace events xfs_end_io_direct_unwritten and xfs_end_io_direct_append were removed but those trace events were not. As trace events can take up to 5K in memory for text and meta data regardless if they are used or not, they should not be created when not used. Remove the unused events. Fixes: `acdda3aae1` ("xfs: use iomap_dio_rw") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-07-24 17:30:12 +02:00
Steven Rostedt	88fd451594	xfs: remove unused event xfs_pagecache_inval When the function xfs_flushinval_pages() was removed, it removed the only caller to the trace event xfs_pagecache_inval. As trace events can take up to 5K of memory in text and meta data each regardless if they are used or not, they should not be created when unused. Remove the unused event. Fixes: `fb59581404` ("xfs: remove xfs_flushinval_pages") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-07-24 17:30:12 +02:00
Steven Rostedt	f110060559	xfs: remove unused event xfs_alloc_near_nominleft When the function xfs_alloc_space_available() was restructured, it removed the only calls to the trace event xfs_alloc_near_nominleft. As trace events take up to 5K of memory for text and meta data for each event, they should not be created when not used. Remove this unused event. Fixes: `54fee133ad` ("xfs: adjust allocation length in xfs_alloc_space_available") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-07-24 17:30:12 +02:00
Steven Rostedt	237f8e8851	xfs: remove unused event xfs_alloc_near_error Trace events take up to 5K of memory in text and meta data regardless if they are used or not. The call to the event xfs_alloc_near_error was removed when the cursor data structure allocation was introduced. Remove it as it is no longer used and is just wasting memory. Fixes: `f5e7dbea1e` ("xfs: introduce allocation cursor data structure") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-07-24 17:30:12 +02:00
Steven Rostedt	ea26bbc779	xfs: remove unused event xfs_attr_node_removename When xfs_attri_remove_iter() was removed, so was the call to the trace event xfs_attr_node_removename. As trace events can take up to 5K in memory for text and meta data regardless if they are used or not, they should not be created when unused. Remove the unused event. Fixes: `59782a236b` ("xfs: remove xfs_attri_remove_iter") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-07-24 17:30:12 +02:00
Steven Rostedt	b54480c3b1	xfs: remove unused xfs_attr events Trace events can take up to 5K in memory for text and meta data per event regardless if they are used or not, so they should not be defined when not used. The events xfs_attr_fillstate and xfs_attr_refillstate are only called in code that is #ifdef out and exists only for future reference. Remove these unused events. If the code is needed again, then git history can recover what the events were. Suggested-by: Christoph Hellwig <hch@lst.de> Fixes: `59782a236b` ("xfs: remove xfs_attri_remove_iter") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-07-24 17:30:12 +02:00
Steven Rostedt	b3b5015d34	xfs: remove unused trace event xfs_attr_rmtval_set When the function xfs_attr_rmtval_set() was removed, the call to the corresponding trace event was also removed but the trace event itself was not. As trace events can take up to 5K of memory in text and meta data regardless if they are used or not they should not be created when not used. Remove the unused trace event. Fixes: `0e6acf29db` ("xfs: Remove xfs_attr_rmtval_set") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-07-24 17:30:12 +02:00
Steven Rostedt	8c54845c3a	xfs: remove unused xfs_reflink_compare_extents events When the clone/dedupe_file_rang common functions were refactored, it removed the calls to the xfs_reflink_compare_extents and xfs_reflink_compare_extents_error events. As each event can take up to 5K in memory for text and meta data regardless if they are used or not, they should not be created if they are not used. Remove these unused events. Fixes: `876bec6f9b` ("vfs: refactor clone/dedupe_file_range common functions") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-07-24 17:30:12 +02:00
Steven Rostedt	6f7080bd93	xfs: remove unused event xfs_ioctl_clone The trace event xfs_ioctl_clone was added but never used. As trace events can take up to 5K of memory in text and meta data regardless if they are used or not, remove the unused trace event. Fixes: `53aa1c34f4` ("xfs: define tracepoints for reflink activities") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-07-24 17:30:12 +02:00
Steven Rostedt	32177ab8ba	xfs: remove unused event xlog_iclog_want_sync The trace event xlog_iclog_want_sync was added but never used. As trace events can take up around 5K of memory in text and meta data regardless if they are used or not, remove this unused event. Fixes: `956f6daa84` ("xfs: add iclog state trace events") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-07-24 17:30:12 +02:00
Steven Rostedt	091e9451d0	xfs: remove unused trace event xfs_attr_remove_iter_return When the function xfs_attri_remove_iter was removed, it did not remove the trace event that it called. As a trace event can take up to 5K of memory for text and meta data regardless of if it is used or not, remove this unused trace event. Fixes: `59782a236b` ("xfs: remove xfs_attri_remove_iter") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-07-24 17:30:12 +02:00
Christoph Hellwig	e4a7a3f9b2	xfs: refactor xfs_calc_atomic_write_unit_max This function and the helpers used by it duplicate the same logic for AGs and RTGs. Use the xfs_group_type enum to unify both variants. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-07-08 13:30:26 +02:00
Dave Chinner	fc48627b9c	xfs: add tracepoints for stale pinned inode state debug I needed more insight into how stale inodes were getting stuck on the AIL after a forced shutdown when running fsstress. These are the tracepoints I added for that purpose. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-06-27 14:14:37 +02:00
Dave Chinner	d62016b1a2	xfs: avoid dquot buffer pin deadlock On shutdown when quotas are enabled, the shutdown can deadlock trying to unpin the dquot buffer buf_log_item like so: [ 3319.483590] task:kworker/20:0H state:D stack:14360 pid:1962230 tgid:1962230 ppid:2 task_flags:0x4208060 flags:0x00004000 [ 3319.493966] Workqueue: xfs-log/dm-6 xlog_ioend_work [ 3319.498458] Call Trace: [ 3319.500800] <TASK> [ 3319.502809] __schedule+0x699/0xb70 [ 3319.512672] schedule+0x64/0xd0 [ 3319.515573] schedule_timeout+0x30/0xf0 [ 3319.528125] __down_common+0xc3/0x200 [ 3319.531488] __down+0x1d/0x30 [ 3319.534186] down+0x48/0x50 [ 3319.540501] xfs_buf_lock+0x3d/0xe0 [ 3319.543609] xfs_buf_item_unpin+0x85/0x1b0 [ 3319.547248] xlog_cil_committed+0x289/0x570 [ 3319.571411] xlog_cil_process_committed+0x6d/0x90 [ 3319.575590] xlog_state_shutdown_callbacks+0x52/0x110 [ 3319.580017] xlog_force_shutdown+0x169/0x1a0 [ 3319.583780] xlog_ioend_work+0x7c/0xb0 [ 3319.587049] process_scheduled_works+0x1d6/0x400 [ 3319.591127] worker_thread+0x202/0x2e0 [ 3319.594452] kthread+0x20c/0x240 The CIL push has seen the deadlock, so it has aborted the push and is running CIL checkpoint completion to abort all the items in the checkpoint. This calls ->iop_unpin(remove = true) to clean up the log items in the checkpoint. When a buffer log item is unpined like this, it needs to lock the buffer to run io completion to correctly fail the buffer and run all the required completions to fail attached log items as well. In this case, the attempt to lock the buffer on unpin is hanging because the buffer is already locked. I suspected a leaked XFS_BLI_HOLD state because of XFS_BLI_STALE handling changes I was testing, so I went looking for pin events on HOLD buffers and unpin events on locked buffer. That isolated this one buffer with these two events: xfs_buf_item_pin: dev 251:6 daddr 0xa910 bbcount 0x2 hold 2 pincount 0 lock 0 flags DONE\|KMEM recur 0 refcount 1 bliflags HOLD\|DIRTY\|LOGGED liflags DIRTY .... xfs_buf_item_unpin: dev 251:6 daddr 0xa910 bbcount 0x2 hold 4 pincount 1 lock 0 flags DONE\|KMEM recur 0 refcount 1 bliflags DIRTY liflags ABORTED Firstly, bbcount = 0x2, which means it is not a single sector structure. That rules out every xfs_trans_bhold() case except one: dquot buffers. Then hung task dumping gave this trace: [ 3197.312078] task:fsync-tester state:D stack:12080 pid:2051125 tgid:2051125 ppid:1643233 task_flags:0x400000 flags:0x00004002 [ 3197.323007] Call Trace: [ 3197.325581] <TASK> [ 3197.327727] __schedule+0x699/0xb70 [ 3197.334582] schedule+0x64/0xd0 [ 3197.337672] schedule_timeout+0x30/0xf0 [ 3197.350139] wait_for_completion+0xbd/0x180 [ 3197.354235] __flush_workqueue+0xef/0x4e0 [ 3197.362229] xlog_cil_force_seq+0xa0/0x300 [ 3197.374447] xfs_log_force+0x77/0x230 [ 3197.378015] xfs_qm_dqunpin_wait+0x49/0xf0 [ 3197.382010] xfs_qm_dqflush+0x55/0x460 [ 3197.385663] xfs_qm_dquot_isolate+0x29e/0x4d0 [ 3197.389977] __list_lru_walk_one+0x141/0x220 [ 3197.398867] list_lru_walk_one+0x10/0x20 [ 3197.402713] xfs_qm_shrink_scan+0x6a/0x100 [ 3197.406699] do_shrink_slab+0x18a/0x350 [ 3197.410512] shrink_slab+0xf7/0x430 [ 3197.413967] drop_slab+0x97/0xf0 [ 3197.417121] drop_caches_sysctl_handler+0x59/0xc0 [ 3197.421654] proc_sys_call_handler+0x18b/0x280 [ 3197.426050] proc_sys_write+0x13/0x20 [ 3197.429750] vfs_write+0x2b8/0x3e0 [ 3197.438532] ksys_write+0x7e/0xf0 [ 3197.441742] __x64_sys_write+0x1b/0x30 [ 3197.445363] x64_sys_call+0x2c72/0x2f60 [ 3197.449044] do_syscall_64+0x6c/0x140 [ 3197.456341] entry_SYSCALL_64_after_hwframe+0x76/0x7e Yup, another test run by check-parallel is running drop_caches concurrently and the dquot shrinker for the hung filesystem is running. That's trying to flush a dirty dquot from reclaim context, and it waiting on a log force to complete. xfs_qm_dqflush is called with the dquot buffer held locked, and so we've called xfs_log_force() with that buffer locked. Now the log force is waiting for a workqueue flush to complete, and that workqueue flush is waiting of CIL checkpoint processing to finish. The CIL checkpoint processing is aborting all the log items it has, and that requires locking aborted buffers to cancel them. Now, normally this isn't a problem if we are issuing a log force to unpin an object, because the ->iop_unpin() method wakes pin waiters first. That results in the pin waiter finishing off whatever it was doing, dropping the lock and then xfs_buf_item_unpin() can lock the buffer and fail it. However, xfs_qm_dqflush() is waiting on the -dquot- unpin event, not the dquot buffer unpin event, and so it never gets woken and so does not drop the buffer lock. Inodes do not have this problem, as they can only be written from one spot (->iop_push) whilst dquots can be written from multiple places (memory reclaim, ->iop_push, xfs_dq_dqpurge, and quotacheck). The reason that the dquot buffer has an attached buffer log item is that it has been recently allocated. Initialisation of the dquot buffer logs the buffer directly, thereby pinning it in memory. We then modify the dquot in a separate operation, and have memory reclaim racing with a shutdown and we trigger this deadlock. check-parallel reproduces this reliably on 1kB FSB filesystems with quota enabled because it does all of these things concurrently without having to explicitly write tests to exercise these corner case conditions. xfs_qm_dquot_logitem_push() doesn't have this deadlock because it checks if the dquot is pinned before locking the dquot buffer and skipping it if it is pinned. This means the xfs_qm_dqunpin_wait() log force in xfs_qm_dqflush() never triggers and we unlock the buffer safely allowing a concurrent shutdown to fail the buffer appropriately. xfs_qm_dqpurge() could have this problem as it is called from quotacheck and we might have allocated dquot buffers when recording the quota updates. This can be fixed by calling xfs_qm_dqunpin_wait() before we lock the dquot buffer. Because we hold the dquot locked, nothing will be able to add to the pin count between the unpin_wait and the dqflush callout, so this now makes xfs_qm_dqpurge() safe against this race. xfs_qm_dquot_isolate() can also be fixed this same way but, quite frankly, we shouldn't be doing IO in memory reclaim context. If the dquot is pinned or dirty, simply rotate it and let memory reclaim come back to it later, same as we do for inodes. This then gets rid of the nasty issue in xfs_qm_flush_one() where quotacheck writeback races with memory reclaim flushing the dquots. We can lift xfs_qm_dqunpin_wait() up into this code, then get rid of the "can't get the dqflush lock" buffer write to cycle the dqlfush lock and enable it to be flushed again. checking if the dquot is pinned and returning -EAGAIN so that the dquot walk will revisit the dquot again later. Finally, with xfs_qm_dqunpin_wait() lifted into all the callers, we can remove it from the xfs_qm_dqflush() code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-06-27 14:14:37 +02:00
Darrick J. Wong	4528b90527	xfs: allow sysadmins to specify a maximum atomic write limit at mount time Introduce a mount option to allow sysadmins to specify the maximum size of an atomic write. If the filesystem can work with the supplied value, that becomes the new guaranteed maximum. The value mustn't be too big for the existing filesystem geometry (max write size, max AG/rtgroup size). We dynamically recompute the tr_atomic_write transaction reservation based on the given block size, check that the current log size isn't less than the new minimum log size constraints, and set a new maximum. The actual software atomic write max is still computed based off of tr_atomic_ioend the same way it has for the past few commits. Note also that xfs_calc_atomic_write_log_geometry is non-static because mkfs will need that. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: John Garry <john.g.garry@oracle.com>	2025-05-07 14:25:33 -07:00
John Garry	0c438dcc31	xfs: add xfs_calc_atomic_write_unit_max() Now that CoW-based atomic writes are supported, update the max size of an atomic write for the data device. The limit of a CoW-based atomic write will be the limit of the number of logitems which can fit into a single transaction. In addition, the max atomic write size needs to be aligned to the agsize. Limit the size of atomic writes to the greatest power-of-two factor of the agsize so that allocations for an atomic write will always be aligned compatibly with the alignment requirements of the storage. Function xfs_atomic_write_logitems() is added to find the limit the number of log items which can fit in a single transaction. Amend the max atomic write computation to create a new transaction reservation type, and compute the maximum size of an atomic write completion (in fsblocks) based on this new transaction reservation. Initially, tr_atomic_write is a clone of tr_itruncate, which provides a reasonable level of parallelism. In the next patch, we'll add a mount option so that sysadmins can configure their own limits. [djwong: use a new reservation type for atomic write ioends, refactor group limit calculations] Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> [jpg: rounddown power-of-2 always] Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com>	2025-05-07 14:25:32 -07:00
John Garry	bd1d2c21d5	xfs: add xfs_atomic_write_cow_iomap_begin() For CoW-based atomic writes, reuse the infrastructure for reflink CoW fork support. Add ->iomap_begin() callback xfs_atomic_write_cow_iomap_begin() to create staging mappings in the CoW fork for atomic write updates. The general steps in the function are as follows: - find extent mapping in the CoW fork for the FS block range being written - if part or full extent is found, proceed to process found extent - if no extent found, map in new blocks to the CoW fork - convert unwritten blocks in extent if required - update iomap extent mapping and return The bulk of this function is quite similar to the processing in xfs_reflink_allocate_cow(), where we try to find an extent mapping; if none exists, then allocate a new extent in the CoW fork, convert unwritten blocks, and return a mapping. Performance testing has shown the XFS_ILOCK_EXCL locking to be quite a bottleneck, so this is an area which could be optimised in future. Christoph Hellwig contributed almost all of the code in xfs_atomic_write_cow_iomap_begin(). Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: add a new xfs_can_sw_atomic_write to convey intent better] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com>	2025-05-07 14:25:31 -07:00
Christoph Hellwig	89ce287c83	xfs: trace what memory backs a buffer Add three trace points for the different backing memory allocators for buffers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-03-10 14:29:44 +01:00
Christoph Hellwig	058dd70c65	xfs: implement buffered writes to zoned RT devices Implement buffered writes including page faults and block zeroing for zoned RT devices. Buffered writes to zoned RT devices are split into three phases: 1) a reservation for the worst case data block usage is taken before acquiring the iolock. When there are enough free blocks but not enough available one, garbage collection is kicked off to free the space before continuing with the write. If there isn't enough freeable space, the block reservation is reduced and a short write will happen as expected by normal Linux write semantics. 2) with the iolock held, the generic iomap buffered write code is called, which through the iomap_begin operation usually just inserts delalloc extents for the range in a single iteration. Only for overwrites of existing data that are not block aligned, or zeroing operations the existing extent mapping is read to fill out the srcmap and to figure out if zeroing is required. 3) the ->map_blocks callback to the generic iomap writeback code calls into the zoned space allocator to actually allocate on-disk space for the range before kicking of the writeback. Note that because all writes are out of place, truncate or hole punches that are not aligned to block size boundaries need to allocate space. For block zeroing from truncate, ->setattr is called with the iolock (aka i_rwsem) already held, so a hacky deviation from the above scheme is needed. In this case the space reservations is called with the iolock held, but is required not to block and can dip into the reserved block pool. This can lead to -ENOSPC when truncating a file, which is unfortunate. But fixing the calling conventions in the VFS is probably much easier with code requiring it already in mainline. Similarly because all writes are out place, the zoned allocator can't support unwritten extents and thus the FALLOC_FL_ALLOCATE_RANGE range mode of fallocate. Other fallocate modes that would reserved space but don't need to to provide proper semantics do work but do not reserve space. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>	2025-03-03 08:17:07 -07:00
Christoph Hellwig	080d01c41d	xfs: implement zoned garbage collection RT groups on a zoned file system need to be completely empty before their space can be reused. This means that partially empty groups need to be emptied entirely to free up space if no entirely free groups are available. Add a garbage collection thread that moves all data out of the least used zone when not enough free zones are available, and which resets all zones that have been emptied. To find empty zone a simple set of 10 buckets based on the amount of space used in the zone is used. To empty zones, the rmap is walked to find the owners and the data is read and then written to the new place. To automatically defragment files the rmap records are sorted by inode and logical offset. This means defragmentation of parallel writes into a single zone happens automatically when performing garbage collection. Because holding the iolock over the entire GC cycle would inject very noticeable latency for other accesses to the inodes, the iolock is not taken while performing I/O. Instead the I/O completion handler checks that the mapping hasn't changed over the one recorded at the start of the GC cycle and doesn't update the mapping if it change. Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>	2025-03-03 08:17:07 -07:00
Christoph Hellwig	0bb2193056	xfs: add support for zoned space reservations For zoned file systems garbage collection (GC) has to take the iolock and mmaplock after moving data to a new place to synchronize with readers. This means waiting for garbage collection with the iolock can deadlock. To avoid this, the worst case required blocks have to be reserved before taking the iolock, which is done using a new RTAVAILABLE counter that tracks blocks that are free to write into and don't require garbage collection. The new helpers try to take these available blocks, and if there aren't enough available it wakes and waits for GC. This is done using a list of on-stack reservations to ensure fairness. Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>	2025-03-03 08:17:07 -07:00
Christoph Hellwig	4e4d520755	xfs: add the zoned space allocator For zoned RT devices space is always allocated at the write pointer, that is right after the last written block and only recorded on I/O completion. Because the actual allocation algorithm is very simple and just involves picking a good zone - preferably the one used for the last write to the inode. As the number of zones that can written at the same time is usually limited by the hardware, selecting a zone is done as late as possible from the iomap dio and buffered writeback bio submissions helpers just before submitting the bio. Given that the writers already took a reservation before acquiring the iolock, space will always be readily available if an open zone slot is available. A new structure is used to track these open zones, and pointed to by the xfs_rtgroup. Because zoned file systems don't have a rsum cache the space for that pointer can be reused. Allocations are only recorded at I/O completion time. The scheme used for that is very similar to the reflink COW end I/O path. Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>	2025-03-03 08:16:56 -07:00
Christoph Hellwig	1df8d75030	xfs: make metabtree reservations global Currently each metabtree inode has it's own space reservation to ensure it can be expanded to the maximum size, mirroring what is done for the AG-based btrees. But unlike the AG-based btrees the metabtree inodes aren't restricted to allocate from a single AG but can use free space form the entire file system. And unlike AG-based btrees where the required reservation shrinks with the available free space due to this, the metabtree reservations for the rtrmap and rtfreflink trees are not bound in any way by the data device free space as they track RT extent allocations. This is not very efficient as it requires a large number of blocks to be set aside that can't be used at all by other btrees. Switch to a model that uses a global pool instead in preparation for reducing the amount of reserved space, which now also removes the overloading of the i_nblocks field for metabtree inodes, which would create problems if metabtree inodes ever had a big enough xattr fork to require xattr blocks outside the inode. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>	2025-03-03 08:16:43 -07:00

1 2 3 4 5 ...

442 Commits