There is currently no XFS ioctl that allows userspace to retrieve the
write pointer for a specific realtime group block for zoned XFS. On zoned
block devices, userspace can obtain this information via zone reports from
the underlying device. However, for zoned XFS operating on regular block
devices, no equivalent mechanism exists.
Access to the realtime group write pointer is useful to userspace
development and analysis tools such as Zonar [1]. So extend the existing
struct xfs_rtgroup_geometry to add a new rg_writepointer field. This field
is valid if XFS_RTGROUP_GEOM_WRITEPOINTER flag is set. The rg_writepointer
field specifies the location of the current writepointer as a block offset
into the respective rtgroup.
[1] https://lwn.net/Articles/1059364/
Signed-off-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
This was done entirely with mindless brute force, using
git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'
to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.
Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.
For the same reason the 'flex' versions will be done as a separate
conversion.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:
Single allocations: kmalloc(sizeof(TYPE), ...)
are replaced with: kmalloc_obj(TYPE, ...)
Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with: kmalloc_objs(TYPE, COUNT, ...)
Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...)
(where TYPE may also be *VAR)
The resulting allocations no longer return "void *", instead returning
"TYPE *".
Signed-off-by: Kees Cook <kees@kernel.org>
This patchset builds new functionality to deliver live information about
filesystem health events to userspace. This is done by creating an
anonymous file that can be read() for events by userspace programs.
Events are captured by hooking various parts of XFS and iomap so that
metadata health failures, file I/O errors, and major changes in
filesystem state (unmounts, shutdowns, etc.) can be observed by
programs.
When an event occurs, the hook functions queue an event object to each
event anonfd for later processing. Programs must have CAP_SYS_ADMIN
to open the anonfd and there's a maximum event lag to prevent resource
overconsumption. The events themselves can be read() from the anonfd
as C structs for the xfs_healer daemon.
In userspace, we create a new daemon program that will read the event
objects and initiate repairs automatically. This daemon is managed
entirely by systemd and will not block unmounting of the filesystem
unless repairs are ongoing. They are auto-started by a starter
service that uses fanotify.
This patchset depends on the new fserror code that Christian Brauner
has tentatively accepted for Linux 7.0:
https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=vfs-7.0.fserror
v7: more cleanups of the media verification ioctl, improve comments, and
reuse the bio
v6: fix pi-breaking bugs, make verify failures trigger health reports
and filter bio status flags better
v5: add verify-media ioctl, collapse small helper funcs with only
one caller
v4: drop multiple client support so we can make direct calls into
healthmon instead of chasing pointers and doing indirect calls
v3: drag out of rfc status
With a bit of luck, this should all go splendidly.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCaXB3wQAKCRBKO3ySh0YR
pkyzAQD/6Yuzlbc/NDUeyHOeSYYB8zAtrbw1Pdky6dtR16FR3QD/Yb5/M9E4MFz7
IX7KeL00cF/fDFl6c3h9qaEx+w23KgA=
=ADAl
-----END PGP SIGNATURE-----
Merge tag 'health-monitoring-7.0_2026-01-20' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-7.0-merge
xfs: autonomous self healing of filesystems [v7]
This patchset builds new functionality to deliver live information about
filesystem health events to userspace. This is done by creating an
anonymous file that can be read() for events by userspace programs.
Events are captured by hooking various parts of XFS and iomap so that
metadata health failures, file I/O errors, and major changes in
filesystem state (unmounts, shutdowns, etc.) can be observed by
programs.
When an event occurs, the hook functions queue an event object to each
event anonfd for later processing. Programs must have CAP_SYS_ADMIN
to open the anonfd and there's a maximum event lag to prevent resource
overconsumption. The events themselves can be read() from the anonfd
as C structs for the xfs_healer daemon.
In userspace, we create a new daemon program that will read the event
objects and initiate repairs automatically. This daemon is managed
entirely by systemd and will not block unmounting of the filesystem
unless repairs are ongoing. They are auto-started by a starter
service that uses fanotify.
This patchset depends on the new fserror code that Christian Brauner
has tentatively accepted for Linux 7.0:
https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=vfs-7.0.fserror
v7: more cleanups of the media verification ioctl, improve comments, and
reuse the bio
v6: fix pi-breaking bugs, make verify failures trigger health reports
and filter bio status flags better
v5: add verify-media ioctl, collapse small helper funcs with only
one caller
v4: drop multiple client support so we can make direct calls into
healthmon instead of chasing pointers and doing indirect calls
v3: drag out of rfc status
With a bit of luck, this should all go splendidly.
Conflicts:
This merge required an update on files:
- fs/xfs/xfs_healthmon.c
- fs/xfs/xfs_verify_media.c
Such change was required because a parallel developement changed
XFS header file xfs.h naming to xfs_platform.h, so the merge
required to update those includes in both files above
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
The xfs.h header conflicts with the public xfs.h in xfsprogs, leading
to a spurious difference in all shared libxfs files that have to
include libxfs_priv.h in userspace. Directly include xfs_platform.h so
that we can add a header of the same name to xfsprogs and remove this
major annoyance for the shared code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Add a new privileged ioctl so that xfs_scrub can ask the kernel to
verify the media of the devices backing an xfs filesystem, and have any
resulting media errors reported to fsnotify and xfs_healer.
To accomplish this, the kernel allocates a folio between the base page
size and 1MB, and issues read IOs to a gradually incrementing range of
one of the storage devices underlying an xfs filesystem. If any error
occurs, that raw error is reported to the calling process. If the error
happens to be one of the ones that the kernel considers indicative of
data loss, then it will also be reported to xfs_healthmon and fsnotify.
Driving the verification from the kernel enables xfs (and by extension
xfs_scrub) to have precise control over the size and error handling of
IOs that are issued to the underlying block device, and to emit
notifications about problems to other relevant kernel subsystems
immediately.
Note that the caller is also allowed to reduce the size of the IO and
to ask for a relaxation period after each IO.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Start creating helper functions and infrastructure to pass filesystem
health events to a health monitoring file. Since this is an
administrative interface, we only support a single health monitor
process per filesystem, so we don't need to use anything fancy such as
notifier chains (== tons of indirect calls).
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Use the direct I/O alignment reporting from ->getattr instead of
reimplementing it. This exposes the relaxation of the memory
alignment in the XFS_IOC_DIOINFO info and ensure the information will
stay in sync. Note that randholes.c in xfstests has a bug where it
incorrectly fails when the required memory alignment is smaller than the
pointer size. Round up the reported value as there is a fair chance that
this code got copied into various applications.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
XFS does't have file attributes manipulation ioctls for special
files. Changing or reading file attributes is rejected for them in
xfs_fileattr_*et().
In XFS, this is necessary to work for project quota directories.
When project is set up, xfs_quota opens and calls FS_IOC_SETFSXATTR on
every inode in the directory. However, special files are skipped due to
open() returning a special inode for them. So, they don't even get to
this check.
The recently added file_getattr/file_setattr will call xfs_fileattr_*et,
on special files. This patch allows reading/changing file attributes on
special files.
Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Add a new field to struct xfs_ibulk to directly pass XFS_IWALK* flags,
and thus remove the need to indirect the SAME_AG flag through
XFS_IBULK*.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaINCpgAKCRCRxhvAZXjc
oqfFAQDcy3rROUF3W34KcSi7rDmaKVSX53d1tUoqH+1zDRpSlwEAriKDNC1ybudp
YAnxVzkRHjHs1296WIuwKq5lfhJ60Q4=
=geAl
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.17-rc1.fileattr' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull fileattr updates from Christian Brauner:
"This introduces the new file_getattr() and file_setattr() system calls
after lengthy discussions.
Both system calls serve as successors and extensible companions to
the FS_IOC_FSGETXATTR and FS_IOC_FSSETXATTR system calls which have
started to show their age in addition to being named in a way that
makes it easy to conflate them with extended attribute related
operations.
These syscalls allow userspace to set filesystem inode attributes on
special files. One of the usage examples is the XFS quota projects.
XFS has project quotas which could be attached to a directory. All new
inodes in these directories inherit project ID set on parent
directory.
The project is created from userspace by opening and calling
FS_IOC_FSSETXATTR on each inode. This is not possible for special
files such as FIFO, SOCK, BLK etc. Therefore, some inodes are left
with empty project ID. Those inodes then are not shown in the quota
accounting but still exist in the directory. This is not critical but
in the case when special files are created in the directory with
already existing project quota, these new inodes inherit extended
attributes. This creates a mix of special files with and without
attributes. Moreover, special files with attributes don't have a
possibility to become clear or change the attributes. This, in turn,
prevents userspace from re-creating quota project on these existing
files.
In addition, these new system calls allow the implementation of
additional attributes that we couldn't or didn't want to fit into the
legacy ioctls anymore"
* tag 'vfs-6.17-rc1.fileattr' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: tighten a sanity check in file_attr_to_fileattr()
tree-wide: s/struct fileattr/struct file_kattr/g
fs: introduce file_getattr and file_setattr syscalls
fs: prepare for extending file_get/setattr()
fs: make vfs_fileattr_[get|set] return -EOPNOTSUPP
selinux: implement inode_file_[g|s]etattr hooks
lsm: introduce new hooks for setting/getting inode fsxattr
fs: split fileattr related helpers into separate file
Replace the deprecated strncpy() with memtostr_pad(). This also avoids
the need for separate zeroing using memset(). Mark sb_fname buffer with
__nonstring as its size is XFSLABEL_MAX and so no terminating NULL for
sb_fname.
Signed-off-by: Pranav Tyagi <pranav.tyagi03@gmail.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Now that we expose struct file_attr as our uapi struct rename all the
internal struct to struct file_kattr to clearly communicate that it is a
kernel internal struct. This is similar to struct mount_{k}attr and
others.
Link: https://lore.kernel.org/20250703-restlaufzeit-baurecht-9ed44552b481@brauner
Signed-off-by: Christian Brauner <brauner@kernel.org>
The zoned space allocator will need reserved RT extents for garbage
collection and zeroing of partial blocks. Move the resblks related
fields into the freecounter array so that they can be used for all
counters.
Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
xfs_{add,dec}_freecounter already handles the block and RT extent
percpu counters, but it currently hardcodes the passed in counter.
Add a freecounter abstraction that uses an enum to designate the counter
and add wrappers that hide the actual percpu_counters. This will allow
expanding the reserved block handling to the RT extent counter in the
next step, and also prepares for adding yet another such counter that
can share the code. Both these additions will be needed for the zoned
allocator.
Also switch the flooring of the frextents counter to 0 in statfs for the
rthinherit case to a manual min_t call to match the handling of the
fdblocks counter for normal file systems.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
* Implement reflink support for the realtime device
* Implement reverse-mapping support for the realtime device
* Several bug fixes and cleanups
Signed-off-by: Carlos Maiolino <cem@kernel.org>
-----BEGIN PGP SIGNATURE-----
iJUEABMJAB0WIQSmtYVZ/MfVMGUq1GNcsMJ8RxYuYwUCZ490YgAKCRBcsMJ8RxYu
Y0ERAX9ClltZD861L93lA30CnQwqlWNTdZDu/YB0LfPw2hM6jVIp6WLJIW7kz08q
jvJMD8UBf0jRKI0I5OlD2Vw60kB1GmliUPo4QlkAzy9YNLtAg8OO2mGbEIJA21oS
y3rAdzPBFg==
=K947
-----END PGP SIGNATURE-----
Merge tag 'xfs-merge-6.14' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull XFS updates from Carlos Maiolino:
"This is mostly focused on the implementation of reflink and
reverse-mapping support for XFS's real-time devices.
It also includes several bugfixes.
- Implement reflink support for the realtime device
- Implement reverse-mapping support for the realtime device
- Several bug fixes and cleanups"
* tag 'xfs-merge-6.14' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (121 commits)
xfs: fix buffer lookup vs release race
xfs: check for dead buffers in xfs_buf_find_insert
xfs: add a b_iodone callback to struct xfs_buf
xfs: move b_li_list based retry handling to common code
xfs: simplify xfsaild_resubmit_item
xfs: always complete the buffer inline in xfs_buf_submit
xfs: remove the extra buffer reference in xfs_buf_submit
xfs: move invalidate_kernel_vmap_range to xfs_buf_ioend
xfs: simplify buffer I/O submission
xfs: move in-memory buftarg handling out of _xfs_buf_ioapply
xfs: move write verification out of _xfs_buf_ioapply
xfs: remove xfs_buf_delwri_submit_buffers
xfs: simplify xfs_buf_delwri_pushbuf
xfs: move xfs_buf_iowait out of (__)xfs_buf_submit
xfs: remove the incorrect comment about the b_pag field
xfs: remove the incorrect comment above xfs_buf_free_maps
xfs: fix a double completion for buffers on in-memory targets
xfs/libxfs: replace kmalloc() and memcpy() with kmemdup()
xfs: constify feature checks
xfs: refactor xfs_fs_statfs
...
For I/O to reflinked blocks we always need to write an entire new file
system block, and the code enforces the file system block alignment for
the entire file if it has any reflinked blocks. Mirror the larger
value reported in the statx in the dio_offset_align in the xfs-specific
XFS_IOC_DIOINFO ioctl for the same reason.
Don't bother adding a new field for the read alignment to this legacy
ioctl as all new users should use statx instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250109083109.1441561-6-hch@lst.de
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
The copy-on-write extent size hint is subject to the same alignment
constraints as the regular extent size hint. Since we're in the process
of adding reflink (and therefore CoW) to the realtime device, we must
apply the same scattered rextsize alignment validation strategies to
both hints to deal with the possibility of rextsize changing.
Therefore, fix the inode validator to perform rextsize alignment checks
on regular realtime files, and to remove misaligned directory hints.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Now that we can share blocks between realtime files, allow this
combination.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
* convert perag to use xarrays
* create a new generic allocation group structure
* Add metadata inode dir trees
* Create in-core rt allocation groups
* Shard the RT section into allocation groups
* Persist quota options with the enw metadata dir tree
* Enable quota for RT volumes
* Enable metadata directory trees
* Some bugfixes
Signed-off-by: Carlos Maiolino <cem@kernel.org>
-----BEGIN PGP SIGNATURE-----
iJUEABMJAB0WIQQMHYkcUKcy4GgPe2RGdaER5QtfpgUCZzyNwAAKCRBGdaER5Qtf
psV3AYCncK/pVhFfKQSFbnCvgPSoAe7N9n0Wt5gmjy0Ill2mbQXVl9ADXkH6a015
gcGM3t4BgIHLJQndL/Uz+3a0L5IriEb9QkAfzmx8t3vjiRBzBe3WfywEx9Yt7kZe
xbxEJ2HQpA==
=3ngC
-----END PGP SIGNATURE-----
Merge tag 'xfs-6.13-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs updates from Carlos Maiolino:
"The bulk of this pull request is a major rework that Darrick and
Christoph have been doing on XFS's real-time volume, coupled with a
few features to support this rework. It does also includes some bug
fixes.
- convert perag to use xarrays
- create a new generic allocation group structure
- add metadata inode dir trees
- create in-core rt allocation groups
- shard the RT section into allocation groups
- persist quota options with the enw metadata dir tree
- enable quota for RT volumes
- enable metadata directory trees
- some bugfixes"
* tag 'xfs-6.13-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (146 commits)
xfs: port ondisk structure checks from xfs/122 to the kernel
xfs: separate space btree structures in xfs_ondisk.h
xfs: convert struct typedefs in xfs_ondisk.h
xfs: enable metadata directory feature
xfs: enable realtime quota again
xfs: update sb field checks when metadir is turned on
xfs: reserve quota for realtime files correctly
xfs: create quota preallocation watermarks for realtime quota
xfs: report realtime block quota limits on realtime directories
xfs: persist quota flags with metadir
xfs: advertise realtime quota support in the xqm stat files
xfs: scrub quota file metapaths
xfs: fix chown with rt quota
xfs: use metadir for quota inodes
xfs: refactor xfs_qm_destroy_quotainos
xfs: use rtgroup busy extent list for FITRIM
xfs: implement busy extent tracking for rtgroups
xfs: port the perag discard code to handle generic groups
xfs: move the min and max group block numbers to xfs_group
xfs: adjust min_block usage in xfs_verify_agbno
...
Making sure that struct fd instances are destroyed in the same
scope where they'd been created, getting rid of reassignments
and passing them by reference, converting to CLASS(fd{,_pos,_raw}).
We are getting very close to having the memory safety of that stuff
trivial to verify.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZzdikAAKCRBZ7Krx/gZQ
69nJAQCmbQHK3TGUbQhOw6MJXOK9ezpyEDN3FZb4jsu38vTIdgEA6OxAYDO2m2g9
CN18glYmD3wRyU6Bwl4vGODouSJvDgA=
=gVH3
-----END PGP SIGNATURE-----
Merge tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull 'struct fd' class updates from Al Viro:
"The bulk of struct fd memory safety stuff
Making sure that struct fd instances are destroyed in the same scope
where they'd been created, getting rid of reassignments and passing
them by reference, converting to CLASS(fd{,_pos,_raw}).
We are getting very close to having the memory safety of that stuff
trivial to verify"
* tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (28 commits)
deal with the last remaing boolean uses of fd_file()
css_set_fork(): switch to CLASS(fd_raw, ...)
memcg_write_event_control(): switch to CLASS(fd)
assorted variants of irqfd setup: convert to CLASS(fd)
do_pollfd(): convert to CLASS(fd)
convert do_select()
convert vfs_dedupe_file_range().
convert cifs_ioctl_copychunk()
convert media_request_get_by_fd()
convert spu_run(2)
switch spufs_calls_{get,put}() to CLASS() use
convert cachestat(2)
convert do_preadv()/do_pwritev()
fdget(), more trivial conversions
fdget(), trivial conversions
privcmd_ioeventfd_assign(): don't open-code eventfd_ctx_fdget()
o2hb_region_dev_store(): avoid goto around fdget()/fdput()
introduce "fd_pos" class, convert fdget_pos() users to it.
fdget_raw() users: switch to CLASS(fd_raw)
convert vmsplice() to CLASS(fd)
...
Create an ioctl so that the kernel can report the status of realtime
groups to userspace.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Every time we update parts of the primary filesystem superblock that are
echoed in the rt superblock, we must update the rt super. Avoid
changing the log to support logging to the rt device by using ordered
buffers.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Allow the V5 bulkstat ioctl to return information about metadata
directory files so that xfs_scrub can find and scrub them, since they
are otherwise ordinary directories.
(Metadata files of course require per-file scrub code and hence do not
need exposure.)
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
all failure exits prior to fdget() leave the scope, all matching fdput()
are immediately followed by leaving the scope.
[xfs_ioc_commit_range() chunk moved here as well]
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Extsize should only be allowed to be set on files with no data in it.
For this, we check if the files have extents but miss to check if
delayed extents are present. This patch adds that check.
While we are at it, also refactor this check into a helper since
it's used in some other places as well like xfs_inactive() or
xfs_ioctl_setattr_xflags()
**Without the patch (SUCCEEDS)**
$ xfs_io -c 'open -f testfile' -c 'pwrite 0 1024' -c 'extsize 65536'
wrote 1024/1024 bytes at offset 0
1 KiB, 1 ops; 0.0002 sec (4.628 MiB/sec and 4739.3365 ops/sec)
**With the patch (FAILS as expected)**
$ xfs_io -c 'open -f testfile' -c 'pwrite 0 1024' -c 'extsize 65536'
wrote 1024/1024 bytes at offset 0
1 KiB, 1 ops; 0.0002 sec (4.628 MiB/sec and 4739.3365 ops/sec)
xfs_io: FS_IOC_FSSETXATTR testfile: Invalid argument
Fixes: e94af02a9c ("[XFS] fix old xfs_setattr mis-merge from irix; mostly harmless esp if not using xfs rt")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZvDNmgAKCRBZ7Krx/gZQ
63zrAP9vI0rf55v27twiabe9LnI7aSx5ckoqXxFIFxyT3dOYpQD/bPmoApnWDD3d
592+iDgLsema/H/0/CqfqlaNtDNY8Q0=
=HUl5
-----END PGP SIGNATURE-----
Merge tag 'pull-stable-struct_fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull 'struct fd' updates from Al Viro:
"Just the 'struct fd' layout change, with conversion to accessor
helpers"
* tag 'pull-stable-struct_fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
add struct fd constructors, get rid of __to_fd()
struct fd: representation change
introduce fd_file(), convert all accessors to it.
Move this function out of xfs_ioctl.c to reduce the clutter in there,
and make the entire getfsmap implementation self-contained in a single
file.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
This patch introduces two more new ioctls to manage atomic updates to
file contents -- XFS_IOC_START_COMMIT and XFS_IOC_COMMIT_RANGE. The
commit mechanism here is exactly the same as what XFS_IOC_EXCHANGE_RANGE
does, but with the additional requirement that file2 cannot have changed
since some sampling point. The start-commit ioctl performs the sampling
of file attributes.
Note: This patch currently samples i_ctime during START_COMMIT and
checks that it hasn't changed during COMMIT_RANGE. This isn't entirely
safe in kernels prior to 6.12 because ctime only had coarse grained
granularity and very fast updates could collide with a COMMIT_RANGE.
With the multi-granularity ctime introduced by Jeff Layton, it's now
possible to update ctime such that this does not happen.
It is critical, then, that this patch must not be backported to any
kernel that does not support fine-grained file change timestamps.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Acked-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
If a file has the S_DAX flag (aka fsdax access mode) set, we cannot
allow users to change the realtime flag unless the datadev and rtdev
both support fsdax access modes. Even if there are no extents allocated
to the file, the setattr thread could be racing with another thread
that has already started down the write code paths.
Fixes: ba23cba9b3 ("fs: allow per-device dax status checking for filesystems")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
For any changes of struct fd representation we need to
turn existing accesses to fields into calls of wrappers.
Accesses to struct fd::flags are very few (3 in linux/file.h,
1 in net/socket.c, 3 in fs/overlayfs/file.c and 3 more in
explicit initializers).
Those can be dealt with in the commit converting to
new layout; accesses to struct fd::file are too many for that.
This commit converts (almost) all of f.file to
fd_file(f). It's not entirely mechanical ('file' is used as
a member name more than just in struct fd) and it does not
even attempt to distinguish the uses in pointer context from
those in boolean context; the latter will be eventually turned
into a separate helper (fd_empty()).
NOTE: mass conversion to fd_empty(), tempting as it
might be, is a bad idea; better do that piecewise in commit
that convert from fdget...() to CLASS(...).
[conflicts in fs/fhandle.c, kernel/bpf/syscall.c, mm/memcontrol.c
caught by git; fs/stat.c one got caught by git grep]
[fs/xattr.c conflict]
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Hoist the inode flag conversion functions into libxfs so that we can
keep them in sync. Do this by creating a new xfs_inode_util.c file in
libxfs.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Introduce a variant on XFS_SCRUB_METADATA that allows for a vectored
mode. The caller specifies the principal metadata object that they want
to scrub (allocation group, inode, etc.) once, followed by an array of
scrub types they want called on that object. The kernel runs the scrub
operations and writes the output flags and errno code to the
corresponding array element.
A new pseudo scrub type BARRIER is introduced to force the kernel to
return to userspace if any corruptions have been found when scrubbing
the previous scrub types in the array. This enables userspace to
schedule, for example, the sequence:
1. data fork
2. barrier
3. directory
If the data fork scrub is clean, then the kernel will perform the
directory scrub. If not, the barrier in 2 will exit back to userspace.
The alternative would have been an interface where userspace passes a
pointer to an empty buffer, and the kernel formats that with
xfs_scrub_vecs that tell userspace what it scrubbed and what the outcome
was. With that the kernel would have to communicate that the buffer
needed to have been at least X size, even though for our cases
XFS_SCRUB_TYPE_NR + 2 would always be enough.
Compared to that, this design keeps all the dependency policy and
ordering logic in userspace where it already resides instead of
duplicating it in the kernel. The downside of that is that it needs the
barrier logic.
When running fstests in "rebuild all metadata after each test" mode, I
observed a 10% reduction in runtime due to fewer transitions across the
system call boundary.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Move the scrub ioctl handler to scrub.c to keep the code together and to
reduce unnecessary code when CONFIG_XFS_ONLINE_SCRUB=n.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
This patch adds a pair of new file ioctls to retrieve the parent pointer
of a given inode. They both return the same results, but one operates
on the file descriptor passed to ioctl() whereas the other allows the
caller to specify a file handle for which the caller wants results.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Move the handle managemnet code (and the attrmulti code that uses it) to
xfs_handle.c.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Pass the attr value to put_listent when we have local xattrs or
shortform xattrs. This will enable the GETPARENTS ioctl to use
xfs_attr_list as its backend.
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Parent pointers match attrs on name+value, unlike everything else which
matches on only the name. Therefore, we cannot keep using the heuristic
that !value means remove. Make this an explicit operation code.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
This field only ever contains XATTR_{CREATE,REPLACE}, and it only goes
as deep as xfs_attr_set. Remove the field from the structure and
replace it with an enum specifying exactly what kind of change we want
to make to the xattr structure. Upsert is the name that we'll give to
the flags==0 operation, because we're either updating an existing value
or inserting it, and the caller doesn't care.
Note: The "UPSERTR" name created here is to make userspace porting
easier. It will be removed in the next patch.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Introduce a new ioctl to handle exchanging ranges of bytes
between files. The goal here is to perform the exchange atomically with
respect to applications -- either they see the file contents before the
exchange or they see that A-B is now B-A, even if the kernel crashes.
My original goal with all this code was to make it so that online repair
can build a replacement directory or xattr structure in a temporary file
and commit the repair by atomically exchanging all the data blocks
between the two files. However, I needed a way to test this mechanism
thoroughly, so I've been evolving an ioctl interface since then.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Move the two public symbols in xfs_file.c to xfs_file.h. We're about to
add more public symbols in that source file, so let's finally create the
header file.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Another incorrect conversion to kfree() instead of kvfree().
Fixes: 4929257613 ("xfs: convert kmem_free() for kvmalloc users to kvfree()")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Wrongly converted from kmem_free() to kfree().
Reported-by: Matthew Wilcox <willy@infradead.org>
Fixes: 4929257613 ("xfs: convert kmem_free() for kvmalloc users to kvfree()")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Without this the kernel crashes in kfree for files with a sufficiently
large number of extents.
Fixes: d4c75a1b40 ("xfs: convert remaining kmem_free() to kfree()")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
The remaining callers of kmem_free() are freeing heap memory, so
we can convert them directly to kfree() and get rid of kmem_free()
altogether.
This conversion was done with:
$ for f in `git grep -l kmem_free fs/xfs`; do
> sed -i s/kmem_free/kfree/ $f
> done
$
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Start getting rid of kmem_free() by converting all the cases where
memory can come from vmalloc interfaces to calling kvfree()
directly.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
xfs_reserve_blocks has a very odd interface that can only be explained
by it directly deriving from the IRIX fcntl handler back in the day.
Split reporting out the reserved blocks out of xfs_reserve_blocks into
the only caller that cares. This means that the value reported from
XFS_IOC_SET_RESBLKS isn't atomically sampled in the same critical
section as when it was set anymore, but as the values could change
right after setting them anyway that does not matter. It does
provide atomic sampling of both values for XFS_IOC_GET_RESBLKS now,
though.
Also pass a normal scalar integer value for the requested value instead
of the pointless pointer.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Split XFS_IOC_FSCOUNTS out of the main xfs_file_ioctl function, and
merge the xfs_fs_counts helper into the ioctl handler.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
The XFS_IOC_GET_RESBLKS and XFS_IOC_SET_RESBLKS already share a fair
amount of code, and will share even more soon. Move the logic for both
of them out of the main xfs_file_ioctl function into a
xfs_ioctl_getset_resblocks helper to share the code and prepare for
additional changes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>