Commit Graph

406 Commits

Author SHA1 Message Date
Linus Torvalds
4793dae01f Driver core changes for 7.1-rc1
- debugfs:
   - Fix NULL pointer dereference in debugfs_create_str()
   - Fix misplaced EXPORT_SYMBOL_GPL for debugfs_create_str()
   - Fix soundwire debugfs NULL pointer dereference from uninitialized
     firmware_file
 
 - device property:
   - Make fwnode flags modifications thread safe; widen the field to
     unsigned long and use set_bit() / clear_bit() based accessors
   - Document how to check for the property presence
 
 - devres:
   - Separate struct devres_node from its "subclasses" (struct devres,
     struct devres_group); give struct devres_node its own release and
     free callbacks for per-type dispatch
   - Introduce struct devres_action for devres actions, avoiding the
     ARCH_DMA_MINALIGN alignment overhead of struct devres
   - Export struct devres_node and its init/add/remove/dbginfo
     primitives for use by Rust Devres<T>
   - Fix missing node debug info in devm_krealloc()
   - Use guard(spinlock_irqsave) where applicable; consolidate unlock
     paths in devres_release_group()
 
 - driver_override:
   - Convert PCI, WMI, vdpa, s390/cio, s390/ap, and fsl-mc to the
     generic driver_override infrastructure, replacing per-bus
     driver_override strings, sysfs attributes, and match logic; fixes
     a potential UAF from unsynchronized access to driver_override in
     bus match() callbacks
   - Simplify __device_set_driver_override() logic
 
 - kernfs:
   - Send IN_DELETE_SELF and IN_IGNORED inotify events on kernfs
     file and directory removal
   - Add corresponding selftests for memcg
 
 - platform:
   - Allow attaching software nodes when creating platform devices via
     a new 'swnode' field in struct platform_device_info
   - Add kerneldoc for struct platform_device_info
 
 - software node:
   - Move software node initialization from postcore_initcall() to
     driver_init(), making it available early in the boot process
   - Move kernel_kobj initialization (ksysfs_init) earlier to support
     the above
   - Remove software_node_exit(); dead code in a built-in unit
 
 - SoC:
   - Introduce of_machine_read_compatible() and of_machine_read_model()
     OF helpers and export soc_attr_read_machine() to replace direct
     accesses to of_root from SoC drivers; also enables
     CONFIG_COMPILE_TEST coverage for these drivers
 
 - sysfs:
   - Constify attribute group array pointers to
     'const struct attribute_group *const *' in sysfs functions,
     device_add_groups() / device_remove_groups(), and struct class
 
 - Rust:
   - Devres:
     - Embed struct devres_node directly in Devres<T> instead of going
       through devm_add_action(), avoiding the extra allocation and
       the unnecessary ARCH_DMA_MINALIGN alignment
 
   - I/O:
     - Turn IoCapable from a marker trait into a functional trait
       carrying the raw I/O accessor implementation (io_read /
       io_write), providing working defaults for the per-type Io
       methods
     - Add RelaxedMmio wrapper type, making relaxed accessors usable
       in code generic over the Io trait
     - Remove overloaded per-type Io methods and per-backend macros
       from Mmio and PCI ConfigSpace
 
   - I/O (Register):
     - Add IoLoc trait and generic read/write/update methods to the Io
       trait, making I/O operations parameterizable by typed locations
     - Add register! macro for defining hardware register types with
       typed bitfield accessors backed by Bounded values; supports
       direct, relative, and array register addressing
     - Add write_reg() / try_write_reg() and LocatedRegister trait
     - Update PCI sample driver to demonstrate the register! macro
 
         Example:
 
         ```
             register! {
                 /// UART control register.
                 CTRL(u32) @ 0x18 {
                     /// Receiver enable.
                     19:19   rx_enable => bool;
                     /// Parity configuration.
                     14:13   parity ?=> Parity;
                 }
 
                 /// FIFO watermark and counter register.
                 WATER(u32) @ 0x2c {
                     /// Number of datawords in the receive FIFO.
                     26:24   rx_count;
                     /// RX interrupt threshold.
                     17:16   rx_water;
                 }
             }
 
             impl WATER {
                 fn rx_above_watermark(&self) -> bool {
                     self.rx_count() > self.rx_water()
                 }
             }
 
             fn init(bar: &pci::Bar<BAR0_SIZE>) {
                 let water = WATER::zeroed()
                     .with_const_rx_water::<1>(); // > 3 would not compile
                 bar.write_reg(water);
 
                 let ctrl = CTRL::zeroed()
                     .with_parity(Parity::Even)
                     .with_rx_enable(true);
                 bar.write_reg(ctrl);
             }
 
             fn handle_rx(bar: &pci::Bar<BAR0_SIZE>) {
                 if bar.read(WATER).rx_above_watermark() {
                     // drain the FIFO
                 }
             }
 
             fn set_parity(bar: &pci::Bar<BAR0_SIZE>, parity: Parity) {
                 bar.update(CTRL, |r| r.with_parity(parity));
             }
         ```
 
   - IRQ:
     - Move 'static bounds from where clauses to trait declarations
       for IRQ handler traits
 
   - Misc:
     - Enable the generic_arg_infer Rust feature
     - Extend Bounded with shift operations, single-bit bool conversion,
       and const get()
 
 - Misc:
   - Make deferred_probe_timeout default a Kconfig option
   - Drop auxiliary_dev_pm_ops; the PM core falls back to driver PM
     callbacks when no bus type PM ops are set
   - Add conditional guard support for device_lock()
   - Add ksysfs.c to the DRIVER CORE MAINTAINERS entry
   - Fix kernel-doc warnings in base.h
   - Fix stale reference to memory_block_add_nid() in documentation
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQS2q/xV6QjXAdC7k+1FlHeO1qrKLgUCadl5SwAKCRBFlHeO1qrK
 LpjDAQCSG3vYznwrngfpmRU5bCB9sdUy/pZiX5px1357+amJkwEA9LgIVQvtHAZW
 ZXcQ7Jr+mR3mJEdlatbkWHp3w1VHqAQ=
 =y1DV
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core

Pull driver core updates from Danilo Krummrich:
 "debugfs:
   - Fix NULL pointer dereference in debugfs_create_str()
   - Fix misplaced EXPORT_SYMBOL_GPL for debugfs_create_str()
   - Fix soundwire debugfs NULL pointer dereference from uninitialized
     firmware_file

  device property:
   - Make fwnode flags modifications thread safe; widen the field to
     unsigned long and use set_bit() / clear_bit() based accessors
   - Document how to check for the property presence

  devres:
   - Separate struct devres_node from its "subclasses" (struct devres,
     struct devres_group); give struct devres_node its own release and
     free callbacks for per-type dispatch
   - Introduce struct devres_action for devres actions, avoiding the
     ARCH_DMA_MINALIGN alignment overhead of struct devres
   - Export struct devres_node and its init/add/remove/dbginfo
     primitives for use by Rust Devres<T>
   - Fix missing node debug info in devm_krealloc()
   - Use guard(spinlock_irqsave) where applicable; consolidate unlock
     paths in devres_release_group()

  driver_override:
   - Convert PCI, WMI, vdpa, s390/cio, s390/ap, and fsl-mc to the
     generic driver_override infrastructure, replacing per-bus
     driver_override strings, sysfs attributes, and match logic; fixes a
     potential UAF from unsynchronized access to driver_override in bus
     match() callbacks
   - Simplify __device_set_driver_override() logic

  kernfs:
   - Send IN_DELETE_SELF and IN_IGNORED inotify events on kernfs file
     and directory removal
   - Add corresponding selftests for memcg

  platform:
   - Allow attaching software nodes when creating platform devices via a
     new 'swnode' field in struct platform_device_info
   - Add kerneldoc for struct platform_device_info

  software node:
   - Move software node initialization from postcore_initcall() to
     driver_init(), making it available early in the boot process
   - Move kernel_kobj initialization (ksysfs_init) earlier to support
     the above
   - Remove software_node_exit(); dead code in a built-in unit

  SoC:
   - Introduce of_machine_read_compatible() and of_machine_read_model()
     OF helpers and export soc_attr_read_machine() to replace direct
     accesses to of_root from SoC drivers; also enables
     CONFIG_COMPILE_TEST coverage for these drivers

  sysfs:
   - Constify attribute group array pointers to
     'const struct attribute_group *const *' in sysfs functions,
     device_add_groups() / device_remove_groups(), and struct class

  Rust:
   - Devres:
      - Embed struct devres_node directly in Devres<T> instead of going
        through devm_add_action(), avoiding the extra allocation and the
        unnecessary ARCH_DMA_MINALIGN alignment

   - I/O:
      - Turn IoCapable from a marker trait into a functional trait
        carrying the raw I/O accessor implementation (io_read /
        io_write), providing working defaults for the per-type Io
        methods
      - Add RelaxedMmio wrapper type, making relaxed accessors usable in
        code generic over the Io trait
      - Remove overloaded per-type Io methods and per-backend macros
        from Mmio and PCI ConfigSpace

   - I/O (Register):
      - Add IoLoc trait and generic read/write/update methods to the Io
        trait, making I/O operations parameterizable by typed locations
      - Add register! macro for defining hardware register types with
        typed bitfield accessors backed by Bounded values; supports
        direct, relative, and array register addressing
      - Add write_reg() / try_write_reg() and LocatedRegister trait
      - Update PCI sample driver to demonstrate the register! macro

         Example:

         ```
             register! {
                 /// UART control register.
                 CTRL(u32) @ 0x18 {
                     /// Receiver enable.
                     19:19   rx_enable => bool;
                     /// Parity configuration.
                     14:13   parity ?=> Parity;
                 }

                 /// FIFO watermark and counter register.
                 WATER(u32) @ 0x2c {
                     /// Number of datawords in the receive FIFO.
                     26:24   rx_count;
                     /// RX interrupt threshold.
                     17:16   rx_water;
                 }
             }

             impl WATER {
                 fn rx_above_watermark(&self) -> bool {
                     self.rx_count() > self.rx_water()
                 }
             }

             fn init(bar: &pci::Bar<BAR0_SIZE>) {
                 let water = WATER::zeroed()
                     .with_const_rx_water::<1>(); // > 3 would not compile
                 bar.write_reg(water);

                 let ctrl = CTRL::zeroed()
                     .with_parity(Parity::Even)
                     .with_rx_enable(true);
                 bar.write_reg(ctrl);
             }

             fn handle_rx(bar: &pci::Bar<BAR0_SIZE>) {
                 if bar.read(WATER).rx_above_watermark() {
                     // drain the FIFO
                 }
             }

             fn set_parity(bar: &pci::Bar<BAR0_SIZE>, parity: Parity) {
                 bar.update(CTRL, |r| r.with_parity(parity));
             }
         ```

   - IRQ:
      - Move 'static bounds from where clauses to trait declarations for
        IRQ handler traits

   - Misc:
      - Enable the generic_arg_infer Rust feature
      - Extend Bounded with shift operations, single-bit bool
        conversion, and const get()

  Misc:
   - Make deferred_probe_timeout default a Kconfig option
   - Drop auxiliary_dev_pm_ops; the PM core falls back to driver PM
     callbacks when no bus type PM ops are set
   - Add conditional guard support for device_lock()
   - Add ksysfs.c to the DRIVER CORE MAINTAINERS entry
   - Fix kernel-doc warnings in base.h
   - Fix stale reference to memory_block_add_nid() in documentation"

* tag 'driver-core-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core: (67 commits)
  bus: fsl-mc: use generic driver_override infrastructure
  s390/ap: use generic driver_override infrastructure
  s390/cio: use generic driver_override infrastructure
  vdpa: use generic driver_override infrastructure
  platform/wmi: use generic driver_override infrastructure
  PCI: use generic driver_override infrastructure
  driver core: make software nodes available earlier
  software node: remove software_node_exit()
  kernel: ksysfs: initialize kernel_kobj earlier
  MAINTAINERS: add ksysfs.c to the DRIVER CORE entry
  drivers/base/memory: fix stale reference to memory_block_add_nid()
  device property: Document how to check for the property presence
  soundwire: debugfs: initialize firmware_file to empty string
  debugfs: fix placement of EXPORT_SYMBOL_GPL for debugfs_create_str()
  debugfs: check for NULL pointer in debugfs_create_str()
  driver core: Make deferred_probe_timeout default a Kconfig option
  driver core: simplify __device_set_driver_override() clearing logic
  driver core: auxiliary bus: Drop auxiliary_dev_pm_ops
  device property: Make modifications of fwnode "flags" thread safe
  rust: devres: embed struct devres_node directly
  ...
2026-04-13 19:03:11 -07:00
Linus Torvalds
c8db08110c vfs-7.1-rc1.xattr
Please consider pulling these changes from the signed vfs-7.1-rc1.xattr tag.
 
 Thanks!
 Christian
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCadjZCgAKCRCRxhvAZXjc
 olWpAQCf//do9MtHmqTjgVfMIv18vw5woI8Bdx/X3m5S9e2D9gD/QddDnkFiVMmu
 txsT+N8BGuW+jbAkYwax5LnUFGUTYgY=
 =a4gm
 -----END PGP SIGNATURE-----

Merge tag 'vfs-7.1-rc1.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs xattr updates from Christian Brauner:
 "This reworks the simple_xattr infrastructure and adds support for
  user.* extended attributes on sockets.

  The simple_xattr subsystem currently uses an rbtree protected by a
  reader-writer spinlock. This series replaces the rbtree with an
  rhashtable giving O(1) average-case lookup with RCU-based lockless
  reads. This sped up concurrent access patterns on tmpfs quite a bit
  and it's an overall easy enough conversion to do and gets rid or
  rwlock_t.

  The conversion is done incrementally: a new rhashtable path is added
  alongside the existing rbtree, consumers are migrated one at a time
  (shmem, kernfs, pidfs), and then the rbtree code is removed. All three
  consumers switch from embedded structs to pointer-based lazy
  allocation so the rhashtable overhead is only paid for inodes that
  actually use xattrs.

  With this infrastructure in place the series adds support for user.*
  xattrs on sockets. Path-based AF_UNIX sockets inherit xattr support
  from the underlying filesystem (e.g. tmpfs) but sockets in sockfs -
  that is everything created via socket() including abstract namespace
  AF_UNIX sockets - had no xattr support at all.

  The xattr_permission() checks are reworked to allow user.* xattrs on
  S_IFSOCK inodes. Sockfs sockets get per-inode limits of 128 xattrs and
  128KB total value size matching the limits already in use for kernfs.

  The practical motivation comes from several directions. systemd and
  GNOME are expanding their use of Varlink as an IPC mechanism.

  For D-Bus there are tools like dbus-monitor that can observe IPC
  traffic across the system but this only works because D-Bus has a
  central broker.

  For Varlink there is no broker and there is currently no way to
  identify which sockets speak Varlink. With user.* xattrs on sockets a
  service can label its socket with the IPC protocol it speaks (e.g.,
  user.varlink=1) and an eBPF program can then selectively capture
  traffic on those sockets. Enumerating bound sockets via netlink
  combined with these xattr labels gives a way to discover all Varlink
  IPC entrypoints for debugging and introspection.

  Similarly, systemd-journald wants to use xattrs on the /dev/log socket
  for protocol negotiation to indicate whether RFC 5424 structured
  syslog is supported or whether only the legacy RFC 3164 format should
  be used.

  In containers these labels are particularly useful as high-privilege
  or more complicated solutions for socket identification aren't
  available.

  The series comes with comprehensive selftests covering path-based
  AF_UNIX sockets, sockfs socket operations, per-inode limit
  enforcement, and xattr operations across multiple address families
  (AF_INET, AF_INET6, AF_NETLINK, AF_PACKET)"

* tag 'vfs-7.1-rc1.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  selftests/xattr: test xattrs on various socket families
  selftests/xattr: sockfs socket xattr tests
  selftests/xattr: path-based AF_UNIX socket xattr tests
  xattr: support extended attributes on sockets
  xattr,net: support limited amount of extended attributes on sockfs sockets
  xattr: move user limits for xattrs to generic infra
  xattr: switch xattr_permission() to switch statement
  xattr: add xattr_permission_error()
  xattr: remove rbtree-based simple_xattr infrastructure
  pidfs: adapt to rhashtable-based simple_xattrs
  kernfs: adapt to rhashtable-based simple_xattrs with lazy allocation
  shmem: adapt to rhashtable-based simple_xattrs with lazy allocation
  xattr: add rhashtable-based simple_xattr infrastructure
  xattr: add rcu_head and rhash_head to struct simple_xattr
2026-04-13 10:10:28 -07:00
Christian Brauner
cb76a81c7c kernfs: make directory seek namespace-aware
The rbtree backing kernfs directories is ordered by (hash, ns_id, name)
but kernfs_dir_pos() only searches by hash when seeking to a position
during readdir. When two nodes from different namespaces share the same
hash value, the binary search can land on a node in the wrong namespace.
The subsequent skip-forward loop walks rb_next() and may overshoot the
correct node, silently dropping an entry from the readdir results.

With the recent switch from raw namespace pointers to public namespace
ids as hash seeds, computing hash collisions became an offline operation.
An unprivileged user could unshare into a new network namespace, create
a single interface whose name-hash collides with a target entry in
init_net, and cause a victim's seekdir/readdir on /sys/class/net to miss
that entry.

Fix this by extending the rbtree search in kernfs_dir_pos() to also
compare namespace ids when hashes match. Since the rbtree is already
ordered by (hash, ns_id, name), this makes the seek land directly in the
correct namespace's range, eliminating the wrong-namespace overshoot.

Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-04-09 14:36:52 +02:00
Christian Brauner
1fe989e1c4 kernfs: use namespace id instead of pointer for hashing and comparison
kernfs uses the namespace tag as both a hash seed (via init_name_hash())
and a comparison key in the rbtree. The resulting hash values are exposed
to userspace through directory seek positions (ctx->pos), and the raw
pointer comparisons in kernfs_name_compare() encode kernel pointer
ordering into the rbtree layout.

This constitutes a KASLR information leak since the hash and ordering
derived from kernel pointers can be observed from userspace.

Fix this by using the 64-bit namespace id (ns_common::ns_id) instead of
the raw pointer value for both hashing and comparison. The namespace id
is a stable, non-secret identifier that is already exposed to userspace
through other interfaces (e.g., /proc/pid/ns/, ioctl NS_GET_NSID).

Introduce kernfs_ns_id() as a helper that extracts the namespace id from
a potentially-NULL ns_common pointer, returning 0 for the no-namespace
case.

All namespace equality checks in the directory iteration and dentry
revalidation paths are also switched from pointer comparison to ns_id
comparison for consistency.

Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-04-09 14:36:52 +02:00
Christian Brauner
e3b2cf6e5d kernfs: pass struct ns_common instead of const void * for namespace tags
kernfs has historically used const void * to pass around namespace tags
used for directory-level namespace filtering. The only current user of
this is sysfs network namespace tagging where struct net pointers are
cast to void *.

Replace all const void * namespace parameters with const struct
ns_common * throughout the kernfs, sysfs, and kobject namespace layers.
This includes the kobj_ns_type_operations callbacks, kobject_namespace(),
and all sysfs/kernfs APIs that accept or return namespace tags.

Passing struct ns_common is needed because various codepaths require
access to the underlying namespace. A struct ns_common can always be
converted back to the concrete namespace type (e.g., struct net) via
container_of() or to_ns_common() in the reverse direction.

This is a preparatory change for switching to ns_id-based directory
iteration to prevent a KASLR pointer leak through the current use of
raw namespace pointers as hash seeds and comparison keys.

Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-04-09 14:36:52 +02:00
T.J. Mercier
16de94a1b0 kernfs: Add missing documentation for kernfs_put_active's drop_supers argument
The drop_supers argument was added to kernfs_put_active to control
whether the kernfs_supers_rwsem is temporarily dropped along with the
kernfs_rwsem, but no documentation was added for it.

Fixes: eea5d2bb34 ("kernfs: Send IN_DELETE_SELF and IN_IGNORED")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202603130112.2FcCzv1g-lkp@intel.com/
Signed-off-by: T.J. Mercier <tjmercier@google.com>
Link: https://patch.msgid.link/20260313175153.235681-1-tjmercier@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-14 12:12:07 +01:00
T.J. Mercier
eea5d2bb34 kernfs: Send IN_DELETE_SELF and IN_IGNORED
Currently some kernfs files (e.g. cgroup.events, memory.events) support
inotify watches for IN_MODIFY, but unlike with regular filesystems, they
do not receive IN_DELETE_SELF or IN_IGNORED events when they are
removed. This means inotify watches persist after file deletion until
the process exits and the inotify file descriptor is cleaned up, or
until inotify_rm_watch is called manually.

This creates a problem for processes monitoring cgroups. For example, a
service monitoring memory.events for memory.high breaches needs to know
when a cgroup is removed to clean up its state. Where it's known that a
cgroup is removed when all processes die, without IN_DELETE_SELF the
service must resort to inefficient workarounds such as:
  1) Periodically scanning procfs to detect process death (wastes CPU
     and is susceptible to PID reuse).
  2) Holding a pidfd for every monitored cgroup (can exhaust file
     descriptors).

This patch enables IN_DELETE_SELF and IN_IGNORED events for kernfs files
and directories by clearing inode i_nlink values during removal. This
allows VFS to make the necessary fsnotify calls so that userspace
receives the inotify events.

As a result, applications can rely on a single existing watch on a file
of interest (e.g. memory.events) to receive notifications for both
modifications and the eventual removal of the file, as well as automatic
watch descriptor cleanup, simplifying userspace logic and improving
efficiency.

There is gap in this implementation for certain file removals due their
unique nature in kernfs. Directory removals that trigger file removals
occur through vfs_rmdir, which shrinks the dcache and emits fsnotify
events after the rmdir operation; there is no issue here. However kernfs
writes to particular files (e.g. cgroup.subtree_control) can also cause
file removal, but vfs_write does not attempt to emit fsnotify events
after the write operation, even if i_nlink counts are 0. As a usecase
for monitoring this category of file removals is not known, they are
left without having IN_DELETE or IN_DELETE_SELF events generated.
Fanotify recursive monitoring also does not work for kernfs nodes that
do not have inodes attached, as they are created on-demand in kernfs.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: T.J. Mercier <tjmercier@google.com>
Tested-by: syzbot@syzkaller.appspotmail.com
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://patch.msgid.link/20260225223404.783173-3-tjmercier@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-12 15:51:03 +01:00
T.J. Mercier
507d8ce13f kernfs: Don't set_nlink for directories being removed
If a directory is already in the process of removal its i_nlink count
becomes irrelevant because its contents are also about to be removed and
any pending filesystem operations on it or its contents will soon start
to fail. So we can avoid setting it for directories already flagged for
removal.

This avoids a race in the next patch, which adds clearing of the i_nlink
count for kernfs nodes being removed to support inotify delete events.

Use protection from the kernfs_iattr_rwsem to avoid adding more
contention to the kernfs_rwsem for calls to kernfs_refresh_inode.

Signed-off-by: T.J. Mercier <tjmercier@google.com>
Tested-by: syzbot@syzkaller.appspotmail.com
Link: https://patch.msgid.link/20260225223404.783173-2-tjmercier@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-12 15:51:03 +01:00
Christian Brauner
4fbe9e78bb
xattr: move user limits for xattrs to generic infra
Link: https://patch.msgid.link/20260216-work-xattr-socket-v1-9-c2efa4f74cb7@kernel.org
Acked-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-03-02 11:06:42 +01:00
Christian Brauner
5bd97f5c5f
kernfs: adapt to rhashtable-based simple_xattrs with lazy allocation
Adapt kernfs to use the rhashtable-based xattr path and switch from an
embedded struct to pointer-based lazy allocation.

Change kernfs_iattrs.xattrs from embedded 'struct simple_xattrs' to a
pointer 'struct simple_xattrs *', initialized to NULL (zeroed by
kmem_cache_zalloc). Since kernfs_iattrs is already lazily allocated
itself, this adds a second level of lazy allocation specifically for
the xattr store.

The xattr store is allocated on first setxattr. Read paths
check for NULL and return -ENODATA or empty list.

Replaced xattr entries are freed via simple_xattr_free_rcu() to allow
concurrent RCU readers to finish.

The cleanup paths in kernfs_free_rcu() and __kernfs_new_node() error
handling conditionally free the xattr store only when allocated.

As Jan noted in [1]:

> This is a slight change in the lifetime rules because previously kernfs
> xattrs could be safely accessed only under RCU but after this change you
> have to hold inode reference *and* RCU to safely access them. I don't think
> anybody would be accessing xattrs without holding inode reference so this
> should be safe [...].

Link: https://patch.msgid.link/20260216-work-xattr-socket-v1-4-c2efa4f74cb7@kernel.org
Link:
https://lore.kernel.org/3cnmtqmakpbb2uwhenrj7kdqu3uefykiykjllgfbtpkiwhaa4s@sghkevv7jned [1]
Acked-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-03-02 11:05:50 +01:00
Linus Torvalds
bf4afc53b7 Convert 'alloc_obj' family to use the new default GFP_KERNEL argument
This was done entirely with mindless brute force, using

    git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
        xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'

to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.

Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.

For the same reason the 'flex' versions will be done as a separate
conversion.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21 17:09:51 -08:00
Kees Cook
69050f8d6d treewide: Replace kmalloc with kmalloc_obj for non-scalar types
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:

Single allocations:	kmalloc(sizeof(TYPE), ...)
are replaced with:	kmalloc_obj(TYPE, ...)

Array allocations:	kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with:	kmalloc_objs(TYPE, COUNT, ...)

Flex array allocations:	kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with:	kmalloc_flex(*PTR, FAM, COUNT, ...)

(where TYPE may also be *VAR)

The resulting allocations no longer return "void *", instead returning
"TYPE *".

Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-21 01:02:28 -08:00
Will Rosenberg
2b74209458 fs/kernfs: null-ptr deref in simple_xattrs_free()
There exists a null pointer dereference in simple_xattrs_free() as
part of the __kernfs_new_node() routine. Within __kernfs_new_node(),
err_out4 calls simple_xattr_free(), but kn->iattr may be NULL if
__kernfs_setattr() was never called. As a result, the first argument to
simple_xattrs_free() may be NULL + 0x38, and no NULL check is done
internally, causing an incorrect pointer dereference.

Add a check to ensure kn->iattr is not NULL, meaning __kernfs_setattr()
has been called and kn->iattr is allocated. Note that struct kernfs_node
kn is allocated with kmem_cache_zalloc, so we can assume kn->iattr will
be NULL if not allocated.

An alternative fix could be to not call simple_xattrs_free() at all. As
was previously discussed during the initial patch, simple_xattrs_free()
is not strictly needed and is included to be consistent with
kernfs_free_rcu(), which also helps the function maintain correctness if
changes are made in __kernfs_new_node().

Reported-by: syzbot+6aaf7f48ae034ab0ea97@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=6aaf7f48ae034ab0ea97
Fixes: 382b1e8f30 ("kernfs: fix memory leak of kernfs_iattrs in __kernfs_new_node")
Signed-off-by: Will Rosenberg <whrosenb@asu.edu>
Link: https://patch.msgid.link/20251217060107.4171558-1-whrosenb@asu.edu
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-12-23 16:14:43 +01:00
Linus Torvalds
416f99c3b1 Driver core changes for 6.19-rc1
- Arch Topology:
   - Move parse_acpi_topology() from arm64 to common code for reuse in RISC-V
 
 - CPU:
   - Expose housekeeping CPUs through /sys/devices/system/cpu/housekeeping
   - Print a newline (or 0x0A) instead of '(null)' reading
     /sys/devices/system/cpu/nohz_full when nohz_full= is not set
 
 - debugfs
   - Remove (broken) 'no-mount' mode
   - Remove redundant access mode checks in debugfs_get_tree() and
     debugfs_create_*() functions
 
 - Devres:
   - Remove unused devm_free_percpu() helper
   - Move devm_alloc_percpu() from device.h to devres.h
 
 - Firmware Loader:
   - Replace simple_strtol() with kstrtoint()
   - Do not call cancel_store() when no upload is in progress
 
 - kernfs:
   - Increase struct super_block::maxbytes to MAX_LFS_FILESIZE
   - Fix a missing unwind path in __kernfs_new_node()
 
 - Misc:
   - Increase the name size in struct auxiliary_device_id to 40 characters
   - Replace system_unbound_wq with system_dfl_wq and add WQ_PERCPU to
     alloc_workqueue()
 
 - Platform:
   - Replace ERR_PTR() with IOMEM_ERR_PTR() in platform ioremap functions
 
 - Rust:
   - Auxiliary:
     - Unregister auxiliary device on parent device unbind
     - Move parent() to impl Device; implement device context aware parent() for
       Device<Bound>
     - Illustrate how to safely obtain a driver's device private data when
       calling from an auxiliary driver into the parant device driver
 
   - DebugFs:
     - Implement support for binary large objects
 
   - Device:
     - Let probe() return the driver's device private data as pinned initializer,
       i.e. impl PinInit<Self, Error>
     - Implement safe accessor for a driver's device private data for
       Device<Bound> (returned reference can't out-live driver binding and
       guarantees the correct private data type)
     - Implement AsBusDevice trait, to be used by class device abstractions to
       derive the bus device type of the parent device
 
   - DMA:
     - Store raw pointer of allocation as NonNull
     - Use start_ptr() and start_ptr_mut() to inherit correct mutability of self
 
   - FS:
     - Add file::Offset type alias
 
   - I2C:
     - Add abstractions for I2C device / driver infrastructure
     - Implement abstractions for manual I2C device registrations
 
   - I/O:
     - Use "kernel vertical" style for imports
     - Define ResourceSize as resource_size_t
     - Move ResourceSize to top-level I/O module
     - Add type alias for phys_addr_t
     - Implement Rust version of read_poll_timeout_atomic()
 
   - PCI:
     - Use "kernel vertical" style for imports
     - Move I/O and IRQ infrastructure to separate files
     - Add support for PCI interrupt vectors
     - Implement TryInto<IrqRequest<'a>> for IrqVector<'a> to convert an
       IrqVector bound to specific pci::Device into an IrqRequest bound to the
       same pci::Device's parent Device
     - Leverage pin_init_scope() to get rid of redundant Result in IRQ methods
 
   - PinInit:
     - Add {pin_}init_scope() to execute code before creating an initializer
 
   - Platform:
     - Leverage pin_init_scope() to get rid of redundant Result in IRQ methods
 
   - Timekeeping:
     - Implement abstraction of udelay()
 
   - Uaccess:
     - Implement read_slice_partial() and read_slice_file() for UserSliceReader
     - Implement write_slice_partial() and write_slice_file() for UserSliceWriter
 
 - sysfs
   - Prepare the constification of struct attribute
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQS2q/xV6QjXAdC7k+1FlHeO1qrKLgUCaTAehwAKCRBFlHeO1qrK
 LmzvAP0TWwKUGIduccknIa1AFvBM92lWVZptSysotv3SLFZq3wD9GBLIENt1DkEk
 s+GBqbobPgoyaodaysqLQ/SNqF9TcAM=
 =Wutw
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core

Pull driver core updates from Danilo Krummrich:
 "Arch Topology:
   - Move parse_acpi_topology() from arm64 to common code for reuse in
     RISC-V

  CPU:
   - Expose housekeeping CPUs through /sys/devices/system/cpu/housekeeping
   - Print a newline (or 0x0A) instead of '(null)' reading
     /sys/devices/system/cpu/nohz_full when nohz_full= is not set

  debugfs
   - Remove (broken) 'no-mount' mode
   - Remove redundant access mode checks in debugfs_get_tree() and
     debugfs_create_*() functions

  Devres:
   - Remove unused devm_free_percpu() helper
   - Move devm_alloc_percpu() from device.h to devres.h

  Firmware Loader:
   - Replace simple_strtol() with kstrtoint()
   - Do not call cancel_store() when no upload is in progress

  kernfs:
   - Increase struct super_block::maxbytes to MAX_LFS_FILESIZE
   - Fix a missing unwind path in __kernfs_new_node()

  Misc:
   - Increase the name size in struct auxiliary_device_id to 40
     characters
   - Replace system_unbound_wq with system_dfl_wq and add WQ_PERCPU to
     alloc_workqueue()

  Platform:
   - Replace ERR_PTR() with IOMEM_ERR_PTR() in platform ioremap
     functions

  Rust:
   - Auxiliary:
      - Unregister auxiliary device on parent device unbind
      - Move parent() to impl Device; implement device context aware
        parent() for Device<Bound>
      - Illustrate how to safely obtain a driver's device private data
        when calling from an auxiliary driver into the parant device
        driver

   - DebugFs:
      - Implement support for binary large objects

   - Device:
      - Let probe() return the driver's device private data as pinned
        initializer, i.e. impl PinInit<Self, Error>
      - Implement safe accessor for a driver's device private data for
        Device<Bound> (returned reference can't out-live driver binding
        and guarantees the correct private data type)
      - Implement AsBusDevice trait, to be used by class device
        abstractions to derive the bus device type of the parent device

   - DMA:
      - Store raw pointer of allocation as NonNull
      - Use start_ptr() and start_ptr_mut() to inherit correct
        mutability of self

   - FS:
      - Add file::Offset type alias

   - I2C:
      - Add abstractions for I2C device / driver infrastructure
      - Implement abstractions for manual I2C device registrations

   - I/O:
      - Use "kernel vertical" style for imports
      - Define ResourceSize as resource_size_t
      - Move ResourceSize to top-level I/O module
      - Add type alias for phys_addr_t
      - Implement Rust version of read_poll_timeout_atomic()

   - PCI:
      - Use "kernel vertical" style for imports
      - Move I/O and IRQ infrastructure to separate files
      - Add support for PCI interrupt vectors
      - Implement TryInto<IrqRequest<'a>> for IrqVector<'a> to convert
        an IrqVector bound to specific pci::Device into an IrqRequest
        bound to the same pci::Device's parent Device
      - Leverage pin_init_scope() to get rid of redundant Result in IRQ
        methods

   - PinInit:
      - Add {pin_}init_scope() to execute code before creating an
        initializer

   - Platform:
      - Leverage pin_init_scope() to get rid of redundant Result in IRQ
        methods

   - Timekeeping:
      - Implement abstraction of udelay()

   - Uaccess:
      - Implement read_slice_partial() and read_slice_file() for
        UserSliceReader
      - Implement write_slice_partial() and write_slice_file() for
        UserSliceWriter

  sysfs:
   - Prepare the constification of struct attribute"

* tag 'driver-core-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core: (75 commits)
  rust: pci: fix build failure when CONFIG_PCI_MSI is disabled
  debugfs: Fix default access mode config check
  debugfs: Remove broken no-mount mode
  debugfs: Remove redundant access mode checks
  driver core: Check drivers_autoprobe for all added devices
  driver core: WQ_PERCPU added to alloc_workqueue users
  driver core: replace use of system_unbound_wq with system_dfl_wq
  tick/nohz: Expose housekeeping CPUs in sysfs
  tick/nohz: avoid showing '(null)' if nohz_full= not set
  sysfs/cpu: Use DEVICE_ATTR_RO for nohz_full attribute
  kernfs: fix memory leak of kernfs_iattrs in __kernfs_new_node
  fs/kernfs: raise sb->maxbytes to MAX_LFS_FILESIZE
  mod_devicetable: Bump auxiliary_device_id name size
  sysfs: simplify attribute definition macros
  samples/kobject: constify 'struct foo_attribute'
  samples/kobject: add is_visible() callback to attribute group
  sysfs: attribute_group: enable const variants of is_visible()
  sysfs: introduce __SYSFS_FUNCTION_ALTERNATIVE()
  sysfs: transparently handle const pointers in ATTRIBUTE_GROUPS()
  sysfs: attribute_group: allow registration of const attribute
  ...
2025-12-05 21:29:02 -08:00
Will Rosenberg
382b1e8f30 kernfs: fix memory leak of kernfs_iattrs in __kernfs_new_node
There exists a memory leak of kernfs_iattrs contained as an element
of kernfs_node allocated in __kernfs_new_node(). __kernfs_setattr()
allocates kernfs_iattrs as a sub-object, and the LSM security check
incorrectly errors out and does not free the kernfs_iattrs sub-object.

Make an additional error out case that properly frees kernfs_iattrs if
security_kernfs_init_security() fails.

Fixes: e19dfdc83b ("kernfs: initialize security of newly created nodes")
Co-developed-by: Oliver Rosenberg <olrose55@gmail.com>
Signed-off-by: Oliver Rosenberg <olrose55@gmail.com>
Signed-off-by: Will Rosenberg <whrosenb@asu.edu>
Link: https://patch.msgid.link/20251125151332.2010687-1-whrosenb@asu.edu
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-26 15:19:11 +01:00
Jane Chu
2467f9928c fs/kernfs: raise sb->maxbytes to MAX_LFS_FILESIZE
On an ARM64 A1 system, it's possible to have physical memory span
up to the 64T boundary, like below

$ lsmem -b -r -n -o range,size
0x0000000080000000-0x00000000bfffffff 1073741824
0x0000080000000000-0x000008007fffffff 2147483648
0x00000800c0000000-0x0000087fffffffff 546534588416
0x0000400000000000-0x00004000bfffffff 3221225472
0x0000400100000000-0x0000407fffffffff 545460846592

So it's time to extend /sys/kernel/mm/page_idle/bitmap to be able
to account for >2G number of pages, by raising the kernfs file size
limit.

Signed-off-by: Jane Chu <jane.chu@oracle.com>
Link: https://patch.msgid.link/20251111202606.1505437-1-jane.chu@oracle.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-26 15:18:13 +01:00
Mateusz Guzik
b4dbfd8653
Coccinelle-based conversion to use ->i_state accessors
All places were patched by coccinelle with the default expecting that
->i_lock is held, afterwards entries got fixed up by hand to use
unlocked variants as needed.

The script:
@@
expression inode, flags;
@@

- inode->i_state & flags
+ inode_state_read(inode) & flags

@@
expression inode, flags;
@@

- inode->i_state &= ~flags
+ inode_state_clear(inode, flags)

@@
expression inode, flag1, flag2;
@@

- inode->i_state &= ~flag1 & ~flag2
+ inode_state_clear(inode, flag1 | flag2)

@@
expression inode, flags;
@@

- inode->i_state |= flags
+ inode_state_set(inode, flags)

@@
expression inode, flags;
@@

- inode->i_state = flags
+ inode_state_assign(inode, flags)

@@
expression inode, flags;
@@

- flags = inode->i_state
+ flags = inode_state_read(inode)

@@
expression inode, flags;
@@

- READ_ONCE(inode->i_state) & flags
+ inode_state_read(inode) & flags

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-20 20:22:26 +02:00
Linus Torvalds
b7ce6fa90f vfs-6.18-rc1.misc
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaNZQMQAKCRCRxhvAZXjc
 omNLAQCgrwzd9sa1JTlixweu3OAxQlSEbLuMpEv7Ztm+B7Wz0AD9HtwPC44Kev03
 GbMcB2DCFLC4evqYECj6IG7NBmoKsAs=
 =1ICf
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.18-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull misc vfs updates from Christian Brauner:
 "This contains the usual selections of misc updates for this cycle.

  Features:

   - Add "initramfs_options" parameter to set initramfs mount options.
     This allows to add specific mount options to the rootfs to e.g.,
     limit the memory size

   - Add RWF_NOSIGNAL flag for pwritev2()

     Add RWF_NOSIGNAL flag for pwritev2. This flag prevents the SIGPIPE
     signal from being raised when writing on disconnected pipes or
     sockets. The flag is handled directly by the pipe filesystem and
     converted to the existing MSG_NOSIGNAL flag for sockets

   - Allow to pass pid namespace as procfs mount option

     Ever since the introduction of pid namespaces, procfs has had very
     implicit behaviour surrounding them (the pidns used by a procfs
     mount is auto-selected based on the mounting process's active
     pidns, and the pidns itself is basically hidden once the mount has
     been constructed)

     This implicit behaviour has historically meant that userspace was
     required to do some special dances in order to configure the pidns
     of a procfs mount as desired. Examples include:

     * In order to bypass the mnt_too_revealing() check, Kubernetes
       creates a procfs mount from an empty pidns so that user
       namespaced containers can be nested (without this, the nested
       containers would fail to mount procfs)

       But this requires forking off a helper process because you cannot
       just one-shot this using mount(2)

     * Container runtimes in general need to fork into a container
       before configuring its mounts, which can lead to security issues
       in the case of shared-pidns containers (a privileged process in
       the pidns can interact with your container runtime process)

       While SUID_DUMP_DISABLE and user namespaces make this less of an
       issue, the strict need for this due to a minor uAPI wart is kind
       of unfortunate

       Things would be much easier if there was a way for userspace to
       just specify the pidns they want. So this pull request contains
       changes to implement a new "pidns" argument which can be set
       using fsconfig(2):

           fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
           fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);

       or classic mount(2) / mount(8):

           // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
           mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");

  Cleanups:

   - Remove the last references to EXPORT_OP_ASYNC_LOCK

   - Make file_remove_privs_flags() static

   - Remove redundant __GFP_NOWARN when GFP_NOWAIT is used

   - Use try_cmpxchg() in start_dir_add()

   - Use try_cmpxchg() in sb_init_done_wq()

   - Replace offsetof() with struct_size() in ioctl_file_dedupe_range()

   - Remove vfs_ioctl() export

   - Replace rwlock() with spinlock in epoll code as rwlock causes
     priority inversion on preempt rt kernels

   - Make ns_entries in fs/proc/namespaces const

   - Use a switch() statement() in init_special_inode() just like we do
     in may_open()

   - Use struct_size() in dir_add() in the initramfs code

   - Use str_plural() in rd_load_image()

   - Replace strcpy() with strscpy() in find_link()

   - Rename generic_delete_inode() to inode_just_drop() and
     generic_drop_inode() to inode_generic_drop()

   - Remove unused arguments from fcntl_{g,s}et_rw_hint()

  Fixes:

   - Document @name parameter for name_contains_dotdot() helper

   - Fix spelling mistake

   - Always return zero from replace_fd() instead of the file descriptor
     number

   - Limit the size for copy_file_range() in compat mode to prevent a
     signed overflow

   - Fix debugfs mount options not being applied

   - Verify the inode mode when loading it from disk in minixfs

   - Verify the inode mode when loading it from disk in cramfs

   - Don't trigger automounts with RESOLVE_NO_XDEV

     If openat2() was called with RESOLVE_NO_XDEV it didn't traverse
     through automounts, but could still trigger them

   - Add FL_RECLAIM flag to show_fl_flags() macro so it appears in
     tracepoints

   - Fix unused variable warning in rd_load_image() on s390

   - Make INITRAMFS_PRESERVE_MTIME depend on BLK_DEV_INITRD

   - Use ns_capable_noaudit() when determining net sysctl permissions

   - Don't call path_put() under namespace semaphore in listmount() and
     statmount()"

* tag 'vfs-6.18-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (38 commits)
  fcntl: trim arguments
  listmount: don't call path_put() under namespace semaphore
  statmount: don't call path_put() under namespace semaphore
  pid: use ns_capable_noaudit() when determining net sysctl permissions
  fs: rename generic_delete_inode() and generic_drop_inode()
  init: INITRAMFS_PRESERVE_MTIME should depend on BLK_DEV_INITRD
  initramfs: Replace strcpy() with strscpy() in find_link()
  initrd: Use str_plural() in rd_load_image()
  initramfs: Use struct_size() helper to improve dir_add()
  initrd: Fix unused variable warning in rd_load_image() on s390
  fs: use the switch statement in init_special_inode()
  fs/proc/namespaces: make ns_entries const
  filelock: add FL_RECLAIM to show_fl_flags() macro
  eventpoll: Replace rwlock with spinlock
  selftests/proc: add tests for new pidns APIs
  procfs: add "pidns" mount option
  pidns: move is-ancestor logic to helper
  openat2: don't trigger automounts with RESOLVE_NO_XDEV
  namei: move cross-device check to __traverse_mounts
  namei: remove LOOKUP_NO_XDEV check from handle_mounts
  ...
2025-09-29 09:03:07 -07:00
Mateusz Guzik
f99b391778
fs: rename generic_delete_inode() and generic_drop_inode()
generic_delete_inode() is rather misleading for what the routine is
doing. inode_just_drop() should be much clearer.

The new naming is inconsistent with generic_drop_inode(), so rename that
one as well with inode_ as the suffix.

No functional changes.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-15 16:09:42 +02:00
Chen Ridong
3c9ba2777d kernfs: Fix UAF in polling when open file is released
A use-after-free (UAF) vulnerability was identified in the PSI (Pressure
Stall Information) monitoring mechanism:

BUG: KASAN: slab-use-after-free in psi_trigger_poll+0x3c/0x140
Read of size 8 at addr ffff3de3d50bd308 by task systemd/1

psi_trigger_poll+0x3c/0x140
cgroup_pressure_poll+0x70/0xa0
cgroup_file_poll+0x8c/0x100
kernfs_fop_poll+0x11c/0x1c0
ep_item_poll.isra.0+0x188/0x2c0

Allocated by task 1:
cgroup_file_open+0x88/0x388
kernfs_fop_open+0x73c/0xaf0
do_dentry_open+0x5fc/0x1200
vfs_open+0xa0/0x3f0
do_open+0x7e8/0xd08
path_openat+0x2fc/0x6b0
do_filp_open+0x174/0x368

Freed by task 8462:
cgroup_file_release+0x130/0x1f8
kernfs_drain_open_files+0x17c/0x440
kernfs_drain+0x2dc/0x360
kernfs_show+0x1b8/0x288
cgroup_file_show+0x150/0x268
cgroup_pressure_write+0x1dc/0x340
cgroup_file_write+0x274/0x548

Reproduction Steps:
1. Open test/cpu.pressure and establish epoll monitoring
2. Disable monitoring: echo 0 > test/cgroup.pressure
3. Re-enable monitoring: echo 1 > test/cgroup.pressure

The race condition occurs because:
1. When cgroup.pressure is disabled (echo 0 > cgroup.pressure), it:
   - Releases PSI triggers via cgroup_file_release()
   - Frees of->priv through kernfs_drain_open_files()
2. While epoll still holds reference to the file and continues polling
3. Re-enabling (echo 1 > cgroup.pressure) accesses freed of->priv

epolling			disable/enable cgroup.pressure
fd=open(cpu.pressure)
while(1)
...
epoll_wait
kernfs_fop_poll
kernfs_get_active = true	echo 0 > cgroup.pressure
...				cgroup_file_show
				kernfs_show
				// inactive kn
				kernfs_drain_open_files
				cft->release(of);
				kfree(ctx);
				...
kernfs_get_active = false
				echo 1 > cgroup.pressure
				kernfs_show
				kernfs_activate_one(kn);
kernfs_fop_poll
kernfs_get_active = true
cgroup_file_poll
psi_trigger_poll
// UAF
...
end: close(fd)

To address this issue, introduce kernfs_get_active_of() for kernfs open
files to obtain active references. This function will fail if the open file
has been released. Replace kernfs_get_active() with kernfs_get_active_of()
to prevent further operations on released file descriptors.

Fixes: 34f26a1561 ("sched/psi: Per-cgroup PSI accounting disable/re-enable interface")
Cc: stable <stable@kernel.org>
Reported-by: Zhang Zhaotian <zhangzhaotian@huawei.com>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20250822070715.1565236-2-chenridong@huaweicloud.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-09-06 20:11:27 +02:00
Christian Brauner
c237aa9884
kernfs: don't fail listing extended attributes
Userspace doesn't expect a failure to list extended attributes:

  $ ls -lA /sys/
  ls: /sys/: No data available
  ls: /sys/kernel: No data available
  ls: /sys/power: No data available
  ls: /sys/class: No data available
  ls: /sys/devices: No data available
  ls: /sys/dev: No data available
  ls: /sys/hypervisor: No data available
  ls: /sys/fs: No data available
  ls: /sys/bus: No data available
  ls: /sys/firmware: No data available
  ls: /sys/block: No data available
  ls: /sys/module: No data available
  total 0
  drwxr-xr-x   2 root root 0 Jan  1  1970 block
  drwxr-xr-x  52 root root 0 Jan  1  1970 bus
  drwxr-xr-x  88 root root 0 Jan  1  1970 class
  drwxr-xr-x   4 root root 0 Jan  1  1970 dev
  drwxr-xr-x  11 root root 0 Jan  1  1970 devices
  drwxr-xr-x   3 root root 0 Jan  1  1970 firmware
  drwxr-xr-x  10 root root 0 Jan  1  1970 fs
  drwxr-xr-x   2 root root 0 Jul  2 09:43 hypervisor
  drwxr-xr-x  14 root root 0 Jan  1  1970 kernel
  drwxr-xr-x 251 root root 0 Jan  1  1970 module
  drwxr-xr-x   3 root root 0 Jul  2 09:43 power

Fix it by simply reporting success when no extended attributes are
available instead of reporting ENODATA.

Link: https://lore.kernel.org/78b13bcdae82ade95e88f315682966051f461dde.camel@linaro.org
Fixes: d1f4e90260 ("kernfs: remove iattr_mutex") # mainline only
Reported-by: André Draszik <andre.draszik@linaro.org>
Link: https://lore.kernel.org/20250819-ahndung-abgaben-524a535f8101@brauner
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-08-19 13:51:28 +02:00
Linus Torvalds
d9104cec3e bpf-next-6.17
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmiINnEACgkQ6rmadz2v
 bToBnA/9F+A3R6rTwGk4HK3xpfc/nm2Tanl3oRN7S2ub/mskDOtWSIyG6cVFZ0UG
 1fK6IkByyRIpAF/5qhdlw8drRXHkQtGLA0lP2L9llm4X1mHLofB18y9OeLrDE1WN
 KwNP06+IGX9W802lCGSIXOY+VmRscVfXSMokyQt2ilHplKjOnDqJcYkWupi3T2rC
 mz79FY9aEl2YrIcpj9RXz+8nwP49pZBuW2P0IM5PAIj4BJBXShrUp8T1nz94okNe
 NFsnAyRxjWpUT0McEgtA9WvpD9lZqujYD8Qp0KlGZWmI3vNpV5d9S1+dBcEb1n7q
 dyNMkTF3oRrJhhg4VqoHc6fVpzSEoZ9ZxV5Hx4cs+ganH75D4YbdGqx/7mR3DUgH
 MZh6rHF1pGnK7TAm7h5gl3ZRAOkZOaahbe1i01NKo9CEe5fSh3AqMyzJYoyGHRKi
 xDN39eQdWBNA+hm1VkbK2Bv93Rbjrka2Kj+D3sSSO9Bo/u3ntcknr7LW39idKz62
 Q8dkKHcCEtun7gjk0YXPF013y81nEohj1C+52BmJ2l5JitM57xfr6YOaQpu7DPDE
 AJbHx6ASxKdyEETecd0b+cXUPQ349zmRXy0+CDMAGKpBicC0H0mHhL14cwOY1Hfu
 EIpIjmIJGI3JNF6T5kybcQGSBOYebdV0FFgwSllzPvuYt7YsHCs=
 =/O3j
 -----END PGP SIGNATURE-----

Merge tag 'bpf-next-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Pull bpf updates from Alexei Starovoitov:

 - Remove usermode driver (UMD) framework (Thomas Weißschuh)

 - Introduce Strongly Connected Component (SCC) in the verifier to
   detect loops and refine register liveness (Eduard Zingerman)

 - Allow 'void *' cast using bpf_rdonly_cast() and corresponding
   '__arg_untrusted' for global function parameters (Eduard Zingerman)

 - Improve precision for BPF_ADD and BPF_SUB operations in the verifier
   (Harishankar Vishwanathan)

 - Teach the verifier that constant pointer to a map cannot be NULL
   (Ihor Solodrai)

 - Introduce BPF streams for error reporting of various conditions
   detected by BPF runtime (Kumar Kartikeya Dwivedi)

 - Teach the verifier to insert runtime speculation barrier (lfence on
   x86) to mitigate speculative execution instead of rejecting the
   programs (Luis Gerhorst)

 - Various improvements for 'veristat' (Mykyta Yatsenko)

 - For CONFIG_DEBUG_KERNEL config warn on internal verifier errors to
   improve bug detection by syzbot (Paul Chaignon)

 - Support BPF private stack on arm64 (Puranjay Mohan)

 - Introduce bpf_cgroup_read_xattr() kfunc to read xattr of cgroup's
   node (Song Liu)

 - Introduce kfuncs for read-only string opreations (Viktor Malik)

 - Implement show_fdinfo() for bpf_links (Tao Chen)

 - Reduce verifier's stack consumption (Yonghong Song)

 - Implement mprog API for cgroup-bpf programs (Yonghong Song)

* tag 'bpf-next-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (192 commits)
  selftests/bpf: Migrate fexit_noreturns case into tracing_failure test suite
  selftests/bpf: Add selftest for attaching tracing programs to functions in deny list
  bpf: Add log for attaching tracing programs to functions in deny list
  bpf: Show precise rejected function when attaching fexit/fmod_ret to __noreturn functions
  bpf: Fix various typos in verifier.c comments
  bpf: Add third round of bounds deduction
  selftests/bpf: Test invariants on JSLT crossing sign
  selftests/bpf: Test cross-sign 64bits range refinement
  selftests/bpf: Update reg_bound range refinement logic
  bpf: Improve bounds when s64 crosses sign boundary
  bpf: Simplify bounds refinement from s32
  selftests/bpf: Enable private stack tests for arm64
  bpf, arm64: JIT support for private stack
  bpf: Move bpf_jit_get_prog_name() to core.c
  bpf, arm64: Fix fp initialization for exception boundary
  umd: Remove usermode driver framework
  bpf/preload: Don't select USERMODE_DRIVER
  selftests/bpf: Fix test dynptr/test_dynptr_memset_xdp_chunks failure
  selftests/bpf: Fix test dynptr/test_dynptr_copy_xdp failure
  selftests/bpf: Increase xdp data size for arm64 64K page size
  ...
2025-07-30 09:58:50 -07:00
Linus Torvalds
7e7bc8335b vfs-6.17-rc1.bpf
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaINCjwAKCRCRxhvAZXjc
 osnVAQCv4rM7sF4yJvGlm1myIJcJy5Sabk2q31qMdI1VHmkcOwD+Mxs7d1aByTS8
 /6djhVleq6lcT2LpP9j8YI3Rb+x30QY=
 =PF3o
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.17-rc1.bpf' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs bpf updates from Christian Brauner:
 "These changes allow bpf to read extended attributes from cgroupfs.

  This is useful in redirecting AF_UNIX socket connections based on
  cgroup membership of the socket. One use-case is the ability to
  implement log namespaces in systemd so services and containers are
  redirected to different journals"

* tag 'vfs-6.17-rc1.bpf' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  selftests/kernfs: test xattr retrieval
  selftests/bpf: Add tests for bpf_cgroup_read_xattr
  bpf: Mark cgroup_subsys_state->cgroup RCU safe
  bpf: Introduce bpf_cgroup_read_xattr to read xattr of cgroup's node
  kernfs: remove iattr_mutex
2025-07-28 14:42:31 -07:00
Christian Brauner
fb7b30cb0e
kernfs: remove iattr_mutex
All allocations of struct kernfs_iattrs are serialized through a global
mutex. Simply do a racy allocation and let the first one win. I bet most
callers are under inode->i_rwsem anyway and it wouldn't be needed but
let's not require that.

Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/20250623063854.1896364-2-song@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-02 14:18:20 +02:00
Christian Brauner
d1f4e90260
kernfs: remove iattr_mutex
All allocations of struct kernfs_iattrs are serialized through a global
mutex. Simply do a racy allocation and let the first one win. I bet most
callers are under inode->i_rwsem anyway and it wouldn't be needed but
let's not require that.

Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/20250623063854.1896364-2-song@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-23 13:03:12 +02:00
Al Viro
05fb0e6664 new helper: set_default_d_op()
... to be used instead of manually assigning to ->s_d_op.
All in-tree filesystem converted (and field itself is renamed,
so any out-of-tree ones in need of conversion will be caught
by compiler).

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-10 22:21:16 -04:00
Linus Torvalds
9d230d500b Driver core changes for 6.16-rc1
Here are the driver core / kernfs changes for 6.16-rc1.
 
 Not a huge number of changes this development cycle, here's the summary
 of what is included in here:
   - kernfs locking tweaks, pushing some global locks down into a per-fs
     image lock
   - rust driver core and pci device bindings added for new features.
   - sysfs const work for bin_attributes.  This churn should now be
     completed for those types of attributes
   - auxbus device creation helpers added
   - fauxbus fix for creating sysfs files after the probe completed
     properly
   - other tiny updates for driver core things.
 
 All of these have been in linux-next for over a week with no reported
 issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCaDbe+g8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ylYbACgl/MngU9pRnx5jZIQh6bWveFSeo8AnRE4U5x0
 X+lgTPjGKL1RrV3C5HJp
 =+0BA
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core

Pull driver core updates from Greg KH:
 "Here are the driver core / kernfs changes for 6.16-rc1.

  Not a huge number of changes this development cycle, here's the
  summary of what is included in here:

   - kernfs locking tweaks, pushing some global locks down into a per-fs
     image lock

   - rust driver core and pci device bindings added for new features.

   - sysfs const work for bin_attributes.

     The final churn of switching away from and removing the
     transitional struct members, "read_new", "write_new" and
     "bin_attrs_new" will come after the merge window to avoid
     unnecesary merge conflicts.

   - auxbus device creation helpers added

   - fauxbus fix for creating sysfs files after the probe completed
     properly

   - other tiny updates for driver core things.

  All of these have been in linux-next for over a week with no reported
  issues"

* tag 'driver-core-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core:
  kernfs: Relax constraint in draining guard
  Documentation: embargoed-hardware-issues.rst: Remove myself
  drivers: hv: fix up const issue with vmbus_chan_bin_attrs
  firmware_loader: use SHA-256 library API instead of crypto_shash API
  docs: debugfs: do not recommend debugfs_remove_recursive
  PM: wakeup: Do not expose 4 device wakeup source APIs
  kernfs: switch global kernfs_rename_lock to per-fs lock
  kernfs: switch global kernfs_idr_lock to per-fs lock
  driver core: auxiliary bus: Fix IS_ERR() vs NULL mixup in __devm_auxiliary_device_create()
  sysfs: constify attribute_group::bin_attrs
  sysfs: constify bin_attribute argument of bin_attribute::read/write()
  software node: Correct a OOB check in software_node_get_reference_args()
  devres: simplify devm_kstrdup() using devm_kmemdup()
  platform: replace magic number with macro PLATFORM_DEVID_NONE
  component: do not try to unbind unbound components
  driver core: auxiliary bus: add device creation helpers
  driver core: faux: Add sysfs groups after probing
2025-05-29 09:11:39 -07:00
Linus Torvalds
8dd53535f1 vfs-6.16-rc1.super
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaDBPTwAKCRCRxhvAZXjc
 oi3BAQD/IBxTbAZIe7vEAsuLlBoKbWrzPGvxzd4UeMGo6OY18wEAvvyJM+arQy51
 jS0ZErDOJnPNe7jps+Gh+WDx6d3NMAY=
 =lqAG
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.16-rc1.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs freezing updates from Christian Brauner:
 "This contains various filesystem freezing related work for this cycle:

   - Allow the power subsystem to support filesystem freeze for suspend
     and hibernate.

     Now all the pieces are in place to actually allow the power
     subsystem to freeze/thaw filesystems during suspend/resume.
     Filesystems are only frozen and thawed if the power subsystem does
     actually own the freeze.

     If the filesystem is already frozen by the time we've frozen all
     userspace processes we don't care to freeze it again. That's
     userspace's job once the process resumes. We only actually freeze
     filesystems if we absolutely have to and we ignore other failures
     to freeze.

     We could bubble up errors and fail suspend/resume if the error
     isn't EBUSY (aka it's already frozen) but I don't think that this
     is worth it. Filesystem freezing during suspend/resume is
     best-effort. If the user has 500 ext4 filesystems mounted and 4
     fail to freeze for whatever reason then we simply skip them.

     What we have now is already a big improvement and let's see how we
     fare with it before making our lives even harder (and uglier) than
     we have to.

   - Allow efivars to support freeze and thaw

     Allow efivarfs to partake to resync variable state during system
     hibernation and suspend. Add freeze/thaw support.

     This is a pretty straightforward implementation. We simply add
     regular freeze/thaw support for both userspace and the kernel.
     efivars is the first pseudofilesystem that adds support for
     filesystem freezing and thawing.

     The simplicity comes from the fact that we simply always resync
     variable state after efivarfs has been frozen. It doesn't matter
     whether that's because of suspend, userspace initiated freeze or
     hibernation. Efivars is simple enough that it doesn't matter that
     we walk all dentries. There are no directories and there aren't
     insane amounts of entries and both freeze/thaw are already
     heavy-handed operations. If userspace initiated a freeze/thaw cycle
     they would need CAP_SYS_ADMIN in the initial user namespace (as
     that's where efivarfs is mounted) so it can't be triggered by
     random userspace. IOW, we really really don't care"

* tag 'vfs-6.16-rc1.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  f2fs: fix freezing filesystem during resize
  kernfs: add warning about implementing freeze/thaw
  efivarfs: support freeze/thaw
  power: freeze filesystems during suspend/resume
  libfs: export find_next_child()
  super: add filesystem freezing helpers for suspend and hibernate
  gfs2: pass through holder from the VFS for freeze/thaw
  super: use common iterator (Part 2)
  super: use a common iterator (Part 1)
  super: skip dying superblocks early
  super: simplify user_get_super()
  super: remove pointless s_root checks
  fs: allow all writers to be frozen
  locking/percpu-rwsem: add freezable alternative to down_read
2025-05-26 09:33:44 -07:00
Michal Koutný
071d8e4c2a kernfs: Relax constraint in draining guard
The active reference lifecycle provides the break/unbreak mechanism but
the active reference is not truly active after unbreak -- callers don't
use it afterwards but it's important for proper pairing of kn->active
counting. Assuming this mechanism is in place, the WARN check in
kernfs_should_drain_open_files() is too sensitive -- it may transiently
catch those (rightful) callers between
kernfs_unbreak_active_protection() and kernfs_put_active() as found out by Chen
Ridong:

	kernfs_remove_by_name_ns	kernfs_get_active // active=1
	__kernfs_remove					  // active=0x80000002
	kernfs_drain			...
	wait_event
	//waiting (active == 0x80000001)
					kernfs_break_active_protection
					// active = 0x80000001
	// continue
					kernfs_unbreak_active_protection
					// active = 0x80000002
	...
	kernfs_should_drain_open_files
	// warning occurs
					kernfs_put_active

To avoid the false positives (mind panic_on_warn) remove the check altogether.
(This is meant as quick fix, I think active reference break/unbreak may be
simplified with larger rework.)

Fixes: bdb2fd7fc5 ("kernfs: Skip kernfs_drain_open_files() more aggressively")
Link: https://lore.kernel.org/r/kmmrseckjctb4gxcx2rdminrjnq2b4ipf7562nvfd432ld5v5m@2byj5eedkb2o/

Cc: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20250505121201.879823-1-mkoutny@suse.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-21 14:23:13 +02:00
Christian Brauner
ef2ed04eba
kernfs: add warning about implementing freeze/thaw
Sysfs is built on top of kernfs and sysfs provides the power management
infrastructure to support suspend/hibernate by writing to various files
in /sys/power/. As filesystems may be automatically frozen during
suspend/hibernate implementing freeze/thaw support for kernfs
generically will cause deadlocks as the suspending/hibernation
initiating task will hold a VFS lock that it will then wait upon to be
released. If freeze/thaw for kernfs is needed talk to the VFS.

Link: https://lore.kernel.org/r/20250402-work-freeze-v2-4-6719a97b52ac@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09 12:41:23 +02:00
Jinliang Zheng
93b27a845e kernfs: switch global kernfs_rename_lock to per-fs lock
The kernfs implementation has big lock granularity(kernfs_rename_lock) so
every kernfs-based(e.g., sysfs, cgroup) fs are able to compete the lock.

This patch switches the global kernfs_rename_lock to per-fs lock, which
put the rwlock into kernfs_root.

Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20250415153659.14950-3-alexjlzheng@tencent.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-25 16:01:57 +02:00
Jinliang Zheng
cec59c440a kernfs: switch global kernfs_idr_lock to per-fs lock
The kernfs implementation has big lock granularity(kernfs_idr_lock) so
every kernfs-based(e.g., sysfs, cgroup) fs are able to compete the lock.

This patch switches the global kernfs_idr_lock to per-fs lock, which
put the spinlock into kernfs_root.

Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20250415153659.14950-2-alexjlzheng@tencent.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-25 16:01:56 +02:00
NeilBrown
fa6fe07d15
VFS: rename lookup_one_len family to lookup_noperm and remove permission check
The lookup_one_len family of functions is (now) only used internally by
a filesystem on itself either
- in a context where permission checking is irrelevant such as by a
  virtual filesystem populating itself, or xfs accessing its ORPHANAGE
  or dquota accessing the quota file; or
- in a context where a permission check (MAY_EXEC on the parent) has just
  been performed such as a network filesystem finding in "silly-rename"
  file in the same directory.  This is also the context after the
  _parentat() functions where currently lookup_one_qstr_excl() is used.

So the permission check is pointless.

The name "one_len" is unhelpful in understanding the purpose of these
functions and should be changed.  Most of the callers pass the len as
"strlen()" so using a qstr and QSTR() can simplify the code.

This patch renames these functions (include lookup_positive_unlocked()
which is part of the family despite the name) to have a name based on
"lookup_noperm".  They are changed to receive a 'struct qstr' instead
of separate name and len.  In a few cases the use of QSTR() results in a
new call to strlen().

try_lookup_noperm() takes a pointer to a qstr instead of the whole
qstr.  This is consistent with d_hash_and_lookup() (which is nearly
identical) and useful for lookup_noperm_unlocked().

The new lookup_noperm_common() doesn't take a qstr yet.  That will be
tidied up in a subsequent patch.

Signed-off-by: NeilBrown <neil@brown.name>
Link: https://lore.kernel.org/r/20250319031545.2999807-5-neil@brown.name
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-08 11:24:36 +02:00
Linus Torvalds
2cd5769fb0 Driver core updates for 6.15-rc1
Here is the big set of driver core updates for 6.15-rc1.  Lots of stuff
 happened this development cycle, including:
   - kernfs scaling changes to make it even faster thanks to rcu
   - bin_attribute constify work in many subsystems
   - faux bus minor tweaks for the rust bindings
   - rust binding updates for driver core, pci, and platform busses,
     making more functionaliy available to rust drivers.  These are all
     due to people actually trying to use the bindings that were in 6.14.
   - make Rafael and Danilo full co-maintainers of the driver core
     codebase
   - other minor fixes and updates.
 
 This has been in linux-next for a while now, with the only reported
 issue being some merge conflicts with the rust tree.  Depending on which
 tree you pull first, you will have conflicts in one of them.  The merge
 resolution has been in linux-next as an example of what to do, or can be
 found here:
 	https://lore.kernel.org/r/CANiq72n3Xe8JcnEjirDhCwQgvWoE65dddWecXnfdnbrmuah-RQ@mail.gmail.com
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZ+mMrg8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ylRgwCdH58OE3BgL0uoFY5vFImStpmPtqUAoL5HpVWI
 jtbJ+UuXGsnmO+JVNBEv
 =gy6W
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updatesk from Greg KH:
 "Here is the big set of driver core updates for 6.15-rc1. Lots of stuff
  happened this development cycle, including:

   - kernfs scaling changes to make it even faster thanks to rcu

   - bin_attribute constify work in many subsystems

   - faux bus minor tweaks for the rust bindings

   - rust binding updates for driver core, pci, and platform busses,
     making more functionaliy available to rust drivers. These are all
     due to people actually trying to use the bindings that were in
     6.14.

   - make Rafael and Danilo full co-maintainers of the driver core
     codebase

   - other minor fixes and updates"

* tag 'driver-core-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (52 commits)
  rust: platform: require Send for Driver trait implementers
  rust: pci: require Send for Driver trait implementers
  rust: platform: impl Send + Sync for platform::Device
  rust: pci: impl Send + Sync for pci::Device
  rust: platform: fix unrestricted &mut platform::Device
  rust: pci: fix unrestricted &mut pci::Device
  rust: device: implement device context marker
  rust: pci: use to_result() in enable_device_mem()
  MAINTAINERS: driver core: mark Rafael and Danilo as co-maintainers
  rust/kernel/faux: mark Registration methods inline
  driver core: faux: only create the device if probe() succeeds
  rust/faux: Add missing parent argument to Registration::new()
  rust/faux: Drop #[repr(transparent)] from faux::Registration
  rust: io: fix devres test with new io accessor functions
  rust: io: rename `io::Io` accessors
  kernfs: Move dput() outside of the RCU section.
  efi: rci2: mark bin_attribute as __ro_after_init
  rapidio: constify 'struct bin_attribute'
  firmware: qemu_fw_cfg: constify 'struct bin_attribute'
  powerpc/perf/hv-24x7: Constify 'struct bin_attribute'
  ...
2025-04-01 11:02:03 -07:00
NeilBrown
88d5baf690
Change inode_operations.mkdir to return struct dentry *
Some filesystems, such as NFS, cifs, ceph, and fuse, do not have
complete control of sequencing on the actual filesystem (e.g.  on a
different server) and may find that the inode created for a mkdir
request already exists in the icache and dcache by the time the mkdir
request returns.  For example, if the filesystem is mounted twice the
directory could be visible on the other mount before it is on the
original mount, and a pair of name_to_handle_at(), open_by_handle_at()
calls could instantiate the directory inode with an IS_ROOT() dentry
before the first mkdir returns.

This means that the dentry passed to ->mkdir() may not be the one that
is associated with the inode after the ->mkdir() completes.  Some
callers need to interact with the inode after the ->mkdir completes and
they currently need to perform a lookup in the (rare) case that the
dentry is no longer hashed.

This lookup-after-mkdir requires that the directory remains locked to
avoid races.  Planned future patches to lock the dentry rather than the
directory will mean that this lookup cannot be performed atomically with
the mkdir.

To remove this barrier, this patch changes ->mkdir to return the
resulting dentry if it is different from the one passed in.
Possible returns are:
  NULL - the directory was created and no other dentry was used
  ERR_PTR() - an error occurred
  non-NULL - this other dentry was spliced in

This patch only changes file-systems to return "ERR_PTR(err)" instead of
"err" or equivalent transformations.  Subsequent patches will make
further changes to some file-systems to return a correct dentry.

Not all filesystems reliably result in a positive hashed dentry:

- NFS, cifs, hostfs will sometimes need to perform a lookup of
  the name to get inode information.  Races could result in this
  returning something different. Note that this lookup is
  non-atomic which is what we are trying to avoid.  Placing the
  lookup in filesystem code means it only happens when the filesystem
  has no other option.
- kernfs and tracefs leave the dentry negative and the ->revalidate
  operation ensures that lookup will be called to correctly populate
  the dentry.  This could be fixed but I don't think it is important
  to any of the users of vfs_mkdir() which look at the dentry.

The recommendation to use
    d_drop();d_splice_alias()
is ugly but fits with current practice.  A planned future patch will
change this.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: NeilBrown <neilb@suse.de>
Link: https://lore.kernel.org/r/20250227013949.536172-2-neilb@suse.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-27 20:00:17 +01:00
Sebastian Andrzej Siewior
c5020c5be9 kernfs: Move dput() outside of the RCU section.
Al Viro pointed out that dput() might sleep and must not be invoked
within an RCU section.

Keep only find_next_ancestor() winthin the RCU section.
Correct the wording in the comment.

Fixes: 6ef5b6fae3 ("kernfs: Drop kernfs_rwsem while invoking lookup_positive_unlocked().")
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/r/20250221084232.xksA_IQ4@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-21 11:55:15 +01:00
Sebastian Andrzej Siewior
6ef5b6fae3 kernfs: Drop kernfs_rwsem while invoking lookup_positive_unlocked().
syzbot reported two warnings:
- kernfs_node::name was accessed outside of a RCU section so it created
  warning. The kernfs_rwsem was held so it was okay but it wasn't seen.

- While kernfs_rwsem was held invoked lookup_positive_unlocked()->
  kernfs_dop_revalidate() which acquired kernfs_rwsem.

kernfs_rwsem was both acquired as a read lock so it can be acquired
twice. However if a writer acquires the lock after the first reader then
neither the writer nor the second reader can obtain the lock so it
deadlocks.

The reason for the lock is to ensure that kernfs_node::name remain
stable during lookup_positive_unlocked()'s invocation. The function can
not be invoked within a RCU section because it may sleep.

Make a temporary copy of the kernfs_node::name under the lock so
GFP_KERNEL can be used and use this instead.

Reported-by: syzbot+ecccecbc636b455f9084@syzkaller.appspotmail.com
Fixes: 5b2fabf7fe ("kernfs: Acquire kernfs_rwsem in kernfs_node_dentry().")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20250218163938.xmvjlJ0K@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-19 17:07:41 +01:00
Sebastian Andrzej Siewior
741c10b096 kernfs: Use RCU to access kernfs_node::name.
Using RCU lifetime rules to access kernfs_node::name can avoid the
trouble with kernfs_rename_lock in kernfs_name() and kernfs_path_from_node()
if the fs was created with KERNFS_ROOT_INVARIANT_PARENT. This is usefull
as it allows to implement kernfs_path_from_node() only with RCU
protection and avoiding kernfs_rename_lock. The lock is only required if
the __parent node can be changed and the function requires an unchanged
hierarchy while it iterates from the node to its parent.
The change is needed to allow the lookup of the node's path
(kernfs_path_from_node()) from context which runs always with disabled
preemption and or interrutps even on PREEMPT_RT. The problem is that
kernfs_rename_lock becomes a sleeping lock on PREEMPT_RT.

I went through all ::name users and added the required access for the lookup
with a few extensions:
- rdtgroup_pseudo_lock_create() drops all locks and then uses the name
  later on. resctrl supports rename with different parents. Here I made
  a temporal copy of the name while it is used outside of the lock.

- kernfs_rename_ns() accepts NULL as new_parent. This simplifies
  sysfs_move_dir_ns() where it can set NULL in order to reuse the current
  name.

- kernfs_rename_ns() is only using kernfs_rename_lock if the parents are
  different. All users use either kernfs_rwsem (for stable path view) or
  just RCU for the lookup. The ::name uses always RCU free.

Use RCU lifetime guarantees to access kernfs_node::name.

Suggested-by: Tejun Heo <tj@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Reported-by: syzbot+6ea37e2e6ffccf41a7e6@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/lkml/67251dc6.050a0220.529b6.015e.GAE@google.com/
Reported-by: Hillf Danton <hdanton@sina.com>
Closes: https://lore.kernel.org/20241102001224.2789-1-hdanton@sina.com
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/r/20250213145023.2820193-7-bigeasy@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-15 17:46:32 +01:00
Sebastian Andrzej Siewior
633488947e kernfs: Use RCU to access kernfs_node::parent.
kernfs_rename_lock is used to obtain stable kernfs_node::{name|parent}
pointer. This is a preparation to access kernfs_node::parent under RCU
and ensure that the pointer remains stable under the RCU lifetime
guarantees.

For a complete path, as it is done in kernfs_path_from_node(), the
kernfs_rename_lock is still required in order to obtain a stable parent
relationship while computing the relevant node depth. This must not
change while the nodes are inspected in order to build the path.
If the kernfs user never moves the nodes (changes the parent) then the
kernfs_rename_lock is not required and the RCU guarantees are
sufficient. This "restriction" can be set with
KERNFS_ROOT_INVARIANT_PARENT. Otherwise the lock is required.

Rename kernfs_node::parent to kernfs_node::__parent to denote the RCU
access and use RCU accessor while accessing the node.
Make cgroup use KERNFS_ROOT_INVARIANT_PARENT since the parent here can
not change.

Acked-by: Tejun Heo <tj@kernel.org>
Cc: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/r/20250213145023.2820193-6-bigeasy@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-15 17:46:32 +01:00
Sebastian Andrzej Siewior
9aab10a024 kernfs: Don't re-lock kernfs_root::kernfs_rwsem in kernfs_fop_readdir().
The readdir operation iterates over all entries and invokes dir_emit()
for every entry passing kernfs_node::name as argument.
Since the name argument can change, and become invalid, the
kernfs_root::kernfs_rwsem lock should not be dropped to prevent renames
during the operation.

The lock drop around dir_emit() has been initially introduced in commit
   1e5289c97b ("sysfs: Cache the last sysfs_dirent to improve readdir scalability v2")

to avoid holding a global lock during a page fault. The lock drop is
wrong since the support of renames and not a big burden since the lock
is no longer global.

Don't re-acquire kernfs_root::kernfs_rwsem while copying the name to the
userpace buffer.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/r/20250213145023.2820193-5-bigeasy@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-15 17:46:32 +01:00
Sebastian Andrzej Siewior
5b2fabf7fe kernfs: Acquire kernfs_rwsem in kernfs_node_dentry().
kernfs_node_dentry() passes kernfs_node::name to
lookup_positive_unlocked().

Acquire kernfs_root::kernfs_rwsem to ensure the node is not renamed
during the operation.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/r/20250213145023.2820193-4-bigeasy@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-15 17:46:32 +01:00
Sebastian Andrzej Siewior
122ab92dee kernfs: Acquire kernfs_rwsem in kernfs_get_parent_dentry().
kernfs_get_parent_dentry() passes kernfs_node::parent to
kernfs_get_inode().

Acquire kernfs_root::kernfs_rwsem to ensure kernfs_node::parent isn't
replaced during the operation.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/r/20250213145023.2820193-3-bigeasy@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-15 17:46:32 +01:00
Sebastian Andrzej Siewior
400188ae36 kernfs: Acquire kernfs_rwsem in kernfs_notify_workfn().
kernfs_notify_workfn() dereferences kernfs_node::name and passes it
later to fsnotify(). If the node is renamed then the previously observed
name pointer becomes invalid.

Acquire kernfs_root::kernfs_rwsem to block renames of the node.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/r/20250213145023.2820193-2-bigeasy@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-15 17:46:32 +01:00
Linus Torvalds
a86bf2283d assorted stuff for this merge window
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZ5yJdgAKCRBZ7Krx/gZQ
 69W4AQDwgxceiQ6icx3rFhCWQigne4jdMO84kd8tNaa+xHGe1AD/WnkeChc5DqjQ
 wZWZxAAzml9SS01IcSiHWaF5fgrjlA0=
 =rXOq
 -----END PGP SIGNATURE-----

Merge tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull misc vfs cleanups from Al Viro:
 "Two unrelated patches - one is a removal of long-obsolete include in
  overlayfs (it used to need fs/internal.h, but the extern it wanted has
  been moved back to include/linux/namei.h) and another introduces
  convenience helper constructing struct qstr by a NUL-terminated
  string"

* tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  add a string-to-qstr constructor
  fs/overlayfs/namei.c: get rid of include ../internal.h
2025-02-01 15:07:56 -08:00
Al Viro
c1feab95e0 add a string-to-qstr constructor
Quite a few places want to build a struct qstr by given string;
it would be convenient to have a primitive doing that, rather
than open-coding it via QSTR_INIT().

The closest approximation was in bcachefs, but that expands to
initializer list - {.len = strlen(string), .name = string}.
It would be more useful to have it as compound literal -
(struct qstr){.len = strlen(string), .name = string}.

Unlike initializer list it's a valid expression.  What's more,
it's a valid lvalue - it's an equivalent of anonymous local
variable with such initializer, so the things like
	path->dentry = d_alloc_pseudo(mnt->mnt_sb, &QSTR(name));
are valid.  It can also be used as initializer, with identical
effect -
	struct qstr x = (struct qstr){.name = s, .len = strlen(s)};
is equivalent to
	struct qstr anon_variable = {.name = s, .len = strlen(s)};
	struct qstr x = anon_variable;
	// anon_variable is never used after that point
and any even remotely sane compiler will manage to collapse that
into
	struct qstr x = {.name = s, .len = strlen(s)};

What compound literals can't be used for is initialization of
global variables, but those are covered by QSTR_INIT().

This commit lifts definition(s) of QSTR() into linux/dcache.h,
converts it to compound literal (all bcachefs users are fine
with that) and converts assorted open-coded instances to using
that.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-01-27 19:25:45 -05:00
Al Viro
5be1fa8abd Pass parent directory inode and expected name to ->d_revalidate()
->d_revalidate() often needs to access dentry parent and name; that has
to be done carefully, since the locking environment varies from caller
to caller.  We are not guaranteed that dentry in question will not be
moved right under us - not unless the filesystem is such that nothing
on it ever gets renamed.

It can be dealt with, but that results in boilerplate code that isn't
even needed - the callers normally have just found the dentry via dcache
lookup and want to verify that it's in the right place; they already
have the values of ->d_parent and ->d_name stable.  There is a couple
of exceptions (overlayfs and, to less extent, ecryptfs), but for the
majority of calls that song and dance is not needed at all.

It's easier to make ecryptfs and overlayfs find and pass those values if
there's a ->d_revalidate() instance to be called, rather than doing that
in the instances.

This commit only changes the calling conventions; making use of supplied
values is left to followups.

NOTE: some instances need more than just the parent - things like CIFS
may need to build an entire path from filesystem root, so they need
more precautions than the usual boilerplate.  This series doesn't
do anything to that need - these filesystems have to keep their locking
mechanisms (rename_lock loops, use of dentry_path_raw(), private rwsem
a-la v9fs).

One thing to keep in mind when using name is that name->name will normally
point into the pathname being resolved; the filename in question occupies
name->len bytes starting at name->name, and there is NUL somewhere after it,
but it the next byte might very well be '/' rather than '\0'.  Do not
ignore name->len.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Gabriel Krisman Bertazi <gabriel@krisman.be>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-01-27 19:25:23 -05:00
Li zeming
75cde4e37a kernfs: mount: Remove unnecessary ‘NULL’ values from knparent
knparent is assigned first, so it does not need to initialize the
assignment.

Signed-off-by: Li zeming <zeming@nfschina.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20240415102009.9926-1-zeming@nfschina.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-05-04 19:02:39 +02:00
Amir Goldstein
16b52bbee4 kernfs: annotate different lockdep class for of->mutex of writable files
The writable file /sys/power/resume may call vfs lookup helpers for
arbitrary paths and readonly files can be read by overlayfs from vfs
helpers when sysfs is a lower layer of overalyfs.

To avoid a lockdep warning of circular dependency between overlayfs
inode lock and kernfs of->mutex, use a different lockdep class for
writable and readonly kernfs files.

Reported-by: syzbot+9a5b0ced8b1bfb238b56@syzkaller.appspotmail.com
Fixes: 0fedefd4c4 ("kernfs: sysfs: support custom llseek method for sysfs entries")
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-04-14 06:55:46 -04:00
Linus Torvalds
241590e5a1 Driver core changes for 6.9-rc1
Here is the "big" set of driver core and kernfs changes for 6.9-rc1.
 
 Nothing all that crazy here, just some good updates that include:
   - automatic attribute group hiding from Dan Williams (he fixed up my
     horrible attempt at doing this.)
   - kobject lock contention fixes from Eric Dumazet
   - driver core cleanups from Andy
   - kernfs rcu work from Tejun
   - fw_devlink changes to resolve some reported issues
   - other minor changes, all details in the shortlog
 
 All of these have been in linux-next for a long time with no reported
 issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZfwsHg8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ynT4ACePcNRAsYrINlOPPKPHimJtyP01yEAn0pZYnj2
 0/UpqIqf3HVPu7zsLKTa
 =vR9S
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg KH:
 "Here is the "big" set of driver core and kernfs changes for 6.9-rc1.

  Nothing all that crazy here, just some good updates that include:

   - automatic attribute group hiding from Dan Williams (he fixed up my
     horrible attempt at doing this.)

   - kobject lock contention fixes from Eric Dumazet

   - driver core cleanups from Andy

   - kernfs rcu work from Tejun

   - fw_devlink changes to resolve some reported issues

   - other minor changes, all details in the shortlog

  All of these have been in linux-next for a long time with no reported
  issues"

* tag 'driver-core-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (28 commits)
  device: core: Log warning for devices pending deferred probe on timeout
  driver: core: Use dev_* instead of pr_* so device metadata is added
  driver: core: Log probe failure as error and with device metadata
  of: property: fw_devlink: Add support for "post-init-providers" property
  driver core: Add FWLINK_FLAG_IGNORE to completely ignore a fwnode link
  driver core: Adds flags param to fwnode_link_add()
  debugfs: fix wait/cancellation handling during remove
  device property: Don't use "proxy" headers
  device property: Move enum dev_dma_attr to fwnode.h
  driver core: Move fw_devlink stuff to where it belongs
  driver core: Drop unneeded 'extern' keyword in fwnode.h
  firmware_loader: Suppress warning on FW_OPT_NO_WARN flag
  sysfs:Addresses documentation in sysfs_merge_group and sysfs_unmerge_group.
  firmware_loader: introduce __free() cleanup hanler
  platform-msi: Remove usage of the deprecated ida_simple_xx() API
  sysfs: Introduce DEFINE_SIMPLE_SYSFS_GROUP_VISIBLE()
  sysfs: Document new "group visible" helpers
  sysfs: Fix crash on empty group attributes array
  sysfs: Introduce a mechanism to hide static attribute_groups
  sysfs: Introduce a mechanism to hide static attribute_groups
  ...
2024-03-21 13:34:15 -07:00
Kent Overstreet
a4af51ce22
fs: super_set_uuid()
Some weird old filesytems have UUID-like things that we wish to expose
as UUIDs, but are smaller; add a length field so that the new
FS_IOC_(GET|SET)UUID ioctls can handle them in generic code.

And add a helper super_set_uuid(), for setting nonstandard length uuids.

Helper is now required for the new FS_IOC_GETUUID ioctl; if
super_set_uuid() hasn't been called, the ioctl won't be supported.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Link: https://lore.kernel.org/r/20240207025624.1019754-2-kent.overstreet@linux.dev
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-08 21:19:59 +01:00