Commit Graph

389 Commits

Author SHA1 Message Date
Hristo Venev
544576f0f0 ceph: put folios not suitable for writeback
The batch holds references to the folios (see `filemap_get_folios`,
`folio_batch_release`), so we need to `folio_put` the folios we remove.

Tested on v6.18.

Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/74156
Signed-off-by: Hristo Venev <hristo@venev.name>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2026-05-11 10:39:22 +02:00
Linus Torvalds
ac2dc6d574 We have a series from Alex which extends CephFS client metrics with
support for per-subvolume data I/O performance and latency tracking
 (metadata operations aren't included) and a good variety of fixes and
 cleanups across RBD and CephFS.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCgAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmnrq1YTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi+WFCACA2Yc6oj6W4yXX2LSGCFCN3FOanSb3
 6ZvPeSrmAALzwD9ZXdef6j50An6w05P7kXKmAyKTgmW2tpiRciJs6uT6y7By/aph
 uGZCaPoJWPDvTlo8d05MAVuyfoKH5eU8pwx2YiEMN5W6kfo7VJQze6BgLbvQt7yH
 ToIzzBLifYONH4vF3nfsHj/uCr38Cbpr6GWY8LIPo8QtInWKJTcwF7HWVVicCaMs
 yqf1t+/CWzlIsnnIQtp+aSxWlpoA5lAqWxGt3jSfd3eVTCAL8eDzw5fkbGMRJYgM
 paH3kZ+LuJWkRXe2ts/RMrXWLJF3ZWOVD6sWU6sfnXf+vBe4SkiwwcUt
 =Ooc5
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-7.1-rc1' of https://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "We have a series from Alex which extends CephFS client metrics with
  support for per-subvolume data I/O performance and latency tracking
  (metadata operations aren't included) and a good variety of fixes and
  cleanups across RBD and CephFS"

* tag 'ceph-for-7.1-rc1' of https://github.com/ceph/ceph-client:
  ceph: add subvolume metrics collection and reporting
  ceph: parse subvolume_id from InodeStat v9 and store in inode
  ceph: handle InodeStat v8 versioned field in reply parsing
  libceph: Fix slab-out-of-bounds access in auth message processing
  rbd: fix null-ptr-deref when device_add_disk() fails
  crush: cleanup in crush_do_rule() method
  ceph: clear s_cap_reconnect when ceph_pagelist_encode_32() fails
  ceph: only d_add() negative dentries when they are unhashed
  libceph: update outdated comment in ceph_sock_write_space()
  libceph: Remove obsolete session key alignment logic
  ceph: fix num_ops off-by-one when crypto allocation fails
  libceph: Prevent potential null-ptr-deref in ceph_handle_auth_reply()
2026-04-24 13:47:19 -07:00
Alex Markuze
b1137e0b3d ceph: add subvolume metrics collection and reporting
Add complete infrastructure for per-subvolume I/O metrics collection
and reporting to the MDS. This enables administrators to monitor I/O
patterns at the subvolume granularity, which is useful for multi-tenant
CephFS deployments.

This patch adds:
- CEPHFS_FEATURE_SUBVOLUME_METRICS feature flag for MDS negotiation
- CEPH_SUBVOLUME_ID_NONE constant (0) for unknown/unset state
- Red-black tree based metrics tracker for efficient per-subvolume
  aggregation with kmem_cache for entry allocations
- Wire format encoding matching the MDS C++ AggregatedIOMetrics struct
- Integration with the existing CLIENT_METRICS message
- Recording of I/O operations from file read/write and writeback paths
- Debugfs interfaces for monitoring (metrics/subvolumes, metrics/metric_features)

Metrics tracked per subvolume include:
- Read/write operation counts
- Read/write byte counts
- Read/write latency sums (for average calculation)

The metrics are periodically sent to the MDS as part of the existing
metrics reporting infrastructure when the MDS advertises support for
the SUBVOLUME_METRICS feature.

CEPH_SUBVOLUME_ID_NONE enforces subvolume_id immutability. Following
the FUSE client convention, 0 means unknown/unset. Once an inode has
a valid (non-zero) subvolume_id, it should not change during the
inode's lifetime.

Signed-off-by: Alex Markuze <amarkuze@redhat.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2026-04-22 01:40:23 +02:00
Sam Edwards
a0d9555bf9 ceph: fix num_ops off-by-one when crypto allocation fails
move_dirty_folio_in_page_array() may fail if the file is encrypted, the
dirty folio is not the first in the batch, and it fails to allocate a
bounce buffer to hold the ciphertext. When that happens,
ceph_process_folio_batch() simply redirties the folio and flushes the
current batch -- it can retry that folio in a future batch.

However, if this failed folio is not contiguous with the last folio that
did make it into the batch, then ceph_process_folio_batch() has already
incremented `ceph_wbc->num_ops`; because it doesn't follow through and
add the discontiguous folio to the array, ceph_submit_write() -- which
expects that `ceph_wbc->num_ops` accurately reflects the number of
contiguous ranges (and therefore the required number of "write extent"
ops) in the writeback -- will panic the kernel:

    BUG_ON(ceph_wbc->op_idx + 1 != req->r_num_ops);

This issue can be reproduced on affected kernels by writing to
fscrypt-enabled CephFS file(s) with a 4KiB-written/4KiB-skipped/repeat
pattern (total filesize should not matter) and gradually increasing the
system's memory pressure until a bounce buffer allocation fails.

Fix this crash by decrementing `ceph_wbc->num_ops` back to the correct
value when move_dirty_folio_in_page_array() fails, but the folio already
started counting a new (i.e. still-empty) extent.

The defect corrected by this patch has existed since 2022 (see first
`Fixes:`), but another bug blocked multi-folio encrypted writeback until
recently (see second `Fixes:`). The second commit made it into 6.18.16,
6.19.6, and 7.0-rc1, unmasking the panic in those versions. This patch
therefore fixes a regression (panic) introduced by cac190c767.

Cc: stable@vger.kernel.org
Fixes: d55207717d ("ceph: add encryption support to writepage and writepages")
Fixes: cac190c767 ("ceph: fix write storm on fscrypted files")
Signed-off-by: Sam Edwards <CFSworks@gmail.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2026-04-22 01:40:22 +02:00
Tal Zussman
4e1d77a8f3 folio_batch: rename pagevec.h to folio_batch.h
struct pagevec was removed in commit 1e0877d58b ("mm: remove struct
pagevec").  Rename include/linux/pagevec.h to reflect reality and update
includes tree-wide.  Add the new filename to MAINTAINERS explicitly, as it
no longer matches the "include/linux/page[-_]*" pattern in MEMORY
MANAGEMENT - CORE.

Link: https://lkml.kernel.org/r/20260225-pagevec_cleanup-v2-3-716868cc2d11@columbia.edu
Signed-off-by: Tal Zussman <tz2294@columbia.edu>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:53:07 -07:00
Hristo Venev
081a0b78ef ceph: do not skip the first folio of the next object in writeback
When `ceph_process_folio_batch` encounters a folio past the end of the
current object, it should leave it in the batch so that it is picked up
in the next iteration.

Removing the folio from the batch means that it does not get written
back and remains dirty instead. This makes `fsync()` silently skip some
of the data, delays capability release, and breaks coherence with
`O_DIRECT`.

The link below contains instructions for reproducing the bug.

Cc: stable@vger.kernel.org
Fixes: ce80b76dd3 ("ceph: introduce ceph_process_folio_batch() method")
Link: https://tracker.ceph.com/issues/75156
Signed-off-by: Hristo Venev <hristo@venev.name>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2026-03-09 12:34:40 +01:00
Kees Cook
69050f8d6d treewide: Replace kmalloc with kmalloc_obj for non-scalar types
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:

Single allocations:	kmalloc(sizeof(TYPE), ...)
are replaced with:	kmalloc_obj(TYPE, ...)

Array allocations:	kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with:	kmalloc_objs(TYPE, COUNT, ...)

Flex array allocations:	kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with:	kmalloc_flex(*PTR, FAM, COUNT, ...)

(where TYPE may also be *VAR)

The resulting allocations no longer return "void *", instead returning
"TYPE *".

Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-21 01:02:28 -08:00
Sam Edwards
cfdde144ae ceph: assert loop invariants in ceph_writepages_start()
If `locked_pages` is zero, the page array must not be allocated:
ceph_process_folio_batch() uses `locked_pages` to decide when to
allocate `pages`, and redundant allocations trigger
ceph_allocate_page_array()'s BUG_ON(), resulting in a worker oops (and
writeback stall) or even a kernel panic. Consequently, the main loop in
ceph_writepages_start() assumes that the lifetime of `pages` is confined
to a single iteration.

This expectation is currently not clear enough, as evidenced by the
recent patch which fixed an oops caused by `pages` persisting into
the next loop iteration:
- "ceph: do not propagate page array emplacement errors as batch errors"

Use an explicit BUG_ON() at the top of the loop to assert the loop's
preexisting expectation that `pages` is cleaned up by the previous
iteration. Because this is closely tied to `locked_pages`, also make it
the previous iteration's responsibility to guarantee its reset, and
verify with a second new BUG_ON() instead of handling (and masking)
failures to do so.

This patch does not change invariants, behavior, or failure modes.
The added BUG_ON() lines catch conditions that would already trigger oops,
but do so earlier for easier debugging and programmer clarity.

Signed-off-by: Sam Edwards <CFSworks@gmail.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2026-02-11 19:14:32 +01:00
Sam Edwards
fa589acaac ceph: remove error return from ceph_process_folio_batch()
Following an earlier commit, ceph_process_folio_batch() no longer
returns errors because the writeback loop cannot handle them.

Since this function already indicates failure to lock any pages by
leaving `ceph_wbc.locked_pages == 0`, and the writeback loop has no way
to handle abandonment of a locked batch, change the return type of
ceph_process_folio_batch() to `void` and remove the pathological goto in
the writeback loop. The lack of a return code emphasizes that
ceph_process_folio_batch() is designed to be abort-free: that is, once
it commits a folio for writeback, it will not later abandon it or
propagate an error for that folio. Any future changes requiring "abort"
logic should follow this invariant by cleaning up its array and
resetting ceph_wbc.locked_pages appropriately.

Signed-off-by: Sam Edwards <CFSworks@gmail.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2026-02-11 17:52:50 +01:00
Sam Edwards
cac190c767 ceph: fix write storm on fscrypted files
CephFS stores file data across multiple RADOS objects. An object is the
atomic unit of storage, so the writeback code must clean only folios
that belong to the same object with each OSD request.

CephFS also supports RAID0-style striping of file contents: if enabled,
each object stores multiple unbroken "stripe units" covering different
portions of the file; if disabled, a "stripe unit" is simply the whole
object. The stripe unit is (usually) reported as the inode's block size.

Though the writeback logic could, in principle, lock all dirty folios
belonging to the same object, its current design is to lock only a
single stripe unit at a time. Ever since this code was first written,
it has determined this size by checking the inode's block size.
However, the relatively-new fscrypt support needed to reduce the block
size for encrypted inodes to the crypto block size (see 'fixes' commit),
which causes an unnecessarily high number of write operations (~1024x as
many, with 4MiB objects) and correspondingly degraded performance.

Fix this (and clarify intent) by using i_layout.stripe_unit directly in
ceph_define_write_size() so that encrypted inodes are written back with
the same number of operations as if they were unencrypted.

This patch depends on the preceding commit ("ceph: do not propagate page
array emplacement errors as batch errors") for correctness. While it
applies cleanly on its own, applying it alone will introduce a
regression. This dependency is only relevant for kernels where
ce80b76dd3 ("ceph: introduce ceph_process_folio_batch() method") has
been applied; stable kernels without that commit are unaffected.

Cc: stable@vger.kernel.org
Fixes: 94af047092 ("ceph: add some fscrypt guardrails")
Signed-off-by: Sam Edwards <CFSworks@gmail.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2026-02-11 17:52:50 +01:00
Sam Edwards
707104682e ceph: do not propagate page array emplacement errors as batch errors
When fscrypt is enabled, move_dirty_folio_in_page_array() may fail
because it needs to allocate bounce buffers to store the encrypted
versions of each folio. Each folio beyond the first allocates its bounce
buffer with GFP_NOWAIT. Failures are common (and expected) under this
allocation mode; they should flush (not abort) the batch.

However, ceph_process_folio_batch() uses the same `rc` variable for its
own return code and for capturing the return codes of its routine calls;
failing to reset `rc` back to 0 results in the error being propagated
out to the main writeback loop, which cannot actually tolerate any
errors here: once `ceph_wbc.pages` is allocated, it must be passed to
ceph_submit_write() to be freed. If it survives until the next iteration
(e.g. due to the goto being followed), ceph_allocate_page_array()'s
BUG_ON() will oops the worker.

Note that this failure mode is currently masked due to another bug
(addressed next in this series) that prevents multiple encrypted folios
from being selected for the same write.

For now, just reset `rc` when redirtying the folio to prevent errors in
move_dirty_folio_in_page_array() from propagating. Note that
move_dirty_folio_in_page_array() is careful never to return errors on
the first folio, so there is no need to check for that. After this
change, ceph_process_folio_batch() no longer returns errors; its only
remaining failure indicator is `locked_pages == 0`, which the caller
already handles correctly.

Cc: stable@vger.kernel.org
Fixes: ce80b76dd3 ("ceph: introduce ceph_process_folio_batch() method")
Signed-off-by: Sam Edwards <CFSworks@gmail.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2026-02-11 17:52:50 +01:00
ethanwu
305ff6b3a0 ceph: supply snapshot context in ceph_uninline_data()
The ceph_uninline_data function was missing proper snapshot context
handling for its OSD write operations. Both CEPH_OSD_OP_CREATE and
CEPH_OSD_OP_WRITE requests were passing NULL instead of the appropriate
snapshot context, which could lead to unnecessary object clone.

Reproducer:
../src/vstart.sh --new -x --localhost --bluestore
// turn on cephfs inline data
./bin/ceph fs set a inline_data true --yes-i-really-really-mean-it
// allow fs_a client to take snapshot
./bin/ceph auth caps client.fs_a mds 'allow rwps fsname=a' mon 'allow r fsname=a' osd 'allow rw tag cephfs data=a'
// mount cephfs with fuse, since kernel cephfs doesn't support inline write
ceph-fuse --id fs_a -m 127.0.0.1:40318 --conf ceph.conf -d /mnt/mycephfs/
// bump snapshot seq
mkdir /mnt/mycephfs/.snap/snap1
echo "foo" > /mnt/mycephfs/test
// umount and mount it again using kernel cephfs client
umount /mnt/mycephfs
mount -t ceph fs_a@.a=/ /mnt/mycephfs/ -o conf=./ceph.conf
echo "bar" >> /mnt/mycephfs/test
./bin/rados listsnaps -p cephfs.a.data $(printf "%x\n" $(stat -c %i /mnt/mycephfs/test)).00000000

will see this object does unnecessary clone
1000000000a.00000000 (seq:2):
cloneid snaps   size    overlap
2       2       4       []
head    -       8

but it's expected to see
10000000000.00000000 (seq:2):
cloneid snaps   size    overlap
head    -       8

since there's no snapshot between these 2 writes

clone happened because the first osd request CEPH_OSD_OP_CREATE doesn't
pass snap context so object is created with snap seq 0, but later data
writeback is equipped with snapshot context.
snap.seq(1) > object snap seq(0), so osd does object clone.

This fix properly acquiring the snapshot context before performing
write operations.

Signed-off-by: ethanwu <ethanwu@synology.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Tested-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2026-02-09 13:23:40 +01:00
Julian Sun
4952f35f05
fs: Make wbc_to_tag() inline and use it in fs.
The logic in wbc_to_tag() is widely used in file systems, so modify this
function to be inline and use it in file systems.

This patch has only passed compilation tests, but it should be fine.

Signed-off-by: Julian Sun <sunjunchao@bytedance.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-29 23:33:48 +01:00
Max Kellermann
249e0a47cd ceph: fix crash after fscrypt_encrypt_pagecache_blocks() error
The function move_dirty_folio_in_page_array() was created by commit
ce80b76dd3 ("ceph: introduce ceph_process_folio_batch() method") by
moving code from ceph_writepages_start() to this function.

This new function is supposed to return an error code which is checked
by the caller (now ceph_process_folio_batch()), and on error, the
caller invokes redirty_page_for_writepage() and then breaks from the
loop.

However, the refactoring commit has gone wrong, and it by accident, it
always returns 0 (= success) because it first NULLs the pointer and
then returns PTR_ERR(NULL) which is always 0.  This means errors are
silently ignored, leaving NULL entries in the page array, which may
later crash the kernel.

The simple solution is to call PTR_ERR() before clearing the pointer.

Cc: stable@vger.kernel.org
Fixes: ce80b76dd3 ("ceph: introduce ceph_process_folio_batch() method")
Link: https://lore.kernel.org/ceph-devel/aK4v548CId5GIKG1@swift.blarg.de/
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2025-09-09 12:57:03 +02:00
Max Kellermann
cce7c15faa ceph: always call ceph_shift_unused_folios_left()
The function ceph_process_folio_batch() sets folio_batch entries to
NULL, which is an illegal state.  Before folio_batch_release() crashes
due to this API violation, the function ceph_shift_unused_folios_left()
is supposed to remove those NULLs from the array.

However, since commit ce80b76dd3 ("ceph: introduce
ceph_process_folio_batch() method"), this shifting doesn't happen
anymore because the "for" loop got moved to ceph_process_folio_batch(),
and now the `i` variable that remains in ceph_writepages_start()
doesn't get incremented anymore, making the shifting effectively
unreachable much of the time.

Later, commit 1551ec61dc ("ceph: introduce ceph_submit_write()
method") added more preconditions for doing the shift, replacing the
`i` check (with something that is still just as broken):

- if ceph_process_folio_batch() fails, shifting never happens

- if ceph_move_dirty_page_in_page_array() was never called (because
  ceph_process_folio_batch() has returned early for some of various
  reasons), shifting never happens

- if `processed_in_fbatch` is zero (because ceph_process_folio_batch()
  has returned early for some of the reasons mentioned above or
  because ceph_move_dirty_page_in_page_array() has failed), shifting
  never happens

Since those two commits, any problem in ceph_process_folio_batch()
could crash the kernel, e.g. this way:

 BUG: kernel NULL pointer dereference, address: 0000000000000034
 #PF: supervisor write access in kernel mode
 #PF: error_code(0x0002) - not-present page
 PGD 0 P4D 0
 Oops: Oops: 0002 [#1] SMP NOPTI
 CPU: 172 UID: 0 PID: 2342707 Comm: kworker/u778:8 Not tainted 6.15.10-cm4all1-es #714 NONE
 Hardware name: Dell Inc. PowerEdge R7615/0G9DHV, BIOS 1.6.10 12/08/2023
 Workqueue: writeback wb_workfn (flush-ceph-1)
 RIP: 0010:folios_put_refs+0x85/0x140
 Code: 83 c5 01 39 e8 7e 76 48 63 c5 49 8b 5c c4 08 b8 01 00 00 00 4d 85 ed 74 05 41 8b 44 ad 00 48 8b 15 b0 >
 RSP: 0018:ffffb880af8db778 EFLAGS: 00010207
 RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000003
 RDX: ffffe377cc3b0000 RSI: 0000000000000000 RDI: ffffb880af8db8c0
 RBP: 0000000000000000 R08: 000000000000007d R09: 000000000102b86f
 R10: 0000000000000001 R11: 00000000000000ac R12: ffffb880af8db8c0
 R13: 0000000000000000 R14: 0000000000000000 R15: ffff9bd262c97000
 FS:  0000000000000000(0000) GS:ffff9c8efc303000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000034 CR3: 0000000160958004 CR4: 0000000000770ef0
 PKRU: 55555554
 Call Trace:
  <TASK>
  ceph_writepages_start+0xeb9/0x1410

The crash can be reproduced easily by changing the
ceph_check_page_before_write() return value to `-E2BIG`.

(Interestingly, the crash happens only if `huge_zero_folio` has
already been allocated; without `huge_zero_folio`,
is_huge_zero_folio(NULL) returns true and folios_put_refs() skips NULL
entries instead of dereferencing them.  That makes reproducing the bug
somewhat unreliable.  See
https://lore.kernel.org/20250826231626.218675-1-max.kellermann@ionos.com
for a discussion of this detail.)

My suggestion is to move the ceph_shift_unused_folios_left() to right
after ceph_process_folio_batch() to ensure it always gets called to
fix up the illegal folio_batch state.

Cc: stable@vger.kernel.org
Fixes: ce80b76dd3 ("ceph: introduce ceph_process_folio_batch() method")
Link: https://lore.kernel.org/ceph-devel/aK4v548CId5GIKG1@swift.blarg.de/
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2025-09-09 12:57:02 +02:00
Linus Torvalds
7031769e10 vfs-6.17-rc1.mmap_prepare
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaINCgQAKCRCRxhvAZXjc
 os+nAP9LFHUwWO6EBzHJJGEVjJvvzsbzqeYrRFamYiMc5ulPJwD+KW4RIgJa/MWO
 pcYE40CacaekD8rFWwYUyszpgmv6ewc=
 =wCwp
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.17-rc1.mmap_prepare' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull mmap_prepare updates from Christian Brauner:
 "Last cycle we introduce f_op->mmap_prepare() in c84bf6dd2b ("mm:
  introduce new .mmap_prepare() file callback").

  This is preferred to the existing f_op->mmap() hook as it does require
  a VMA to be established yet, thus allowing the mmap logic to invoke
  this hook far, far earlier, prior to inserting a VMA into the virtual
  address space, or performing any other heavy handed operations.

  This allows for much simpler unwinding on error, and for there to be a
  single attempt at merging a VMA rather than having to possibly
  reattempt a merge based on potentially altered VMA state.

  Far more importantly, it prevents inappropriate manipulation of
  incompletely initialised VMA state, which is something that has been
  the cause of bugs and complexity in the past.

  The intent is to gradually deprecate f_op->mmap, and in that vein this
  series coverts the majority of file systems to using f_op->mmap_prepare.

  Prerequisite steps are taken - firstly ensuring all checks for mmap
  capabilities use the file_has_valid_mmap_hooks() helper rather than
  directly checking for f_op->mmap (which is now not a valid check) and
  secondly updating daxdev_mapping_supported() to not require a VMA
  parameter to allow ext4 and xfs to be converted.

  Commit bb666b7c27 ("mm: add mmap_prepare() compatibility layer for
  nested file systems") handles the nasty edge-case of nested file
  systems like overlayfs, which introduces a compatibility shim to allow
  f_op->mmap_prepare() to be invoked from an f_op->mmap() callback.

  This allows for nested filesystems to continue to function correctly
  with all file systems regardless of which callback is used. Once we
  finally convert all file systems, this shim can be removed.

  As a result, ecryptfs, fuse, and overlayfs remain unaltered so they
  can nest all other file systems.

  We additionally do not update resctl - as this requires an update to
  remap_pfn_range() (or an alternative to it) which we defer to a later
  series, equally we do not update cramfs which needs a mixed mapping
  insertion with the same issue, nor do we update procfs, hugetlbfs,
  syfs or kernfs all of which require VMAs for internal state and hooks.
  We shall return to all of these later"

* tag 'vfs-6.17-rc1.mmap_prepare' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  doc: update porting, vfs documentation to describe mmap_prepare()
  fs: replace mmap hook with .mmap_prepare for simple mappings
  fs: convert most other generic_file_*mmap() users to .mmap_prepare()
  fs: convert simple use of generic_file_*_mmap() to .mmap_prepare()
  mm/filemap: introduce generic_file_*_mmap_prepare() helpers
  fs/xfs: transition from deprecated .mmap hook to .mmap_prepare
  fs/ext4: transition from deprecated .mmap hook to .mmap_prepare
  fs/dax: make it possible to check dev dax support without a VMA
  fs: consistently use can_mmap_file() helper
  mm/nommu: use file_has_valid_mmap_hooks() helper
  mm: rename call_mmap/mmap_prepare to vfs_mmap/mmap_prepare
2025-07-28 13:43:25 -07:00
Taotao Chen
e9d8e2bf23
fs: change write_begin/write_end interface to take struct kiocb *
Change the address_space_operations callbacks write_begin() and
write_end() to take struct kiocb * as the first argument instead of
struct file *.

Update all affected function prototypes, implementations, call sites,
and related documentation across VFS, filesystems, and block layer.

Part of a series refactoring address_space_operations write_begin and
write_end callbacks to use struct kiocb for passing write context and
flags.

Signed-off-by: Taotao Chen <chentaotao@didiglobal.com>
Link: https://lore.kernel.org/20250716093559.217344-4-chentaotao@didiglobal.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-16 14:48:18 +02:00
Lorenzo Stoakes
2e3b37a7e4
fs: replace mmap hook with .mmap_prepare for simple mappings
Since commit c84bf6dd2b ("mm: introduce new .mmap_prepare() file
callback"), the f_op->mmap() hook has been deprecated in favour of
f_op->mmap_prepare().

This callback is invoked in the mmap() logic far earlier, so error handling
can be performed more safely without complicated and bug-prone state
unwinding required should an error arise.

This hook also avoids passing a pointer to a not-yet-correctly-established
VMA avoiding any issues with referencing this data structure.

It rather provides a pointer to the new struct vm_area_desc descriptor type
which contains all required state and allows easy setting of required
parameters without any consideration needing to be paid to locking or
reference counts.

Note that nested filesystems like overlayfs are compatible with an
.mmap_prepare() callback since commit bb666b7c27 ("mm: add mmap_prepare()
compatibility layer for nested file systems").

In this patch we apply this change to file systems with relatively simple
mmap() hook logic - exfat, ceph, f2fs, bcachefs, zonefs, btrfs, ocfs2,
orangefs, nilfs2, romfs, ramfs and aio.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/f528ac4f35b9378931bd800920fee53fc0c5c74d.1750099179.git.lorenzo.stoakes@oracle.com
Acked-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-19 13:56:59 +02:00
Linus Torvalds
a3fb8a61e4 A one-liner that leads to a startling (but also very much rational)
performance improvement in cases where an IMA policy with rules that
 are based on fsmagic matching is enforced, an encryption-related fixup
 that addresses generic/397 and other fstest failures and a couple of
 cleanups in CephFS.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmhDKSQTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi5ofCACeaTBV/Dwr90/P+j9sYOfoirVi9rps
 onoce9aMPxEgVHDuhmVSwVnta6Fw7XtD1xxpe3rc1km/zi2/pH/FjBNNWYPk5bjX
 0+LPSqdf4HGBVmAFahLeYgB1beUznEd0saddoiq4tzY7DRpoD+yhfU8kHUzus8Ai
 77yEU5kh3SoTUPPoat0ePIR3KdUIFZy7wKHF3oo6urp8RmOjFw+knZd44uPTpvd5
 ApLzAwsGqJm7gwCSuB67TL3Wd0/6I+wnyNXKX8bYe090ZcPd6fy3WwIacTlo0CFu
 +tCenZBbVFXWVkv4LQtvIz+YlfmTZ31/HzgvLqb/33yjK+jWSzskDsBy
 =lFOg
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-6.16-rc1' of https://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:

 - a one-liner that leads to a startling (but also very much rational)
   performance improvement in cases where an IMA policy with rules that
   are based on fsmagic matching is enforced

 - an encryption-related fixup that addresses generic/397 and other
   fstest failures

 - a couple of cleanups in CephFS

* tag 'ceph-for-6.16-rc1' of https://github.com/ceph/ceph-client:
  ceph: fix variable dereferenced before check in ceph_umount_begin()
  ceph: set superblock s_magic for IMA fsmagic matching
  ceph: cleanup hardcoded constants of file handle size
  ceph: fix possible integer overflow in ceph_zero_objects()
  ceph: avoid kernel BUG for encrypted inode with unaligned file size
2025-06-06 17:56:19 -07:00
Viacheslav Dubeyko
060909278c ceph: avoid kernel BUG for encrypted inode with unaligned file size
The generic/397 test hits a BUG_ON for the case of encrypted inode with
unaligned file size (for example, 33K or 1K):

[ 877.737811] run fstests generic/397 at 2025-01-03 12:34:40
[ 877.875761] libceph: mon0 (2)127.0.0.1:40674 session established
[ 877.876130] libceph: client4614 fsid 19b90bca-f1ae-47a6-93dd-0b03ee637949
[ 877.991965] libceph: mon0 (2)127.0.0.1:40674 session established
[ 877.992334] libceph: client4617 fsid 19b90bca-f1ae-47a6-93dd-0b03ee637949
[ 878.017234] libceph: mon0 (2)127.0.0.1:40674 session established
[ 878.017594] libceph: client4620 fsid 19b90bca-f1ae-47a6-93dd-0b03ee637949
[ 878.031394] xfs_io (pid 18988) is setting deprecated v1 encryption policy; recommend upgrading to v2.
[ 878.054528] libceph: mon0 (2)127.0.0.1:40674 session established
[ 878.054892] libceph: client4623 fsid 19b90bca-f1ae-47a6-93dd-0b03ee637949
[ 878.070287] libceph: mon0 (2)127.0.0.1:40674 session established
[ 878.070704] libceph: client4626 fsid 19b90bca-f1ae-47a6-93dd-0b03ee637949
[ 878.264586] libceph: mon0 (2)127.0.0.1:40674 session established
[ 878.265258] libceph: client4629 fsid 19b90bca-f1ae-47a6-93dd-0b03ee637949
[ 878.374578] -----------[ cut here ]------------
[ 878.374586] kernel BUG at net/ceph/messenger.c:1070!
[ 878.375150] Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[ 878.378145] CPU: 2 UID: 0 PID: 4759 Comm: kworker/2:9 Not tainted 6.13.0-rc5+ #1
[ 878.378969] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 878.380167] Workqueue: ceph-msgr ceph_con_workfn
[ 878.381639] RIP: 0010:ceph_msg_data_cursor_init+0x42/0x50
[ 878.382152] Code: 89 17 48 8b 46 70 55 48 89 47 08 c7 47 18 00 00 00 00 48 89 e5 e8 de cc ff ff 5d 31 c0 31 d2 31 f6 31 ff c3 cc cc cc cc 0f 0b <0f> 0b 0f 0b 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90
[ 878.383928] RSP: 0018:ffffb4ffc7cbbd28 EFLAGS: 00010287
[ 878.384447] RAX: ffffffff82bb9ac0 RBX: ffff981390c2f1f8 RCX: 0000000000000000
[ 878.385129] RDX: 0000000000009000 RSI: ffff981288232b58 RDI: ffff981390c2f378
[ 878.385839] RBP: ffffb4ffc7cbbe18 R08: 0000000000000000 R09: 0000000000000000
[ 878.386539] R10: 0000000000000000 R11: 0000000000000000 R12: ffff981390c2f030
[ 878.387203] R13: ffff981288232b58 R14: 0000000000000029 R15: 0000000000000001
[ 878.387877] FS: 0000000000000000(0000) GS:ffff9814b7900000(0000) knlGS:0000000000000000
[ 878.388663] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 878.389212] CR2: 00005e106a0554e0 CR3: 0000000112bf0001 CR4: 0000000000772ef0
[ 878.389921] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 878.390620] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 878.391307] PKRU: 55555554
[ 878.391567] Call Trace:
[ 878.391807] <TASK>
[ 878.392021] ? show_regs+0x71/0x90
[ 878.392391] ? die+0x38/0xa0
[ 878.392667] ? do_trap+0xdb/0x100
[ 878.392981] ? do_error_trap+0x75/0xb0
[ 878.393372] ? ceph_msg_data_cursor_init+0x42/0x50
[ 878.393842] ? exc_invalid_op+0x53/0x80
[ 878.394232] ? ceph_msg_data_cursor_init+0x42/0x50
[ 878.394694] ? asm_exc_invalid_op+0x1b/0x20
[ 878.395099] ? ceph_msg_data_cursor_init+0x42/0x50
[ 878.395583] ? ceph_con_v2_try_read+0xd16/0x2220
[ 878.396027] ? _raw_spin_unlock+0xe/0x40
[ 878.396428] ? raw_spin_rq_unlock+0x10/0x40
[ 878.396842] ? finish_task_switch.isra.0+0x97/0x310
[ 878.397338] ? __schedule+0x44b/0x16b0
[ 878.397738] ceph_con_workfn+0x326/0x750
[ 878.398121] process_one_work+0x188/0x3d0
[ 878.398522] ? __pfx_worker_thread+0x10/0x10
[ 878.398929] worker_thread+0x2b5/0x3c0
[ 878.399310] ? __pfx_worker_thread+0x10/0x10
[ 878.399727] kthread+0xe1/0x120
[ 878.400031] ? __pfx_kthread+0x10/0x10
[ 878.400431] ret_from_fork+0x43/0x70
[ 878.400771] ? __pfx_kthread+0x10/0x10
[ 878.401127] ret_from_fork_asm+0x1a/0x30
[ 878.401543] </TASK>
[ 878.401760] Modules linked in: hctr2 nhpoly1305_avx2 nhpoly1305_sse2 nhpoly1305 chacha_generic chacha_x86_64 libchacha adiantum libpoly1305 essiv authenc mptcp_diag xsk_diag tcp_diag udp_diag raw_diag inet_diag unix_diag af_packet_diag netlink_diag intel_rapl_msr intel_rapl_common intel_uncore_frequency_common skx_edac_common nfit kvm_intel kvm crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel joydev crypto_simd cryptd rapl input_leds psmouse sch_fq_codel serio_raw bochs i2c_piix4 floppy qemu_fw_cfg i2c_smbus mac_hid pata_acpi msr parport_pc ppdev lp parport efi_pstore ip_tables x_tables
[ 878.407319] ---[ end trace 0000000000000000 ]---
[ 878.407775] RIP: 0010:ceph_msg_data_cursor_init+0x42/0x50
[ 878.408317] Code: 89 17 48 8b 46 70 55 48 89 47 08 c7 47 18 00 00 00 00 48 89 e5 e8 de cc ff ff 5d 31 c0 31 d2 31 f6 31 ff c3 cc cc cc cc 0f 0b <0f> 0b 0f 0b 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90
[ 878.410087] RSP: 0018:ffffb4ffc7cbbd28 EFLAGS: 00010287
[ 878.410609] RAX: ffffffff82bb9ac0 RBX: ffff981390c2f1f8 RCX: 0000000000000000
[ 878.411318] RDX: 0000000000009000 RSI: ffff981288232b58 RDI: ffff981390c2f378
[ 878.412014] RBP: ffffb4ffc7cbbe18 R08: 0000000000000000 R09: 0000000000000000
[ 878.412735] R10: 0000000000000000 R11: 0000000000000000 R12: ffff981390c2f030
[ 878.413438] R13: ffff981288232b58 R14: 0000000000000029 R15: 0000000000000001
[ 878.414121] FS: 0000000000000000(0000) GS:ffff9814b7900000(0000) knlGS:0000000000000000
[ 878.414935] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 878.415516] CR2: 00005e106a0554e0 CR3: 0000000112bf0001 CR4: 0000000000772ef0
[ 878.416211] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 878.416907] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 878.417630] PKRU: 55555554

(gdb) l *ceph_msg_data_cursor_init+0x42
0xffffffff823b45a2 is in ceph_msg_data_cursor_init (net/ceph/messenger.c:1070).
1065
1066 void ceph_msg_data_cursor_init(struct ceph_msg_data_cursor *cursor,
1067                                struct ceph_msg *msg, size_t length)
1068 {
1069        BUG_ON(!length);
1070        BUG_ON(length > msg->data_length);
1071        BUG_ON(!msg->num_data_items);
1072
1073        cursor->total_resid = length;
1074        cursor->data = msg->data;

The issue takes place because of this:

[ 202.628853] libceph: net/ceph/messenger_v2.c:2034 prepare_sparse_read_data(): msg->data_length 33792, msg->sparse_read_total 36864

1070        BUG_ON(length > msg->data_length);

The generic/397 test (xfstests) executes such steps:
(1) create encrypted files and directories;
(2) access the created files and folders with encryption key;
(3) access the created files and folders without encryption key.

The issue takes place in this portion of code:

    if (IS_ENCRYPTED(inode)) {
            struct page **pages;
            size_t page_off;

            err = iov_iter_get_pages_alloc2(&subreq->io_iter, &pages, len,
                                            &page_off);
            if (err < 0) {
                    doutc(cl, "%llx.%llx failed to allocate pages, %d\n",
                          ceph_vinop(inode), err);
                    goto out;
            }

            /* should always give us a page-aligned read */
            WARN_ON_ONCE(page_off);
            len = err;
            err = 0;

            osd_req_op_extent_osd_data_pages(req, 0, pages, len, 0, false,
                                             false);

The reason of the issue is that subreq->io_iter.count keeps unaligned
value of length:

[  347.751182] lib/iov_iter.c:1185 __iov_iter_get_pages_alloc(): maxsize 36864, maxpages 4294967295, start 18446659367320516064
[  347.752808] lib/iov_iter.c:1196 __iov_iter_get_pages_alloc(): maxsize 33792, maxpages 4294967295, start 18446659367320516064
[  347.754394] lib/iov_iter.c:1015 iter_folioq_get_pages(): maxsize 33792, maxpages 4294967295, extracted 0, _start_offset 18446659367320516064

This patch simply assigns the aligned value to subreq->io_iter.count
before calling iov_iter_get_pages_alloc2().

[ idryomov: tag the comment with FIXME to make it clear that it's only
            a workaround for netfslib not coexisting with fscrypt nicely
            (this is also noted in another pre-existing comment) ]

Cc: David Howells <dhowells@redhat.com>
Cc: stable@vger.kernel.org
Fixes: ee4cdf7ba8 ("netfs: Speed up buffered reading")
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2025-06-01 17:53:08 +02:00
David Howells
db26d62d79
netfs: Fix undifferentiation of DIO reads from unbuffered reads
On cifs, "DIO reads" (specified by O_DIRECT) need to be differentiated from
"unbuffered reads" (specified by cache=none in the mount parameters).  The
difference is flagged in the protocol and the server may behave
differently: Windows Server will, for example, mandate that DIO reads are
block aligned.

Fix this by adding a NETFS_UNBUFFERED_READ to differentiate this from
NETFS_DIO_READ, parallelling the write differentiation that already exists.
cifs will then do the right thing.

Fixes: 016dc8516a ("netfs: Implement unbuffered/DIO read support")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/3444961.1747987072@warthog.procyon.org.uk
Reviewed-by: "Paulo Alcantara (Red Hat)" <pc@manguebit.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
cc: Steve French <sfrench@samba.org>
cc: netfs@lists.linux.dev
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: linux-cifs@vger.kernel.org
cc: ceph-devel@vger.kernel.org
cc: linux-nfs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-23 10:35:03 +02:00
David Howells
20d72b00ca
netfs: Fix the request's work item to not require a ref
When the netfs_io_request struct's work item is queued, it must be supplied
with a ref to the work item struct to prevent it being deallocated whilst
on the queue or whilst it is being processed.  This is tricky to manage as
we have to get a ref before we try and queue it and then we may find it's
already queued and is thus already holding a ref - in which case we have to
try and get rid of the ref again.

The problem comes if we're in BH or IRQ context and need to drop the ref:
if netfs_put_request() reduces the count to 0, we have to do the cleanup -
but the cleanup may need to wait.

Fix this by adding a new work item to the request, ->cleanup_work, and
dispatching that when the refcount hits zero.  That can then synchronously
cancel any outstanding work on the main work item before doing the cleanup.

Adding a new work item also deals with another problem upstream where it's
sometimes changing the work func in the put function and requeuing it -
which has occasionally in the past caused the cleanup to happen
incorrectly.

As a bonus, this allows us to get rid of the 'was_async' parameter from a
bunch of functions.  This indicated whether the put function might not be
permitted to sleep.

Fixes: 3d3c950467 ("netfs: Provide readahead and readpage netfs helpers")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/20250519090707.2848510-4-dhowells@redhat.com
cc: Paulo Alcantara <pc@manguebit.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Steve French <stfrench@microsoft.com>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-21 14:35:20 +02:00
Matthew Wilcox (Oracle)
59b59a9431
fscrypt: Change fscrypt_encrypt_pagecache_blocks() to take a folio
ext4 and ceph already have a folio to pass; f2fs needs to be properly
converted but this will do for now.  This removes a reference
to page->index and page->mapping as well as removing a call to
compound_head().

Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Link: https://lore.kernel.org/r/20250304170224.523141-1-willy@infradead.org
Acked-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-05 12:57:15 +01:00
Matthew Wilcox (Oracle)
d1b452673a
ceph: Pass a folio to ceph_allocate_page_array()
Remove two accesses to page->index.

Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Link: https://lore.kernel.org/r/20250217185119.430193-10-willy@infradead.org
Tested-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-28 11:21:31 +01:00
Matthew Wilcox (Oracle)
ad49fe2b3d
ceph: Convert ceph_move_dirty_page_in_page_array() to move_dirty_folio_in_page_array()
Shorten the name of this internal function by dropping the 'ceph_'
prefix and pass in a folio instead of a page.

Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Link: https://lore.kernel.org/r/20250217185119.430193-9-willy@infradead.org
Tested-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-28 11:21:31 +01:00
Matthew Wilcox (Oracle)
a55cf4fd8f
ceph: Remove uses of page from ceph_process_folio_batch()
Remove uses of page->index and deprecated page APIs.  Saves a lot of
hidden calls to compound_head().

Also convert is_page_index_contiguous() to is_folio_index_contiguous()
and make its arguments const.

Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Link: https://lore.kernel.org/r/20250217185119.430193-8-willy@infradead.org
Tested-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-28 11:21:30 +01:00
Matthew Wilcox (Oracle)
15fdaf2fd6
ceph: Convert ceph_check_page_before_write() to use a folio
Remove the conversion back to a struct page and just use the folio
passed in.

Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Link: https://lore.kernel.org/r/20250217185119.430193-7-willy@infradead.org
Tested-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-28 11:21:30 +01:00
Matthew Wilcox (Oracle)
62171c16da
ceph: Convert writepage_nounlock() to write_folio_nounlock()
Remove references to page->index, page->mapping, thp_size(),
page_offset() and other page APIs in favour of their more efficient
folio replacements.

Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Link: https://lore.kernel.org/r/20250217185119.430193-6-willy@infradead.org
Tested-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-28 11:21:30 +01:00
Matthew Wilcox (Oracle)
f9707a8b5b
ceph: Convert ceph_find_incompatible() to take a folio
Both callers already have the folio.  Pass it in and use it throughout.
Removes some hidden calls to compound_head() and a reference to
page->mapping.

Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Link: https://lore.kernel.org/r/20250217185119.430193-4-willy@infradead.org
Tested-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-28 11:21:30 +01:00
Matthew Wilcox (Oracle)
88a59bda3f
ceph: Use a folio in ceph_page_mkwrite()
Convert the passed page to a folio and use it
throughout ceph_page_mkwrite().  Removes the last call to
page_mkwrite_check_truncate(), the last call to offset_in_thp() and one
of the last calls to thp_size().  Saves a few calls to compound_head().

Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Link: https://lore.kernel.org/r/20250217185119.430193-3-willy@infradead.org
Tested-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-28 11:21:30 +01:00
Matthew Wilcox (Oracle)
19a2881104
ceph: Remove ceph_writepage()
Ceph already has a writepages operation which is preferred over writepage
in all situations except for page migration.  By adding a migrate_folio
operation, there will be no situations in which ->writepage should
be called.  filemap_migrate_folio() is an appropriate operation to use
because the ceph data stored in folio->private does not contain any
reference to the memory address of the folio.

Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Link: https://lore.kernel.org/r/20250217185119.430193-2-willy@infradead.org
Tested-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-28 11:21:30 +01:00
Viacheslav Dubeyko
fd7449d937
ceph: fix generic/421 test failure
The generic/421 fails to finish because of the issue:

Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.894678] INFO: task kworker/u48:0:11 blocked for more than 122 seconds.
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.895403] Not tainted 6.13.0-rc5+ #1
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.895867] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.896633] task:kworker/u48:0 state:D stack:0 pid:11 tgid:11 ppid:2 flags:0x00004000
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.896641] Workqueue: writeback wb_workfn (flush-ceph-24)
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897614] Call Trace:
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897620] <TASK>
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897629] __schedule+0x443/0x16b0
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897637] schedule+0x2b/0x140
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897640] io_schedule+0x4c/0x80
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897643] folio_wait_bit_common+0x11b/0x310
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897646] ? _raw_spin_unlock_irq+0xe/0x50
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897652] ? __pfx_wake_page_function+0x10/0x10
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897655] __folio_lock+0x17/0x30
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897658] ceph_writepages_start+0xca9/0x1fb0
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897663] ? fsnotify_remove_queued_event+0x2f/0x40
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897668] do_writepages+0xd2/0x240
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897672] __writeback_single_inode+0x44/0x350
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897675] writeback_sb_inodes+0x25c/0x550
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897680] wb_writeback+0x89/0x310
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897683] ? finish_task_switch.isra.0+0x97/0x310
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897687] wb_workfn+0xb5/0x410
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897689] process_one_work+0x188/0x3d0
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897692] worker_thread+0x2b5/0x3c0
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897694] ? __pfx_worker_thread+0x10/0x10
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897696] kthread+0xe1/0x120
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897699] ? __pfx_kthread+0x10/0x10
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897701] ret_from_fork+0x43/0x70
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897705] ? __pfx_kthread+0x10/0x10
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897707] ret_from_fork_asm+0x1a/0x30
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897711] </TASK>

There are several issues here:
(1) ceph_kill_sb() doesn't wait ending of flushing
all dirty folios/pages because of racy nature of
mdsc->stopping_blockers. As a result, mdsc->stopping
becomes CEPH_MDSC_STOPPING_FLUSHED too early.
(2) The ceph_inc_osd_stopping_blocker(fsc->mdsc) fails
to increment mdsc->stopping_blockers. Finally,
already locked folios/pages are never been unlocked
and the logic tries to lock the same page second time.
(3) The folio_batch with found dirty pages by
filemap_get_folios_tag() is not processed properly.
And this is why some number of dirty pages simply never
processed and we have dirty folios/pages after unmount
anyway.

This patch fixes the issues by means of:
(1) introducing dirty_folios counter and flush_end_wq
waiting queue in struct ceph_mds_client;
(2) ceph_dirty_folio() increments the dirty_folios
counter;
(3) writepages_finish() decrements the dirty_folios
counter and wake up all waiters on the queue
if dirty_folios counter is equal or lesser than zero;
(4) adding in ceph_kill_sb() method the logic of
checking the value of dirty_folios counter and
waiting if it is bigger than zero;
(5) adding ceph_inc_osd_stopping_blocker() call in the
beginning of the ceph_writepages_start() and
ceph_dec_osd_stopping_blocker() at the end of
the ceph_writepages_start() with the goal to resolve
the racy nature of mdsc->stopping_blockers.

sudo ./check generic/421
FSTYP         -- ceph
PLATFORM      -- Linux/x86_64 ceph-testing-0001 6.13.0+ #137 SMP PREEMPT_DYNAMIC Mon Feb  3 20:30:08 UTC 2025
MKFS_OPTIONS  -- 127.0.0.1:40551:/scratch
MOUNT_OPTIONS -- -o name=fs,secret=<secret>,ms_mode=crc,nowsync,copyfrom 127.0.0.1:40551:/scratch /mnt/scratch

generic/421 7s ...  4s
Ran: generic/421
Passed all 1 tests

Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Link: https://lore.kernel.org/r/20250205000249.123054-5-slava@dubeyko.com
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-28 11:20:16 +01:00
Viacheslav Dubeyko
1551ec61dc
ceph: introduce ceph_submit_write() method
Final responsibility of ceph_writepages_start() is
to submit write requests for processed dirty folios/pages.
The ceph_submit_write() summarize all this logic in
one method.

The generic/421 fails to finish because of the issue:

Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.894678] INFO: task kworker/u48:0:11 blocked for more than 122 seconds.
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.895403] Not tainted 6.13.0-rc5+ #1
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.895867] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.896633] task:kworker/u48:0 state:D stack:0 pid:11 tgid:11 ppid:2 flags:0x00004000
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.896641] Workqueue: writeback wb_workfn (flush-ceph-24)
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897614] Call Trace:
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897620] <TASK>
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897629] __schedule+0x443/0x16b0
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897637] schedule+0x2b/0x140
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897640] io_schedule+0x4c/0x80
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897643] folio_wait_bit_common+0x11b/0x310
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897646] ? _raw_spin_unlock_irq+0xe/0x50
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897652] ? __pfx_wake_page_function+0x10/0x10
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897655] __folio_lock+0x17/0x30
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897658] ceph_writepages_start+0xca9/0x1fb0
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897663] ? fsnotify_remove_queued_event+0x2f/0x40
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897668] do_writepages+0xd2/0x240
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897672] __writeback_single_inode+0x44/0x350
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897675] writeback_sb_inodes+0x25c/0x550
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897680] wb_writeback+0x89/0x310
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897683] ? finish_task_switch.isra.0+0x97/0x310
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897687] wb_workfn+0xb5/0x410
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897689] process_one_work+0x188/0x3d0
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897692] worker_thread+0x2b5/0x3c0
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897694] ? __pfx_worker_thread+0x10/0x10
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897696] kthread+0xe1/0x120
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897699] ? __pfx_kthread+0x10/0x10
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897701] ret_from_fork+0x43/0x70
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897705] ? __pfx_kthread+0x10/0x10
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897707] ret_from_fork_asm+0x1a/0x30
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897711] </TASK>

There are two problems here:

if (!ceph_inc_osd_stopping_blocker(fsc->mdsc)) {
     rc = -EIO;
     goto release_folios;
}

(1) ceph_kill_sb() doesn't wait ending of flushing
all dirty folios/pages because of racy nature of
mdsc->stopping_blockers. As a result, mdsc->stopping
becomes CEPH_MDSC_STOPPING_FLUSHED too early.
(2) The ceph_inc_osd_stopping_blocker(fsc->mdsc) fails
to increment mdsc->stopping_blockers. Finally,
already locked folios/pages are never been unlocked
and the logic tries to lock the same page second time.

This patch implements refactoring of ceph_submit_write()
and also it solves the second issue.

Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Link: https://lore.kernel.org/r/20250205000249.123054-4-slava@dubeyko.com
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-28 11:20:16 +01:00
Viacheslav Dubeyko
ce80b76dd3
ceph: introduce ceph_process_folio_batch() method
First step of ceph_writepages_start() logic is
of finding the dirty memory folios and processing it.
This patch introduces ceph_process_folio_batch()
method that moves this logic into dedicated method.

The ceph_writepages_start() has this logic:

if (ceph_wbc.locked_pages == 0)
    lock_page(page);  /* first page */
else if (!trylock_page(page))
    break;

<skipped>

if (folio_test_writeback(folio) ||
    folio_test_private_2(folio) /* [DEPRECATED] */) {
      if (wbc->sync_mode == WB_SYNC_NONE) {
          doutc(cl, "%p under writeback\n", folio);
          folio_unlock(folio);
          continue;
      }
      doutc(cl, "waiting on writeback %p\n", folio);
      folio_wait_writeback(folio);
      folio_wait_private_2(folio); /* [DEPRECATED] */
}

The problem here that folio/page is locked here at first
and it is by set_page_writeback(page) later before
submitting the write request. The folio/page is unlocked
by writepages_finish() after finishing the write
request. It means that logic of checking folio_test_writeback()
and folio_wait_writeback() never works because page is locked
and it cannot be locked again until write request completion.
However, for majority of folios/pages the trylock_page()
is used. As a result, multiple threads can try to lock the same
folios/pages multiple times even if they are under writeback
already. It makes this logic more compute intensive than
it is necessary.

This patch changes this logic:

if (folio_test_writeback(folio) ||
    folio_test_private_2(folio) /* [DEPRECATED] */) {
      if (wbc->sync_mode == WB_SYNC_NONE) {
          doutc(cl, "%p under writeback\n", folio);
          folio_unlock(folio);
          continue;
      }
      doutc(cl, "waiting on writeback %p\n", folio);
      folio_wait_writeback(folio);
      folio_wait_private_2(folio); /* [DEPRECATED] */
}

if (ceph_wbc.locked_pages == 0)
    lock_page(page);  /* first page */
else if (!trylock_page(page))
    break;

This logic should exclude the ignoring of writeback
state of folios/pages.

Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Link: https://lore.kernel.org/r/20250205000249.123054-3-slava@dubeyko.com
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-28 11:20:15 +01:00
Viacheslav Dubeyko
f08068df4a
ceph: extend ceph_writeback_ctl for ceph_writepages_start() refactoring
The ceph_writepages_start() has unreasonably huge size and
complex logic that makes this method hard to understand.
Current state of the method's logic makes bug fix really
hard task. This patch extends the struct ceph_writeback_ctl
with the goal to make ceph_writepages_start() method
more compact and easy to understand by means of deep refactoring.

Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Link: https://lore.kernel.org/r/20250205000249.123054-2-slava@dubeyko.com
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-28 11:20:15 +01:00
David Howells
e2d46f2ec3
netfs: Change the read result collector to only use one work item
Change the way netfslib collects read results to do all the collection for
a particular read request using a single work item that walks along the
subrequest queue as subrequests make progress or complete, unlocking folios
progressively rather than doing the unlock in parallel as parallel requests
come in.

The code is remodelled to be more like the write-side code, though only
using a single stream.  This makes it more directly comparable and thus
easier to duplicate fixes between the two sides.

This has a number of advantages:

 (1) It's simpler.  There doesn't need to be a complex donation mechanism
     to handle mismatches between the size and alignment of subrequests and
     folios.  The collector unlocks folios as the subrequests covering each
     complete.

 (2) It should cause less scheduler overhead as there's a single work item
     in play unlocking pages in parallel when a read gets split up into a
     lot of subrequests instead of one per subrequest.

     Whilst the parallellism is nice in theory, in practice, the vast
     majority of loads are sequential reads of the whole file, so
     committing a bunch of threads to unlocking folios out of order doesn't
     help in those cases.

 (3) It should make it easier to implement content decryption.  A folio
     cannot be decrypted until all the requests that contribute to it have
     completed - and, again, most loads are sequential and so, most of the
     time, we want to begin decryption sequentially (though it's great if
     the decryption can happen in parallel).

There is a disadvantage in that we're losing the ability to decrypt and
unlock things on an as-things-arrive basis which may affect some
applications.

Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241216204124.3752367-28-dhowells@redhat.com
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-20 22:34:08 +01:00
David Howells
31fc366aa7
netfs: Drop the was_async arg from netfs_read_subreq_terminated()
Drop the was_async argument from netfs_read_subreq_terminated().  Almost
every caller is either in process context and passes false.  Some
filesystems delegate the call to a workqueue to avoid doing the work in
their network message queue parsing thread.

The only exception is netfs_cache_read_terminated() which handles
completion in the cache - which is usually a callback from the backing
filesystem in softirq context, though it can be from process context if an
error occurred.  In this case, delegate to a workqueue.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/CAHk-=wiVC5Cgyz6QKXFu6fTaA6h4CjexDR-OV9kL6Vo5x9v8=A@mail.gmail.com/
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241216204124.3752367-10-dhowells@redhat.com
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-20 22:34:03 +01:00
David Howells
360157829e
netfs: Drop the error arg from netfs_read_subreq_terminated()
Drop the error argument from netfs_read_subreq_terminated() in favour of
passing the value in subreq->error.

Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241216204124.3752367-9-dhowells@redhat.com
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-20 22:34:03 +01:00
Linus Torvalds
9d0ad04553 A fix for the mount "device" string parser from Patrick and two cred
reference counting fixups from Max, marked for stable.  Also included
 a number of cleanups and a tweak to MAINTAINERS to avoid unnecessarily
 CCing netdev list.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmdJ/sMTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi8EOB/9Jhq1nOe0dN7aAWN1owZH85TXmOLuX
 eSS79AJp63lmJgx+mF0CbLLN6Vwjvm1vqz10Uhe5VCmqtxKy1/F4QxEOwk+zEMwT
 iGkM+6nUtMMqnxqItJpFC19YQONwgidsNcbi7v8nDEHqH8FXEC4R0pi0990bUQSj
 E8zVzq44TNFQrhWjDJHPnXsxbH9SijRuwu1O4KEZ2HK0QQ8LfpPptozJawH0p2Hs
 Wc6V6ppt7o9F9MHW137I4BOG9xm18aAQa9Ztd5GHhip63SuLpdQNwoM9JhC8J/bt
 bcci5CoCa34P57g9/1kh9ov/NXf9XgjSCFlDzk9zZH0IxaX5CYgTqYL5
 =xf5D
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-6.13-rc1' of https://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "A fix for the mount "device" string parser from Patrick and two cred
  reference counting fixups from Max, marked for stable.

  Also included a number of cleanups and a tweak to MAINTAINERS to avoid
  unnecessarily CCing netdev list"

* tag 'ceph-for-6.13-rc1' of https://github.com/ceph/ceph-client:
  ceph: fix cred leak in ceph_mds_check_access()
  ceph: pass cred pointer to ceph_mds_auth_match()
  ceph: improve caps debugging output
  ceph: correct ceph_mds_cap_peer field name
  ceph: correct ceph_mds_cap_item field name
  ceph: miscellaneous spelling fixes
  ceph: Use strscpy() instead of strcpy() in __get_snap_name()
  ceph: Use str_true_false() helper in status_show()
  ceph: requalify some char pointers as const
  ceph: extract entity name from device id
  MAINTAINERS: exclude net/ceph from networking
  ceph: Remove fs/ceph deadcode
  libceph: Remove unused ceph_crypto_key_encode
  libceph: Remove unused ceph_osdc_watch_check
  libceph: Remove unused pagevec functions
  libceph: Remove unused ceph_pagelist functions
2024-11-30 10:22:38 -08:00
Linus Torvalds
56be9aaf98 vfs-6.13.pagecache
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzcUQAAKCRCRxhvAZXjc
 onEpAQCUdwIBHpwmSIFvJFA9aNGpbLzi0dDSEIxuWYtp5qVuogD+ImccwqpG3kEi
 Zq9vokdPpB1zbahxKl1mkvBG4G0GFQE=
 =LbP6
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.13.pagecache' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs pagecache updates from Christian Brauner:
 "Cleanup filesystem page flag usage: This continues the work to make
  the mappedtodisk/owner_2 flag available to filesystems which don't use
  buffer heads. Further patches remove uses of Private2. This brings us
  very close to being rid of it entirely"

* tag 'vfs-6.13.pagecache' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  migrate: Remove references to Private2
  ceph: Remove call to PagePrivate2()
  btrfs: Switch from using the private_2 flag to owner_2
  mm: Remove PageMappedToDisk
  nilfs2: Convert nilfs_copy_buffer() to use folios
  fs: Move clearing of mappedtodisk to buffer.c
2024-11-18 09:54:32 -08:00
Dmitry Antipov
3500000bb1 ceph: miscellaneous spelling fixes
Correct spelling here and there as suggested by codespell.

Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-11-18 17:34:36 +01:00
Matthew Wilcox (Oracle)
fd15ba4cb0
ceph: Remove call to PagePrivate2()
Use the folio that we already have to call folio_test_private_2()
instead.  This is the last call to PagePrivate2(), so replace its
PAGEFLAG() definition with FOLIO_FLAG().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20241002040111.1023018-6-willy@infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-04 09:24:25 +02:00
Patrick Donnelly
ccda9910d8 ceph: fix cap ref leak via netfs init_request
Log recovered from a user's cluster:

    <7>[ 5413.970692] ceph:  get_cap_refs 00000000958c114b ret 1 got Fr
    <7>[ 5413.970695] ceph:  start_read 00000000958c114b, no cache cap
    ...
    <7>[ 5473.934609] ceph:   my wanted = Fr, used = Fr, dirty -
    <7>[ 5473.934616] ceph:  revocation: pAsLsXsFr -> pAsLsXs (revoking Fr)
    <7>[ 5473.934632] ceph:  __ceph_caps_issued 00000000958c114b cap 00000000f7784259 issued pAsLsXs
    <7>[ 5473.934638] ceph:  check_caps 10000000e68.fffffffffffffffe file_want - used Fr dirty - flushing - issued pAsLsXs revoking Fr retain pAsLsXsFsr  AUTHONLY NOINVAL FLUSH_FORCE

The MDS subsequently complains that the kernel client is late releasing
caps.

Approximately, a series of changes to this code by commits 4987005600
("ceph: convert ceph_readpages to ceph_readahead"), 2de1604173
("netfs: Change ->init_request() to return an error code") and
a5c9dc4451 ("ceph: Make ceph_init_request() check caps on readahead")
resulted in subtle resource cleanup to be missed. The main culprit is
the change in error handling in 2de1604173 which meant that a failure
in init_request() would no longer cause cleanup to be called. That
would prevent the ceph_put_cap_refs() call which would cleanup the
leaked cap ref.

Cc: stable@vger.kernel.org
Fixes: a5c9dc4451 ("ceph: Make ceph_init_request() check caps on readahead")
Link: https://tracker.ceph.com/issues/67008
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-10-03 09:31:08 +02:00
Thorsten Blum
7264745d55 ceph: use struct_size() helper in __ceph_pool_perm_get()
Use struct_size() to calculate the number of bytes to be allocated.

Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-10-03 09:31:08 +02:00
Linus Torvalds
894b3c35d1 Three CephFS fixes from Xiubo and Luis and a bunch of assorted
cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmb3JroTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzizDiB/0elHQQaFxXMjuJRY1IzohozAHi0cHK
 gwgE4nEbECE8vRYK/QvyvZ3S+ep+N+r6jOIiIDyqhjtlY3//oSyyxL7RjMJlVFBq
 Ie37w8r4q1aL1mn9QDQ4iQxcRYyU+JxcUcPR1UUUvLiKgWaRixmq27zby/WQSrkA
 ke2ScBRDtEAYVtdxvxmUJK/DrPr3skwJAGY52KesjwgVhXSL8KG9X1zMUbWdJYDV
 THbQzLZsu4NVh7LlAsS/mh+z0EIZsXxQYU5IY3dIVEYcuLK93lXRGZb+7whtmUef
 wsDtYIe/w30QVxFdrN28qAQp8daUJhp+3t0EZSyecRcq5OPey6ICx1P4
 =+bdB
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-6.12-rc1' of https://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "Three CephFS fixes from Xiubo and Luis and a bunch of assorted
  cleanups"

* tag 'ceph-for-6.12-rc1' of https://github.com/ceph/ceph-client:
  ceph: remove the incorrect Fw reference check when dirtying pages
  ceph: Remove empty definition in header file
  ceph: Fix typo in the comment
  ceph: fix a memory leak on cap_auths in MDS client
  ceph: flush all caps releases when syncing the whole filesystem
  ceph: rename ceph_flush_cap_releases() to ceph_flush_session_cap_releases()
  libceph: use min() to simplify code in ceph_dns_resolve_name()
  ceph: Convert to use jiffies macro
  ceph: Remove unused declarations
2024-09-28 08:40:36 -07:00
Xiubo Li
c08dfb1b49 ceph: remove the incorrect Fw reference check when dirtying pages
When doing the direct-io reads it will also try to mark pages dirty,
but for the read path it won't hold the Fw caps and there is case
will it get the Fw reference.

Fixes: 5dda377cf0 ("ceph: set i_head_snapc when getting CEPH_CAP_FILE_WR reference")
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-09-24 22:51:33 +02:00
Linus Torvalds
35219bc5c7 vfs-6.12.netfs
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZuQEvgAKCRCRxhvAZXjc
 onQWAQD6IxAKPU0zom2FoWNilvSzPs7WglTtvddX9pu/lT1RNAD/YC/wOLW8mvAv
 9oTAmigQDQQhEWdJA9RgLZBiw7k+DAw=
 =zWFb
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.12.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull netfs updates from Christian Brauner:
 "This contains the work to improve read/write performance for the new
  netfs library.

  The main performance enhancing changes are:

   - Define a structure, struct folio_queue, and a new iterator type,
     ITER_FOLIOQ, to hold a buffer as a replacement for ITER_XARRAY. See
     that patch for questions about naming and form.

     ITER_FOLIOQ is provided as a replacement for ITER_XARRAY. The
     problem with an xarray is that accessing it requires the use of a
     lock (typically the RCU read lock) - and this means that we can't
     supply iterate_and_advance() with a step function that might sleep
     (crypto for example) without having to drop the lock between pages.
     ITER_FOLIOQ is the iterator for a chain of folio_queue structs,
     where each folio_queue holds a small list of folios. A folio_queue
     struct is a simpler structure than xarray and is not subject to
     concurrent manipulation by the VM. folio_queue is used rather than
     a bvec[] as it can form lists of indefinite size, adding to one end
     and removing from the other on the fly.

   - Provide a copy_folio_from_iter() wrapper.

   - Make cifs RDMA support ITER_FOLIOQ.

   - Use folio queues in the write-side helpers instead of xarrays.

   - Add a function to reset the iterator in a subrequest.

   - Simplify the write-side helpers to use sheaves to skip gaps rather
     than trying to work out where gaps are.

   - In afs, make the read subrequests asynchronous, putting them into
     work items to allow the next patch to do progressive
     unlocking/reading.

   - Overhaul the read-side helpers to improve performance.

   - Fix the caching of a partial block at the end of a file.

   - Allow a store to be cancelled.

  Then some changes for cifs to make it use folio queues instead of
  xarrays for crypto bufferage:

   - Use raw iteration functions rather than manually coding iteration
     when hashing data.

   - Switch to using folio_queue for crypto buffers.

   - Remove the xarray bits.

  Make some adjustments to the /proc/fs/netfs/stats file such that:

   - All the netfs stats lines begin 'Netfs:' but change this to
     something a bit more useful.

   - Add a couple of stats counters to track the numbers of skips and
     waits on the per-inode writeback serialisation lock to make it
     easier to check for this as a source of performance loss.

  Miscellaneous work:

   - Ensure that the sb_writers lock is taken around
     vfs_{set,remove}xattr() in the cachefiles code.

   - Reduce the number of conditional branches in netfs_perform_write().

   - Move the CIFS_INO_MODIFIED_ATTR flag to the netfs_inode struct and
     remove cifs_post_modify().

   - Move the max_len/max_nr_segs members from netfs_io_subrequest to
     netfs_io_request as they're only needed for one subreq at a time.

   - Add an 'unknown' source value for tracing purposes.

   - Remove NETFS_COPY_TO_CACHE as it's no longer used.

   - Set the request work function up front at allocation time.

   - Use bh-disabling spinlocks for rreq->lock as cachefiles completion
     may be run from block-filesystem DIO completion in softirq context.

   - Remove fs/netfs/io.c"

* tag 'vfs-6.12.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (25 commits)
  docs: filesystems: corrected grammar of netfs page
  cifs: Don't support ITER_XARRAY
  cifs: Switch crypto buffer to use a folio_queue rather than an xarray
  cifs: Use iterate_and_advance*() routines directly for hashing
  netfs: Cancel dirty folios that have no storage destination
  cachefiles, netfs: Fix write to partial block at EOF
  netfs: Remove fs/netfs/io.c
  netfs: Speed up buffered reading
  afs: Make read subreqs async
  netfs: Simplify the writeback code
  netfs: Provide an iterator-reset function
  netfs: Use new folio_queue data type and iterator instead of xarray iter
  cifs: Provide the capability to extract from ITER_FOLIOQ to RDMA SGEs
  iov_iter: Provide copy_folio_from_iter()
  mm: Define struct folio_queue and ITER_FOLIOQ to handle a sequence of folios
  netfs: Use bh-disabling spinlocks for rreq->lock
  netfs: Set the request work function upon allocation
  netfs: Remove NETFS_COPY_TO_CACHE
  netfs: Reserve netfs_sreq_source 0 as unset/unknown
  netfs: Move max_len/max_nr_segs from netfs_io_subrequest to netfs_io_stream
  ...
2024-09-16 12:13:31 +02:00
Linus Torvalds
2775df6e5e vfs-6.12.folio
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZuQEvgAKCRCRxhvAZXjc
 ou77AQD3U1KjbdgzbUi6kaUmiiWOPhfYTlm8mho8dBjqvTCB+AD/XTWSFCWWhHB4
 KyQZTbjRD81xmVNbKjASazp0EA6Ahwc=
 =gIsD
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.12.folio' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs

Pull vfs folio updates from Christian Brauner:
 "This contains work to port write_begin and write_end to rely on folios
  for various filesystems.

  This converts ocfs2, vboxfs, orangefs, jffs2, hostfs, fuse, f2fs,
  ecryptfs, ntfs3, nilfs2, reiserfs, minixfs, qnx6, sysv, ufs, and
  squashfs.

  After this series lands a bunch of the filesystems in this list do not
  mention struct page anymore"

* tag 'vfs-6.12.folio' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (61 commits)
  Squashfs: Ensure all readahead pages have been used
  Squashfs: Rewrite and update squashfs_readahead_fragment() to not use page->index
  Squashfs: Update squashfs_readpage_block() to not use page->index
  Squashfs: Update squashfs_readahead() to not use page->index
  Squashfs: Update page_actor to not use page->index
  jffs2: Use a folio in jffs2_garbage_collect_dnode()
  jffs2: Convert jffs2_do_readpage_nolock to take a folio
  buffer: Convert __block_write_begin() to take a folio
  ocfs2: Convert ocfs2_write_zero_page to use a folio
  fs: Convert aops->write_begin to take a folio
  fs: Convert aops->write_end to take a folio
  vboxsf: Use a folio in vboxsf_write_end()
  orangefs: Convert orangefs_write_begin() to use a folio
  orangefs: Convert orangefs_write_end() to use a folio
  jffs2: Convert jffs2_write_begin() to use a folio
  jffs2: Convert jffs2_write_end() to use a folio
  hostfs: Convert hostfs_write_end() to use a folio
  fuse: Convert fuse_write_begin() to use a folio
  fuse: Convert fuse_write_end() to use a folio
  f2fs: Convert f2fs_write_begin() to use a folio
  ...
2024-09-16 08:54:30 +02:00
David Howells
ee4cdf7ba8
netfs: Speed up buffered reading
Improve the efficiency of buffered reads in a number of ways:

 (1) Overhaul the algorithm in general so that it's a lot more compact and
     split the read submission code between buffered and unbuffered
     versions.  The unbuffered version can be vastly simplified.

 (2) Read-result collection is handed off to a work queue rather than being
     done in the I/O thread.  Multiple subrequests can be processes
     simultaneously.

 (3) When a subrequest is collected, any folios it fully spans are
     collected and "spare" data on either side is donated to either the
     previous or the next subrequest in the sequence.

Notes:

 (*) Readahead expansion is massively slows down fio, presumably because it
     causes a load of extra allocations, both folio and xarray, up front
     before RPC requests can be transmitted.

 (*) RDMA with cifs does appear to work, both with SIW and RXE.

 (*) PG_private_2-based reading and copy-to-cache is split out into its own
     file and altered to use folio_queue.  Note that the copy to the cache
     now creates a new write transaction against the cache and adds the
     folios to be copied into it.  This allows it to use part of the
     writeback I/O code.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-20-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-12 12:20:41 +02:00
Dominique Martinet
e3786b29c5
9p: Fix DIO read through netfs
If a program is watching a file on a 9p mount, it won't see any change in
size if the file being exported by the server is changed directly in the
source filesystem, presumably because 9p doesn't have change notifications,
and because netfs skips the reads if the file is empty.

Fix this by attempting to read the full size specified when a DIO read is
requested (such as when 9p is operating in unbuffered mode) and dealing
with a short read if the EOF was less than the expected read.

To make this work, filesystems using netfslib must not set
NETFS_SREQ_CLEAR_TAIL if performing a DIO read where that read hit the EOF.
I don't want to mandatorily clear this flag in netfslib for DIO because,
say, ceph might make a read from an object that is not completely filled,
but does not reside at the end of file - and so we need to clear the
excess.

This can be tested by watching an empty file over 9p within a VM (such as
in the ktest framework):

        while true; do read content; if [ -n "$content" ]; then echo $content; break; fi; done < /host/tmp/foo

then writing something into the empty file.  The watcher should immediately
display the file content and break out of the loop.  Without this fix, it
remains in the loop indefinitely.

Fixes: 80105ed2fd ("9p: Use netfslib read/write_iter")
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218916
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/1229195.1723211769@warthog.procyon.org.uk
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Trond Myklebust <trond.myklebust@hammerspace.com>
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: ceph-devel@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: linux-nfs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-13 13:53:09 +02:00