Under memory pressure, direct reclaim can kick in during compressed
readahead. This puts the associated task into D-state. Then shrink_lruvec()
disables interrupts when acquiring the LRU lock. Under heavy pressure,
we've observed reclaim can run long enough that the CPU becomes prone to
CSD lock stalls since it cannot service incoming IPIs. Although the CSD
lock stalls are the worst case scenario, we have found many more subtle
occurrences of this latency on the order of seconds, over a minute in some
cases.
Prevent direct reclaim during compressed readahead. This is achieved by
using different GFP flags at key points when the bio is marked for
readahead.
There are two functions that allocate during compressed readahead:
btrfs_alloc_compr_folio() and add_ra_bio_pages(). Both currently use
GFP_NOFS which includes __GFP_DIRECT_RECLAIM.
For the internal API call btrfs_alloc_compr_folio(), the signature changes
to accept an additional gfp_t parameter. At the readahead call site, it
gets flags similar to GFP_NOFS but stripped of __GFP_DIRECT_RECLAIM.
__GFP_NOWARN is added since these allocations are allowed to fail. Demand
reads still use full GFP_NOFS and will enter reclaim if needed. All other
existing call sites of btrfs_alloc_compr_folio() now explicitly pass
GFP_NOFS to retain their current behavior.
add_ra_bio_pages() gains a bool parameter which allows callers to specify
if they want to allow direct reclaim or not. In either case, the
__GFP_NOWARN flag was added unconditionally since the allocations are
speculative.
There has been some previous work done on calling add_ra_bio_pages() [0].
This patch is complementary: where that patch reduces call frequency, this
patch reduces the latency associated with those calls.
[0] https://lore.kernel.org/linux-btrfs/656838ec1232314a2657716e59f4f15a8eadba64.1751492111.git.boris@bur.io/
Reviewed-by: Mark Harmstone <mark@harmstone.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: JP Kobryn (Meta) <jp.kobryn@linux.dev>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have two locations using open-coded 512K size, as the async chunk
size.
For compression we have not only the max size a compressed extent can
represent (128K), but also how large an async chunk can be (512K).
Although we have a macro for the maximum compressed extent size, we do
not have any macro for the async chunk size.
Add such a macro and replace the two open-coded SZ_512K.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now there is no one utilizing that member, we can safely remove it along
with compressed_bio::nr_folios member. The size is reduced from 352 to
336 bytes on x86_64.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently only encoded writes utilized btrfs_submit_compressed_write(),
which utilized compressed_bio::compressed_folios[] array.
Change the only call site to call the new helper,
btrfs_alloc_compressed_write(), to allocate a compressed bio, then queue
needed folios into that bio, and finally call
btrfs_submit_compressed_write() to submit the compressed bio.
This change has one hidden benefit, previously we used
btrfs_alloc_folio_array() for the folios of
btrfs_submit_compressed_read(), which doesn't utilize the compression
page pool for bs == ps cases.
Now we call btrfs_alloc_compr_folio() which will benefit from the page pool.
The other obvious benefit is that we no longer need to allocate an array
to hold all those folios, thus one less error path.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since it's been replaced by btrfs_compress_bio(), remove all involved
functions.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The helper will allocate a new compressed_bio, do the compression, and
return it to the caller.
This greatly simplifies the compression path, as we no longer need to
allocate a folio array thus no extra error path, furthermore the
compressed bio structure can be utilized for submission with very minor
modifications (like rounding up the bi_size and populate the bi_sector).
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The new helper has the following enhancements against the existing
zlib_compress_folios()
- Much smaller parameter list
No more shared IN/OUT members, no need to pre-allocate a
compressed_folios[] array.
Just a workspace and compressed_bio pointer, everything we need can be
extracted from that @cb pointer.
- Ready-to-be-submitted compressed bio
Although the caller still needs to do some common works like
rounding up and zeroing the tailing part of the last fs block.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The new helper has the following enhancements against the existing
zstd_compress_folios()
- Much smaller parameter list
No more shared IN/OUT members, no need to pre-allocate a
compressed_folios[] array.
Just a workspace and compressed_bio pointer, everything we need can be
extracted from that @cb pointer.
- Ready-to-be-submitted compressed bio
Although the caller still needs to do some common works like
rounding up and zeroing the tailing part of the last fs block.
Overall the workflow is the same as zstd_compress_folios(), but with
some minor changes:
- @start/@len is now constant
For the current input file offset, use @start + @tot_in instead.
The original change of @start and @len makes it pretty hard to know
what value we're really comparing to.
- No more @cur_len
It's only utilized when switching input buffer.
Directly use btrfs_calc_input_length() instead.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The new helper has the following enhancements against the existing
lzo_compress_folios()
- Much smaller parameter list
No more shared IN/OUT members, no need to pre-allocate a
compressed_folios[] array.
Just a workspace list header and a compressed_bio pointer.
Everything else can be fetched from that @cb pointer.
- Read-to-be-submitted compressed bio
Although the caller still needs to do some common works like
rounding up and zeroing the tailing part of the last fs block.
Some workloads are specific to lZO that is not needed with other
multi-run compression interfaces:
- Need to write a LZO header or segment header
Use the new write_and_queue_folio() helper to do the bio_add_folio()
call and folio switching.
- Need to update the LZO header after compression is done
Use bio_first_folio_all() to grab the first folio and update the header.
- Extra corner case of error handling
This can happen when we have queued part of a folio and hit an error.
In that case those folios will be released by the bio.
Thus we can only release the folio that has no queued part.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmko/DUACgkQxWXV+ddt
WDtyCw//UaFOTX/k72HgA1n2MWfegWbWyD+OGNbGosoZljKOrAe/mRnjXTyF9lyW
8GDzGvJzF4Tkl5lyuGyiequlrO2F3veTpwHo94xnBTOYCeiTpMqTN/e/5SkasBpN
4YlWq7OGYR4hwghRvZpaW7nsmVCKDLIlZVkH77x9Bmvx0NLO24EJlEZusQT4zYew
ntC/i9x3DW0ZxYyfRhFIFvk9JUUdgXfxJ6dNexz0zi3dKUSUIR9hI0J9Nwl++1cF
SgjAzbtO064htWoCvsKykgA6YGbJCZjw8XO8D2eJonkN24VbqSMaY44TPXmCMLVs
ZXw871jV2E/urfWhRNdxv/kJdCFudPk0qXG5ZtfHO4UUwS/nZ+qAig+LHawgAOCJ
9CgWy4zrfiYCqULRuqF1wzWu/z22++zIlZC552VAZd1RQ+JjqJY/aje4xhY5nUF4
n1uVBReZaI9sH3jJOsMWpwLMptbhpH9RZp3QPgqZlUHo6GtPJJmNKfw8KgMAhZ7L
wf7iy6v9yo+7VZ2ACwu2qJ+lZRxbZ0yvCnFatN3O5G1O0kkIrZFUM3MwdKtufZ0u
LHWkGfoaq7zR6E6DhIaxIhiTTXMlOfLTikNKgBUO3NEdrRZwrDhr7K07S25jFxSx
ZCNV6OdSCeziShPqT0ntcwecnJ41/kOcm13732NHF+QgzMK5LrI=
=rO4x
-----END PGP SIGNATURE-----
Merge tag 'for-6.19-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"Features:
- shutdown ioctl support (needs CONFIG_BTRFS_EXPERIMENTAL for now):
- set filesystem state as being shut down (also named going down
in other filesystems), where all active operations return EIO
and this cannot be changed until unmount
- pending operations are attempted to be finished but error
messages may still show up depending on where exactly the
shutdown happened
- scrub (and device replace) vs suspend/hibernate:
- a running scrub will prevent suspend, which can be annoying as
suspend is an immediate request and scrub is not critical
- filesystem freezing before suspend was not sufficient as the
problem was in process freezing
- behaviour change: on suspend scrub and device replace are
cancelled, where scrub can record the last state and continue
from there; the device replace has to be restarted from the
beginning
- zone stats exported in sysfs, from the perspective of the
filesystem this includes active, reclaimable, relocation etc zones
Performance:
- improvements when processing space reservation tickets by
optimizing locking and shrinking critical sections, cumulative
improvements in lockstat numbers show +15%
Notable fixes:
- use vmalloc fallback when allocating bios as high order allocations
can happen with wide checksums (like sha256)
- scrub will always track the last position of progress so it's not
starting from zero after an error
Core:
- under experimental config, checksum calculations are offloaded to
process context, simplifies locking and allows to remove
compression write worker kthread(s):
- speed improvement in direct IO throughput with buffered IO
fallback is +15% when not offloaded but this is more related to
internal crypto subsystem improvements
- this will be probably default in the future removing the sysfs
tunable
- (experimental) block size > page size updates:
- support more operations when not using large folios (encoded
read/write and send)
- raid56
- more preparations for fscrypt support
Other:
- more conversions to auto-cleaned variables
- parameter cleanups and removals
- extended warning fixes
- improved printing of structured values like keys
- lots of other cleanups and refactoring"
* tag 'for-6.19-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (147 commits)
btrfs: remove unnecessary inode key in btrfs_log_all_parents()
btrfs: remove redundant zero/NULL initializations in btrfs_alloc_root()
btrfs: remaining BTRFS_PATH_AUTO_FREE conversions
btrfs: send: do not allocate memory for xattr data when checking it exists
btrfs: send: add unlikely to all unexpected overflow checks
btrfs: reduce arguments to btrfs_del_inode_ref_in_log()
btrfs: remove root argument from btrfs_del_dir_entries_in_log()
btrfs: use test_and_set_bit() in btrfs_delayed_delete_inode_ref()
btrfs: don't search back for dir inode item in INO_LOOKUP_USER
btrfs: don't rewrite ret from inode_permission
btrfs: add orig_logical to btrfs_bio for encryption
btrfs: disable verity on encrypted inodes
btrfs: disable various operations on encrypted inodes
btrfs: remove redundant level reset in btrfs_del_items()
btrfs: simplify leaf traversal after path release in btrfs_next_old_leaf()
btrfs: optimize balance_level() path reference handling
btrfs: factor out root promotion logic into promote_child_to_root()
btrfs: raid56: remove the "_step" infix
btrfs: raid56: enable bs > ps support
btrfs: raid56: prepare finish_parity_scrub() to support bs > ps cases
...
Commit 2c25716dcc ("btrfs: zlib: fix and simplify the inline extent
decompression") renamed the 'start_byte' parameter to 'dest_pgoff' in
the btrfs_decompress(). The remaining 'start_byte' references are
inconsistent with the actual implementation and may cause confusion for
developers.
Ensure consistency between function declaration and implementation.
Signed-off-by: Zhen Ni <zhen.ni@easystack.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The reason why end_bbio_compressed_write() queues a work into
compressed_write_workers wq is for end_compressed_writeback() call, as
it will grab all the involved folios and clear the writeback flags,
which may sleep.
However now we always run btrfs_bio::end_io() in task context, there is
no need to queue the work anymore.
Just remove btrfs_fs_info::compressed_write_workers and
compressed_bio::write_end_work.
There is a comment about the works queued into
compressed_write_workers, now change to flush endio wq instead, which is
responsible to handle all data endio functions.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently there is only one caller which doesn't populate
btrfs_bio::inode, and that's scrub.
The idea is scrub doesn't want any automatic csum verification nor
read-repair, as everything will be handled by scrub itself.
However that behavior is really no different than metadata inode, thus
we can reuse btree_inode as btrfs_bio::inode for scrub.
The only exception is in btrfs_submit_chunk() where if a bbio is from
scrub or data reloc inode, we set rst_search_commit_root to true.
This means we still need a way to distinguish scrub from metadata, but
that can be done by a new flag inside btrfs_bio.
Now btrfs_bio::inode is a mandatory parameter, we can extract fs_info
from that inode thus can remove btrfs_bio::fs_info to save 8 bytes from
btrfs_bio structure.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When I tried to remove btrfs_bio::fs_info and use btrfs_bio::inode to
grab the fs_info, the header "btrfs_inode.h" is needed to access the
full btrfs_inode structure.
Then btrfs will fail to compile.
[CAUSE]
There is a recursive including chain:
"bio.h" -> "btrfs_inode.h" -> "extent_map.h" -> "compression.h" ->
"bio.h"
That recursive including is causing problems for btrfs.
[ENHANCEMENT]
To reduce the risk of recursive including:
- Remove unnecessary local includes from btrfs headers
Either the included header is pulled in by other headers, or is
completely unnecessary.
- Remove btrfs local includes if the header only requires a pointer
In that case let the implementing C file to pull the required header.
This is especially important for headers like "btrfs_inode.h" which
pulls in a lot of other btrfs headers, thus it's a mine field of
recursive including.
- Remove unnecessary temporary structure definition
Either if we have included the header defining the structure, or
completely unused.
Now including "btrfs_inode.h" inside "bio.h" is completely fine,
although "btrfs_inode.h" still includes "extent_map.h", but that header
only includes "fs.h", no more chain back to "bio.h".
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs defined its own variant of folio_next_pos() called folio_end().
This is an ambiguous name as 'end' might be exclusive or inclusive.
Switch to the new folio_next_pos().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20251024170822.1427218-3-willy@infradead.org
Acked-by: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: David Sterba <dsterba@suse.com>
Cc: linux-btrfs@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
This includes the following preparation for bs > ps cases:
- Always alloc/free the folio directly if bs > ps
This adds a new @fs_info parameter for btrfs_alloc_compr_folio(), thus
affecting all compression algorithms.
For btrfs_free_compr_folio() it needs no parameter for now, as we can
use the folio size to skip the caching part.
For now the change is just to passing a @fs_info into the function,
all the folio size assumption is still based on page size.
- Properly zero the last folio in compress_file_range()
Since the compressed folios can be larger than a page, we need to
properly zero the whole folio.
- Use correct folio size for btrfs_add_compressed_bio_folios()
Instead of page size, use the correct folio size.
- Use correct folio size/shift for btrfs_compress_filemap_get_folio()
As we are not only using simple page sized folios anymore.
- Use correct folio size for btrfs_decompress()
There is an ASSERT() making sure the decompressed range is no larger
than a page, which will be triggered for bs > ps cases.
- Skip readahead for compressed pages
Similar to subpage cases.
- Make btrfs_alloc_folio_array() to accept a new @order parameter
- Add a helper to calculate the minimal folio size
All those changes should not affect the existing bs <= ps handling.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since all workspace managers are per-fs, there is no need nor no way to
store them inside btrfs_compress_op::wsm anymore.
With that said, we can do the following modifications:
- Remove zstd_workspace_mananger::ops
Zstd always grab the global btrfs_compress_op[].
- Remove btrfs_compress_op::wsm member
- Rename btrfs_compress_op to btrfs_compress_levels
This should make it more clear that btrfs_compress_levels structures are
only to indicate the levels of each compress algorithm.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since all workspaces are handled by the per-fs workspace managers, we
can safely remove the old per-module managers.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This involves:
- Add zstd_alloc_workspace_manager() and zstd_free_workspace_manager()
Those two functions will accept an fs_info pointer, and alloc/free
fs_info->compr_wsm[BTRFS_COMPRESS_ZSTD] pointer.
- Add btrfs_alloc_compress_wsm() and btrfs_free_compress_wsm()
Those are helpers allocating the workspace managers for all
algorithms.
For now only zstd is supported, and the timing is a little unusual,
the btrfs_alloc_compress_wsm() should only be called after the
sectorsize being initialized.
Meanwhile btrfs_free_fs_info_compress() is called in
btrfs_free_fs_info().
- Move the definition of btrfs_compression_type to "fs.h"
The reason is that "compression.h" has already included "fs.h", thus
we can not just include "compression.h" to get the definition of
BTRFS_NR_COMPRESS_TYPES to define fs_info::compr_wsm[].
For now the per-fs zstd workspace manager won't really have any effect,
and all compression is still going through the global workspace manager.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BACKGROUND]
Currently btrfs shares workspaces and their managers for all filesystems,
this is mostly fine as all those workspaces are using page size based
buffers, and btrfs only support block size (bs) <= page size (ps).
This means even if bs < ps, we at most waste some buffer space in the
workspace, but everything will still work fine.
The problem here is that is limiting our support for bs > ps cases.
As now a workspace now may need larger buffer to handle bs > ps cases,
but since the pool has no way to distinguish different workspaces, a
regular workspace (which is still using buffer size based on ps) can be
passed to a btrfs whose bs > ps.
In that case the buffer is not large enough, and will cause various
problems.
[ENHANCEMENT]
To prepare for the per-fs workspace migration, add an fs_info parameter
to all workspace related functions.
For now this new fs_info parameter is not yet utilized.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For the 3 supported compression algorithms, two of them (zstd and zlib)
are already grabbing the btrfs inode for error messages.
It's more common to pass btrfs_inode and grab the address space from it.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Inspired by recent changes to compression level parsing in
6db1df415d ("btrfs: accept and ignore compression level for lzo")
it turns out that we do not do any extra validation for compression
level input string, thus allowing things like "compress=lzo:invalid" to
be accepted without warnings.
Although we accept levels that are beyond the supported algorithm
ranges, accepting completely invalid level specification is not correct.
Fix the too loose checks for compression level, by doing proper error
handling of kstrtoint(), so that we will reject not only too large
values (beyond int range) but also completely wrong levels like
"lzo:invalid".
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently the defrag ioctl cannot rewrite the extents without
compression. Add a new flag for that, as setting compression to 0 (or
"no compression") means to do no changes to compression so take what is
the current default, like mount options or properties.
The defrag setting overrides mount or properties. The compression
BTRFS_DEFRAG_DONT_COMPRESS is only used for in-memory operations and
does not need to have a fixed value.
Mount with zstd:9, copy test file from /usr/bin/ (about 260KB):
$ mount -o compress=zstd:9 /dev/vda /mnt
$ filefrag -vsb testfile
filefrag: -b needs a blocksize option, assuming 1024-byte blocks.
Filesystem type is: 9123683e
File size of testfile is 297704 (292 blocks of 1024 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 127: 13312.. 13439: 128: encoded
1: 128.. 255: 13364.. 13491: 128: 13440: encoded
2: 256.. 291: 13424.. 13459: 36: 13492: last,encoded,eof
testfile: 3 extents found
$ compsize testfile
Processed 1 file, 3 regular extents (3 refs), 0 inline, 1 fragments.
Type Perc Disk Usage Uncompressed Referenced
TOTAL 42% 124K 292K 292K
zstd 42% 124K 292K 292K
Defrag to uncompressed:
$ btrfs fi defrag --nocomp testfile
$ filefrag -vsb testfile
filefrag: -b needs a blocksize option, assuming 1024-byte blocks.
Filesystem type is: 9123683e
File size of testfile is 297704 (292 blocks of 1024 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 291: 291840.. 292131: 292: last,eof
testfile: 1 extent found
$ compsize testfile
Processed 1 file, 1 regular extents (1 refs), 0 inline, 1 fragments.
Type Perc Disk Usage Uncompressed Referenced
TOTAL 100% 292K 292K 292K
none 100% 292K 292K 292K
Compress again with LZO:
$ btrfs fi defrag -clzo testfile
$ filefrag -vsb testfile
filefrag: -b needs a blocksize option, assuming 1024-byte blocks.
Filesystem type is: 9123683e
File size of testfile is 297704 (292 blocks of 1024 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 127: 13312.. 13439: 128: encoded
1: 128.. 255: 13392.. 13519: 128: 13440: encoded
2: 256.. 291: 13480.. 13515: 36: 13520: last,encoded,eof
testfile: 3 extents found
$ compsize testfile
Processed 1 file, 3 regular extents (3 refs), 0 inline, 1 fragments.
Type Perc Disk Usage Uncompressed Referenced
TOTAL 64% 188K 292K 292K
lzo 64% 188K 292K 292K
Signed-off-by: David Sterba <dsterba@suse.com>
Simplify folio_pos() + folio_size() and use the new helper.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All compression algorithms inside btrfs are not supporting large folios
due to the following points:
- btrfs_calc_input_length() is assuming page sized folio
- kmap_local_folio() usages are using offset_in_page()
Prepare them to support large data folios by:
- Add a folio parameter to btrfs_calc_input_length()
And use that folio parameter to calculate the correct length.
Since we're here, also add extra ASSERT()s to make sure the parameter
@cur is inside the folio range.
This affects only zlib and zstd. Lzo compresses at most one block at a
time, thus not affected.
- Use offset_in_folio() to calculate the kmap_local_folio() offset
This affects all 3 algorithms.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The zstd and zlib compression types support setting compression levels.
Enhance the defrag interface to specify the levels as well. For zstd the
negative (realtime) levels are also accepted.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The LZO compression has only one level, we don't need to pass the
parameter.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. Based on the previous patch, the compression path can be
directly used in folio without converting to page.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. And memcpy_to_page() can be replaced with memcpy_to_folio().
But there is no memzero_folio(), but it can be replaced equivalently by
folio_zero_range().
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. And memcpy_to_page() can be replaced with memcpy_to_folio().
But there is no memzero_folio(), but it can be replaced equivalently by
folio_zero_range().
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. And memcpy_to_page() can be replaced with memcpy_to_folio().
But there is no memzero_folio(), but it can be replaced equivalently by
folio_zero_range().
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs compression path is not really subpage compatible, every
thing is still done in page unit.
That's fine for regular sector size and subpage routine. As even for
subpage routine compression is only enabled if the whole range is page
aligned, so reading the page cache in page unit is totally fine.
However in preparation for the future subpage perfect compression
support, we need to change the compression routine to properly handle a
subpage range.
This patch would prepare both zlib and zstd to only read the subpage
range for compression.
Lzo is already doing subpage aware read, as lzo's on-disk format is
already sectorsize dependent.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass a struct btrfs_inode to btrfs_compress_heuristic() as it's an
internal interface, allowing to remove some use of BTRFS_I.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
For both compression and decompression paths, we always require a
"struct page **pages" and "unsigned long nr_pages", this involves quite
some part of the btrfs compression paths:
- All the compression entry points
- compressed_bio structure
This affects both compression and decompression.
- async_extent structure
Unfortunately with all those involved parts, there is no good way to
split the conversion into smaller patches while still passing compiling.
So do this in one big conversion in one go.
Please note this is direct page->folio conversion, no change on the page
sized folio requirement yet.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor style fixups ]
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we have two wrappers to allocate and free a page for
compression usage:
- btrfs_alloc_compr_page()
- btrfs_free_compr_page()
The allocator would try to grab a page from the pool, and only allocate
a new page if the pool is empty.
The reclaimer would check if the pool is full, and if not full it would
put the page into the pool.
This patch converts both helpers to use folio interfaces, and allowing
further conversion of compression path to folios.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For all the supported compression algorithms, the compression path would
always need to grab the page cache, then do the compression.
Normally we would get a page reference without any problem, since the
write path should have already locked the pages in the write range.
For the sake of error handling, we should handle the page cache miss
case.
Adds a common wrapper, btrfs_compress_find_get_page(), which calls
find_get_page(), and do the error handling along with an error message.
Callers inside compression path would only need to call
btrfs_compress_find_get_page(), and error out if it returned any error.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Do a cleanup in more headers:
- add forward declarations for types referenced by pointers
- add includes when types need them
This fixes potential compilation problems if the headers are reordered
or the missing includes are not provided indirectly.
Signed-off-by: David Sterba <dsterba@suse.com>
Note: this is a fixed version that was previously reverted as
e01a83e126 ("Revert "btrfs: zstd: fix and simplify the inline extent
decompression""), with fixed parameters to memzero_page().
[BUG]
If we have a filesystem with 4k sectorsize, and an inlined compressed
extent created like this:
item 4 key (257 INODE_ITEM 0) itemoff 15863 itemsize 160
generation 8 transid 8 size 4096 nbytes 4096
block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
sequence 1 flags 0x0(none)
item 5 key (257 INODE_REF 256) itemoff 15839 itemsize 24
index 2 namelen 14 name: source_inlined
item 6 key (257 EXTENT_DATA 0) itemoff 15770 itemsize 69
generation 8 type 0 (inline)
inline extent data size 48 ram_bytes 4096 compression 3 (zstd)
Then trying to reflink that extent in an aarch64 system with 64K page
size, the reflink would just fail:
# xfs_io -f -c "reflink $mnt/source_inlined 0 60k 4k" $mnt/dest
XFS_IOC_CLONE_RANGE: Input/output error
[CAUSE]
In zstd_decompress(), we didn't treat @start_byte as just a page offset,
but also use it as an indicator on whether we should error out, without
any proper explanation (this is copied from other decompression code).
In reality, for subpage cases, although @start_byte can be non-zero,
we should never switch input/output buffer nor error out, since the whole
input/output buffer should never exceed one sector, thus we should not
need to do any buffer switch.
Thus the current code using @start_byte as a condition to switch
input/output buffer or finish the decompression is completely incorrect.
[FIX]
The fix involves several modification:
- Rename @start_byte to @dest_pgoff to properly express its meaning
- Use @sectorsize other than PAGE_SIZE to properly initialize the
output buffer size
- Use correct destination offset inside the destination page
- Simplify the main loop
Since the input/output buffer should never switch, we only need one
zstd_decompress_stream() call.
- Consider early end as an error
After the fix, even on 64K page sized aarch64, above reflink now
works as expected:
# xfs_io -f -c "reflink $mnt/source_inlined 0 60k 4k" $mnt/dest
linked 4096/4096 bytes at offset 61440
And results the correct file layout:
item 9 key (258 INODE_ITEM 0) itemoff 15542 itemsize 160
generation 10 transid 10 size 65536 nbytes 4096
block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
sequence 1 flags 0x0(none)
item 10 key (258 INODE_REF 256) itemoff 15528 itemsize 14
index 3 namelen 4 name: dest
item 11 key (258 XATTR_ITEM 3817753667) itemoff 15445 itemsize 83
location key (0 UNKNOWN.0 0) type XATTR
transid 10 data_len 37 name_len 16
name: security.selinux
data unconfined_u:object_r:unlabeled_t:s0
item 12 key (258 EXTENT_DATA 61440) itemoff 15392 itemsize 53
generation 10 type 1 (regular)
extent data disk byte 13631488 nr 4096
extent data offset 0 nr 4096 ram 4096
extent compression 0 (none)
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This reverts commit 1e7f6def8b.
It causes my machine to not even boot, and Klara Modin reports that the
cause is that small zstd-compressed files return garbage when read.
Reported-by: Klara Modin <klarasmodin@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CABq1_vj4GpUeZpVG49OHCo-3sdbe2-2ROcu_xDvUG-6-5zPRXg@mail.gmail.com/
Reported-and-bisected-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: David Sterba <dsterba@suse.com>
Cc: Qu Wenruo <wqu@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[BUG]
If we have a filesystem with 4k sectorsize, and an inlined compressed
extent created like this:
item 4 key (257 INODE_ITEM 0) itemoff 15863 itemsize 160
generation 8 transid 8 size 4096 nbytes 4096
block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
sequence 1 flags 0x0(none)
item 5 key (257 INODE_REF 256) itemoff 15839 itemsize 24
index 2 namelen 14 name: source_inlined
item 6 key (257 EXTENT_DATA 0) itemoff 15770 itemsize 69
generation 8 type 0 (inline)
inline extent data size 48 ram_bytes 4096 compression 3 (zstd)
Then trying to reflink that extent in an aarch64 system with 64K page
size, the reflink would just fail:
# xfs_io -f -c "reflink $mnt/source_inlined 0 60k 4k" $mnt/dest
XFS_IOC_CLONE_RANGE: Input/output error
[CAUSE]
In zstd_decompress(), we didn't treat @start_byte as just a page offset,
but also use it as an indicator on whether we should error out, without
any proper explanation (this is copied from other decompression code).
In reality, for subpage cases, although @start_byte can be non-zero,
we should never switch input/output buffer nor error out, since the whole
input/output buffer should never exceed one sector, thus we should not
need to do any buffer switch.
Thus the current code using @start_byte as a condition to switch
input/output buffer or finish the decompression is completely incorrect.
[FIX]
The fix involves several modification:
- Rename @start_byte to @dest_pgoff to properly express its meaning
- Use @sectorsize other than PAGE_SIZE to properly initialize the
output buffer size
- Use correct destination offset inside the destination page
- Simplify the main loop
Since the input/output buffer should never switch, we only need one
zstd_decompress_stream() call.
- Consider early end as an error
After the fix, even on 64K page sized aarch64, above reflink now
works as expected:
# xfs_io -f -c "reflink $mnt/source_inlined 0 60k 4k" $mnt/dest
linked 4096/4096 bytes at offset 61440
And results the correct file layout:
item 9 key (258 INODE_ITEM 0) itemoff 15542 itemsize 160
generation 10 transid 10 size 65536 nbytes 4096
block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
sequence 1 flags 0x0(none)
item 10 key (258 INODE_REF 256) itemoff 15528 itemsize 14
index 3 namelen 4 name: dest
item 11 key (258 XATTR_ITEM 3817753667) itemoff 15445 itemsize 83
location key (0 UNKNOWN.0 0) type XATTR
transid 10 data_len 37 name_len 16
name: security.selinux
data unconfined_u:object_r:unlabeled_t:s0
item 12 key (258 EXTENT_DATA 61440) itemoff 15392 itemsize 53
generation 10 type 1 (regular)
extent data disk byte 13631488 nr 4096
extent data offset 0 nr 4096 ram 4096
extent compression 0 (none)
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
If we have a filesystem with 4k sectorsize, and an inlined compressed
extent created like this:
item 4 key (257 INODE_ITEM 0) itemoff 15863 itemsize 160
generation 8 transid 8 size 4096 nbytes 4096
block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
sequence 1 flags 0x0(none)
item 5 key (257 INODE_REF 256) itemoff 15839 itemsize 24
index 2 namelen 14 name: source_inlined
item 6 key (257 EXTENT_DATA 0) itemoff 15770 itemsize 69
generation 8 type 0 (inline)
inline extent data size 48 ram_bytes 4096 compression 2 (lzo)
Then trying to reflink that extent in an aarch64 system with 64K page
size, the reflink would just fail:
# xfs_io -f -c "reflink $mnt/source_inlined 0 60k 4k" $mnt/dest
XFS_IOC_CLONE_RANGE: Input/output error
[CAUSE]
In zlib_decompress(), we didn't treat @start_byte as just a page offset,
but also use it as an indicator on whether we should error out, without
any proper explanation (this is from the very beginning of btrfs).
In reality, for subpage cases, although @start_byte can be non-zero,
we should never switch input/output buffer nor error out, since the whole
input/output buffer should never exceed one sector.
Note: The above assumption is only not true if we're going to support
multi-page sectorsize.
Thus the current code using @start_byte as a condition to switch
input/output buffer or finish the decompression is completely incorrect.
[FIX]
The fix involves several modifications:
- Rename @start_byte to @dest_pgoff to properly express its meaning
- Use @sectorsize other than PAGE_SIZE to properly initialize the
output buffer size
- Use correct destination offset inside the destination page
- Use memcpy_to_page() to copy the contents to the destination page
- Use memzero_page() to zero out the tailing part
- Consider early end as an error
After the fix, even on 64K page sized aarch64, above reflink now
works as expected:
# xfs_io -f -c "reflink $mnt/source_inlined 0 60k 4k" $mnt/dest
linked 4096/4096 bytes at offset 61440
And results the correct file layout:
item 9 key (258 INODE_ITEM 0) itemoff 15542 itemsize 160
generation 10 transid 10 size 65536 nbytes 4096
block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
sequence 1 flags 0x0(none)
item 10 key (258 INODE_REF 256) itemoff 15528 itemsize 14
index 3 namelen 4 name: dest
item 11 key (258 XATTR_ITEM 3817753667) itemoff 15445 itemsize 83
location key (0 UNKNOWN.0 0) type XATTR
transid 10 data_len 37 name_len 16
name: security.selinux
data unconfined_u:object_r:unlabeled_t:s0
item 12 key (258 EXTENT_DATA 61440) itemoff 15392 itemsize 53
generation 10 type 1 (regular)
extent data disk byte 13631488 nr 4096
extent data offset 0 nr 4096 ram 4096
extent compression 0 (none)
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
If we have a filesystem with 4k sectorsize, and an inlined compressed
extent created like this:
item 4 key (257 INODE_ITEM 0) itemoff 15863 itemsize 160
generation 8 transid 8 size 4096 nbytes 4096
block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
sequence 1 flags 0x0(none)
item 5 key (257 INODE_REF 256) itemoff 15839 itemsize 24
index 2 namelen 14 name: source_inlined
item 6 key (257 EXTENT_DATA 0) itemoff 15770 itemsize 69
generation 8 type 0 (inline)
inline extent data size 48 ram_bytes 4096 compression 1 (zlib)
Which has an inline compressed extent at file offset 0, and its
decompressed size is 4K, allowing us to reflink that 4K range to another
location (which will not be compressed).
If we do such reflink on a subpage system, it would fail like this:
# xfs_io -f -c "reflink $mnt/source_inlined 0 60k 4k" $mnt/dest
XFS_IOC_CLONE_RANGE: Input/output error
[CAUSE]
In zlib_decompress(), we didn't treat @start_byte as just a page offset,
but also use it as an indicator on whether we should switch our output
buffer.
In reality, for subpage cases, although @start_byte can be non-zero,
we should never switch input/output buffer, since the whole input/output
buffer should never exceed one sector.
Note: The above assumption is only not true if we're going to support
multi-page sectorsize.
Thus the current code using @start_byte as a condition to switch
input/output buffer or finish the decompression is completely incorrect.
[FIX]
The fix involves several modifications:
- Rename @start_byte to @dest_pgoff to properly express its meaning
- Add an extra ASSERT() inside btrfs_decompress() to make sure the
input/output size never exceeds one sector.
- Use Z_FINISH flag to make sure the decompression happens in one go
- Remove the loop needed to switch input/output buffers
- Use correct destination offset inside the destination page
- Consider early end as an error
After the fix, even on 64K page sized aarch64, above reflink now
works as expected:
# xfs_io -f -c "reflink $mnt/source_inlined 0 60k 4k" $mnt/dest
linked 4096/4096 bytes at offset 61440
And resulted a correct file layout:
item 9 key (258 INODE_ITEM 0) itemoff 15542 itemsize 160
generation 10 transid 10 size 65536 nbytes 4096
block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
sequence 1 flags 0x0(none)
item 10 key (258 INODE_REF 256) itemoff 15528 itemsize 14
index 3 namelen 4 name: dest
item 11 key (258 XATTR_ITEM 3817753667) itemoff 15445 itemsize 83
location key (0 UNKNOWN.0 0) type XATTR
transid 10 data_len 37 name_len 16
name: security.selinux
data unconfined_u:object_r:unlabeled_t:s0
item 12 key (258 EXTENT_DATA 61440) itemoff 15392 itemsize 53
generation 10 type 1 (regular)
extent data disk byte 13631488 nr 4096
extent data offset 0 nr 4096 ram 4096
extent compression 0 (none)
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is a preparation for managing compression pages in a cache-like
manner, instead of asking the allocator each time. The common allocation
and free wrappers are introduced and are functionally equivalent to the
current code.
The freeing helpers need to be carefully placed where the last reference
is dropped. This is either after directly allocating (error handling)
or when there are no other users of the pages (after copying the contents).
It's safe to not use the helper and use put_page() that will handle the
reference count. Not using the helper means there's lower number of
pages that could be reused without passing them back to allocator.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_submit_compressed_write always operates on a single ordered_extent.
Make that explicit by using btrfs_alloc_ordered_extent in the callers
and passing the ordered_extent to btrfs_submit_compressed_write. This
will help with storing and ordered_extent pointer in the btrfs_bio in
subsequent patches.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Given that read recovery for data I/O is handled in the storage layer,
the mirror_num argument to btrfs_submit_compressed_read is always 0,
so remove it.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_submit_compressed_write should not have to care if it is called
from a helper thread or not. Move the kthread_associate_blkcg handling
into submit_one_async_extent, as that is the one caller that needs it.
Also move the assignment of REQ_CGROUP_PUNT into cow_file_range_async,
as that is the routine that sets up the helper thread offload.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The original bio must be a btrfs_bio, so store a pointer to the
btrfs_bio for better type checking.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_submit_compressed_read expects the bio passed to it to be embedded
into a btrfs_bio structure. Pass the btrfs_bio directly to increase type
safety and make the code self-documenting.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>