- Remove EROFS_MAP_ENCODED since it was always set together with
EROFS_MAP_MAPPED for compressed extents and checked redundantly;
- Replace the EROFS_MAP_FULL_MAPPED flag with the opposite
EROFS_MAP_PARTIAL_MAPPED flag so that extents are implicitly
fully mapped initially to simplify the logic;
- Make fragment extents independent of EROFS_MAP_MAPPED since
they are not directly allocated on disk; thus fragment extents
are no longer twisted with mapped extents.
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Inode with identical data but different @aops cannot be mixed
because the page cache is managed by different subsystems (e.g.,
@aops for compressed on-disk inodes cannot handle plain on-disk
inodes).
In this patch, we never allow inodes to share the page cache
among plain, compressed, and fileio cases. When a shared inode
is created, we initialize @aops that is the same as the initial
real inode, and subsequent inodes cannot share the page cache
if the inferred @aops differ from the corresponding shared inode.
This is reasonable as a first step because, in typical use cases,
if an inode is compressible, it will fall into compressed
inodes across different filesystem images unless users use plain
filesystems. However, in that cases, users will use plain
filesystems all the time.
Fixes: 5ef3208e3b ("erofs: introduce the page cache share feature")
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
- Support inode page cache sharing among filesystems
- Formally separate optional encoded (aka compressed) inode layouts
(and the implementations) from the EROFS core on-disk aligned plain
format for future zero-trust security usage
- Improve performance by caching the fact that an inode does not have
a POSIX ACL
- Improve LZ4 decompression error reporting
- Enable LZMA by default and promote DEFLATE and Zstandard algorithms
out of EXPERIMENTAL status
- Switch to inode_set_cached_link() to cache symlink lengths
- random bugfixes and minor cleanups
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEQ0A6bDUS9Y+83NPFUXZn5Zlu5qoFAmmJWA8RHHhpYW5nQGtl
cm5lbC5vcmcACgkQUXZn5Zlu5qpKRhAAmmkeLT5vwxpdk9l5uAzz9rvpJgZzorl2
grD6jn0whzSi3BY7MiSDwcY2wl5xPuZjHRnqrcwQzsxua/Y6YJe9mIZTKhviYzuD
6A90OxO4cIseXlGL+AK+OgiFSUBvC+0AttE9napOxQmkTrBkYPDYX2IoMOxr+1DA
vtsPAWmmYOeyjV+2nYT3qVYKk5LaHu+wjXsH6U7RDi1Cut3xu3FIRqtWKatdfhWs
0NSRVc9IcWyBvMRPjGwlEhGY+XW+tXa62NWNTDDTyXCMVVx4TKXMueJkHvo+ysYg
i7uypDAI+JfnasrlsEuRjjvvqg+bKm+6wd1y9FIU8AefPf2kp1P5QmqmhhPv0PyI
WMm6ZwQX4DTZPo6P4goxw4/SvxY8UMPHYb8/APCI7NfzG8DHCXH/OxW5yamCxL/a
6ZREjpkBtMH4lT9adCNsuKK5HQepsECCXr1BWHQDWarFFoRn0mGYIxZiHspMY2wQ
SaqSkMre59S/ZstYjtYhjwyQPscxq3mejh9Cj7R37U0nhziY54EfwytvlFrTyDZ5
gg9g+/pzEdgfjJ/sVHYMo8lHhglgzFa9hTD41qeu7AeuRmJq4GAlMhnN2bmbuoDs
mgBQam4+m74UyF1yk1L9ks8Ucepkgb/rdLr7u90nCg8PfhtQjyK46BnaCXwmktCz
0d7u6QZXNZ8=
=REdF
-----END PGP SIGNATURE-----
Merge tag 'erofs-for-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs updates from Gao Xiang:
"In this cycle, inode page cache sharing among filesystems on the same
machine is now supported, which is particularly useful for
high-density hosts running tens of thousands of containers.
In addition, we fully isolate the EROFS core on-disk format from other
optional encoded layouts since the core on-disk part is designed to be
simple, effective, and secure. Users can use the core format to build
unique golden immutable images and import their filesystem trees
directly from raw block devices via DMA, page-mapped DAX devices,
and/or file-backed mounts without having to worry about unnecessary
intrinsic consistency issues found in other generic filesystems by
design. However, the full vision is still working in progress and will
spend more time to achieve final goals.
There are other improvements and bug fixes as usual, as listed below:
- Support inode page cache sharing among filesystems
- Formally separate optional encoded (aka compressed) inode layouts
(and the implementations) from the EROFS core on-disk aligned plain
format for future zero-trust security usage
- Improve performance by caching the fact that an inode does not have
a POSIX ACL
- Improve LZ4 decompression error reporting
- Enable LZMA by default and promote DEFLATE and Zstandard algorithms
out of EXPERIMENTAL status
- Switch to inode_set_cached_link() to cache symlink lengths
- random bugfixes and minor cleanups"
* tag 'erofs-for-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: (31 commits)
erofs: fix UAF issue for file-backed mounts w/ directio option
erofs: update compression algorithm status
erofs: fix inline data read failure for ztailpacking pclusters
erofs: avoid some unnecessary #ifdefs
erofs: handle end of filesystem properly for file-backed mounts
erofs: separate plain and compressed filesystems formally
erofs: use inode_set_cached_link()
erofs: mark inodes without acls in erofs_read_inode()
erofs: implement .fadvise for page cache share
erofs: support compressed inodes for page cache share
erofs: support unencoded inodes for page cache share
erofs: pass inode to trace_erofs_read_folio
erofs: introduce the page cache share feature
erofs: using domain_id in the safer way
erofs: add erofs_inode_set_aops helper to set the aops
erofs: support user-defined fingerprint name
erofs: decouple `struct erofs_anon_fs_type`
fs: Export alloc_empty_backing_file
erofs: tidy up erofs_init_inode_xattrs()
erofs: add missing documentation about `directio` mount option
...
The EROFS on-disk format uses a tiny, plain metadata design that
prioritizes performance and minimizes complex inconsistencies against
common writable disk filesystems (almost all serious metadata
inconsistency cannot happen in well-designed immutable filesystems like
EROFS). EROFS deliberately avoids artificial design flaws to eliminate
serious security risks from untrusted remote sources by design,
although human-made implementation bugs can still happen sometimes.
Currently, there is no strict check to prevent compressed inodes,
especially LZ4-compressed inodes, from being read in plain filesystems.
Starting with erofs-utils 1.0 and Linux 5.3, LZ4_0PADDING sb feature
is automatically enabled for LZ4-compressed EROFS images to support
in-place decompression. Furthermore, since Linux 5.4 LTS is no longer
supported, we no longer need to handle ancient LZ4-compressed EROFS
images generated by erofs-utils prior to 1.0.
To formally distinguish different filesystem types for improved
security:
- Use the presence of LZ4_0PADDING or a non-zero
`dsb->u1.lz4_max_distance` as a marker for compressed filesystems
containing LZ4-compressed inodes only;
- For other algorithms, use `dsb->u1.available_compr_algs` bitmap.
Note: LZ4_0PADDING has been supported since Linux 5.4 (the first formal
kernel version), so exposing it via sysfs is no longer necessary and is
now deprecated (but remain it for five more years until 2031):
`dsb->u1` has been strictly non-zero for all EROFS images containing
compressed inodes starting with erofs-utils v1.3 and it is actually
a much better marker for compressed filesystems.
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Currently, reading files with different paths (or names) but the same
content will consume multiple copies of the page cache, even if the
content of these page caches is the same. For example, reading
identical files (e.g., *.so files) from two different minor versions of
container images will cost multiple copies of the same page cache,
since different containers have different mount points. Therefore,
sharing the page cache for files with the same content can save memory.
This introduces the page cache share feature in erofs. It allocate a
shared inode and use its page cache as shared. Reads for files
with identical content will ultimately be routed to the page cache of
the shared inode. In this way, a single page cache satisfies
multiple read requests for different files with the same contents.
We introduce new mount option `inode_share` to enable the page
sharing mode during mounting. This option is used in conjunction
with `domain_id` to share the page cache within the same trusted
domain.
Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Add erofs_inode_set_aops helper to set the inode->i_mapping->a_ops
and use IS_ENABLED to make it cleaner.
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
When creating the EROFS image, users can specify the fingerprint name.
This is to prepare for the upcoming inode page cache share.
Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
- Move the `struct erofs_anon_fs_type` to super.c and expose it
in preparation for the upcoming page cache share feature;
- Remove the `.owner` field, as they are all internal mounts and
fully managed by EROFS. Retaining `.owner` would unnecessarily
increment module reference counts, preventing the EROFS kernel
module from being unloaded.
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
- Get rid of `sbi->opt.max_sync_decompress_pages` since it's fixed as
3 all the time;
- Add Z_EROFS_MAX_SYNC_DECOMPRESS_BYTES in bytes instead of in pages,
since for non-4K pages, 3-page limitation makes no sense;
- Move `sync_decompress` to sbi to avoid unexpected remount impact;
- Fold z_erofs_is_sync_decompress() into its caller;
- Better description of sysfs entry `sync_decompress`.
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Stop definining these privately and instead move them to the uapi
errno.h so that they become canonical instead of copy pasta.
Cc: linux-api@vger.kernel.org
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Link: https://patch.msgid.link/176826402587.3490369.17659117524205214600.stgit@frogsfrogsfrogs
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Add support for reading to the erofs volume label from the
FS_IOC_GETFSLABEL ioctls.
Signed-off-by: Bo Liu (OpenAnolis) <liubo03@inspur.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Reviewed-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Currently, xattr name prefixes are forcibly placed into the packed
inode if the fragments feature is enabled, and users have no option
to put them in plain form directly on disk.
This is inflexible. First, as mentioned above, users should be able
to store unwrapped long xattr name prefixes unconditionally
(COMPAT_PLAIN_XATTR_PFX). Second, since we now have the new metabox
inode to store metadata, it should be used when available instead
of the packed inode.
Fixes: 414091322c ("erofs: implement metadata compression")
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
This patch supports to readahead more blocks in erofs_readdir(), it can
enhance readdir performance in large direcotry.
readdir test in a large directory which contains 12000 sub-files.
files_per_second
Before: 926385.54
After: 2380435.562
Meanwhile, let's introduces a new sysfs entry to control readahead
bytes to provide more flexible policy for readahead of readdir().
- location: /sys/fs/erofs/<disk>/dir_ra_bytes
- default value: 16384
- disable readahead: set the value to 0
Signed-off-by: Chao Yu <chao@kernel.org>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20250721021352.2495371-1-chao@kernel.org
[ Gao Xiang: minor styling adjustment. ]
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Thanks to the meta buffer infrastructure, metadata-compressed inodes are
just read from the metabox inode instead of the blockdevice (or backing
file) inode.
The same is true for shared extended attributes.
When metadata compression is enabled, inode numbers are divided from
on-disk NIDs because of non-LTS 32-bit application compatibility.
Co-developed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Bo Liu (OpenAnolis) <liubo03@inspur.com>
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20250722003229.2121752-1-hsiangkao@linux.alibaba.com
Filesystem metadata has a high degree of redundancy, so it should
compress well in the general case.
Although metadata compression can increase overall I/O latency, many
users care more about minimized image sizes than extreme runtime
performance. Let's implement metadata compression in response to user
requests [1].
Actually, it's quite simple to implement metadata compression: since
EROFS already supports per-inode compression, we can simply treat a
special inode (called `the metabox inode`) as a container for compressed
inode metadata. Since EROFS supports multiple algorithms, users can
even specify LZ4 for metadata and LZMA for data.
To better support incremental builds, the MSB of NIDs indicates where
the inode metadata is located: if bit 63 is set, the inode itself should
be read from `the metabox inode`.
Optionally, shared xattrs can also be kept in `the metabox inode` if
COMPAT_SHARED_EA_IN_METABOX is set.
[1] https://issues.redhat.com/browse/RHEL-75783
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Acked-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20250717070804.1446345-2-hsiangkao@linux.alibaba.com
- need_kmap is always true except for a ztailpacking case; thus, just
open-code that one;
- The upcoming metadata compression will add a new boolean, so simplify
this first.
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20250714090907.4095645-1-hsiangkao@linux.alibaba.com
Fragments aren't limited by Z_EROFS_PCLUSTER_MAX_DSIZE. However, if
a fragment's logical length is larger than Z_EROFS_PCLUSTER_MAX_DSIZE
but the fragment is not the whole inode, it currently returns
-EOPNOTSUPP because m_flags has the wrong EROFS_MAP_ENCODED flag set.
It is not intended by design but should be rare, as it can only be
reproduced by mkfs with `-Eall-fragments` in a specific case.
Let's normalize fragment m_flags using the new EROFS_MAP_FRAGMENT.
Reported-by: Axel Fontaine <axel@axelfontaine.com>
Closes: https://github.com/erofs/erofs-utils/issues/23
Fixes: 7c3ca1838a ("erofs: restrict pcluster size limitations")
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20250711195826.3601157-1-hsiangkao@linux.alibaba.com
When attempting to use an archive file, such as APEX on android,
as a file-backed mount source, it fails because EROFS image within
the archive file does not start at offset 0. As a result, a loop
or a dm device is still needed to attach the image file at an
appropriate offset first. Similarly, if an EROFS image within a
block device does not start at offset 0, it cannot be mounted
directly either.
To address this issue, this patch adds a new mount option `fsoffset=x'
to accept a start offset for the primary device. The offset should be
aligned to the block size. EROFS will add this offset before performing
read requests.
Signed-off-by: Sheng Yong <shengyong1@xiaomi.com>
Signed-off-by: Wang Shuai <wangshuai12@xiaomi.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20250517090544.2687651-1-shengyong1@xiaomi.com
[ Gao Xiang: minor update on documentation and the error message. ]
Reviewed-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Implement the extent metadata parsing described in the previous commit.
For 16-byte and 32-byte extent records, currently it is just a trivial
binary search without considering the last access footprint, but it can
be optimized for better sequential performance later.
Tail fragments are supported, but ztailpacking feature is not
for simplicity.
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Acked-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20250310095459.2620647-9-hsiangkao@linux.alibaba.com
Previously, EROFS provided both (non-)compact compressed indexes to
keep necessary hints for each logical block, enabling O(1) random
indexing. This approach was originally designed for small compression
units (e.g., 4KiB), where compressed data is strictly block-aligned via
fixed-sized output compression.
However, EROFS now supports big pclusters up to 1MiB and many users use
large configurations to minimize image sizes. For such configurations,
the total number of extents decreases significantly (e.g., only 1,024
extents for a 1GiB file using 1MiB pclusters), then runtime metadata
overhead becomes negligible compared to data I/O and decoding costs.
Additionally, some popular compression algorithm (mainly Zstd) still
lacks native fixed-sized output compression support (although it's
planned by their authors). Instead of just waiting for compressor
improvements, let's adopt byte-oriented extents, allowing these
compressors to retain their current methods.
For example, it speeds up Zstd compression a lot:
Processor: Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz * 96
Dataset: enwik9
Build time Size Type Command Line
3m52.339s 266653696 FO -C524288 -zzstd,22
3m48.549s 266174464 FO -E48bit -C524288 -zzstd,22
0m12.821s 272134144 FI -E48bit -C1048576 --max-extent-bytes=1048576 -zzstd,22
0m14.528s 248987648 FO -C1048576 -zlzma,9
0m14.605s 248504320 FO -E48bit -C1048576 -zlzma,9
Encoded extents are structured as an array of `struct z_erofs_extent`,
sorted by logical address in ascending order:
__le32 plen // encoded length, algorithm id and flags
__le32 pstart_lo // physical offset LSB
__le32 pstart_hi // physical offset MSB
__le32 lstart_lo // logical offset
__le32 lstart_hi // logical offset MSB
..
Note that prefixed reduced records can be used to minimize metadata for
specific cases (e.g. lstart less than 32 bits, then 32 to 16 bytes).
If the logical lengths of all encoded extents are the same, 4-byte
(plen) and 8-byte (plen, pstart_lo) records can be used. Or, 16-byte
(plen .. lstart_lo) and 32-byte full records have to be used instead.
If 16-byte and 32-byte records are used, the total number of extents
is kept in `struct z_erofs_map_header`, and binary search can be
applied on them. Note that `eytzinger order` is not considerd because
data sequential access is important.
If 4-byte records are used, 8-byte start physical offset is between
`struct z_erofs_map_header` and the `plen` array.
In addition, 64-bit physical offsets can be applied with new encoded
extent format to match full 48-bit block addressing.
Remove redundant comments around `struct z_erofs_lcluster_index` too.
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Acked-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20250310095459.2620647-8-hsiangkao@linux.alibaba.com
- Rename erofs_init_managed_cache() to z_erofs_init_super();
- Move the initialization of managed_pslots into z_erofs_init_super() too;
- Move z_erofs_init_super() and packed inode preparation upwards, before
the root inode initialization.
Therefore, the root directory can also be compressible.
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Acked-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20250317054840.3483000-1-hsiangkao@linux.alibaba.com
There's no need to record "." dirents in the directory data (while
they could be used for sanity checks, they aren't very useful.)
Omitting "." dirents also improves directory data deduplication.
Use a per-inode (instead of per-sb) flag to indicate if the "." dirent
is omitted or not, ensuring compatibility with incremental builds. It
also reuses EROFS_I_NLINK_1_BIT, as it has very limited use cases for
directories with `nlink = 1`.
Emit the "." entry as the last virtual dirent in the directory because
it is _much_ less frequently used than the ".." dirent. It also keeps
`f_pos` meaningful, as it strictly follows the directory data when it's
less than i_size.
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Acked-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20250310095459.2620647-6-hsiangkao@linux.alibaba.com
It adapts the on-disk changes from the previous commit. It also
supports EROFS_NULL_ADDR (all 1's) for EROFS_INODE_FLAT_PLAIN inodes
to indicate 0-filled inodes, as it's common for composefs use cases.
As a result, EROFS_INODE_CHUNK_BASED is no longer needed.
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Acked-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20250310095459.2620647-5-hsiangkao@linux.alibaba.com
The current 32-bit block addressing limits EROFS to a 16TiB maximum
volume size with 4KiB blocks. However, several new use cases now
require larger capacity support:
- Massive datasets for model training in order to boost random
sampling performance for each epoch;
- Object storage clients using EROFS direct passthrough.
This extends core on-disk structures to support 48-bit block addressing,
such as inodes, device slots, and inode chunks.
Additionally:
- Expand superblock root NID to 8-byte `rootnid_8b` to enable full
out-of-place update incremental builds;
- Introduce `epoch` field in the superblock as well as add `mtime`
field to 32-byte compact inodes for basic timestamp support.
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Acked-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20250310095459.2620647-4-hsiangkao@linux.alibaba.com
Use `z_idata_size != 0` to indicate that ztailpacking is enabled.
`Z_EROFS_ADVISE_INLINE_PCLUSTER` cannot be ignored, as `h_idata_size`
could be non-zero prior to erofs-utils 1.6 [1].
Additionally, merge `z_idataoff` and `z_fragmentoff` since these two
features are mutually exclusive for a given inode.
[1] https://git.kernel.org/xiang/erofs-utils/c/547bea3cb71a
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20250225114038.3259726-1-hsiangkao@linux.alibaba.com
Actually, volume name doesn't need to include the NIL terminator if
the string length matches the on-disk field size as mentioned in [1].
I tend to relax it together with the upcoming 48-bit block addressing
(or stable kernels which backport this fix) so that we could have a
chance to record a 16-byte volume name like ext4.
Since in-memory `volume_name` has no user, just get rid of the unneeded
check for now. `sbi->uuid` is useless and avoid it too.
[1] https://lore.kernel.org/r/96efe46b-dcce-4490-bba1-a0b00932d1cc@linux.alibaba.com
Fixes: a64d9493f5 ("staging: erofs: refuse to mount images with malformed volume name")
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20250225033934.2542635-1-hsiangkao@linux.alibaba.com
Since EROFS_KMAP_ATOMIC is no longer valid, get rid of erofs_kmap_type too.
Signed-off-by: Bo Liu <liubo03@inspur.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20250217093141.2659-1-liubo03@inspur.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
For many use cases (e.g. container images are just fetched from remote),
performance will be impacted if underlay page cache is up-to-date but
direct i/o flushes dirty pages first.
Instead, let's use buffered I/O by default to keep in sync with loop
devices and add a (re)mount option to explicitly give a try to use
direct I/O if supported by the underlying files.
The container startup time is improved as below:
[workload] docker.io/library/workpress:latest
unpack 1st run non-1st runs
EROFS snapshotter buffered I/O file 4.586404265s 0.308s 0.198s
EROFS snapshotter direct I/O file 4.581742849s 2.238s 0.222s
EROFS snapshotter loop 4.596023152s 0.346s 0.201s
Overlayfs snapshotter 5.382851037s 0.206s 0.214s
Fixes: fb17675026 ("erofs: add file-backed mount support")
Cc: Derek McGowan <derek@mcg.dev>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20241212134336.2059899-1-hsiangkao@linux.alibaba.com
Record `m_sb` and `m_dif` to replace `m_fscache`, `m_daxdev`, `m_fp`
and `m_dax_part_off` in order to simplify the codebase.
Note that `m_bdev` is still left since it can be assigned from
`sb->s_bdev` directly.
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20241212235401.2857246-1-hsiangkao@linux.alibaba.com
Instead of just listing each one directly in `struct erofs_sb_info`
except that we still use `sb->s_bdev` for the primary block device.
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20241216125310.930933-2-hsiangkao@linux.alibaba.com
After commit 927e5010ff ("erofs: use kmap_local_page() only for
erofs_bread()"), `buf->kmap_type` actually has no use at all.
Let's get rid of `buf->kmap_type` now.
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20241114095813.839866-1-hsiangkao@linux.alibaba.com
Add a sysfs node to drop compression-related caches, currently used to
drop in-memory pclusters and cached compressed folios.
Signed-off-by: Chunhai Guo <guochunhai@vivo.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20241113041148.749129-1-guochunhai@vivo.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
`struct erofs_workgroup` was introduced to provide a unique header
for all physically indexed objects. However, after big pclusters and
shared pclusters are implemented upstream, it seems that all EROFS
encoded data (which requires transformation) can be represented with
`struct z_erofs_pcluster` directly.
Move all members into `struct z_erofs_pcluster` for simplicity.
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20241021035323.3280682-3-hsiangkao@linux.alibaba.com
Use pseudo bios just like the previous fscache approach since
merged bio_vecs can be filled properly with unique interfaces.
Reviewed-by: Sandeep Dhavale <dhavale@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240830032840.3783206-3-hsiangkao@linux.alibaba.com
Since EROFS only needs to handle read requests in simple contexts,
Just directly use vfs_iocb_iter_read() for data I/Os.
Reviewed-by: Sandeep Dhavale <dhavale@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240905093031.2745929-1-hsiangkao@linux.alibaba.com
It actually has been around for years: For containers and other sandbox
use cases, there will be thousands (and even more) of authenticated
(sub)images running on the same host, unlike OS images.
Of course, all scenarios can use the same EROFS on-disk format, but
bdev-backed mounts just work well for OS images since golden data is
dumped into real block devices. However, it's somewhat hard for
container runtimes to manage and isolate so many unnecessary virtual
block devices safely and efficiently [1]: they just look like a burden
to orchestrators and file-backed mounts are preferred indeed. There
were already enough attempts such as Incremental FS, the original
ComposeFS and PuzzleFS acting in the same way for immutable fses. As
for current EROFS users, ComposeFS, containerd and Android APEXs will
be directly benefited from it.
On the other hand, previous experimental feature "erofs over fscache"
was once also intended to provide a similar solution (inspired by
Incremental FS discussion [2]), but the following facts show file-backed
mounts will be a better approach:
- Fscache infrastructure has recently been moved into new Netfslib
which is an unexpected dependency to EROFS really, although it
originally claims "it could be used for caching other things such as
ISO9660 filesystems too." [3]
- It takes an unexpectedly long time to upstream Fscache/Cachefiles
enhancements. For example, the failover feature took more than
one year, and the deamonless feature is still far behind now;
- Ongoing HSM "fanotify pre-content hooks" [4] together with this will
perfectly supersede "erofs over fscache" in a simpler way since
developers (mainly containerd folks) could leverage their existing
caching mechanism entirely in userspace instead of strictly following
the predefined in-kernel caching tree hierarchy.
After "fanotify pre-content hooks" lands upstream to provide the same
functionality, "erofs over fscache" will be removed then (as an EROFS
internal improvement and EROFS will not have to bother with on-demand
fetching and/or caching improvements anymore.)
[1] https://github.com/containers/storage/pull/2039
[2] https://lore.kernel.org/r/CAOQ4uxjbVxnubaPjVaGYiSwoGDTdpWbB=w_AeM6YM=zVixsUfQ@mail.gmail.com
[3] https://docs.kernel.org/filesystems/caching/fscache.html
[4] https://lore.kernel.org/r/cover.1723670362.git.josef@toxicpanda.com
Closes: https://github.com/containers/composefs/issues/144
Reviewed-by: Sandeep Dhavale <dhavale@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240830032840.3783206-1-hsiangkao@linux.alibaba.com
- Use i_size instead of i_size_read() due to immutable fses;
- Get rid of an unneeded goto since erofs_fill_dentries() also works;
- Remove unnecessary lines.
Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240801112622.2164029-1-hongzhen@linux.alibaba.com
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
just lift the call of erofs_pos() into the callers; it will
collapse in most of them, but that's better done caller-by-caller.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/r/20240425195846.GC1031757@ZenIV
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Al Viro has a series of "->bd_inode elimination" which touches several
subsystems, but he also has EROFS-specific further cleanup patches
which I tend to go with EROFS tree for more testing.
Let's merge "#misc.erofs" as Al suggested in the previous email [1]:
"#misc.erofs (the first two commits) is put into never-rebased mode;
you pull it into your tree and do whatever's convenient with the rest.
I merge the same branch into block_device work; that way it doesn't
cause conflicts whatever else happens in our trees."
[1] https://lore.kernel.org/r/20240503041542.GV2118490@ZenIV
Signed-off-by: Gao Xiang <xiang@kernel.org>
Add Zstandard compression as the 4th supported algorithm since it
becomes more popular now and some end users have asked this for
quite a while [1][2].
Each EROFS physical cluster contains only one valid standard
Zstandard frame as described in [3] so that decompression can be
performed on a per-pcluster basis independently.
Currently, it just leverages multi-call stream decompression APIs with
internal sliding window buffers. One-shot or bufferless decompression
could be implemented later for even better performance if needed.
[1] https://github.com/erofs/erofs-utils/issues/6
[2] https://lore.kernel.org/r/Y08h+z6CZdnS1XBm@B-P7TQMD6M-0146.lan
[3] https://www.rfc-editor.org/rfc/rfc8478.txt
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240508234453.17896-1-xiang@kernel.org
This adds a special global buffer pool (in the end) for reserved pages.
Using a reserved pool for LZ4 decompression significantly reduces the
time spent on extra temporary page allocation for the extreme cases in
low memory scenarios.
The table below shows the reduction in time spent on page allocation for
LZ4 decompression when using a reserved pool. The results were obtained
from multi-app launch benchmarks on ARM64 Android devices running the
5.15 kernel with an 8-core CPU and 8GB of memory. In the benchmark, we
launched 16 frequently-used apps, and the camera app was the last one in
each round. The data in the table is the average time of camera app for
each round.
After using the reserved pool, there was an average improvement of 150ms
in the overall launch time of our camera app, which was obtained from
the systrace log.
+--------------+---------------+--------------+---------+
| | w/o page pool | w/ page pool | diff |
+--------------+---------------+--------------+---------+
| Average (ms) | 3434 | 21 | -99.38% |
+--------------+---------------+--------------+---------+
Based on the benchmark logs, 64 pages are sufficient for 95% of
scenarios. This value can be adjusted with a module parameter
`reserved_pages`. The default value is 0.
This pool is currently only used for the LZ4 decompressor, but it can be
applied to more decompressors if needed.
Signed-off-by: Chunhai Guo <guochunhai@vivo.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240402131523.2703948-1-guochunhai@vivo.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>