Commit Graph

1002 Commits

Author SHA1 Message Date
Linus Torvalds
d46dd0d883 f2fs-for-7.1-rc1
In this round, the changes primarily focus on resolving race conditions,
 memory safety issues (UAF), and improving the robustness of garbage
 collection (GC), and folio management.
 
 Enhancement:
  - add page-order information for large folio reads in iostat
  - add defrag_blocks sysfs node
 
 Bug fix:
  - fix uninitialized kobject put in f2fs_init_sysfs()
  - disallow setting an extension to both cold and hot
  - fix node_cnt race between extent node destroy and writeback
  - fix to preserve previous reserve_{blocks,node} value when remount
  - fix to freeze GC and discard threads quickly
  - fix false alarm of lockdep on cp_global_sem lock
  - fix data loss caused by incorrect use of nat_entry flag
  - fix to skip empty sections in f2fs_get_victim
  - fix inline data not being written to disk in writeback path
  - fix fsck inconsistency caused by FGGC of node block
  - fix fsck inconsistency caused by incorrect nat_entry flag usage
  - call f2fs_handle_critical_error() to set cp_error flag
  - fix fiemap boundary handling when read extent cache is incomplete
  - fix use-after-free of sbi in f2fs_compress_write_end_io()
  - fix UAF caused by decrementing sbi->nr_pages[] in f2fs_write_end_io()
  - fix incorrect file address mapping when inline inode is unwritten
  - fix incomplete search range in f2fs_get_victim when f2fs_need_rand_seg is enabled
  - fix to avoid memory leak in f2fs_rename()
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmnn7+kACgkQQBSofoJI
 UNIknA//ScYLuOhOmJJNBfmkEoUe5es04YRRq1OOBAvOCGw+Z/qg9unel9Qpneqg
 0xQ35rLKL6q7Y592ZOgWyipFTGhDBEbdJNP6eI9avBURoj9sFjDhFlmkVuUhjsns
 IgOSVgWSWqijWZOcBQbJGEm+N/W81Ktee1RUIDkcti66/uYIS+roTLDLbIyEhvkT
 DhsmUnYwoMy9cB5ag9rZuSWvEa8TI7UbelH78Oi/TqRYJu6ax+D99s6PzOFBH1EE
 FwNGoEMn3r1+2gqPVzDmtrz7A/cYtHVigaUT9d8/n2yygZhGaQ8whd0QoIlikgcW
 9n7Ymo3sns/yLEJURFqkB6Q5yFcZ30jRJZJb5CMNeqtuHQFoLjtcpEWqiQKGzzKY
 uUATMoG7F3QSn8AOVt6GaxnpvNb/NiVZ1Fsvt1Cgq8hUjxf1v2AhHZnvcK0EDAqa
 PvEYSriB56Qtnt1UfbNqydxSiviDDjtaHDprFIvAyEavDCs2F7gzrHEW7IHzG2XR
 Io9hnaBNUJs065zU8qWHyetIZCjPySnPOkZ42eaMEsDMhDtlC3WDOB3ZkmFnh9u2
 2K/SaIpQInGyP2LGLzNB/khWhDcZ4aGciCd7b5Ul9WkrfZTzrN9XI/F2w7dr0R6q
 tE6xJThraGk7NjO67xUq/M2KnVAHN5gTPRY9OmEboEdTO+6pC5w=
 =0oeQ
 -----END PGP SIGNATURE-----

Merge tag 'f2fs-for-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "In this round, the changes primarily focus on resolving race
  conditions, memory safety issues (UAF), and improving the robustness
  of garbage collection (GC), and folio management.

  Enhancements:
   - add page-order information for large folio reads in iostat
   - add defrag_blocks sysfs node

  Bug fixes:
   - fix uninitialized kobject put in f2fs_init_sysfs()
   - disallow setting an extension to both cold and hot
   - fix node_cnt race between extent node destroy and writeback
   - preserve previous reserve_{blocks,node} value when remount
   - freeze GC and discard threads quickly
   - fix false alarm of lockdep on cp_global_sem lock
   - fix data loss caused by incorrect use of nat_entry flag
   - skip empty sections in f2fs_get_victim
   - fix inline data not being written to disk in writeback path
   - fix fsck inconsistency caused by FGGC of node block
   - fix fsck inconsistency caused by incorrect nat_entry flag usage
   - call f2fs_handle_critical_error() to set cp_error flag
   - fix fiemap boundary handling when read extent cache is incomplete
   - fix use-after-free of sbi in f2fs_compress_write_end_io()
   - fix UAF caused by decrementing sbi->nr_pages[] in f2fs_write_end_io()
   - fix incorrect file address mapping when inline inode is unwritten
   - fix incomplete search range in f2fs_get_victim when f2fs_need_rand_seg is enabled
   - avoid memory leak in f2fs_rename()"

* tag 'f2fs-for-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (35 commits)
  f2fs: add page-order information for large folio reads in iostat
  f2fs: do not support mmap write for large folio
  f2fs: fix uninitialized kobject put in f2fs_init_sysfs()
  f2fs: protect extension_list reading with sb_lock in f2fs_sbi_show()
  f2fs: disallow setting an extension to both cold and hot
  f2fs: fix node_cnt race between extent node destroy and writeback
  f2fs: allow empty mount string for Opt_usr|grp|projjquota
  f2fs: fix to preserve previous reserve_{blocks,node} value when remount
  f2fs: invalidate block device page cache on umount
  f2fs: fix to freeze GC and discard threads quickly
  f2fs: fix to avoid uninit-value access in f2fs_sanity_check_node_footer
  f2fs: fix false alarm of lockdep on cp_global_sem lock
  f2fs: fix data loss caused by incorrect use of nat_entry flag
  f2fs: fix to skip empty sections in f2fs_get_victim
  f2fs: fix inline data not being written to disk in writeback path
  f2fs: fix fsck inconsistency caused by FGGC of node block
  f2fs: fix fsck inconsistency caused by incorrect nat_entry flag usage
  f2fs: fix to do sanity check on dcc->discard_cmd_cnt conditionally
  f2fs: refactor node footer flag setting related code
  f2fs: refactor f2fs_move_node_folio function
  ...
2026-04-21 14:50:04 -07:00
Daniel Lee
cb8ff3ead9 f2fs: add page-order information for large folio reads in iostat
Track read folio counts by order in F2FS iostat sysfs and tracepoints.

Signed-off-by: Daniel Lee <chullee@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-04-18 22:44:42 +00:00
Linus Torvalds
334fbe734e mm.git review status for linus..mm-stable
Everything:
 
 Total patches:       368
 Reviews/patch:       1.56
 Reviewed rate:       74%
 
 Excluding DAMON:
 
 Total patches:       316
 Reviews/patch:       1.77
 Reviewed rate:       81%
 
 Excluding DAMON and zram:
 
 Total patches:       306
 Reviews/patch:       1.81
 Reviewed rate:       82%
 
 Excluding DAMON, zram and maple_tree:
 
 Total patches:       276
 Reviews/patch:       2.01
 Reviewed rate:       91%
 
 Significant patch series in this merge:
 
 - The 30 patch series "maple_tree: Replace big node with maple copy"
   from Liam Howlett is mainly prepararatory work for ongoing development
   but it does reduce stack usage and is an improvement.
 
 - The 12 patch series "mm, swap: swap table phase III: remove swap_map"
   from Kairui Song offers memory savings by removing the static swap_map.
   It also yields some CPU savings and implements several cleanups.
 
 - The 2 patch series "mm: memfd_luo: preserve file seals" from Pratyush
   Yadav adds file seal preservation to LUO's memfd code.
 
 - The 2 patch series "mm: zswap: add per-memcg stat for incompressible
   pages" from Jiayuan Chen adds additional userspace stats reportng to
   zswap.
 
 - The 4 patch series "arch, mm: consolidate empty_zero_page" from Mike
   Rapoport implements some cleanups for our handling of ZERO_PAGE() and
   zero_pfn.
 
 - The 2 patch series "mm/kmemleak: Improve scan_should_stop()
   implementation" from Zhongqiu Han provides an robustness improvement and
   some cleanups in the kmemleak code.
 
 - The 4 patch series "Improve khugepaged scan logic" from Vernon Yang
   "improves the khugepaged scan logic and reduces CPU consumption by
   prioritizing scanning tasks that access memory frequently".
 
 - The 2 patch series "Make KHO Stateless" from Jason Miu simplifies
   Kexec Handover by "transitioning KHO from an xarray-based metadata
   tracking system with serialization to a radix tree data structure that
   can be passed directly to the next kernel"
 
 - The 3 patch series "mm: vmscan: add PID and cgroup ID to vmscan
   tracepoints" from Thomas Ballasi and Steven Rostedt enhances vmscan's
   tracepointing.
 
 - The 5 patch series "mm: arch/shstk: Common shadow stack mapping helper
   and VM_NOHUGEPAGE" from Catalin Marinas is a cleanup for the shadow
   stack code: remove per-arch code in favour of a generic implementation.
 
 - The 2 patch series "Fix KASAN support for KHO restored vmalloc
   regions" from Pasha Tatashin fixes a WARN() which can be emitted the KHO
   restores a vmalloc area.
 
 - The 4 patch series "mm: Remove stray references to pagevec" from Tal
   Zussman provides several cleanups, mainly udpating references to "struct
   pagevec", which became folio_batch three years ago.
 
 - The 17 patch series "mm: Eliminate fake head pages from vmemmap
   optimization" from Kiryl Shutsemau simplifies the HugeTLB vmemmap
   optimization (HVO) by changing how tail pages encode their relationship
   to the head page.
 
 - The 2 patch series "mm/damon/core: improve DAMOS quota efficiency for
   core layer filters" from SeongJae Park improves two problematic
   behaviors of DAMOS that makes it less efficient when core layer filters
   are used.
 
 - The 3 patch series "mm/damon: strictly respect min_nr_regions" from
   SeongJae Park improves DAMON usability by extending the treatment of the
   min_nr_regions user-settable parameter.
 
 - The 3 patch series "mm/page_alloc: pcp locking cleanup" from Vlastimil
   Babka is a proper fix for a previously hotfixed SMP=n issue.  Code
   simplifications and cleanups ennsed.
 
 - The 16 patch series "mm: cleanups around unmapping / zapping" from
   David Hildenbrand implements "a bunch of cleanups around unmapping and
   zapping.  Mostly simplifications, code movements, documentation and
   renaming of zapping functions".
 
 - The 6 patch series "support batched checking of the young flag for
   MGLRU" from Baolin Wang supports batched checking of the young flag for
   MGLRU.  It's part cleanups; one benchmark shows large performance
   benefits for arm64.
 
 - The 5 patch series "memcg: obj stock and slab stat caching cleanups"
   from Johannes Weiner provides memcg cleanup and robustness improvements.
 
 - The 5 patch series "Allow order zero pages in page reporting" from
   Yuvraj Sakshith enhances page_reporting's free page reporting - it is
   presently and undesirably order-0 pages when reporting free memory.
 
 - The 6 patch series "mm: vma flag tweaks" from Lorenzo Stoakes is
   cleanup work following from the recent conversion of the VMA flags to a
   bitmap.
 
 - The 10 patch series "mm/damon: add optional debugging-purpose sanity
   checks" from SeongJae Park adds some more developer-facing debug checks
   into DAMON core.
 
 - The 2 patch series "mm/damon: test and document power-of-2
   min_region_sz requirement" from SeongJae Park adds an additional DAMON
   kunit test and makes some adjustments to the addr_unit parameter
   handling.
 
 - The 3 patch series "mm/damon/core: make passed_sample_intervals
   comparisons overflow-safe" from SeongJae Park fixes a hard-to-hit time
   overflow issue in DAMON core.
 
 - The 7 patch series "mm/damon: improve/fixup/update ratio calculation,
   test and documentation" from SeongJae Park is a "batch of misc/minor
   improvements and fixups" for DAMON.
 
 - The 4 patch series "mm: move vma_(kernel|mmu)_pagesize() out of
   hugetlb.c" from David Hildenbrand fixes a possible issue with dax-device
   when CONFIG_HUGETLB=n.  Some code movement was required.
 
 - The 6 patch series "zram: recompression cleanups and tweaks" from
   Sergey Senozhatsky provides "a somewhat random mix of fixups,
   recompression cleanups and improvements" in the zram code.
 
 - The 11 patch series "mm/damon: support multiple goal-based quota
   tuning algorithms" from SeongJae Park extend DAMOS quotas goal
   auto-tuning to support multiple tuning algorithms that users can select.
 
 - The 4 patch series "mm: thp: reduce unnecessary
   start_stop_khugepaged()" from Breno Leitao fixes the khugpaged sysfs
   handling so we no longer spam the logs with reams of junk when
   starting/stopping khugepaged.
 
 - The 3 patch series "mm: improve map count checks" from Lorenzo Stoakes
   provides some cleanups and slight fixes in the mremap, mmap and vma
   code.
 
 - The 5 patch series "mm/damon: support addr_unit on default monitoring
   targets for modules" from SeongJae Park extends the use of DAMON core's
   addr_unit tunable.
 
 - The 5 patch series "mm: khugepaged cleanups and mTHP prerequisites"
   from Nico Pache provides cleanups in the khugepaged and is a base for
   Nico's planned khugepaged mTHP support.
 
 - The 15 patch series "mm: memory hot(un)plug and SPARSEMEM cleanups"
   from David Hildenbrand implements code movement and cleanups in the
   memhotplug and sparsemem code.
 
 - The 2 patch series "mm: remove CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE and
   cleanup CONFIG_MIGRATION" from David Hildenbrand rationalizes some
   memhotplug Kconfig support.
 
 - The 6 patch series "change young flag check functions to return bool"
   from Baolin Wang is "a cleanup patchset to change all young flag check
   functions to return bool".
 
 - The 3 patch series "mm/damon/sysfs: fix memory leak and NULL
   dereference issues" from Josh Law and SeongJae Park fixes a few
   potential DAMON bugs.
 
 - The 25 patch series "mm/vma: convert vm_flags_t to vma_flags_t in vma
   code" from "converts a lot of the existing use of the legacy vm_flags_t
   data type to the new vma_flags_t type which replaces it".  Mainly in the
   vma code.
 
 - The 21 patch series "mm: expand mmap_prepare functionality and usage"
   from Lorenzo Stoakes "expands the mmap_prepare functionality, which is
   intended to replace the deprecated f_op->mmap hook which has been the
   source of bugs and security issues for some time".  Cleanups,
   documentation, extension of mmap_prepare into filesystem drivers.
 
 - The 13 patch series "mm/huge_memory: refactor zap_huge_pmd()" from
   Lorenzo Stoakes simplifies and cleans up zap_huge_pmd().  Additional
   cleanups around vm_normal_folio_pmd() and the softleaf functionality are
   performed.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCad3HDQAKCRDdBJ7gKXxA
 jrUQAPwNhPk5nPSxnyxjAeQtOBHqgCdnICeEismLajPKd9aYRgEA0s2XAu3tSUYi
 GrBnWImHG3s4ePQxVcPCegWTsOUrXgQ=
 =1Q7o
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2026-04-13-21-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - "maple_tree: Replace big node with maple copy" (Liam Howlett)

   Mainly prepararatory work for ongoing development but it does reduce
   stack usage and is an improvement.

 - "mm, swap: swap table phase III: remove swap_map" (Kairui Song)

   Offers memory savings by removing the static swap_map. It also yields
   some CPU savings and implements several cleanups.

 - "mm: memfd_luo: preserve file seals" (Pratyush Yadav)

   File seal preservation to LUO's memfd code

 - "mm: zswap: add per-memcg stat for incompressible pages" (Jiayuan
   Chen)

   Additional userspace stats reportng to zswap

 - "arch, mm: consolidate empty_zero_page" (Mike Rapoport)

   Some cleanups for our handling of ZERO_PAGE() and zero_pfn

 - "mm/kmemleak: Improve scan_should_stop() implementation" (Zhongqiu
   Han)

   A robustness improvement and some cleanups in the kmemleak code

 - "Improve khugepaged scan logic" (Vernon Yang)

   Improve khugepaged scan logic and reduce CPU consumption by
   prioritizing scanning tasks that access memory frequently

 - "Make KHO Stateless" (Jason Miu)

   Simplify Kexec Handover by transitioning KHO from an xarray-based
   metadata tracking system with serialization to a radix tree data
   structure that can be passed directly to the next kernel

 - "mm: vmscan: add PID and cgroup ID to vmscan tracepoints" (Thomas
   Ballasi and Steven Rostedt)

   Enhance vmscan's tracepointing

 - "mm: arch/shstk: Common shadow stack mapping helper and
   VM_NOHUGEPAGE" (Catalin Marinas)

   Cleanup for the shadow stack code: remove per-arch code in favour of
   a generic implementation

 - "Fix KASAN support for KHO restored vmalloc regions" (Pasha Tatashin)

   Fix a WARN() which can be emitted the KHO restores a vmalloc area

 - "mm: Remove stray references to pagevec" (Tal Zussman)

   Several cleanups, mainly udpating references to "struct pagevec",
   which became folio_batch three years ago

 - "mm: Eliminate fake head pages from vmemmap optimization" (Kiryl
   Shutsemau)

   Simplify the HugeTLB vmemmap optimization (HVO) by changing how tail
   pages encode their relationship to the head page

 - "mm/damon/core: improve DAMOS quota efficiency for core layer
   filters" (SeongJae Park)

   Improve two problematic behaviors of DAMOS that makes it less
   efficient when core layer filters are used

 - "mm/damon: strictly respect min_nr_regions" (SeongJae Park)

   Improve DAMON usability by extending the treatment of the
   min_nr_regions user-settable parameter

 - "mm/page_alloc: pcp locking cleanup" (Vlastimil Babka)

   The proper fix for a previously hotfixed SMP=n issue. Code
   simplifications and cleanups ensued

 - "mm: cleanups around unmapping / zapping" (David Hildenbrand)

   A bunch of cleanups around unmapping and zapping. Mostly
   simplifications, code movements, documentation and renaming of
   zapping functions

 - "support batched checking of the young flag for MGLRU" (Baolin Wang)

   Batched checking of the young flag for MGLRU. It's part cleanups; one
   benchmark shows large performance benefits for arm64

 - "memcg: obj stock and slab stat caching cleanups" (Johannes Weiner)

   memcg cleanup and robustness improvements

 - "Allow order zero pages in page reporting" (Yuvraj Sakshith)

   Enhance free page reporting - it is presently and undesirably order-0
   pages when reporting free memory.

 - "mm: vma flag tweaks" (Lorenzo Stoakes)

   Cleanup work following from the recent conversion of the VMA flags to
   a bitmap

 - "mm/damon: add optional debugging-purpose sanity checks" (SeongJae
   Park)

   Add some more developer-facing debug checks into DAMON core

 - "mm/damon: test and document power-of-2 min_region_sz requirement"
   (SeongJae Park)

   An additional DAMON kunit test and makes some adjustments to the
   addr_unit parameter handling

 - "mm/damon/core: make passed_sample_intervals comparisons
   overflow-safe" (SeongJae Park)

   Fix a hard-to-hit time overflow issue in DAMON core

 - "mm/damon: improve/fixup/update ratio calculation, test and
   documentation" (SeongJae Park)

   A batch of misc/minor improvements and fixups for DAMON

 - "mm: move vma_(kernel|mmu)_pagesize() out of hugetlb.c" (David
   Hildenbrand)

   Fix a possible issue with dax-device when CONFIG_HUGETLB=n. Some code
   movement was required.

 - "zram: recompression cleanups and tweaks" (Sergey Senozhatsky)

   A somewhat random mix of fixups, recompression cleanups and
   improvements in the zram code

 - "mm/damon: support multiple goal-based quota tuning algorithms"
   (SeongJae Park)

   Extend DAMOS quotas goal auto-tuning to support multiple tuning
   algorithms that users can select

 - "mm: thp: reduce unnecessary start_stop_khugepaged()" (Breno Leitao)

   Fix the khugpaged sysfs handling so we no longer spam the logs with
   reams of junk when starting/stopping khugepaged

 - "mm: improve map count checks" (Lorenzo Stoakes)

   Provide some cleanups and slight fixes in the mremap, mmap and vma
   code

 - "mm/damon: support addr_unit on default monitoring targets for
   modules" (SeongJae Park)

   Extend the use of DAMON core's addr_unit tunable

 - "mm: khugepaged cleanups and mTHP prerequisites" (Nico Pache)

   Cleanups to khugepaged and is a base for Nico's planned khugepaged
   mTHP support

 - "mm: memory hot(un)plug and SPARSEMEM cleanups" (David Hildenbrand)

   Code movement and cleanups in the memhotplug and sparsemem code

 - "mm: remove CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE and cleanup
   CONFIG_MIGRATION" (David Hildenbrand)

   Rationalize some memhotplug Kconfig support

 - "change young flag check functions to return bool" (Baolin Wang)

   Cleanups to change all young flag check functions to return bool

 - "mm/damon/sysfs: fix memory leak and NULL dereference issues" (Josh
   Law and SeongJae Park)

   Fix a few potential DAMON bugs

 - "mm/vma: convert vm_flags_t to vma_flags_t in vma code" (Lorenzo
   Stoakes)

   Convert a lot of the existing use of the legacy vm_flags_t data type
   to the new vma_flags_t type which replaces it. Mainly in the vma
   code.

 - "mm: expand mmap_prepare functionality and usage" (Lorenzo Stoakes)

   Expand the mmap_prepare functionality, which is intended to replace
   the deprecated f_op->mmap hook which has been the source of bugs and
   security issues for some time. Cleanups, documentation, extension of
   mmap_prepare into filesystem drivers

 - "mm/huge_memory: refactor zap_huge_pmd()" (Lorenzo Stoakes)

   Simplify and clean up zap_huge_pmd(). Additional cleanups around
   vm_normal_folio_pmd() and the softleaf functionality are performed.

* tag 'mm-stable-2026-04-13-21-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits)
  mm: fix deferred split queue races during migration
  mm/khugepaged: fix issue with tracking lock
  mm/huge_memory: add and use has_deposited_pgtable()
  mm/huge_memory: add and use normal_or_softleaf_folio_pmd()
  mm: add softleaf_is_valid_pmd_entry(), pmd_to_softleaf_folio()
  mm/huge_memory: separate out the folio part of zap_huge_pmd()
  mm/huge_memory: use mm instead of tlb->mm
  mm/huge_memory: remove unnecessary sanity checks
  mm/huge_memory: deduplicate zap deposited table call
  mm/huge_memory: remove unnecessary VM_BUG_ON_PAGE()
  mm/huge_memory: add a common exit path to zap_huge_pmd()
  mm/huge_memory: handle buggy PMD entry in zap_huge_pmd()
  mm/huge_memory: have zap_huge_pmd return a boolean, add kdoc
  mm/huge: avoid big else branch in zap_huge_pmd()
  mm/huge_memory: simplify vma_is_specal_huge()
  mm: on remap assert that input range within the proposed VMA
  mm: add mmap_action_map_kernel_pages[_full]()
  uio: replace deprecated mmap hook with mmap_prepare in uio_info
  drivers: hv: vmbus: replace deprecated mmap hook with mmap_prepare
  mm: allow handling of stacked mmap_prepare hooks in more drivers
  ...
2026-04-15 12:59:16 -07:00
Tal Zussman
4e1d77a8f3 folio_batch: rename pagevec.h to folio_batch.h
struct pagevec was removed in commit 1e0877d58b ("mm: remove struct
pagevec").  Rename include/linux/pagevec.h to reflect reality and update
includes tree-wide.  Add the new filename to MAINTAINERS explicitly, as it
no longer matches the "include/linux/page[-_]*" pattern in MEMORY
MANAGEMENT - CORE.

Link: https://lkml.kernel.org/r/20260225-pagevec_cleanup-v2-3-716868cc2d11@columbia.edu
Signed-off-by: Tal Zussman <tz2294@columbia.edu>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:53:07 -07:00
Chao Yu
7b9161a605 f2fs: fix to avoid uninit-value access in f2fs_sanity_check_node_footer
syzbot reported a f2fs bug as below:

BUG: KMSAN: uninit-value in f2fs_sanity_check_node_footer+0x374/0xa20 fs/f2fs/node.c:1520
 f2fs_sanity_check_node_footer+0x374/0xa20 fs/f2fs/node.c:1520
 f2fs_finish_read_bio+0xe1e/0x1d60 fs/f2fs/data.c:177
 f2fs_read_end_io+0x6ab/0x2220 fs/f2fs/data.c:-1
 bio_endio+0x1006/0x1160 block/bio.c:1792
 submit_bio_noacct+0x533/0x2960 block/blk-core.c:891
 submit_bio+0x57a/0x620 block/blk-core.c:926
 blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
 f2fs_submit_read_bio+0x12c/0x360 fs/f2fs/data.c:557
 f2fs_submit_page_bio+0xee2/0x1450 fs/f2fs/data.c:775
 read_node_folio+0x384/0x4b0 fs/f2fs/node.c:1481
 __get_node_folio+0x5db/0x15d0 fs/f2fs/node.c:1576
 f2fs_get_inode_folio+0x40/0x50 fs/f2fs/node.c:1623
 do_read_inode fs/f2fs/inode.c:425 [inline]
 f2fs_iget+0x1209/0x9380 fs/f2fs/inode.c:596
 f2fs_fill_super+0x8f5a/0xb2e0 fs/f2fs/super.c:5184
 get_tree_bdev_flags+0x6e6/0x920 fs/super.c:1694
 get_tree_bdev+0x38/0x50 fs/super.c:1717
 f2fs_get_tree+0x35/0x40 fs/f2fs/super.c:5436
 vfs_get_tree+0xb3/0x5d0 fs/super.c:1754
 fc_mount fs/namespace.c:1193 [inline]
 do_new_mount_fc fs/namespace.c:3763 [inline]
 do_new_mount+0x885/0x1dd0 fs/namespace.c:3839
 path_mount+0x7a2/0x20b0 fs/namespace.c:4159
 do_mount fs/namespace.c:4172 [inline]
 __do_sys_mount fs/namespace.c:4361 [inline]
 __se_sys_mount+0x704/0x7f0 fs/namespace.c:4338
 __x64_sys_mount+0xe4/0x150 fs/namespace.c:4338
 x64_sys_call+0x39f0/0x3ea0 arch/x86/include/generated/asm/syscalls_64.h:166
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x134/0xf80 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

The root cause is: in f2fs_finish_read_bio(), we may access uninit data
in folio if we failed to read the data from device into folio, let's add
a check condition to avoid such issue.

Cc: stable@kernel.org
Fixes: 50ac3ecd8e ("f2fs: fix to do sanity check on node footer in {read,write}_end_io")
Reported-by: syzbot+9aac813cdc456cdd49f8@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/linux-f2fs-devel/69a9ca26.a70a0220.305d9a.0000.GAE@google.com
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-04-02 16:24:20 +00:00
Yongpeng Yang
95e159ad3e f2fs: fix fiemap boundary handling when read extent cache is incomplete
f2fs_fiemap() calls f2fs_map_blocks() to obtain the block mapping a
file, and then merges contiguous mappings into extents. If the mapping
is found in the read extent cache, node blocks do not need to be read.
However, in the following scenario, a contiguous extent can be split
into two extents:

$ dd if=/dev/zero of=data.128M bs=1M count=128
$ losetup -f data.128M
$ mkfs.f2fs /dev/loop0 -f
$ mount -o mode=lfs /dev/loop0 /mnt/f2fs/
$ cd /mnt/f2fs/
$ dd if=/dev/zero of=data.72M bs=1M count=72 && sync
$ dd if=/dev/zero of=data.4M bs=1M count=4 && sync
$ dd if=/dev/zero of=data.4M bs=1M count=2 seek=2 conv=notrunc && sync
$ echo 3 > /proc/sys/vm/drop_caches
$ dd if=/dev/zero of=data.4M bs=1M count=2 seek=0 conv=notrunc && sync
$ dd if=/dev/zero of=data.4M bs=1M count=2 seek=0 conv=notrunc && sync
$ f2fs_io fiemap 0 1024 data.4M
Fiemap: offset = 0 len = 1024
logical addr.    physical addr.   length           flags
0	0000000000000000 0000000006400000 0000000000200000 00001000
1	0000000000200000 0000000006600000 0000000000200000 00001001

Although the physical addresses of the ranges 0~2MB and 2M~4MB are
contiguous, the mapping for the 2M~4MB range is not present in memory.
When the physical addresses for the 0~2MB range are updated, no merge
happens because the adjacent mapping is missing from the in-memory
cache. As a result, fiemap reports two separate extents instead of a
single contiguous one.

The root cause is that the read extent cache does not guarantee that all
blocks of an extent are present in memory. Therefore, when the extent
length returned by f2fs_map_blocks_cached() is smaller than maxblocks,
the remaining mappings are retrieved via f2fs_get_dnode_of_data() to
ensure correct fiemap extent boundary handling.

Cc: stable@kernel.org
Fixes: cd8fc5226b ("f2fs: remove the create argument to f2fs_map_blocks")
Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-03-24 17:21:00 +00:00
Yongpeng Yang
eb2ca3ca98 f2fs: fix incorrect multidevice info in trace_f2fs_map_blocks()
When f2fs_map_blocks()->f2fs_map_blocks_cached() hits the read extent
cache, map->m_multidev_dio is not updated, which leads to incorrect
multidevice information being reported by trace_f2fs_map_blocks().

This patch updates map->m_multidev_dio in f2fs_map_blocks_cached() when
the read extent cache is hit.

Cc: stable@kernel.org
Fixes: 0094e98bd1 ("f2fs: factor a f2fs_map_blocks_cached helper")
Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-03-24 17:21:00 +00:00
Yongpeng Yang
1eaf7ee2e6 f2fs: drop unused sbi parameter from f2fs_in_warm_node_list()
The sbi parameter in f2fs_in_warm_node_list() is not used. Remove it to
simplify the function.

Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-03-24 17:21:00 +00:00
Yongpeng Yang
2d9c4a4ed4 f2fs: fix UAF caused by decrementing sbi->nr_pages[] in f2fs_write_end_io()
The xfstests case "generic/107" and syzbot have both reported a NULL
pointer dereference.

The concurrent scenario that triggers the panic is as follows:

F2FS_WB_CP_DATA write callback          umount
                                        - f2fs_write_checkpoint
                                         - f2fs_wait_on_all_pages(sbi, F2FS_WB_CP_DATA)
- blk_mq_end_request
 - bio_endio
  - f2fs_write_end_io
   : dec_page_count(sbi, F2FS_WB_CP_DATA)
   : wake_up(&sbi->cp_wait)
                                        - kill_f2fs_super
                                         - kill_block_super
                                          - f2fs_put_super
                                           : iput(sbi->node_inode)
                                           : sbi->node_inode = NULL
   : f2fs_in_warm_node_list
    - is_node_folio // sbi->node_inode is NULL and panic

The root cause is that f2fs_put_super() calls iput(sbi->node_inode) and
sets sbi->node_inode to NULL after sbi->nr_pages[F2FS_WB_CP_DATA] is
decremented to zero. As a result, f2fs_in_warm_node_list() may
dereference a NULL node_inode when checking whether a folio belongs to
the node inode, leading to a panic.

This patch fixes the issue by calling f2fs_in_warm_node_list() before
decrementing sbi->nr_pages[F2FS_WB_CP_DATA], thus preventing the
use-after-free condition.

Cc: stable@kernel.org
Fixes: 50fa53eccf ("f2fs: fix to avoid broken of dnode block list")
Reported-by: syzbot+6e4cb1cac5efc96ea0ca@syzkaller.appspotmail.com
Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-03-24 17:21:00 +00:00
Eric Biggers
d69ee59d38 f2fs: remove unreachable code in f2fs_encrypt_one_page()
Since commit 52e7e0d889 ("fscrypt: Switch to sync_skcipher and
on-stack requests") eliminated the dynamic allocation of crypto
requests, the only remaining dynamic memory allocation done by
fscrypt_encrypt_pagecache_blocks() is the bounce page allocation.

The bounce page is allocated from a mempool.  Mempool allocations with
GFP_NOFS never fail.  Therefore, fscrypt_encrypt_pagecache_blocks() can
no longer return -ENOMEM when passed GFP_NOFS.

Remove the now-unreachable code from f2fs_encrypt_one_page().

Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://lore.kernel.org/all/d9dc2ee1-283d-4467-ad36-a6a4aa557589@suse.cz/
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-03-24 17:20:59 +00:00
Christoph Hellwig
3c7eaa775d fscrypt: pass a byte offset to fscrypt_set_bio_crypt_ctx
Logical offsets into an inode are usually expressed as bytes in the VFS.
Switch fscrypt_set_bio_crypt_ctx to that convention.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20260302141922.370070-9-hch@lst.de
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2026-03-09 13:31:50 -07:00
Christoph Hellwig
22be86a23c fscrypt: pass a byte offset to fscrypt_mergeable_bio
Logical offsets into an inode are usually expressed as bytes in the VFS.
Switch fscrypt_mergeable_bio to that convention.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20260302141922.370070-8-hch@lst.de
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2026-03-09 13:31:50 -07:00
Linus Torvalds
3e48a11675 f2fs-for-7.0-rc1
In this development cycle, we focused on several key performance optimizations:
 1) introducing large folio support to enhance read speeds for immutable files,
 2) reducing checkpoint=enable latency by flushing only committed dirty pages,
 and 3) implementing tracepoints to diagnose and resolve lock priority inversion.
 Additionally, we introduced the packed_ssa feature to optimize the SSA footprint
 when utilizing large block sizes.
 
 Enhancement:
  - support large folio for immutable non-compressed case
  - support non-4KB block size without packed_ssa feature
  - optimize f2fs_enable_checkpoint() to avoid long delay
  - optimize f2fs_overwrite_io() for f2fs_iomap_begin
  - optimize NAT block loading during checkpoint write
  - add write latency stats for NAT and SIT blocks in f2fs_write_checkpoint
  - pin files do not require sbi->writepages lock for ordering
  - avoid f2fs_map_blocks() for consecutive holes in readpages
  - flush plug periodically during GC to maximize readahead effect
  - add tracepoints to catch lock overheads
  - add several sysfs entries to tune internal lock priorities
 
 Bug fix:
  - fix lock priority inversion issue
  - fix incomplete block usage in compact SSA summaries
  - fix to show simulate_lock_timeout correctly
  - fix to avoid mapping wrong physical block for swapfile
  - fix IS_CHECKPOINTED flag inconsistency issue caused by concurrent atomic
    commit and checkpoint writes
  - fix to avoid UAF in f2fs_write_end_io()
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmmP3mYACgkQQBSofoJI
 UNIvMA//c0vFSIB2Gsfjt5rk2kxDSeuxQHDetKNPR/xzz/tRJHw6F0y+3oFPbQDa
 bI62/DbhHPiCienq07l1LZQd44pYgheQEYmYtf6A2wGduh+S1Cy1uYZRmKJwtcfv
 t8gZoFIle4rufz5GlWoY6L70jhSJmpLPYLItltL7mxgJL1cR7Ea3L+fOAmSp9YYT
 mo0zT3jTaYSbCqad9Cgoa6GU/HwrvimiGPRFBVsxkZItRSIY22CTA0DmnXkG2iys
 GgcNKR1qMcy44rrt4oLXrlffmqLQXtLn4F62K79or0PMby34pGEZldxr+sWDxr0p
 /1lFwwnnAFZiJ/z9TLjND5z3KmZtF0ng98QWqj0uoTYLyCAzgqDkvrStBz6pJjjb
 oA/0XOWPLAxIMbB3xipeICJTzFauR6Pg69e0A0oDvB2CfkHuSuUbhU47HPWNfi2n
 ASL1jcFVtF6mZr7iV23W2vFWqWz6ZKDi2ZTphaRu9UXrMkyB3OYxNDumbJCwbd8c
 pb6xf8UoXG2MDHwJPKRbSuznPTCbM2ZohoTgDmED8YcTdxc+CE3FVDGNdObZWU8w
 guA1HJQxScXPPPUHcTybXN4qOjO/ppJBRkoq2tBzd4iLr4V+gQNTTOmtK+wfuLsM
 LSK0mQiGj1VPJD950NXwervibgaxnv85iLLgVYccc4N+E8aoLbQ=
 =Ugvd
 -----END PGP SIGNATURE-----

Merge tag 'f2fs-for-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "In this development cycle, we focused on several key performance
  optimizations:

   - introducing large folio support to enhance read speeds for
     immutable files

   - reducing checkpoint=enable latency by flushing only committed dirty
     pages

   - implementing tracepoints to diagnose and resolve lock priority
     inversion.

  Additionally, we introduced the packed_ssa feature to optimize the SSA
  footprint when utilizing large block sizes.

  Detail summary:

  Enhancements:
   - support large folio for immutable non-compressed case
   - support non-4KB block size without packed_ssa feature
   - optimize f2fs_enable_checkpoint() to avoid long delay
   - optimize f2fs_overwrite_io() for f2fs_iomap_begin
   - optimize NAT block loading during checkpoint write
   - add write latency stats for NAT and SIT blocks in
     f2fs_write_checkpoint
   - pin files do not require sbi->writepages lock for ordering
   - avoid f2fs_map_blocks() for consecutive holes in readpages
   - flush plug periodically during GC to maximize readahead effect
   - add tracepoints to catch lock overheads
   - add several sysfs entries to tune internal lock priorities

  Fixes:
   - fix lock priority inversion issue
   - fix incomplete block usage in compact SSA summaries
   - fix to show simulate_lock_timeout correctly
   - fix to avoid mapping wrong physical block for swapfile
   - fix IS_CHECKPOINTED flag inconsistency issue caused by
     concurrent atomic commit and checkpoint writes
   - fix to avoid UAF in f2fs_write_end_io()"

* tag 'f2fs-for-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (61 commits)
  f2fs: sysfs: introduce critical_task_priority
  f2fs: introduce trace_f2fs_priority_update
  f2fs: fix lock priority inversion issue
  f2fs: optimize f2fs_overwrite_io() for f2fs_iomap_begin
  f2fs: fix incomplete block usage in compact SSA summaries
  f2fs: decrease maximum flush retry count in f2fs_enable_checkpoint()
  f2fs: optimize NAT block loading during checkpoint write
  f2fs: change size parameter of __has_cursum_space() to unsigned int
  f2fs: add write latency stats for NAT and SIT blocks in f2fs_write_checkpoint
  f2fs: pin files do not require sbi->writepages lock for ordering
  f2fs: fix to show simulate_lock_timeout correctly
  f2fs: introduce FAULT_SKIP_WRITE
  f2fs: check skipped write in f2fs_enable_checkpoint()
  Revert "f2fs: add timeout in f2fs_enable_checkpoint()"
  f2fs: fix to unlock folio in f2fs_read_data_large_folio()
  f2fs: fix error path handling in f2fs_read_data_large_folio()
  f2fs: use folio_end_read
  f2fs: fix to avoid mapping wrong physical block for swapfile
  f2fs: avoid f2fs_map_blocks() for consecutive holes in readpages
  f2fs: advance index and offset after zeroing in large folio read
  ...
2026-02-14 09:48:10 -08:00
Linus Torvalds
997f9640c9 fsverity updates for 7.0
fsverity cleanups, speedup, and memory usage optimization from
 Christoph Hellwig:
 
 - Move some logic into common code
 
 - Fix btrfs to reject truncates of fsverity files
 
 - Improve the readahead implementation
 
 - Store each inode's fsverity_info in a hash table instead of using a
   pointer in the filesystem-specific part of the inode.
 
   This optimizes for memory usage in the usual case where most files
   don't have fsverity enabled.
 
 - Look up the fsverity_info fewer times during verification, to
   amortize the hash table overhead
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCaY0nZhQcZWJpZ2dlcnNA
 a2VybmVsLm9yZwAKCRDzXCl4vpKOK/AVAP9wSLEYsG3dqnNIHjIvLeK+9NC3Ni4d
 m+fvT1JfuideOwEA9r2EfztusLU5iyqWJlHyxekibXItUDgYGltaYb7eXAU=
 =a+To
 -----END PGP SIGNATURE-----

Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux

Pull fsverity updates from Eric Biggers:
 "fsverity cleanups, speedup, and memory usage optimization from
  Christoph Hellwig:

   - Move some logic into common code

   - Fix btrfs to reject truncates of fsverity files

   - Improve the readahead implementation

   - Store each inode's fsverity_info in a hash table instead of using a
     pointer in the filesystem-specific part of the inode.

     This optimizes for memory usage in the usual case where most files
     don't have fsverity enabled.

   - Look up the fsverity_info fewer times during verification, to
     amortize the hash table overhead"

* tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux:
  fsverity: remove inode from fsverity_verification_ctx
  fsverity: use a hashtable to find the fsverity_info
  btrfs: consolidate fsverity_info lookup
  f2fs: consolidate fsverity_info lookup
  ext4: consolidate fsverity_info lookup
  fs: consolidate fsverity_info lookup in buffer.c
  fsverity: push out fsverity_info lookup
  fsverity: deconstify the inode pointer in struct fsverity_info
  fsverity: kick off hash readahead at data I/O submission time
  ext4: move ->read_folio and ->readahead to readpage.c
  readahead: push invalidate_lock out of page_cache_ra_unbounded
  fsverity: don't issue readahead for non-ENOENT errors from __filemap_get_folio
  fsverity: start consolidating pagecache code
  fsverity: pass struct file to ->write_merkle_tree_block
  f2fs: don't build the fsverity work handler for !CONFIG_FS_VERITY
  ext4: don't build the fsverity work handler for !CONFIG_FS_VERITY
  fs,fsverity: clear out fsverity_info from common code
  fs,fsverity: reject size changes on fsverity files in setattr_prepare
2026-02-12 10:41:34 -08:00
Christoph Hellwig
45dcb3ac98 f2fs: consolidate fsverity_info lookup
Look up the fsverity_info once in f2fs_mpage_readpages, and then use it
for the readahead, local verification of holes and pass it along to the
I/O completion workqueue in struct bio_post_read_ctx.  Do the same
thing in f2fs_get_read_data_folio for reads that come from garbage
collection and other background activities.

This amortizes the lookup better once it becomes less efficient.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20260202060754.270269-10-hch@lst.de
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2026-02-04 11:31:54 -08:00
Christoph Hellwig
47bc2ac9b6 fsverity: push out fsverity_info lookup
Pass a struct fsverity_info to the verification and readahead helpers,
and push the lookup into the callers.  Right now this is a very dumb
almost mechanic move that open codes a lot of fsverity_info_addr() calls
in the file systems.  The subsequent patches will clean this up.

This prepares for reducing the number of fsverity_info lookups, which
will allow to amortize them better when using a more expensive lookup
method.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Acked-by: David Sterba <dsterba@suse.com> # btrfs
Link: https://lore.kernel.org/r/20260202060754.270269-7-hch@lst.de
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2026-02-02 17:15:26 -08:00
Christoph Hellwig
f1a6cf44b3 fsverity: kick off hash readahead at data I/O submission time
Currently all reads of the fsverity hashes are kicked off from the data
I/O completion handler, leading to needlessly dependent I/O.  This is
worked around a bit by performing readahead on the level 0 nodes, but
still fairly ineffective.

Switch to a model where the ->read_folio and ->readahead methods instead
kick off explicit readahead of the fsverity hashed so they are usually
available at I/O completion time.

For 64k sequential reads on my test VM this improves read performance
from 2.4GB/s - 2.6GB/s to 3.5GB/s - 3.9GB/s.  The improvements for
random reads are likely to be even bigger.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Sterba <dsterba@suse.com> # btrfs
Link: https://lore.kernel.org/r/20260202060754.270269-5-hch@lst.de
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2026-02-02 17:15:26 -08:00
Yeongjin Gil
d860974a7e f2fs: optimize f2fs_overwrite_io() for f2fs_iomap_begin
When overwriting already allocated blocks, f2fs_iomap_begin() calls
f2fs_overwrite_io() to check block mappings. However,
f2fs_overwrite_io() iterates through all mapped blocks in the range,
which can be inefficient for fragmented files with large I/O requests.

This patch optimizes f2fs_overwrite_io() by adding a 'check_first'
parameter and introducing __f2fs_overwrite_io() helper. When called from
f2fs_iomap_begin(), we only check the first mapping to determine if the
range is already allocated, which is sufficient for setting
map.m_may_create.

This optimization significantly reduces the number of f2fs_map_blocks()
calls in f2fs_overwrite_io() when called from f2fs_iomap_begin(),
especially for fragmented files with large I/O requests.

Cc: stable@kernel.org
Fixes: 351bc76133 ("f2fs: optimize f2fs DIO overwrites")
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Reviewed-by: Sunmin Jeong <s_min.jeong@samsung.com>
Signed-off-by: Yeongjin Gil <youngjin.gil@samsung.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-31 03:23:31 +00:00
Yongpeng Yang
be38b5717a f2fs: pin files do not require sbi->writepages lock for ordering
For pinned files, the file mapping is already established before
writing, and since the writes are in IPU, there is no need to acquire
the sbi->writepages lock to guarantee write ordering.

Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-27 02:45:58 +00:00
Chao Yu
1120764691 f2fs: introduce FAULT_SKIP_WRITE
In order to simulate skipped write during enable_checkpoint().

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-27 02:45:58 +00:00
Chao Yu
ab59919c8a f2fs: check skipped write in f2fs_enable_checkpoint()
This patch introduces sbi->nr_pages[F2FS_SKIPPED_WRITE] to record any
skipped write during data flush in f2fs_enable_checkpoint().

So in the loop of data flush, if there is any skipped write in previous
flush, let's retry sync_inode_sb(), otherwise, all dirty data written
before f2fs_enable_checkpoint() should have been persisted, then break
the retry loop.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-27 02:45:44 +00:00
Chao Yu
a5d8b9d94e f2fs: fix to unlock folio in f2fs_read_data_large_folio()
We missed to unlock folio in error path of f2fs_read_data_large_folio(),
fix it.

With below testcase, it can reproduce the bug.

touch /mnt/f2fs/file
truncate -s $((1024*1024*1024)) /mnt/f2fs/file
f2fs_io setflags immutable /mnt/f2fs/file
sync
echo 3 > /proc/sys/vm/drop_caches
time dd if=/mnt/f2fs/file of=/dev/null bs=1M count=1024
f2fs_io clearflags immutable /mnt/f2fs/file
echo 1 > /proc/sys/vm/drop_caches
time dd if=/mnt/f2fs/file of=/dev/null bs=1M count=1024
time dd if=/mnt/f2fs/file of=/dev/null bs=1M count=1024

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-19 17:07:47 +00:00
Chao Yu
fe15bc3d44 f2fs: fix error path handling in f2fs_read_data_large_folio()
In error path of f2fs_read_data_large_folio(), if bio is valid, it
may submit bio twice, fix it.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-19 17:07:47 +00:00
Jaegeuk Kim
ec8bb999dc f2fs: use folio_end_read
No logic change.

Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-17 00:00:35 +00:00
Chao Yu
5c145c0318 f2fs: fix to avoid mapping wrong physical block for swapfile
Xiaolong Guo reported a f2fs bug in bugzilla [1]

[1] https://bugzilla.kernel.org/show_bug.cgi?id=220951

Quoted:

"When using stress-ng's swap stress test on F2FS filesystem with kernel 6.6+,
the system experiences data corruption leading to either:
1 dm-verity corruption errors and device reboot
2 F2FS node corruption errors and boot hangs

The issue occurs specifically when:
1 Using F2FS filesystem (ext4 is unaffected)
2 Swapfile size is less than F2FS section size (2MB)
3 Swapfile has fragmented physical layout (multiple non-contiguous extents)
4 Kernel version is 6.6+ (6.1 is unaffected)

The root cause is in check_swap_activate() function in fs/f2fs/data.c. When the
first extent of a small swapfile (< 2MB) is not aligned to section boundaries,
the function incorrectly treats it as the last extent, failing to map
subsequent extents. This results in incorrect swap_extent creation where only
the first extent is mapped, causing subsequent swap writes to overwrite wrong
physical locations (other files' data).

Steps to Reproduce
1 Setup a device with F2FS-formatted userdata partition
2 Compile stress-ng from https://github.com/ColinIanKing/stress-ng
3 Run swap stress test: (Android devices)
adb shell "cd /data/stressng; ./stress-ng-64 --metrics-brief --timeout 60
--swap 0"

Log:
1 Ftrace shows in kernel 6.6, only first extent is mapped during second
f2fs_map_blocks call in check_swap_activate():
stress-ng-swap-8990: f2fs_map_blocks: ino=11002, file offset=0, start
blkaddr=0x43143, len=0x1
(Only 4KB mapped, not the full swapfile)
2 in kernel 6.1, both extents are correctly mapped:
stress-ng-swap-5966: f2fs_map_blocks: ino=28011, file offset=0, start
blkaddr=0x13cd4, len=0x1
stress-ng-swap-5966: f2fs_map_blocks: ino=28011, file offset=1, start
blkaddr=0x60c84b, len=0xff

The problematic code is in check_swap_activate():
if ((pblock - SM_I(sbi)->main_blkaddr) % blks_per_sec ||
    nr_pblocks % blks_per_sec ||
    !f2fs_valid_pinned_area(sbi, pblock)) {
    bool last_extent = false;

    not_aligned++;

    nr_pblocks = roundup(nr_pblocks, blks_per_sec);
    if (cur_lblock + nr_pblocks > sis->max)
        nr_pblocks -= blks_per_sec;

    /* this extent is last one */
    if (!nr_pblocks) {
        nr_pblocks = last_lblock - cur_lblock;
        last_extent = true;
    }

    ret = f2fs_migrate_blocks(inode, cur_lblock, nr_pblocks);
    if (ret) {
        if (ret == -ENOENT)
            ret = -EINVAL;
        goto out;
    }

    if (!last_extent)
        goto retry;
}

When the first extent is unaligned and roundup(nr_pblocks, blks_per_sec)
exceeds sis->max, we subtract blks_per_sec resulting in nr_pblocks = 0. The
code then incorrectly assumes this is the last extent, sets nr_pblocks =
last_lblock - cur_lblock (entire swapfile), and performs migration. After
migration, it doesn't retry mapping, so subsequent extents are never processed.
"

In order to fix this issue, we need to lookup block mapping info after
we migrate all blocks in the tail of swapfile.

Cc: stable@kernel.org
Fixes: 9703d69d9d ("f2fs: support file pinning for zoned devices")
Cc: Daeho Jeong <daehojeong@google.com>
Reported-and-tested-by: Xiaolong Guo <guoxiaolong2008@gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220951
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-17 00:00:35 +00:00
Chao Yu
fe2961fb77 f2fs: avoid f2fs_map_blocks() for consecutive holes in readpages
For consecutive large hole mapping across {d,id,did}nodes , we don't
need to call f2fs_map_blocks() to check one hole block per one time,
instead, we can use map.m_next_pgofs as a hint of next potential valid
block, so that we can skip calling f2fs_map_blocks the range of
[cur_pgofs + 1, .m_next_pgofs).

1) regular case

touch /mnt/f2fs/file
truncate -s $((1024*1024*1024)) /mnt/f2fs/file
time dd if=/mnt/f2fs/file of=/dev/null bs=1M count=1024

Before:
real    0m0.706s
user    0m0.000s
sys     0m0.706s

After:
real    0m0.620s
user    0m0.008s
sys     0m0.611s

2) large folio case

touch /mnt/f2fs/file
truncate -s $((1024*1024*1024)) /mnt/f2fs/file
f2fs_io setflags immutable /mnt/f2fs/file
sync
echo 3 > /proc/sys/vm/drop_caches
time dd if=/mnt/f2fs/file of=/dev/null bs=1M count=1024

Before:
real    0m0.438s
user    0m0.004s
sys     0m0.433s

After:
real    0m0.368s
user    0m0.004s
sys     0m0.364s

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-17 00:00:35 +00:00
Nanzhe Zhao
d194f112a9 f2fs: advance index and offset after zeroing in large folio read
In f2fs_read_data_large_folio(), the block zeroing path calls
folio_zero_range() and then continues the loop. However, it fails to
advance index and offset before continuing.

This can cause the loop to repeatedly process the same subpage of the
folio, leading to stalls/hangs and incorrect progress when reading large
folios with holes/zeroed blocks.

Fix it by advancing index and offset unconditionally in the loop
iteration, so they are updated even when the zeroing path continues.

Signed-off-by: Nanzhe Zhao <nzzhao@126.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-17 00:00:35 +00:00
Nanzhe Zhao
6afd05ca6d f2fs: add 'folio_in_bio' to handle readahead folios with no BIO submission
f2fs_read_data_large_folio() can build a single read BIO across multiple
folios during readahead. If a folio ends up having none of its subpages
added to the BIO (e.g. all subpages are zeroed / treated as holes), it
will never be seen by f2fs_finish_read_bio(), so folio_end_read() is
never called. This leaves the folio locked and not marked uptodate.

Track whether the current folio has been added to a BIO via a local
'folio_in_bio' bool flag, and when iterating readahead folios, explicitly
mark the folio uptodate (on success) and unlock it when nothing was added.

Signed-off-by: Nanzhe Zhao <nzzhao@126.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-17 00:00:35 +00:00
Yongpeng Yang
540d34c182 f2fs: avoid unnecessary block mapping lookups in f2fs_read_data_large_folio
In the second call to f2fs_map_blocks within f2fs_read_data_large_folio,
map.m_len exceeds the logical address space to be read. This patch
ensures map.m_len does not exceed the required address space.

Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-17 00:00:35 +00:00
Chao Yu
50ac3ecd8e f2fs: fix to do sanity check on node footer in {read,write}_end_io
-----------[ cut here ]------------
kernel BUG at fs/f2fs/data.c:358!
Call Trace:
 <IRQ>
 blk_update_request+0x5eb/0xe70 block/blk-mq.c:987
 blk_mq_end_request+0x3e/0x70 block/blk-mq.c:1149
 blk_complete_reqs block/blk-mq.c:1224 [inline]
 blk_done_softirq+0x107/0x160 block/blk-mq.c:1229
 handle_softirqs+0x283/0x870 kernel/softirq.c:579
 __do_softirq kernel/softirq.c:613 [inline]
 invoke_softirq kernel/softirq.c:453 [inline]
 __irq_exit_rcu+0xca/0x1f0 kernel/softirq.c:680
 irq_exit_rcu+0x9/0x30 kernel/softirq.c:696
 instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1050 [inline]
 sysvec_apic_timer_interrupt+0xa6/0xc0 arch/x86/kernel/apic/apic.c:1050
 </IRQ>

In f2fs_write_end_io(), it detects there is inconsistency in between
node page index (nid) and footer.nid of node page.

If footer of node page is corrupted in fuzzed image, then we load corrupted
node page w/ async method, e.g. f2fs_ra_node_pages() or f2fs_ra_node_page(),
in where we won't do sanity check on node footer, once node page becomes
dirty, we will encounter this bug after node page writeback.

Cc: stable@kernel.org
Reported-by: syzbot+803dd716c4310d16ff3a@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=803dd716c4310d16ff3a
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-17 00:00:34 +00:00
Chao Yu
ce2739e482 f2fs: fix to avoid UAF in f2fs_write_end_io()
As syzbot reported an use-after-free issue in f2fs_write_end_io().

It is caused by below race condition:

loop device				umount
- worker_thread
 - loop_process_work
  - do_req_filebacked
   - lo_rw_aio
    - lo_rw_aio_complete
     - blk_mq_end_request
      - blk_update_request
       - f2fs_write_end_io
        - dec_page_count
        - folio_end_writeback
					- kill_f2fs_super
					 - kill_block_super
					  - f2fs_put_super
					 : free(sbi)
       : get_pages(, F2FS_WB_CP_DATA)
         accessed sbi which is freed

In kill_f2fs_super(), we will drop all page caches of f2fs inodes before
call free(sbi), it guarantee that all folios should end its writeback, so
it should be safe to access sbi before last folio_end_writeback().

Let's relocate ckpt thread wakeup flow before folio_end_writeback() to
resolve this issue.

Cc: stable@kernel.org
Fixes: e234088758 ("f2fs: avoid wait if IO end up when do_checkpoint for better performance")
Reported-by: syzbot+b4444e3c972a7a124187@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=b4444e3c972a7a124187
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-17 00:00:34 +00:00
Chao Yu
3996b70209 Revert "f2fs: block cache/dio write during f2fs_enable_checkpoint()"
This reverts commit 196c81fdd4.

Original patch may cause below deadlock, revert it.

write				remount
- write_begin
 - lock_page  --- lock A
 - prepare_write_begin
  - f2fs_map_lock
				- f2fs_enable_checkpoint
				 - down_write(cp_enable_rwsem)  --- lock B
				 - sync_inode_sb
				  - writepages
				   - lock_page			--- lock A
   - down_read(cp_enable_rwsem)  --- lock A

Cc: stable@kernel.org
Fixes: 196c81fdd4 ("f2fs: block cache/dio write during f2fs_enable_checkpoint()")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-16 03:49:31 +00:00
Christoph Hellwig
bb8e2019ad blk-crypto: handle the fallback above the block layer
Add a blk_crypto_submit_bio helper that either submits the bio when
it is not encrypted or inline encryption is provided, but otherwise
handles the encryption before going down into the low-level driver.
This reduces the risk from bio reordering and keeps memory allocation
as high up in the stack as possible.

Note that if the submitter knows that inline enctryption is known to
be supported by the underyling driver, it can still use plain
submit_bio.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-11 12:55:41 -07:00
Nanzhe Zhao
c0c589fa1d f2fs: Accounting large folio subpages before bio submission
In f2fs_read_data_large_folio(), read_pages_pending is incremented only
after the subpage has been added to the BIO.  With a heavily fragmented
file, each new subpage can force submission of the previous BIO.

If the BIO completes quickly, f2fs_finish_read_bio() may decrement
read_pages_pending to zero and call folio_end_read() while the read loop
is still processing other subpages of the same large folio.

Fix the ordering by incrementing read_pages_pending before any possible
BIO submission for the current subpage, matching the iomap ordering and
preventing premature folio_end_read().

Signed-off-by: Nanzhe Zhao <nzzhao@126.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-07 03:17:08 +00:00
Nanzhe Zhao
00feea1dfc f2fs: Zero f2fs_folio_state on allocation
f2fs_folio_state is attached to folio->private and is expected to start
with read_pages_pending == 0.  However, the structure was allocated from
ffs_entry_slab without being fully initialized, which can leave
read_pages_pending with stale values.

Allocate the object with __GFP_ZERO so all fields are reliably zeroed at
creation time.

Signed-off-by: Nanzhe Zhao <nzzhao@126.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-07 03:17:08 +00:00
Chao Yu
67972c2b89 f2fs: trace elapsed time for io_rwsem lock
Use f2fs_{down,up}_{read,write}_trace for io_rwsem to trace lock elapsed time.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-07 03:17:07 +00:00
Chao Yu
bb28b66875 f2fs: trace elapsed time for node_write lock
Use f2fs_{down,up}_read_trace for node_write to trace lock elapsed time.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-07 03:17:06 +00:00
Chao Yu
f9f9360251 f2fs: trace elapsed time for node_change lock
Use f2fs_{down,up}_read_trace for node_change to trace lock elapsed time.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-07 03:17:06 +00:00
Chao Yu
66e9e0d55d f2fs: trace elapsed time for cp_rwsem lock
Use f2fs_{down,up}_read_trace for cp_rwsem to trace lock elapsed time.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-07 03:17:06 +00:00
Yongpeng Yang
db1a8a7813 f2fs: return immediately after submitting the specified folio in __submit_merged_write_cond
f2fs_folio_wait_writeback ensures the folio write is submitted to the
block layer via __submit_merged_write_cond, then waits for the folio to
complete. Other I/O submissions are irrelevant to
f2fs_folio_wait_writeback. Thus, if the folio write bio is already
submitted, the function can return immediately. This patch adds a
writeback parameter to __submit_merged_write_cond(), which signals an
immediate return after submitting the target folio, and waitting
writeback can use this parameter.

Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-01 03:29:40 +00:00
Yongpeng Yang
86c1cf0578 f2fs: clean up the force parameter in __submit_merged_write_cond()
The force parameter in __submit_merged_write_cond is redundant, where
`force == true` implies `inode == NULL && folio == NULL && ino == 0` is
true, and `force == false` implies `inode != NULL || folio != NULL ||
ino != 0` is true. Thus, this patch replaces the force parameter with
a stack variable force.

Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-01 03:29:35 +00:00
Jaegeuk Kim
903c6e95bc f2fs: add a tracepoint to see large folio read submission
For example,

1327.539878: f2fs_preload_pages_start: dev = (252,16), ino = 14, i_size = 4294967296 start: 0, end: 8191
1327.539878: page_cache_sync_ra: dev=252:16 ino=e index=0 req_count=8192 order=9 size=0 async_size=0 ra_pages=4096 mmap_miss=0 prev_pos=-1
1327.539879: page_cache_ra_order: dev=252:16 ino=e index=0 order=9 size=4096 async_size=2048 ra_pages=4096
1327.541895: f2fs_readpages: dev = (252,16), ino = 14, start = 0 nrpage = 4096
1327.541930: f2fs_lookup_extent_tree_start: dev = (252,16), ino = 14, pgofs = 0, type = Read
1327.541931: f2fs_lookup_read_extent_tree_end: dev = (252,16), ino = 14, pgofs = 0, read_ext_info(fofs: 0, len: 1048576, blk: 4221440)
1327.541931: f2fs_map_blocks: dev = (252,16), ino = 14, file offset = 0, start blkaddr = 0x406a00, len = 0x1000, flags = 2, seg_type = 8, may_create = 0, multidevice = 0, flag = 0, err = 0
1327.541989: f2fs_read_folio: dev = (252,16), ino = 14, DATA, FILE, index = 0, nr_pages = 512, dirty = 0, uptodate = 0
1327.542012: f2fs_read_folio: dev = (252,16), ino = 14, DATA, FILE, index = 512, nr_pages = 512, dirty = 0, uptodate = 0
1327.542036: f2fs_read_folio: dev = (252,16), ino = 14, DATA, FILE, index = 1024, nr_pages = 512, dirty = 0, uptodate = 0
1327.542080: f2fs_read_folio: dev = (252,16), ino = 14, DATA, FILE, index = 1536, nr_pages = 512, dirty = 0, uptodate = 0
1327.542127: f2fs_read_folio: dev = (252,16), ino = 14, DATA, FILE, index = 2048, nr_pages = 512, dirty = 0, uptodate = 0
1327.542151: f2fs_read_folio: dev = (252,16), ino = 14, DATA, FILE, index = 2560, nr_pages = 512, dirty = 0, uptodate = 0
1327.542196: f2fs_read_folio: dev = (252,16), ino = 14, DATA, FILE, index = 3072, nr_pages = 512, dirty = 0, uptodate = 0
1327.542219: f2fs_read_folio: dev = (252,16), ino = 14, DATA, FILE, index = 3584, nr_pages = 512, dirty = 0, uptodate = 0
1327.542239: f2fs_submit_read_bio: dev = (252,16)/(252,16), rw = READ(R), DATA, sector = 33771520, size = 16777216
1327.542269: page_cache_sync_ra: dev=252:16 ino=e index=4096 req_count=8192 order=9 size=4096 async_size=2048 ra_pages=4096 mmap_miss=0 prev_pos=-1
1327.542289: page_cache_ra_order: dev=252:16 ino=e index=4096 order=9 size=4096 async_size=2048 ra_pages=4096
1327.544485: f2fs_readpages: dev = (252,16), ino = 14, start = 4096 nrpage = 4096
1327.544521: f2fs_lookup_extent_tree_start: dev = (252,16), ino = 14, pgofs = 4096, type = Read
1327.544521: f2fs_lookup_read_extent_tree_end: dev = (252,16), ino = 14, pgofs = 4096, read_ext_info(fofs: 0, len: 1048576, blk: 4221440)
1327.544522: f2fs_map_blocks: dev = (252,16), ino = 14, file offset = 4096, start blkaddr = 0x407a00, len = 0x1000, flags = 2, seg_type = 8, may_create = 0, multidevice = 0, flag = 0, err = 0
1327.544550: f2fs_read_folio: dev = (252,16), ino = 14, DATA, FILE, index = 4096, nr_pages = 512, dirty = 0, uptodate = 0
1327.544575: f2fs_read_folio: dev = (252,16), ino = 14, DATA, FILE, index = 4608, nr_pages = 512, dirty = 0, uptodate = 0
1327.544601: f2fs_read_folio: dev = (252,16), ino = 14, DATA, FILE, index = 5120, nr_pages = 512, dirty = 0, uptodate = 0
1327.544647: f2fs_read_folio: dev = (252,16), ino = 14, DATA, FILE, index = 5632, nr_pages = 512, dirty = 0, uptodate = 0
1327.544692: f2fs_read_folio: dev = (252,16), ino = 14, DATA, FILE, index = 6144, nr_pages = 512, dirty = 0, uptodate = 0
1327.544734: f2fs_read_folio: dev = (252,16), ino = 14, DATA, FILE, index = 6656, nr_pages = 512, dirty = 0, uptodate = 0
1327.544777: f2fs_read_folio: dev = (252,16), ino = 14, DATA, FILE, index = 7168, nr_pages = 512, dirty = 0, uptodate = 0
1327.544805: f2fs_read_folio: dev = (252,16), ino = 14, DATA, FILE, index = 7680, nr_pages = 512, dirty = 0, uptodate = 0
1327.544826: f2fs_submit_read_bio: dev = (252,16)/(252,16), rw = READ(R), DATA, sector = 33804288, size = 16777216
1327.544852: f2fs_preload_pages_end: dev = (252,16), ino = 14, i_size = 4294967296 start: 8192, end: 8191

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-12-16 00:46:49 +00:00
Jaegeuk Kim
05e65c14ea f2fs: support large folio for immutable non-compressed case
This patch enables large folio for limited case where we can get the high-order
memory allocation. It supports the encrypted and fsverity files, which are
essential for Android environment.

How to test:
- dd if=/dev/zero of=/mnt/test/test bs=1G count=4
- f2fs_io setflags immutable /mnt/test/test
- echo 3 > /proc/sys/vm/drop_caches
 : to reload inode with large folio
- f2fs_io read 32 0 1024 mmap 0 0 /mnt/test/test

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-12-16 00:46:49 +00:00
Linus Torvalds
cb015814f8 f2fs-for-6.19-rc1
This series focuses on minor clean-ups and performance optimizations across
 sysfs, documentation, debugfs, tracepoints, slab allocation, and GC.
 Furthermore, it resolves several corner-case bugs caught by xfstests, as
 well as issues related to 16KB page support and f2fs_enable_checkpoint.
 
 Enhancement:
  - wrap ASCII tables in literal blocks to fix LaTeX build
  - optimize trace_f2fs_write_checkpoint with enums
  - support to show curseg.next_blkoff in debugfs
  - add a sysfs entry to show max open zones
  - add fadvise tracepoint
  - use global inline_xattr_slab instead of per-sb slab cache
  - set default valid_thresh_ratio to 80 for zoned devices
  - maintain one time GC mode is enabled during whole zoned GC cycle
 
 Bug fix:
  - ensure node page reads complete before f2fs_put_super() finishes
  - fix to not account invalid blocks in get_left_section_blocks()
  - revert summary entry count from 2048 to 512 in 16kb block support
  - fix to detect recoverable inode during dryrun of find_fsync_dnodes()
  - fix age extent cache insertion skip on counter overflow
  - Add sanity checks before unlinking and loading inodes
  - ensure minimum trim granularity accounts for all devices
  - block cache/dio write during f2fs_enable_checkpoint()
  - fix to propagate error from f2fs_enable_checkpoint()
  - invalidate dentry cache on failed whiteout creation
  - fix to avoid updating compression context during writeback
  - fix to avoid updating zero-sized extent in extent cache
  - fix to avoid potential deadlock
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmk3N1kACgkQQBSofoJI
 UNKfGw//Z7+0Oy0w/3k8UkJHvz6b3sDFzzCGlyBtYUaQaxp0eXxytB9T7GNE4g8z
 UA6nOA7VvHdFyu8YvJkMrf8vejorVnO9I86vlUZ/uZcOqKPWkjNxaHJvMYg0ZvkS
 uwiFo8rSL5FO0MSbnVhZScnolNuEINYi1sYd0fb2BzHB3P7cSwRrDGYuU53E3S8p
 3JsOa1EN0DrxlL7YTI8q8wmMcN1+/BK9YP4Sl3r8nBAYNAoP/JLMY40YkOTk3gKy
 ppJ32e++D9XxVTEaZUvktW/z9zLKdSvqjFE0BduSbNrqlfGj2AEwU1WJouFPYDOs
 b4mDhi9y3Mv2LWY6fTeOXcT/nTf6IssopHNBpPI6Ay73GwENPOYf+q4oTNeqpa1f
 sGqmw6M8NGiEjQAPKrbON8IDSpdc6Yzk1ENRjOf5j7/xR0gtL1b3G0KV5FCO+25x
 QP9KupkhBc9yheCTrig6reCQlvfWU+I70tyB30YD/BcqhCB/EjBvM/v9kK1udN0e
 6wjr5eBfX8z8DGlqNYzAjjEQC8IfkwDc1qLkovTsBKBo2Z0fHPriAZERAcLU7TuU
 z06GZQT6QdZ4lAw4KfNWcef0S3m14qY5E8qJoQS2G7DwdMOglouJRakOi75nW1Dc
 lSZBI1m1JxwLsj7iXNXLEJoGMUR5u+oUzJyj46trn6fOG6AIbuo=
 =4ZOp
 -----END PGP SIGNATURE-----

Merge tag 'f2fs-for-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "This series focuses on minor clean-ups and performance optimizations
  across sysfs, documentation, debugfs, tracepoints, slab allocation,
  and GC. Furthermore, it resolves several corner-case bugs caught by
  xfstests, as well as issues related to 16KB page support and
  f2fs_enable_checkpoint.

  Enhancement:
   - wrap ASCII tables in literal blocks to fix LaTeX build
   - optimize trace_f2fs_write_checkpoint with enums
   - support to show curseg.next_blkoff in debugfs
   - add a sysfs entry to show max open zones
   - add fadvise tracepoint
   - use global inline_xattr_slab instead of per-sb slab cache
   - set default valid_thresh_ratio to 80 for zoned devices
   - maintain one time GC mode is enabled during whole zoned GC cycle

  Bug fix:
   - ensure node page reads complete before f2fs_put_super() finishes
   - do not account invalid blocks in get_left_section_blocks()
   - revert summary entry count from 2048 to 512 in 16kb block support
   - detect recoverable inode during dryrun of find_fsync_dnodes()
   - fix age extent cache insertion skip on counter overflow
   - add sanity checks before unlinking and loading inodes
   - ensure minimum trim granularity accounts for all devices
   - block cache/dio write during f2fs_enable_checkpoint()
   - propagate error from f2fs_enable_checkpoint()
   - invalidate dentry cache on failed whiteout creation
   - avoid updating compression context during writeback
   - avoid updating zero-sized extent in extent cache
   - avoid potential deadlock"

* tag 'f2fs-for-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (39 commits)
  f2fs: ignore discard return value
  f2fs: optimize trace_f2fs_write_checkpoint with enums
  f2fs: fix to not account invalid blocks in get_left_section_blocks()
  f2fs: support to show curseg.next_blkoff in debugfs
  docs: f2fs: wrap ASCII tables in literal blocks to fix LaTeX build
  f2fs: expand scalability of f2fs mount option
  f2fs: change default schedule timeout value
  f2fs: introduce f2fs_schedule_timeout()
  f2fs: use memalloc_retry_wait() as much as possible
  f2fs: add a sysfs entry to show max open zones
  f2fs: wrap all unusable_blocks_per_sec code in CONFIG_BLK_DEV_ZONED
  f2fs: simplify list initialization in f2fs_recover_fsync_data()
  f2fs: revert summary entry count from 2048 to 512 in 16kb block support
  f2fs: fix to detect recoverable inode during dryrun of find_fsync_dnodes()
  f2fs: fix return value of f2fs_recover_fsync_data()
  f2fs: add fadvise tracepoint
  f2fs: fix age extent cache insertion skip on counter overflow
  f2fs: Add sanity checks before unlinking and loading inodes
  f2fs: Rename f2fs_unlink exit label
  f2fs: ensure minimum trim granularity accounts for all devices
  ...
2025-12-09 12:06:20 +09:00
Chao Yu
76e780d88c f2fs: introduce f2fs_schedule_timeout()
In f2fs retry logic, we will call f2fs_io_schedule_timeout() to sleep as
uninterruptible state (waiting for IO) for a while, however, in several
paths below, we are not blocked by IO:
- f2fs_write_single_data_page() return -EAGAIN due to racing on cp_rwsem.
- f2fs_flush_device_cache() failed to submit preflush command.
- __issue_discard_cmd_range() sleeps periodically in between two in batch
discard submissions.

So, in order to reveal state of task more accurate, let's introduce
f2fs_schedule_timeout() and call it in above paths in where we are waiting
for non-IO reasons.

Then we can get real reason of uninterruptible sleep for a thread in
tracepoint, perfetto, etc.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-12-04 02:00:05 +00:00
Chao Yu
196c81fdd4 f2fs: block cache/dio write during f2fs_enable_checkpoint()
If there are too many background IOs during f2fs_enable_checkpoint(),
sync_inodes_sb() may be blocked for long time due to it will loop to
write dirty datas which are generated by in parallel write()
continuously.

Let's change as below to resolve this issue:
- hold cp_enable_rwsem write lock to block any cache/dio write
- decrease DEF_ENABLE_INTERVAL from 16 to 5

In addition, dump more logs during f2fs_enable_checkpoint().

Testcase:
1. fill data into filesystem until 90% usage.
2. mount -o remount,checkpoint=disable:10% /data
3. fio --rw=randwrite  --bs=4kb  --size=1GB  --numjobs=10  \
--iodepth=64  --ioengine=psync  --time_based  --runtime=600 \
--directory=/data/fio_dir/ &
4. mount -o remount,checkpoint=enable /data

Before:
F2FS-fs (dm-51): f2fs_enable_checkpoint() finishes, writeback:7232, sync:39793, cp:457

After:
F2FS-fs (dm-51): f2fs_enable_checkpoint end, writeback:5032, lock:0, sync_inode:5552, sync_fs:84

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-12-04 02:00:02 +00:00
Yongpeng Yang
89c16629e3 f2fs: change the unlock parameter of f2fs_put_page to bool
Change the type of the unlock parameter of f2fs_put_page to bool.
All callers should consistently pass true or false. No logical change.

Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-12-04 02:00:02 +00:00
Chao Yu
10b591e7fb f2fs: fix to avoid updating compression context during writeback
Bai, Shuangpeng <sjb7183@psu.edu> reported a bug as below:

Oops: divide error: 0000 [#1] SMP KASAN PTI
CPU: 0 UID: 0 PID: 11441 Comm: syz.0.46 Not tainted 6.17.0 #1 PREEMPT(full)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
RIP: 0010:f2fs_all_cluster_page_ready+0x106/0x550 fs/f2fs/compress.c:857
Call Trace:
 <TASK>
 f2fs_write_cache_pages fs/f2fs/data.c:3078 [inline]
 __f2fs_write_data_pages fs/f2fs/data.c:3290 [inline]
 f2fs_write_data_pages+0x1c19/0x3600 fs/f2fs/data.c:3317
 do_writepages+0x38e/0x640 mm/page-writeback.c:2634
 filemap_fdatawrite_wbc mm/filemap.c:386 [inline]
 __filemap_fdatawrite_range mm/filemap.c:419 [inline]
 file_write_and_wait_range+0x2ba/0x3e0 mm/filemap.c:794
 f2fs_do_sync_file+0x6e6/0x1b00 fs/f2fs/file.c:294
 generic_write_sync include/linux/fs.h:3043 [inline]
 f2fs_file_write_iter+0x76e/0x2700 fs/f2fs/file.c:5259
 new_sync_write fs/read_write.c:593 [inline]
 vfs_write+0x7e9/0xe00 fs/read_write.c:686
 ksys_write+0x19d/0x2d0 fs/read_write.c:738
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xf7/0x470 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

The bug was triggered w/ below race condition:

fsync				setattr			ioctl
- f2fs_do_sync_file
 - file_write_and_wait_range
  - f2fs_write_cache_pages
  : inode is non-compressed
  : cc.cluster_size =
    F2FS_I(inode)->i_cluster_size = 0
   - tag_pages_for_writeback
				- f2fs_setattr
				 - truncate_setsize
				 - f2fs_truncate
							- f2fs_fileattr_set
							 - f2fs_setflags_common
							  - set_compress_context
							  : F2FS_I(inode)->i_cluster_size = 4
							  : set_inode_flag(inode, FI_COMPRESSED_FILE)
   - f2fs_compressed_file
   : return true
   - f2fs_all_cluster_page_ready
   : "pgidx % cc->cluster_size" trigger dividing 0 issue

Let's change as below to fix this issue:
- introduce a new atomic type variable .writeback in structure f2fs_inode_info
to track the number of threads which calling f2fs_write_cache_pages().
- use .i_sem lock to protect .writeback update.
- check .writeback before update compression context in f2fs_setflags_common()
to avoid race w/ ->writepages.

Fixes: 4c8ff7095b ("f2fs: support data compression")
Cc: stable@kernel.org
Reported-by: Bai, Shuangpeng <sjb7183@psu.edu>
Tested-by: Bai, Shuangpeng <sjb7183@psu.edu>
Closes: https://lore.kernel.org/lkml/44D8F7B3-68AD-425F-9915-65D27591F93F@psu.edu
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-12-04 02:00:02 +00:00
Chao Yu
c1cdb00488 f2fs: use f2fs_filemap_get_folio() to support fault injection
Use f2fs_filemap_get_folio() instead of __filemap_get_folio() in:
- f2fs_find_data_folio
- f2fs_write_begin
- f2fs_read_merkle_tree_page

So that, we can trigger fault injection in those places.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-12-04 01:55:57 +00:00
Chao Yu
3b7e73ddc0 f2fs: convert add_ipu_page() to use folio
No logic changes.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-12-04 01:55:57 +00:00