mirror of
https://github.com/torvalds/linux.git
synced 2026-05-27 16:44:58 +02:00
Everything: Total patches: 368 Reviews/patch: 1.56 Reviewed rate: 74% Excluding DAMON: Total patches: 316 Reviews/patch: 1.77 Reviewed rate: 81% Excluding DAMON and zram: Total patches: 306 Reviews/patch: 1.81 Reviewed rate: 82% Excluding DAMON, zram and maple_tree: Total patches: 276 Reviews/patch: 2.01 Reviewed rate: 91% Significant patch series in this merge: - The 30 patch series "maple_tree: Replace big node with maple copy" from Liam Howlett is mainly prepararatory work for ongoing development but it does reduce stack usage and is an improvement. - The 12 patch series "mm, swap: swap table phase III: remove swap_map" from Kairui Song offers memory savings by removing the static swap_map. It also yields some CPU savings and implements several cleanups. - The 2 patch series "mm: memfd_luo: preserve file seals" from Pratyush Yadav adds file seal preservation to LUO's memfd code. - The 2 patch series "mm: zswap: add per-memcg stat for incompressible pages" from Jiayuan Chen adds additional userspace stats reportng to zswap. - The 4 patch series "arch, mm: consolidate empty_zero_page" from Mike Rapoport implements some cleanups for our handling of ZERO_PAGE() and zero_pfn. - The 2 patch series "mm/kmemleak: Improve scan_should_stop() implementation" from Zhongqiu Han provides an robustness improvement and some cleanups in the kmemleak code. - The 4 patch series "Improve khugepaged scan logic" from Vernon Yang "improves the khugepaged scan logic and reduces CPU consumption by prioritizing scanning tasks that access memory frequently". - The 2 patch series "Make KHO Stateless" from Jason Miu simplifies Kexec Handover by "transitioning KHO from an xarray-based metadata tracking system with serialization to a radix tree data structure that can be passed directly to the next kernel" - The 3 patch series "mm: vmscan: add PID and cgroup ID to vmscan tracepoints" from Thomas Ballasi and Steven Rostedt enhances vmscan's tracepointing. - The 5 patch series "mm: arch/shstk: Common shadow stack mapping helper and VM_NOHUGEPAGE" from Catalin Marinas is a cleanup for the shadow stack code: remove per-arch code in favour of a generic implementation. - The 2 patch series "Fix KASAN support for KHO restored vmalloc regions" from Pasha Tatashin fixes a WARN() which can be emitted the KHO restores a vmalloc area. - The 4 patch series "mm: Remove stray references to pagevec" from Tal Zussman provides several cleanups, mainly udpating references to "struct pagevec", which became folio_batch three years ago. - The 17 patch series "mm: Eliminate fake head pages from vmemmap optimization" from Kiryl Shutsemau simplifies the HugeTLB vmemmap optimization (HVO) by changing how tail pages encode their relationship to the head page. - The 2 patch series "mm/damon/core: improve DAMOS quota efficiency for core layer filters" from SeongJae Park improves two problematic behaviors of DAMOS that makes it less efficient when core layer filters are used. - The 3 patch series "mm/damon: strictly respect min_nr_regions" from SeongJae Park improves DAMON usability by extending the treatment of the min_nr_regions user-settable parameter. - The 3 patch series "mm/page_alloc: pcp locking cleanup" from Vlastimil Babka is a proper fix for a previously hotfixed SMP=n issue. Code simplifications and cleanups ennsed. - The 16 patch series "mm: cleanups around unmapping / zapping" from David Hildenbrand implements "a bunch of cleanups around unmapping and zapping. Mostly simplifications, code movements, documentation and renaming of zapping functions". - The 6 patch series "support batched checking of the young flag for MGLRU" from Baolin Wang supports batched checking of the young flag for MGLRU. It's part cleanups; one benchmark shows large performance benefits for arm64. - The 5 patch series "memcg: obj stock and slab stat caching cleanups" from Johannes Weiner provides memcg cleanup and robustness improvements. - The 5 patch series "Allow order zero pages in page reporting" from Yuvraj Sakshith enhances page_reporting's free page reporting - it is presently and undesirably order-0 pages when reporting free memory. - The 6 patch series "mm: vma flag tweaks" from Lorenzo Stoakes is cleanup work following from the recent conversion of the VMA flags to a bitmap. - The 10 patch series "mm/damon: add optional debugging-purpose sanity checks" from SeongJae Park adds some more developer-facing debug checks into DAMON core. - The 2 patch series "mm/damon: test and document power-of-2 min_region_sz requirement" from SeongJae Park adds an additional DAMON kunit test and makes some adjustments to the addr_unit parameter handling. - The 3 patch series "mm/damon/core: make passed_sample_intervals comparisons overflow-safe" from SeongJae Park fixes a hard-to-hit time overflow issue in DAMON core. - The 7 patch series "mm/damon: improve/fixup/update ratio calculation, test and documentation" from SeongJae Park is a "batch of misc/minor improvements and fixups" for DAMON. - The 4 patch series "mm: move vma_(kernel|mmu)_pagesize() out of hugetlb.c" from David Hildenbrand fixes a possible issue with dax-device when CONFIG_HUGETLB=n. Some code movement was required. - The 6 patch series "zram: recompression cleanups and tweaks" from Sergey Senozhatsky provides "a somewhat random mix of fixups, recompression cleanups and improvements" in the zram code. - The 11 patch series "mm/damon: support multiple goal-based quota tuning algorithms" from SeongJae Park extend DAMOS quotas goal auto-tuning to support multiple tuning algorithms that users can select. - The 4 patch series "mm: thp: reduce unnecessary start_stop_khugepaged()" from Breno Leitao fixes the khugpaged sysfs handling so we no longer spam the logs with reams of junk when starting/stopping khugepaged. - The 3 patch series "mm: improve map count checks" from Lorenzo Stoakes provides some cleanups and slight fixes in the mremap, mmap and vma code. - The 5 patch series "mm/damon: support addr_unit on default monitoring targets for modules" from SeongJae Park extends the use of DAMON core's addr_unit tunable. - The 5 patch series "mm: khugepaged cleanups and mTHP prerequisites" from Nico Pache provides cleanups in the khugepaged and is a base for Nico's planned khugepaged mTHP support. - The 15 patch series "mm: memory hot(un)plug and SPARSEMEM cleanups" from David Hildenbrand implements code movement and cleanups in the memhotplug and sparsemem code. - The 2 patch series "mm: remove CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE and cleanup CONFIG_MIGRATION" from David Hildenbrand rationalizes some memhotplug Kconfig support. - The 6 patch series "change young flag check functions to return bool" from Baolin Wang is "a cleanup patchset to change all young flag check functions to return bool". - The 3 patch series "mm/damon/sysfs: fix memory leak and NULL dereference issues" from Josh Law and SeongJae Park fixes a few potential DAMON bugs. - The 25 patch series "mm/vma: convert vm_flags_t to vma_flags_t in vma code" from "converts a lot of the existing use of the legacy vm_flags_t data type to the new vma_flags_t type which replaces it". Mainly in the vma code. - The 21 patch series "mm: expand mmap_prepare functionality and usage" from Lorenzo Stoakes "expands the mmap_prepare functionality, which is intended to replace the deprecated f_op->mmap hook which has been the source of bugs and security issues for some time". Cleanups, documentation, extension of mmap_prepare into filesystem drivers. - The 13 patch series "mm/huge_memory: refactor zap_huge_pmd()" from Lorenzo Stoakes simplifies and cleans up zap_huge_pmd(). Additional cleanups around vm_normal_folio_pmd() and the softleaf functionality are performed. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCad3HDQAKCRDdBJ7gKXxA jrUQAPwNhPk5nPSxnyxjAeQtOBHqgCdnICeEismLajPKd9aYRgEA0s2XAu3tSUYi GrBnWImHG3s4ePQxVcPCegWTsOUrXgQ= =1Q7o -----END PGP SIGNATURE----- Merge tag 'mm-stable-2026-04-13-21-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "maple_tree: Replace big node with maple copy" (Liam Howlett) Mainly prepararatory work for ongoing development but it does reduce stack usage and is an improvement. - "mm, swap: swap table phase III: remove swap_map" (Kairui Song) Offers memory savings by removing the static swap_map. It also yields some CPU savings and implements several cleanups. - "mm: memfd_luo: preserve file seals" (Pratyush Yadav) File seal preservation to LUO's memfd code - "mm: zswap: add per-memcg stat for incompressible pages" (Jiayuan Chen) Additional userspace stats reportng to zswap - "arch, mm: consolidate empty_zero_page" (Mike Rapoport) Some cleanups for our handling of ZERO_PAGE() and zero_pfn - "mm/kmemleak: Improve scan_should_stop() implementation" (Zhongqiu Han) A robustness improvement and some cleanups in the kmemleak code - "Improve khugepaged scan logic" (Vernon Yang) Improve khugepaged scan logic and reduce CPU consumption by prioritizing scanning tasks that access memory frequently - "Make KHO Stateless" (Jason Miu) Simplify Kexec Handover by transitioning KHO from an xarray-based metadata tracking system with serialization to a radix tree data structure that can be passed directly to the next kernel - "mm: vmscan: add PID and cgroup ID to vmscan tracepoints" (Thomas Ballasi and Steven Rostedt) Enhance vmscan's tracepointing - "mm: arch/shstk: Common shadow stack mapping helper and VM_NOHUGEPAGE" (Catalin Marinas) Cleanup for the shadow stack code: remove per-arch code in favour of a generic implementation - "Fix KASAN support for KHO restored vmalloc regions" (Pasha Tatashin) Fix a WARN() which can be emitted the KHO restores a vmalloc area - "mm: Remove stray references to pagevec" (Tal Zussman) Several cleanups, mainly udpating references to "struct pagevec", which became folio_batch three years ago - "mm: Eliminate fake head pages from vmemmap optimization" (Kiryl Shutsemau) Simplify the HugeTLB vmemmap optimization (HVO) by changing how tail pages encode their relationship to the head page - "mm/damon/core: improve DAMOS quota efficiency for core layer filters" (SeongJae Park) Improve two problematic behaviors of DAMOS that makes it less efficient when core layer filters are used - "mm/damon: strictly respect min_nr_regions" (SeongJae Park) Improve DAMON usability by extending the treatment of the min_nr_regions user-settable parameter - "mm/page_alloc: pcp locking cleanup" (Vlastimil Babka) The proper fix for a previously hotfixed SMP=n issue. Code simplifications and cleanups ensued - "mm: cleanups around unmapping / zapping" (David Hildenbrand) A bunch of cleanups around unmapping and zapping. Mostly simplifications, code movements, documentation and renaming of zapping functions - "support batched checking of the young flag for MGLRU" (Baolin Wang) Batched checking of the young flag for MGLRU. It's part cleanups; one benchmark shows large performance benefits for arm64 - "memcg: obj stock and slab stat caching cleanups" (Johannes Weiner) memcg cleanup and robustness improvements - "Allow order zero pages in page reporting" (Yuvraj Sakshith) Enhance free page reporting - it is presently and undesirably order-0 pages when reporting free memory. - "mm: vma flag tweaks" (Lorenzo Stoakes) Cleanup work following from the recent conversion of the VMA flags to a bitmap - "mm/damon: add optional debugging-purpose sanity checks" (SeongJae Park) Add some more developer-facing debug checks into DAMON core - "mm/damon: test and document power-of-2 min_region_sz requirement" (SeongJae Park) An additional DAMON kunit test and makes some adjustments to the addr_unit parameter handling - "mm/damon/core: make passed_sample_intervals comparisons overflow-safe" (SeongJae Park) Fix a hard-to-hit time overflow issue in DAMON core - "mm/damon: improve/fixup/update ratio calculation, test and documentation" (SeongJae Park) A batch of misc/minor improvements and fixups for DAMON - "mm: move vma_(kernel|mmu)_pagesize() out of hugetlb.c" (David Hildenbrand) Fix a possible issue with dax-device when CONFIG_HUGETLB=n. Some code movement was required. - "zram: recompression cleanups and tweaks" (Sergey Senozhatsky) A somewhat random mix of fixups, recompression cleanups and improvements in the zram code - "mm/damon: support multiple goal-based quota tuning algorithms" (SeongJae Park) Extend DAMOS quotas goal auto-tuning to support multiple tuning algorithms that users can select - "mm: thp: reduce unnecessary start_stop_khugepaged()" (Breno Leitao) Fix the khugpaged sysfs handling so we no longer spam the logs with reams of junk when starting/stopping khugepaged - "mm: improve map count checks" (Lorenzo Stoakes) Provide some cleanups and slight fixes in the mremap, mmap and vma code - "mm/damon: support addr_unit on default monitoring targets for modules" (SeongJae Park) Extend the use of DAMON core's addr_unit tunable - "mm: khugepaged cleanups and mTHP prerequisites" (Nico Pache) Cleanups to khugepaged and is a base for Nico's planned khugepaged mTHP support - "mm: memory hot(un)plug and SPARSEMEM cleanups" (David Hildenbrand) Code movement and cleanups in the memhotplug and sparsemem code - "mm: remove CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE and cleanup CONFIG_MIGRATION" (David Hildenbrand) Rationalize some memhotplug Kconfig support - "change young flag check functions to return bool" (Baolin Wang) Cleanups to change all young flag check functions to return bool - "mm/damon/sysfs: fix memory leak and NULL dereference issues" (Josh Law and SeongJae Park) Fix a few potential DAMON bugs - "mm/vma: convert vm_flags_t to vma_flags_t in vma code" (Lorenzo Stoakes) Convert a lot of the existing use of the legacy vm_flags_t data type to the new vma_flags_t type which replaces it. Mainly in the vma code. - "mm: expand mmap_prepare functionality and usage" (Lorenzo Stoakes) Expand the mmap_prepare functionality, which is intended to replace the deprecated f_op->mmap hook which has been the source of bugs and security issues for some time. Cleanups, documentation, extension of mmap_prepare into filesystem drivers - "mm/huge_memory: refactor zap_huge_pmd()" (Lorenzo Stoakes) Simplify and clean up zap_huge_pmd(). Additional cleanups around vm_normal_folio_pmd() and the softleaf functionality are performed. * tag 'mm-stable-2026-04-13-21-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits) mm: fix deferred split queue races during migration mm/khugepaged: fix issue with tracking lock mm/huge_memory: add and use has_deposited_pgtable() mm/huge_memory: add and use normal_or_softleaf_folio_pmd() mm: add softleaf_is_valid_pmd_entry(), pmd_to_softleaf_folio() mm/huge_memory: separate out the folio part of zap_huge_pmd() mm/huge_memory: use mm instead of tlb->mm mm/huge_memory: remove unnecessary sanity checks mm/huge_memory: deduplicate zap deposited table call mm/huge_memory: remove unnecessary VM_BUG_ON_PAGE() mm/huge_memory: add a common exit path to zap_huge_pmd() mm/huge_memory: handle buggy PMD entry in zap_huge_pmd() mm/huge_memory: have zap_huge_pmd return a boolean, add kdoc mm/huge: avoid big else branch in zap_huge_pmd() mm/huge_memory: simplify vma_is_specal_huge() mm: on remap assert that input range within the proposed VMA mm: add mmap_action_map_kernel_pages[_full]() uio: replace deprecated mmap hook with mmap_prepare in uio_info drivers: hv: vmbus: replace deprecated mmap hook with mmap_prepare mm: allow handling of stacked mmap_prepare hooks in more drivers ...
614 lines
17 KiB
C
614 lines
17 KiB
C
// SPDX-License-Identifier: GPL-2.0
|
|
/*
|
|
* linux/fs/ext4/page-io.c
|
|
*
|
|
* This contains the new page_io functions for ext4
|
|
*
|
|
* Written by Theodore Ts'o, 2010.
|
|
*/
|
|
|
|
#include <linux/blk-crypto.h>
|
|
#include <linux/fs.h>
|
|
#include <linux/time.h>
|
|
#include <linux/highuid.h>
|
|
#include <linux/pagemap.h>
|
|
#include <linux/quotaops.h>
|
|
#include <linux/string.h>
|
|
#include <linux/buffer_head.h>
|
|
#include <linux/writeback.h>
|
|
#include <linux/mpage.h>
|
|
#include <linux/namei.h>
|
|
#include <linux/uio.h>
|
|
#include <linux/bio.h>
|
|
#include <linux/workqueue.h>
|
|
#include <linux/kernel.h>
|
|
#include <linux/slab.h>
|
|
#include <linux/mm.h>
|
|
#include <linux/sched/mm.h>
|
|
|
|
#include "ext4_jbd2.h"
|
|
#include "xattr.h"
|
|
#include "acl.h"
|
|
|
|
static struct kmem_cache *io_end_cachep;
|
|
static struct kmem_cache *io_end_vec_cachep;
|
|
|
|
int __init ext4_init_pageio(void)
|
|
{
|
|
io_end_cachep = KMEM_CACHE(ext4_io_end, SLAB_RECLAIM_ACCOUNT);
|
|
if (io_end_cachep == NULL)
|
|
return -ENOMEM;
|
|
|
|
io_end_vec_cachep = KMEM_CACHE(ext4_io_end_vec, 0);
|
|
if (io_end_vec_cachep == NULL) {
|
|
kmem_cache_destroy(io_end_cachep);
|
|
return -ENOMEM;
|
|
}
|
|
return 0;
|
|
}
|
|
|
|
void ext4_exit_pageio(void)
|
|
{
|
|
kmem_cache_destroy(io_end_cachep);
|
|
kmem_cache_destroy(io_end_vec_cachep);
|
|
}
|
|
|
|
struct ext4_io_end_vec *ext4_alloc_io_end_vec(ext4_io_end_t *io_end)
|
|
{
|
|
struct ext4_io_end_vec *io_end_vec;
|
|
|
|
io_end_vec = kmem_cache_zalloc(io_end_vec_cachep, GFP_NOFS);
|
|
if (!io_end_vec)
|
|
return ERR_PTR(-ENOMEM);
|
|
INIT_LIST_HEAD(&io_end_vec->list);
|
|
list_add_tail(&io_end_vec->list, &io_end->list_vec);
|
|
return io_end_vec;
|
|
}
|
|
|
|
static void ext4_free_io_end_vec(ext4_io_end_t *io_end)
|
|
{
|
|
struct ext4_io_end_vec *io_end_vec, *tmp;
|
|
|
|
if (list_empty(&io_end->list_vec))
|
|
return;
|
|
list_for_each_entry_safe(io_end_vec, tmp, &io_end->list_vec, list) {
|
|
list_del(&io_end_vec->list);
|
|
kmem_cache_free(io_end_vec_cachep, io_end_vec);
|
|
}
|
|
}
|
|
|
|
struct ext4_io_end_vec *ext4_last_io_end_vec(ext4_io_end_t *io_end)
|
|
{
|
|
BUG_ON(list_empty(&io_end->list_vec));
|
|
return list_last_entry(&io_end->list_vec, struct ext4_io_end_vec, list);
|
|
}
|
|
|
|
/*
|
|
* Print an buffer I/O error compatible with the fs/buffer.c. This
|
|
* provides compatibility with dmesg scrapers that look for a specific
|
|
* buffer I/O error message. We really need a unified error reporting
|
|
* structure to userspace ala Digital Unix's uerf system, but it's
|
|
* probably not going to happen in my lifetime, due to LKML politics...
|
|
*/
|
|
static void buffer_io_error(struct buffer_head *bh)
|
|
{
|
|
printk_ratelimited(KERN_ERR "Buffer I/O error on device %pg, logical block %llu\n",
|
|
bh->b_bdev,
|
|
(unsigned long long)bh->b_blocknr);
|
|
}
|
|
|
|
static void ext4_finish_bio(struct bio *bio)
|
|
{
|
|
struct folio_iter fi;
|
|
|
|
bio_for_each_folio_all(fi, bio) {
|
|
struct folio *folio = fi.folio;
|
|
struct folio *io_folio = NULL;
|
|
struct buffer_head *bh, *head;
|
|
size_t bio_start = fi.offset;
|
|
size_t bio_end = bio_start + fi.length;
|
|
unsigned under_io = 0;
|
|
unsigned long flags;
|
|
|
|
if (fscrypt_is_bounce_folio(folio)) {
|
|
io_folio = folio;
|
|
folio = fscrypt_pagecache_folio(folio);
|
|
}
|
|
|
|
if (bio->bi_status) {
|
|
int err = blk_status_to_errno(bio->bi_status);
|
|
mapping_set_error(folio->mapping, err);
|
|
}
|
|
bh = head = folio_buffers(folio);
|
|
/*
|
|
* We check all buffers in the folio under b_uptodate_lock
|
|
* to avoid races with other end io clearing async_write flags
|
|
*/
|
|
spin_lock_irqsave(&head->b_uptodate_lock, flags);
|
|
do {
|
|
if (bh_offset(bh) < bio_start ||
|
|
bh_offset(bh) + bh->b_size > bio_end) {
|
|
if (buffer_async_write(bh))
|
|
under_io++;
|
|
continue;
|
|
}
|
|
clear_buffer_async_write(bh);
|
|
if (bio->bi_status) {
|
|
set_buffer_write_io_error(bh);
|
|
buffer_io_error(bh);
|
|
}
|
|
} while ((bh = bh->b_this_page) != head);
|
|
spin_unlock_irqrestore(&head->b_uptodate_lock, flags);
|
|
if (!under_io) {
|
|
fscrypt_free_bounce_page(&io_folio->page);
|
|
folio_end_writeback(folio);
|
|
}
|
|
}
|
|
}
|
|
|
|
static void ext4_release_io_end(ext4_io_end_t *io_end)
|
|
{
|
|
struct bio *bio, *next_bio;
|
|
|
|
BUG_ON(!list_empty(&io_end->list));
|
|
BUG_ON(io_end->flag & EXT4_IO_END_UNWRITTEN);
|
|
WARN_ON(io_end->handle);
|
|
|
|
for (bio = io_end->bio; bio; bio = next_bio) {
|
|
next_bio = bio->bi_private;
|
|
ext4_finish_bio(bio);
|
|
bio_put(bio);
|
|
}
|
|
ext4_free_io_end_vec(io_end);
|
|
kmem_cache_free(io_end_cachep, io_end);
|
|
}
|
|
|
|
/*
|
|
* On successful IO, check a range of space and convert unwritten extents to
|
|
* written. On IO failure, check if journal abort is needed. Note that
|
|
* we are protected from truncate touching same part of extent tree by the
|
|
* fact that truncate code waits for all DIO to finish (thus exclusion from
|
|
* direct IO is achieved) and also waits for PageWriteback bits. Thus we
|
|
* cannot get to ext4_ext_truncate() before all IOs overlapping that range are
|
|
* completed (happens from ext4_free_ioend()).
|
|
*/
|
|
static int ext4_end_io_end(ext4_io_end_t *io_end)
|
|
{
|
|
struct inode *inode = io_end->inode;
|
|
handle_t *handle = io_end->handle;
|
|
struct super_block *sb = inode->i_sb;
|
|
int ret = 0;
|
|
|
|
ext4_debug("ext4_end_io_nolock: io_end 0x%p from inode %llu,list->next 0x%p,"
|
|
"list->prev 0x%p\n",
|
|
io_end, inode->i_ino, io_end->list.next, io_end->list.prev);
|
|
|
|
/*
|
|
* Do not convert the unwritten extents if data writeback fails,
|
|
* or stale data may be exposed.
|
|
*/
|
|
io_end->handle = NULL; /* Following call will use up the handle */
|
|
if (unlikely(io_end->flag & EXT4_IO_END_FAILED)) {
|
|
ret = -EIO;
|
|
if (handle)
|
|
jbd2_journal_free_reserved(handle);
|
|
|
|
if (test_opt(sb, DATA_ERR_ABORT))
|
|
jbd2_journal_abort(EXT4_SB(sb)->s_journal, ret);
|
|
} else {
|
|
ret = ext4_convert_unwritten_io_end_vec(handle, io_end);
|
|
}
|
|
if (ret < 0 && !ext4_emergency_state(sb) &&
|
|
io_end->flag & EXT4_IO_END_UNWRITTEN) {
|
|
ext4_msg(sb, KERN_EMERG,
|
|
"failed to convert unwritten extents to written "
|
|
"extents -- potential data loss! "
|
|
"(inode %llu, error %d)", inode->i_ino, ret);
|
|
}
|
|
|
|
ext4_clear_io_unwritten_flag(io_end);
|
|
ext4_release_io_end(io_end);
|
|
return ret;
|
|
}
|
|
|
|
static void dump_completed_IO(struct inode *inode, struct list_head *head)
|
|
{
|
|
#ifdef EXT4FS_DEBUG
|
|
struct list_head *cur, *before, *after;
|
|
ext4_io_end_t *io_end, *io_end0, *io_end1;
|
|
|
|
if (list_empty(head))
|
|
return;
|
|
|
|
ext4_debug("Dump inode %llu completed io list\n", inode->i_ino);
|
|
list_for_each_entry(io_end, head, list) {
|
|
cur = &io_end->list;
|
|
before = cur->prev;
|
|
io_end0 = container_of(before, ext4_io_end_t, list);
|
|
after = cur->next;
|
|
io_end1 = container_of(after, ext4_io_end_t, list);
|
|
|
|
ext4_debug("io 0x%p from inode %llu,prev 0x%p,next 0x%p\n",
|
|
io_end, inode->i_ino, io_end0, io_end1);
|
|
}
|
|
#endif
|
|
}
|
|
|
|
static bool ext4_io_end_defer_completion(ext4_io_end_t *io_end)
|
|
{
|
|
if (io_end->flag & EXT4_IO_END_UNWRITTEN &&
|
|
!list_empty(&io_end->list_vec))
|
|
return true;
|
|
if (test_opt(io_end->inode->i_sb, DATA_ERR_ABORT) &&
|
|
io_end->flag & EXT4_IO_END_FAILED &&
|
|
!ext4_emergency_state(io_end->inode->i_sb))
|
|
return true;
|
|
return false;
|
|
}
|
|
|
|
/* Add the io_end to per-inode completed end_io list. */
|
|
static void ext4_add_complete_io(ext4_io_end_t *io_end)
|
|
{
|
|
struct ext4_inode_info *ei = EXT4_I(io_end->inode);
|
|
struct ext4_sb_info *sbi = EXT4_SB(io_end->inode->i_sb);
|
|
struct workqueue_struct *wq;
|
|
unsigned long flags;
|
|
|
|
/* Only reserved conversions or pending IO errors will enter here. */
|
|
WARN_ON(!(io_end->flag & EXT4_IO_END_DEFER_COMPLETION));
|
|
WARN_ON(io_end->flag & EXT4_IO_END_UNWRITTEN &&
|
|
!io_end->handle && sbi->s_journal);
|
|
WARN_ON(!io_end->bio);
|
|
|
|
spin_lock_irqsave(&ei->i_completed_io_lock, flags);
|
|
wq = sbi->rsv_conversion_wq;
|
|
if (list_empty(&ei->i_rsv_conversion_list))
|
|
queue_work(wq, &ei->i_rsv_conversion_work);
|
|
list_add_tail(&io_end->list, &ei->i_rsv_conversion_list);
|
|
spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
|
|
}
|
|
|
|
static int ext4_do_flush_completed_IO(struct inode *inode,
|
|
struct list_head *head)
|
|
{
|
|
ext4_io_end_t *io_end;
|
|
struct list_head unwritten;
|
|
unsigned long flags;
|
|
struct ext4_inode_info *ei = EXT4_I(inode);
|
|
int err, ret = 0;
|
|
|
|
spin_lock_irqsave(&ei->i_completed_io_lock, flags);
|
|
dump_completed_IO(inode, head);
|
|
list_replace_init(head, &unwritten);
|
|
spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
|
|
|
|
while (!list_empty(&unwritten)) {
|
|
io_end = list_entry(unwritten.next, ext4_io_end_t, list);
|
|
BUG_ON(!(io_end->flag & EXT4_IO_END_DEFER_COMPLETION));
|
|
list_del_init(&io_end->list);
|
|
|
|
err = ext4_end_io_end(io_end);
|
|
if (unlikely(!ret && err))
|
|
ret = err;
|
|
}
|
|
return ret;
|
|
}
|
|
|
|
/*
|
|
* Used to convert unwritten extents to written extents upon IO completion,
|
|
* or used to abort the journal upon IO errors.
|
|
*/
|
|
void ext4_end_io_rsv_work(struct work_struct *work)
|
|
{
|
|
struct ext4_inode_info *ei = container_of(work, struct ext4_inode_info,
|
|
i_rsv_conversion_work);
|
|
ext4_do_flush_completed_IO(&ei->vfs_inode, &ei->i_rsv_conversion_list);
|
|
}
|
|
|
|
ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags)
|
|
{
|
|
ext4_io_end_t *io_end = kmem_cache_zalloc(io_end_cachep, flags);
|
|
|
|
if (io_end) {
|
|
io_end->inode = inode;
|
|
INIT_LIST_HEAD(&io_end->list);
|
|
INIT_LIST_HEAD(&io_end->list_vec);
|
|
refcount_set(&io_end->count, 1);
|
|
}
|
|
return io_end;
|
|
}
|
|
|
|
void ext4_put_io_end_defer(ext4_io_end_t *io_end)
|
|
{
|
|
if (refcount_dec_and_test(&io_end->count)) {
|
|
if (ext4_io_end_defer_completion(io_end))
|
|
return ext4_add_complete_io(io_end);
|
|
|
|
ext4_release_io_end(io_end);
|
|
}
|
|
}
|
|
|
|
int ext4_put_io_end(ext4_io_end_t *io_end)
|
|
{
|
|
if (refcount_dec_and_test(&io_end->count)) {
|
|
if (ext4_io_end_defer_completion(io_end))
|
|
return ext4_end_io_end(io_end);
|
|
|
|
ext4_release_io_end(io_end);
|
|
}
|
|
return 0;
|
|
}
|
|
|
|
ext4_io_end_t *ext4_get_io_end(ext4_io_end_t *io_end)
|
|
{
|
|
refcount_inc(&io_end->count);
|
|
return io_end;
|
|
}
|
|
|
|
/* BIO completion function for page writeback */
|
|
static void ext4_end_bio(struct bio *bio)
|
|
{
|
|
ext4_io_end_t *io_end = bio->bi_private;
|
|
sector_t bi_sector = bio->bi_iter.bi_sector;
|
|
|
|
if (WARN_ONCE(!io_end, "io_end is NULL: %pg: sector %Lu len %u err %d\n",
|
|
bio->bi_bdev,
|
|
(long long) bio->bi_iter.bi_sector,
|
|
(unsigned) bio_sectors(bio),
|
|
bio->bi_status)) {
|
|
ext4_finish_bio(bio);
|
|
bio_put(bio);
|
|
return;
|
|
}
|
|
bio->bi_end_io = NULL;
|
|
|
|
if (bio->bi_status) {
|
|
struct inode *inode = io_end->inode;
|
|
|
|
ext4_warning(inode->i_sb, "I/O error %d writing to inode %llu "
|
|
"starting block %llu)",
|
|
bio->bi_status, inode->i_ino,
|
|
(unsigned long long)
|
|
bi_sector >> (inode->i_blkbits - 9));
|
|
io_end->flag |= EXT4_IO_END_FAILED;
|
|
mapping_set_error(inode->i_mapping,
|
|
blk_status_to_errno(bio->bi_status));
|
|
}
|
|
|
|
if (ext4_io_end_defer_completion(io_end)) {
|
|
/*
|
|
* Link bio into list hanging from io_end. We have to do it
|
|
* atomically as bio completions can be racing against each
|
|
* other.
|
|
*/
|
|
bio->bi_private = xchg(&io_end->bio, bio);
|
|
ext4_put_io_end_defer(io_end);
|
|
} else {
|
|
/*
|
|
* Drop io_end reference early. Inode can get freed once
|
|
* we finish the bio.
|
|
*/
|
|
ext4_put_io_end_defer(io_end);
|
|
ext4_finish_bio(bio);
|
|
bio_put(bio);
|
|
}
|
|
}
|
|
|
|
void ext4_io_submit(struct ext4_io_submit *io)
|
|
{
|
|
struct bio *bio = io->io_bio;
|
|
|
|
if (bio) {
|
|
if (io->io_wbc->sync_mode == WB_SYNC_ALL)
|
|
io->io_bio->bi_opf |= REQ_SYNC;
|
|
blk_crypto_submit_bio(io->io_bio);
|
|
}
|
|
io->io_bio = NULL;
|
|
}
|
|
|
|
void ext4_io_submit_init(struct ext4_io_submit *io,
|
|
struct writeback_control *wbc)
|
|
{
|
|
io->io_wbc = wbc;
|
|
io->io_bio = NULL;
|
|
io->io_end = NULL;
|
|
}
|
|
|
|
static void io_submit_init_bio(struct ext4_io_submit *io,
|
|
struct inode *inode,
|
|
struct folio *folio,
|
|
struct buffer_head *bh)
|
|
{
|
|
struct bio *bio;
|
|
|
|
/*
|
|
* bio_alloc will _always_ be able to allocate a bio if
|
|
* __GFP_DIRECT_RECLAIM is set, see comments for bio_alloc_bioset().
|
|
*/
|
|
bio = bio_alloc(bh->b_bdev, BIO_MAX_VECS, REQ_OP_WRITE, GFP_NOIO);
|
|
fscrypt_set_bio_crypt_ctx(bio, inode, folio_pos(folio) + bh_offset(bh),
|
|
GFP_NOIO);
|
|
bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9);
|
|
bio->bi_end_io = ext4_end_bio;
|
|
bio->bi_private = ext4_get_io_end(io->io_end);
|
|
bio->bi_write_hint = inode->i_write_hint;
|
|
io->io_bio = bio;
|
|
io->io_next_block = bh->b_blocknr;
|
|
wbc_init_bio(io->io_wbc, bio);
|
|
}
|
|
|
|
static bool io_submit_need_new_bio(struct ext4_io_submit *io,
|
|
struct inode *inode,
|
|
struct folio *folio,
|
|
struct buffer_head *bh)
|
|
{
|
|
if (bh->b_blocknr != io->io_next_block)
|
|
return true;
|
|
if (!fscrypt_mergeable_bio(io->io_bio, inode,
|
|
folio_pos(folio) + bh_offset(bh)))
|
|
return true;
|
|
return false;
|
|
}
|
|
|
|
static void io_submit_add_bh(struct ext4_io_submit *io,
|
|
struct inode *inode,
|
|
struct folio *folio,
|
|
struct folio *io_folio,
|
|
struct buffer_head *bh)
|
|
{
|
|
if (io->io_bio && io_submit_need_new_bio(io, inode, folio, bh)) {
|
|
submit_and_retry:
|
|
ext4_io_submit(io);
|
|
}
|
|
if (io->io_bio == NULL)
|
|
io_submit_init_bio(io, inode, folio, bh);
|
|
if (!bio_add_folio(io->io_bio, io_folio, bh->b_size, bh_offset(bh)))
|
|
goto submit_and_retry;
|
|
wbc_account_cgroup_owner(io->io_wbc, folio, bh->b_size);
|
|
io->io_next_block++;
|
|
}
|
|
|
|
int ext4_bio_write_folio(struct ext4_io_submit *io, struct folio *folio,
|
|
size_t len)
|
|
{
|
|
struct folio *io_folio = folio;
|
|
struct inode *inode = folio->mapping->host;
|
|
unsigned block_start;
|
|
struct buffer_head *bh, *head;
|
|
int ret = 0;
|
|
int nr_to_submit = 0;
|
|
struct writeback_control *wbc = io->io_wbc;
|
|
bool keep_towrite = false;
|
|
|
|
BUG_ON(!folio_test_locked(folio));
|
|
BUG_ON(folio_test_writeback(folio));
|
|
|
|
/*
|
|
* Comments copied from block_write_full_folio:
|
|
*
|
|
* The folio straddles i_size. It must be zeroed out on each and every
|
|
* writepage invocation because it may be mmapped. "A file is mapped
|
|
* in multiples of the page size. For a file that is not a multiple of
|
|
* the page size, the remaining memory is zeroed when mapped, and
|
|
* writes to that region are not written out to the file."
|
|
*/
|
|
if (len < folio_size(folio))
|
|
folio_zero_segment(folio, len, folio_size(folio));
|
|
/*
|
|
* In the first loop we prepare and mark buffers to submit. We have to
|
|
* mark all buffers in the folio before submitting so that
|
|
* folio_end_writeback() cannot be called from ext4_end_bio() when IO
|
|
* on the first buffer finishes and we are still working on submitting
|
|
* the second buffer.
|
|
*/
|
|
bh = head = folio_buffers(folio);
|
|
do {
|
|
block_start = bh_offset(bh);
|
|
if (block_start >= len) {
|
|
clear_buffer_dirty(bh);
|
|
set_buffer_uptodate(bh);
|
|
continue;
|
|
}
|
|
if (!buffer_dirty(bh) || buffer_delay(bh) ||
|
|
!buffer_mapped(bh) || buffer_unwritten(bh)) {
|
|
/* A hole? We can safely clear the dirty bit */
|
|
if (!buffer_mapped(bh))
|
|
clear_buffer_dirty(bh);
|
|
/*
|
|
* Keeping dirty some buffer we cannot write? Make sure
|
|
* to redirty the folio and keep TOWRITE tag so that
|
|
* racing WB_SYNC_ALL writeback does not skip the folio.
|
|
* This happens e.g. when doing writeout for
|
|
* transaction commit or when journalled data is not
|
|
* yet committed.
|
|
*/
|
|
if (buffer_dirty(bh) ||
|
|
(buffer_jbd(bh) && buffer_jbddirty(bh))) {
|
|
if (!folio_test_dirty(folio))
|
|
folio_redirty_for_writepage(wbc, folio);
|
|
keep_towrite = true;
|
|
}
|
|
continue;
|
|
}
|
|
if (buffer_new(bh))
|
|
clear_buffer_new(bh);
|
|
set_buffer_async_write(bh);
|
|
clear_buffer_dirty(bh);
|
|
nr_to_submit++;
|
|
} while ((bh = bh->b_this_page) != head);
|
|
|
|
if (!nr_to_submit) {
|
|
/*
|
|
* We have nothing to submit. Just cycle the folio through
|
|
* writeback state to properly update xarray tags.
|
|
*/
|
|
__folio_start_writeback(folio, keep_towrite);
|
|
folio_end_writeback(folio);
|
|
return 0;
|
|
}
|
|
|
|
bh = head = folio_buffers(folio);
|
|
|
|
/*
|
|
* If any blocks are being written to an encrypted file, encrypt them
|
|
* into a bounce page. For simplicity, just encrypt until the last
|
|
* block which might be needed. This may cause some unneeded blocks
|
|
* (e.g. holes) to be unnecessarily encrypted, but this is rare and
|
|
* can't happen in the common case of blocksize == PAGE_SIZE.
|
|
*/
|
|
if (fscrypt_inode_uses_fs_layer_crypto(inode)) {
|
|
gfp_t gfp_flags = GFP_NOFS;
|
|
unsigned int enc_bytes = round_up(len, i_blocksize(inode));
|
|
struct page *bounce_page;
|
|
|
|
/*
|
|
* Since bounce page allocation uses a mempool, we can only use
|
|
* a waiting mask (i.e. request guaranteed allocation) on the
|
|
* first page of the bio. Otherwise it can deadlock.
|
|
*/
|
|
if (io->io_bio)
|
|
gfp_flags = GFP_NOWAIT;
|
|
retry_encrypt:
|
|
bounce_page = fscrypt_encrypt_pagecache_blocks(folio,
|
|
enc_bytes, 0, gfp_flags);
|
|
if (IS_ERR(bounce_page)) {
|
|
ret = PTR_ERR(bounce_page);
|
|
if (ret == -ENOMEM &&
|
|
(io->io_bio || wbc->sync_mode == WB_SYNC_ALL)) {
|
|
gfp_t new_gfp_flags = GFP_NOFS;
|
|
if (io->io_bio)
|
|
ext4_io_submit(io);
|
|
else
|
|
new_gfp_flags |= __GFP_NOFAIL;
|
|
memalloc_retry_wait(gfp_flags);
|
|
gfp_flags = new_gfp_flags;
|
|
goto retry_encrypt;
|
|
}
|
|
|
|
printk_ratelimited(KERN_ERR "%s: ret = %d\n", __func__, ret);
|
|
folio_redirty_for_writepage(wbc, folio);
|
|
do {
|
|
if (buffer_async_write(bh)) {
|
|
clear_buffer_async_write(bh);
|
|
set_buffer_dirty(bh);
|
|
}
|
|
bh = bh->b_this_page;
|
|
} while (bh != head);
|
|
|
|
return ret;
|
|
}
|
|
io_folio = page_folio(bounce_page);
|
|
}
|
|
|
|
__folio_start_writeback(folio, keep_towrite);
|
|
|
|
/* Now submit buffers to write */
|
|
do {
|
|
if (!buffer_async_write(bh))
|
|
continue;
|
|
io_submit_add_bh(io, inode, folio, io_folio, bh);
|
|
} while ((bh = bh->b_this_page) != head);
|
|
|
|
return 0;
|
|
}
|