linux

mirror of https://github.com/torvalds/linux.git synced 2026-05-25 15:41:52 +02:00

Author	SHA1	Message	Date
Qi Zheng	681d325b23	mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}() In the near future, a folio will no longer pin its corresponding memory cgroup. To ensure safety, it will only be appropriate to hold the rcu read lock or acquire a reference to the memory cgroup returned by folio_memcg(), thereby preventing it from being released. In the current patch, the rcu read lock is employed to safeguard against the release of the memory cgroup in folio_split_queue_lock{_irqsave}(). Link: https://lore.kernel.org/ca2957c0df1126b2c71b40c738018fd5255525a6.1772711148.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Acked-by: Muchun Song <muchun.song@linux.dev> Cc: Allen Pais <apais@linux.microsoft.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baoquan He <bhe@redhat.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Chen Ridong <chenridong@huawei.com> Cc: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com> Cc: Hugh Dickins <hughd@google.com> Cc: Imran Khan <imran.f.khan@oracle.com> Cc: Kamalesh Babulal <kamalesh.babulal@oracle.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Koutný <mkoutny@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Yosry Ahmed <yosry@kernel.org> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:46 -07:00
Muchun Song	db128b2c6b	mm: rename unlock_page_lruvec_irq and its variants It is inappropriate to use folio_lruvec_lock() variants in conjunction with unlock_page_lruvec() variants, as this involves the inconsistent operation of locking a folio while unlocking a page. To rectify this, the functions unlock_page_lruvec{_irq, _irqrestore} are renamed to lruvec_unlock{_irq,_irqrestore}. Link: https://lore.kernel.org/4e5e05271a250df4d1812e1832be65636a78c957.1772711148.git.zhengqi.arch@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Chen Ridong <chenridong@huawei.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Allen Pais <apais@linux.microsoft.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baoquan He <bhe@redhat.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com> Cc: Hugh Dickins <hughd@google.com> Cc: Imran Khan <imran.f.khan@oracle.com> Cc: Kamalesh Babulal <kamalesh.babulal@oracle.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Koutný <mkoutny@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Yosry Ahmed <yosry@kernel.org> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-18 00:10:44 -07:00
Lance Yang	3bac011689	mm: fix deferred split queue races during migration migrate_folio_move() records the deferred split queue state from src and replays it on dst. Replaying it after remove_migration_ptes(src, dst, 0) makes dst visible before it is requeued, so a concurrent rmap-removal path can mark dst partially mapped and trip the WARN in deferred_split_folio(). Move the requeue before remove_migration_ptes() so dst is back on the deferred split queue before it becomes visible again. Because migration still holds dst locked at that point, teach deferred_split_scan() to requeue a folio when folio_trylock() fails. Otherwise a fully mapped underused folio can be dequeued by the shrinker and silently lost from split_queue. [ziy@nvidia.com: move the comment] Link: https://lkml.kernel.org/r/FB71A764-0F10-4E5A-B4A0-BA4C7F138408@nvidia.com Link: https://syzkaller.appspot.com/bug?extid=a7067a757858ac8eb085 Link: https://lkml.kernel.org/r/20260401131032.13011-1-lance.yang@linux.dev Fixes: 8a8ca142a488 ("mm: migrate: requeue destination folio on deferred split queue") Signed-off-by: Lance Yang <lance.yang@linux.dev> Signed-off-by: Zi Yan <ziy@nvidia.com> Reported-by: syzbot+a7067a757858ac8eb085@syzkaller.appspotmail.com Closes: https://lore.kernel.org/linux-mm/69ccb65b.050a0220.183828.003a.GAE@google.com/ Suggested-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Byungchul Park <byungchul@sk.com> Cc: David Hildenbrand <david@kernel.org> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Nico Pache <npache@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Ying Huang <ying.huang@linux.alibaba.com> Cc: Usama Arif <usama.arif@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:47 -07:00
Lorenzo Stoakes (Oracle)	bf263bcaf6	mm/huge_memory: add and use has_deposited_pgtable() Rather than thread has_deposited through zap_huge_pmd(), make things clearer by adding has_deposited_pgtable() with comments describing why in each case. [ljs@kernel.org: fix folio_put()-before-recheck issue, per Sashiko] Link: https://lkml.kernel.org/r/0a917f80-902f-49b0-a75f-1bbaf23d7f94@lucifer.local Link: https://lkml.kernel.org/r/f9db59ca90937e39913d50ecb4f662e2bad17bbb.1774029655.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:47 -07:00
Lorenzo Stoakes (Oracle)	d80a9cb1a6	mm/huge_memory: add and use normal_or_softleaf_folio_pmd() Now we have pmd_to_softleaf_folio() available to us which also raises a CONFIG_DEBUG_VM warning if unexpectedly an invalid softleaf entry, we can now abstract folio handling altogether. vm_normal_folio() deals with the huge zero page (which is present), as well as PFN map/mixed map mappings in both cases returning NULL. Otherwise, we try to obtain the softleaf folio. This makes the logic far easier to comprehend and has it use the standard vm_normal_folio_pmd() path for decoding of present entries. Finally, we have to update the flushing logic to only do so if a folio is established. This patch also makes the 'is_present' value more accurate - because PFN map, mixed map and zero huge pages are present, just not present and 'normal'. [ljs@kernel.org: avoid bisection hazard] Link: https://lkml.kernel.org/r/d0cc6161-77a4-42ba-a411-96c23c78df1b@lucifer.local Link: https://lkml.kernel.org/r/c2be872d64ef9573b80727d9ab5446cf002f17b5.1774029655.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:46 -07:00
Lorenzo Stoakes (Oracle)	f87854c909	mm/huge_memory: separate out the folio part of zap_huge_pmd() Place the part of the logic that manipulates counters and possibly updates the accessed bit of the folio into its own function to make zap_huge_pmd() more readable. Also rename flush_needed to is_present as we only require a flush for present entries. Additionally add comments as to why we're doing what we're doing with respect to softleaf entries. This also lays the ground for further refactoring. Link: https://lkml.kernel.org/r/6c4db67952f5529da4db102a6149b9050b5dda4e.1774029655.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:46 -07:00
Lorenzo Stoakes (Oracle)	1c6b7ff60b	mm/huge_memory: use mm instead of tlb->mm Reduce the repetition, and lay the ground for further refactorings by keeping this variable separate. Link: https://lkml.kernel.org/r/98104cde87e4b2aabeb16f236b8731591594457f.1774029655.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:46 -07:00
Lorenzo Stoakes (Oracle)	1fc034c1c9	mm/huge_memory: remove unnecessary sanity checks These checks have been in place since 2014, I think we can safely assume that we are in a place where we don't need these as runtime checks. In addition there are 4 other invocations of folio_remove_rmap_pmd(), none of which make this assertion. If we need to add this assertion, it should be in folio_remove_rmap_pmd(), and as a VM_WARN_ON_ONCE(), however these seem superfluous so just remove them. Link: https://lkml.kernel.org/r/0c4c5ab247c90f80cf44718e8124b217d6a22544.1774029655.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:46 -07:00
Lorenzo Stoakes (Oracle)	d490940f3a	mm/huge_memory: deduplicate zap deposited table call Rather than having separate logic for each case determining whether to zap the deposited table, simply track this via a boolean. We default this to whether the architecture requires it, and update it as required elsewhere. Link: https://lkml.kernel.org/r/71f576a1fbcd27a86322d12caa937bcdacf75407.1774029655.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:46 -07:00
Lorenzo Stoakes (Oracle)	07f264680a	mm/huge_memory: remove unnecessary VM_BUG_ON_PAGE() This has been around since the beginnings of the THP implementation. I think we can safely assume that, if we have a THP folio, it will have a head page. Link: https://lkml.kernel.org/r/f3fa8eb4634ccb2e78209f570cc1a769a02ce93e.1774029655.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:46 -07:00
Lorenzo Stoakes (Oracle)	7217744e0a	mm/huge_memory: add a common exit path to zap_huge_pmd() Other than when we acquire the PTL, we always need to unlock the PTL, and optionally need to flush on exit. The code is currently very duplicated in this respect, so default flush_needed to false, set it true in the case in which it's required, then share the same logic for all exit paths. This also makes flush_needed make more sense as a function-scope value (we don't need to flush for the PFN map/mixed map, zero huge, error cases for instance). Link: https://lkml.kernel.org/r/6b281d8ed972dff0e89bdcbdd810c96c7ae8c9dc.1774029655.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:45 -07:00
Lorenzo Stoakes (Oracle)	7011140612	mm/huge_memory: handle buggy PMD entry in zap_huge_pmd() A recent bug I analysed managed to, through a bug in the userfaultfd implementation, reach an invalid point in the zap_huge_pmd() code where the PMD was none of: - A non-DAX, PFN or mixed map. - The huge zero folio - A present PMD entry - A softleaf entry The code at this point calls folio_test_anon() on a known-NULL folio. Having logic like this explicitly NULL dereference in the code is hard to understand, and makes debugging potentially more difficult. Add an else branch to handle this case and WARN(). No functional change intended. Link: https://lore.kernel.org/all/6b3d7ad7-49e1-407a-903d-3103704160d8@lucifer.local/ Link: https://lkml.kernel.org/r/fcf1f6de84a2ace188b6bf103fa15dde695f1ed8.1774029655.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:45 -07:00
Lorenzo Stoakes (Oracle)	b92b9d4f69	mm/huge_memory: have zap_huge_pmd return a boolean, add kdoc There's no need to use the ancient approach of returning an integer here, just return a boolean. Also update flush_needed to be a boolean, similarly. Also add a kdoc comment describing the function. No functional change intended. Link: https://lkml.kernel.org/r/132274566cd49d2960a2294c36dd2450593dfc55.1774029655.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:45 -07:00
Lorenzo Stoakes (Oracle)	6886f93790	mm/huge: avoid big else branch in zap_huge_pmd() We don't need to have an extra level of indentation, we can simply exit early in the first two branches. No functional change intended. Link: https://lkml.kernel.org/r/6b4d5efdbf5554b8fe788f677d0b50f355eec999.1774029655.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:45 -07:00
Lorenzo Stoakes (Oracle)	c0ea52c18c	mm/huge_memory: simplify vma_is_specal_huge() Patch series "mm/huge_memory: refactor zap_huge_pmd()", v3. zap_huge_pmd() is overly complicated, clean it up and also add an assert in the case that we encounter a buggy PMD entry that doesn't match expectations. This is motivated by a bug discovered [0] where the PMD entry was none of: * A non-DAX, PFN or mixed map. * The huge zero folio * A present PMD entry * A softleaf entry In zap_huge_pmd(), but due to the bug we manged to reach this code. It is useful to explicitly call this out rather than have an arbitrary NULL pointer dereference happen, which also improves understanding of what's going on. The series goes further to make use of vm_normal_folio_pmd() rather than implementing custom logic for retrieving the folio, and extends softleaf functionality to provide and use an equivalent softleaf function. This patch (of 13): This function is confused - it overloads the term 'special' yet again, checks for DAX but in many cases the code explicitly excludes DAX before invoking the predicate. It also unnecessarily checks for vma->vm_file - this has to be present for a driver to have set VMA_MIXEDMAP_BIT or VMA_PFNMAP_BIT. In fact, a far simpler form of this is to reverse the DAX predicate and return false if DAX is set. This makes sense from the point of view of 'special' as in vm_normal_page(), as DAX actually does potentially have retrievable folios. Also there's no need to have this in mm.h so move it to huge_memory.c. No functional change intended. Link: https://lkml.kernel.org/r/cover.1774029655.git.ljs@kernel.org Link: https://lkml.kernel.org/r/d2b65883dc4895f197c4b4a69fbf27a063463412.1774029655.git.ljs@kernel.org Link: https://lore.kernel.org/all/6b3d7ad7-49e1-407a-903d-3103704160d8@lucifer.local/ [0] Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:45 -07:00
Nico Pache	b90c453d26	mm: introduce is_pmd_order helper In order to add mTHP support to khugepaged, we will often be checking if a given order is (or is not) a PMD order. Some places in the kernel already use this check, so lets create a simple helper function to keep the code clean and readable. Link: https://lkml.kernel.org/r/20260325114022.444081-3-npache@redhat.com Signed-off-by: Nico Pache <npache@redhat.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Takashi Iwai (SUSE) <tiwai@suse.de> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:29 -07:00
Breno Leitao	35a01d9442	mm: huge_memory: refactor enabled_store() with set_global_enabled_mode() Refactor enabled_store() to use a new set_global_enabled_mode() helper. Introduce a separate enum global_enabled_mode and global_enabled_mode_strings[], mirroring the anon_enabled_mode pattern from the previous commit. A separate enum is necessary because the global THP setting does not support "inherit", only "always", "madvise", and "never". Reusing anon_enabled_mode would leave a NULL gap in the string array, causing sysfs_match_string() to stop early and fail to match entries after the gap. The helper uses the same loop pattern as set_anon_enabled_mode(), iterating over an array of flag bit positions and using test_and_set_bit()/test_and_clear_bit() to track whether the state actually changed. Link: https://lkml.kernel.org/r/20260317-thp_logs-v7-3-31eb98fa5a8b@debian.org Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Barry Song <baohua@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:27 -07:00
Breno Leitao	82d9ff648c	mm: huge_memory: refactor anon_enabled_store() with set_anon_enabled_mode() Consolidate the repeated spin_lock/set_bit/clear_bit pattern in anon_enabled_store() into a new set_anon_enabled_mode() helper that loops over an orders[] array, setting the bit for the selected mode and clearing the others. Introduce enum anon_enabled_mode and anon_enabled_mode_strings[] for the per-order anon THP setting. Use sysfs_match_string() with the anon_enabled_mode_strings[] table to replace the if/else chain of sysfs_streq() calls. The helper uses __test_and_set_bit()/__test_and_clear_bit() to track whether the state actually changed, so start_stop_khugepaged() is only called when needed. When the mode is unchanged, set_recommended_min_free_kbytes() is called directly to preserve the watermark recalculation behavior of the original code. Link: https://lkml.kernel.org/r/20260317-thp_logs-v7-2-31eb98fa5a8b@debian.org Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:27 -07:00
Mike Rapoport (Microsoft)	9a1d0c738b	mm: rename my_zero_pfn() to zero_pfn() my_zero_pfn() is a silly name. Rename zero_pfn variable to zero_page_pfn and my_zero_pfn() function to zero_pfn(). While on it, move extern declarations of zero_page_pfn outside the functions that use it and add a comment about what ZERO_PAGE is. Link: https://lkml.kernel.org/r/20260211103141.3215197-3-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy (CS GROUP) <chleroy@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Richard Weinberger <richard@nod.at> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:01 -07:00
Chris Down	fae654083b	mm/huge_memory: fix use of NULL folio in move_pages_huge_pmd() move_pages_huge_pmd() handles UFFDIO_MOVE for both normal THPs and huge zero pages. For the huge zero page path, src_folio is explicitly set to NULL, and is used as a sentinel to skip folio operations like lock and rmap. In the huge zero page branch, src_folio is NULL, so folio_mk_pmd(NULL, pgprot) passes NULL through folio_pfn() and page_to_pfn(). With SPARSEMEM_VMEMMAP this silently produces a bogus PFN, installing a PMD pointing to non-existent physical memory. On other memory models it is a NULL dereference. Use page_folio(src_page) to obtain the valid huge zero folio from the page, which was obtained from pmd_page() and remains valid throughout. After commit `d82d09e482` ("mm/huge_memory: mark PMD mappings of the huge zero folio special"), moved huge zero PMDs must remain special so vm_normal_page_pmd() continues to treat them as special mappings. move_pages_huge_pmd() currently reconstructs the destination PMD in the huge zero page branch, which drops PMD state such as pmd_special() on architectures with CONFIG_ARCH_HAS_PTE_SPECIAL. As a result, vm_normal_page_pmd() can treat the moved huge zero PMD as a normal page and corrupt its refcount. Instead of reconstructing the PMD from the folio, derive the destination entry from src_pmdval after pmdp_huge_clear_flush(), then handle the PMD metadata the same way move_huge_pmd() does for moved entries by marking it soft-dirty and clearing uffd-wp. Link: https://lkml.kernel.org/r/a1e787dd-b911-474d-8570-f37685357d86@lucifer.local Fixes: `e3981db444` ("mm: add folio_mk_pmd()") Signed-off-by: Chris Down <chris@chrisdown.name> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Tested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-03-10 16:01:49 -07:00
Zi Yan	577a1f495f	mm/huge_memory: fix a folio_split() race condition with folio_try_get() During a pagecache folio split, the values in the related xarray should not be changed from the original folio at xarray split time until all after-split folios are well formed and stored in the xarray. Current use of xas_try_split() in __split_unmapped_folio() lets some after-split folios show up at wrong indices in the xarray. When these misplaced after-split folios are unfrozen, before correct folios are stored via __xa_store(), and grabbed by folio_try_get(), they are returned to userspace at wrong file indices, causing data corruption. More detailed explanation is at the bottom. The reproducer is at: https://github.com/dfinity/thp-madv-remove-test It 1. creates a memfd, 2. forks, 3. in the child process, maps the file with large folios (via shmem code path) and reads the mapped file continuously with 16 threads, 4. in the parent process, uses madvise(MADV_REMOVE) to punch poles in the large folio. Data corruption can be observed without the fix. Basically, data from a wrong page->index is returned. Fix it by using the original folio in xas_try_split() calls, so that folio_try_get() can get the right after-split folios after the original folio is unfrozen. Uniform split, split_huge_page*(), is not affected, since it uses xas_split_alloc() and xas_split() only once and stores the original folio in the xarray. Change xas_split() used in uniform split branch to use the original folio to avoid confusion. Fixes below points to the commit introduces the code, but folio_split() is used in a later commit `7460b470a1` ("mm/truncate: use folio_split() in truncate operation"). More details: For example, a folio f is split non-uniformly into f, f2, f3, f4 like below: +----------------+---------+----+----+ \| f \| f2 \| f3 \| f4 \| +----------------+---------+----+----+ but the xarray would look like below after __split_unmapped_folio() is done: +----------------+---------+----+----+ \| f \| f2 \| f3 \| f3 \| +----------------+---------+----+----+ After __split_unmapped_folio(), the code changes the xarray and unfreezes after-split folios: 1. unfreezes f2, __xa_store(f2) 2. unfreezes f3, __xa_store(f3) 3. unfreezes f4, __xa_store(f4), which overwrites the second f3 to f4. 4. unfreezes f. Meanwhile, a parallel filemap_get_entry() can read the second f3 from the xarray and use folio_try_get() on it at step 2 when f3 is unfrozen. Then, f3 is wrongly returned to user. After the fix, the xarray looks like below after __split_unmapped_folio(): +----------------+---------+----+----+ \| f \| f \| f \| f \| +----------------+---------+----+----+ so that the race window no longer exists. [ziy@nvidia.com: move comment, per David] Link: https://lkml.kernel.org/r/5C9FA053-A4C6-4615-BE05-74E47A6462B3@nvidia.com Link: https://lkml.kernel.org/r/20260302203159.3208341-1-ziy@nvidia.com Fixes: `00527733d0` ("mm/huge_memory: add two new (not yet used) functions for folio_split()") Signed-off-by: Zi Yan <ziy@nvidia.com> Reported-by: Bas van Dijk <bas@dfinity.org> Closes: https://lore.kernel.org/all/CAKNNEtw5_kZomhkugedKMPOG-sxs5Q5OLumWJdiWXv+C9Yct0w@mail.gmail.com/ Tested-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-03-04 09:44:24 -08:00
Deepanshu Kartikey	dd085fe9a8	mm: thp: deny THP for files on anonymous inodes file_thp_enabled() incorrectly allows THP for files on anonymous inodes (e.g. guest_memfd and secretmem). These files are created via alloc_file_pseudo(), which does not call get_write_access() and leaves inode->i_writecount at 0. Combined with S_ISREG(inode->i_mode) being true, they appear as read-only regular files when CONFIG_READ_ONLY_THP_FOR_FS is enabled, making them eligible for THP collapse. Anonymous inodes can never pass the inode_is_open_for_write() check since their i_writecount is never incremented through the normal VFS open path. The right thing to do is to exclude them from THP eligibility altogether, since CONFIG_READ_ONLY_THP_FOR_FS was designed for real filesystem files (e.g. shared libraries), not for pseudo-filesystem inodes. For guest_memfd, this allows khugepaged and MADV_COLLAPSE to create large folios in the page cache via the collapse path, but the guest_memfd fault handler does not support large folios. This triggers WARN_ON_ONCE(folio_test_large(folio)) in kvm_gmem_fault_user_mapping(). For secretmem, collapse_file() tries to copy page contents through the direct map, but secretmem pages are removed from the direct map. This can result in a kernel crash: BUG: unable to handle page fault for address: ffff88810284d000 RIP: 0010:memcpy_orig+0x16/0x130 Call Trace: collapse_file hpage_collapse_scan_file madvise_collapse Secretmem is not affected by the crash on upstream as the memory failure recovery handles the failed copy gracefully, but it still triggers confusing false memory failure reports: Memory failure: 0x106d96f: recovery action for clean unevictable LRU page: Recovered Check IS_ANON_FILE(inode) in file_thp_enabled() to deny THP for all anonymous inode files. Link: https://syzkaller.appspot.com/bug?extid=33a04338019ac7e43a44 Link: https://lore.kernel.org/linux-mm/CAEvNRgHegcz3ro35ixkDw39ES8=U6rs6S7iP0gkR9enr7HoGtA@mail.gmail.com Link: https://lkml.kernel.org/r/20260214001535.435626-1-kartikey406@gmail.com Fixes: `7fbb5e1882` ("mm: remove VM_EXEC requirement for THP eligibility") Signed-off-by: Deepanshu Kartikey <Kartikey406@gmail.com> Reported-by: syzbot+33a04338019ac7e43a44@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=33a04338019ac7e43a44 Tested-by: syzbot+33a04338019ac7e43a44@syzkaller.appspotmail.com Tested-by: Lance Yang <lance.yang@linux.dev> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Ackerley Tng <ackerleytng@google.com> Tested-by: Ackerley Tng <ackerleytng@google.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Fangrui Song <i@maskray.me> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-02-24 11:13:26 -08:00
Linus Torvalds	bf4afc53b7	Convert 'alloc_obj' family to use the new default GFP_KERNEL argument This was done entirely with mindless brute force, using git grep -l '\<k[vmz]alloc_objs(., GFP_KERNEL)' \| xargs sed -i 's/\(alloc_objs(.*\), GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-02-21 17:09:51 -08:00
Kees Cook	69050f8d6d	treewide: Replace kmalloc with kmalloc_obj for non-scalar types This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(PTR, FAM, COUNT, ...) (where TYPE may also be VAR) The resulting allocations no longer return "void ", instead returning "TYPE ". Signed-off-by: Kees Cook <kees@kernel.org>	2026-02-21 01:02:28 -08:00
Linus Torvalds	4cff5c05e0	mm.git review status for linus..mm-stable Everything: Total patches: 325 Reviews/patch: 1.39 Reviewed rate: 72% Excluding DAMON: Total patches: 262 Reviews/patch: 1.63 Reviewed rate: 82% Excluding DAMON and zram: Total patches: 248 Reviews/patch: 1.72 Reviewed rate: 86% - The 14 patch series "powerpc/64s: do not re-activate batched TLB flush" from Alexander Gordeev makes arch_{enter\|leave}_lazy_mmu_mode() nest properly. It adds a generic enter/leave layer and switches architectures to use it. Various hacks were removed in the process. - The 7 patch series "zram: introduce compressed data writeback" from Richard Chang and Sergey Senozhatsky implements data compression for zram writeback. - The 8 patch series "mm: folio_zero_user: clear page ranges" from David Hildenbrand adds clearing of contiguous page ranges for hugepages. Large improvements during demand faulting are demonstrated. - The 2 patch series "memcg cleanups" from Chen Ridong tideis up some memcg code. - The 12 patch series "mm/damon: introduce {,max_}nr_snapshots and tracepoint for damos stats" from SeongJae Park improves DAMOS stat's provided information, deterministic control, and readability. - The 3 patch series "selftests/mm: hugetlb cgroup charging: robustness fixes" from Li Wang fixes a few issues in the hugetlb cgroup charging selftests. - The 5 patch series "Fix va_high_addr_switch.sh test failure - again" from Chunyu Hu addresses several issues in the va_high_addr_switch test. - The 5 patch series "mm/damon/tests/core-kunit: extend existing test scenarios" from Shu Anzai improves the KUnit test coverage for DAMON. - The 2 patch series "mm/khugepaged: fix dirty page handling for MADV_COLLAPSE" from Shivank Garg fixes a glitch in khugepaged which was causing madvise(MADV_COLLAPSE) to transiently return -EAGAIN. - The 29 patch series "arch, mm: consolidate hugetlb early reservation" from Mike Rapoport reworks and consolidates a pile of straggly code related to reservation of hugetlb memory from bootmem and creation of CMA areas for hugetlb. - The 9 patch series "mm: clean up anon_vma implementation" from Lorenzo Stoakes cleans up the anon_vma implementation in various ways. - The 3 patch series "tweaks for __alloc_pages_slowpath()" from Vlastimil Babka does a little streamlining of the page allocator's slowpath code. - The 8 patch series "memcg: separate private and public ID namespaces" from Shakeel Butt cleans up the memcg ID code and prevents the internal-only private IDs from being exposed to userspace. - The 6 patch series "mm: hugetlb: allocate frozen gigantic folio" from Kefeng Wang cleans up the allocation of frozen folios and avoids some atomic refcount operations. - The 11 patch series "mm/damon: advance DAMOS-based LRU sorting" from SeongJae Park improves DAMOS's movement of memory betewwn the active and inactive LRUs and adds auto-tuning of the ratio-based quotas and of monitoring intervals. - The 18 patch series "Support page table check on PowerPC" from Andrew Donnellan makes CONFIG_PAGE_TABLE_CHECK_ENFORCED work on powerpc. - The 3 patch series "nodemask: align nodes_and{,not} with underlying bitmap ops" from Yury Norov makes nodes_and() and nodes_andnot() propagate the return values from the underlying bit operations, enabling some cleanup in calling code. - The 5 patch series "mm/damon: hide kdamond and kdamond_lock from API callers" from SeongJae Park cleans up some DAMON internal interfaces. - The 4 patch series "mm/khugepaged: cleanups and scan limit fix" from Shivank Garg does some cleanup work in khupaged and fixes a scan limit accounting issue. - The 24 patch series "mm: balloon infrastructure cleanups" from David Hildenbrand goes to town on the balloon infrastructure and its page migration function. Mainly cleanups, also some locking simplification. - The 2 patch series "mm/vmscan: add tracepoint and reason for kswapd_failures reset" from Jiayuan Chen adds additional tracepoints to the page reclaim code. - The 3 patch series "Replace wq users and add WQ_PERCPU to alloc_workqueue() users" from Marco Crivellari is part of Marco's kernel-wide migration from the legacy workqueue APIs over to the preferred unbound workqueues. - The 9 patch series "Various mm kselftests improvements/fixes" from Kevin Brodsky provides various unrelated improvements/fixes for the mm kselftests. - The 5 patch series "mm: accelerate gigantic folio allocation" from Kefeng Wang greatly speeds up gigantic folio allocation, mainly by avoiding unnecessary work in pfn_range_valid_contig(). - The 5 patch series "selftests/damon: improve leak detection and wss estimation reliability" from SeongJae Park improves the reliability of two of the DAMON selftests. - The 8 patch series "mm/damon: cleanup kdamond, damon_call(), damos filter and DAMON_MIN_REGION" from SeongJae Park does some cleanup work in the core DAMON code. - The 8 patch series "Docs/mm/damon: update intro, modules, maintainer profile, and misc" from SeongJae Park performs maintenance work on the DAMON documentation. - The 10 patch series "mm: add and use vma_assert_stabilised() helper" from Lorenzo Stoakes refactors and cleans up the core VMA code. The main aim here is to be able to use the mmap write lock's lockdep state to perform various assertions regarding the locking which the VMA code requires. - The 19 patch series "mm, swap: swap table phase II: unify swapin use" from Kairui Song removes some old swap code (swap cache bypassing and swap synchronization) which wasn't working very well. Various other cleanups and simplifications were made. The end result is a 20% speedup in one benchmark. - The 8 patch series "enable PT_RECLAIM on more 64-bit architectures" from Qi Zheng makes PT_RECLAIM available on 64-bit alpha, loongarch, mips, parisc, um, Various cleanups were performed along the way. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaY1HfAAKCRDdBJ7gKXxA jqhZAP9H8ZlKKqCEgnr6U5XXmJ63Ep2FDQpl8p35yr9yVuU9+gEAgfyWiJ43l1fP rT0yjsUW3KQFBi/SEA3R6aYarmoIBgI= =+HLt -----END PGP SIGNATURE----- Merge tag 'mm-stable-2026-02-11-19-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "powerpc/64s: do not re-activate batched TLB flush" makes arch_{enter\|leave}_lazy_mmu_mode() nest properly (Alexander Gordeev) It adds a generic enter/leave layer and switches architectures to use it. Various hacks were removed in the process. - "zram: introduce compressed data writeback" implements data compression for zram writeback (Richard Chang and Sergey Senozhatsky) - "mm: folio_zero_user: clear page ranges" adds clearing of contiguous page ranges for hugepages. Large improvements during demand faulting are demonstrated (David Hildenbrand) - "memcg cleanups" tidies up some memcg code (Chen Ridong) - "mm/damon: introduce {,max_}nr_snapshots and tracepoint for damos stats" improves DAMOS stat's provided information, deterministic control, and readability (SeongJae Park) - "selftests/mm: hugetlb cgroup charging: robustness fixes" fixes a few issues in the hugetlb cgroup charging selftests (Li Wang) - "Fix va_high_addr_switch.sh test failure - again" addresses several issues in the va_high_addr_switch test (Chunyu Hu) - "mm/damon/tests/core-kunit: extend existing test scenarios" improves the KUnit test coverage for DAMON (Shu Anzai) - "mm/khugepaged: fix dirty page handling for MADV_COLLAPSE" fixes a glitch in khugepaged which was causing madvise(MADV_COLLAPSE) to transiently return -EAGAIN (Shivank Garg) - "arch, mm: consolidate hugetlb early reservation" reworks and consolidates a pile of straggly code related to reservation of hugetlb memory from bootmem and creation of CMA areas for hugetlb (Mike Rapoport) - "mm: clean up anon_vma implementation" cleans up the anon_vma implementation in various ways (Lorenzo Stoakes) - "tweaks for __alloc_pages_slowpath()" does a little streamlining of the page allocator's slowpath code (Vlastimil Babka) - "memcg: separate private and public ID namespaces" cleans up the memcg ID code and prevents the internal-only private IDs from being exposed to userspace (Shakeel Butt) - "mm: hugetlb: allocate frozen gigantic folio" cleans up the allocation of frozen folios and avoids some atomic refcount operations (Kefeng Wang) - "mm/damon: advance DAMOS-based LRU sorting" improves DAMOS's movement of memory betewwn the active and inactive LRUs and adds auto-tuning of the ratio-based quotas and of monitoring intervals (SeongJae Park) - "Support page table check on PowerPC" makes CONFIG_PAGE_TABLE_CHECK_ENFORCED work on powerpc (Andrew Donnellan) - "nodemask: align nodes_and{,not} with underlying bitmap ops" makes nodes_and() and nodes_andnot() propagate the return values from the underlying bit operations, enabling some cleanup in calling code (Yury Norov) - "mm/damon: hide kdamond and kdamond_lock from API callers" cleans up some DAMON internal interfaces (SeongJae Park) - "mm/khugepaged: cleanups and scan limit fix" does some cleanup work in khupaged and fixes a scan limit accounting issue (Shivank Garg) - "mm: balloon infrastructure cleanups" goes to town on the balloon infrastructure and its page migration function. Mainly cleanups, also some locking simplification (David Hildenbrand) - "mm/vmscan: add tracepoint and reason for kswapd_failures reset" adds additional tracepoints to the page reclaim code (Jiayuan Chen) - "Replace wq users and add WQ_PERCPU to alloc_workqueue() users" is part of Marco's kernel-wide migration from the legacy workqueue APIs over to the preferred unbound workqueues (Marco Crivellari) - "Various mm kselftests improvements/fixes" provides various unrelated improvements/fixes for the mm kselftests (Kevin Brodsky) - "mm: accelerate gigantic folio allocation" greatly speeds up gigantic folio allocation, mainly by avoiding unnecessary work in pfn_range_valid_contig() (Kefeng Wang) - "selftests/damon: improve leak detection and wss estimation reliability" improves the reliability of two of the DAMON selftests (SeongJae Park) - "mm/damon: cleanup kdamond, damon_call(), damos filter and DAMON_MIN_REGION" does some cleanup work in the core DAMON code (SeongJae Park) - "Docs/mm/damon: update intro, modules, maintainer profile, and misc" performs maintenance work on the DAMON documentation (SeongJae Park) - "mm: add and use vma_assert_stabilised() helper" refactors and cleans up the core VMA code. The main aim here is to be able to use the mmap write lock's lockdep state to perform various assertions regarding the locking which the VMA code requires (Lorenzo Stoakes) - "mm, swap: swap table phase II: unify swapin use" removes some old swap code (swap cache bypassing and swap synchronization) which wasn't working very well. Various other cleanups and simplifications were made. The end result is a 20% speedup in one benchmark (Kairui Song) - "enable PT_RECLAIM on more 64-bit architectures" makes PT_RECLAIM available on 64-bit alpha, loongarch, mips, parisc, and um. Various cleanups were performed along the way (Qi Zheng) * tag 'mm-stable-2026-02-11-19-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (325 commits) mm/memory: handle non-split locks correctly in zap_empty_pte_table() mm: move pte table reclaim code to memory.c mm: make PT_RECLAIM depends on MMU_GATHER_RCU_TABLE_FREE mm: convert __HAVE_ARCH_TLB_REMOVE_TABLE to CONFIG_HAVE_ARCH_TLB_REMOVE_TABLE config um: mm: enable MMU_GATHER_RCU_TABLE_FREE parisc: mm: enable MMU_GATHER_RCU_TABLE_FREE mips: mm: enable MMU_GATHER_RCU_TABLE_FREE LoongArch: mm: enable MMU_GATHER_RCU_TABLE_FREE alpha: mm: enable MMU_GATHER_RCU_TABLE_FREE mm: change mm/pt_reclaim.c to use asm/tlb.h instead of asm-generic/tlb.h mm/damon/stat: remove __read_mostly from memory_idle_ms_percentiles zsmalloc: make common caches global mm: add SPDX id lines to some mm source files mm/zswap: use %pe to print error pointers mm/vmscan: use %pe to print error pointers mm/readahead: fix typo in comment mm: khugepaged: fix NR_FILE_PAGES and NR_SHMEM in collapse_file() mm: refactor vma_map_pages to use vm_insert_pages mm/damon: unify address range representation with damon_addr_range mm/cma: replace snprintf with strscpy in cma_new_area ...	2026-02-12 11:32:37 -08:00
Matthew Wilcox (Oracle)	832d95b531	migrate: replace RMP_ flags with TTU_ flags Instead of translating between RMP_ and TTU_ flags, remove the RMP_ flags and just use the TTU_ flag space; there's plenty available. Possibly we should rename these to RMAP_ flags, and maybe even pass them in through rmap_walk_arg, but that can be done later. Link: https://lkml.kernel.org/r/20260109041345.3863089-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Gregory Price <gourry@gourry.net> Cc: Jann Horn <jannh@google.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Ying Huang <ying.huang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-26 20:02:33 -08:00
Al Viro	47b3b9bf93	simplify the callers of file_open_name() It accepts ERR_PTR() for name and does the right thing in that case. That allows to simplify the logics in callers, making them trivial to switch to CLASS(filename). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2026-01-13 15:18:08 -05:00
Linus Torvalds	2516a87153	Patch series in this pull request: - The 2 patch series "powerpc/pseries/cmm: two smaller fixes" from David Hildenbrand fixes a couple of minor things in ppc land. - The 4 patch series "Improve folio split related functions" from Zi Yan provides some cleanups and minorish fixes in the folio splitting code. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaTseCwAKCRDdBJ7gKXxA jkCLAP9ttvtG7zhDf+tqvjCJFnybPEo6Z2B4Qx9g8i7s27gTZgD8DzYbIl+YdxrN /tS0tdgUfrJIXR2PzzipkydNrs+lxgg= =T65o -----END PGP SIGNATURE----- Merge tag 'mm-stable-2025-12-11-11-39' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more MM updates from Andrew Morton: - "powerpc/pseries/cmm: two smaller fixes" (David Hildenbrand) fixes a couple of minor things in ppc land - "Improve folio split related functions" (Zi Yan) some cleanups and minorish fixes in the folio splitting code * tag 'mm-stable-2025-12-11-11-39' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm/damon/tests/core-kunit: avoid damos_test_commit stack warning mm: vmscan: correct nr_requested tracing in scan_folios MAINTAINERS: add idr core-api doc file to XARRAY mm/hugetlb: fix incorrect error return from hugetlb_reserve_pages() mm: fix CONFIG_STACK_GROWSUP typo in mm.h mm/huge_memory: fix folio split stats counting mm/huge_memory: make min_order_for_split() always return an order mm/huge_memory: replace can_split_folio() with direct refcount calculation mm/huge_memory: change folio_split_supported() to folio_check_splittable() mm/sparse: fix sparse_vmemmap_init_nid_early definition without CONFIG_SPARSEMEM powerpc/pseries/cmm: adjust BALLOON_MIGRATE when migrating pages powerpc/pseries/cmm: call balloon_devinfo_init() also without CONFIG_BALLOON_COMPACTION	2025-12-13 20:35:41 +12:00
Zi Yan	9dcdc0c207	mm/huge_memory: fix folio split stats counting The "return <error code>" statements for error checks at the beginning of __folio_split() skip necessary count_vm_event() and count_mthp_stat() at the end of the function. Fix these by replacing them with "ret = <error code>; goto out;". Link: https://lkml.kernel.org/r/20251126210618.1971206-5-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-12-09 11:25:33 -08:00
Zi Yan	2f78910659	mm/huge_memory: make min_order_for_split() always return an order min_order_for_split() returns -EBUSY when the folio is truncated and cannot be split. In commit `77008e1b2e` ("mm/huge_memory: do not change split_huge_page*() target order silently"), memory_failure() does not handle it and pass -EBUSY to try_to_split_thp_page() directly. try_to_split_thp_page() returns -EINVAL since -EBUSY becomes 0xfffffff0 as new_order is unsigned int in __folio_split() and this large new_order is rejected as an invalid input. The code does not cause a bug. soft_offline_in_use_page() also uses min_order_for_split() but it always passes 0 as new_order for split. Fix it by making min_order_for_split() always return an order. When the given folio is truncated, namely folio->mapping == NULL, return 0 and let a subsequent split function handle the situation and return -EBUSY. Add kernel-doc to min_order_for_split() to clarify its use. Link: https://lkml.kernel.org/r/20251126210618.1971206-4-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-12-09 11:25:33 -08:00
Zi Yan	5842bcbfc3	mm/huge_memory: replace can_split_folio() with direct refcount calculation can_split_folio() is just a refcount comparison, making sure only the split caller holds an extra pin. Open code it with folio_expected_ref_count() != folio_ref_count() - 1. For the extra_pins used by folio_ref_freeze(), add folio_cache_ref_count() to calculate it. Also replace folio_expected_ref_count() with folio_cache_ref_count() used by folio_ref_unfreeze(), since they are returning the same values when a folio is frozen and folio_cache_ref_count() does not have unnecessary folio_mapcount() in its implementation. Link: https://lkml.kernel.org/r/20251126210618.1971206-3-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-12-09 11:25:32 -08:00
Zi Yan	bdd0d69a32	mm/huge_memory: change folio_split_supported() to folio_check_splittable() Patch series "Improve folio split related functions", v4. This patchset improves several folio split related functions to avoid future misuse. The changes are: 1. Consolidated folio splittable checks by moving truncated folio check, huge zero folio check, and writeback folio check into folio_split_supported(). Changed the function return type. Renamed it to folio_check_splittable() for clarification. 2. Replaced can_split_folio() with open coded folio_expected_ref_count() and folio_ref_count() and introduced folio_cache_ref_count(). 3. Changed min_order_for_split() to always return an order. 4. Fixed folio split stats counting. Motivation ========== This is based on Wei's observation[1] and solves several potential issues: 1. Dereferencing NULL folio->mapping in try_folio_split_to_order() if it is called on truncated folios. 2. Not handling of negative return value of min_order_for_split() in mm/memory-failure.c There is no bug in the current code. This patch (of 4): folio_split_supported() used in try_folio_split_to_order() requires folio->mapping to be non NULL, but current try_folio_split_to_order() does not check it. There is no issue in the current code, since try_folio_split_to_order() is only used in truncate_inode_partial_folio(), where folio->mapping is not NULL. To prevent future misuse, move folio->mapping NULL check (i.e., folio is truncated) into folio_split_supported(). Since folio->mapping NULL check returns -EBUSY and folio_split_supported() == false means -EINVAL, change folio_split_supported() return type from bool to int and return error numbers accordingly. Rename folio_split_supported() to folio_check_splittable() to match the return type change. While at it, move is_huge_zero_folio() check and folio_test_writeback() check into folio_check_splittable() and add kernel-doc. Remove all warnings inside folio_check_splittable() and give warnings in __folio_split() instead, so that bool warns parameter can be removed. Link: https://lkml.kernel.org/r/20251126210618.1971206-1-ziy@nvidia.com Link: https://lkml.kernel.org/r/20251126210618.1971206-2-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: Balbir Singh <balbirs@nvidia.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-12-09 11:25:32 -08:00
Linus Torvalds	7203ca412f	Significant patch series in this merge are as follows: - The 10 patch series "__vmalloc()/kvmalloc() and no-block support" from Uladzislau Rezki reworks the vmalloc() code to support non-blocking allocations (GFP_ATOIC, GFP_NOWAIT). - The 2 patch series "ksm: fix exec/fork inheritance" from xu xin fixes a rare case where the KSM MMF_VM_MERGE_ANY prctl state is not inherited across fork/exec. - The 4 patch series "mm/zswap: misc cleanup of code and documentations" from SeongJae Park does some light maintenance work on the zswap code. - The 5 patch series "mm/page_owner: add debugfs files 'show_handles' and 'show_stacks_handles'" from Mauricio Faria de Oliveira enhances the /sys/kernel/debug/page_owner debug feature. It adds unique identifiers to differentiate the various stack traces so that userspace monitoring tools can better match stack traces over time. - The 2 patch series "mm/page_alloc: pcp->batch cleanups" from Joshua Hahn makes some minor alterations to the page allocator's per-cpu-pages feature. - The 2 patch series "Improve UFFDIO_MOVE scalability by removing anon_vma lock" from Lokesh Gidra addresses a scalability issue in userfaultfd's UFFDIO_MOVE operation. - The 2 patch series "kasan: cleanups for kasan_enabled() checks" from Sabyrzhan Tasbolatov performs some cleanup in the KASAN code. - The 2 patch series "drivers/base/node: fold node register and unregister functions" from Donet Tom cleans up the NUMA node handling code a little. - The 4 patch series "mm: some optimizations for prot numa" from Kefeng Wang provides some cleanups and small optimizations to the NUMA allocation hinting code. - The 5 patch series "mm/page_alloc: Batch callers of free_pcppages_bulk" from Joshua Hahn addresses long lock hold times at boot on large machines. These were causing (harmless) softlockup warnings. - The 2 patch series "optimize the logic for handling dirty file folios during reclaim" from Baolin Wang removes some now-unnecessary work from page reclaim. - The 10 patch series "mm/damon: allow DAMOS auto-tuned for per-memcg per-node memory usage" from SeongJae Park enhances the DAMOS auto-tuning feature. - The 2 patch series "mm/damon: fixes for address alignment issues in DAMON_LRU_SORT and DAMON_RECLAIM" from Quanmin Yan fixes DAMON_LRU_SORT and DAMON_RECLAIM with certain userspace configuration. - The 15 patch series "expand mmap_prepare functionality, port more users" from Lorenzo Stoakes enhances the new(ish) file_operations.mmap_prepare() method and ports additional callsites from the old ->mmap() over to ->mmap_prepare(). - The 8 patch series "Fix stale IOTLB entries for kernel address space" from Lu Baolu fixes a bug (and possible security issue on non-x86) in the IOMMU code. In some situations the IOMMU could be left hanging onto a stale kernel pagetable entry. - The 4 patch series "mm/huge_memory: cleanup __split_unmapped_folio()" from Wei Yang cleans up and optimizes the folio splitting code. - The 5 patch series "mm, swap: misc cleanup and bugfix" from Kairui Song implements some cleanups and a minor fix in the swap discard code. - The 8 patch series "mm/damon: misc documentation fixups" from SeongJae Park does as advertised. - The 9 patch series "mm/damon: support pin-point targets removal" from SeongJae Park permits userspace to remove a specific monitoring target in the middle of the current targets list. - The 2 patch series "mm: MISC follow-up patches for linux/pgalloc.h" from Harry Yoo implements a couple of cleanups related to mm header file inclusion. - The 2 patch series "mm/swapfile.c: select swap devices of default priority round robin" from Baoquan He improves the selection of swap devices for NUMA machines. - The 3 patch series "mm: Convert memory block states (MEM_) macros to enums" from Israel Batista changes the memory block labels from macros to enums so they will appear in kernel debug info. - The 3 patch series "ksm: perform a range-walk to jump over holes in break_ksm" from Pedro Demarchi Gomes addresses an inefficiency when KSM unmerges an address range. - The 22 patch series "mm/damon/tests: fix memory bugs in kunit tests" from SeongJae Park fixes leaks and unhandled malloc() failures in DAMON userspace unit tests. - The 2 patch series "some cleanups for pageout()" from Baolin Wang cleans up a couple of minor things in the page scanner's writeback-for-eviction code. - The 2 patch series "mm/hugetlb: refactor sysfs/sysctl interfaces" from Hui Zhu moves hugetlb's sysfs/sysctl handling code into a new file. - The 9 patch series "introduce VM_MAYBE_GUARD and make it sticky" from Lorenzo Stoakes makes the VMA guard regions available in /proc/pid/smaps and improves the mergeability of guarded VMAs. - The 2 patch series "mm: perform guard region install/remove under VMA lock" from Lorenzo Stoakes reduces mmap lock contention for callers performing VMA guard region operations. - The 2 patch series "vma_start_write_killable" from Matthew Wilcox starts work in permitting applications to be killed when they are waiting on a read_lock on the VMA lock. - The 11 patch series "mm/damon/tests: add more tests for online parameters commit" from SeongJae Park adds additional userspace testing of DAMON's "commit" feature. - The 9 patch series "mm/damon: misc cleanups" from SeongJae Park does that. - The 2 patch series "make VM_SOFTDIRTY a sticky VMA flag" from Lorenzo Stoakes addresses the possible loss of a VMA's VM_SOFTDIRTY flag when that VMA is merged with another. - The 16 patch series "mm: support device-private THP" from Balbir Singh introduces support for Transparent Huge Page (THP) migration in zone device-private memory. - The 3 patch series "Optimize folio split in memory failure" from Zi Yan optimizes folio split operations in the memory failure code. - The 2 patch series "mm/huge_memory: Define split_type and consolidate split support checks" from Wei Yang provides some more cleanups in the folio splitting code. - The 16 patch series "mm: remove is_swap_[pte, pmd]() + non-swap entries, introduce leaf entries" from Lorenzo Stoakes cleans up our handling of pagetable leaf entries by introducing the concept of 'software leaf entries', of type softleaf_t. - The 4 patch series "reparent the THP split queue" from Muchun Song reparents the THP split queue to its parent memcg. This is in preparation for addressing the long-standing "dying memcg" problem, wherein dead memcg's linger for too long, consuming memory resources. - The 3 patch series "unify PMD scan results and remove redundant cleanup" from Wei Yang does a little cleanup in the hugepage collapse code. - The 6 patch series "zram: introduce writeback bio batching" from Sergey Senozhatsky improves zram writeback efficiency by introducing batched bio writeback support. - The 4 patch series "memcg: cleanup the memcg stats interfaces" from Shakeel Butt cleans up our handling of the interrupt safety of some memcg stats. - The 4 patch series "make vmalloc gfp flags usage more apparent" from Vishal Moola cleans up vmalloc's handling of incoming GFP flags. - The 6 patch series "mm: Add soft-dirty and uffd-wp support for RISC-V" from Chunyan Zhang teches soft dirty and userfaultfd write protect tracking to use RISC-V's Svrsw60t59b extension. - The 5 patch series "mm: swap: small fixes and comment cleanups" from Youngjun Park fixes a small bug and cleans up some of the swap code. - The 4 patch series "initial work on making VMA flags a bitmap" from Lorenzo Stoakes starts work on converting the vma struct's flags to a bitmap, so we stop running out of them, especially on 32-bit. - The 2 patch series "mm/swapfile: fix and cleanup swap list iterations" from Youngjun Park addresses a possible bug in the swap discard code and cleans things up a little. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaTEb0wAKCRDdBJ7gKXxA jjfIAP94W4EkCCwNOupnChoG+YWw/JW21anXt5NN+i5svn1yugEAwzvv6A+cAFng o+ug/fyrfPZG7PLp2R8WFyGIP0YoBA4= =IUzS -----END PGP SIGNATURE----- Merge tag 'mm-stable-2025-12-03-21-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "__vmalloc()/kvmalloc() and no-block support" (Uladzislau Rezki) Rework the vmalloc() code to support non-blocking allocations (GFP_ATOIC, GFP_NOWAIT) "ksm: fix exec/fork inheritance" (xu xin) Fix a rare case where the KSM MMF_VM_MERGE_ANY prctl state is not inherited across fork/exec "mm/zswap: misc cleanup of code and documentations" (SeongJae Park) Some light maintenance work on the zswap code "mm/page_owner: add debugfs files 'show_handles' and 'show_stacks_handles'" (Mauricio Faria de Oliveira) Enhance the /sys/kernel/debug/page_owner debug feature by adding unique identifiers to differentiate the various stack traces so that userspace monitoring tools can better match stack traces over time "mm/page_alloc: pcp->batch cleanups" (Joshua Hahn) Minor alterations to the page allocator's per-cpu-pages feature "Improve UFFDIO_MOVE scalability by removing anon_vma lock" (Lokesh Gidra) Address a scalability issue in userfaultfd's UFFDIO_MOVE operation "kasan: cleanups for kasan_enabled() checks" (Sabyrzhan Tasbolatov) "drivers/base/node: fold node register and unregister functions" (Donet Tom) Clean up the NUMA node handling code a little "mm: some optimizations for prot numa" (Kefeng Wang) Cleanups and small optimizations to the NUMA allocation hinting code "mm/page_alloc: Batch callers of free_pcppages_bulk" (Joshua Hahn) Address long lock hold times at boot on large machines. These were causing (harmless) softlockup warnings "optimize the logic for handling dirty file folios during reclaim" (Baolin Wang) Remove some now-unnecessary work from page reclaim "mm/damon: allow DAMOS auto-tuned for per-memcg per-node memory usage" (SeongJae Park) Enhance the DAMOS auto-tuning feature "mm/damon: fixes for address alignment issues in DAMON_LRU_SORT and DAMON_RECLAIM" (Quanmin Yan) Fix DAMON_LRU_SORT and DAMON_RECLAIM with certain userspace configuration "expand mmap_prepare functionality, port more users" (Lorenzo Stoakes) Enhance the new(ish) file_operations.mmap_prepare() method and port additional callsites from the old ->mmap() over to ->mmap_prepare() "Fix stale IOTLB entries for kernel address space" (Lu Baolu) Fix a bug (and possible security issue on non-x86) in the IOMMU code. In some situations the IOMMU could be left hanging onto a stale kernel pagetable entry "mm/huge_memory: cleanup __split_unmapped_folio()" (Wei Yang) Clean up and optimize the folio splitting code "mm, swap: misc cleanup and bugfix" (Kairui Song) Some cleanups and a minor fix in the swap discard code "mm/damon: misc documentation fixups" (SeongJae Park) "mm/damon: support pin-point targets removal" (SeongJae Park) Permit userspace to remove a specific monitoring target in the middle of the current targets list "mm: MISC follow-up patches for linux/pgalloc.h" (Harry Yoo) A couple of cleanups related to mm header file inclusion "mm/swapfile.c: select swap devices of default priority round robin" (Baoquan He) improve the selection of swap devices for NUMA machines "mm: Convert memory block states (MEM_) macros to enums" (Israel Batista) Change the memory block labels from macros to enums so they will appear in kernel debug info "ksm: perform a range-walk to jump over holes in break_ksm" (Pedro Demarchi Gomes) Address an inefficiency when KSM unmerges an address range "mm/damon/tests: fix memory bugs in kunit tests" (SeongJae Park) Fix leaks and unhandled malloc() failures in DAMON userspace unit tests "some cleanups for pageout()" (Baolin Wang) Clean up a couple of minor things in the page scanner's writeback-for-eviction code "mm/hugetlb: refactor sysfs/sysctl interfaces" (Hui Zhu) Move hugetlb's sysfs/sysctl handling code into a new file "introduce VM_MAYBE_GUARD and make it sticky" (Lorenzo Stoakes) Make the VMA guard regions available in /proc/pid/smaps and improves the mergeability of guarded VMAs "mm: perform guard region install/remove under VMA lock" (Lorenzo Stoakes) Reduce mmap lock contention for callers performing VMA guard region operations "vma_start_write_killable" (Matthew Wilcox) Start work on permitting applications to be killed when they are waiting on a read_lock on the VMA lock "mm/damon/tests: add more tests for online parameters commit" (SeongJae Park) Add additional userspace testing of DAMON's "commit" feature "mm/damon: misc cleanups" (SeongJae Park) "make VM_SOFTDIRTY a sticky VMA flag" (Lorenzo Stoakes) Address the possible loss of a VMA's VM_SOFTDIRTY flag when that VMA is merged with another "mm: support device-private THP" (Balbir Singh) Introduce support for Transparent Huge Page (THP) migration in zone device-private memory "Optimize folio split in memory failure" (Zi Yan) "mm/huge_memory: Define split_type and consolidate split support checks" (Wei Yang) Some more cleanups in the folio splitting code "mm: remove is_swap_[pte, pmd]() + non-swap entries, introduce leaf entries" (Lorenzo Stoakes) Clean up our handling of pagetable leaf entries by introducing the concept of 'software leaf entries', of type softleaf_t "reparent the THP split queue" (Muchun Song) Reparent the THP split queue to its parent memcg. This is in preparation for addressing the long-standing "dying memcg" problem, wherein dead memcg's linger for too long, consuming memory resources "unify PMD scan results and remove redundant cleanup" (Wei Yang) A little cleanup in the hugepage collapse code "zram: introduce writeback bio batching" (Sergey Senozhatsky) Improve zram writeback efficiency by introducing batched bio writeback support "memcg: cleanup the memcg stats interfaces" (Shakeel Butt) Clean up our handling of the interrupt safety of some memcg stats "make vmalloc gfp flags usage more apparent" (Vishal Moola) Clean up vmalloc's handling of incoming GFP flags "mm: Add soft-dirty and uffd-wp support for RISC-V" (Chunyan Zhang) Teach soft dirty and userfaultfd write protect tracking to use RISC-V's Svrsw60t59b extension "mm: swap: small fixes and comment cleanups" (Youngjun Park) Fix a small bug and clean up some of the swap code "initial work on making VMA flags a bitmap" (Lorenzo Stoakes) Start work on converting the vma struct's flags to a bitmap, so we stop running out of them, especially on 32-bit "mm/swapfile: fix and cleanup swap list iterations" (Youngjun Park) Address a possible bug in the swap discard code and clean things up a little [ This merge also reverts commit `ebb9aeb980` ("vfio/nvgrace-gpu: register device memory for poison handling") because it looks broken to me, I've asked for clarification - Linus ] * tag 'mm-stable-2025-12-03-21-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits) mm: fix vma_start_write_killable() signal handling mm/swapfile: use plist_for_each_entry in __folio_throttle_swaprate mm/swapfile: fix list iteration when next node is removed during discard fs/proc/task_mmu.c: fix make_uffd_wp_huge_pte() huge pte handling mm/kfence: add reboot notifier to disable KFENCE on shutdown memcg: remove inc/dec_lruvec_kmem_state helpers selftests/mm/uffd: initialize char variable to Null mm: fix DEBUG_RODATA_TEST indentation in Kconfig mm: introduce VMA flags bitmap type tools/testing/vma: eliminate dependency on vma->__vm_flags mm: simplify and rename mm flags function for clarity mm: declare VMA flags by bit zram: fix a spelling mistake mm/page_alloc: optimize lowmem_reserve max lookup using its semantic monotonicity mm/vmscan: skip increasing kswapd_failures when reclaim was boosted pagemap: update BUDDY flag documentation mm: swap: remove scan_swap_map_slots() references from comments mm: swap: change swap_alloc_slow() to void mm, swap: remove redundant comment for read_swap_cache_async mm, swap: use SWP_SOLIDSTATE to determine if swap is rotational ...	2025-12-05 13:52:43 -08:00
Linus Torvalds	44fc84337b	arm64 updates for 6.19: Core features: - Basic Arm MPAM (Memory system resource Partitioning And Monitoring) driver under drivers/resctrl/ which makes use of the fs/rectrl/ API Perf and PMU: - Avoid cycle counter on multi-threaded CPUs - Extend CSPMU device probing and add additional filtering support for NVIDIA implementations - Add support for the PMUs on the NoC S3 interconnect - Add additional compatible strings for new Cortex and C1 CPUs - Add support for data source filtering to the SPE driver - Add support for i.MX8QM and "DB" PMU in the imx PMU driver Memory managemennt: - Avoid broadcast TLBI if page reused in write fault - Elide TLB invalidation if the old PTE was not valid - Drop redundant cpu_set__tcr_t0sz() macros - Propagate pgtable_alloc() errors outside of __create_pgd_mapping() - Propagate return value from __change_memory_common() ACPI and EFI: - Call EFI runtime services without disabling preemption - Remove unused ACPI function Miscellaneous: - ptrace support to disable streaming on SME-only systems - Improve sysreg generation to include a 'Prefix' descriptor - Replace __ASSEMBLY__ with __ASSEMBLER__ - Align register dumps in the kselftest zt-test - Remove some no longer used macros/functions - Various spelling corrections -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAmkvMjkACgkQa9axLQDI XvGaGg//dtT/ZAqrWa6Yniv1LOlh837C07YdxAYTTuJ+I87DnrxIqjwbW+ye+bF+ 61RTkioeCUm3PH+ncO9gPVNi4ASZ1db3/Rc8Fb6rr1TYOI1sMIeBsbbVdRJgsbX6 zu9197jOBHscTAeDceB6jZBDyW8iSLINPZ7LN6lGxXsZM/Vn5zfE0heKEEio6Fsx +AzO2vos0XcwBR9vFGXtiCDx57T+/cXUtrWfA0Cjz4nvHSgD8+ghS+Jwv+kHMt1L zrarqbeQfj+Iixm9PVHiazv+8THo9QdNl1yGLxDmJ4LEVPewjW5jBs8+5e8e3/Gj p5JEvmSyWvKTTbFoM5vhxC72A7yuT1QwAk2iCyFIxMbQ25PndHboKVp/569DzOkT +6CjI88sVSP6D7bVlN6pFlzc/Fa07YagnDMnMCSfk4LBjUfE3jYb+usaFydyv/rl jwZbJrnSF/H+uQlyoJFgOEXSoQdDsll3dv6yEsUCwbd8RqXbAe3svbguOUHSdvIj sCViezGZQ7Rkn6D21AfF9j6e7ceaSDaf5DWMxPI3dAxFKG8TJbCBsToR59NnoSj+ bNEozbZ1mCxmwH8i43wZ6P0RkClvJnoXcvRA+TJj02fSZACO39d3XDNswfXWL41r KiWGUJZyn2lPKtiAWVX6pSBtDJ+5rFhuoFgADLX6trkxDe9/EMQ= =4Sb6 -----END PGP SIGNATURE----- Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Catalin Marinas: "These are the arm64 updates for 6.19. The biggest part is the Arm MPAM driver under drivers/resctrl/. There's a patch touching mm/ to handle spurious faults for huge pmd (similar to the pte version). The corresponding arm64 part allows us to avoid the TLB maintenance if a (huge) page is reused after a write fault. There's EFI refactoring to allow runtime services with preemption enabled and the rest is the usual perf/PMU updates and several cleanups/typos. Summary: Core features: - Basic Arm MPAM (Memory system resource Partitioning And Monitoring) driver under drivers/resctrl/ which makes use of the fs/rectrl/ API Perf and PMU: - Avoid cycle counter on multi-threaded CPUs - Extend CSPMU device probing and add additional filtering support for NVIDIA implementations - Add support for the PMUs on the NoC S3 interconnect - Add additional compatible strings for new Cortex and C1 CPUs - Add support for data source filtering to the SPE driver - Add support for i.MX8QM and "DB" PMU in the imx PMU driver Memory managemennt: - Avoid broadcast TLBI if page reused in write fault - Elide TLB invalidation if the old PTE was not valid - Drop redundant cpu_set__tcr_t0sz() macros - Propagate pgtable_alloc() errors outside of __create_pgd_mapping() - Propagate return value from __change_memory_common() ACPI and EFI: - Call EFI runtime services without disabling preemption - Remove unused ACPI function Miscellaneous: - ptrace support to disable streaming on SME-only systems - Improve sysreg generation to include a 'Prefix' descriptor - Replace __ASSEMBLY__ with __ASSEMBLER__ - Align register dumps in the kselftest zt-test - Remove some no longer used macros/functions - Various spelling corrections" * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (94 commits) arm64/mm: Document why linear map split failure upon vm_reset_perms is not problematic arm64/pageattr: Propagate return value from __change_memory_common arm64/sysreg: Remove unused define ARM64_FEATURE_FIELD_BITS KVM: arm64: selftests: Consider all 7 possible levels of cache KVM: arm64: selftests: Remove ARM64_FEATURE_FIELD_BITS and its last user arm64: atomics: lse: Remove unused parameters from ATOMIC_FETCH_OP_AND macros Documentation/arm64: Fix the typo of register names ACPI: GTDT: Get rid of acpi_arch_timer_mem_init() perf: arm_spe: Add support for filtering on data source perf: Add perf_event_attr::config4 perf/imx_ddr: Add support for PMU in DB (system interconnects) perf/imx_ddr: Get and enable optional clks perf/imx_ddr: Move ida_alloc() from ddr_perf_init() to ddr_perf_probe() dt-bindings: perf: fsl-imx-ddr: Add compatible string for i.MX8QM, i.MX8QXP and i.MX8DXL arm64: remove duplicate ARCH_HAS_MEM_ENCRYPT arm64: mm: use untagged address to calculate page index MAINTAINERS: new entry for MPAM Driver arm_mpam: Add kunit tests for props_mismatch() arm_mpam: Add kunit test for bitmap reset arm_mpam: Add helper to reset saved mbwu state ...	2025-12-02 17:03:55 -08:00
Chunyan Zhang	277a1ae387	mm: softdirty: add pgtable_supports_soft_dirty() Patch series "mm: Add soft-dirty and uffd-wp support for RISC-V", v15. This patchset adds support for Svrsw60t59b [1] extension which is ratified now, also add soft dirty and userfaultfd write protect tracking for RISC-V. The patches 1 and 2 add macros to allow architectures to define their own checks if the soft-dirty / uffd_wp PTE bits are available, in other words for RISC-V, the Svrsw60t59b extension is supported on which device the kernel is running. Also patch1-2 are removing "ifdef CONFIG_MEM_SOFT_DIRTY" "ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP" and "ifdef CONFIG_PTE_MARKER_UFFD_WP" in favor of checks which if not overridden by the architecture, no change in behavior is expected. This patchset has been tested with kselftest mm suite in which soft-dirty, madv_populate, test_unmerge_uffd_wp, and uffd-unit-tests run and pass, and no regressions are observed in any of the other tests. This patch (of 6): Some platforms can customize the PTE PMD entry soft-dirty bit making it unavailable even if the architecture provides the resource. Add an API which architectures can define their specific implementations to detect if soft-dirty bit is available on which device the kernel is running. This patch is removing "ifdef CONFIG_MEM_SOFT_DIRTY" in favor of pgtable_supports_soft_dirty() checks that defaults to IS_ENABLED(CONFIG_MEM_SOFT_DIRTY), if not overridden by the architecture, no change in behavior is expected. We make sure to never set VM_SOFTDIRTY if !pgtable_supports_soft_dirty(), so we will never run into VM_SOFTDIRTY checks. [lorenzo.stoakes@oracle.com: fix VMA selftests] Link: https://lkml.kernel.org/r/dac6ddfe-773a-43d5-8f69-021b9ca4d24b@lucifer.local Link: https://lkml.kernel.org/r/20251113072806.795029-1-zhangchunyan@iscas.ac.cn Link: https://lkml.kernel.org/r/20251113072806.795029-2-zhangchunyan@iscas.ac.cn Link: https://github.com/riscv-non-isa/riscv-iommu/pull/543 [1] Signed-off-by: Chunyan Zhang <zhangchunyan@iscas.ac.cn> Acked-by: David Hildenbrand <david@redhat.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Conor Dooley <conor@kernel.org> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Jan Kara <jack@suse.cz> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rob Herring <robh@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Andrew Jones <ajones@ventanamicro.com> Cc: Conor Dooley <conor.dooley@microchip.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-24 15:08:54 -08:00
Shakeel Butt	c1bd09994c	memcg: remove __lruvec_stat_mod_folio __lruvec_stat_mod_folio() is already safe against irqs, so there is no need to have a separate interface (i.e. lruvec_stat_mod_folio) which wraps calls to it with irq disabling and reenabling. Let's rename __lruvec_stat_mod_folio() to lruvec_stat_mod_folio(). Link: https://lkml.kernel.org/r/20251110232008.1352063-5-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-24 15:08:54 -08:00
Balbir Singh	cab812d9c9	mm/huge_memory.c: introduce folio_split_unmapped Unmapped was added as a parameter to __folio_split() and related call sites to support splitting of folios already in the midst of a migration. This special case arose for device private folio migration since during migration there could be a disconnect between source and destination on the folio size. Introduce folio_split_unmapped() to handle this special case. Also refactor code and add __folio_freeze_and_split_unmapped() helper that is common to both __folio_split() and folio_split_unmapped(). This in turn removes the special casing introduced by the unmapped parameter in __folio_split(). [balbirs@nvidia.com: v2] Link: https://lkml.kernel.org/r/20251115084041.3914728-1-balbirs@nvidia.com [balbirs@nvidia.com: fix clang-20 build] Link: https://lkml.kernel.org/r/20251120134232.3588203-1-balbirs@nvidia.com [akpm@linux-foundation.org: add `inline' to shmem_uncharge() stub, per Balbir] Link: https://lkml.kernel.org/r/20251114012228.2634882-1-balbirs@nvidia.com Signed-off-by: Balbir Singh <balbirs@nvidia.com> Suggested-by: Zi Yan <ziy@nvidia.com> Acked-by: Zi Yan <ziy@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Gregory Price <gourry@gourry.net> Cc: Ying Huang <ying.huang@linux.alibaba.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Barry Song <baohua@kernel.org> Cc: Lyude Paul <lyude@redhat.com> Cc: Danilo Krummrich <dakr@kernel.org> Cc: David Airlie <airlied@gmail.com> Cc: Simona Vetter <simona@ffwll.ch> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Mika Penttilä <mpenttil@redhat.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Francois Dugast <francois.dugast@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-24 15:08:53 -08:00
Qi Zheng	46156dba32	mm: thp: reparent the split queue during memcg offline Similar to list_lru, the split queue is relatively independent and does not need to be reparented along with objcg and LRU folios (holding objcg lock and lru lock). So let's apply the similar mechanism as list_lru to reparent the split queue separately when memcg is offine. This is also a preparation for reparenting LRU folios. Link: https://lkml.kernel.org/r/8703f907c4d1f7e8a2ef2bfed3036a84fa53028b.1762762324.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Nico Pache <npache@redhat.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-24 15:08:52 -08:00
Muchun Song	776bde7caf	mm: thp: use folio_batch to handle THP splitting in deferred_split_scan() The maintenance of the folio->_deferred_list is intricate because it's reused in a local list. Here are some peculiarities: 1) When a folio is removed from its split queue and added to a local on-stack list in deferred_split_scan(), the ->split_queue_len isn't updated, leading to an inconsistency between it and the actual number of folios in the split queue. 2) When the folio is split via split_folio() later, it's removed from the local list while holding the split queue lock. At this time, the lock is not needed as it is not protecting anything. 3) To handle the race condition with a third-party freeing or migrating the preceding folio, we must ensure there's always one safe (with raised refcount) folio before by delaying its folio_put(). More details can be found in commit `e66f3185fa` ("mm/thp: fix deferred split queue not partially_mapped"). It's rather tricky. We can use the folio_batch infrastructure to handle this clearly. In this case, ->split_queue_len will be consistent with the real number of folios in the split queue. If list_empty(&folio->_deferred_list) returns false, it's clear the folio must be in its split queue (not in a local list anymore). In the future, we will reparent LRU folios during memcg offline to eliminate dying memory cgroups, which requires reparenting the split queue to its parent first. So this patch prepares for using folio_split_queue_lock_irqsave() as the memcg may change then. Link: https://lkml.kernel.org/r/59cb6b6fb5ffcff9d23b81890b252960139ad8e7.1762762324.git.zhengqi.arch@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nico Pache <npache@redhat.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-24 15:08:52 -08:00
Muchun Song	ad7c7f4576	mm: thp: introduce folio_split_queue_lock and its variants In future memcg removal, the binding between a folio and a memcg may change, making the split lock within the memcg unstable when held. A new approach is required to reparent the split queue to its parent. This patch starts introducing a unified way to acquire the split lock for future work. It's a code-only refactoring with no functional changes. Link: https://lkml.kernel.org/r/a31a90bcac04dc754f775e87ae3205be3170b571.1762762324.git.zhengqi.arch@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nico Pache <npache@redhat.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-24 15:08:52 -08:00
Muchun Song	fd603ae11e	mm: thp: replace folio_memcg() with folio_memcg_charged() Patch series "reparent the THP split queue", v6. In the future, we will reparent LRU folios during memcg offline to eliminate dying memory cgroups, which requires reparenting the THP split queue to its parent memcg. Similar to list_lru, the split queue is relatively independent and does not need to be reparented along with objcg and LRU folios (holding objcg lock and lru lock). Therefore, we can apply the same mechanism as list_lru to reparent the split queue first when memcg is offine. The first three patches in this series are separated from the series "Eliminate Dying Memory Cgroup" [1], mainly to do some cleanup and preparatory work. The last patch reparents the THP split queue to its parent memcg during memcg offline. This patch (of 4): folio_memcg_charged() is intended for use when the user is unconcerned about the returned memcg pointer. It is more efficient than folio_memcg(). Therefore, replace folio_memcg() with folio_memcg_charged(). Link: https://lkml.kernel.org/r/56624d537520e33e5a6b3755238b3dfb959a52ee.1762762324.git.zhengqi.arch@bytedance.com Link: https://lore.kernel.org/all/20250415024532.26632-1-songmuchun@bytedance.com/ [1] Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-24 15:08:52 -08:00
Lorenzo Stoakes	c0a80c2ce6	mm: remove remaining is_swap_pmd() users and is_swap_pmd() Update copy_huge_pmd() and change_huge_pmd() to use pmd_is_valid_softleaf() - as this checks for the only valid non-present huge PMD states. Also update mm/debug_vm_pgtable.c to explicitly test for a valid leaf PMD entry (which it was not before, which was incorrect), and have it test against pmd_is_huge() and pmd_is_valid_softleaf() rather than is_swap_pmd(). With these changes done there are no further users of is_swap_pmd(), so remove it. Link: https://lkml.kernel.org/r/1628b00b00c8498bbd2c20b82117ee87845fb738.1762812360.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Byungchul Park <byungchul@sk.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Chris Li <chrisl@kernel.org> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Gregory Price <gourry@gourry.net> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Leon Romanovsky <leon@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Nico Pache <npache@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Rik van Riel <riel@surriel.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Xu <weixugc@google.com> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-24 15:08:51 -08:00
Lorenzo Stoakes	15eabc898d	mm: introduce pmd_is_huge() and use where appropriate The leaf entry PMD case is confusing as only migration entries and device private entries are valid at PMD level, not true swap entries. We repeatedly perform checks of the form is_swap_pmd() \|\| pmd_trans_huge() which is itself confusing - it implies that leaf entries at PMD level exist and are different from huge entries. Address this confusion by introduced pmd_is_huge() which checks for either case. Sadly due to header dependency issues (huge_mm.h is included very early on in headers and cannot really rely on much else) we cannot use pmd_is_valid_softleaf() here. However since these are the only valid, handled cases the function is still achieving what it intends to do. We then replace all instances of is_swap_pmd() \|\| pmd_trans_huge() with pmd_is_huge() invocations and adjust logic accordingly to accommodate this. No functional change intended. Link: https://lkml.kernel.org/r/00f79db3b15293cac8f7040a48d69c52d00117e4.1762812360.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Byungchul Park <byungchul@sk.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Chris Li <chrisl@kernel.org> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Gregory Price <gourry@gourry.net> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Leon Romanovsky <leon@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Nico Pache <npache@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Rik van Riel <riel@surriel.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Xu <weixugc@google.com> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-24 15:08:51 -08:00
Lorenzo Stoakes	0ac881efe1	mm: replace pmd_to_swp_entry() with softleaf_from_pmd() Introduce softleaf_from_pmd() to do the equivalent operation for PMDs that softleaf_from_pte() fulfils, and cascade changes through code base accordingly, introducing helpers as necessary. We are then able to eliminate pmd_to_swp_entry(), is_pmd_migration_entry(), is_pmd_device_private_entry() and is_pmd_non_present_folio_entry(). This further establishes the use of leaf operations throughout the code base and further establishes the foundations for eliminating is_swap_pmd(). No functional change intended. [lorenzo.stoakes@oracle.com: check writable, not readable/writable, per Vlastimil] Link: https://lkml.kernel.org/r/cd97b6ec-00f9-45a4-9ae0-8f009c212a94@lucifer.local Link: https://lkml.kernel.org/r/3fb431699639ded8fdc63d2210aa77a38c8891f1.1762812360.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: SeongJae Park <sj@kernel.org>\ Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Byungchul Park <byungchul@sk.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Chris Li <chrisl@kernel.org> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Gregory Price <gourry@gourry.net> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Leon Romanovsky <leon@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Nico Pache <npache@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Rik van Riel <riel@surriel.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Wei Xu <weixugc@google.com> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-24 15:08:51 -08:00
Lorenzo Stoakes	5dfa791605	mm/huge_memory: refactor change_huge_pmd() non-present logic Similar to copy_huge_pmd(), there is a large mass of open-coded logic for the CONFIG_ARCH_ENABLE_THP_MIGRATION non-present entry case that does not use thp_migration_supported() consistently. Resolve this by separating out this logic and introduce change_non_present_huge_pmd(). No functional change intended. Link: https://lkml.kernel.org/r/451b85636ad711e307fdfbff19af699fdab4d05f.1762812360.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Byungchul Park <byungchul@sk.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Chris Li <chrisl@kernel.org> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Gregory Price <gourry@gourry.net> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Leon Romanovsky <leon@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Nico Pache <npache@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Rik van Riel <riel@surriel.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Wei Xu <weixugc@google.com> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-24 15:08:51 -08:00
Lorenzo Stoakes	e244d82d02	mm/huge_memory: refactor copy_huge_pmd() non-present logic Right now we are inconsistent in our use of thp_migration_supported(): static inline bool thp_migration_supported(void) { return IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION); } And simply having arbitrary and ugly #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION blocks in code. This is exhibited in copy_huge_pmd(), which inserts a large #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION block and an if-branch which is difficult to follow It's difficult to follow the logic of such a large function and the non-present PMD logic is clearly separate as it sits in a giant if-branch. Therefore this patch both separates out the logic and utilises thp_migration_supported(). No functional change intended. Link: https://lkml.kernel.org/r/6eaadc23ed512d370ede65561e34e96241c54b9d.1762812360.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Byungchul Park <byungchul@sk.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Chris Li <chrisl@kernel.org> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Gregory Price <gourry@gourry.net> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Leon Romanovsky <leon@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Nico Pache <npache@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Rik van Riel <riel@surriel.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Wei Xu <weixugc@google.com> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-24 15:08:51 -08:00
Lorenzo Stoakes	aa62204cb6	mm: avoid unnecessary use of is_swap_pmd() PMD 'non-swap' swap entries are currently used for PMD-level migration entries and device private entries. To add to the confusion in this terminology we use is_swap_pmd() in an inconsistent way similar to how is_swap_pte() was being used - sometimes adopting the convention that !pmd_none(), !pmd_present() implies PMD 'swap' entry, sometimes not. This patch handles the low-hanging fruit of cases where we can simply substitute other predicates for is_swap_pmd(). No functional change intended. Link: https://lkml.kernel.org/r/8a1704b36a009c18032d5bea4cb68e71448fbbe5.1762812360.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Byungchul Park <byungchul@sk.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Chris Li <chrisl@kernel.org> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Gregory Price <gourry@gourry.net> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Leon Romanovsky <leon@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Nico Pache <npache@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Rik van Riel <riel@surriel.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Xu <weixugc@google.com> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-24 15:08:51 -08:00
Wei Yang	8a0e4bdddd	mm/huge_memory: merge uniform_split_supported() and non_uniform_split_supported() uniform_split_supported() and non_uniform_split_supported() share significantly similar logic. The only functional difference is that uniform_split_supported() includes an additional check on the requested @new_order. The reason for this check comes from the following two aspects: * some file system or swap cache just supports order-0 folio * the behavioral difference between uniform/non-uniform split The behavioral difference between uniform split and non-uniform: * uniform split splits folio directly to @new_order * non-uniform split creates after-split folios with orders from folio_order(folio) - 1 to new_order. This means for non-uniform split or !new_order split we should check the file system and swap cache respectively. This commit unifies the logic and merge the two functions into a single combined helper, removing redundant code and simplifying the split support checking mechanism. Link: https://lkml.kernel.org/r/20251106034155.21398-3-richard.weiyang@gmail.com Fixes: `c010d47f10` ("mm: thp: split huge page to any lower order pages") Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Zi Yan <ziy@nvidia.com> Cc: "David Hildenbrand (Red Hat)" <david@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-24 15:08:50 -08:00
Wei Yang	c467061fbb	mm/huge_memory: introduce enum split_type for clarity Patch series "mm/huge_memory: Define split_type and consolidate split support checks", v3. This two-patch series focuses on improving code clarity and removing redundancy in the huge memory handling logic related to folio splitting. The series is based on an original proposal to merge two significantly identical functions that check folio split support[1]. During this process, we found an opportunity to improve readability by explicitly defining the split types. Patch 1: define split_type and use it Patch 2: merge uniform_split_supported() and non_uniform_split_supported() This patch (of 2): We currently handle two distinct types of large folio splitting: * uniform split * non-uniform split Differentiating between these types using a simple boolean variable is not obvious and can harm code readability. This commit introduces enum split_type to explicitly define these two types. Replacing the existing boolean variable with this enumeration significantly improves code clarity and expressiveness when dealing with folio splitting logic. No functional change is expected. [akpm@linux-foundation.org: tweak layout, per David] Link: https://lkml.kernel.org/r/20251106034155.21398-1-richard.weiyang@gmail.com Link: https://lkml.kernel.org/r/20251106034155.21398-2-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: "David Hildenbrand (Red Hat)" <david@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-24 15:08:50 -08:00
Zi Yan	50d0598cf2	mm/huge_memory: fix kernel-doc comments for folio_split() and related try_folio_split_to_order(), folio_split, __folio_split(), and __split_unmapped_folio() do not have correct kernel-doc comment format. Fix them. [ziy@nvidia.com: kernel-doc fixup] Link: https://lkml.kernel.org/r/BE7AC5F3-9E64-4923-861D-C2C4E0CB91EB@nvidia.com [ziy@nvidia.com: add newline to fix an error and a warning from docutils] Link: https://lkml.kernel.org/r/040B38C0-23C6-4AEA-B069-69AE6DAA828B@nvidia.com Link: https://lkml.kernel.org/r/20251031162001.670503-4-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Luis Chamberalin <mcgrof@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Nico Pache <npache@redhat.com> Cc: Pankaj Raghav <kernel@pankajraghav.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-24 15:08:49 -08:00

1 2 3 4 5 ...

1094 Commits