linux

mirror of https://github.com/torvalds/linux.git synced 2026-05-30 10:04:04 +02:00

Author	SHA1	Message	Date
Chengwen Feng	a7034e9e44	ACPI: PPTT: Use acpi_get_cpu_uid() and remove get_acpi_id_for_cpu() Update acpi/pptt.c to use acpi_get_cpu_uid() and remove unused get_acpi_id_for_cpu() from arm64/loongarch/riscv, completing PPTT's migration to the unified ACPI CPU UID interface Signed-off-by: Chengwen Feng <fengchengwen@huawei.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Link: https://patch.msgid.link/20260401081640.26875-8-fengchengwen@huawei.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2026-04-06 16:55:16 +02:00
Chengwen Feng	f652d0a4e1	ACPI: Centralize acpi_get_cpu_uid() declaration in include/linux/acpi.h Centralize acpi_get_cpu_uid() in include/linux/acpi.h (global scope) and remove arch-specific declarations from arm64/loongarch/riscv/x86 asm/acpi.h. This unifies the interface across architectures and simplifies maintenance by eliminating duplicate prototypes. Signed-off-by: Chengwen Feng <fengchengwen@huawei.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Link: https://patch.msgid.link/20260401081640.26875-6-fengchengwen@huawei.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2026-04-06 16:55:16 +02:00
Chengwen Feng	7cd5f5659a	arm64: acpi: Add acpi_get_cpu_uid() for unified ACPI CPU UID retrieval As a step towards unifying the interface for retrieving ACPI CPU UID across architectures, introduce a new function acpi_get_cpu_uid() for arm64. While at it, add input validation to make the code more robust. Reimplement get_cpu_for_acpi_id() based on acpi_get_cpu_uid() for consistency, and move its implementation next to the new function for code coherence. Signed-off-by: Chengwen Feng <fengchengwen@huawei.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Link: https://patch.msgid.link/20260401081640.26875-2-fengchengwen@huawei.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2026-04-06 16:55:15 +02:00
Lorenzo Stoakes (Oracle)	3a6455d56b	mm: convert do_brk_flags() to use vma_flags_t In order to be able to do this, we need to change VM_DATA_DEFAULT_FLAGS and friends and update the architecture-specific definitions also. We then have to update some KSM logic to handle VMA flags, and introduce VMA_STACK_FLAGS to define the vma_flags_t equivalent of VM_STACK_FLAGS. We also introduce two helper functions for use during the time we are converting legacy flags to vma_flags_t values - vma_flags_to_legacy() and legacy_to_vma_flags(). This enables us to iteratively make changes to break these changes up into separate parts. We use these explicitly here to keep VM_STACK_FLAGS around for certain users which need to maintain the legacy vm_flags_t values for the time being. We are no longer able to rely on the simple VM_xxx being set to zero if the feature is not enabled, so in the case of VM_DROPPABLE we introduce VMA_DROPPABLE as the vma_flags_t equivalent, which is set to EMPTY_VMA_FLAGS if the droppable flag is not available. While we're here, we make the description of do_brk_flags() into a kdoc comment, as it almost was already. We use vma_flags_to_legacy() to not need to update the vm_get_page_prot() logic as this time. Note that in create_init_stack_vma() we have to replace the BUILD_BUG_ON() with a VM_WARN_ON_ONCE() as the tested values are no longer build time available. We also update mprotect_fixup() to use VMA flags where possible, though we have to live with a little duplication between vm_flags_t and vma_flags_t values for the time being until further conversions are made. While we're here, update VM_SPECIAL to be defined in terms of VMA_SPECIAL_FLAGS now we have vma_flags_to_legacy(). Finally, we update the VMA tests to reflect these changes. Link: https://lkml.kernel.org/r/d02e3e45d9a33d7904b149f5604904089fd640ae.1774034900.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: Paul Moore <paul@paul-moore.com> [SELinux] Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Kees Cook <kees@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Ondrej Mosnacek <omosnace@redhat.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Richard Weinberger <richard@nod.at> Cc: Russell King <linux@armlinux.org.uk> Cc: Stephen Smalley <stephen.smalley.work@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Will Deacon <will@kernel.org> Cc: xu xin <xu.xin16@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:40 -07:00
Baolin Wang	42e26354c4	mm: change to return bool for pmdp_test_and_clear_young() Callers use pmdp_test_and_clear_young() to clear the young flag and check whether it was set for this PMD entry. Change the return type to bool to make the intention clearer. Link: https://lkml.kernel.org/r/f1d31307a13365d3d0fed5809727dcc2dd59631b.1774075004.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:35 -07:00
Baolin Wang	06c4dfa3ce	mm: change to return bool for ptep_clear_flush_young()/clear_flush_young_ptes() The ptep_clear_flush_young() and clear_flush_young_ptes() are used to clear the young flag and flush the TLB, returning whether the young flag was set. Change the return type to bool to make the intention clearer. Link: https://lkml.kernel.org/r/24af5144b96103631594501f77d4525f2475c1be.1774075004.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:35 -07:00
Baolin Wang	a62ca3f40f	mm: change to return bool for ptep_test_and_clear_young() Patch series "change young flag check functions to return bool", v2. This is a cleanup patchset to change all young flag check functions to return bool, as discussed with David in the previous thread[1]. Since callers only care about whether the young flag was set, returning bool makes the intention clearer. No functional changes intended. This patch (of 6): Callers use ptep_test_and_clear_young() to clear the young flag and check whether it was set. Change the return type to bool to make the intention clearer. Link: https://lkml.kernel.org/r/cover.1774075004.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/57e70efa9703d43959aa645246ea3cbdba14fa17.1774075004.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:35 -07:00
Baolin Wang	9970a9a27f	arm64: mm: implement the architecture-specific test_and_clear_young_ptes() Implement the Arm64 architecture-specific test_and_clear_young_ptes() to enable batched checking of young flags, improving performance during large folio reclamation when MGLRU is enabled. While we're at it, simplify ptep_test_and_clear_young() by calling test_and_clear_young_ptes(). Since callers guarantee that PTEs are present before calling these functions, we can use pte_cont() to check the CONT_PTE flag instead of pte_valid_cont(). Performance testing: Enable MGLRU, then allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to reclaim 8G file-backed folios via the memory.reclaim interface. I can observe 60%+ performance improvement on my Arm64 32-core server (and about 15% improvement on my X86 machine). W/o patchset: real 0m0.470s user 0m0.000s sys 0m0.470s W/ patchset: real 0m0.180s user 0m0.001s sys 0m0.179s Link: https://lkml.kernel.org/r/7f891d42a720cc2e57862f3b79e4f774404f313c.1772778858.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Rik van Riel <riel@surriel.com> Reviewed-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Xu <weixugc@google.com> Cc: Will Deacon <will@kernel.org> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:16 -07:00
Mike Rapoport (Microsoft)	26513781d1	mm: cache struct page for empty_zero_page and return it from ZERO_PAGE() For most architectures every invocation of ZERO_PAGE() does virt_to_page(empty_zero_page). But empty_zero_page is in BSS and it is enough to get its struct page once at initialization time and then use it whenever a zero page should be accessed. Add yet another __zero_page variable that will be initialized as virt_to_page(empty_zero_page) for most architectures in a weak arch_setup_zero_pages() function. For architectures that use colored zero pages (MIPS and s390) rename their setup_zero_pages() to arch_setup_zero_pages() and make it global rather than static. For architectures that cannot use virt_to_page() for BSS (arm64 and sparc64) add override of arch_setup_zero_pages(). Link: https://lkml.kernel.org/r/20260211103141.3215197-5-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Christophe Leroy (CS GROUP) <chleroy@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Richard Weinberger <richard@nod.at> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:01 -07:00
Mike Rapoport (Microsoft)	6215d9f447	arch, mm: consolidate empty_zero_page Reduce 22 declarations of empty_zero_page to 3 and 23 declarations of ZERO_PAGE() to 4. Every architecture defines empty_zero_page that way or another, but for the most of them it is always a page aligned page in BSS and most definitions of ZERO_PAGE do virt_to_page(empty_zero_page). Move Linus vetted x86 definition of empty_zero_page and ZERO_PAGE() to the core MM and drop these definitions in architectures that do not implement colored zero page (MIPS and s390). ZERO_PAGE() remains a macro because turning it to a wrapper for a static inline causes severe pain in header dependencies. For the most part the change is mechanical, with these being noteworthy: * alpha: aliased empty_zero_page with ZERO_PGE that was also used for boot parameters. Switching to a generic empty_zero_page removes the aliasing and keeps ZERO_PGE for boot parameters only * arm64: uses __pa_symbol() in ZERO_PAGE() so that definition of ZERO_PAGE() is kept intact. * m68k/parisc/um: allocated empty_zero_page from memblock, although they do not support zero page coloring and having it in BSS will work fine. * sparc64 can have empty_zero_page in BSS rather allocate it, but it can't use virt_to_page() for BSS. Keep it's definition of ZERO_PAGE() but instead of allocating it, make mem_map_zero point to empty_zero_page. * sh: used empty_zero_page for boot parameters at the very early boot. Rename the parameters page to boot_params_page and let sh use the generic empty_zero_page. * hexagon: had an amusing comment about empty_zero_page /* A handy thing to have if one has the RAM. Declared in head.S */ that unfortunately had to go :) Link: https://lkml.kernel.org/r/20260211103141.3215197-4-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Helge Deller <deller@gmx.de> [parisc] Tested-by: Helge Deller <deller@gmx.de> [parisc] Reviewed-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Magnus Lindholm <linmag7@gmail.com> [alpha] Acked-by: Dinh Nguyen <dinguyen@kernel.org> [nios2] Acked-by: Andreas Larsson <andreas@gaisler.com> [sparc] Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: David S. Miller <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guo Ren <guoren@kernel.org> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Richard Weinberger <richard@nod.at> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:01 -07:00
Seongsu Park	3d443691ed	mm/pkeys: remove unused tsk parameter from arch_set_user_pkey_access() The tsk parameter in arch_set_user_pkey_access() is never used in the function implementations across all architectures (arm64, powerpc, x86). Link: https://lkml.kernel.org/r/20260219063506.545148-1-sgsu.park@samsung.com Signed-off-by: Seongsu Park <sgsu.park@samsung.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:52:57 -07:00
Linus Torvalds	441c63ff42	arm64 fix for v7.0 - Implement a basic static call trampoline to fix CFI failures with the generic implementation. -----BEGIN PGP SIGNATURE----- iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmnPh0AQHHdpbGxAa2Vy bmVsLm9yZwAKCRC3rHDchMFjNHRVB/97IOb/LZAq2yguGy6rMptm3tCdCsUmgPkh aPBeI4BE1JXofRcyM1oaavM/wC6M3ASb8JJbg5Ceta3wXwPfjzR2F9+6OEzipXzC nQzm0Da5GvwiHOY6GGhOgUy91+JJB1g7402ALIRjCiaadDBTLgys/YzDFUGC4+8N QKToOJykO4sCUR4lpYpuJvd1NQv1VkJo4ZgtlWvanHo9ovkTXOuCJsCTBv6EHMo6 nJg9iSZOMj3L20VSmnY5fa0MpCNCXH8cfYtbmHBYBxI3e3sKYI8A2j0H22FP4oIH 2+tkIg5TxQsmejf9u9V1JES2/0712SmG/hS0y1BsQtYzVuDp7pBZ =qSXb -----END PGP SIGNATURE----- Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 fix from Will Deacon: - Implement a basic static call trampoline to fix CFI failures with the generic implementation * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: arm64: Use static call trampolines when kCFI is enabled	2026-04-03 08:47:13 -07:00
Christoph Hellwig	e20043b476	xor: make xor.ko self-contained in lib/raid/ Move the asm/xor.h headers to lib/raid/xor/$(SRCARCH)/xor_arch.h and include/linux/raid/xor_impl.h to lib/raid/xor/xor_impl.h so that the xor.ko module implementation is self-contained in lib/raid/. As this remove the asm-generic mechanism a new kconfig symbol is added to indicate that a architecture-specific implementations exists, and xor_arch.h should be included. Link: https://lkml.kernel.org/r/20260327061704.3707577-22-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Biggers <ebiggers@kernel.org> Tested-by: Eric Biggers <ebiggers@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Mason <clm@fb.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: David Sterba <dsterba@suse.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason A. Donenfeld <jason@zx2c4.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Li Nan <linan122@huawei.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Richard Weinberger <richard@nod.at> Cc: Russell King <linux@armlinux.org.uk> Cc: Song Liu <song@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Ted Ts'o <tytso@mit.edu> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-02 23:36:20 -07:00
Christoph Hellwig	352ebd066b	xor: avoid indirect calls for arm64-optimized ops Remove the inner xor_block_templates, and instead have two separate actual template that call into the neon-enabled compilation unit. Link: https://lkml.kernel.org/r/20260327061704.3707577-21-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Biggers <ebiggers@kernel.org> Tested-by: Eric Biggers <ebiggers@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Mason <clm@fb.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: David Sterba <dsterba@suse.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason A. Donenfeld <jason@zx2c4.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Li Nan <linan122@huawei.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Richard Weinberger <richard@nod.at> Cc: Russell King <linux@armlinux.org.uk> Cc: Song Liu <song@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Ted Ts'o <tytso@mit.edu> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-02 23:36:20 -07:00
Christoph Hellwig	3786f2ad00	arm64: move the XOR code to lib/raid/ Move the optimized XOR into lib/raid and include it it in the main xor.ko instead of building a separate module for it. Note that this drops the CONFIG_KERNEL_MODE_NEON dependency, as that is always set for arm64. Link: https://lkml.kernel.org/r/20260327061704.3707577-14-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Biggers <ebiggers@kernel.org> Tested-by: Eric Biggers <ebiggers@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Mason <clm@fb.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: David Sterba <dsterba@suse.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason A. Donenfeld <jason@zx2c4.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Li Nan <linan122@huawei.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Richard Weinberger <richard@nod.at> Cc: Russell King <linux@armlinux.org.uk> Cc: Song Liu <song@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Ted Ts'o <tytso@mit.edu> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-02 23:36:18 -07:00
Christoph Hellwig	35ebc4de10	xor: remove macro abuse for XOR implementation registrations Drop the pretty confusing historic XOR_TRY_TEMPLATES and XOR_SELECT_TEMPLATE, and instead let the architectures provide a arch_xor_init that calls either xor_register to register candidates or xor_force to force a specific implementation. Link: https://lkml.kernel.org/r/20260327061704.3707577-10-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Biggers <ebiggers@kernel.org> Tested-by: Eric Biggers <ebiggers@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Mason <clm@fb.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: David Sterba <dsterba@suse.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason A. Donenfeld <jason@zx2c4.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Li Nan <linan122@huawei.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Richard Weinberger <richard@nod.at> Cc: Russell King <linux@armlinux.org.uk> Cc: Song Liu <song@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Ted Ts'o <tytso@mit.edu> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-02 23:36:17 -07:00
Christoph Hellwig	675a0dd596	arm64/xor: fix conflicting attributes for xor_block_template Commit `2c54b423cf` ("arm64/xor: use EOR3 instructions when available") changes the definition to __ro_after_init instead of const, but failed to update the external declaration in xor.h. This was not found because xor-neon.c doesn't include <asm/xor.h>, and can't easily do that due to current architecture of the XOR code. Link: https://lkml.kernel.org/r/20260327061704.3707577-4-hch@lst.de Fixes: `2c54b423cf` ("arm64/xor: use EOR3 instructions when available") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Biggers <ebiggers@kernel.org> Tested-by: Eric Biggers <ebiggers@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Mason <clm@fb.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: David Sterba <dsterba@suse.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason A. Donenfeld <jason@zx2c4.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Li Nan <linan122@huawei.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Richard Weinberger <richard@nod.at> Cc: Russell King <linux@armlinux.org.uk> Cc: Song Liu <song@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Ted Ts'o <tytso@mit.edu> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-02 23:36:15 -07:00
Ryan Roberts	1d37713fa8	arm64: mm: Remove pmd_sect() and pud_sect() The semantics of pXd_leaf() are very similar to pXd_sect(). The only difference is that pXd_sect() only considers it a section if PTE_VALID is set, whereas pXd_leaf() permits both "valid" and "present-invalid" types. Using pXd_sect() has caused issues now that large leaf entries can be present-invalid since commit `a166563e7e` ("arm64: mm: support large block mapping when rodata=full"), so let's just remove the API and standardize on pXd_leaf(). There are a few callsites of the form pXd_leaf(READ_ONCE(*pXdp)). This was previously fine for the pXd_sect() macro because it only evaluated its argument once. But pXd_leaf() evaluates its argument multiple times. So let's avoid unintended side effects by reimplementing pXd_leaf() as an inline function. Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2026-04-02 20:49:16 +01:00
Ryan Roberts	15bfba1ad7	arm64: mm: Handle invalid large leaf mappings correctly It has been possible for a long time to mark ptes in the linear map as invalid. This is done for secretmem, kfence, realm dma memory un/share, and others, by simply clearing the PTE_VALID bit. But until commit `a166563e7e` ("arm64: mm: support large block mapping when rodata=full") large leaf mappings were never made invalid in this way. It turns out various parts of the code base are not equipped to handle invalid large leaf mappings (in the way they are currently encoded) and I've observed a kernel panic while booting a realm guest on a BBML2_NOABORT system as a result: [ 15.432706] software IO TLB: Memory encryption is active and system is using DMA bounce buffers [ 15.476896] Unable to handle kernel paging request at virtual address ffff000019600000 [ 15.513762] Mem abort info: [ 15.527245] ESR = 0x0000000096000046 [ 15.548553] EC = 0x25: DABT (current EL), IL = 32 bits [ 15.572146] SET = 0, FnV = 0 [ 15.592141] EA = 0, S1PTW = 0 [ 15.612694] FSC = 0x06: level 2 translation fault [ 15.640644] Data abort info: [ 15.661983] ISV = 0, ISS = 0x00000046, ISS2 = 0x00000000 [ 15.694875] CM = 0, WnR = 1, TnD = 0, TagAccess = 0 [ 15.723740] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [ 15.755776] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000081f3f000 [ 15.800410] [ffff000019600000] pgd=0000000000000000, p4d=180000009ffff403, pud=180000009fffe403, pmd=00e8000199600704 [ 15.855046] Internal error: Oops: 0000000096000046 [#1] SMP [ 15.886394] Modules linked in: [ 15.900029] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 7.0.0-rc4-dirty #4 PREEMPT [ 15.935258] Hardware name: linux,dummy-virt (DT) [ 15.955612] pstate: 21400005 (nzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--) [ 15.986009] pc : __pi_memcpy_generic+0x128/0x22c [ 16.006163] lr : swiotlb_bounce+0xf4/0x158 [ 16.024145] sp : ffff80008000b8f0 [ 16.038896] x29: ffff80008000b8f0 x28: 0000000000000000 x27: 0000000000000000 [ 16.069953] x26: ffffb3976d261ba8 x25: 0000000000000000 x24: ffff000019600000 [ 16.100876] x23: 0000000000000001 x22: ffff0000043430d0 x21: 0000000000007ff0 [ 16.131946] x20: 0000000084570010 x19: 0000000000000000 x18: ffff00001ffe3fcc [ 16.163073] x17: 0000000000000000 x16: 00000000003fffff x15: 646e612065766974 [ 16.194131] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 [ 16.225059] x11: 0000000000000000 x10: 0000000000000010 x9 : 0000000000000018 [ 16.256113] x8 : 0000000000000018 x7 : 0000000000000000 x6 : 0000000000000000 [ 16.287203] x5 : ffff000019607ff0 x4 : ffff000004578000 x3 : ffff000019600000 [ 16.318145] x2 : 0000000000007ff0 x1 : ffff000004570010 x0 : ffff000019600000 [ 16.349071] Call trace: [ 16.360143] __pi_memcpy_generic+0x128/0x22c (P) [ 16.380310] swiotlb_tbl_map_single+0x154/0x2b4 [ 16.400282] swiotlb_map+0x5c/0x228 [ 16.415984] dma_map_phys+0x244/0x2b8 [ 16.432199] dma_map_page_attrs+0x44/0x58 [ 16.449782] virtqueue_map_page_attrs+0x38/0x44 [ 16.469596] virtqueue_map_single_attrs+0xc0/0x130 [ 16.490509] virtnet_rq_alloc.isra.0+0xa4/0x1fc [ 16.510355] try_fill_recv+0x2a4/0x584 [ 16.526989] virtnet_open+0xd4/0x238 [ 16.542775] __dev_open+0x110/0x24c [ 16.558280] __dev_change_flags+0x194/0x20c [ 16.576879] netif_change_flags+0x24/0x6c [ 16.594489] dev_change_flags+0x48/0x7c [ 16.611462] ip_auto_config+0x258/0x1114 [ 16.628727] do_one_initcall+0x80/0x1c8 [ 16.645590] kernel_init_freeable+0x208/0x2f0 [ 16.664917] kernel_init+0x24/0x1e0 [ 16.680295] ret_from_fork+0x10/0x20 [ 16.696369] Code: 927cec03 cb0e0021 8b0e0042 a9411c26 (a900340c) [ 16.723106] ---[ end trace 0000000000000000 ]--- [ 16.752866] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b [ 16.792556] Kernel Offset: 0x3396ea200000 from 0xffff800080000000 [ 16.818966] PHYS_OFFSET: 0xfff1000080000000 [ 16.837237] CPU features: 0x0000000,00060005,13e38581,957e772f [ 16.862904] Memory Limit: none [ 16.876526] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]--- This panic occurs because the swiotlb memory was previously shared to the host (__set_memory_enc_dec()), which involves transitioning the (large) leaf mappings to invalid, sharing to the host, then marking the mappings valid again. But pageattr_p[mu]d_entry() would only update the entry if it is a section mapping, since otherwise it concluded it must be a table entry so shouldn't be modified. But p[mu]d_sect() only returns true if the entry is valid. So the result was that the large leaf entry was made invalid in the first pass then ignored in the second pass. It remains invalid until the above code tries to access it and blows up. The simple fix would be to update pageattr_pmd_entry() to use !pmd_table() instead of pmd_sect(). That would solve this problem. But the ptdump code also suffers from a similar issue. It checks pmd_leaf() and doesn't call into the arch-specific note_page() machinery if it returns false. As a result of this, ptdump wasn't even able to show the invalid large leaf mappings; it looked like they were valid which made this super fun to debug. the ptdump code is core-mm and pmd_table() is arm64-specific so we can't use the same trick to solve that. But we already support the concept of "present-invalid" for user space entries. And even better, pmd_leaf() will return true for a leaf mapping that is marked present-invalid. So let's just use that encoding for present-invalid kernel mappings too. Then we can use pmd_leaf() where we previously used pmd_sect() and everything is magically fixed. Additionally, from inspection kernel_page_present() was broken in a similar way, so I'm also updating that to use pmd_leaf(). The transitional page tables component was also similarly broken; it creates a copy of the kernel page tables, making RO leaf mappings RW in the process. It also makes invalid (but-not-none) pte mappings valid. But it was not doing this for large leaf mappings. This could have resulted in crashes at kexec- or hibernate-time. This code is fixed to flip "present-invalid" mappings back to "present-valid" at all levels. Finally, I have hardened split_pmd()/split_pud() so that if it is passed a "present-invalid" leaf, it will maintain that property in the split leaves, since I wasn't able to convince myself that it would only ever be called for "present-valid" leaves. Fixes: `a166563e7e` ("arm64: mm: support large block mapping when rodata=full") Cc: stable@vger.kernel.org Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2026-04-02 20:49:16 +01:00
Ryan Roberts	f12b435de2	arm64: mm: Fix rodata=full block mapping support for realm guests Commit `a166563e7e` ("arm64: mm: support large block mapping when rodata=full") enabled the linear map to be mapped by block/cont while still allowing granular permission changes on BBML2_NOABORT systems by lazily splitting the live mappings. This mechanism was intended to be usable by realm guests since they need to dynamically share dma buffers with the host by "decrypting" them - which for Arm CCA, means marking them as shared in the page tables. However, it turns out that the mechanism was failing for realm guests because realms need to share their dma buffers (via __set_memory_enc_dec()) much earlier during boot than split_kernel_leaf_mapping() was able to handle. The report linked below showed that GIC's ITS was one such user. But during the investigation I found other callsites that could not meet the split_kernel_leaf_mapping() constraints. The problem is that we block map the linear map based on the boot CPU supporting BBML2_NOABORT, then check that all the other CPUs support it too when finalizing the caps. If they don't, then we stop_machine() and split to ptes. For safety, split_kernel_leaf_mapping() previously wouldn't permit splitting until after the caps were finalized. That ensured that if any secondary cpus were running that didn't support BBML2_NOABORT, we wouldn't risk breaking them. I've fix this problem by reducing the black-out window where we refuse to split; there are now 2 windows. The first is from T0 until the page allocator is inititialized. Splitting allocates memory for the page allocator so it must be in use. The second covers the period between starting to online the secondary cpus until the system caps are finalized (this is a very small window). All of the problematic callers are calling __set_memory_enc_dec() before the secondary cpus come online, so this solves the problem. However, one of these callers, swiotlb_update_mem_attributes(), was trying to split before the page allocator was initialized. So I have moved this call from arch_mm_preinit() to mem_init(), which solves the ordering issue. I've added warnings and return an error if any attempt is made to split in the black-out windows. Note there are other issues which prevent booting all the way to user space, which will be fixed in subsequent patches. Reported-by: Jinjiang Tu <tujinjiang@huawei.com> Closes: https://lore.kernel.org/all/0b2a4ae5-fc51-4d77-b177-b2e9db74f11d@huawei.com/ Fixes: `a166563e7e` ("arm64: mm: support large block mapping when rodata=full") Cc: stable@vger.kernel.org Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com> Tested-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2026-04-02 20:49:16 +01:00
Eric Biggers	180e92df9a	arm64: fpsimd: Remove obsolete cond_yield macro All invocations of the cond_yield macro have been removed, so remove the macro definition as well. Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20260401000548.133151-10-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org>	2026-04-01 13:02:10 -07:00
Kevin Brodsky	5e0deb0a6b	arm64: mm: Use generic enum pgtable_level enum pgtable_type was introduced for arm64 by commit `c64f46ee13` ("arm64: mm: use enum to identify pgtable level instead of *_SHIFT"). In the meantime, the generic enum pgtable_level got introduced by commit `b22cc9a9c7` ("mm/rmap: convert "enum rmap_level" to "enum pgtable_level""). Let's switch to the generic enum pgtable_level. The only difference is that it also includes PGD level; __pgd_pgtable_alloc() isn't expected to create PGD tables so we add a VM_WARN_ON() for that case. Suggested-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2026-04-01 18:20:12 +01:00
Sascha Bischoff	9c1ac77ddf	KVM: arm64: vgic-v5: Fold PPI state for all exposed PPIs GICv5 supports up to 128 PPIs, which would introduce a large amount of overhead if all of them were actively tracked. Rather than keeping track of all 128 potential PPIs, we instead only consider the set of architected PPIs (the first 64). Moreover, we further reduce that set by only exposing a subset of the PPIs to a guest. In practice, this means that only 4 PPIs are typically exposed to a guest - the SW_PPI, PMUIRQ, and the timers. When folding the PPI state, changed bits in the active or pending were used to choose which state to sync back. However, this breaks badly for Edge interrupts when exiting the guest before it has consumed the edge. There is no change in pending state detected, and the edge is lost forever. Given the reduced set of PPIs exposed to the guest, and the issues around tracking the edges, drop the tracking of changed state, and instead iterate over the limited subset of PPIs exposed to the guest directly. This change drops the second copy of the PPI pending state used for detecting edges in the pending state, and reworks vgic_v5_fold_ppi_state() to iterate over the VM's PPI mask instead. Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com> Link: https://patch.msgid.link/20260401162152.932243-1-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-04-01 17:52:17 +01:00
Ard Biesheuvel	54ac9ff8f1	arm64: Use static call trampolines when kCFI is enabled Implement arm64 support for the 'unoptimized' static call variety, which routes all calls through a trampoline that performs a tail call to the chosen function, and wire it up for use when kCFI is enabled. This works around an issue with kCFI and generic static calls, where the prototypes of default handlers such as __static_call_nop() and __static_call_ret0() don't match the expected prototype of the call site, resulting in kCFI false positives [0]. Since static call targets may be located in modules loaded out of direct branching range, this needs an ADRP/LDR pair to load the branch target into R16 and a branch-to-register (BR) instruction to perform an indirect call. Unlike on x86, there is no pressing need on arm64 to avoid indirect calls at all cost, but hiding it from the compiler as is done here does have some benefits: - the literal is located in .rodata, which gives us the same robustness advantage that code patching does; - no D-cache pollution from fetching hash values from .text sections. From an execution speed PoV, this is unlikely to make any difference at all. Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Kees Cook <kees@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will McVicker <willmcvicker@google.com> Reported-by: Carlos Llamas <cmllamas@google.com> Closes: https://lore.kernel.org/all/20260311225822.1565895-1-cmllamas@google.com/ [0] Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2026-04-01 15:29:59 +01:00
Linus Torvalds	809b997a5c	x86-64/arm64/powerpc: clean up and rename __copy_from_user_flushcache This finishes the work on these odd functions that were only implemented by a handful of architectures. The 'flushcache' function was only used from the iterator code, and let's make it do the same thing that the nontemporal version does: remove the two underscores and add the user address checking. Yes, yes, the user address checking is also done at iovec import time, but we have long since walked away from the old double-underscore thing where we try to avoid address checking overhead at access time, and these functions shouldn't be so special and old-fashioned. The arm64 version already did the address check, in fact, so there it's just a matter of renaming it. For powerpc and x86-64 we now do the proper user access boilerplate. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-03-30 15:05:57 -07:00
Will Deacon	8800dbf661	KVM: arm64: Allow userspace to create protected VMs when pKVM is enabled Introduce a new VM type for KVM/arm64 to allow userspace to request the creation of a "protected VM" when the host has booted with pKVM enabled. For now, this feature results in a taint on first use as many aspects of a protected VM are not yet protected! Tested-by: Fuad Tabba <tabba@google.com> Tested-by: Mostafa Saleh <smostafa@google.com> Signed-off-by: Will Deacon <will@kernel.org> Link: https://patch.msgid.link/20260330144841.26181-32-will@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-03-30 16:58:09 +01:00
Will Deacon	5991916392	KVM: arm64: Return -EFAULT from VCPU_RUN on access to a poisoned pte If a protected vCPU faults on an IPA which appears to be mapped, query the hypervisor to determine whether or not the faulting pte has been poisoned by a forceful reclaim. If the pte has been poisoned, return -EFAULT back to userspace rather than retrying the instruction forever. Tested-by: Fuad Tabba <tabba@google.com> Tested-by: Mostafa Saleh <smostafa@google.com> Signed-off-by: Will Deacon <will@kernel.org> Link: https://patch.msgid.link/20260330144841.26181-28-will@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-03-30 16:58:09 +01:00
Will Deacon	281a38ad29	KVM: arm64: Reclaim faulting page from pKVM in spurious fault handler Host kernel accesses to pages that are inaccessible at stage-2 result in the injection of a translation fault, which is fatal unless an exception table fixup is registered for the faulting PC (e.g. for user access routines). This is undesirable, since a get_user_pages() call could be used to obtain a reference to a donated page and then a subsequent access via a kernel mapping would lead to a panic(). Rework the spurious fault handler so that stage-2 faults injected back into the host result in the target page being forcefully reclaimed when no exception table fixup handler is registered. Tested-by: Fuad Tabba <tabba@google.com> Tested-by: Mostafa Saleh <smostafa@google.com> Signed-off-by: Will Deacon <will@kernel.org> Link: https://patch.msgid.link/20260330144841.26181-27-will@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-03-30 16:58:09 +01:00
Will Deacon	56080f53a6	KVM: arm64: Introduce hypercall to force reclaim of a protected page Introduce a new hypercall, __pkvm_force_reclaim_guest_page(), to allow the host to forcefully reclaim a physical page that was previous donated to a protected guest. This results in the page being zeroed and the previous guest mapping being poisoned so that new pages cannot be subsequently donated at the same IPA. Tested-by: Fuad Tabba <tabba@google.com> Tested-by: Mostafa Saleh <smostafa@google.com> Signed-off-by: Will Deacon <will@kernel.org> Link: https://patch.msgid.link/20260330144841.26181-26-will@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-03-30 16:58:09 +01:00
Will Deacon	44887977ab	KVM: arm64: Change 'pkvm_handle_t' to u16 'pkvm_handle_t' doesn't need to be a 32-bit type and subsequent patches will rely on it being no more than 16 bits so that it can be encoded into a pte annotation. Change 'pkvm_handle_t' to a u16 and add a compile-type check that the maximum handle fits into the reduced type. Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> Tested-by: Mostafa Saleh <smostafa@google.com> Signed-off-by: Will Deacon <will@kernel.org> Link: https://patch.msgid.link/20260330144841.26181-24-will@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-03-30 16:58:08 +01:00
Will Deacon	afa72d207e	KVM: arm64: Introduce host_stage2_set_owner_metadata_locked() Rework host_stage2_set_owner_locked() to add a new helper function, host_stage2_set_owner_metadata_locked(), which will allow us to store additional metadata alongside a 3-bit owner ID for invalid host stage-2 entries. Tested-by: Fuad Tabba <tabba@google.com> Tested-by: Mostafa Saleh <smostafa@google.com> Signed-off-by: Will Deacon <will@kernel.org> Link: https://patch.msgid.link/20260330144841.26181-23-will@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-03-30 16:58:08 +01:00
Will Deacon	c6ba94640c	KVM: arm64: Generalise kvm_pgtable_stage2_set_owner() kvm_pgtable_stage2_set_owner() can be generalised into a way to store up to 59 bits in the page tables alongside a 4-bit 'type' identifier specific to the format of the 59-bit payload. Introduce kvm_pgtable_stage2_annotate() and move the existing invalid ptes (for locked ptes and donated pages) over to the new scheme. Tested-by: Fuad Tabba <tabba@google.com> Tested-by: Mostafa Saleh <smostafa@google.com> Signed-off-by: Will Deacon <will@kernel.org> Link: https://patch.msgid.link/20260330144841.26181-22-will@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-03-30 16:58:08 +01:00
Will Deacon	0bf5f4d400	KVM: arm64: Introduce __pkvm_reclaim_dying_guest_page() To enable reclaim of pages from a protected VM during teardown, introduce a new hypercall to reclaim a single page from a protected guest that is in the dying state. Since the EL2 code is non-preemptible, the new hypercall deliberately acts on a single page at a time so as to allow EL1 to reschedule frequently during the teardown operation. Reviewed-by: Vincent Donnefort <vdonnefort@google.com> Tested-by: Fuad Tabba <tabba@google.com> Tested-by: Mostafa Saleh <smostafa@google.com> Co-developed-by: Quentin Perret <qperret@google.com> Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Will Deacon <will@kernel.org> Link: https://patch.msgid.link/20260330144841.26181-16-will@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-03-30 16:58:08 +01:00
Will Deacon	1e579adca1	KVM: arm64: Introduce __pkvm_host_donate_guest() In preparation for supporting protected VMs, whose memory pages are isolated from the host, introduce a new pKVM hypercall to allow the donation of pages to a guest. Tested-by: Fuad Tabba <tabba@google.com> Tested-by: Mostafa Saleh <smostafa@google.com> Signed-off-by: Will Deacon <will@kernel.org> Link: https://patch.msgid.link/20260330144841.26181-13-will@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-03-30 16:58:08 +01:00
Will Deacon	6c58f914eb	KVM: arm64: Split teardown hypercall into two phases In preparation for reclaiming protected guest VM pages from the host during teardown, split the current 'pkvm_teardown_vm' hypercall into separate 'start' and 'finalise' calls. The 'pkvm_start_teardown_vm' hypercall puts the VM into a new 'is_dying' state, which is a point of no return past which no vCPU of the pVM is allowed to run any more. Once in this new state, 'pkvm_finalize_teardown_vm' can be used to reclaim meta-data and page-table pages from the VM. A subsequent patch will add support for reclaiming the individual guest memory pages. Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> Tested-by: Mostafa Saleh <smostafa@google.com> Co-developed-by: Quentin Perret <qperret@google.com> Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Will Deacon <will@kernel.org> Link: https://patch.msgid.link/20260330144841.26181-12-will@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-03-30 16:58:07 +01:00
Will Deacon	733774b520	KVM: arm64: Remove is_protected_kvm_enabled() checks from hypercalls When pKVM is not enabled, the host shouldn't issue pKVM-specific hypercalls and so there's no point checking for this in the pKVM hypercall handlers. Remove the redundant is_protected_kvm_enabled() checks from each hypercall and instead rejig the hypercall table so that the pKVM-specific hypercalls are unreachable when pKVM is not being used. Reviewed-by: Quentin Perret <qperret@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> Tested-by: Mostafa Saleh <smostafa@google.com> Signed-off-by: Will Deacon <will@kernel.org> Link: https://patch.msgid.link/20260330144841.26181-8-will@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-03-30 16:58:07 +01:00
Will Deacon	be3473c338	KVM: arm64: Don't advertise unsupported features for protected guests Both SVE and PMUv3 are treated as "restricted" features for protected guests and attempts to access their corresponding architectural state from a protected guest result in an undefined exception being injected by the hypervisor. Since these exceptions are unexpected and typically fatal for the guest, don't advertise these features for protected guests. Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> Tested-by: Mostafa Saleh <smostafa@google.com> Signed-off-by: Will Deacon <will@kernel.org> Link: https://patch.msgid.link/20260330144841.26181-6-will@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-03-30 16:58:07 +01:00
Will Deacon	07695f7dc1	KVM: arm64: Disable SPE Profiling Buffer when running in guest context The nVHE world-switch code relies on zeroing PMSCR_EL1 to disable profiling data generation in guest context when SPE is in use by the host. Unfortunately, this may leave PMBLIMITR_EL1.E set and consequently we can end up running in guest/hypervisor context with the Profiling Buffer enabled. The current "known issues" document for Rev M.a of the Arm ARM states that this can lead to speculative, out-of-context translations: \| 2.18 D23136: \| \| When the Profiling Buffer is enabled, profiling is not stopped, and \| Discard mode is not enabled, the Statistical Profiling Unit might \| cause speculative translations for the owning translation regime, \| including when the owning translation regime is out-of-context. In a similar fashion to TRBE, ensure that the Profiling Buffer is disabled during the nVHE world switch before we start messing with the stage-2 MMU and trap configuration. Cc: Marc Zyngier <maz@kernel.org> Cc: Oliver Upton <oupton@kernel.org> Cc: James Clark <james.clark@linaro.org> Cc: Leo Yan <leo.yan@arm.com> Cc: Suzuki K Poulose <suzuki.poulose@arm.com> Cc: Fuad Tabba <tabba@google.com> Cc: Alexandru Elisei <alexandru.elisei@arm.com> Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Alexandru Elisei <alexandru.elisei@arm.com> Tested-by: Fuad Tabba <tabba@google.com> Fixes: `f85279b4bd` ("arm64: KVM: Save/restore the host SPE state when entering/leaving a VM") Signed-off-by: Will Deacon <will@kernel.org> Link: https://patch.msgid.link/20260327130047.21065-3-will@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-03-28 17:07:49 +00:00
Will Deacon	d133aa75e3	KVM: arm64: Disable TRBE Trace Buffer Unit when running in guest context The nVHE world-switch code relies on zeroing TRFCR_EL1 to disable trace generation in guest context when self-hosted TRBE is in use by the host. Per D3.2.1 ("Controls to prohibit trace at Exception levels"), clearing TRFCR_EL1 means that trace generation is prohibited at EL1 and EL0 but per R_YCHKJ the Trace Buffer Unit will still be enabled if TRBLIMITR_EL1.E is set. R_SJFRQ goes on to state that, when enabled, the Trace Buffer Unit can perform address translation for the "owning exception level" even when it is out of context. Consequently, we can end up in a state where TRBE performs speculative page-table walks for a host VA/IPA in guest/hypervisor context depending on the value of MDCR_EL2.E2TB, which changes over world-switch. The potential result appears to be a heady mixture of SErrors, data corruption and hardware lockups. Extend the TRBE world-switch code to clear TRBLIMITR_EL1.E after draining the buffer, restoring the register on return to the host. This unfortunately means we need to tackle CPU errata #2064142 and #2038923 which add additional synchronisation requirements around manipulations of the limit register. Hopefully this doesn't need to be fast. Cc: Marc Zyngier <maz@kernel.org> Cc: Oliver Upton <oupton@kernel.org> Cc: James Clark <james.clark@linaro.org> Cc: Leo Yan <leo.yan@arm.com> Cc: Suzuki K Poulose <suzuki.poulose@arm.com> Cc: Fuad Tabba <tabba@google.com> Cc: Alexandru Elisei <alexandru.elisei@arm.com> Tested-by: Leo Yan <leo.yan@arm.com> Tested-by: Fuad Tabba <tabba@google.com> Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com> Reviewed-by: Fuad Tabba <tabba@google.com> Fixes: `a1319260bf` ("arm64: KVM: Enable access to TRBE support for host") Signed-off-by: Will Deacon <will@kernel.org> Link: https://patch.msgid.link/20260327130047.21065-2-will@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-03-28 17:07:49 +00:00
James Morse	4aab135bda	arm64: mpam: Select ARCH_HAS_CPU_RESCTRL Enough MPAM support is present to enable ARCH_HAS_CPU_RESCTRL. Let it rip^Wlink! ARCH_HAS_CPU_RESCTRL indicates resctrl can be enabled. It is enabled by the arch code simply because it has 'arch' in its name. This removes ARM_CPU_RESCTRL as a mimic of X86_CPU_RESCTRL. While here, move the ACPI dependency to the driver's Kconfig file. Tested-by: Gavin Shan <gshan@redhat.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Punit Agrawal <punit.agrawal@oss.qualcomm.com> Tested-by: Jesse Chick <jessechick@os.amperecomputing.com> Reviewed-by: Zeng Heng <zengheng4@huawei.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Co-developed-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: James Morse <james.morse@arm.com>	2026-03-27 15:32:11 +00:00
James Morse	6789fb9928	arm_mpam: resctrl: Add CDP emulation Intel RDT's CDP feature allows the cache to use a different control value depending on whether the accesses was for instruction fetch or a data access. MPAM's equivalent feature is the other way up: the CPU assigns a different partid label to traffic depending on whether it was instruction fetch or a data access, which causes the cache to use a different control value based solely on the partid. MPAM can emulate CDP, with the side effect that the alternative partid is seen by all MSC, it can't be enabled per-MSC. Add the resctrl hooks to turn this on or off. Add the helpers that match a closid against a task, which need to be aware that the value written to hardware is not the same as the one resctrl is using. Update the 'arm64_mpam_global_default' variable the arch code uses during context switch to know when the per-cpu value should be used instead. Also, update these per-cpu values and sync the resulting mpam partid/pmg configuration to hardware. resctrl can enable CDP for L2 caches, L3 caches or both. When it is enabled by one and not the other MPAM globally enabled CDP but hides the effect on the other cache resource. This hiding is possible as CPOR is the only supported cache control and that uses a resource bitmap; two partids with the same bitmap act as one. Awkwardly, the MB controls don't implement CDP and CDP can't be hidden as the memory bandwidth control is a maximum per partid which can't be modelled with more partids. If the total maximum is used for both the data and instruction partids then then the maximum may be exceeded and if it is split in two then the one using more bandwidth will hit a lower limit. Hence, hide the MB controls completely if CDP is enabled for any resource. Tested-by: Gavin Shan <gshan@redhat.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Punit Agrawal <punit.agrawal@oss.qualcomm.com> Tested-by: Jesse Chick <jessechick@os.amperecomputing.com> Cc: Dave Martin <Dave.Martin@arm.com> Cc: Amit Singh Tomar <amitsinght@marvell.com> Reviewed-by: Zeng Heng <zengheng4@huawei.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Co-developed-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: James Morse <james.morse@arm.com>	2026-03-27 15:30:10 +00:00
James Morse	2cf9ca3fae	arm64: mpam: Add helpers to change a task or cpu's MPAM PARTID/PMG values Care must be taken when modifying the PARTID and PMG of a task in any per-task structure as writing these values may race with the task being scheduled in, and reading the modified values. Add helpers to set the task properties, and the CPU default value. These use WRITE_ONCE() that pairs with the READ_ONCE() in mpam_get_regval() to avoid causing torn values. Tested-by: Gavin Shan <gshan@redhat.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Punit Agrawal <punit.agrawal@oss.qualcomm.com> Tested-by: Jesse Chick <jessechick@os.amperecomputing.com> Cc: Dave Martin <Dave.Martin@arm.com> Reviewed-by: Zeng Heng <zengheng4@huawei.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Co-developed-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: James Morse <james.morse@arm.com>	2026-03-27 15:29:09 +00:00
Ben Horgan	37fe0f984d	arm64: mpam: Initialise and context switch the MPAMSM_EL1 register The MPAMSM_EL1 sets the MPAM labels, PMG and PARTID, for loads and stores generated by a shared SMCU. Disable the traps so the kernel can use it and set it to the same configuration as the per-EL cpu MPAM configuration. If an SMCU is not shared with other cpus then it is implementation defined whether the configuration from MPAMSM_EL1 is used or that from the appropriate MPAMy_ELx. As we set the same, PMG_D and PARTID_D, configuration for MPAM0_EL1, MPAM1_EL1 and MPAMSM_EL1 the resulting configuration is the same regardless. The range of valid configurations for the PARTID and PMG in MPAMSM_EL1 is not currently specified in Arm Architectural Reference Manual but the architect has confirmed that it is intended to be the same as that for the cpu configuration in the MPAMy_ELx registers. Tested-by: Gavin Shan <gshan@redhat.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Punit Agrawal <punit.agrawal@oss.qualcomm.com> Tested-by: Jesse Chick <jessechick@os.amperecomputing.com> Reviewed-by: Zeng Heng <zengheng4@huawei.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: James Morse <james.morse@arm.com> Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: James Morse <james.morse@arm.com>	2026-03-27 15:29:02 +00:00
James Morse	8e06d04ff1	arm64: mpam: Context switch the MPAM registers MPAM allows traffic in the SoC to be labeled by the OS, these labels are used to apply policy in caches and bandwidth regulators, and to monitor traffic in the SoC. The label is made up of a PARTID and PMG value. The x86 equivalent calls these CLOSID and RMID, but they don't map precisely. MPAM has two CPU system registers that is used to hold the PARTID and PMG values that traffic generated at each exception level will use. These can be set per-task by the resctrl file system. (resctrl is the defacto interface for controlling this stuff). Add a helper to switch this. struct task_struct's separate CLOSID and RMID fields are insufficient to implement resctrl using MPAM, as resctrl can change the PARTID (CLOSID) and PMG (sort of like the RMID) separately. On x86, the rmid is an independent number, so a race that writes a mismatched closid and rmid into hardware is benign. On arm64, the pmg bits extend the partid. (i.e. partid-5 has a pmg-0 that is not the same as partid-6's pmg-0). In this case, mismatching the values will 'dirty' a pmg value that resctrl believes is clean, and is not tracking with its 'limbo' code. To avoid this, the partid and pmg are always read and written as a pair. This requires a new u64 field. In struct task_struct there are two u32, rmid and closid for the x86 case, but as we can't use them here do something else. Add this new field, mpam_partid_pmg, to struct thread_info to avoid adding more architecture specific code to struct task_struct. Always use READ_ONCE()/WRITE_ONCE() when accessing this field. Resctrl allows a per-cpu 'default' value to be set, this overrides the values when scheduling a task in the default control-group, which has PARTID 0. The way 'code data prioritisation' gets emulated means the register value for the default group needs to be a variable. The current system register value is kept in a per-cpu variable to avoid writing to the system register if the value isn't going to change. Writes to this register may reset the hardware state for regulating bandwidth. Finally, there is no reason to context switch these registers unless there is a driver changing the values in struct task_struct. Hide the whole thing behind a static key. This also allows the driver to disable MPAM in response to errors reported by hardware. Move the existing static key to belong to the arch code, as in the future the MPAM driver may become a loadable module. All this should depend on whether there is an MPAM driver, hide it behind CONFIG_ARM64_MPAM. Tested-by: Gavin Shan <gshan@redhat.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Punit Agrawal <punit.agrawal@oss.qualcomm.com> Tested-by: Jesse Chick <jessechick@os.amperecomputing.com> CC: Amit Singh Tomar <amitsinght@marvell.com> Reviewed-by: Zeng Heng <zengheng4@huawei.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Co-developed-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: James Morse <james.morse@arm.com>	2026-03-27 15:27:59 +00:00
Yeoreum Yun	44adf2bf40	arm64: futex: Support futex with FEAT_LSUI Current futex atomic operations are implemented using LL/SC instructions while temporarily clearing PSTATE.PAN and setting PSTATE.TCO (if KASAN_HW_TAGS is enabled). With Armv9.6, FEAT_LSUI provides atomic instructions for user memory access in the kernel without the need for PSTATE bits toggling. Use the FEAT_LSUI instructions to implement the futex atomic operations. Note that some futex operations do not have a matching LSUI instruction, (eor or word-sized cmpxchg). For such cases, use cas{al}t to implement the operation. Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com> [catalin.marinas@arm.com: add comment on -EAGAIN in __lsui_futex_cmpxchg()] Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2026-03-27 12:52:04 +00:00
Yeoreum Yun	eaa3babcce	arm64: futex: Refactor futex atomic operation Refactor the futex atomic operations using ll/sc instructions in preparation for FEAT_LSUI support. In addition, use named operands for the inline asm. No functional change. Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com> [catalin.marinas@arm.com: remove unnecessary stringify.h include] Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2026-03-27 09:47:33 +00:00
Yeoreum Yun	7181f718cb	arm64: cpufeature: Add FEAT_LSUI Since Armv9.6, FEAT_LSUI introduces atomic instructions that allow privileged code to access user memory without clearing the PSTATE.PAN bit. Add CPU feature detection for FEAT_LSUI. Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com> [catalin.marinas@arm.com: Remove commit log references to SW_PAN] Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2026-03-26 18:19:07 +00:00
Ryan Roberts	b7d9d2e3a8	arm64: mm: __ptep_set_access_flags must hint correct TTL It has been reported that since commit `752a0d1d48` ("arm64: mm: Provide level hint for flush_tlb_page()"), the arm64 check_hugetlb_options selftest has been locking up while running "Check child hugetlb memory with private mapping, sync error mode and mmap memory". This is due to hugetlb (and THP) helpers casting their PMD/PUD entries to PTE and calling __ptep_set_access_flags(), which issues a __flush_tlb_page(). Now that this is hinted for level 3, in this case, the TLB entry does not get evicted and we end up in a spurious fault loop. Fix this by creating a __ptep_set_access_flags_anysz() function which takes the pgsize of the entry. It can then add the appropriate hint. The "_anysz" approach is the established pattern for problems of this class. Reported-by: Aishwarya TCV <Aishwarya.TCV@arm.com> Fixes: `752a0d1d48` ("arm64: mm: Provide level hint for flush_tlb_page()") Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2026-03-25 18:08:13 +00:00
Marc Zyngier	54a3cc1454	KVM: arm64: Remove extra ISBs when using msr_hcr_el2 The msr_hcr_el2 macro is slightly awkward, as it provides an ISB when CONFIG_AMPERE_ERRATUM_AC04_CPU_23 is present, and none otherwise. Note that this this option is 'default y', meaning that it is likely to be selected. Most instances of msr_hcr_el2 are also immediately followed by an ISB, meaning that in most cases, you end-up with two back-to-back ISBs. This isn't a big deal, but once you have seen that, you can't unsee it. Rework the msr_hcr_el2 macro to always provide the ISB, and drop the superfluous ISBs everywhere else. Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> Link: https://patch.msgid.link/20260321212419.2803972-6-maz@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-03-23 11:03:53 +00:00
Marc Zyngier	59c6e12d40	KVM: arm64: pkvm: Use direct function pointers for cpu_{on,resume} Instead of using a boolean to decide whether a CPU is booting or resuming, just pass an actual function pointer around. This makes the code a bit more straightforward to understand. Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> Link: https://patch.msgid.link/20260321212419.2803972-5-maz@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-03-23 11:03:53 +00:00
Wei-Lin Chang	19e15dc73f	KVM: arm64: nv: Expose shadow page tables in debugfs Exposing shadow page tables in debugfs improves the debugability and testability of NV. With this patch a new directory "nested" is created for each VM created if the host is NV capable. Within the directory each valid s2 mmu will have its shadow page table exposed as a readable file with the file name formatted as 0x<vttbr>-0x<vtcr>-s2-{en,dis}abled. The creation and removal of the files happen at the points when an s2 mmu becomes valid, or the context it represents change. In the future the "nested" directory can also hold other NV related information. This is gated behind CONFIG_PTDUMP_STAGE2_DEBUGFS. Suggested-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Sebastian Ene <sebastianene@google.com> Signed-off-by: Wei-Lin Chang <weilin.chang@arm.com> Reviewed-by: Joey Gouly <joey.gouly@arm.com> Link: https://patch.msgid.link/20260317182638.1592507-3-weilin.chang@arm.com [maz: minor refactor, full 16 chars addresses] Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-03-23 10:06:50 +00:00
Sascha Bischoff	a8946fde86	KVM: arm64: gic-v5: Set ICH_VCTLR_EL2.En on boot This control enables virtual HPPI selection, i.e., selection and delivery of interrupts for a guest (assuming that the guest itself has opted to receive interrupts). This is set to enabled on boot as there is no reason for disabling it in normal operation as virtual interrupt signalling itself is still controlled via the HCR_EL2. Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com> Link: https://patch.msgid.link/20260319154937.3619520-37-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-03-19 18:21:29 +00:00
Sascha Bischoff	5aefaf11f9	KVM: arm64: gic: Hide GICv5 for protected guests We don't support running protected guest with GICv5 at the moment. Therefore, be sure that we don't expose it to the guest at all by actively hiding it when running a protected guest. Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com> Link: https://patch.msgid.link/20260319154937.3619520-34-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-03-19 18:21:29 +00:00
Sascha Bischoff	d1328c6151	KVM: arm64: gic-v5: Trap and mask guest ICC_PPI_ENABLERx_EL1 writes A guest should not be able to detect if a PPI that is not exposed to the guest is implemented or not. Avoid the guest enabling any PPIs that are not implemented as far as the guest is concerned by trapping and masking writes to the two ICC_PPI_ENABLERx_EL1 registers. When a guest writes these registers, the write is masked with the set of PPIs actually exposed to the guest, and the state is written back to KVM's shadow state. As there is now no way for the guest to change the PPI enable state without it being trapped, saving of the PPI Enable state is dropped from guest exit. Reads for the above registers are not masked. When the guest is running and reads from the above registers, it is presented with what KVM provides in the ICH_PPI_ENABLERx_EL2 registers, which is the masked version of what the guest last wrote. Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Link: https://patch.msgid.link/20260319154937.3619520-25-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-03-19 18:21:28 +00:00
Sascha Bischoff	af325e87af	KVM: arm64: gic-v5: Add vgic-v5 save/restore hyp interface Introduce the following hyp functions to save/restore GICv5 state: * __vgic_v5_save_apr() * __vgic_v5_restore_vmcr_apr() * __vgic_v5_save_ppi_state() - no hypercall required * __vgic_v5_restore_ppi_state() - no hypercall required * __vgic_v5_save_state() - no hypercall required * __vgic_v5_restore_state() - no hypercall required Note that the functions tagged as not requiring hypercalls are always called directly from the same context. They are either called via the vgic_save_state()/vgic_restore_state() path when running with VHE, or via __hyp_vgic_save_state()/__hyp_vgic_restore_state() otherwise. This mimics how vgic_v3_save_state()/vgic_v3_restore_state() are implemented. Overall, the state of the following registers is saved/restored: * ICC_ICSR_EL1 * ICH_APR_EL2 * ICH_PPI_ACTIVERx_EL2 * ICH_PPI_DVIRx_EL2 * ICH_PPI_ENABLERx_EL2 * ICH_PPI_PENDRx_EL2 * ICH_PPI_PRIORITYRx_EL2 * ICH_VMCR_EL2 All of these are saved/restored to/from the KVM vgic_v5 CPUIF shadow state, with the exception of the PPI active, pending, and enable state. The pending state is saved and restored from kvm_host_data as any changes here need to be tracked and propagated back to the vgic_irq shadow structures (coming in a future commit). Therefore, an entry and an exit copy is required. The active and enable state is restored from the vgic_v5 CPUIF, but is saved to kvm_host_data. Again, this needs to by synced back into the shadow data structures. The ICSR must be save/restored as this register is shared between host and guest. Therefore, to avoid leaking host state to the guest, this must be saved and restored. Moreover, as this can by used by the host at any time, it must be save/restored eagerly. Note: the host state is not preserved as the host should only use this register when preemption is disabled. As with GICv3, the VMCR is eagerly saved as this is required when checking if interrupts can be injected or not, and therefore impacts things such as WFI. As part of restoring the ICH_VMCR_EL2 and ICH_APR_EL2, GICv3-compat mode is also disabled by setting the ICH_VCTLR_EL2.V3 bit to 0. The correspoinding GICv3-compat mode enable is part of the VMCR & APR restore for a GICv3 guest as it only takes effect when actually running a guest. Co-authored-by: Timothy Hayes <timothy.hayes@arm.com> Signed-off-by: Timothy Hayes <timothy.hayes@arm.com> Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com> Link: https://patch.msgid.link/20260319154937.3619520-17-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-03-19 18:21:28 +00:00
Sascha Bischoff	9d6d9514c0	KVM: arm64: gic-v5: Support GICv5 FGTs & FGUs Extend the existing FGT/FGU infrastructure to include the GICv5 trap registers (ICH_HFGRTR_EL2, ICH_HFGWTR_EL2, ICH_HFGITR_EL2). This involves mapping the trap registers and their bits to the corresponding feature that introduces them (FEAT_GCIE for all, in this case), and mapping each trap bit to the system register/instruction controlled by it. As of this change, none of the GICv5 instructions or register accesses are being trapped. Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Link: https://patch.msgid.link/20260319154937.3619520-14-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-03-19 18:21:27 +00:00
Sascha Bischoff	59991153f0	arm64/sysreg: Add GICR CDNMIA encoding The encoding for the GICR CDNMIA system instruction is thus far unused (and shall remain unused for the time being). However, in order to plumb the FGTs into KVM correctly, KVM needs to be made aware of the encoding of this system instruction. Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Link: https://patch.msgid.link/20260319154937.3619520-8-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-03-19 18:21:27 +00:00
Linus Torvalds	11e8c7e947	ARM: - Correctly handle deeactivation of interrupts that were activated from LRs. Since EOIcount only denotes deactivation of interrupts that are not present in an LR, start EOIcount deactivation walk after the last irq that made it into an LR. - Avoid calling into the stubs to probe for ICH_VTR_EL2.TDS when pKVM is already enabled -- not only thhis isn't possible (pKVM will reject the call), but it is also useless: this can only happen for a CPU that has already booted once, and the capability will not change. - Fix a couple of low-severity bugs in our S2 fault handling path, affecting the recently introduced LS64 handling and the even more esoteric handling of hwpoison in a nested context - Address yet another syzkaller finding in the vgic initialisation, where we would end-up destroying an uninitialised vgic with nasty consequences - Address an annoying case of pKVM failing to boot when some of the memblock regions that the host is faulting in are not page-aligned - Inject some sanity in the NV stage-2 walker by checking the limits against the advertised PA size, and correctly report the resulting faults PPC: - Fix a PPC e500 build error due to a long-standing wart that was exposed by the recent conversion to kmalloc_obj(); rip out all the ugliness that led to the wart. RISC-V: - Prevent speculative out-of-bounds access using array_index_nospec() in APLIC interrupt handling, ONE_REG regiser access, AIA CSR access, float register access, and PMU counter access - Fix potential use-after-free issues in kvm_riscv_gstage_get_leaf(), kvm_riscv_aia_aplic_has_attr(), and kvm_riscv_aia_imsic_has_attr() - Fix potential null pointer dereference in kvm_riscv_vcpu_aia_rmw_topei() - Fix off-by-one array access in SBI PMU - Skip THP support check during dirty logging - Fix error code returned for Smstateen and Ssaia ONE_REG interface - Check host Ssaia extension when creating AIA irqchip x86: - Fix cases where CPUID mitigation features were incorrectly marked as available whenever the kernel used scattered feature words for them. - Validate _all_ GVAs, rather than just the first GVA, when processing a range of GVAs for Hyper-V's TLB flush hypercalls. - Fix a brown paper bug in add_atomic_switch_msr(). - Use hlist_for_each_entry_srcu() when traversing mask_notifier_list, to fix a lockdep warning; KVM doesn't hold RCU, just irq_srcu. - Ensure AVIC VMCB fields are initialized if the VM has an in-kernel local APIC (and AVIC is enabled at the module level). - Update CR8 write interception when AVIC is (de)activated, to fix a bug where the guest can run in perpetuity with the CR8 intercept enabled. - Add a quirk to skip the consistency check on FREEZE_IN_SMM, i.e. to allow L1 hypervisors to set FREEZE_IN_SMM. This reverts (by default) an unintentional tightening of userspace ABI in 6.17, and provides some amount of backwards compatibility with hypervisors who want to freeze PMCs on VM-Entry. - Validate the VMCS/VMCB on return to a nested guest from SMM, because either userspace or the guest could stash invalid values in memory and trigger the processor's consistency checks. Generic: - Remove a subtle pseudo-overlay of kvm_stats_desc, which, aside from being unnecessary and confusing, triggered compiler warnings due to -Wflex-array-member-not-at-end. - Document that vcpu->mutex is take outside of kvm->slots_lock and kvm->slots_arch_lock, which is intentional and desirable despite being rather unintuitive. Selftests: - Increase the maximum number of NUMA nodes in the guest_memfd selftest to 64 (from 8). -----BEGIN PGP SIGNATURE----- iQFIBAABCgAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmmy6n8UHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroNX7ggAhWoCG+AE6P3yrp6Mi+nRYpeRGC3q q2IiZCn0UoCg6q3c2kgn7b/N2zLJs0Q8FZRCEp2Je+2uvptpmdp/BMEfiIU3n2/a 61z+Dydbpyc+kUmhJzUJ+aotq5FnMNmAAmqSKoc19GhAx2OQhQmBP/JOZ0P/eqLE Is0qNBgr/Zms2ib3GFf/JT+urysL2mX47qe92HTzq1T9EEG0KleID0Jz8vYQI8Fr I5N9+lTxagQDi8ytwOM85Cn8K7wh+CQIgzmciHcVErpAvAWkrEjrPlQltpEz2C5B aWEcRgw46utEaAiwPQGJRW6TeoKUG0pUR3v6T90nBkjjJ1npm6gPVE6TBA== =7nQ9 -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm fixes from Paolo Bonzini: "Quite a large pull request, partly due to skipping last week and therefore having material from ~all submaintainers in this one. About a fourth of it is a new selftest, and a couple more changes are large in number of files touched (fixing a -Wflex-array-member-not-at-end compiler warning) or lines changed (reformatting of a table in the API documentation, thanks rST). But who am I kidding---it's a lot of commits and there are a lot of bugs being fixed here, some of them on the nastier side like the RISC-V ones. ARM: - Correctly handle deactivation of interrupts that were activated from LRs. Since EOIcount only denotes deactivation of interrupts that are not present in an LR, start EOIcount deactivation walk after the last irq that made it into an LR - Avoid calling into the stubs to probe for ICH_VTR_EL2.TDS when pKVM is already enabled -- not only thhis isn't possible (pKVM will reject the call), but it is also useless: this can only happen for a CPU that has already booted once, and the capability will not change - Fix a couple of low-severity bugs in our S2 fault handling path, affecting the recently introduced LS64 handling and the even more esoteric handling of hwpoison in a nested context - Address yet another syzkaller finding in the vgic initialisation, where we would end-up destroying an uninitialised vgic with nasty consequences - Address an annoying case of pKVM failing to boot when some of the memblock regions that the host is faulting in are not page-aligned - Inject some sanity in the NV stage-2 walker by checking the limits against the advertised PA size, and correctly report the resulting faults PPC: - Fix a PPC e500 build error due to a long-standing wart that was exposed by the recent conversion to kmalloc_obj(); rip out all the ugliness that led to the wart RISC-V: - Prevent speculative out-of-bounds access using array_index_nospec() in APLIC interrupt handling, ONE_REG regiser access, AIA CSR access, float register access, and PMU counter access - Fix potential use-after-free issues in kvm_riscv_gstage_get_leaf(), kvm_riscv_aia_aplic_has_attr(), and kvm_riscv_aia_imsic_has_attr() - Fix potential null pointer dereference in kvm_riscv_vcpu_aia_rmw_topei() - Fix off-by-one array access in SBI PMU - Skip THP support check during dirty logging - Fix error code returned for Smstateen and Ssaia ONE_REG interface - Check host Ssaia extension when creating AIA irqchip x86: - Fix cases where CPUID mitigation features were incorrectly marked as available whenever the kernel used scattered feature words for them - Validate _all_ GVAs, rather than just the first GVA, when processing a range of GVAs for Hyper-V's TLB flush hypercalls - Fix a brown paper bug in add_atomic_switch_msr() - Use hlist_for_each_entry_srcu() when traversing mask_notifier_list, to fix a lockdep warning; KVM doesn't hold RCU, just irq_srcu - Ensure AVIC VMCB fields are initialized if the VM has an in-kernel local APIC (and AVIC is enabled at the module level) - Update CR8 write interception when AVIC is (de)activated, to fix a bug where the guest can run in perpetuity with the CR8 intercept enabled - Add a quirk to skip the consistency check on FREEZE_IN_SMM, i.e. to allow L1 hypervisors to set FREEZE_IN_SMM. This reverts (by default) an unintentional tightening of userspace ABI in 6.17, and provides some amount of backwards compatibility with hypervisors who want to freeze PMCs on VM-Entry - Validate the VMCS/VMCB on return to a nested guest from SMM, because either userspace or the guest could stash invalid values in memory and trigger the processor's consistency checks Generic: - Remove a subtle pseudo-overlay of kvm_stats_desc, which, aside from being unnecessary and confusing, triggered compiler warnings due to -Wflex-array-member-not-at-end - Document that vcpu->mutex is take outside of kvm->slots_lock and kvm->slots_arch_lock, which is intentional and desirable despite being rather unintuitive Selftests: - Increase the maximum number of NUMA nodes in the guest_memfd selftest to 64 (from 8)" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (43 commits) KVM: selftests: Verify SEV+ guests can read and write EFER, CR0, CR4, and CR8 Documentation: kvm: fix formatting of the quirks table KVM: x86: clarify leave_smm() return value selftests: kvm: add a test that VMX validates controls on RSM selftests: kvm: extract common functionality out of smm_test.c KVM: SVM: check validity of VMCB controls when returning from SMM KVM: VMX: check validity of VMCS controls when returning from SMM KVM: SVM: Set/clear CR8 write interception when AVIC is (de)activated KVM: SVM: Initialize AVIC VMCB fields if AVIC is enabled with in-kernel APIC KVM: x86: Introduce KVM_X86_QUIRK_VMCS12_ALLOW_FREEZE_IN_SMM KVM: x86: Fix SRCU list traversal in kvm_fire_mask_notifiers() KVM: VMX: Fix a wrong MSR update in add_atomic_switch_msr() KVM: x86: hyper-v: Validate all GVAs during PV TLB flush KVM: x86: synthesize CPUID bits only if CPU capability is set KVM: PPC: e500: Rip out "struct tlbe_ref" KVM: PPC: e500: Fix build error due to using kmalloc_obj() with wrong type KVM: selftests: Increase 'maxnode' for guest_memfd tests KVM: arm64: pkvm: Don't reprobe for ICH_VTR_EL2.TDS on CPU hotplug KVM: arm64: vgic: Pick EOIcount deactivations from AP-list tail KVM: arm64: Remove the redundant ISB in __kvm_at_s1e2() ...	2026-03-15 12:22:10 -07:00
Anshuman Khandual	be6e9dee0e	arm64/mm: Directly use TTBRx_EL1_CnP Replace all TTBR_CNP_BIT macro instances with TTBRx_EL1_CNP_BIT which is a standard field from tools sysreg format. Drop the now redundant custom macro TTBR_CNP_BIT. No functional change. Cc: Will Deacon <will@kernel.org> Cc: Marc Zyngier <maz@kernel.org> Cc: Oliver Upton <oupton@kernel.org> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Cc: kvmarm@lists.linux.dev Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2026-03-14 16:12:27 +00:00
Anshuman Khandual	d989010bbe	arm64/mm: Directly use TTBRx_EL1_ASID_MASK Replace all TTBR_ASID_MASK macro instances with TTBRx_EL1_ASID_MASK which is a standard field mask from tools sysreg format. Drop the now redundant custom macro TTBR_ASID_MASK. No functional change. Cc: Will Deacon <will@kernel.org> Cc: Marc Zyngier <maz@kernel.org> Cc: Oliver Upton <oupton@kernel.org> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Cc: kvmarm@lists.linux.dev Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2026-03-14 16:12:27 +00:00
Anshuman Khandual	2615924e45	arm64/mm: Describe TTBR1_BADDR_4852_OFFSET TTBR1_BADDR_4852_OFFSET is a constant offset which gets added into kernel page table physical address for TTBR1_EL1 when kernel is build for 52 bit VA but found to be running on 48 bit VA capable system. Although there is no explanation on how the macro is computed. Describe TTBR1_BADDR_4852_OFFSET computation in detail via deriving from all required parameters involved thus improving clarity and readability. Cc: Will Deacon <will@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2026-03-14 16:11:44 +00:00
Barry Song	d7eafe655b	dma-mapping: Separate DMA sync issuing and completion waiting Currently, arch_sync_dma_for_cpu and arch_sync_dma_for_device always wait for the completion of each DMA buffer. That is, issuing the DMA sync and waiting for completion is done in a single API call. For scatter-gather lists with multiple entries, this means issuing and waiting is repeated for each entry, which can hurt performance. Architectures like ARM64 may be able to issue all DMA sync operations for all entries first and then wait for completion together. To address this, arch_sync_dma_for_* now batches DMA operations and performs a flush afterward. On ARM64, the flush is implemented with a dsb instruction in arch_sync_dma_flush(). On other architectures, arch_sync_dma_flush() is currently a nop. Cc: Leon Romanovsky <leon@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Marc Zyngier <maz@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Tangquan Zheng <zhengtangquan@oppo.com> Reviewed-by: Juergen Gross <jgross@suse.com> # drivers/xen/swiotlb-xen.c Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com> Signed-off-by: Barry Song <baohua@kernel.org> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20260228221316.59934-1-21cnbao@gmail.com	2026-03-13 23:47:31 +01:00
Barry Song	cf875c4b68	arm64: Provide dcache_inval_poc_nosync helper dcache_inval_poc_nosync does not wait for the data cache invalidation to complete. Later, we defer the synchronization so we can wait for all SG entries together. Cc: Leon Romanovsky <leon@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Marc Zyngier <maz@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tangquan Zheng <zhengtangquan@oppo.com> Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com> Signed-off-by: Barry Song <baohua@kernel.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20260228221258.59918-1-21cnbao@gmail.com	2026-03-13 23:47:16 +01:00
Barry Song	1c3a7f9e6b	arm64: Provide dcache_clean_poc_nosync helper dcache_clean_poc_nosync does not wait for the data cache clean to complete. Later, we wait for completion of all scatter-gather entries together. Cc: Leon Romanovsky <leon@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Marc Zyngier <maz@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tangquan Zheng <zhengtangquan@oppo.com> Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com> Signed-off-by: Barry Song <baohua@kernel.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20260228221239.59903-1-21cnbao@gmail.com	2026-03-13 23:46:48 +01:00
Barry Song	2c92eff008	arm64: Provide dcache_by_myline_op_nosync helper dcache_by_myline_op ensures completion of the data cache operations for a region, while dcache_by_myline_op_nosync only issues them without waiting. This enables deferred synchronization so completion for multiple regions can be handled together later. Cc: Leon Romanovsky <leon@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Marc Zyngier <maz@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tangquan Zheng <zhengtangquan@oppo.com> Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com> Signed-off-by: Barry Song <baohua@kernel.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20260228221216.59886-1-21cnbao@gmail.com	2026-03-13 23:46:32 +01:00
Ryan Roberts	752a0d1d48	arm64: mm: Provide level hint for flush_tlb_page() Previously tlb invalidations issued by __flush_tlb_page() did not contain a level hint. From the core API documentation, this function is clearly only ever intended to target level 3 (PTE) tlb entries: \| 4) ``void flush_tlb_page(struct vm_area_struct *vma, unsigned long addr)`` \| \| This time we need to remove the PAGE_SIZE sized translation \| from the TLB. However, the arm64 documentation is more relaxed allowing any last level: \| this operation only invalidates a single, last-level page-table \| entry and therefore does not affect any walk-caches It turns out that the function was actually being used to invalidate a level 2 mapping via flush_tlb_fix_spurious_fault_pmd(). The bug was benign because the level hint was not set so the HW would still invalidate the PMD mapping, and also because the TLBF_NONOTIFY flag was set, the bounds of the mapping were never used for anything else. Now that we have the new and improved range-invalidation API, it is trival to fix flush_tlb_fix_spurious_fault_pmd() to explicitly flush the whole range (locally, without notification and last level only). So let's do that, and then update __flush_tlb_page() to hint level 3. Reviewed-by: Linu Cherian <linu.cherian@arm.com> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> [catalin.marinas@arm.com: use "level 3" in the __flush_tlb_page() description] [catalin.marinas@arm.com: tweak the commit message to include the core API text] Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2026-03-13 19:02:27 +00:00
Ryan Roberts	15397e3c38	arm64: mm: Wrap flush_tlb_page() around __do_flush_tlb_range() Flushing a page from the tlb is just a special case of flushing a range. So let's rework flush_tlb_page() so that it simply wraps __do_flush_tlb_range(). While at it, let's also update the API to take the same flags that we use when flushing a range. This allows us to delete all the ugly "_nosync", "_local" and "_nonotify" variants. Thanks to constant folding, all of the complex looping and tlbi-by-range options get eliminated so that the generated code for flush_tlb_page() looks very similar to the previous version. Reviewed-by: Linu Cherian <linu.cherian@arm.com> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2026-03-13 17:26:26 +00:00
Ryan Roberts	0477fc5696	arm64: mm: More flags for __flush_tlb_range() Refactor function variants with "_nosync", "_local" and "_nonotify" into a single __always_inline implementation that takes flags and rely on constant folding to select the parts that are actually needed at any given callsite, based on the provided flags. Flags all live in the tlbf_t (TLB flags) type; TLBF_NONE (0) continues to provide the strongest semantics (i.e. evict from walk cache, broadcast, synchronise and notify). Each flag reduces the strength in some way; TLBF_NONOTIFY, TLBF_NOSYNC and TLBF_NOBROADCAST are added to complement the existing TLBF_NOWALKCACHE. There are no users that require TLBF_NOBROADCAST without TLBF_NOWALKCACHE so implement that as BUILD_BUG() to avoid needing to introduce dead code for vae1 invalidations. The result is a clearer, simpler, more powerful API. Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2026-03-13 17:24:09 +00:00
Ryan Roberts	11f6dd8dd2	arm64: mm: Refactor __flush_tlb_range() to take flags We have function variants with "_nosync", "_local", "_nonotify" as well as the "last_level" parameter. Let's generalize and simplify by using a flags parameter to encode all these variants. As a first step, convert the "last_level" boolean parameter to a flags parameter and create the first flag, TLBF_NOWALKCACHE. When present, walk cache entries are not evicted, which is the same as the old last_level=true. Reviewed-by: Linu Cherian <linu.cherian@arm.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2026-03-13 17:23:04 +00:00
Ryan Roberts	64212d6893	arm64: mm: Refactor flush_tlb_page() to use __tlbi_level_asid() Now that we have __tlbi_level_asid(), let's refactor the flush_tlb_page() variants to use it rather than open coding. The emitted tlbi(s) is/are intended to be exactly the same as before; no TTL hint is provided. Although the spec for flush_tlb_page() allows for setting the TTL hint to 3, it turns out that flush_tlb_fix_spurious_fault_pmd() depends on local_flush_tlb_page_nonotify() to invalidate the level 2 entry. This will be fixed separately. Reviewed-by: Linu Cherian <linu.cherian@arm.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2026-03-13 17:23:04 +00:00
Will Deacon	c753d667d9	arm64: mm: Simplify __flush_tlb_range_limit_excess() __flush_tlb_range_limit_excess() is unnecessarily complicated: - It takes a 'start', 'end' and 'pages' argument, whereas it only needs 'pages' (which the caller has computed from the other two arguments!). - It erroneously compares 'pages' with MAX_TLBI_RANGE_PAGES when the system doesn't support range-based invalidation but the range to be invalidated would result in fewer than MAX_DVM_OPS invalidations. Simplify the function so that it no longer takes the 'start' and 'end' arguments and only considers the MAX_TLBI_RANGE_PAGES threshold on systems that implement range-based invalidation. Signed-off-by: Will Deacon <will@kernel.org> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2026-03-13 17:23:04 +00:00
Will Deacon	057bbd8e06	arm64: mm: Simplify __TLBI_RANGE_NUM() macro Since commit `e2768b798a` ("arm64/mm: Modify range-based tlbi to decrement scale"), we don't need to clamp the 'pages' argument to fit the range for the specified 'scale' as we know that the upper bits will have been processed in a prior iteration. Drop the clamping and simplify the __TLBI_RANGE_NUM() macro. Signed-off-by: Will Deacon <will@kernel.org> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2026-03-13 17:23:04 +00:00
Ryan Roberts	5e63b73f3d	arm64: mm: Re-implement the __flush_tlb_range_op macro in C The __flush_tlb_range_op() macro is horrible and has been a previous source of bugs thanks to multiple expansions of its arguments (see commit `f7edb07ad7` ("Fix mmu notifiers for range-based invalidates")). Rewrite the thing in C. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Co-developed-by: Will Deacon <will@kernel.org> Signed-off-by: Will Deacon <will@kernel.org> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2026-03-13 17:23:04 +00:00
Will Deacon	d4b048ca14	arm64: mm: Inline __TLBI_VADDR_RANGE() into __tlbi_range() The __TLBI_VADDR_RANGE() macro is only used in one place and isn't something that's generally useful outside of the low-level range invalidation gubbins. Inline __TLBI_VADDR_RANGE() into the __tlbi_range() function so that the macro can be removed entirely. Signed-off-by: Will Deacon <will@kernel.org> Reviewed-by: Linu Cherian <linu.cherian@arm.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2026-03-13 17:23:04 +00:00
Will Deacon	a371003560	arm64: mm: Push __TLBI_VADDR() into __tlbi_level() The __TLBI_VADDR() macro takes an ASID and an address and converts them into a single argument formatted correctly for a TLB invalidation instruction. Rather than have callers worry about this (especially in the case where the ASID is zero), push the macro down into __tlbi_level() via a new __tlbi_level_asid() helper. Signed-off-by: Will Deacon <will@kernel.org> Reviewed-by: Linu Cherian <linu.cherian@arm.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2026-03-13 17:23:03 +00:00
Ryan Roberts	edc55b7abb	arm64: mm: Implicitly invalidate user ASID based on TLBI operation When kpti is enabled, separate ASIDs are used for userspace and kernelspace, requiring ASID-qualified TLB invalidation by virtual address to invalidate both of them. Push the logic for invalidating the two ASIDs down into the low-level tlbi-op-specific functions and remove the burden from the caller to handle the kpti-specific behaviour. Co-developed-by: Will Deacon <will@kernel.org> Signed-off-by: Will Deacon <will@kernel.org> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2026-03-13 17:23:03 +00:00
Ryan Roberts	d2bf322695	arm64: mm: Introduce a C wrapper for by-range TLB invalidation As part of efforts to reduce our reliance on complex preprocessor macros for TLB invalidation routines, introduce a new C wrapper for by-range TLB invalidation which can be used instead of the __tlbi() macro and can additionally be called from C code. Each specific tlbi range op is implemented as a C function and the appropriate function pointer is passed to __tlbi_range(). Since everything is declared inline and is statically resolvable, the compiler will convert the indirect function call to a direct inline execution. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2026-03-13 17:23:03 +00:00
Ryan Roberts	5b3fb8a6b4	arm64: mm: Re-implement the __tlbi_level macro as a C function As part of efforts to reduce our reliance on complex preprocessor macros for TLB invalidation routines, convert the __tlbi_level macro to a C function for by-level TLB invalidation. Each specific tlbi level op is implemented as a C function and the appropriate function pointer is passed to __tlbi_level(). Since everything is declared inline and is statically resolvable, the compiler will convert the indirect function call to a direct inline execution. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2026-03-13 17:23:03 +00:00
Will Deacon	3ce8f5860f	arm64: scs: Remove redundant save/restore of SCS SP on entry to/from EL0 When returning to userspace, the SCS is empty and so the SCS SP just points to the base address of the SCS page. Rather than saving and restoring this address in the current task, we can simply restore the SCS SP to point at the base of the stack on entry to EL1 from EL0. Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Will Deacon <will@kernel.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2026-03-13 17:17:58 +00:00
Tobias Huschle	cf8771ca4c	mm/page_table_check: Pass mm_struct to pxx_user_accessible_page() Unlike other architectures, s390 does not have means to distinguish kernel vs user page table entries - neither an entry itself, nor the address could be used for that. It is only the mm_struct that indicates whether an entry in question is mapped to a user space. So pass mm_struct to pxx_user_accessible_page() callbacks. [agordeev@linux.ibm.com: rephrased commit message, removed braces] Acked-by: Madhavan Srinivasan <maddy@linux.ibm.com> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> #powerpc Signed-off-by: Tobias Huschle <huschle@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Link: https://lore.kernel.org/r/ca77f3489453c2fe01b25e50e53b778929e0dfc5.1772812343.git.agordeev@linux.ibm.com Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>	2026-03-13 00:07:47 +01:00
Thomas Weißschuh	2b8cf39d7e	arm64: vDSO: compat_gettimeofday: Add explicit includes The reference to VDSO_CLOCKMODE_ARCHTIMER requires vdso/clocksource.h and 'struct old_timespec32' requires vdso/time32.h. Currently these headers are included transitively, but those transitive inclusions are about to go away. Explicitly include the headers. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Link: https://patch.msgid.link/20260227-vdso-header-cleanups-v2-2-35d60acf7410@linutronix.de	2026-03-11 15:12:22 +01:00
Thomas Weißschuh	b18ec8b5e0	arm64: vDSO: gettimeofday: Explicitly include vdso/clocksource.h The reference to VDSO_CLOCKMODE_NONE requires vdso/clocksource.h. Currently this header is included transitively, but that transitive inclusion is about to go away. Explicitly include the header. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Link: https://patch.msgid.link/20260227-vdso-header-cleanups-v2-1-35d60acf7410@linutronix.de	2026-03-11 15:12:15 +01:00
Vincent Donnefort	5bbbed42f7	KVM: arm64: Add selftest event support to nVHE/pKVM hyp Add a selftest event that can be triggered from a `write_event` tracefs file. This intends to be used by trace remote selftests. Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Link: https://patch.msgid.link/20260309162516.2623589-30-vdonnefort@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-03-11 08:51:17 +00:00
Vincent Donnefort	696dfec22b	KVM: arm64: Add hyp_enter/hyp_exit events to nVHE/pKVM hyp The hyp_enter and hyp_exit events are logged by the hypervisor any time it is entered and exited. Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Link: https://patch.msgid.link/20260309162516.2623589-29-vdonnefort@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-03-11 08:51:17 +00:00
Vincent Donnefort	0a90fbc8a1	KVM: arm64: Add event support to the nVHE/pKVM hyp and trace remote Allow the creation of hypervisor and trace remote events with a single macro HYP_EVENT(). That macro expands in the kernel side to add all the required declarations (based on REMOTE_EVENT()) as well as in the hypervisor side to create the trace_<event>() function. Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Link: https://patch.msgid.link/20260309162516.2623589-28-vdonnefort@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-03-11 08:51:17 +00:00
Vincent Donnefort	2194d317e0	KVM: arm64: Add trace reset to the nVHE/pKVM hyp Make the hypervisor reset either the whole tracing buffer or a specific ring-buffer, on remotes/hypervisor/trace or per_cpu/<cpu>/trace write access. Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Link: https://patch.msgid.link/20260309162516.2623589-27-vdonnefort@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-03-11 08:51:16 +00:00
Vincent Donnefort	b22888917f	KVM: arm64: Sync boot clock with the nVHE/pKVM hyp Configure the hypervisor tracing clock with the kernel boot clock. For tracing purposes, the boot clock is interesting: it doesn't stop on suspend. However, it is corrected on a regular basis, which implies the need to re-evaluate it every once in a while. Cc: John Stultz <jstultz@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Stephen Boyd <sboyd@kernel.org> Cc: Christopher S. Hall <christopher.s.hall@intel.com> Cc: Richard Cochran <richardcochran@gmail.com> Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Link: https://patch.msgid.link/20260309162516.2623589-26-vdonnefort@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-03-11 08:51:16 +00:00
Vincent Donnefort	680a04c333	KVM: arm64: Add tracing capability for the nVHE/pKVM hyp There is currently no way to inspect or log what's happening at EL2 when the nVHE or pKVM hypervisor is used. With the growing set of features for pKVM, the need for tooling is more pressing. And tracefs, by its reliability, versatility and support for user-space is fit for purpose. Add support to write into a tracefs compatible ring-buffer. There's no way the hypervisor could log events directly into the host tracefs ring-buffers. So instead let's use our own, where the hypervisor is the writer and the host the reader. Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Link: https://patch.msgid.link/20260309162516.2623589-24-vdonnefort@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-03-11 08:51:16 +00:00
Vincent Donnefort	8bbeb4d169	KVM: arm64: Initialise hyp_nr_cpus for nVHE hyp Knowing the number of CPUs is necessary for determining the boundaries of per-cpu variables, which will be used for upcoming hypervisor tracing. hyp_nr_cpus which stores this value, is only initialised for the pKVM hypervisor. Make it accessible for the nVHE hypervisor as well. With the kernel now responsible for initialising hyp_nr_cpus, the nr_cpus parameter is no longer needed in __pkvm_init. Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Link: https://patch.msgid.link/20260309162516.2623589-22-vdonnefort@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-03-11 08:51:16 +00:00
Marc Zyngier	6da5e537f5	KVM: arm64: vgic: Pick EOIcount deactivations from AP-list tail Valentine reports that their guests fail to boot correctly, losing interrupts, and indicates that the wrong interrupt gets deactivated. What happens here is that if the maintenance interrupt is slow enough to kick us out of the guest, extra interrupts can be activated from the LRs. We then exit and proceed to handle EOIcount deactivations, picking active interrupts from the AP list. But we start from the top of the list, potentially deactivating interrupts that were in the LRs, while EOIcount only denotes deactivation of interrupts that are not present in an LR. Solve this by tracking the last interrupt that made it in the LRs, and start the EOIcount deactivation walk after that interrupt. Since this only makes sense while the vcpu is loaded, stash this in the per-CPU host state. Huge thanks to Valentine for doing all the detective work and providing an initial patch. Fixes: `3cfd59f81e` ("KVM: arm64: GICv3: Handle LR overflow when EOImode==0") Fixes: `281c6c06e2` ("KVM: arm64: GICv2: Handle LR overflow when EOImode==0") Reported-by: Valentine Burley <valentine.burley@collabora.com> Tested-by: Valentine Burley <valentine.burley@collabora.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20260307115955.369455-1-valentine.burley@collabora.com Link: https://patch.msgid.link/20260307191151.3781182-1-maz@kernel.org Cc: stable@vger.kernel.org	2026-03-07 21:45:58 +00:00
Linus Torvalds	4660e168c6	arm64 fixes for -rc3 - Fix kexec/hibernation hang due to bogus read-only mappings. - Fix sparse warnings in our cmpxchg() implementation. - Prevent runtime-const being used in modules, just like x86. - Fix broken elision of access flag modifications for contiguous entries on systems without support for hardware updates. - Fix a broken SVE selftest that was testing the wrong instruction. -----BEGIN PGP SIGNATURE----- iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmmrH5wQHHdpbGxAa2Vy bmVsLm9yZwAKCRC3rHDchMFjNLiWB/40+A3Q3gz9VB3obupFeC/s688TjGMwLbIO m03Qu/ulGwBZaPRPZxsxnr8pFZKjSple5NJiHv5kQ/wR4Cfc4zwF2zOSdRvAI/c3 qPT2YL0CcVt0OgbWd2VCjiThTuFREewdCqRWbmkDaPYd27k0KWY14gHHpriRw7XM QY0OOz8wrWi3lg2Wyvub9wWLkyjKtFlrkwZaACD5D90k/CwKVgncC1z4vh41hQxk qjxdygNJt2sV+31+F7QMoY/rbyVnUkdSvWSwe9z2Bs9mwebaoGgx4c1l47Wq+oQD NiVgHOZnPQkDgd2MWkUiCwzAr6C3B0aF2BCu+NTgILkbX7PyG792 =knFu -----END PGP SIGNATURE----- Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 fixes from Will Deacon: "The main changes are a fix to the way in which we manage the access flag setting for mappings using the contiguous bit and a fix for a hang on the kexec/hibernation path. Summary: - Fix kexec/hibernation hang due to bogus read-only mappings - Fix sparse warnings in our cmpxchg() implementation - Prevent runtime-const being used in modules, just like x86 - Fix broken elision of access flag modifications for contiguous entries on systems without support for hardware updates - Fix a broken SVE selftest that was testing the wrong instruction" * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: selftest/arm64: Fix sve2p1_sigill() to hwcap test arm64: contpte: fix set_access_flags() no-op check for SMMU/ATS faults arm64: make runtime const not usable by modules arm64: mm: Add PTE_DIRTY back to PAGE_KERNEL* to fix kexec/hibernation arm64: Silence sparse warnings caused by the type casting in (cmp)xchg	2026-03-06 19:57:03 -08:00
Jisheng Zhang	0100e495cd	arm64: make runtime const not usable by modules Similar as commit `284922f4c5` ("x86: uaccess: don't use runtime-const rewriting in modules") does, make arm64's runtime const not usable by modules too, to "make sure this doesn't get forgotten the next time somebody wants to do runtime constant optimizations". The reason is well explained in the above commit: "The runtime-const infrastructure was never designed to handle the modular case, because the constant fixup is only done at boot time for core kernel code." Signed-off-by: Jisheng Zhang <jszhang@kernel.org> Signed-off-by: Will Deacon <will@kernel.org>	2026-03-04 16:13:58 +00:00
Catalin Marinas	c25c4aa3f7	arm64: mm: Add PTE_DIRTY back to PAGE_KERNEL* to fix kexec/hibernation Commit `143937ca51` ("arm64, mm: avoid always making PTE dirty in pte_mkwrite()") changed pte_mkwrite_novma() to only clear PTE_RDONLY when PTE_DIRTY is set. This was to allow writable-clean PTEs for swap pages that haven't actually been written. However, this broke kexec and hibernation for some platforms. Both go through trans_pgd_create_copy() -> _copy_pte(), which calls pte_mkwrite_novma() to make the temporary linear-map copy fully writable. With the updated pte_mkwrite_novma(), read-only kernel pages (without PTE_DIRTY) remain read-only in the temporary mapping. While such behaviour is fine for user pages where hardware DBM or trapping will make them writeable, subsequent in-kernel writes by the kexec relocation code will fault. Add PTE_DIRTY back to all _PAGE_KERNEL* protection definitions. This was the case prior to 5.4, commit `aa57157be6` ("arm64: Ensure VM_WRITE\|VM_SHARED ptes are clean by default"). With the kernel linear-map PTEs always having PTE_DIRTY set, pte_mkwrite_novma() correctly clears PTE_RDONLY. Fixes: `143937ca51` ("arm64, mm: avoid always making PTE dirty in pte_mkwrite()") Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Cc: stable@vger.kernel.org Reported-by: Jianpeng Chang <jianpeng.chang.cn@windriver.com> Link: https://lore.kernel.org/r/20251204062722.3367201-1-jianpeng.chang.cn@windriver.com Cc: Will Deacon <will@kernel.org> Cc: Huang, Ying <ying.huang@linux.alibaba.com> Cc: Guenter Roeck <linux@roeck-us.net> Reviewed-by: Huang Ying <ying.huang@linux.alibaba.com> Signed-off-by: Will Deacon <will@kernel.org>	2026-03-04 14:33:08 +00:00
Catalin Marinas	212dd84776	arm64: Silence sparse warnings caused by the type casting in (cmp)xchg The arm64 xchg/cmpxchg() wrappers cast the arguments to (unsigned long) prior to invoking the static inline functions implementing the operation. Some restrictive type annotations (e.g. __bitwise) lead to sparse warnings like below: sparse warnings: (new ones prefixed by >>) fs/crypto/bio.c:67:17: sparse: sparse: cast from restricted blk_status_t >> fs/crypto/bio.c:67:17: sparse: sparse: cast to restricted blk_status_t Force the casting in the arm64 xchg/cmpxchg() wrappers to silence sparse. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202602230947.uNRsPyBn-lkp@intel.com/ Link: https://lore.kernel.org/r/202602230947.uNRsPyBn-lkp@intel.com/ Cc: Will Deacon <will@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Will Deacon <will@kernel.org>	2026-03-04 14:31:27 +00:00
Linus Torvalds	949d0a46ad	Arm: * Make sure we don't leak any S1POE state from guest to guest when the feature is supported on the HW, but not enabled on the host * Propagate the ID registers from the host into non-protected VMs managed by pKVM, ensuring that the guest sees the intended feature set * Drop double kern_hyp_va() from unpin_host_sve_state(), which could bite us if we were to change kern_hyp_va() to not being idempotent * Don't leak stage-2 mappings in protected mode * Correctly align the faulting address when dealing with single page stage-2 mappings for PAGE_SIZE > 4kB * Fix detection of virtualisation-capable GICv5 IRS, due to the maintainer being obviously fat fingered... [his words, not mine] * Remove duplication of code retrieving the ASID for the purpose of S1 PT handling * Fix slightly abusive const-ification in vgic_set_kvm_info() Generic: * Remove internal Kconfigs that are now set on all architectures. * Remove per-architecture code to enable KVM_CAP_SYNC_MMU, all architectures finally enable it in Linux 7.0. -----BEGIN PGP SIGNATURE----- iQFIBAABCgAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmmkSMUUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroO2tggAgOHvH9lwgHxFADAOKXPEsv+Tw6cB zuUBPo6e5XQQx6BJQVo8ctveCBbA0ybRJXbju6lgS/Bg9qQYKjZbNrYdKwBwhw3b JBUFTIJgl7lLU5gEANNtvgiiNUc2z1GfG8NQ5ovMJ0/MDEEI0YZjkGoe8ZyJ54vg qrNHBROKvt54wVHXFGhZ1z8u4RqJ/WCmy21wYbwiMgQGXq9ugRRfhnu3iGNO8BcK bSSh4cP7qusO5Nj0m/XYD/68VwvyogaEkh4sNS1VH0FOoONRWs80Q6iT2HjCGgv6 mAzCx/yB9ziSQRis3oYYEBH23HZxRVDVBeSBlPcbxE3VM7SGzp7O8L692g== =+mC+ -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm fixes from Paolo Bonzini: "Arm: - Make sure we don't leak any S1POE state from guest to guest when the feature is supported on the HW, but not enabled on the host - Propagate the ID registers from the host into non-protected VMs managed by pKVM, ensuring that the guest sees the intended feature set - Drop double kern_hyp_va() from unpin_host_sve_state(), which could bite us if we were to change kern_hyp_va() to not being idempotent - Don't leak stage-2 mappings in protected mode - Correctly align the faulting address when dealing with single page stage-2 mappings for PAGE_SIZE > 4kB - Fix detection of virtualisation-capable GICv5 IRS, due to the maintainer being obviously fat fingered... [his words, not mine] - Remove duplication of code retrieving the ASID for the purpose of S1 PT handling - Fix slightly abusive const-ification in vgic_set_kvm_info() Generic: - Remove internal Kconfigs that are now set on all architectures - Remove per-architecture code to enable KVM_CAP_SYNC_MMU, all architectures finally enable it in Linux 7.0" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: always define KVM_CAP_SYNC_MMU KVM: remove CONFIG_KVM_GENERIC_MMU_NOTIFIER KVM: arm64: Deduplicate ASID retrieval code irqchip/gic-v5: Fix inversion of IRS_IDR0.virt flag KVM: arm64: Revert accidental drop of kvm_uninit_stage2_mmu() for non-NV VMs KVM: arm64: Fix protected mode handling of pages larger than 4kB KVM: arm64: vgic: Handle const qualifier from gic_kvm_info allocation type KVM: arm64: Remove redundant kern_hyp_va() in unpin_host_sve_state() KVM: arm64: Fix ID register initialization for non-protected pKVM guests KVM: arm64: Optimise away S1POE handling when not supported by host KVM: arm64: Hide S1POE from guests when not supported by the host	2026-03-01 15:34:47 -08:00
Paolo Bonzini	55365ab85a	KVM/arm64 fixes for 7.0, take #1 - Make sure we don't leak any S1POE state from guest to guest when the feature is supported on the HW, but not enabled on the host - Propagate the ID registers from the host into non-protected VMs managed by pKVM, ensuring that the guest sees the intended feature set - Drop double kern_hyp_va() from unpin_host_sve_state(), which could bite us if we were to change kern_hyp_va() to not being idempotent - Don't leak stage-2 mappings in protected mode - Correctly align the faulting address when dealing with single page stage-2 mappings for PAGE_SIZE > 4kB - Fix detection of virtualisation-capable GICv5 IRS, due to the maintainer being obviously fat fingered... - Remove duplication of code retrieving the ASID for the purpose of S1 PT handling - Fix slightly abusive const-ification in vgic_set_kvm_info() -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmmfIDAACgkQI9DQutE9 ekNw0w//aFrTa4yoUns0Jd1HeCMnKIP3Iljz1FR0g58KAgIX5tTvgysThhNN7yjW k7Xp03oD7K4UtxqJVQnB4nPTX0PYFFOxfcsHNN1+cxhepo6n9t/61KMa6IM/EE2X lH5Cypz4pxHEynMXEDugnGJQaMKIe5eU9UHs5rTqqUeMGupqUOk2mjBOk77CWfio oIvOIne4h7SD9tzYcgZADZ5YAtc1QcRVZkLTAtfsd2xRzNlpBDNAqVX8PJPzEW35 cffrEvT0NGBu+0MLBzc3bZ3cViGv5/8DHqiTcQhvK14dHMUIxl1GxUrkXVHb8Ca0 hoBEV7dJHS0avPQoKXb+DjB+HkxHda4g8EJ7RxWJisofS1AztWWDZOsZnfgTaaLB Ia0WekWra0yuEiYlU5kE/HPRlXAKUAa3mhih3uu5Uq1c5xuhCzHY7pyB9uNCdYfz KiSsTFG0i+T0ddT8E92LOFTt65G2wZyd498yVoVJ7MDOUSb+eESvgu5BPH4hiJLM txnDDvEvcHn5zlnayUItxk4SSHLZHFEP/tjfWVTrEpFW3s9O86aQlSqCsxOQReYZ ZVSa4Qr/aef8KzS4IMxFcSIMGkUoNjFIe0+SsVj/5vLoX6eK0ppE0GoQhmREU2cc hcicwGH5VYdJXhxJUvxCqtrmX+F14JlENmdXBZsE4N0b41pi7G8= =Ppfu -----END PGP SIGNATURE----- Merge tag 'kvmarm-fixes-7.0-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 7.0, take #1 - Make sure we don't leak any S1POE state from guest to guest when the feature is supported on the HW, but not enabled on the host - Propagate the ID registers from the host into non-protected VMs managed by pKVM, ensuring that the guest sees the intended feature set - Drop double kern_hyp_va() from unpin_host_sve_state(), which could bite us if we were to change kern_hyp_va() to not being idempotent - Don't leak stage-2 mappings in protected mode - Correctly align the faulting address when dealing with single page stage-2 mappings for PAGE_SIZE > 4kB - Fix detection of virtualisation-capable GICv5 IRS, due to the maintainer being obviously fat fingered... - Remove duplication of code retrieving the ASID for the purpose of S1 PT handling - Fix slightly abusive const-ification in vgic_set_kvm_info()	2026-02-28 15:33:34 +01:00
Marco Elver	773b24bced	arm64, compiler-context-analysis: Permit alias analysis through __READ_ONCE() with CONFIG_LTO=y When enabling Clang's Context Analysis (aka. Thread Safety Analysis) on kernel/futex/core.o (see Peter's changes at [1]), in arm64 LTO builds we could see: \| kernel/futex/core.c:982:1: warning: spinlock 'atomic ? __u.__val : q->lock_ptr' is still held at the end of function [-Wthread-safety-analysis] \| 982 \| } \| \| ^ \| kernel/futex/core.c:976:2: note: spinlock acquired here \| 976 \| spin_lock(lock_ptr); \| \| ^ \| kernel/futex/core.c:982:1: warning: expecting spinlock 'q->lock_ptr' to be held at the end of function [-Wthread-safety-analysis] \| 982 \| } \| \| ^ \| kernel/futex/core.c:966:6: note: spinlock acquired here \| 966 \| void futex_q_lockptr_lock(struct futex_q q) \| \| ^ \| 2 warnings generated. Where we have: extern void futex_q_lockptr_lock(struct futex_q q) __acquires(q->lock_ptr); .. void futex_q_lockptr_lock(struct futex_q q) { spinlock_t lock_ptr; /* * See futex_unqueue() why lock_ptr can change. */ guard(rcu)(); retry: >> lock_ptr = READ_ONCE(q->lock_ptr); spin_lock(lock_ptr); ... } At the time of the above report (prior to removal of the 'atomic' flag), Clang Thread Safety Analysis's alias analysis resolved 'lock_ptr' to 'atomic ? __u.__val : q->lock_ptr' (now just '__u.__val'), and used this as the identity of the context lock given it cannot "see through" the inline assembly; however, we want 'q->lock_ptr' as the canonical context lock. While for code generation the compiler simplified to '__u.__val' for pointers (8 byte case -> 'atomic' was set), TSA's analysis (a) happens much earlier on the AST, and (b) would be the wrong deduction. Now that we've gotten rid of the 'atomic' ternary comparison, we can return '__u.__val' through a pointer that we initialize with '&x', but then update via a pointer-to-pointer. When READ_ONCE()'ing a context lock pointer, TSA's alias analysis does not invalidate the initial alias when updated through the pointer-to-pointer, and we make it effectively "see through" the __READ_ONCE(). Code generation is unchanged. Link: https://lkml.kernel.org/r/20260121110704.221498346@infradead.org [1] Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202601221040.TeM0ihff-lkp@intel.com/ Cc: Peter Zijlstra <peterz@infradead.org> Tested-by: Boqun Feng <boqun@kernel.org> Reviewed-by: David Laight <david.laight.linux@gmail.com> Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Will Deacon <will@kernel.org>	2026-02-26 18:03:07 +00:00
Marco Elver	abf1be684d	arm64: Optimize __READ_ONCE() with CONFIG_LTO=y Rework arm64 LTO __READ_ONCE() to improve code generation as follows: 1. Replace _Generic-based __unqual_scalar_typeof() with more complete __rwonce_typeof_unqual(). This strips qualifiers from all types, not just integer types, which is required to be able to assign (must be non-const) to __u.__val in the non-atomic case (required for #2). One subtle point here is that non-integer types of __val could be const or volatile within the union with the old __unqual_scalar_typeof(), if the passed variable is const or volatile. This would then result in a forced load from the stack if __u.__val is volatile; in the case of const, it does look odd if the underlying storage changes, but the compiler is told said member is "const" -- it smells like UB. 2. Eliminate the atomic flag and ternary conditional expression. Move the fallback volatile load into the default case of the switch, ensuring __u is unconditionally initialized across all paths. The statement expression now unconditionally returns __u.__val. This refactoring appears to help the compiler improve (or fix) code generation. With a defconfig + LTO + debug options builds, we observe different codegen for the following functions: btrfs_reclaim_sweep (708 -> 1032 bytes) btrfs_sinfo_bg_reclaim_threshold_store (200 -> 204 bytes) check_mem_access (3652 -> 3692 bytes) [inlined bpf_map_is_rdonly] console_flush_all (1268 -> 1264 bytes) console_lock_spinning_disable_and_check (180 -> 176 bytes) igb_add_filter (640 -> 636 bytes) igb_config_tx_modes (2404 -> 2400 bytes) kvm_vcpu_on_spin (480 -> 476 bytes) map_freeze (376 -> 380 bytes) netlink_bind (1664 -> 1656 bytes) nmi_cpu_backtrace (404 -> 400 bytes) set_rps_cpu (516 -> 520 bytes) swap_cluster_readahead (944 -> 932 bytes) tcp_accecn_third_ack (328 -> 336 bytes) tcp_create_openreq_child (1764 -> 1772 bytes) tcp_data_queue (5784 -> 5892 bytes) tcp_ecn_rcv_synack (620 -> 628 bytes) xen_manage_runstate_time (944 -> 896 bytes) xen_steal_clock (340 -> 296 bytes) Increase of some functions are due to more aggressive inlining due to better codegen (in this build, e.g. bpf_map_is_rdonly is no longer present due to being inlined completely). NOTE: The return-value-of-function-drops-qualifiers hack was first suggested by Al Viro in [1], which notes some of its limitations which make it unsuitable for a general __unqual_scalar_typeof() replacement. Notably, array types are not supported, and GCC 8.1-8.3 still fail. Why should we use it here? READ_ONCE() does not support reading whole arrays, and the GCC version problem only affects 3 minor releases of a very ancient still-supported GCC version; not only that, this arm64 READ_ONCE() version is currently only activated by LTO builds, which to-date are only supported by Clang! Link: https://lore.kernel.org/all/20260111182010.GH3634291@ZenIV/ [1] Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Will Deacon <will@kernel.org>	2026-02-26 18:03:07 +00:00
Mark Rutland	a8f78680ee	arm64: tlb: Optimize ARM64_WORKAROUND_REPEAT_TLBI The ARM64_WORKAROUND_REPEAT_TLBI workaround is used to mitigate several errata where broadcast TLBI;DSB sequences don't provide all the architecturally required synchronization. The workaround performs more work than necessary, and can have significant overhead. This patch optimizes the workaround, as explained below. The workaround was originally added for Qualcomm Falkor erratum 1009 in commit: `d9ff80f83e` ("arm64: Work around Falkor erratum 1009") As noted in the message for that commit, the workaround is applied even in cases where it is not strictly necessary. The workaround was later reused without changes for: * Arm Cortex-A76 erratum #1286807 SDEN v33: https://developer.arm.com/documentation/SDEN-885749/33-0/ * Arm Cortex-A55 erratum #2441007 SDEN v16: https://developer.arm.com/documentation/SDEN-859338/1600/ * Arm Cortex-A510 erratum #2441009 SDEN v19: https://developer.arm.com/documentation/SDEN-1873351/1900/ The important details to note are as follows: 1. All relevant errata only affect the ordering and/or completion of memory accesses which have been translated by an invalidated TLB entry. The actual invalidation of TLB entries is unaffected. 2. The existing workaround is applied to both broadcast and local TLB invalidation, whereas for all relevant errata it is only necessary to apply a workaround for broadcast invalidation. 3. The existing workaround replaces every TLBI with a TLBI;DSB;TLBI sequence, whereas for all relevant errata it is only necessary to execute a single additional TLBI;DSB sequence after any number of TLBIs are completed by a DSB. For example, for a sequence of batched TLBIs: TLBI <op1>[, <arg1>] TLBI <op2>[, <arg2>] TLBI <op3>[, <arg3>] DSB ISH ... the existing workaround will expand this to: TLBI <op1>[, <arg1>] DSB ISH // additional TLBI <op1>[, <arg1>] // additional TLBI <op2>[, <arg2>] DSB ISH // additional TLBI <op2>[, <arg2>] // additional TLBI <op3>[, <arg3>] DSB ISH // additional TLBI <op3>[, <arg3>] // additional DSB ISH ... whereas it is sufficient to have: TLBI <op1>[, <arg1>] TLBI <op2>[, <arg2>] TLBI <op3>[, <arg3>] DSB ISH TLBI <opX>[, <argX>] // additional DSB ISH // additional Using a single additional TBLI and DSB at the end of the sequence can have significantly lower overhead as each DSB which completes a TLBI must synchronize with other PEs in the system, with potential performance effects both locally and system-wide. 4. The existing workaround repeats each specific TLBI operation, whereas for all relevant errata it is sufficient for the additional TLBI to use any operation which will be broadcast, regardless of which translation regime or stage of translation the operation applies to. For example, for a single TLBI: TLBI ALLE2IS DSB ISH ... the existing workaround will expand this to: TLBI ALLE2IS DSB ISH TLBI ALLE2IS // additional DSB ISH // additional ... whereas it is sufficient to have: TLBI ALLE2IS DSB ISH TLBI VALE1IS, XZR // additional DSB ISH // additional As the additional TLBI doesn't have to match a specific earlier TLBI, the additional TLBI can be implemented in separate code, with no memory of the earlier TLBIs. The additional TLBI can also use a cheaper TLBI operation. 5. The existing workaround is applied to both Stage-1 and Stage-2 TLB invalidation, whereas for all relevant errata it is only necessary to apply a workaround for Stage-1 invalidation. Architecturally, TLBI operations which invalidate only Stage-2 information (e.g. IPAS2E1IS) are not required to invalidate TLB entries which combine information from Stage-1 and Stage-2 translation table entries, and consequently may not complete memory accesses translated by those combined entries. In these cases, completion of memory accesses is only guaranteed after subsequent invalidation of Stage-1 information (e.g. VMALLE1IS). Taking the above points into account, this patch reworks the workaround logic to reduce overhead: * New __tlbi_sync_s1ish() and __tlbi_sync_s1ish_hyp() functions are added and used in place of any dsb(ish) which is used to complete broadcast Stage-1 TLB maintenance. When the ARM64_WORKAROUND_REPEAT_TLBI workaround is enabled, these helpers will execute an additional TLBI;DSB sequence. For consistency, it might make sense to add __tlbi_sync_() helpers for local and stage 2 maintenance. For now I've left those with open-coded dsb() to keep the diff small. The duplication of TLBIs in __TLBI_0() and __TLBI_1() is removed. This is no longer needed as the necessary synchronization will happen in __tlbi_sync_s1ish() or __tlbi_sync_s1ish_hyp(). * The additional TLBI operation is chosen to have minimal impact: - __tlbi_sync_s1ish() uses "TLBI VALE1IS, XZR". This is only used at EL1 or at EL2 with {E2H,TGE}=={1,1}, where it will target an unused entry for the reserved ASID in the kernel's own translation regime, and have no adverse affect. - __tlbi_sync_s1ish_hyp() uses "TLBI VALE2IS, XZR". This is only used in hyp code, where it will target an unused entry in the hyp code's TTBR0 mapping, and should have no adverse effect. * As __TLBI_0() and __TLBI_1() no longer replace each TLBI with a TLBI;DSB;TLBI sequence, batching TLBIs is worthwhile, and there's no need for arch_tlbbatch_should_defer() to consider ARM64_WORKAROUND_REPEAT_TLBI. When building defconfig with GCC 15.1.0, compared to v6.19-rc1, this patch saves ~1KiB of text, makes the vmlinux ~42KiB smaller, and makes the resulting Image 64KiB smaller: \| [mark@lakrids:~/src/linux]% size vmlinux-* \| text data bss dec hex filename \| 21179831 19660919 708216 `41548966` 279fca6 vmlinux-after \| 21181075 19660903 708216 41550194 27a0172 vmlinux-before \| [mark@lakrids:~/src/linux]% ls -l vmlinux-* \| -rwxr-xr-x 1 mark mark 157771472 Feb 4 12:05 vmlinux-after \| -rwxr-xr-x 1 mark mark 157815432 Feb 4 12:05 vmlinux-before \| [mark@lakrids:~/src/linux]% ls -l Image-* \| -rw-r--r-- 1 mark mark 41007616 Feb 4 12:05 Image-after \| -rw-r--r-- 1 mark mark 41073152 Feb 4 12:05 Image-before Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Oliver Upton <oupton@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Will Deacon <will@kernel.org>	2026-02-25 21:37:44 +00:00
Mark Rutland	bfd9c931d1	arm64: tlb: Allow XZR argument to TLBI ops The TLBI instruction accepts XZR as a register argument, and for TLBI operations with a register argument, there is no functional difference between using XZR or another GPR which contains zeroes. Operations without a register argument are encoded as if XZR were used. Allow the __TLBI_1() macro to use XZR when a register argument is all zeroes. Today this only results in a trivial code saving in __do_compat_cache_op()'s workaround for Neoverse-N1 erratum #1542419. In subsequent patches this pattern will be used more generally. There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Oliver Upton <oupton@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Will Deacon <will@kernel.org>	2026-02-25 21:37:44 +00:00

1 2 3 4 5 ...

6328 Commits