From 80501dff814eeccebf44a59340c3fe3a205eb120 Mon Sep 17 00:00:00 2001 From: "Borislav Petkov (AMD)" Date: Wed, 20 May 2026 13:25:07 -0700 Subject: [PATCH 1/5] Documentation/arch/x86: Hide clearcpuid= This option was never meant to be used in production because it solely clears the X86_FEATURE kernel-internal representation of what CPUID bits it has detected and doesn't do any *proper* feature disablement like clearing CR4.CET in the user shadow stack case, for example. So remove its documentation so that it doesn't get used in production and people get silly ideas. It is meant strictly for debugging; and if a chicken bit for properly disabling a feature is warranted, then that would need proper enablement. No functional changes. Signed-off-by: Borislav Petkov (AMD) Signed-off-by: Ingo Molnar Cc: Mathias Krause Cc: Linus Torvalds Link: https://patch.msgid.link/20260520202508.160112-1-bp@kernel.org --- .../admin-guide/kernel-parameters.txt | 18 ------------------ Documentation/arch/x86/cpuinfo.rst | 4 ++++ 2 files changed, 4 insertions(+), 18 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 4d0f545fb3ec..97007f4f69d4 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -789,24 +789,6 @@ Kernel parameters cio_ignore= [S390] See Documentation/arch/s390/common_io.rst for details. - clearcpuid=X[,X...] [X86] - Disable CPUID feature X for the kernel. See - arch/x86/include/asm/cpufeatures.h for the valid bit - numbers X. Note the Linux-specific bits are not necessarily - stable over kernel options, but the vendor-specific - ones should be. - X can also be a string as appearing in the flags: line - in /proc/cpuinfo which does not have the above - instability issue. However, not all features have names - in /proc/cpuinfo. - Note that using this option will taint your kernel. - Also note that user programs calling CPUID directly - or using the feature without checking anything - will still see it. This just prevents it from - being used by the kernel or shown in /proc/cpuinfo. - Also note the kernel might malfunction if you disable - some critical bits. - clk_ignore_unused [CLK] Prevents the clock framework from automatically gating diff --git a/Documentation/arch/x86/cpuinfo.rst b/Documentation/arch/x86/cpuinfo.rst index 9f2e47c4b1c8..17fce95367e6 100644 --- a/Documentation/arch/x86/cpuinfo.rst +++ b/Documentation/arch/x86/cpuinfo.rst @@ -187,6 +187,10 @@ to disable features using the feature number as defined in Protection can be disabled using clearcpuid=514. The number 514 is calculated from #define X86_FEATURE_UMIP (16*32 + 2). +DO NOT USE this cmdline option in production - it is meant to be used only as +a quick'n'dirty debugging aid to rule out a feature-enabling code is the +culprit. If you use it, it'll taint the kernel. + In addition, there exists a variety of custom command-line parameters that disable specific features. The list of parameters includes, but is not limited to, nofsgsbase, nosgx, noxsave, etc. 5-level paging can also be disabled using From cda64169bade79427f264e43d0f422eaed9dc116 Mon Sep 17 00:00:00 2001 From: Borislav Petkov Date: Wed, 13 May 2026 22:06:01 +0200 Subject: [PATCH 2/5] x86/microcode: Do not access MSR_IA32_PLATFORM_ID when running as a guest Patch in Fixes: causes the usual: unchecked MSR access error: RDMSR from 0x17 at ... (intel_get_platform_id) Call Trace: early_init_intel early_cpu_init setup_arch _printk start_kernel x86_64_start_reservations x86_64_start_kernel common_startup_64 because the kernel is booted in a guest. In order to avoid it, this MSR access needs to be prevented when running virtualized. That is usually done by checking X86_FEATURE_HYPERVISOR but for this particular case it is too early yet. The platform ID needs to be read as early as when microcode is loaded on the BSP: load_ucode_bsp ... -> get_microcode_blob ... -> intel_find_matching_signature and by that time, CPUID leafs haven't been parsed yet. The microcode loader already has logic to check early whether the kernel is running virtualized so make that globally available to arch/x86/. The query whether running virtualized is getting more and more prominent in recent times so might as well make it an arch-global var which the rest of the code can use. Fixes: d8630b67ca1ed ("x86/cpu: Add platform ID to CPU info structure") Reported-by: Vishal Verma Signed-off-by: Borislav Petkov (AMD) Reviewed-by: Binbin Wu Reviewed-by: Xiaoyao Li Tested-by: Binbin Wu Link: https://lore.kernel.org/all/20260430020953.1405535-1-binbin.wu@linux.intel.com --- arch/x86/include/asm/processor.h | 1 + arch/x86/kernel/cpu/microcode/amd.c | 4 ++-- arch/x86/kernel/cpu/microcode/core.c | 22 ++++++++++------------ arch/x86/kernel/cpu/microcode/intel.c | 3 +++ arch/x86/kernel/cpu/microcode/internal.h | 1 - 5 files changed, 16 insertions(+), 15 deletions(-) diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index 10b5355b323e..67dd932305db 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -733,6 +733,7 @@ bool xen_set_default_idle(void); #endif void __noreturn stop_this_cpu(void *dummy); +extern bool x86_hypervisor_present; void microcode_check(struct cpuinfo_x86 *prev_info); void store_cpu_caps(struct cpuinfo_x86 *info); diff --git a/arch/x86/kernel/cpu/microcode/amd.c b/arch/x86/kernel/cpu/microcode/amd.c index e533881284a1..5c0afae75e9f 100644 --- a/arch/x86/kernel/cpu/microcode/amd.c +++ b/arch/x86/kernel/cpu/microcode/amd.c @@ -322,7 +322,7 @@ static u32 get_patch_level(void) { u32 rev, dummy __always_unused; - if (IS_ENABLED(CONFIG_MICROCODE_DBG) && hypervisor_present) { + if (IS_ENABLED(CONFIG_MICROCODE_DBG) && x86_hypervisor_present) { int cpu = smp_processor_id(); if (!microcode_rev[cpu]) { @@ -714,7 +714,7 @@ static bool __apply_microcode_amd(struct microcode_amd *mc, u32 *cur_rev, invlpg(p_addr_end); } - if (IS_ENABLED(CONFIG_MICROCODE_DBG) && hypervisor_present) + if (IS_ENABLED(CONFIG_MICROCODE_DBG) && x86_hypervisor_present) microcode_rev[smp_processor_id()] = mc->hdr.patch_id; /* verify patch application was successful */ diff --git a/arch/x86/kernel/cpu/microcode/core.c b/arch/x86/kernel/cpu/microcode/core.c index 651202e6fefb..45ca406a8112 100644 --- a/arch/x86/kernel/cpu/microcode/core.c +++ b/arch/x86/kernel/cpu/microcode/core.c @@ -57,7 +57,7 @@ bool force_minrev = IS_ENABLED(CONFIG_MICROCODE_LATE_FORCE_MINREV); u32 base_rev; u32 microcode_rev[NR_CPUS] = {}; -bool hypervisor_present; +bool __ro_after_init x86_hypervisor_present; /* * Synchronization. @@ -118,14 +118,9 @@ bool __init microcode_loader_disabled(void) /* * Disable when: * - * 1) The CPU does not support CPUID. - */ - if (!cpuid_feature()) { - dis_ucode_ldr = true; - return dis_ucode_ldr; - } - - /* + * 1) The CPU does not support CPUID, detected below in + * load_ucode_bsp(). + * * 2) Bit 31 in CPUID[1]:ECX is clear * The bit is reserved for hypervisor use. This is still not * completely accurate as XEN PV guests don't see that CPUID bit @@ -135,9 +130,7 @@ bool __init microcode_loader_disabled(void) * 3) Certain AMD patch levels are not allowed to be * overwritten. */ - hypervisor_present = native_cpuid_ecx(1) & BIT(31); - - if ((hypervisor_present && !IS_ENABLED(CONFIG_MICROCODE_DBG)) || + if ((x86_hypervisor_present && !IS_ENABLED(CONFIG_MICROCODE_DBG)) || amd_check_current_patch_level()) dis_ucode_ldr = true; @@ -179,6 +172,11 @@ void __init load_ucode_bsp(void) early_parse_cmdline(); + if (!cpuid_feature()) + dis_ucode_ldr = true; + else + x86_hypervisor_present = native_cpuid_ecx(1) & BIT(31); + if (microcode_loader_disabled()) return; diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c index 37ac4afe0972..a4c0a0cf928b 100644 --- a/arch/x86/kernel/cpu/microcode/intel.c +++ b/arch/x86/kernel/cpu/microcode/intel.c @@ -138,6 +138,9 @@ u32 intel_get_platform_id(void) { unsigned int val[2]; + if (x86_hypervisor_present) + return 0; + /* * This can be called early. Use CPUID directly instead of * relying on cpuinfo_x86 which may not be fully initialized. diff --git a/arch/x86/kernel/cpu/microcode/internal.h b/arch/x86/kernel/cpu/microcode/internal.h index 3b93c0676b4f..a10b547eda1e 100644 --- a/arch/x86/kernel/cpu/microcode/internal.h +++ b/arch/x86/kernel/cpu/microcode/internal.h @@ -48,7 +48,6 @@ extern struct early_load_data early_data; extern struct ucode_cpu_info ucode_cpu_info[]; extern u32 microcode_rev[NR_CPUS]; extern u32 base_rev; -extern bool hypervisor_present; struct cpio_data find_microcode_in_initrd(const char *path); From a17dc12bfed8868e6a86f3b45c16065a70641acb Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Alexis=20Lothor=C3=A9=20=28eBPF=20Foundation=29?= Date: Wed, 27 May 2026 21:12:31 +0200 Subject: [PATCH 3/5] x86/ftrace: Relocate %rip-relative percpu refs in dynamic trampolines MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit With CONFIG_CALL_DEPTH_TRACKING enabled on an x86 retbleed-affected platform (eg: Skylake), with retbleed=stuff, registering a dynamic ftrace trampoline crashes on the first call into the traced function: BUG: unable to handle page fault for address: ffff88817ae18880 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 4b53067 P4D 4b53067 PUD 0 Oops: Oops: 0002 [#1] SMP PTI CPU: 3 UID: 0 PID: 187 Comm: usleep Not tainted 7.0.10 #243 PREEMPT(full) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.17.0-2-2 04/01/2014 Code: 24 78 00 00 00 00 48 89 ea 48 89 54 24 20 48 8b b4 24 b8 00 00 00 48 8b bc 24 b0 00 00 00 48 89 bc 24 80 00 00 00 48 83 ef 05 <65> 48 c1 3d 1f a8 b6 02 05 48 8b 15 f6 00 00 00 4c 89 3c 24 4c 89 Call Trace: ? find_held_lock ? exc_page_fault ? lock_release ? __x64_sys_clock_nanosleep ? lockdep_hardirqs_on_prepare ? trace_hardirqs_on __x64_sys_clock_nanosleep do_syscall_64 ? exc_page_fault ? call_depth_return_thunk entry_SYSCALL_64_after_hwframe ... Kernel panic - not syncing: Fatal exception This small reproducer allows to easily trigger the crash: # echo 'p __x64_sys_clock_nanosleep' > /sys/kernel/tracing/kprobe_events # echo 1 > /sys/kernel/tracing/events/kprobes/p___x64_sys_clock_nanosleep_0/enable # usleep 1 Monitoring the crash under GDB points to the exact instruction in charge of incrementing the call depth: sarq $5, %gs:__x86_call_depth(%rip) This instruction matches the one inserted by the ftrace_regs_caller from ftrace_64.S. This emitted code was likely working fine until the introduction of 59bec00ace28 ("x86/percpu: Introduce %rip-relative addressing to PER_CPU_VAR()"): it has made the call depth accounting addressing relative to $rip, instead of being based on an absolute address. As this code exact location depends on where the trampoline lives in memory, the corresponding displacement needs to be adjusted at runtime to actually correctly find the per-cpu __x86_call_depth value, otherwise the targeted address is wrong, leading to the page fault seen above. Fix the %rip-relative displacement of the copied CALL_DEPTH_ACCOUNT instruction (from ftrace_regs_caller) by calling text_poke_apply_relocation(), as it is done for example by the x86 BPF JIT compiler through x86_call_depth_emit_accounting(). This corrects both CALL_DEPTH_ACCOUNT slots, in ftrace_caller and ftrace_regs_caller. [ bp: Massage. ] Fixes: 59bec00ace28 ("x86/percpu: Introduce %rip-relative addressing to PER_CPU_VAR()") Signed-off-by: Alexis Lothoré (eBPF Foundation) Signed-off-by: Borislav Petkov (AMD) Acked-by: Peter Zijlstra (Intel) Acked-by: Steven Rostedt Cc: Link: https://patch.msgid.link/20260527-fix_call_depth_in_trampoline-v1-1-1c1abc8ae310@bootlin.com --- arch/x86/kernel/ftrace.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c index 0543b57f54ee..17d6edfcb7e0 100644 --- a/arch/x86/kernel/ftrace.c +++ b/arch/x86/kernel/ftrace.c @@ -375,6 +375,13 @@ create_trampoline(struct ftrace_ops *ops, unsigned int *tramp_size) goto fail; } + /* + * Generated trampoline may contain rIP-relative addressing which + * displacement needs to be fixed. + */ + text_poke_apply_relocation(trampoline, trampoline, size, + (void *)start_offset, size); + /* * The address of the ftrace_ops that is used for this trampoline * is stored at the end of the trampoline. This will be used to From 8aeb879baf12fe64889f019da9a4f8347c604e91 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Tue, 26 May 2026 11:06:31 +0200 Subject: [PATCH 4/5] x86/kvm/vmx: Fix x86_64 CFI build It was missed that idt_do_interrupt_irqoff() gets compiled on x84_64; this is a problem for CFI builds because it includes an unadorned indirect call. It is however completely dead code. Rework things to not emit this function at all. Fixes: 0701c9e17bd9 ("x86/kvm/vmx: Move IRQ/NMI dispatch from KVM into x86 core") Reported-by: Nathan Chancellor Reported-by: Calvin Owens Signed-off-by: Peter Zijlstra (Intel) Tested-by: Nathan Chancellor Link: https://patch.msgid.link/20260526090631.GA4149641@noisy.programming.kicks-ass.net --- arch/x86/entry/common.c | 2 +- arch/x86/entry/entry.S | 2 ++ arch/x86/kernel/idt.c | 12 ++---------- 3 files changed, 5 insertions(+), 11 deletions(-) diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index 06c7c6ebd6f9..14cd43d4da6c 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -55,7 +55,7 @@ noinstr void x86_entry_from_kvm(unsigned int event_type, unsigned int vector) * The FRED NMI context is significantly different and will not work * right (specifically FRED fixed the NMI recursion issue). */ - idt_entry_from_kvm(vector); + idt_do_nmi_irqoff(); } EXPORT_SYMBOL_FOR_KVM(x86_entry_from_kvm); #endif diff --git a/arch/x86/entry/entry.S b/arch/x86/entry/entry.S index a56e043b266d..2bc217bb5475 100644 --- a/arch/x86/entry/entry.S +++ b/arch/x86/entry/entry.S @@ -109,11 +109,13 @@ EXPORT_SYMBOL(__ref_stack_chk_guard); RET .endm +#ifndef CONFIG_X86_64 .pushsection .text, "ax" SYM_FUNC_START(idt_do_interrupt_irqoff) IDT_DO_EVENT_IRQOFF CALL_NOSPEC _ASM_ARG1 SYM_FUNC_END(idt_do_interrupt_irqoff) .popsection +#endif .pushsection .noinstr.text, "ax" SYM_FUNC_START(idt_do_nmi_irqoff) diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c index 7bcf1decc034..90a22e24a9eb 100644 --- a/arch/x86/kernel/idt.c +++ b/arch/x86/kernel/idt.c @@ -268,18 +268,10 @@ void __init idt_setup_early_pf(void) } #endif -#if IS_ENABLED(CONFIG_KVM_INTEL) -noinstr void idt_entry_from_kvm(unsigned int vector) +#if IS_ENABLED(CONFIG_KVM_INTEL) && !defined(CONFIG_X86_64) +void idt_entry_from_kvm(unsigned int vector) { - if (vector == NMI_VECTOR) - return idt_do_nmi_irqoff(); - - /* - * Only the NMI path requires noinstr. - */ - instrumentation_begin(); idt_do_interrupt_irqoff(gate_offset(idt_table + vector)); - instrumentation_end(); } #endif From 44eeff9bc467bc7d1fec34fc3f6001f385fe462c Mon Sep 17 00:00:00 2001 From: Andrei Vagin Date: Tue, 26 May 2026 20:50:43 +0000 Subject: [PATCH 5/5] Revert "x86/fpu: Refine and simplify the magic number check during signal return" This reverts dc8aa31a7ac2 ("x86/fpu: Refine and simplify the magic number check during signal return"). The aforementioned commit broke applications that construct signal frames in userspace (such as CRIU and gVisor) if the frame's xstate size is smaller than the kernel's fpstate->user_size. Furthermore, this introduces a critical issue for checkpoint/restore tools like CRIU. If a process is checkpointed while inside a signal handler, its stack contains a signal frame formatted according to the source host's xstate capabilities. If that process is later restored on a destination host with larger xstate capabilities (e.g., a newer CPU with more features enabled, resulting in a larger fpstate->user_size), the kernel will look for FP_XSTATE_MAGIC2 at the destination host's larger user_size offset instead of the offset encoded in the frame's fx_sw->xstate_size. This causes the magic2 check to fail, forcing sigreturn to silently fall back to "FX-only" mode. Upon return from the signal handler, the process's extended state is reset to initial values instead of being restored, leading to silent data corruption. The aforementioned commit cited d877550eaf2d ("x86/fpu: Stop relying on userspace for info to fault in xsave buffer") as justification to stop relying on userspace for the magic number check. However, these two changes are fundamentally different. The last one only changed how much memory the kernel ensures is paged-in before running XRSTOR to prevent an infinite loop. It did not change the signal frame format or how the layout is validated. Reverting this change restores the use of fx_sw->xstate_size for locating magic2 and restores the necessary sanity checks, ensuring that the signal frame remains self-describing and portable. [ bp: Massage commit message. ] Fixes: dc8aa31a7ac2 ("x86/fpu: Refine and simplify the magic number check during signal return") Signed-off-by: Andrei Vagin Signed-off-by: Borislav Petkov (AMD) Acked-by: Chang S. Bae Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20260429000623.3356606-1-avagin@google.com --- arch/x86/kernel/fpu/signal.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c index c3ec2512f2bb..20b638c507ca 100644 --- a/arch/x86/kernel/fpu/signal.c +++ b/arch/x86/kernel/fpu/signal.c @@ -27,14 +27,19 @@ static inline bool check_xstate_in_sigframe(struct fxregs_state __user *fxbuf, struct _fpx_sw_bytes *fx_sw) { + int min_xstate_size = sizeof(struct fxregs_state) + + sizeof(struct xstate_header); void __user *fpstate = fxbuf; unsigned int magic2; if (__copy_from_user(fx_sw, &fxbuf->sw_reserved[0], sizeof(*fx_sw))) return false; - /* Check for the first magic field */ - if (fx_sw->magic1 != FP_XSTATE_MAGIC1) + /* Check for the first magic field and other error scenarios. */ + if (fx_sw->magic1 != FP_XSTATE_MAGIC1 || + fx_sw->xstate_size < min_xstate_size || + fx_sw->xstate_size > x86_task_fpu(current)->fpstate->user_size || + fx_sw->xstate_size > fx_sw->extended_size) goto setfx; /* @@ -43,7 +48,7 @@ static inline bool check_xstate_in_sigframe(struct fxregs_state __user *fxbuf, * fpstate layout with out copying the extended state information * in the memory layout. */ - if (__get_user(magic2, (__u32 __user *)(fpstate + x86_task_fpu(current)->fpstate->user_size))) + if (__get_user(magic2, (__u32 __user *)(fpstate + fx_sw->xstate_size))) return false; if (likely(magic2 == FP_XSTATE_MAGIC2))