From 80501dff814eeccebf44a59340c3fe3a205eb120 Mon Sep 17 00:00:00 2001
From: "Borislav Petkov (AMD)" <bp@alien8.de>
Date: Wed, 20 May 2026 13:25:07 -0700
Subject: [PATCH 1/5] Documentation/arch/x86: Hide clearcpuid=

This option was never meant to be used in production because it solely
clears the X86_FEATURE kernel-internal representation of what CPUID bits
it has detected and doesn't do any *proper* feature disablement like
clearing CR4.CET in the user shadow stack case, for example.

So remove its documentation so that it doesn't get used in production
and people get silly ideas. It is meant strictly for debugging; and if
a chicken bit for properly disabling a feature is warranted, then that
would need proper enablement.

No functional changes.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Mathias Krause <minipli@grsecurity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://patch.msgid.link/20260520202508.160112-1-bp@kernel.org
---
 .../admin-guide/kernel-parameters.txt          | 18 ------------------
 Documentation/arch/x86/cpuinfo.rst             |  4 ++++
 2 files changed, 4 insertions(+), 18 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 4d0f545fb3ec..97007f4f69d4 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -789,24 +789,6 @@ Kernel parameters
 	cio_ignore=	[S390]
 			See Documentation/arch/s390/common_io.rst for details.
 
-	clearcpuid=X[,X...] [X86]
-			Disable CPUID feature X for the kernel. See
-			arch/x86/include/asm/cpufeatures.h for the valid bit
-			numbers X. Note the Linux-specific bits are not necessarily
-			stable over kernel options, but the vendor-specific
-			ones should be.
-			X can also be a string as appearing in the flags: line
-			in /proc/cpuinfo which does not have the above
-			instability issue. However, not all features have names
-			in /proc/cpuinfo.
-			Note that using this option will taint your kernel.
-			Also note that user programs calling CPUID directly
-			or using the feature without checking anything
-			will still see it. This just prevents it from
-			being used by the kernel or shown in /proc/cpuinfo.
-			Also note the kernel might malfunction if you disable
-			some critical bits.
-
 	clk_ignore_unused
 			[CLK]
 			Prevents the clock framework from automatically gating
diff --git a/Documentation/arch/x86/cpuinfo.rst b/Documentation/arch/x86/cpuinfo.rst
index 9f2e47c4b1c8..17fce95367e6 100644
--- a/Documentation/arch/x86/cpuinfo.rst
+++ b/Documentation/arch/x86/cpuinfo.rst
@@ -187,6 +187,10 @@ to disable features using the feature number as defined in
 Protection can be disabled using clearcpuid=514. The number 514 is calculated
 from #define X86_FEATURE_UMIP (16*32 + 2).
 
+DO NOT USE this cmdline option in production - it is meant to be used only as
+a quick'n'dirty debugging aid to rule out a feature-enabling code is the
+culprit. If you use it, it'll taint the kernel.
+
 In addition, there exists a variety of custom command-line parameters that
 disable specific features. The list of parameters includes, but is not limited
 to, nofsgsbase, nosgx, noxsave, etc. 5-level paging can also be disabled using

From cda64169bade79427f264e43d0f422eaed9dc116 Mon Sep 17 00:00:00 2001
From: Borislav Petkov <bp@alien8.de>
Date: Wed, 13 May 2026 22:06:01 +0200
Subject: [PATCH 2/5] x86/microcode: Do not access MSR_IA32_PLATFORM_ID when
 running as a guest

Patch in Fixes: causes the usual:

  unchecked MSR access error: RDMSR from 0x17 at ... (intel_get_platform_id)
  Call Trace:
   early_init_intel
   early_cpu_init
   setup_arch
   _printk
   start_kernel
   x86_64_start_reservations
   x86_64_start_kernel
   common_startup_64

because the kernel is booted in a guest.

In order to avoid it, this MSR access needs to be prevented when running
virtualized. That is usually done by checking X86_FEATURE_HYPERVISOR but
for this particular case it is too early yet.

The platform ID needs to be read as early as when microcode is loaded on
the BSP:

  load_ucode_bsp ... -> get_microcode_blob ... -> intel_find_matching_signature

and by that time, CPUID leafs haven't been parsed yet.

The microcode loader already has logic to check early whether the kernel
is running virtualized so make that globally available to arch/x86/. The
query whether running virtualized is getting more and more prominent in
recent times so might as well make it an arch-global var which the rest
of the code can use.

Fixes: d8630b67ca1ed ("x86/cpu: Add platform ID to CPU info structure")
Reported-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Tested-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/all/20260430020953.1405535-1-binbin.wu@linux.intel.com
---
 arch/x86/include/asm/processor.h         |  1 +
 arch/x86/kernel/cpu/microcode/amd.c      |  4 ++--
 arch/x86/kernel/cpu/microcode/core.c     | 22 ++++++++++------------
 arch/x86/kernel/cpu/microcode/intel.c    |  3 +++
 arch/x86/kernel/cpu/microcode/internal.h |  1 -
 5 files changed, 16 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 10b5355b323e..67dd932305db 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -733,6 +733,7 @@ bool xen_set_default_idle(void);
 #endif
 
 void __noreturn stop_this_cpu(void *dummy);
+extern bool x86_hypervisor_present;
 void microcode_check(struct cpuinfo_x86 *prev_info);
 void store_cpu_caps(struct cpuinfo_x86 *info);
 
diff --git a/arch/x86/kernel/cpu/microcode/amd.c b/arch/x86/kernel/cpu/microcode/amd.c
index e533881284a1..5c0afae75e9f 100644
--- a/arch/x86/kernel/cpu/microcode/amd.c
+++ b/arch/x86/kernel/cpu/microcode/amd.c
@@ -322,7 +322,7 @@ static u32 get_patch_level(void)
 {
 	u32 rev, dummy __always_unused;
 
-	if (IS_ENABLED(CONFIG_MICROCODE_DBG) && hypervisor_present) {
+	if (IS_ENABLED(CONFIG_MICROCODE_DBG) && x86_hypervisor_present) {
 		int cpu = smp_processor_id();
 
 		if (!microcode_rev[cpu]) {
@@ -714,7 +714,7 @@ static bool __apply_microcode_amd(struct microcode_amd *mc, u32 *cur_rev,
 			invlpg(p_addr_end);
 	}
 
-	if (IS_ENABLED(CONFIG_MICROCODE_DBG) && hypervisor_present)
+	if (IS_ENABLED(CONFIG_MICROCODE_DBG) && x86_hypervisor_present)
 		microcode_rev[smp_processor_id()] = mc->hdr.patch_id;
 
 	/* verify patch application was successful */
diff --git a/arch/x86/kernel/cpu/microcode/core.c b/arch/x86/kernel/cpu/microcode/core.c
index 651202e6fefb..45ca406a8112 100644
--- a/arch/x86/kernel/cpu/microcode/core.c
+++ b/arch/x86/kernel/cpu/microcode/core.c
@@ -57,7 +57,7 @@ bool force_minrev = IS_ENABLED(CONFIG_MICROCODE_LATE_FORCE_MINREV);
 u32 base_rev;
 u32 microcode_rev[NR_CPUS] = {};
 
-bool hypervisor_present;
+bool __ro_after_init x86_hypervisor_present;
 
 /*
  * Synchronization.
@@ -118,14 +118,9 @@ bool __init microcode_loader_disabled(void)
 	/*
 	 * Disable when:
 	 *
-	 * 1) The CPU does not support CPUID.
-	 */
-	if (!cpuid_feature()) {
-		dis_ucode_ldr = true;
-		return dis_ucode_ldr;
-	}
-
-	/*
+	 * 1) The CPU does not support CPUID, detected below in
+	 *    load_ucode_bsp().
+	 *
 	 * 2) Bit 31 in CPUID[1]:ECX is clear
 	 *    The bit is reserved for hypervisor use. This is still not
 	 *    completely accurate as XEN PV guests don't see that CPUID bit
@@ -135,9 +130,7 @@ bool __init microcode_loader_disabled(void)
 	 * 3) Certain AMD patch levels are not allowed to be
 	 *    overwritten.
 	 */
-	hypervisor_present = native_cpuid_ecx(1) & BIT(31);
-
-	if ((hypervisor_present && !IS_ENABLED(CONFIG_MICROCODE_DBG)) ||
+	if ((x86_hypervisor_present && !IS_ENABLED(CONFIG_MICROCODE_DBG)) ||
 	    amd_check_current_patch_level())
 		dis_ucode_ldr = true;
 
@@ -179,6 +172,11 @@ void __init load_ucode_bsp(void)
 
 	early_parse_cmdline();
 
+	if (!cpuid_feature())
+		dis_ucode_ldr = true;
+	else
+		x86_hypervisor_present = native_cpuid_ecx(1) & BIT(31);
+
 	if (microcode_loader_disabled())
 		return;
 
diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c
index 37ac4afe0972..a4c0a0cf928b 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -138,6 +138,9 @@ u32 intel_get_platform_id(void)
 {
 	unsigned int val[2];
 
+	if (x86_hypervisor_present)
+		return 0;
+
 	/*
 	 * This can be called early. Use CPUID directly instead of
 	 * relying on cpuinfo_x86 which may not be fully initialized.
diff --git a/arch/x86/kernel/cpu/microcode/internal.h b/arch/x86/kernel/cpu/microcode/internal.h
index 3b93c0676b4f..a10b547eda1e 100644
--- a/arch/x86/kernel/cpu/microcode/internal.h
+++ b/arch/x86/kernel/cpu/microcode/internal.h
@@ -48,7 +48,6 @@ extern struct early_load_data early_data;
 extern struct ucode_cpu_info ucode_cpu_info[];
 extern u32 microcode_rev[NR_CPUS];
 extern u32 base_rev;
-extern bool hypervisor_present;
 
 struct cpio_data find_microcode_in_initrd(const char *path);
 

From a17dc12bfed8868e6a86f3b45c16065a70641acb Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Alexis=20Lothor=C3=A9=20=28eBPF=20Foundation=29?=
 <alexis.lothore@bootlin.com>
Date: Wed, 27 May 2026 21:12:31 +0200
Subject: [PATCH 3/5] x86/ftrace: Relocate %rip-relative percpu refs in dynamic
 trampolines
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

With CONFIG_CALL_DEPTH_TRACKING enabled on an x86 retbleed-affected platform
(eg: Skylake), with retbleed=stuff, registering a dynamic ftrace trampoline
crashes on the first call into the traced function:

  BUG: unable to handle page fault for address: ffff88817ae18880
  #PF: supervisor write access in kernel mode
  #PF: error_code(0x0002) - not-present page
  PGD 4b53067 P4D 4b53067 PUD 0
  Oops: Oops: 0002 [#1] SMP PTI
  CPU: 3 UID: 0 PID: 187 Comm: usleep Not tainted 7.0.10 #243 PREEMPT(full)
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.17.0-2-2 04/01/2014
  Code: 24 78 00 00 00 00 48 89 ea 48 89 54 24 20 48 8b b4 24 b8 00 00 00 48 8b bc 24 b0 00 00 00 48 89 bc 24 80 00 00 00 48 83 ef 05 <65> 48 c1 3d 1f a8 b6 02 05 48 8b 15 f6 00 00 00 4c 89 3c 24 4c 89
  Call Trace:
   <TASK>
   ? find_held_lock
   ? exc_page_fault
   ? lock_release
   ? __x64_sys_clock_nanosleep
   ? lockdep_hardirqs_on_prepare
   ? trace_hardirqs_on
   __x64_sys_clock_nanosleep
   do_syscall_64
   ? exc_page_fault
   ? call_depth_return_thunk
   entry_SYSCALL_64_after_hwframe
  ...
  Kernel panic - not syncing: Fatal exception

This small reproducer allows to easily trigger the crash:

  # echo 'p __x64_sys_clock_nanosleep' > /sys/kernel/tracing/kprobe_events
  # echo 1 > /sys/kernel/tracing/events/kprobes/p___x64_sys_clock_nanosleep_0/enable
  # usleep 1

Monitoring the crash under GDB points to the exact instruction in charge of
incrementing the call depth:

  sarq $5, %gs:__x86_call_depth(%rip)

This instruction matches the one inserted by the ftrace_regs_caller from
ftrace_64.S. This emitted code was likely working fine until the introduction
of

  59bec00ace28 ("x86/percpu: Introduce %rip-relative addressing to PER_CPU_VAR()"):

it has made the call depth accounting addressing relative to $rip, instead of
being based on an absolute address.

As this code exact location depends on where the trampoline lives in memory,
the corresponding displacement needs to be adjusted at runtime to actually
correctly find the per-cpu __x86_call_depth value, otherwise the targeted
address is wrong, leading to the page fault seen above.

Fix the %rip-relative displacement of the copied CALL_DEPTH_ACCOUNT
instruction (from ftrace_regs_caller) by calling text_poke_apply_relocation(),
as it is done for example by the x86 BPF JIT compiler through
x86_call_depth_emit_accounting(). This corrects both CALL_DEPTH_ACCOUNT slots,
in ftrace_caller and ftrace_regs_caller.

  [ bp: Massage. ]

Fixes: 59bec00ace28 ("x86/percpu: Introduce %rip-relative addressing to PER_CPU_VAR()")
Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore@bootlin.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: <stable@kernel.org>
Link: https://patch.msgid.link/20260527-fix_call_depth_in_trampoline-v1-1-1c1abc8ae310@bootlin.com
---
 arch/x86/kernel/ftrace.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index 0543b57f54ee..17d6edfcb7e0 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -375,6 +375,13 @@ create_trampoline(struct ftrace_ops *ops, unsigned int *tramp_size)
 			goto fail;
 	}
 
+	/*
+	 * Generated trampoline may contain rIP-relative addressing which
+	 * displacement needs to be fixed.
+	 */
+	text_poke_apply_relocation(trampoline, trampoline, size,
+				   (void *)start_offset, size);
+
 	/*
 	 * The address of the ftrace_ops that is used for this trampoline
 	 * is stored at the end of the trampoline. This will be used to

From 8aeb879baf12fe64889f019da9a4f8347c604e91 Mon Sep 17 00:00:00 2001
From: Peter Zijlstra <peterz@infradead.org>
Date: Tue, 26 May 2026 11:06:31 +0200
Subject: [PATCH 4/5] x86/kvm/vmx: Fix x86_64 CFI build

It was missed that idt_do_interrupt_irqoff() gets compiled on x84_64;
this is a problem for CFI builds because it includes an unadorned
indirect call. It is however completely dead code.

Rework things to not emit this function at all.

Fixes: 0701c9e17bd9 ("x86/kvm/vmx: Move IRQ/NMI dispatch from KVM into x86 core")
Reported-by: Nathan Chancellor <nathan@kernel.org>
Reported-by: Calvin Owens <calvin@wbinvd.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Link: https://patch.msgid.link/20260526090631.GA4149641@noisy.programming.kicks-ass.net
---
 arch/x86/entry/common.c |  2 +-
 arch/x86/entry/entry.S  |  2 ++
 arch/x86/kernel/idt.c   | 12 ++----------
 3 files changed, 5 insertions(+), 11 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 06c7c6ebd6f9..14cd43d4da6c 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -55,7 +55,7 @@ noinstr void x86_entry_from_kvm(unsigned int event_type, unsigned int vector)
 	 * The FRED NMI context is significantly different and will not work
 	 * right (specifically FRED fixed the NMI recursion issue).
 	 */
-	idt_entry_from_kvm(vector);
+	idt_do_nmi_irqoff();
 }
 EXPORT_SYMBOL_FOR_KVM(x86_entry_from_kvm);
 #endif
diff --git a/arch/x86/entry/entry.S b/arch/x86/entry/entry.S
index a56e043b266d..2bc217bb5475 100644
--- a/arch/x86/entry/entry.S
+++ b/arch/x86/entry/entry.S
@@ -109,11 +109,13 @@ EXPORT_SYMBOL(__ref_stack_chk_guard);
 	RET
 .endm
 
+#ifndef CONFIG_X86_64
 .pushsection .text, "ax"
 SYM_FUNC_START(idt_do_interrupt_irqoff)
 	IDT_DO_EVENT_IRQOFF CALL_NOSPEC _ASM_ARG1
 SYM_FUNC_END(idt_do_interrupt_irqoff)
 .popsection
+#endif
 
 .pushsection .noinstr.text, "ax"
 SYM_FUNC_START(idt_do_nmi_irqoff)
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index 7bcf1decc034..90a22e24a9eb 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -268,18 +268,10 @@ void __init idt_setup_early_pf(void)
 }
 #endif
 
-#if IS_ENABLED(CONFIG_KVM_INTEL)
-noinstr void idt_entry_from_kvm(unsigned int vector)
+#if IS_ENABLED(CONFIG_KVM_INTEL) && !defined(CONFIG_X86_64)
+void idt_entry_from_kvm(unsigned int vector)
 {
-	if (vector == NMI_VECTOR)
-		return idt_do_nmi_irqoff();
-
-	/*
-	 * Only the NMI path requires noinstr.
-	 */
-	instrumentation_begin();
 	idt_do_interrupt_irqoff(gate_offset(idt_table + vector));
-	instrumentation_end();
 }
 #endif
 

From 44eeff9bc467bc7d1fec34fc3f6001f385fe462c Mon Sep 17 00:00:00 2001
From: Andrei Vagin <avagin@google.com>
Date: Tue, 26 May 2026 20:50:43 +0000
Subject: [PATCH 5/5] Revert "x86/fpu: Refine and simplify the magic number
 check during signal return"

This reverts

  dc8aa31a7ac2 ("x86/fpu: Refine and simplify the magic number check during signal return").

The aforementioned commit broke applications that construct signal frames in
userspace (such as CRIU and gVisor) if the frame's xstate size is smaller than
the kernel's fpstate->user_size.

Furthermore, this introduces a critical issue for checkpoint/restore tools
like CRIU. If a process is checkpointed while inside a signal handler, its
stack contains a signal frame formatted according to the source host's xstate
capabilities.

If that process is later restored on a destination host with larger xstate
capabilities (e.g., a newer CPU with more features enabled, resulting in
a larger fpstate->user_size), the kernel will look for FP_XSTATE_MAGIC2 at the
destination host's larger user_size offset instead of the offset encoded in
the frame's fx_sw->xstate_size.

This causes the magic2 check to fail, forcing sigreturn to silently fall back
to "FX-only" mode. Upon return from the signal handler, the process's extended
state is reset to initial values instead of being restored, leading to silent
data corruption.

The aforementioned commit cited

  d877550eaf2d ("x86/fpu: Stop relying on userspace for info to fault in xsave buffer")

as justification to stop relying on userspace for the magic number check.

However, these two changes are fundamentally different. The last one only
changed how much memory the kernel ensures is paged-in before running XRSTOR
to prevent an infinite loop. It did not change the signal frame format or how
the layout is validated.

Reverting this change restores the use of fx_sw->xstate_size for
locating magic2 and restores the necessary sanity checks, ensuring that
the signal frame remains self-describing and portable.

  [ bp: Massage commit message. ]

Fixes: dc8aa31a7ac2 ("x86/fpu: Refine and simplify the magic number check during signal return")
Signed-off-by: Andrei Vagin <avagin@google.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Chang S. Bae <chang.seok.bae@intel.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20260429000623.3356606-1-avagin@google.com
---
 arch/x86/kernel/fpu/signal.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index c3ec2512f2bb..20b638c507ca 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -27,14 +27,19 @@
 static inline bool check_xstate_in_sigframe(struct fxregs_state __user *fxbuf,
 					    struct _fpx_sw_bytes *fx_sw)
 {
+	int min_xstate_size = sizeof(struct fxregs_state) +
+			      sizeof(struct xstate_header);
 	void __user *fpstate = fxbuf;
 	unsigned int magic2;
 
 	if (__copy_from_user(fx_sw, &fxbuf->sw_reserved[0], sizeof(*fx_sw)))
 		return false;
 
-	/* Check for the first magic field */
-	if (fx_sw->magic1 != FP_XSTATE_MAGIC1)
+	/* Check for the first magic field and other error scenarios. */
+	if (fx_sw->magic1 != FP_XSTATE_MAGIC1 ||
+	    fx_sw->xstate_size < min_xstate_size ||
+	    fx_sw->xstate_size > x86_task_fpu(current)->fpstate->user_size ||
+	    fx_sw->xstate_size > fx_sw->extended_size)
 		goto setfx;
 
 	/*
@@ -43,7 +48,7 @@ static inline bool check_xstate_in_sigframe(struct fxregs_state __user *fxbuf,
 	 * fpstate layout with out copying the extended state information
 	 * in the memory layout.
 	 */
-	if (__get_user(magic2, (__u32 __user *)(fpstate + x86_task_fpu(current)->fpstate->user_size)))
+	if (__get_user(magic2, (__u32 __user *)(fpstate + fx_sw->xstate_size)))
 		return false;
 
 	if (likely(magic2 == FP_XSTATE_MAGIC2))