Commit Graph

40974 Commits

Author SHA1 Message Date
Linus Torvalds
9c6913b749 - Fix the MSI message data struct definition
- Use local labels in the exception table macros to avoid symbol
 conflicts with clang LTO builds
 
 - A couple of fixes to objtool checking of the relatively newly added
 SLS and IBT code
 
 - Rename a local var in the WARN* macro machinery to prevent shadowing
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmJSwSkACgkQEsHwGGHe
 VUp6QQ//TGhL2xxLoN+7pYjIBDEDHJ3Oi0m6fOweqyQAZTYcm/rAPqd7hvoWVSoO
 YsLdWi9jeMwkzG0ItSm/qPVm/UvrViXwuQMdz4nDWqg2IPFIbhgNA3CKCIyPTio2
 WHp2NXvYyDnwPMr6xTTRndMDoxiwxMBnXf91pNwoU3toxw0GuUuXan0Y+GKnvx1A
 sqhbpWO27bAmhKb26wPw5soJVxBbSqx+1TbFVG0Sz/uwYQowMa+nfNg1DXF0sXyJ
 E/ssqBB6wjl7ANVbQsxBQHRzr/EksLVPwHHrlT8ga/5loin+VJ6mTBCPLgG7SMBE
 +R1fm79Bp/9KU194fcqhJ3pvnyJPi8hfizzCqNKnK871V8LRzC+jW0l3EdvASEXC
 sDj0XWsSFoWft9eAtMV11d641uVC4rLB90GyyzmWWrEw9BbxmasBgED6QBx9d+V6
 o1L4y58Tsz88HKzwd0PtBkeGDkvkA7xOx8ViG24IeLA0tcbixnfnATQdelQeWKqO
 4m3o1JU8ogJp9JCEBY7ZeXyStFjZMedM4U/V0akF6AKnpDuVfR3T5C68cYhoLKBu
 XU6Swf5sFHImNWp0+54HPnXhHj/uhuwj9YWCkxx/eXViwvVlxSdTdIQWa380EddN
 0KhOFLwLOdhha2+81FJc6vmkDHwiu6hlR38yqdGvdxZf/KPKjM0=
 =kMtP
 -----END PGP SIGNATURE-----

Merge tag 'x86_urgent_for_v5.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:

 - Fix the MSI message data struct definition

 - Use local labels in the exception table macros to avoid symbol
   conflicts with clang LTO builds

 - A couple of fixes to objtool checking of the relatively newly added
   SLS and IBT code

 - Rename a local var in the WARN* macro machinery to prevent shadowing

* tag 'x86_urgent_for_v5.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/msi: Fix msi message data shadow struct
  x86/extable: Prefer local labels in .set directives
  x86,bpf: Avoid IBT objtool warning
  objtool: Fix SLS validation for kcov tail-call replacement
  objtool: Fix IBT tail-call detection
  x86/bug: Prevent shadowing in __WARN_FLAGS
  x86/mm/tlb: Revert retpoline avoidance approach
2022-04-10 07:12:27 -10:00
Linus Torvalds
b51f86e990 - A couple of fixes to cgroup-related handling of perf events
- A couple of fixes to event encoding on Sapphire Rapids
 
 - Pass event caps of inherited events so that perf doesn't fail wrongly at fork()
 
 - Add support for a new Raptor Lake CPU
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmJSvt0ACgkQEsHwGGHe
 VUpKBRAAtegwh4ilwoRM0LePH2TX752pREy+M1qfEUp/XyH3tF8VAixCAmIg7qlI
 IyjRX0AKDC1F08sM7/JmTf0M+hnl/oH2YPG8Q6p3igtfARvn+5bPZdSBpTAC9P5L
 QX3S2WVzv5X78IomIfENqbg5HyZP3IXeg7R7sqZhHbtoG54n5NEv/+aJl5HmHFTt
 gLTrXetL46OSMnLzKfd3hlJqCWSnTz1aGKgGX2cZy9ipI63+XrYMuNmiwJ+CrA3G
 pI98RmKnCPqV2rXij1GpVQNyG2aPR+VVZM3aaq6XBAmiNTaCfnvWbEBGhCkjaSgA
 UU7Y6D1Qxc0OZ1plcjhKc4l/W1oj8jqmG9nS6J2Xy4szdpZIdxBhlWq89xCrb9AC
 yIgKif2iVl7eMVKVG1Jq1u2wTwurBAamH73sCCNn8ndctBjicoM8pbtHMHxzceyZ
 w4Cff0yUNzHgPiqSHQRARw/CaUceL9kDoGzPeEQOR0A+27MpNulchts4HCtIvwzI
 yLIK1JFPHDrCACLTMuAhvov3EMTeoTIfc91eOZRjubRTPx7TxujaZHdP7N+R3nkk
 Giehc/l6IhFPhT8QACk0bziTVJ9in+Jx8pCnocGKuj80Uqs7Sq7swjlasy1Zoy7r
 x9Qzy1gZhPHnvPd6LWU4WyPa767D07DlG/zFdg+P3EeWa/3efdw=
 =ba3V
 -----END PGP SIGNATURE-----

Merge tag 'perf_urgent_for_v5.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fixes from Borislav Petkov:

 - A couple of fixes to cgroup-related handling of perf events

 - A couple of fixes to event encoding on Sapphire Rapids

 - Pass event caps of inherited events so that perf doesn't fail wrongly
   at fork()

 - Add support for a new Raptor Lake CPU

* tag 'perf_urgent_for_v5.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/core: Always set cpuctx cgrp when enable cgroup event
  perf/core: Fix perf_cgroup_switch()
  perf/core: Use perf_cgroup_info->active to check if cgroup is active
  perf/core: Don't pass task around when ctx sched in
  perf/x86/intel: Update the FRONTEND MSR mask on Sapphire Rapids
  perf/x86/intel: Don't extend the pseudo-encoding to GP counters
  perf/core: Inherit event_caps
  perf/x86/uncore: Add Raptor Lake uncore support
  perf/x86/msr: Add Raptor Lake CPU support
  perf/x86/cstate: Add Raptor Lake support
  perf/x86: Add Intel Raptor Lake support
2022-04-10 07:08:22 -10:00
Linus Torvalds
50c94de67c - Allow the compiler to optimize away unused percpu accesses and change
the local_lock_* macros back to inline functions
 
 - A couple of fixes to static call insn patching
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmJStZ4ACgkQEsHwGGHe
 VUpUpA/8DHOMUQa7rM8z49ZWBV01HNVCLECTeeKshQBLyJfWc84MNOfdPbpgEGvY
 XE/eIZDnTMB5UKD0bfRqD+AQ0fXjl3NiLnJrdDZJqEQAiP/wGBswKNXMire8xPT8
 9MfaOKYWYPl0LY2uZBWVLcdC+lVe4kRGfhqAcl4LRx0ZSvMzgjcFy34NeXY8LlXD
 kFQJEzHa97CTROje54mtmXEt7Y5bxjxWwVTSyfEt0hJPGo1bJtJP6FaY01Muj+Xu
 h/OGNx3KLOYf9MqQC31caAwKgtUOptm8bTpvG3onaHg29qJgz2umKwONyOjYrUUn
 2PE3NREfMuKI38nf88pX+lOCs6/I1uVIjJPvAVJijIcuI1ZBXrfm26IP0lZ3LqG1
 h/9Y5gChiZPn1j90VnF4UCJUm4u3bYEAHqKIQgUdpcpUqX0NlxbDiXoYxJWfHnmB
 PBJ0PE7Vdo4MPK0n3BGVrzXAFeOyHsohAsKFijT8afRCMAOF/ebmVs/tI5NygFrK
 11e/U13/78iKkazZSxWew8vU3yXA39W5Rym7aPnhR2lWxvN+xQOjNTgZTxF9hUcZ
 6AcsaYJgHR7nD8SM7Y9+cwHWOWaDEdZMg9XSkgvyd1p0tHb4u+Ve/SQK7sA3j9q7
 ZmZyFSE1X3K+M1i+75rUSVmIEVM5cpfhodN89iRje/JIZ1KyRT8=
 =hSOc
 -----END PGP SIGNATURE-----

Merge tag 'locking_urgent_for_v5.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking fixes from Borislav Petkov:

 - Allow the compiler to optimize away unused percpu accesses and change
   the local_lock_* macros back to inline functions

 - A couple of fixes to static call insn patching

* tag 'locking_urgent_for_v5.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  Revert "mm/page_alloc: mark pagesets as __maybe_unused"
  Revert "locking/local_lock: Make the empty local_lock_*() function a macro."
  x86/percpu: Remove volatile from arch_raw_cpu_ptr().
  static_call: Remove __DEFINE_STATIC_CALL macro
  static_call: Properly initialise DEFINE_STATIC_CALL_RET0()
  static_call: Don't make __static_call_return0 static
  x86,static_call: Fix __static_call_return0 for i386
2022-04-10 06:56:46 -10:00
Reto Buerki
59b18a1e65 x86/msi: Fix msi message data shadow struct
The x86 MSI message data is 32 bits in total and is either in
compatibility or remappable format, see Intel Virtualization Technology
for Directed I/O, section 5.1.2.

Fixes: 6285aa5073 ("x86/msi: Provide msi message shadow structs")
Co-developed-by: Adrian-Ken Rueegsegger <ken@codelabs.ch>
Signed-off-by: Adrian-Ken Rueegsegger <ken@codelabs.ch>
Signed-off-by: Reto Buerki <reet@codelabs.ch>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20220407110647.67372-1-reet@codelabs.ch
2022-04-07 15:19:32 +02:00
Nick Desaulniers
334865b291 x86/extable: Prefer local labels in .set directives
Bernardo reported an error that Nathan bisected down to
(x86_64) defconfig+LTO_CLANG_FULL+X86_PMEM_LEGACY.

    LTO     vmlinux.o
  ld.lld: error: <instantiation>:1:13: redefinition of 'found'
  .set found, 0
              ^

  <inline asm>:29:1: while in macro instantiation
  extable_type_reg reg=%eax, type=(17 | ((0) << 16))
  ^

This appears to be another LTO specific issue similar to what was folded
into commit 4b5305decc ("x86/extable: Extend extable functionality"),
where the `.set found, 0` in DEFINE_EXTABLE_TYPE_REG in
arch/x86/include/asm/asm.h conflicts with the symbol for the static
function `found` in arch/x86/kernel/pmem.c.

Assembler .set directive declare symbols with global visibility, so the
assembler may not rename such symbols in the event of a conflict. LTO
could rename static functions if there was a conflict in C sources, but
it cannot see into symbols defined in inline asm.

The symbols are also retained in the symbol table, regardless of LTO.

Give the symbols .L prefixes making them locally visible, so that they
may be renamed for LTO to avoid conflicts, and to drop them from the
symbol table regardless of LTO.

Fixes: 4b5305decc ("x86/extable: Extend extable functionality")
Reported-by: Bernardo Meurer Costa <beme@google.com>
Debugged-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Link: https://lore.kernel.org/r/20220329202148.2379697-1-ndesaulniers@google.com
2022-04-07 11:27:02 +02:00
Peter Zijlstra
be8a096521 x86,bpf: Avoid IBT objtool warning
Clang can inline emit_indirect_jump() and then folds constants, which
results in:

  | vmlinux.o: warning: objtool: emit_bpf_dispatcher()+0x6a4: relocation to !ENDBR: .text.__x86.indirect_thunk+0x40
  | vmlinux.o: warning: objtool: emit_bpf_dispatcher()+0x67d: relocation to !ENDBR: .text.__x86.indirect_thunk+0x40
  | vmlinux.o: warning: objtool: emit_bpf_tail_call_indirect()+0x386: relocation to !ENDBR: .text.__x86.indirect_thunk+0x20
  | vmlinux.o: warning: objtool: emit_bpf_tail_call_indirect()+0x35d: relocation to !ENDBR: .text.__x86.indirect_thunk+0x20

Suppress the optimization such that it must emit a code reference to
the __x86_indirect_thunk_array[] base.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lkml.kernel.org/r/20220405075531.GB30877@worktop.programming.kicks-ass.net
2022-04-07 11:27:02 +02:00
Pawan Gupta
e2a1256b17 x86/speculation: Restore speculation related MSRs during S3 resume
After resuming from suspend-to-RAM, the MSRs that control CPU's
speculative execution behavior are not being restored on the boot CPU.

These MSRs are used to mitigate speculative execution vulnerabilities.
Not restoring them correctly may leave the CPU vulnerable.  Secondary
CPU's MSRs are correctly being restored at S3 resume by
identify_secondary_cpu().

During S3 resume, restore these MSRs for boot CPU when restoring its
processor state.

Fixes: 772439717d ("x86/bugs/intel: Set proper CPU features and setup RDS")
Reported-by: Neelima Krishnan <neelima.krishnan@intel.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Tested-by: Neelima Krishnan <neelima.krishnan@intel.com>
Acked-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-05 10:18:31 -07:00
Pawan Gupta
73924ec4d5 x86/pm: Save the MSR validity status at context setup
The mechanism to save/restore MSRs during S3 suspend/resume checks for
the MSR validity during suspend, and only restores the MSR if its a
valid MSR.  This is not optimal, as an invalid MSR will unnecessarily
throw an exception for every suspend cycle.  The more invalid MSRs,
higher the impact will be.

Check and save the MSR validity at setup.  This ensures that only valid
MSRs that are guaranteed to not throw an exception will be attempted
during suspend.

Fixes: 7a9c2dd08e ("x86/pm: Introduce quirk framework to save/restore extra MSR registers around suspend/resume")
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-05 10:18:31 -07:00
Vincent Mailhol
9ce02f0fc6 x86/bug: Prevent shadowing in __WARN_FLAGS
The macro __WARN_FLAGS() uses a local variable named "f". This being a
common name, there is a risk of shadowing other variables.

For example, GCC would yield:

| In file included from ./include/linux/bug.h:5,
|                  from ./include/linux/cpumask.h:14,
|                  from ./arch/x86/include/asm/cpumask.h:5,
|                  from ./arch/x86/include/asm/msr.h:11,
|                  from ./arch/x86/include/asm/processor.h:22,
|                  from ./arch/x86/include/asm/timex.h:5,
|                  from ./include/linux/timex.h:65,
|                  from ./include/linux/time32.h:13,
|                  from ./include/linux/time.h:60,
|                  from ./include/linux/stat.h:19,
|                  from ./include/linux/module.h:13,
|                  from virt/lib/irqbypass.mod.c:1:
| ./include/linux/rcupdate.h: In function 'rcu_head_after_call_rcu':
| ./arch/x86/include/asm/bug.h:80:21: warning: declaration of 'f' shadows a parameter [-Wshadow]
|    80 |         __auto_type f = BUGFLAG_WARNING|(flags);                \
|       |                     ^
| ./include/asm-generic/bug.h:106:17: note: in expansion of macro '__WARN_FLAGS'
|   106 |                 __WARN_FLAGS(BUGFLAG_ONCE |                     \
|       |                 ^~~~~~~~~~~~
| ./include/linux/rcupdate.h:1007:9: note: in expansion of macro 'WARN_ON_ONCE'
|  1007 |         WARN_ON_ONCE(func != (rcu_callback_t)~0L);
|       |         ^~~~~~~~~~~~
| In file included from ./include/linux/rbtree.h:24,
|                  from ./include/linux/mm_types.h:11,
|                  from ./include/linux/buildid.h:5,
|                  from ./include/linux/module.h:14,
|                  from virt/lib/irqbypass.mod.c:1:
| ./include/linux/rcupdate.h:1001:62: note: shadowed declaration is here
|  1001 | rcu_head_after_call_rcu(struct rcu_head *rhp, rcu_callback_t f)
|       |                                               ~~~~~~~~~~~~~~~^

For reference, sparse also warns about it, c.f. [1].

This patch renames the variable from f to __flags (with two underscore
prefixes as suggested in the Linux kernel coding style [2]) in order
to prevent collisions.

[1] https://lore.kernel.org/all/CAFGhKbyifH1a+nAMCvWM88TK6fpNPdzFtUXPmRGnnQeePV+1sw@mail.gmail.com/

[2] Linux kernel coding style, section 12) Macros, Enums and RTL,
paragraph 5) namespace collisions when defining local variables in
macros resembling functions
https://www.kernel.org/doc/html/latest/process/coding-style.html#macros-enums-and-rtl

Fixes: bfb1a7c91f ("x86/bug: Merge annotate_reachable() into_BUG_FLAGS() asm")
Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20220324023742.106546-1-mailhol.vincent@wanadoo.fr
2022-04-05 10:24:40 +02:00
Kan Liang
e590928de7 perf/x86/intel: Update the FRONTEND MSR mask on Sapphire Rapids
On Sapphire Rapids, the FRONTEND_RETIRED.MS_FLOWS event requires the
FRONTEND MSR value 0x8. However, the current FRONTEND MSR mask doesn't
support it.

Update intel_spr_extra_regs[] to support it.

Fixes: 61b985e3e7 ("perf/x86/intel: Add perf core PMU support for Sapphire Rapids")
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/1648482543-14923-2-git-send-email-kan.liang@linux.intel.com
2022-04-05 09:59:44 +02:00
Kan Liang
4a263bf331 perf/x86/intel: Don't extend the pseudo-encoding to GP counters
The INST_RETIRED.PREC_DIST event (0x0100) doesn't count on SPR.
perf stat -e cpu/event=0xc0,umask=0x0/,cpu/event=0x0,umask=0x1/ -C0

 Performance counter stats for 'CPU(s) 0':

           607,246      cpu/event=0xc0,umask=0x0/
                 0      cpu/event=0x0,umask=0x1/

The encoding for INST_RETIRED.PREC_DIST is pseudo-encoding, which
doesn't work on the generic counters. However, current perf extends its
mask to the generic counters.

The pseudo event-code for a fixed counter must be 0x00. Check and avoid
extending the mask for the fixed counter event which using the
pseudo-encoding, e.g., ref-cycles and PREC_DIST event.

With the patch,
perf stat -e cpu/event=0xc0,umask=0x0/,cpu/event=0x0,umask=0x1/ -C0

 Performance counter stats for 'CPU(s) 0':

           583,184      cpu/event=0xc0,umask=0x0/
           583,048      cpu/event=0x0,umask=0x1/

Fixes: 2de71ee153 ("perf/x86/intel: Fix ICL/SPR INST_RETIRED.PREC_DIST encodings")
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1648482543-14923-1-git-send-email-kan.liang@linux.intel.com
2022-04-05 09:59:44 +02:00
Kan Liang
ad4878d4d7 perf/x86/uncore: Add Raptor Lake uncore support
The uncore PMU of the Raptor Lake is the same as Alder Lake.
Add new PCIIDs of IMC for Raptor Lake.

Signed-off-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/1647366360-82824-4-git-send-email-kan.liang@linux.intel.com
2022-04-05 09:59:43 +02:00
Kan Liang
82cd83047a perf/x86/msr: Add Raptor Lake CPU support
Raptor Lake is Intel's successor to Alder lake. PPERF and SMI_COUNT MSRs
are also supported.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/1647366360-82824-3-git-send-email-kan.liang@linux.intel.com
2022-04-05 09:59:43 +02:00
Kan Liang
2da202aa1c perf/x86/cstate: Add Raptor Lake support
Raptor Lake is Intel's successor to Alder lake. From the perspective of
Intel cstate residency counters, there is nothing changed compared with
Alder lake.

Share adl_cstates with Alder lake.
Update the comments for Raptor Lake.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/1647366360-82824-2-git-send-email-kan.liang@linux.intel.com
2022-04-05 09:59:43 +02:00
Kan Liang
c61759e581 perf/x86: Add Intel Raptor Lake support
From PMU's perspective, Raptor Lake is the same as the Alder Lake. The
only difference is the event list, which will be supported in the perf
tool later.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/1647366360-82824-1-git-send-email-kan.liang@linux.intel.com
2022-04-05 09:59:43 +02:00
Sebastian Andrzej Siewior
1c1e7e3c23 x86/percpu: Remove volatile from arch_raw_cpu_ptr().
The volatile attribute in the inline assembly of arch_raw_cpu_ptr()
forces the compiler to always generate the code, even if the compiler
can decide upfront that its result is not needed.

For instance invoking __intel_pmu_disable_all(false) (like
intel_pmu_snapshot_arch_branch_stack() does) leads to loading the
address of &cpu_hw_events into the register while compiler knows that it
has no need for it. This ends up with code like:

|	movq	$cpu_hw_events, %rax			#, tcp_ptr__
|	add	%gs:this_cpu_off(%rip), %rax		# this_cpu_off, tcp_ptr__
|	xorl	%eax, %eax				# tmp93

It also creates additional code within local_lock() with !RT &&
!LOCKDEP which is not desired.

By removing the volatile attribute the compiler can place the
function freely and avoid it if it is not needed in the end.
By using the function twice the compiler properly caches only the
variable offset and always loads the CPU-offset.

this_cpu_ptr() also remains properly placed within a preempt_disable()
sections because
- arch_raw_cpu_ptr() assembly has a memory input ("m" (this_cpu_off))
- prempt_{dis,en}able() fundamentally has a 'barrier()' in it

Therefore this_cpu_ptr() is already properly serialized and does not
rely on the 'volatile' attribute.

Remove volatile from arch_raw_cpu_ptr().

[ bigeasy: Added Linus' explanation why this_cpu_ptr() is not moved out
  of a preempt_disable() section without the 'volatile' attribute. ]

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220328145810.86783-2-bigeasy@linutronix.de
2022-04-05 09:59:38 +02:00
Christophe Leroy
5517d50082 static_call: Properly initialise DEFINE_STATIC_CALL_RET0()
When a static call is updated with __static_call_return0() as target,
arch_static_call_transform() set it to use an optimised set of
instructions which are meant to lay in the same cacheline.

But when initialising a static call with DEFINE_STATIC_CALL_RET0(),
we get a branch to the real __static_call_return0() function instead
of getting the optimised setup:

	c00d8120 <__SCT__perf_snapshot_branch_stack>:
	c00d8120:	4b ff ff f4 	b       c00d8114 <__static_call_return0>
	c00d8124:	3d 80 c0 0e 	lis     r12,-16370
	c00d8128:	81 8c 81 3c 	lwz     r12,-32452(r12)
	c00d812c:	7d 89 03 a6 	mtctr   r12
	c00d8130:	4e 80 04 20 	bctr
	c00d8134:	38 60 00 00 	li      r3,0
	c00d8138:	4e 80 00 20 	blr
	c00d813c:	00 00 00 00 	.long 0x0

Add ARCH_DEFINE_STATIC_CALL_RET0_TRAMP() defined by each architecture
to setup the optimised configuration, and rework
DEFINE_STATIC_CALL_RET0() to call it:

	c00d8120 <__SCT__perf_snapshot_branch_stack>:
	c00d8120:	48 00 00 14 	b       c00d8134 <__SCT__perf_snapshot_branch_stack+0x14>
	c00d8124:	3d 80 c0 0e 	lis     r12,-16370
	c00d8128:	81 8c 81 3c 	lwz     r12,-32452(r12)
	c00d812c:	7d 89 03 a6 	mtctr   r12
	c00d8130:	4e 80 04 20 	bctr
	c00d8134:	38 60 00 00 	li      r3,0
	c00d8138:	4e 80 00 20 	blr
	c00d813c:	00 00 00 00 	.long 0x0

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/1e0a61a88f52a460f62a58ffc2a5f847d1f7d9d8.1647253456.git.christophe.leroy@csgroup.eu
2022-04-05 09:59:38 +02:00
Peter Zijlstra
1cd5f059d9 x86,static_call: Fix __static_call_return0 for i386
Paolo reported that the instruction sequence that is used to replace:

    call __static_call_return0

namely:

    66 66 48 31 c0	data16 data16 xor %rax,%rax

decodes to something else on i386, namely:

    66 66 48		data16 dec %ax
    31 c0		xor    %eax,%eax

Which is a nonsensical sequence that happens to have the same outcome.
*However* an important distinction is that it consists of 2
instructions which is a problem when the thing needs to be overwriten
with a regular call instruction again.

As such, replace the instruction with something that decodes the same
on both i386 and x86_64.

Fixes: 3f2a8fc4b1 ("static_call/x86: Add __static_call_return0()")
Reported-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220318204419.GT8939@worktop.programming.kicks-ass.net
2022-04-05 09:59:37 +02:00
Dave Hansen
d39268ad24 x86/mm/tlb: Revert retpoline avoidance approach
0day reported a regression on a microbenchmark which is intended to
stress the TLB flushing path:

	https://lore.kernel.org/all/20220317090415.GE735@xsang-OptiPlex-9020/

It pointed at a commit from Nadav which intended to remove retpoline
overhead in the TLB flushing path by taking the 'cond'-ition in
on_each_cpu_cond_mask(), pre-calculating it, and incorporating it into
'cpumask'.  That allowed the code to use a bunch of earlier direct
calls instead of later indirect calls that need a retpoline.

But, in practice, threads can go idle (and into lazy TLB mode where
they don't need to flush their TLB) between the early and late calls.
It works in this direction and not in the other because TLB-flushing
threads tend to hold mmap_lock for write.  Contention on that lock
causes threads to _go_ idle right in this early/late window.

There was not any performance data in the original commit specific
to the retpoline overhead.  I did a few tests on a system with
retpolines:

	https://lore.kernel.org/all/dd8be93c-ded6-b962-50d4-96b1c3afb2b7@intel.com/

which showed a possible small win.  But, that small win pales in
comparison with the bigger loss induced on non-retpoline systems.

Revert the patch that removed the retpolines.  This was not a
clean revert, but it was self-contained enough not to be too painful.

Fixes: 6035152d8e ("x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()")
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Nadav Amit <namit@vmware.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/164874672286.389.7021457716635788197.tip-bot2@tip-bot2
2022-04-04 19:41:36 +02:00
Linus Torvalds
8b5656bc4e A set of x86 fixes and updates:
- Make the prctl() for enabling dynamic XSTATE components correct so it
     adds the newly requested feature to the permission bitmap instead of
     overwriting it. Add a selftest which validates that.
 
   - Unroll string MMIO for encrypted SEV guests as the hypervisor cannot
     emulate it.
 
   - Handle supervisor states correctly in the FPU/XSTATE code so it takes
     the feature set of the fpstate buffer into account. The feature sets
     can differ between host and guest buffers. Guest buffers do not contain
     supervisor states. So far this was not an issue, but with enabling
     PASID it needs to be handled in the buffer offset calculation and in
     the permission bitmaps.
 
   - Avoid a gazillion of repeated CPUID invocations in by caching the values
     early in the FPU/XSTATE code.
 
   - Enable CONFIG_WERROR for X86.
 
   - Make the X86 defconfigs more useful by adapting them to Y2022 reality.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmJJWwwTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoT3mEACA9xkNjECn/MHN3B0X5wTPhVyw9+TJ
 OdfpqL7C9pbAU1s2mwf3TyicrCOqx8nlnOYB/mXgfRGnbZqmUeGQFpZFM587dm/I
 r/BtouAzSASjnaW7SijT3gnRTqMPVNTcLOTUEVjnTa7zatw+t4rH1uxE9dLqEq9B
 cKMtsBOJyTTbj4ie3ngkUS2PQngNNHLJ4oQGZW4wCA5snLuwF1LlgcZJy8Zkrlpo
 D58h/ZV6K2/tI7INWLINlqGnxaL2B/Ld4zXsFH+t05XGh+JOiq8ueLi5tdfEPG9f
 /pzuGia0Cv6WBv+jOHLCBe2kfgvBx+Y8Goi0tqL0hwKCGjpZlQkhRccrjbVSAPhW
 2SfxOD1pulTwI1J75csYXjTc/heJvAv/ZpZSz3wldM3fyiwnmgfWKlMYqG6Xb9+T
 2OHwEUJHJQnon/f25+yb9dWI7HYMw2fEIqu3CgbRyOviObcB9MM1uKVErkCYAUWY
 W7Q8ShjNPrUguCPbw4YFPIwaazuhRbR8t2kRvfBOyTYwh3jo6U3eRL72Cov84uik
 hnFtUdiusWtvV59ngZelREmd3iVKif2hxx7EoGDY/VV2Ru4C2X/xgJemKJeKSR/f
 gm6pp8wbPSC4TBJOfP6IwYtoZKyu03miIeupPPUDxx0hLbx5j2e6EgVM5NVAeJFF
 fu4MEkGvStZc+w==
 =GK27
 -----END PGP SIGNATURE-----

Merge tag 'x86-urgent-2022-04-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Thomas Gleixner:
 "A set of x86 fixes and updates:

   - Make the prctl() for enabling dynamic XSTATE components correct so
     it adds the newly requested feature to the permission bitmap
     instead of overwriting it. Add a selftest which validates that.

   - Unroll string MMIO for encrypted SEV guests as the hypervisor
     cannot emulate it.

   - Handle supervisor states correctly in the FPU/XSTATE code so it
     takes the feature set of the fpstate buffer into account. The
     feature sets can differ between host and guest buffers. Guest
     buffers do not contain supervisor states. So far this was not an
     issue, but with enabling PASID it needs to be handled in the buffer
     offset calculation and in the permission bitmaps.

   - Avoid a gazillion of repeated CPUID invocations in by caching the
     values early in the FPU/XSTATE code.

   - Enable CONFIG_WERROR in x86 defconfig.

   - Make the X86 defconfigs more useful by adapting them to Y2022
     reality"

* tag 'x86-urgent-2022-04-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/fpu/xstate: Consolidate size calculations
  x86/fpu/xstate: Handle supervisor states in XSTATE permissions
  x86/fpu/xsave: Handle compacted offsets correctly with supervisor states
  x86/fpu: Cache xfeature flags from CPUID
  x86/fpu/xsave: Initialize offset/size cache early
  x86/fpu: Remove unused supervisor only offsets
  x86/fpu: Remove redundant XCOMP_BV initialization
  x86/sev: Unroll string mmio with CC_ATTR_GUEST_UNROLL_STRING_IO
  x86/config: Make the x86 defconfigs a bit more usable
  x86/defconfig: Enable WERROR
  selftests/x86/amx: Update the ARCH_REQ_XCOMP_PERM test
  x86/fpu/xstate: Fix the ARCH_REQ_XCOMP_PERM implementation
2022-04-03 12:15:47 -07:00
Linus Torvalds
e235f4192f Revert the RT related signal changes. They need to be reworked and
generalized.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmJJV1gTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYof8tD/0Xs4qpxlR81PgZSJ3QJ9vok5tKpe3j
 O+ZLvQtyc2dnkduSOpJXiKe5YxDZ39Ihb7Fb9ETSUFS0ohJFDYiR6bKVXqKBjp6g
 Z0u57B3j/ZrZt9W3oK2BxlKBgen3MTYmybPQja+oTZfuu+Vd+DKD6NEyGcOZe53G
 +ZzEnBevar+f+/ble4PmJrnu5fP63jlUDPlY6h7HnsS2+MYTlx8JOMyhc4v4KxpR
 od4/9NUMbcpV4q2hReC5D22TArhr/7woNaCFswnOuk+mb9d8sPvqv9U8iHC/YoTM
 IeX3Bt1qHRT++Sjkkup2/k0xAy50H/7wMbQP+Jb993rWlLiWSd2WY0OHZ+gWSfgG
 oM6a2yAZ029klyMBvV0AdiAYpvhlDs36UZBLyIIa8M4zRgH9h+//F9UZ5qnt+0kp
 ACTd/B+bksbvO4A1npxZ1fUWPw6L5a8730GIy/csvAsoRlOaITfCFVA98ob+36TF
 JUdyuzRAOrbt3H7pRUB+xz0pxxPkceoBBwrBTcSw1cyIyV3b8CaFT2oRWY3nt+er
 THWuiXY4Jy2wtNcHMhKIZKBCtUZ7sDUBhcnplxL+qoRJ0V340B2Kh1J8/0mnjDD+
 Aks4E7Q3ogpyuMXAKDEGebyTPcRe0bQXyyjJVR9cuPn5i8AM9/rv5Iqem4Ed1hLK
 dQeXuWx6zLcGrw==
 =mJKF
 -----END PGP SIGNATURE-----

Merge tag 'core-urgent-2022-04-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull RT signal fix from Thomas Gleixner:
 "Revert the RT related signal changes. They need to be reworked and
  generalized"

* tag 'core-urgent-2022-04-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  Revert "signal, x86: Delay calling signals in atomic on RT enabled kernels"
2022-04-03 12:08:26 -07:00
Linus Torvalds
38904911e8 * Only do MSR filtering for MSRs accessed by rdmsr/wrmsr
* Documentation improvements
 
 * Prevent module exit until all VMs are freed
 
 * PMU Virtualization fixes
 
 * Fix for kvm_irq_delivery_to_apic_fast() NULL-pointer dereferences
 
 * Other miscellaneous bugfixes
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmJIGV8UHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroO5FQgAhls4+Nu+NqId/yvvyNxr3vXq0dHI
 hLlHtvzgGzZisZ7y2bNeyIpJVBDT5LCbrptPD/5eTvchVswDh0+kCVC0Uni5ugGT
 tLT/Pv9Oq9e0X7aGdHRyuHIivIFDC20zIZO2DV48Lrj/+r6DafB2Fghq2XQLlBxN
 p8KislvuqAAos543BPC1+Lk3dhOLuZ8qcFD8wGRlcCwjNwYaitrQ16rO04cLfUur
 OwIks1I6TdI2JpLBhm6oWYVG/YnRsoo4bQE8cjdQ6yNSbwWtRpV33q7X6onw8x8K
 BEeESoTnMqfaxIF/6mPl6bnDblVHFp6Xhld/vJcgeWQTdajFtuFE/K4sCA==
 =xnQ6
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:

 - Only do MSR filtering for MSRs accessed by rdmsr/wrmsr

 - Documentation improvements

 - Prevent module exit until all VMs are freed

 - PMU Virtualization fixes

 - Fix for kvm_irq_delivery_to_apic_fast() NULL-pointer dereferences

 - Other miscellaneous bugfixes

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (42 commits)
  KVM: x86: fix sending PV IPI
  KVM: x86/mmu: do compare-and-exchange of gPTE via the user address
  KVM: x86: Remove redundant vm_entry_controls_clearbit() call
  KVM: x86: cleanup enter_rmode()
  KVM: x86: SVM: fix tsc scaling when the host doesn't support it
  kvm: x86: SVM: remove unused defines
  KVM: x86: SVM: move tsc ratio definitions to svm.h
  KVM: x86: SVM: fix avic spec based definitions again
  KVM: MIPS: remove reference to trap&emulate virtualization
  KVM: x86: document limitations of MSR filtering
  KVM: x86: Only do MSR filtering when access MSR by rdmsr/wrmsr
  KVM: x86/emulator: Emulate RDPID only if it is enabled in guest
  KVM: x86/pmu: Fix and isolate TSX-specific performance event logic
  KVM: x86: mmu: trace kvm_mmu_set_spte after the new SPTE was set
  KVM: x86/svm: Clear reserved bits written to PerfEvtSeln MSRs
  KVM: x86: Trace all APICv inhibit changes and capture overall status
  KVM: x86: Add wrappers for setting/clearing APICv inhibits
  KVM: x86: Make APICv inhibit reasons an enum and cleanup naming
  KVM: X86: Handle implicit supervisor access with SMAP
  KVM: X86: Rename variable smap to not_smap in permission_fault()
  ...
2022-04-02 12:09:02 -07:00
Li RongQing
c15e0ae42c KVM: x86: fix sending PV IPI
If apic_id is less than min, and (max - apic_id) is greater than
KVM_IPI_CLUSTER_SIZE, then the third check condition is satisfied but
the new apic_id does not fit the bitmask.  In this case __send_ipi_mask
should send the IPI.

This is mostly theoretical, but it can happen if the apic_ids on three
iterations of the loop are for example 1, KVM_IPI_CLUSTER_SIZE, 0.

Fixes: aaffcfd1e8 ("KVM: X86: Implement PV IPIs in linux guest")
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Message-Id: <1646814944-51801-1-git-send-email-lirongqing@baidu.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:37:27 -04:00
Paolo Bonzini
2a8859f373 KVM: x86/mmu: do compare-and-exchange of gPTE via the user address
FNAME(cmpxchg_gpte) is an inefficient mess.  It is at least decent if it
can go through get_user_pages_fast(), but if it cannot then it tries to
use memremap(); that is not just terribly slow, it is also wrong because
it assumes that the VM_PFNMAP VMA is contiguous.

The right way to do it would be to do the same thing as
hva_to_pfn_remapped() does since commit add6a0cd1c ("KVM: MMU: try to
fix up page faults before giving up", 2016-07-05), using follow_pte()
and fixup_user_fault() to determine the correct address to use for
memremap().  To do this, one could for example extract hva_to_pfn()
for use outside virt/kvm/kvm_main.c.  But really there is no reason to
do that either, because there is already a perfectly valid address to
do the cmpxchg() on, only it is a userspace address.  That means doing
user_access_begin()/user_access_end() and writing the code in assembly
to handle exceptions correctly.  Worse, the guest PTE can be 8-byte
even on i686 so there is the extra complication of using cmpxchg8b to
account for.  But at least it is an efficient mess.

(Thanks to Linus for suggesting improvement on the inline assembly).

Reported-by: Qiuhao Li <qiuhao@sysec.org>
Reported-by: Gaoning Pan <pgn@zju.edu.cn>
Reported-by: Yongkang Jia <kangel@zju.edu.cn>
Reported-by: syzbot+6cde2282daa792c49ab8@syzkaller.appspotmail.com
Debugged-by: Tadeusz Struk <tadeusz.struk@linaro.org>
Tested-by: Maxim Levitsky <mlevitsk@redhat.com>
Cc: stable@vger.kernel.org
Fixes: bd53cb35a3 ("X86/KVM: Handle PFNs outside of kernel reach when touching GPTEs")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:37:27 -04:00
Zhenzhong Duan
4335edbbc1 KVM: x86: Remove redundant vm_entry_controls_clearbit() call
When emulating exit from long mode, EFER_LMA is cleared with
vmx_set_efer().  This will already unset the VM_ENTRY_IA32E_MODE control
bit as requested by SDM, so there is no need to unset VM_ENTRY_IA32E_MODE
again in exit_lmode() explicitly.  In case EFER isn't supported by
hardware, long mode isn't supported, so exit_lmode() cannot be reached.

Note that, thanks to the shadow controls mechanism, this change doesn't
eliminate vmread or vmwrite.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Message-Id: <20220311102643.807507-3-zhenzhong.duan@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:37:26 -04:00
Zhenzhong Duan
b76edfe91a KVM: x86: cleanup enter_rmode()
vmx_set_efer() sets uret->data but, in fact if the value of uret->data
will be used vmx_setup_uret_msrs() will have rewritten it with the value
returned by update_transition_efer().  uret->data is consumed if and only
if uret->load_into_hardware is true, and vmx_setup_uret_msrs() takes care
of (a) updating uret->data before setting uret->load_into_hardware to true
(b) setting uret->load_into_hardware to false if uret->data isn't updated.

Opportunistically use "vmx" directly instead of redoing to_vmx().

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Message-Id: <20220311102643.807507-2-zhenzhong.duan@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:37:26 -04:00
Maxim Levitsky
8809931383 KVM: x86: SVM: fix tsc scaling when the host doesn't support it
It was decided that when TSC scaling is not supported,
the virtual MSR_AMD64_TSC_RATIO should still have the default '1.0'
value.

However in this case kvm_max_tsc_scaling_ratio is not set,
which breaks various assumptions.

Fix this by always calculating kvm_max_tsc_scaling_ratio regardless of
host support.  For consistency, do the same for VMX.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220322172449.235575-8-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:37:26 -04:00
Maxim Levitsky
f37b735e31 kvm: x86: SVM: remove unused defines
Remove some unused #defines from svm.c

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220322172449.235575-7-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:37:25 -04:00
Maxim Levitsky
bb2aa78e9a KVM: x86: SVM: move tsc ratio definitions to svm.h
Another piece of SVM spec which should be in the header file

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220322172449.235575-6-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:37:25 -04:00
Maxim Levitsky
0dacc3df89 KVM: x86: SVM: fix avic spec based definitions again
Due to wrong rebase, commit
4a204f7895 ("KVM: SVM: Allow AVIC support on system w/ physical APIC ID > 255")

moved avic spec #defines back to avic.c.

Move them back, and while at it extend AVIC_DOORBELL_PHYSICAL_ID_MASK to 12
bits as well (it will be used in nested avic)

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220322172449.235575-5-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:37:24 -04:00
Hou Wenlong
ac8d6cad3c KVM: x86: Only do MSR filtering when access MSR by rdmsr/wrmsr
If MSR access is rejected by MSR filtering,
kvm_set_msr()/kvm_get_msr() would return KVM_MSR_RET_FILTERED,
and the return value is only handled well for rdmsr/wrmsr.
However, some instruction emulation and state transition also
use kvm_set_msr()/kvm_get_msr() to do msr access but may trigger
some unexpected results if MSR access is rejected, E.g. RDPID
emulation would inject a #UD but RDPID wouldn't cause a exit
when RDPID is supported in hardware and ENABLE_RDTSCP is set.
And it would also cause failure when load MSR at nested entry/exit.
Since msr filtering is based on MSR bitmap, it is better to only
do MSR filtering for rdmsr/wrmsr.

Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Message-Id: <2b2774154f7532c96a6f04d71c82a8bec7d9e80b.1646655860.git.houwenlong.hwl@antgroup.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:34:47 -04:00
Hou Wenlong
a836839cbf KVM: x86/emulator: Emulate RDPID only if it is enabled in guest
When RDTSCP is supported but RDPID is not supported in host,
RDPID emulation is available. However, __kvm_get_msr() would
only fail when RDTSCP/RDPID both are disabled in guest, so
the emulator wouldn't inject a #UD when RDPID is disabled but
RDTSCP is enabled in guest.

Fixes: fb6d4d340e ("KVM: x86: emulate RDPID")
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Message-Id: <1dfd46ae5b76d3ed87bde3154d51c64ea64c99c1.1646226788.git.houwenlong.hwl@antgroup.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:34:46 -04:00
Like Xu
e644896f51 KVM: x86/pmu: Fix and isolate TSX-specific performance event logic
HSW_IN_TX* bits are used in generic code which are not supported on
AMD. Worse, these bits overlap with AMD EventSelect[11:8] and hence
using HSW_IN_TX* bits unconditionally in generic code is resulting in
unintentional pmu behavior on AMD. For example, if EventSelect[11:8]
is 0x2, pmc_reprogram_counter() wrongly assumes that
HSW_IN_TX_CHECKPOINTED is set and thus forces sampling period to be 0.

Also per the SDM, both bits 32 and 33 "may only be set if the processor
supports HLE or RTM" and for "IN_TXCP (bit 33): this bit may only be set
for IA32_PERFEVTSEL2."

Opportunistically eliminate code redundancy, because if the HSW_IN_TX*
bit is set in pmc->eventsel, it is already set in attr.config.

Reported-by: Ravi Bangoria <ravi.bangoria@amd.com>
Reported-by: Jim Mattson <jmattson@google.com>
Fixes: 103af0a987 ("perf, kvm: Support the in_tx/in_tx_cp modifiers in KVM arch perfmon emulation v5")
Co-developed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Like Xu <likexu@tencent.com>
Message-Id: <20220309084257.88931-1-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:34:46 -04:00
Maxim Levitsky
5959ff4ae9 KVM: x86: mmu: trace kvm_mmu_set_spte after the new SPTE was set
It makes more sense to print new SPTE value than the
old value.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220302102457.588450-1-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:34:45 -04:00
Jim Mattson
9b026073db KVM: x86/svm: Clear reserved bits written to PerfEvtSeln MSRs
AMD EPYC CPUs never raise a #GP for a WRMSR to a PerfEvtSeln MSR. Some
reserved bits are cleared, and some are not. Specifically, on
Zen3/Milan, bits 19 and 42 are not cleared.

When emulating such a WRMSR, KVM should not synthesize a #GP,
regardless of which bits are set. However, undocumented bits should
not be passed through to the hardware MSR. So, rather than checking
for reserved bits and synthesizing a #GP, just clear the reserved
bits.

This may seem pedantic, but since KVM currently does not support the
"Host/Guest Only" bits (41:40), it is necessary to clear these bits
rather than synthesizing #GP, because some popular guests (e.g Linux)
will set the "Host Only" bit even on CPUs that don't support
EFER.SVME, and they don't expect a #GP.

For example,

root@Ubuntu1804:~# perf stat -e r26 -a sleep 1

 Performance counter stats for 'system wide':

                 0      r26

       1.001070977 seconds time elapsed

Feb 23 03:59:58 Ubuntu1804 kernel: [  405.379957] unchecked MSR access error: WRMSR to 0xc0010200 (tried to write 0x0000020000130026) at rIP: 0xffffffff9b276a28 (native_write_msr+0x8/0x30)
Feb 23 03:59:58 Ubuntu1804 kernel: [  405.379958] Call Trace:
Feb 23 03:59:58 Ubuntu1804 kernel: [  405.379963]  amd_pmu_disable_event+0x27/0x90

Fixes: ca724305a2 ("KVM: x86/vPMU: Implement AMD vPMU code for KVM")
Reported-by: Lotus Fenn <lotusf@google.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Like Xu <likexu@tencent.com>
Reviewed-by: David Dunn <daviddunn@google.com>
Message-Id: <20220226234131.2167175-1-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:34:45 -04:00
Sean Christopherson
4f4c4a3ee5 KVM: x86: Trace all APICv inhibit changes and capture overall status
Trace all APICv inhibit changes instead of just those that result in
APICv being (un)inhibited, and log the current state.  Debugging why
APICv isn't working is frustrating as it's hard to see why APICv is still
inhibited, and logging only the first inhibition means unnecessary onion
peeling.

Opportunistically drop the export of the tracepoint, it is not and should
not be used by vendor code due to the need to serialize toggling via
apicv_update_lock.

Note, using the common flow means kvm_apicv_init() switched from atomic
to non-atomic bitwise operations.  The VM is unreachable at init, so
non-atomic is perfectly ok.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220311043517.17027-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:34:45 -04:00
Sean Christopherson
320af55a93 KVM: x86: Add wrappers for setting/clearing APICv inhibits
Add set/clear wrappers for toggling APICv inhibits to make the call sites
more readable, and opportunistically rename the inner helpers to align
with the new wrappers and to make them more readable as well.  Invert the
flag from "activate" to "set"; activate is painfully ambiguous as it's
not obvious if the inhibit is being activated, or if APICv is being
activated, in which case the inhibit is being deactivated.

For the functions that take @set, swap the order of the inhibit reason
and @set so that the call sites are visually similar to those that bounce
through the wrapper.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220311043517.17027-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:34:44 -04:00
Sean Christopherson
7491b7b2e1 KVM: x86: Make APICv inhibit reasons an enum and cleanup naming
Use an enum for the APICv inhibit reasons, there is no meaning behind
their values and they most definitely are not "unsigned longs".  Rename
the various params to "reason" for consistency and clarity (inhibit may
be confused as a command, i.e. inhibit APICv, instead of the reason that
is getting toggled/checked).

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220311043517.17027-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:34:44 -04:00
Lai Jiangshan
4f4aa80e3b KVM: X86: Handle implicit supervisor access with SMAP
There are two kinds of implicit supervisor access
	implicit supervisor access when CPL = 3
	implicit supervisor access when CPL < 3

Current permission_fault() handles only the first kind for SMAP.

But if the access is implicit when SMAP is on, data may not be read
nor write from any user-mode address regardless the current CPL.

So the second kind should be also supported.

The first kind can be detect via CPL and access mode: if it is
supervisor access and CPL = 3, it must be implicit supervisor access.

But it is not possible to detect the second kind without extra
information, so this patch adds an artificial PFERR_EXPLICIT_ACCESS
into @access. This extra information also works for the first kind, so
the logic is changed to use this information for both cases.

The value of PFERR_EXPLICIT_ACCESS is deliberately chosen to be bit 48
which is in the most significant 16 bits of u64 and less likely to be
forced to change due to future hardware uses it.

This patch removes the call to ->get_cpl() for access mode is determined
by @access.  Not only does it reduce a function call, but also remove
confusions when the permission is checked for nested TDP.  The nested
TDP shouldn't have SMAP checking nor even the L2's CPL have any bearing
on it.  The original code works just because it is always user walk for
NPT and SMAP fault is not set for EPT in update_permission_bitmask.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Message-Id: <20220311070346.45023-5-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:34:43 -04:00
Lai Jiangshan
8873c1434f KVM: X86: Rename variable smap to not_smap in permission_fault()
Comments above the variable says the bit is set when SMAP is overridden
or the same meaning in update_permission_bitmask(): it is not subjected
to SMAP restriction.

Renaming it to reflect the negative implication and make the code better
readability.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Message-Id: <20220311070346.45023-4-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:34:43 -04:00
Lai Jiangshan
94b4a2f174 KVM: X86: Fix comments in update_permission_bitmask
The commit 09f037aa48 ("KVM: MMU: speedup update_permission_bitmask")
refactored the code of update_permission_bitmask() and change the
comments.  It added a condition into a list to match the new code,
so the number/order for conditions in the comments should be updated
too.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Message-Id: <20220311070346.45023-3-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:34:42 -04:00
Lai Jiangshan
5b22bbe717 KVM: X86: Change the type of access u32 to u64
Change the type of access u32 to u64 for FNAME(walk_addr) and
->gva_to_gpa().

The kinds of accesses are usually combinations of UWX, and VMX/SVM's
nested paging adds a new factor of access: is it an access for a guest
page table or for a final guest physical address.

And SMAP relies a factor for supervisor access: explicit or implicit.

So @access in FNAME(walk_addr) and ->gva_to_gpa() is better to include
all these information to do the walk.

Although @access(u32) has enough bits to encode all the kinds, this
patch extends it to u64:
	o Extra bits will be in the higher 32 bits, so that we can
	  easily obtain the traditional access mode (UWX) by converting
	  it to u32.
	o Reuse the value for the access kind defined by SVM's nested
	  paging (PFERR_GUEST_FINAL_MASK and PFERR_GUEST_PAGE_MASK) as
	  @error_code in kvm_handle_page_fault().

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Message-Id: <20220311070346.45023-2-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:34:42 -04:00
David Woodhouse
cf1d88b36b KVM: Remove dirty handling from gfn_to_pfn_cache completely
It isn't OK to cache the dirty status of a page in internal structures
for an indefinite period of time.

Any time a vCPU exits the run loop to userspace might be its last; the
VMM might do its final check of the dirty log, flush the last remaining
dirty pages to the destination and complete a live migration. If we
have internal 'dirty' state which doesn't get flushed until the vCPU
is finally destroyed on the source after migration is complete, then
we have lost data because that will escape the final copy.

This problem already exists with the use of kvm_vcpu_unmap() to mark
pages dirty in e.g. VMX nesting.

Note that the actual Linux MM already considers the page to be dirty
since we have a writeable mapping of it. This is just about the KVM
dirty logging.

For the nesting-style use cases (KVM_GUEST_USES_PFN) we will need to
track which gfn_to_pfn_caches have been used and explicitly mark the
corresponding pages dirty before returning to userspace. But we would
have needed external tracking of that anyway, rather than walking the
full list of GPCs to find those belonging to this vCPU which are dirty.

So let's rely *solely* on that external tracking, and keep it simple
rather than laying a tempting trap for callers to fall into.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220303154127.202856-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:34:41 -04:00
Sean Christopherson
d0d96121d0 KVM: Use enum to track if cached PFN will be used in guest and/or host
Replace the guest_uses_pa and kernel_map booleans in the PFN cache code
with a unified enum/bitmask. Using explicit names makes it easier to
review and audit call sites.

Opportunistically add a WARN to prevent passing garbage; instantating a
cache without declaring its usage is either buggy or pointless.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220303154127.202856-2-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:34:41 -04:00
Peter Gonda
4a9e7b9ea2 KVM: SVM: Fix kvm_cache_regs.h inclusions for is_guest_mode()
Include kvm_cache_regs.h to pick up the definition of is_guest_mode(),
which is referenced by nested_svm_virtualize_tpr() in svm.h. Remove
include from svm_onhpyerv.c which was done only because of lack of
include in svm.h.

Fixes: 883b0a91f4 ("KVM: SVM: Move Nested SVM Implementation to nested.c")
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Peter Gonda <pgonda@google.com>
Message-Id: <20220304161032.2270688-1-pgonda@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:34:40 -04:00
Jim Mattson
95b065bf5c KVM: x86/pmu: Use different raw event masks for AMD and Intel
The third nybble of AMD's event select overlaps with Intel's IN_TX and
IN_TXCP bits. Therefore, we can't use AMD64_RAW_EVENT_MASK on Intel
platforms that support TSX.

Declare a raw_event_mask in the kvm_pmu structure, initialize it in
the vendor-specific pmu_refresh() functions, and use that mask for
PERF_TYPE_RAW configurations in reprogram_gp_counter().

Fixes: 710c476514 ("KVM: x86/pmu: Use AMD64_RAW_EVENT_MASK for PERF_TYPE_RAW")
Signed-off-by: Jim Mattson <jmattson@google.com>
Message-Id: <20220308012452.3468611-1-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:34:40 -04:00
Sean Christopherson
f47e5bbbc9 KVM: x86/mmu: Zap only TDP MMU leafs in zap range and mmu_notifier unmap
Re-introduce zapping only leaf SPTEs in kvm_zap_gfn_range() and
kvm_tdp_mmu_unmap_gfn_range(), this time without losing a pending TLB
flush when processing multiple roots (including nested TDP shadow roots).
Dropping the TLB flush resulted in random crashes when running Hyper-V
Server 2019 in a guest with KSM enabled in the host (or any source of
mmu_notifier invalidations, KSM is just the easiest to force).

This effectively revert commits 873dd12217
and fcb93eb6d0, and thus restores commit
cf3e26427c, plus this delta on top:

bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end,
        struct kvm_mmu_page *root;

        for_each_tdp_mmu_root_yield_safe(kvm, root, as_id)
-               flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, false);
+               flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, flush);

        return flush;
 }

Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20220325230348.2587437-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:34:39 -04:00
Yi Wang
a80ced6ea5 KVM: SVM: fix panic on out-of-bounds guest IRQ
As guest_irq is coming from KVM_IRQFD API call, it may trigger
crash in svm_update_pi_irte() due to out-of-bounds:

crash> bt
PID: 22218  TASK: ffff951a6ad74980  CPU: 73  COMMAND: "vcpu8"
 #0 [ffffb1ba6707fa40] machine_kexec at ffffffff8565b397
 #1 [ffffb1ba6707fa90] __crash_kexec at ffffffff85788a6d
 #2 [ffffb1ba6707fb58] crash_kexec at ffffffff8578995d
 #3 [ffffb1ba6707fb70] oops_end at ffffffff85623c0d
 #4 [ffffb1ba6707fb90] no_context at ffffffff856692c9
 #5 [ffffb1ba6707fbf8] exc_page_fault at ffffffff85f95b51
 #6 [ffffb1ba6707fc50] asm_exc_page_fault at ffffffff86000ace
    [exception RIP: svm_update_pi_irte+227]
    RIP: ffffffffc0761b53  RSP: ffffb1ba6707fd08  RFLAGS: 00010086
    RAX: ffffb1ba6707fd78  RBX: ffffb1ba66d91000  RCX: 0000000000000001
    RDX: 00003c803f63f1c0  RSI: 000000000000019a  RDI: ffffb1ba66db2ab8
    RBP: 000000000000019a   R8: 0000000000000040   R9: ffff94ca41b82200
    R10: ffffffffffffffcf  R11: 0000000000000001  R12: 0000000000000001
    R13: 0000000000000001  R14: ffffffffffffffcf  R15: 000000000000005f
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #7 [ffffb1ba6707fdb8] kvm_irq_routing_update at ffffffffc09f19a1 [kvm]
 #8 [ffffb1ba6707fde0] kvm_set_irq_routing at ffffffffc09f2133 [kvm]
 #9 [ffffb1ba6707fe18] kvm_vm_ioctl at ffffffffc09ef544 [kvm]
    RIP: 00007f143c36488b  RSP: 00007f143a4e04b8  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: 00007f05780041d0  RCX: 00007f143c36488b
    RDX: 00007f05780041d0  RSI: 000000004008ae6a  RDI: 0000000000000020
    RBP: 00000000000004e8   R8: 0000000000000008   R9: 00007f05780041e0
    R10: 00007f0578004560  R11: 0000000000000246  R12: 00000000000004e0
    R13: 000000000000001a  R14: 00007f1424001c60  R15: 00007f0578003bc0
    ORIG_RAX: 0000000000000010  CS: 0033  SS: 002b

Vmx have been fix this in commit 3a8b0677fc (KVM: VMX: Do not BUG() on
out-of-bounds guest IRQ), so we can just copy source from that to fix
this.

Co-developed-by: Yi Liu <liu.yi24@zte.com.cn>
Signed-off-by: Yi Liu <liu.yi24@zte.com.cn>
Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Message-Id: <20220309113025.44469-1-wang.yi59@zte.com.cn>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:34:38 -04:00
Paolo Bonzini
a1a39128fa KVM: MMU: propagate alloc_workqueue failure
If kvm->arch.tdp_mmu_zap_wq cannot be created, the failure has
to be propagated up to kvm_mmu_init_vm and kvm_arch_init_vm.
kvm_arch_init_vm also has to undo all the initialization, so
group all the MMU initialization code at the beginning and
handle cleaning up of kvm_page_track_init.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:34:38 -04:00
Linus Torvalds
88e6c02076 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "Assorted bits and pieces"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  aio: drop needless assignment in aio_read()
  clean overflow checks in count_mounts() a bit
  seq_file: fix NULL pointer arithmetic warning
  uml/x86: use x86 load_unaligned_zeropad()
  asm/user.h: killed unused macros
  constify struct path argument of finish_automount()/do_add_mount()
  fs: Remove FIXME comment in generic_write_checks()
2022-04-01 19:57:03 -07:00