linux/arch
HATAYAMA Daisuke 6b3f0da3d2 perf/x86/intel: ignore CondChgd bit to avoid false NMI handling
commit b292d7a104 upstream.

Currently, any NMI is falsely handled by a NMI handler of NMI watchdog
if CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR is set.

For example, we use external NMI to make system panic to get crash
dump, but in this case, the external NMI is falsely handled do to the
issue.

This commit deals with the issue simply by ignoring CondChgd bit.

Here is explanation in detail.

On x86 NMI watchdog uses performance monitoring feature to
periodically signal NMI each time performance counter gets overflowed.

intel_pmu_handle_irq() is called as a NMI_LOCAL handler from a NMI
handler of NMI watchdog, perf_event_nmi_handler(). It identifies an
owner of a given NMI by looking at overflow status bits in
MSR_CORE_PERF_GLOBAL_STATUS MSR. If some of the bits are set, then it
handles the given NMI as its own NMI.

The problem is that the intel_pmu_handle_irq() doesn't distinguish
CondChgd bit from other bits. Unlike the other status bits, CondChgd
bit doesn't represent overflow status for performance counters. Thus,
CondChgd bit cannot be thought of as a mark indicating a given NMI is
NMI watchdog's.

As a result, if CondChgd bit is set, any NMI is falsely handled by the
NMI handler of NMI watchdog. Also, if type of the falsely handled NMI
is either NMI_UNKNOWN, NMI_SERR or NMI_IO_CHECK, the corresponding
action is never performed until CondChgd bit is cleared.

I noticed this behavior on systems with Ivy Bridge processors: Intel
Xeon CPU E5-2630 v2 and Intel Xeon CPU E7-8890 v2. On both systems,
CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR has already been set
in the beginning at boot. Then the CondChgd bit is immediately cleared
by next wrmsr to MSR_CORE_PERF_GLOBAL_CTRL MSR and appears to remain
0.

On the other hand, on older processors such as Nehalem, Xeon E7540,
CondChgd bit is not set in the beginning at boot.

I'm not sure about exact behavior of CondChgd bit, in particular when
this bit is set. Although I read Intel System Programmer's Manual to
figure out that, the descriptions I found are:

  In 18.9.1:

  "The MSR_PERF_GLOBAL_STATUS MSR also provides a ¡sticky bit¢ to
   indicate changes to the state of performancmonitoring hardware"

  In Table 35-2 IA-32 Architectural MSRs

  63 CondChg: status bits of this register has changed.

These are different from the bahviour I see on the actual system as I
explained above.

At least, I think ignoring CondChgd bit should be enough for NMI
watchdog perspective.

Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Acked-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20140625.103503.409316067.d.hatayama@jp.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-28 08:00:06 -07:00
..
alpha Removal of GENERIC_GPIO for v3.10 2013-05-09 09:59:16 -07:00
arc ARC: !PREEMPT: Ensure Return to kernel mode is IRQ safe 2014-05-13 13:59:46 +02:00
arm ARM: OMAP2+: Fix parser-bug in platform muxing code 2014-07-09 11:14:01 -07:00
arm64 arm64: implement TASK_SIZE_OF 2014-07-17 15:58:02 -07:00
avr32 avr32: Makefile: add '-D__linux__' flag for gcc-4.4.7 use 2014-03-06 21:30:02 -08:00
blackfin blackfin updates for Linux 3.10 2013-05-10 07:21:16 -07:00
c6x arch: c6x: mm: include "asm/uaccess.h" to pass compiling 2013-07-21 18:21:29 -07:00
cris cris: media platform drivers: fix build 2013-11-29 11:11:53 -08:00
frv Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2013-05-01 14:08:52 -07:00
h8300 We get rid of the general module prefix confusion with a binary config option, 2013-05-05 10:58:06 -07:00
hexagon Removal of GENERIC_GPIO for v3.10 2013-05-09 09:59:16 -07:00
ia64 exec/ptrace: fix get_dumpable() incorrect tests 2013-11-29 11:11:44 -08:00
m32r Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2013-05-01 14:08:52 -07:00
m68k m68k: Skip futex_atomic_cmpxchg_inatomic() test 2014-04-14 06:42:19 -07:00
metag metag: Reduce maximum stack size to 256MB 2014-06-07 13:25:38 -07:00
microblaze microblaze: fix clone syscall 2013-08-20 08:43:02 -07:00
mips MIPS: KVM: Fix memory leak on VCPU 2014-07-06 18:54:15 -07:00
mn10300 mn10300: Use early_param() to parse "mem=" parameter 2013-06-28 16:53:03 +01:00
openrisc Removal of GENERIC_GPIO for v3.10 2013-05-09 09:59:16 -07:00
parisc parisc: add serial ports of C8000/1GHz machine to hardware database 2014-07-17 15:57:59 -07:00
powerpc powerpc/perf: Clear MMCR2 when enabling PMU 2014-07-17 15:58:01 -07:00
s390 s390/lowcore: reserve 96 bytes for IRB in lowcore 2014-06-30 20:09:42 -07:00
score Score: Modify the Makefile of Score, remove -mlong-calls for compiling 2014-07-17 15:58:04 -07:00
sh sh: fix format string bug in stack tracer 2014-05-06 07:55:32 -07:00
sparc net: filter: fix sparc32 typo 2014-06-26 15:12:38 -04:00
tile tile: remove compat_sys_lookup_dcookie declaration to fix compile error 2014-02-13 13:48:00 -08:00
um uml: check length in exitcode_proc_write() 2013-11-13 12:05:33 +09:00
unicore32 arch/unicore32/mm/alignment.c: include "asm/pgtable.h" to avoid compiling error 2014-07-09 11:14:02 -07:00
x86 perf/x86/intel: ignore CondChgd bit to avoid false NMI handling 2014-07-28 08:00:06 -07:00
xtensa xtensa: introduce spill_registers_kernel macro 2014-03-06 21:30:11 -08:00
.gitignore
Kconfig microblaze: fix clone syscall 2013-08-20 08:43:02 -07:00