Commit Graph

28691 Commits

Author SHA1 Message Date
Greg Kroah-Hartman
d885da678e This is the 4.19.34 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlynu40ACgkQONu9yGCS
 aT5X6g//Wkfm/+qSZ0GhLDQkPniiH1QkvzhOmVrrxu+KB0qsiwsEl8Srw33ZVkJK
 LT8+IPGiG9jEGu9dj+BYXTIfy9ZvfSsEL2N6GhYwDSXP0fok2rUaHbZvv1IB2g4W
 afhGdNwNAUCJ/j1UrUsi+SAFJ+xWbVxFpGstd0cqM9IbKdEV7RIukvuKckHiKOKR
 qI8FxC+G2PAr+BtnETfk5/suPDJ7B3ZicDoMhiWJGxJ6dfFTVmkSmasSoPDaMiHm
 4S3hN2lu+WTeRpRPPB17Dlk4MmIp0k+bGYBKAlaxAMCc/RZxvbT2pRYaMQbId2/L
 mNUfSnOQFGEAhlAPfb7wdbObphnyT34GhlkWfZBTrnhPO0/FomLOvU6xVdcNuakX
 Tv2JKfDzb+2ttcMZ+0T84Ru9RztoswFATSw8uFMVxW8oTS6MVWnHu96Kxfl7QO3J
 PdlIGcyqxSuWNE8OX1QVtdSruGZfwUDNs94S4nQJtkB8BViRwhGJlqaXuy4d9Wp6
 fGlI2W6qhjyosi2wBSMTjh/ytk/jq0vfs+z2XjR2gAYssvB/SOLR/AlSVguWsDnf
 WaoFBkXvCbuPvPlo0TrLpl5RW5WlOtLXHE3Vr3dKp458wLwpf/OZBGoZiknp7DrF
 PzBZs2ie5tmyqTxbAygl7WkbQPJ682pd5R4nf5CY+zvUaOMZv1g=
 =Iuup
 -----END PGP SIGNATURE-----

Merge 4.19.34 into android-4.19

Changes in 4.19.34
	arm64: debug: Don't propagate UNKNOWN FAR into si_code for debug signals
	ext4: cleanup bh release code in ext4_ind_remove_space()
	tty/serial: atmel: Add is_half_duplex helper
	tty/serial: atmel: RS485 HD w/DMA: enable RX after TX is stopped
	CIFS: fix POSIX lock leak and invalid ptr deref
	h8300: use cc-cross-prefix instead of hardcoding h8300-unknown-linux-
	f2fs: fix to adapt small inline xattr space in __find_inline_xattr()
	f2fs: fix to avoid deadlock in f2fs_read_inline_dir()
	tracing: kdb: Fix ftdump to not sleep
	net/mlx5: Avoid panic when setting vport rate
	net/mlx5: Avoid panic when setting vport mac, getting vport config
	gpio: gpio-omap: fix level interrupt idling
	include/linux/relay.h: fix percpu annotation in struct rchan
	sysctl: handle overflow for file-max
	net: stmmac: Avoid sometimes uninitialized Clang warnings
	enic: fix build warning without CONFIG_CPUMASK_OFFSTACK
	libbpf: force fixdep compilation at the start of the build
	scsi: hisi_sas: Set PHY linkrate when disconnected
	scsi: hisi_sas: Fix a timeout race of driver internal and SMP IO
	iio: adc: fix warning in Qualcomm PM8xxx HK/XOADC driver
	x86/hyperv: Fix kernel panic when kexec on HyperV
	perf c2c: Fix c2c report for empty numa node
	mm/sparse: fix a bad comparison
	mm/cma.c: cma_declare_contiguous: correct err handling
	mm/page_ext.c: fix an imbalance with kmemleak
	mm, swap: bounds check swap_info array accesses to avoid NULL derefs
	mm,oom: don't kill global init via memory.oom.group
	memcg: killed threads should not invoke memcg OOM killer
	mm, mempolicy: fix uninit memory access
	mm/vmalloc.c: fix kernel BUG at mm/vmalloc.c:512!
	mm/slab.c: kmemleak no scan alien caches
	ocfs2: fix a panic problem caused by o2cb_ctl
	f2fs: do not use mutex lock in atomic context
	fs/file.c: initialize init_files.resize_wait
	page_poison: play nicely with KASAN
	cifs: use correct format characters
	dm thin: add sanity checks to thin-pool and external snapshot creation
	f2fs: fix to check inline_xattr_size boundary correctly
	cifs: Accept validate negotiate if server return NT_STATUS_NOT_SUPPORTED
	cifs: Fix NULL pointer dereference of devname
	netfilter: nf_tables: check the result of dereferencing base_chain->stats
	netfilter: conntrack: tcp: only close if RST matches exact sequence
	jbd2: fix invalid descriptor block checksum
	fs: fix guard_bio_eod to check for real EOD errors
	tools lib traceevent: Fix buffer overflow in arg_eval
	PCI/PME: Fix hotplug/sysfs remove deadlock in pcie_pme_remove()
	wil6210: check null pointer in _wil_cfg80211_merge_extra_ies
	mt76: fix a leaked reference by adding a missing of_node_put
	crypto: crypto4xx - add missing of_node_put after of_device_is_available
	crypto: cavium/zip - fix collision with generic cra_driver_name
	usb: chipidea: Grab the (legacy) USB PHY by phandle first
	powerpc/powernv/ioda: Fix locked_vm counting for memory used by IOMMU tables
	scsi: core: replace GFP_ATOMIC with GFP_KERNEL in scsi_scan.c
	kbuild: invoke syncconfig if include/config/auto.conf.cmd is missing
	powerpc/xmon: Fix opcode being uninitialized in print_insn_powerpc
	coresight: etm4x: Add support to enable ETMv4.2
	serial: 8250_pxa: honor the port number from devicetree
	ARM: 8840/1: use a raw_spinlock_t in unwind
	iommu/io-pgtable-arm-v7s: Only kmemleak_ignore L2 tables
	powerpc/hugetlb: Handle mmap_min_addr correctly in get_unmapped_area callback
	btrfs: qgroup: Make qgroup async transaction commit more aggressive
	mmc: omap: fix the maximum timeout setting
	net: dsa: mv88e6xxx: Add lockdep classes to fix false positive splat
	e1000e: Fix -Wformat-truncation warnings
	mlxsw: spectrum: Avoid -Wformat-truncation warnings
	platform/x86: ideapad-laptop: Fix no_hw_rfkill_list for Lenovo RESCUER R720-15IKBN
	platform/mellanox: mlxreg-hotplug: Fix KASAN warning
	loop: set GENHD_FL_NO_PART_SCAN after blkdev_reread_part()
	IB/mlx4: Increase the timeout for CM cache
	clk: fractional-divider: check parent rate only if flag is set
	perf annotate: Fix getting source line failure
	ASoC: qcom: Fix of-node refcount unbalance in qcom_snd_parse_of()
	cpufreq: acpi-cpufreq: Report if CPU doesn't support boost technologies
	efi: cper: Fix possible out-of-bounds access
	s390/ism: ignore some errors during deregistration
	scsi: megaraid_sas: return error when create DMA pool failed
	scsi: fcoe: make use of fip_mode enum complete
	drm/amd/display: Clear stream->mode_changed after commit
	perf test: Fix failure of 'evsel-tp-sched' test on s390
	mwifiex: don't advertise IBSS features without FW support
	perf report: Don't shadow inlined symbol with different addr range
	SoC: imx-sgtl5000: add missing put_device()
	media: ov7740: fix runtime pm initialization
	media: sh_veu: Correct return type for mem2mem buffer helpers
	media: s5p-jpeg: Correct return type for mem2mem buffer helpers
	media: rockchip/rga: Correct return type for mem2mem buffer helpers
	media: s5p-g2d: Correct return type for mem2mem buffer helpers
	media: mx2_emmaprp: Correct return type for mem2mem buffer helpers
	media: mtk-jpeg: Correct return type for mem2mem buffer helpers
	mt76: usb: do not run mt76u_queues_deinit twice
	xen/gntdev: Do not destroy context while dma-bufs are in use
	vfs: fix preadv64v2 and pwritev64v2 compat syscalls with offset == -1
	HID: intel-ish-hid: avoid binding wrong ishtp_cl_device
	cgroup, rstat: Don't flush subtree root unless necessary
	jbd2: fix race when writing superblock
	leds: lp55xx: fix null deref on firmware load failure
	perf report: Add s390 diagnosic sampling descriptor size
	iwlwifi: pcie: fix emergency path
	ACPI / video: Refactor and fix dmi_is_desktop()
	selftests: skip seccomp get_metadata test if not real root
	kprobes: Prohibit probing on bsearch()
	kprobes: Prohibit probing on RCU debug routine
	netfilter: conntrack: fix cloned unconfirmed skb->_nfct race in __nf_conntrack_confirm
	ARM: 8833/1: Ensure that NEON code always compiles with Clang
	ARM: dts: meson8b: fix the Ethernet data line signals in eth_rgmii_pins
	ALSA: PCM: check if ops are defined before suspending PCM
	ath10k: fix shadow register implementation for WCN3990
	usb: f_fs: Avoid crash due to out-of-scope stack ptr access
	sched/topology: Fix percpu data types in struct sd_data & struct s_data
	bcache: fix input overflow to cache set sysfs file io_error_halflife
	bcache: fix input overflow to sequential_cutoff
	bcache: fix potential div-zero error of writeback_rate_i_term_inverse
	bcache: improve sysfs_strtoul_clamp()
	genirq: Avoid summation loops for /proc/stat
	net: marvell: mvpp2: fix stuck in-band SGMII negotiation
	iw_cxgb4: fix srqidx leak during connection abort
	net: phy: consider latched link-down status in polling mode
	fbdev: fbmem: fix memory access if logo is bigger than the screen
	cdrom: Fix race condition in cdrom_sysctl_register
	drm: rcar-du: add missing of_node_put
	drm/amd/display: Don't re-program planes for DPMS changes
	drm/amd/display: Disconnect mpcc when changing tg
	perf/aux: Make perf_event accessible to setup_aux()
	e1000e: fix cyclic resets at link up with active tx
	e1000e: Exclude device from suspend direct complete optimization
	platform/x86: intel_pmc_core: Fix PCH IP sts reading
	i2c: of: Try to find an I2C adapter matching the parent
	staging: spi: mt7621: Add return code check on device_reset()
	iwlwifi: mvm: fix RFH config command with >=10 CPUs
	ASoC: fsl-asoc-card: fix object reference leaks in fsl_asoc_card_probe
	sched/debug: Initialize sd_sysctl_cpus if !CONFIG_CPUMASK_OFFSTACK
	efi/memattr: Don't bail on zero VA if it equals the region's PA
	sched/core: Use READ_ONCE()/WRITE_ONCE() in move_queued_task()/task_rq_lock()
	drm/vkms: Bugfix extra vblank frame
	ARM: dts: lpc32xx: Remove leading 0x and 0s from bindings notation
	efi/arm/arm64: Allow SetVirtualAddressMap() to be omitted
	soc: qcom: gsbi: Fix error handling in gsbi_probe()
	mt7601u: bump supported EEPROM version
	ARM: 8830/1: NOMMU: Toggle only bits in EXC_RETURN we are really care of
	ARM: avoid Cortex-A9 livelock on tight dmb loops
	block, bfq: fix in-service-queue check for queue merging
	bpf: fix missing prototype warnings
	selftests/bpf: skip verifier tests for unsupported program types
	powerpc/64s: Clear on-stack exception marker upon exception return
	cgroup/pids: turn cgroup_subsys->free() into cgroup_subsys->release() to fix the accounting
	backlight: pwm_bl: Use gpiod_get_value_cansleep() to get initial state
	tty: increase the default flip buffer limit to 2*640K
	powerpc/pseries: Perform full re-add of CPU for topology update post-migration
	drm/amd/display: Enable vblank interrupt during CRC capture
	ALSA: dice: add support for Solid State Logic Duende Classic/Mini
	usb: dwc3: gadget: Fix OTG events when gadget driver isn't loaded
	platform/x86: intel-hid: Missing power button release on some Dell models
	perf script python: Use PyBytes for attr in trace-event-python
	perf script python: Add trace_context extension module to sys.modules
	media: mt9m111: set initial frame size other than 0x0
	hwrng: virtio - Avoid repeated init of completion
	soc/tegra: fuse: Fix illegal free of IO base address
	HID: intel-ish: ipc: handle PIMR before ish_wakeup also clear PISR busy_clear bit
	f2fs: UBSAN: set boolean value iostat_enable correctly
	hpet: Fix missing '=' character in the __setup() code of hpet_mmap_enable
	cpu/hotplug: Mute hotplug lockdep during init
	dmaengine: imx-dma: fix warning comparison of distinct pointer types
	dmaengine: qcom_hidma: assign channel cookie correctly
	dmaengine: qcom_hidma: initialize tx flags in hidma_prep_dma_*
	netfilter: physdev: relax br_netfilter dependency
	media: rcar-vin: Allow independent VIN link enablement
	media: s5p-jpeg: Check for fmt_ver_flag when doing fmt enumeration
	regulator: act8865: Fix act8600_sudcdc_voltage_ranges setting
	pinctrl: meson: meson8b: add the eth_rxd2 and eth_rxd3 pins
	drm: Auto-set allow_fb_modifiers when given modifiers at plane init
	drm/nouveau: Stop using drm_crtc_force_disable
	x86/build: Specify elf_i386 linker emulation explicitly for i386 objects
	selinux: do not override context on context mounts
	brcmfmac: Use firmware_request_nowarn for the clm_blob
	wlcore: Fix memory leak in case wl12xx_fetch_firmware failure
	x86/build: Mark per-CPU symbols as absolute explicitly for LLD
	drm/fb-helper: fix leaks in error path of drm_fb_helper_fbdev_setup
	clk: meson: clean-up clock registration
	clk: rockchip: fix frac settings of GPLL clock for rk3328
	dmaengine: tegra: avoid overflow of byte tracking
	Input: soc_button_array - fix mapping of the 5th GPIO in a PNP0C40 device
	drm/dp/mst: Configure no_stop_bit correctly for remote i2c xfers
	net: stmmac: Avoid one more sometimes uninitialized Clang warning
	ACPI / video: Extend chassis-type detection with a "Lunch Box" check
	bcache: fix potential div-zero error of writeback_rate_p_term_inverse
	kprobes/x86: Blacklist non-attachable interrupt functions
	Linux 4.19.34

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2019-04-05 22:43:09 +02:00
Valentin Schneider
fba4c61e98 cpu/hotplug: Mute hotplug lockdep during init
[ Upstream commit ce48c457b9 ]

Since we've had:

  commit cb538267ea ("jump_label/lockdep: Assert we hold the hotplug lock for _cpuslocked() operations")

we've been getting some lockdep warnings during init, such as on HiKey960:

[    0.820495] WARNING: CPU: 4 PID: 0 at kernel/cpu.c:316 lockdep_assert_cpus_held+0x3c/0x48
[    0.820498] Modules linked in:
[    0.820509] CPU: 4 PID: 0 Comm: swapper/4 Tainted: G S                4.20.0-rc5-00051-g4cae42a #34
[    0.820511] Hardware name: HiKey960 (DT)
[    0.820516] pstate: 600001c5 (nZCv dAIF -PAN -UAO)
[    0.820520] pc : lockdep_assert_cpus_held+0x3c/0x48
[    0.820523] lr : lockdep_assert_cpus_held+0x38/0x48
[    0.820526] sp : ffff00000a9cbe50
[    0.820528] x29: ffff00000a9cbe50 x28: 0000000000000000
[    0.820533] x27: 00008000b69e5000 x26: ffff8000bff4cfe0
[    0.820537] x25: ffff000008ba69e0 x24: 0000000000000001
[    0.820541] x23: ffff000008fce000 x22: ffff000008ba70c8
[    0.820545] x21: 0000000000000001 x20: 0000000000000003
[    0.820548] x19: ffff00000a35d628 x18: ffffffffffffffff
[    0.820552] x17: 0000000000000000 x16: 0000000000000000
[    0.820556] x15: ffff00000958f848 x14: 455f3052464d4d34
[    0.820559] x13: 00000000769dde98 x12: ffff8000bf3f65a8
[    0.820564] x11: 0000000000000000 x10: ffff00000958f848
[    0.820567] x9 : ffff000009592000 x8 : ffff00000958f848
[    0.820571] x7 : ffff00000818ffa0 x6 : 0000000000000000
[    0.820574] x5 : 0000000000000000 x4 : 0000000000000001
[    0.820578] x3 : 0000000000000000 x2 : 0000000000000001
[    0.820582] x1 : 00000000ffffffff x0 : 0000000000000000
[    0.820587] Call trace:
[    0.820591]  lockdep_assert_cpus_held+0x3c/0x48
[    0.820598]  static_key_enable_cpuslocked+0x28/0xd0
[    0.820606]  arch_timer_check_ool_workaround+0xe8/0x228
[    0.820610]  arch_timer_starting_cpu+0xe4/0x2d8
[    0.820615]  cpuhp_invoke_callback+0xe8/0xd08
[    0.820619]  notify_cpu_starting+0x80/0xb8
[    0.820625]  secondary_start_kernel+0x118/0x1d0

We've also had a similar warning in sched_init_smp() for every
asymmetric system that would enable the sched_asym_cpucapacity static
key, although that was singled out in:

  commit 40fa3780ba ("sched/core: Take the hotplug lock in sched_init_smp()")

Those warnings are actually harmless, since we cannot have hotplug
operations at the time they appear. Instead of starting to sprinkle
useless hotplug lock operations in the init codepaths, mute the
warnings until they start warning about real problems.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: cai@gmx.us
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: linux-arm-kernel@lists.infradead.org
Cc: longman@redhat.com
Cc: marc.zyngier@arm.com
Cc: mark.rutland@arm.com
Link: https://lkml.kernel.org/r/1545243796-23224-2-git-send-email-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-04-05 22:33:14 +02:00
Oleg Nesterov
d0bc74c563 cgroup/pids: turn cgroup_subsys->free() into cgroup_subsys->release() to fix the accounting
[ Upstream commit 51bee5abea ]

The only user of cgroup_subsys->free() callback is pids_cgrp_subsys which
needs pids_free() to uncharge the pid.

However, ->free() is called from __put_task_struct()->cgroup_free() and this
is too late. Even the trivial program which does

	for (;;) {
		int pid = fork();
		assert(pid >= 0);
		if (pid)
			wait(NULL);
		else
			exit(0);
	}

can run out of limits because release_task()->call_rcu(delayed_put_task_struct)
implies an RCU gp after the task/pid goes away and before the final put().

Test-case:

	mkdir -p /tmp/CG
	mount -t cgroup2 none /tmp/CG
	echo '+pids' > /tmp/CG/cgroup.subtree_control

	mkdir /tmp/CG/PID
	echo 2 > /tmp/CG/PID/pids.max

	perl -e 'while ($p = fork) { wait; } $p // die "fork failed: $!\n"' &
	echo $! > /tmp/CG/PID/cgroup.procs

Without this patch the forking process fails soon after migration.

Rename cgroup_subsys->free() to cgroup_subsys->release() and move the callsite
into the new helper, cgroup_release(), called by release_task() which actually
frees the pid(s).

Reported-by: Herton R. Krzesinski <hkrzesin@redhat.com>
Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-04-05 22:33:13 +02:00
Andrea Parri
e8e0bd4915 sched/core: Use READ_ONCE()/WRITE_ONCE() in move_queued_task()/task_rq_lock()
[ Upstream commit c546951d9c ]

move_queued_task() synchronizes with task_rq_lock() as follows:

	move_queued_task()		task_rq_lock()

	[S] ->on_rq = MIGRATING		[L] rq = task_rq()
	WMB (__set_task_cpu())		ACQUIRE (rq->lock);
	[S] ->cpu = new_cpu		[L] ->on_rq

where "[L] rq = task_rq()" is ordered before "ACQUIRE (rq->lock)" by an
address dependency and, in turn, "ACQUIRE (rq->lock)" is ordered before
"[L] ->on_rq" by the ACQUIRE itself.

Use READ_ONCE() to load ->cpu in task_rq() (c.f., task_cpu()) to honor
this address dependency.  Also, mark the accesses to ->cpu and ->on_rq
with READ_ONCE()/WRITE_ONCE() to comply with the LKMM.

Signed-off-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: https://lkml.kernel.org/r/20190121155240.27173-1-andrea.parri@amarulasolutions.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-04-05 22:33:12 +02:00
Hidetoshi Seto
f056c90f07 sched/debug: Initialize sd_sysctl_cpus if !CONFIG_CPUMASK_OFFSTACK
[ Upstream commit 1ca4fa3ab6 ]

register_sched_domain_sysctl() copies the cpu_possible_mask into
sd_sysctl_cpus, but only if sd_sysctl_cpus hasn't already been
allocated (ie, CONFIG_CPUMASK_OFFSTACK is set).  However, when
CONFIG_CPUMASK_OFFSTACK is not set, sd_sysctl_cpus is left
uninitialized (all zeroes) and the kernel may fail to initialize
sched_domain sysctl entries for all possible CPUs.

This is visible to the user if the kernel is booted with maxcpus=n, or
if ACPI tables have been modified to leave CPUs offline, and then
checking for missing /proc/sys/kernel/sched_domain/cpu* entries.

Fix this by separating the allocation and initialization, and adding a
flag to initialize the possible CPU entries while system booting only.

Tested-by: Syuuichirou Ishii <ishii.shuuichir@jp.fujitsu.com>
Tested-by: Tarumizu, Kohei <tarumizu.kohei@jp.fujitsu.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masayoshi Mizuma <msys.mizuma@gmail.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190129151245.5073-1-msys.mizuma@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-04-05 22:33:11 +02:00
Mathieu Poirier
efd85d83ac perf/aux: Make perf_event accessible to setup_aux()
[ Upstream commit 840018668c ]

When pmu::setup_aux() is called the coresight PMU needs to know which
sink to use for the session by looking up the information in the
event's attr::config2 field.

As such simply replace the cpu information by the complete perf_event
structure and change all affected customers.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Reviewed-by: Suzuki Poulouse <suzuki.poulose@arm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-s390@vger.kernel.org
Link: http://lkml.kernel.org/r/20190131184714.20388-2-mathieu.poirier@linaro.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-04-05 22:33:11 +02:00
Thomas Gleixner
1f3694865d genirq: Avoid summation loops for /proc/stat
[ Upstream commit 1136b07289 ]

Waiman reported that on large systems with a large amount of interrupts the
readout of /proc/stat takes a long time to sum up the interrupt
statistics. In principle this is not a problem. but for unknown reasons
some enterprise quality software reads /proc/stat with a high frequency.

The reason for this is that interrupt statistics are accounted per cpu. So
the /proc/stat logic has to sum up the interrupt stats for each interrupt.

This can be largely avoided for interrupts which are not marked as
'PER_CPU' interrupts by simply adding a per interrupt summation counter
which is incremented along with the per interrupt per cpu counter.

The PER_CPU interrupts need to avoid that and use only per cpu accounting
because they share the interrupt number and the interrupt descriptor and
concurrent updates would conflict or require unwanted synchronization.

Reported-by: Waiman Long <longman@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Waiman Long <longman@redhat.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Daniel Colascione <dancol@google.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Link: https://lkml.kernel.org/r/20190208135020.925487496@linutronix.de

8<-------------

v2: Undo the unintentional layout change of struct irq_desc.

 include/linux/irqdesc.h |    1 +
 kernel/irq/chip.c       |   12 ++++++++++--
 kernel/irq/internals.h  |    8 +++++++-
 kernel/irq/irqdesc.c    |    7 ++++++-
 4 files changed, 24 insertions(+), 4 deletions(-)

Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-04-05 22:33:09 +02:00
Luc Van Oostenryck
845d4849b6 sched/topology: Fix percpu data types in struct sd_data & struct s_data
[ Upstream commit 99687cdbb3 ]

The percpu members of struct sd_data and s_data are declared as:

	struct ... ** __percpu member;

So their type is:

	__percpu pointer to pointer to struct ...

But looking at how they're used, their type should be:

	pointer to __percpu pointer to struct ...

and they should thus be declared as:

	struct ... * __percpu *member;

So fix the placement of '__percpu' in the definition of these
structures.

This addresses a bunch of Sparse's warnings like:

	warning: incorrect type in initializer (different address spaces)
	  expected void const [noderef] <asn:3> *__vpp_verify
	  got struct sched_domain **

Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190118144936.79158-1-luc.vanoostenryck@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-04-05 22:33:09 +02:00
Masami Hiramatsu
d53b295f78 kprobes: Prohibit probing on RCU debug routine
[ Upstream commit a39f15b964 ]

Since kprobe itself depends on RCU, probing on RCU debug
routine can cause recursive breakpoint bugs.

Prohibit probing on RCU debug routines.

int3
 ->do_int3()
   ->ist_enter()
     ->RCU_LOCKDEP_WARN()
       ->debug_lockdep_rcu_enabled() -> int3

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andrea Righi <righi.andrea@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/154998807741.31052.11229157537816341591.stgit@devbox
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-04-05 22:33:08 +02:00
Tejun Heo
a74ebf047e cgroup, rstat: Don't flush subtree root unless necessary
[ Upstream commit b4ff1b44bc ]

cgroup_rstat_cpu_pop_updated() is used to traverse the updated cgroups
on flush.  While it was only visiting updated ones in the subtree, it
was visiting @root unconditionally.  We can easily check whether @root
is updated or not by looking at its ->updated_next just as with the
cgroups in the subtree.

* Remove the unnecessary cgroup_parent() test.  The system root cgroup
  is never updated and thus its ->updated_next is always NULL.  No
  need to test whether cgroup_parent() exists in addition to
  ->updated_next.

* Terminate traverse if ->updated_next is NULL.  This can only happen
  for subtree @root and there's no reason to visit it if it's not
  marked updated.

This reduces cpu consumption when reading a lot of rstat backed files.
In a micro benchmark reading stat from ~1600 cgroups, the sys time was
lowered by >40%.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-04-05 22:33:06 +02:00
Christian Brauner
b227f15712 sysctl: handle overflow for file-max
[ Upstream commit 32a5ad9c22 ]

Currently, when writing

  echo 18446744073709551616 > /proc/sys/fs/file-max

/proc/sys/fs/file-max will overflow and be set to 0.  That quickly
crashes the system.

This commit sets the max and min value for file-max.  The max value is
set to long int.  Any higher value cannot currently be used as the
percpu counters are long ints and not unsigned integers.

Note that the file-max value is ultimately parsed via
__do_proc_doulongvec_minmax().  This function does not report error when
min or max are exceeded.  Which means if a value largen that long int is
written userspace will not receive an error instead the old value will be
kept.  There is an argument to be made that this should be changed and
__do_proc_doulongvec_minmax() should return an error when a dedicated min
or max value are exceeded.  However this has the potential to break
userspace so let's defer this to an RFC patch.

Link: http://lkml.kernel.org/r/20190107222700.15954-3-christian@brauner.io
Signed-off-by: Christian Brauner <christian@brauner.io>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Waiman Long <longman@redhat.com>
[christian@brauner.io: v4]
  Link: http://lkml.kernel.org/r/20190210203943.8227-3-christian@brauner.io
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-04-05 22:32:57 +02:00
Douglas Anderson
b73c7d0204 tracing: kdb: Fix ftdump to not sleep
[ Upstream commit 31b265b3ba ]

As reported back in 2016-11 [1], the "ftdump" kdb command triggers a
BUG for "sleeping function called from invalid context".

kdb's "ftdump" command wants to call ring_buffer_read_prepare() in
atomic context.  A very simple solution for this is to add allocation
flags to ring_buffer_read_prepare() so kdb can call it without
triggering the allocation error.  This patch does that.

Note that in the original email thread about this, it was suggested
that perhaps the solution for kdb was to either preallocate the buffer
ahead of time or create our own iterator.  I'm hoping that this
alternative of adding allocation flags to ring_buffer_read_prepare()
can be considered since it means I don't need to duplicate more of the
core trace code into "trace_kdb.c" (for either creating my own
iterator or re-preparing a ring allocator whose memory was already
allocated).

NOTE: another option for kdb is to actually figure out how to make it
reuse the existing ftrace_dump() function and totally eliminate the
duplication.  This sounds very appealing and actually works (the "sr
z" command can be seen to properly dump the ftrace buffer).  The
downside here is that ftrace_dump() fully consumes the trace buffer.
Unless that is changed I'd rather not use it because it means "ftdump
| grep xyz" won't be very useful to search the ftrace buffer since it
will throw away the whole trace on the first grep.  A future patch to
dump only the last few lines of the buffer will also be hard to
implement.

[1] https://lkml.kernel.org/r/20161117191605.GA21459@google.com

Link: http://lkml.kernel.org/r/20190308193205.213659-1-dianders@chromium.org

Reported-by: Brian Norris <briannorris@chromium.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-04-05 22:32:56 +02:00
Greg Kroah-Hartman
0b065cd568 This is the 4.19.33 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlykNfcACgkQONu9yGCS
 aT40dRAAiCeYjEC1zH8dkAnbFlKo6IZuhKgISfVTgrWlRe9nUTYaenBXqAfGjufH
 EzXHrD1IANRAnFfWg8xt01TNBfTaiEYnYzFmJkWAHFWGKxa5fRU5Kan0MB97r8s9
 NjoSRsnFl8l2oJI88zwFa7k89Itop9ST/zvZIgnrysAr+j8yEZb7BZWaU2UrKK/q
 qfnJxjfCb/jeqAxwVh3OkasXj0gG2JkGR/uEGTw2EARuI6pvKo5OCzYz0tXTN6ZJ
 CSzM4X7dhkGSgLIUw3JOCB28riK9TYbOdPr4MFYMrnoU5VL8+n62tXoKXewobJ1C
 2+Pmg5E54r13Rr35eoGCiHsW2LGQrOyvm8S9TFB/0SmTPtzUjFlHBC62Vs/AW5ut
 HSmwVy+ILM/xTIdts5QT58Gw+5NCmHxw2oEdrgcct+6FtnR9XPOqZYyzH1tw1ZB+
 DL2PqYyT9czuo2bKanWA37M8339q2INDFskXqKRokQ9GNiqUx1E6fNBqtK5SqyDI
 CdohtAs7xSQoPPbKDiITOmt82MM8xvefKmqTIvHN5B7Ns4lT1QC74DCXEFEKtw6M
 l+p64h6Qw4DiKmqna7fsbKPjmg8pg1lVrCOwD5iUF3JXy+Wxi/OKnr2gqAVMChAq
 GSfFFf+MMZhYzJPRQKxOn4GwDBDz5niaWvQmlPcPWLJ5jdbzl+Y=
 =Yyc/
 -----END PGP SIGNATURE-----

Merge 4.19.33 into android-4.19

Changes in 4.19.33
	Bluetooth: Check L2CAP option sizes returned from l2cap_get_conf_opt
	Bluetooth: Verify that l2cap_get_conf_opt provides large enough buffer
	ipmi_si: Fix crash when using hard-coded device
	dccp: do not use ipv6 header for ipv4 flow
	genetlink: Fix a memory leak on error path
	gtp: change NET_UDP_TUNNEL dependency to select
	ipv6: make ip6_create_rt_rcu return ip6_null_entry instead of NULL
	mac8390: Fix mmio access size probe
	mISDN: hfcpci: Test both vendor & device ID for Digium HFC4S
	net: aquantia: fix rx checksum offload for UDP/TCP over IPv6
	net: datagram: fix unbounded loop in __skb_try_recv_datagram()
	net/packet: Set __GFP_NOWARN upon allocation in alloc_pg_vec
	net: phy: meson-gxl: fix interrupt support
	net: rose: fix a possible stack overflow
	net: stmmac: fix memory corruption with large MTUs
	net-sysfs: call dev_hold if kobject_init_and_add success
	packets: Always register packet sk in the same order
	rhashtable: Still do rehash when we get EEXIST
	sctp: get sctphdr by offset in sctp_compute_cksum
	sctp: use memdup_user instead of vmemdup_user
	tcp: do not use ipv6 header for ipv4 flow
	tipc: allow service ranges to be connect()'ed on RDM/DGRAM
	tipc: change to check tipc_own_id to return in tipc_net_stop
	tipc: fix cancellation of topology subscriptions
	tun: properly test for IFF_UP
	vrf: prevent adding upper devices
	vxlan: Don't call gro_cells_destroy() before device is unregistered
	ila: Fix rhashtable walker list corruption
	net: sched: fix cleanup NULL pointer exception in act_mirr
	thunderx: enable page recycling for non-XDP case
	thunderx: eliminate extra calls to put_page() for pages held for recycling
	tun: add a missing rcu_read_unlock() in error path
	powerpc/fsl: Add infrastructure to fixup branch predictor flush
	powerpc/fsl: Add macro to flush the branch predictor
	powerpc/fsl: Emulate SPRN_BUCSR register
	powerpc/fsl: Add nospectre_v2 command line argument
	powerpc/fsl: Flush the branch predictor at each kernel entry (64bit)
	powerpc/fsl: Flush the branch predictor at each kernel entry (32 bit)
	powerpc/fsl: Flush branch predictor when entering KVM
	powerpc/fsl: Enable runtime patching if nospectre_v2 boot arg is used
	powerpc/fsl: Update Spectre v2 reporting
	powerpc/fsl: Fixed warning: orphan section `__btb_flush_fixup'
	powerpc/fsl: Fix the flush of branch predictor.
	powerpc/security: Fix spectre_v2 reporting
	Btrfs: fix incorrect file size after shrinking truncate and fsync
	btrfs: remove WARN_ON in log_dir_items
	btrfs: don't report readahead errors and don't update statistics
	btrfs: raid56: properly unmap parity page in finish_parity_scrub()
	btrfs: Avoid possible qgroup_rsv_size overflow in btrfs_calculate_inode_block_rsv_size
	Btrfs: fix assertion failure on fsync with NO_HOLES enabled
	ARM: imx6q: cpuidle: fix bug that CPU might not wake up at expected time
	powerpc: bpf: Fix generation of load/store DW instructions
	vfio: ccw: only free cp on final interrupt
	NFS: fix mount/umount race in nlmclnt.
	NFSv4.1 don't free interrupted slot on open
	net: dsa: qca8k: remove leftover phy accessors
	ALSA: rawmidi: Fix potential Spectre v1 vulnerability
	ALSA: seq: oss: Fix Spectre v1 vulnerability
	ALSA: pcm: Fix possible OOB access in PCM oss plugins
	ALSA: pcm: Don't suspend stream in unrecoverable PCM state
	ALSA: hda/realtek - Add support headset mode for DELL WYSE AIO
	ALSA: hda/realtek - Add support headset mode for New DELL WYSE NB
	ALSA: hda/realtek: Enable headset MIC of Acer AIO with ALC286
	ALSA: hda/realtek: Enable headset MIC of Acer Aspire Z24-890 with ALC286
	ALSA: hda/realtek - Add support for Acer Aspire E5-523G/ES1-432 headset mic
	ALSA: hda/realtek: Enable ASUS X441MB and X705FD headset MIC with ALC256
	ALSA: hda/realtek: Enable headset mic of ASUS P5440FF with ALC256
	ALSA: hda/realtek: Enable headset MIC of ASUS X430UN and X512DK with ALC256
	ALSA: hda/realtek - Fix speakers on Acer Predator Helios 500 Ryzen laptops
	kbuild: modversions: Fix relative CRC byte order interpretation
	fs/open.c: allow opening only regular files during execve()
	ocfs2: fix inode bh swapping mixup in ocfs2_reflink_inodes_lock
	scsi: sd: Fix a race between closing an sd device and sd I/O
	scsi: sd: Quiesce warning if device does not report optimal I/O size
	scsi: zfcp: fix rport unblock if deleted SCSI devices on Scsi_Host
	scsi: zfcp: fix scsi_eh host reset with port_forced ERP for non-NPIV FCP devices
	drm/rockchip: vop: reset scale mode when win is disabled
	tty: mxs-auart: fix a potential NULL pointer dereference
	tty: atmel_serial: fix a potential NULL pointer dereference
	tty: serial: qcom_geni_serial: Initialize baud in qcom_geni_console_setup
	staging: comedi: ni_mio_common: Fix divide-by-zero for DIO cmdtest
	staging: speakup_soft: Fix alternate speech with other synths
	staging: vt6655: Remove vif check from vnt_interrupt
	staging: vt6655: Fix interrupt race condition on device start up.
	staging: erofs: fix to handle error path of erofs_vmap()
	serial: max310x: Fix to avoid potential NULL pointer dereference
	serial: mvebu-uart: Fix to avoid a potential NULL pointer dereference
	serial: sh-sci: Fix setting SCSCR_TIE while transferring data
	USB: serial: cp210x: add new device id
	USB: serial: ftdi_sio: add additional NovaTech products
	USB: serial: mos7720: fix mos_parport refcount imbalance on error path
	USB: serial: option: set driver_info for SIM5218 and compatibles
	USB: serial: option: add support for Quectel EM12
	USB: serial: option: add Olicard 600
	Disable kgdboc failed by echo space to /sys/module/kgdboc/parameters/kgdboc
	fs/proc/proc_sysctl.c: fix NULL pointer dereference in put_links
	drm/vgem: fix use-after-free when drm_gem_handle_create() fails
	drm/vkms: fix use-after-free when drm_gem_handle_create() fails
	drm/i915/gvt: Fix MI_FLUSH_DW parsing with correct index check
	gpio: exar: add a check for the return value of ida_simple_get fails
	gpio: adnp: Fix testing wrong value in adnp_gpio_direction_input
	phy: sun4i-usb: Support set_mode to USB_HOST for non-OTG PHYs
	usb: mtu3: fix EXTCON dependency
	USB: gadget: f_hid: fix deadlock in f_hidg_write()
	usb: common: Consider only available nodes for dr_mode
	usb: host: xhci-rcar: Add XHCI_TRUST_TX_LENGTH quirk
	xhci: Fix port resume done detection for SS ports with LPM enabled
	usb: xhci: dbc: Don't free all memory with spinlock held
	xhci: Don't let USB3 ports stuck in polling state prevent suspend
	usb: cdc-acm: fix race during wakeup blocking TX traffic
	mm: add support for kmem caches in DMA32 zone
	iommu/io-pgtable-arm-v7s: request DMA32 memory, and improve debugging
	mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified
	mm/migrate.c: add missing flush_dcache_page for non-mapped page migrate
	perf pmu: Fix parser error for uncore event alias
	perf intel-pt: Fix TSC slip
	objtool: Query pkg-config for libelf location
	powerpc/pseries/energy: Use OF accessor functions to read ibm,drc-indexes
	powerpc/64: Fix memcmp reading past the end of src/dest
	watchdog: Respect watchdog cpumask on CPU hotplug
	cpu/hotplug: Prevent crash when CPU bringup fails on CONFIG_HOTPLUG_CPU=n
	x86/smp: Enforce CONFIG_HOTPLUG_CPU when SMP=y
	KVM: Reject device ioctls from processes other than the VM's creator
	KVM: x86: update %rip after emulating IO
	KVM: x86: Emulate MSR_IA32_ARCH_CAPABILITIES on AMD hosts
	staging: erofs: fix error handling when failed to read compresssed data
	staging: erofs: keep corrupted fs from crashing kernel in erofs_readdir()
	bpf: do not restore dst_reg when cur_state is freed
	drivers: base: Helpers for adding device connection descriptions
	platform: x86: intel_cht_int33fe: Register all connections at once
	platform: x86: intel_cht_int33fe: Add connection for the DP alt mode
	platform: x86: intel_cht_int33fe: Add connections for the USB Type-C port
	usb: typec: class: Don't use port parent for getting mux handles
	platform: x86: intel_cht_int33fe: Remove the old connections for the muxes
	Linux 4.19.33

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2019-04-03 06:53:19 +02:00
Xu Yu
f5959dec08 bpf: do not restore dst_reg when cur_state is freed
commit 0803278b0b upstream.

Syzkaller hit 'KASAN: use-after-free Write in sanitize_ptr_alu' bug.

Call trace:

  dump_stack+0xbf/0x12e
  print_address_description+0x6a/0x280
  kasan_report+0x237/0x360
  sanitize_ptr_alu+0x85a/0x8d0
  adjust_ptr_min_max_vals+0x8f2/0x1ca0
  adjust_reg_min_max_vals+0x8ed/0x22e0
  do_check+0x1ca6/0x5d00
  bpf_check+0x9ca/0x2570
  bpf_prog_load+0xc91/0x1030
  __se_sys_bpf+0x61e/0x1f00
  do_syscall_64+0xc8/0x550
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Fault injection trace:

  kfree+0xea/0x290
  free_func_state+0x4a/0x60
  free_verifier_state+0x61/0xe0
  push_stack+0x216/0x2f0	          <- inject failslab
  sanitize_ptr_alu+0x2b1/0x8d0
  adjust_ptr_min_max_vals+0x8f2/0x1ca0
  adjust_reg_min_max_vals+0x8ed/0x22e0
  do_check+0x1ca6/0x5d00
  bpf_check+0x9ca/0x2570
  bpf_prog_load+0xc91/0x1030
  __se_sys_bpf+0x61e/0x1f00
  do_syscall_64+0xc8/0x550
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

When kzalloc() fails in push_stack(), free_verifier_state() will free
current verifier state. As push_stack() returns, dst_reg was restored
if ptr_is_dst_reg is false. However, as member of the cur_state,
dst_reg is also freed, and error occurs when dereferencing dst_reg.
Simply fix it by testing ret of push_stack() before restoring dst_reg.

Fixes: 979d63d50c ("bpf: prevent out of bounds speculation on pointer arithmetic")
Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-03 06:26:30 +02:00
Thomas Gleixner
a56aa02e6f cpu/hotplug: Prevent crash when CPU bringup fails on CONFIG_HOTPLUG_CPU=n
commit 206b92353c upstream.

Tianyu reported a crash in a CPU hotplug teardown callback when booting a
kernel which has CONFIG_HOTPLUG_CPU disabled with the 'nosmt' boot
parameter.

It turns out that the SMP=y CONFIG_HOTPLUG_CPU=n case has been broken
forever in case that a bringup callback fails. Unfortunately this issue was
not recognized when the CPU hotplug code was reworked, so the shortcoming
just stayed in place.

When a bringup callback fails, the CPU hotplug code rolls back the
operation and takes the CPU offline.

The 'nosmt' command line argument uses a bringup failure to abort the
bringup of SMT sibling CPUs. This partial bringup is required due to the
MCE misdesign on Intel CPUs.

With CONFIG_HOTPLUG_CPU=y the rollback works perfectly fine, but
CONFIG_HOTPLUG_CPU=n lacks essential mechanisms to exercise the low level
teardown of a CPU including the synchronizations in various facilities like
RCU, NOHZ and others.

As a consequence the teardown callbacks which must be executed on the
outgoing CPU within stop machine with interrupts disabled are executed on
the control CPU in interrupt enabled and preemptible context causing the
kernel to crash and burn. The pre state machine code has a different
failure mode which is more subtle and resulting in a less obvious use after
free crash because the control side frees resources which are still in use
by the undead CPU.

But this is not a x86 only problem. Any architecture which supports the
SMP=y HOTPLUG_CPU=n combination suffers from the same issue. It's just less
likely to be triggered because in 99.99999% of the cases all bringup
callbacks succeed.

The easy solution of making HOTPLUG_CPU mandatory for SMP is not working on
all architectures as the following architectures have either no hotplug
support at all or not all subarchitectures support it:

 alpha, arc, hexagon, openrisc, riscv, sparc (32bit), mips (partial).

Crashing the kernel in such a situation is not an acceptable state
either.

Implement a minimal rollback variant by limiting the teardown to the point
where all regular teardown callbacks have been invoked and leave the CPU in
the 'dead' idle state. This has the following consequences:

 - the CPU is brought down to the point where the stop_machine takedown
   would happen.

 - the CPU stays there forever and is idle

 - The CPU is cleared in the CPU active mask, but not in the CPU online
   mask which is a legit state.

 - Interrupts are not forced away from the CPU

 - All facilities which only look at online mask would still see it, but
   that is the case during normal hotplug/unplug operations as well. It's
   just a (way) longer time frame.

This will expose issues, which haven't been exposed before or only seldom,
because now the normally transient state of being non active but online is
a permanent state. In testing this exposed already an issue vs. work queues
where the vmstat code schedules work on the almost dead CPU which ends up
in an unbound workqueue and triggers 'preemtible context' warnings. This is
not a problem of this change, it merily exposes an already existing issue.
Still this is better than crashing fully without a chance to debug it.

This is mainly thought as workaround for those architectures which do not
support HOTPLUG_CPU. All others should enforce HOTPLUG_CPU for SMP.

Fixes: 2e1a3483ce ("cpu/hotplug: Split out the state walk into functions")
Reported-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Konrad Wilk <konrad.wilk@oracle.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Mukesh Ojha <mojha@codeaurora.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Rik van Riel <riel@surriel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Micheal Kelley <michael.h.kelley@microsoft.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20190326163811.503390616@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-03 06:26:29 +02:00
Thomas Gleixner
336f6b23b5 watchdog: Respect watchdog cpumask on CPU hotplug
commit 7dd4761711 upstream.

The rework of the watchdog core to use cpu_stop_work broke the watchdog
cpumask on CPU hotplug.

The watchdog_enable/disable() functions are now called unconditionally from
the hotplug callback, i.e. even on CPUs which are not in the watchdog
cpumask. As a consequence the watchdog can become unstoppable.

Only invoke them when the plugged CPU is in the watchdog cpumask.

Fixes: 9cf57731b6 ("watchdog/softlockup: Replace "watchdog/%u" threads with cpu_stop_work")
Reported-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1903262245490.1789@nanos.tec.linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-03 06:26:29 +02:00
Greg Kroah-Hartman
6f994bf048 This is the 4.19.32 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlybBsMACgkQONu9yGCS
 aT6mnhAApfo3mX+F3z5Ikcx7LKQZkO7AbBO/PmjPmn2AQN/I77qlYZgv3jOTk0al
 6Jk8reVS7PjKi+RDugku+xA5iaEkalFW/epS8MIp95yLEiPHLrEMmFmqd9Bbk9dy
 sPmQ1l5ZZ4h4mdlScxzIMKLDlWVB4w4Sk5zBl2zGwG/KiQ2zEWz+3Tfz6glQPu+o
 GGH9AL+9HBBVqtTlF63LBPvdz6er5NHTOrgHC7K4GXLnt9B8+kcObOjoLDtGqXNG
 tl1cQDOtMyMm64r+OTvfEwzIQ6shfcbxQTdLheJlJmCkIHTY5A9Xeyb9S1Opa/Xg
 k8zj03StKMTdqQfOHgbYdIVyHJ/nWmsRNIP5fDp8YGl91pBaHrqHCSLaYtFrrV6n
 yvHfl29e8QH8SYH/1VMXziFGTncqUO2/7NTmWZWJ+B/1oxHwSrvcIpo2q6mRaJwD
 i3XRnanvczvpefMaQPcUrI+aMUXPPeEytrGbqW2KuX/uxhXtV0jFB517JeY6UQw+
 OqEiIRYx3FyQZDNvxUn66+Prr2wt4vOMK7WzV/PrH49/JmxPJypjSXjoRsbRxq8N
 hnD+JTK8mX6K2NgBwh2Ez2fnCQxPTbH12fk2NIRCVcOY8ZoiQud10mhyY9oyAkCj
 pq7X2US1W+Xml3Nn4XJHQg38rv7PrN0nFJ6Eib4EizoHzy0CFUk=
 =bYhI
 -----END PGP SIGNATURE-----

Merge 4.19.32 into android-4.19

Changes in 4.19.32
	ALSA: hda - add Lenovo IdeaCentre B550 to the power_save_blacklist
	ALSA: firewire-motu: use 'version' field of unit directory to identify model
	mmc: pxamci: fix enum type confusion
	mmc: mxcmmc: "Revert mmc: mxcmmc: handle highmem pages"
	mmc: renesas_sdhi: limit block count to 16 bit for old revisions
	drm/vmwgfx: Don't double-free the mode stored in par->set_mode
	drm/vmwgfx: Return 0 when gmrid::get_node runs out of ID's
	iommu/amd: fix sg->dma_address for sg->offset bigger than PAGE_SIZE
	libceph: wait for latest osdmap in ceph_monc_blacklist_add()
	udf: Fix crash on IO error during truncate
	mips: loongson64: lemote-2f: Add IRQF_NO_SUSPEND to "cascade" irqaction.
	MIPS: Ensure ELF appended dtb is relocated
	MIPS: Fix kernel crash for R6 in jump label branch function
	powerpc/vdso64: Fix CLOCK_MONOTONIC inconsistencies across Y2038
	scsi: ibmvscsi: Protect ibmvscsi_head from concurrent modificaiton
	scsi: ibmvscsi: Fix empty event pool access during host removal
	futex: Ensure that futex address is aligned in handle_futex_death()
	cifs: allow guest mounts to work for smb3.11
	perf probe: Fix getting the kernel map
	objtool: Move objtool_file struct off the stack
	irqchip/gic-v3-its: Fix comparison logic in lpi_range_cmp
	SMB3: Fix SMB3.1.1 guest mounts to Samba
	ALSA: x86: Fix runtime PM for hdmi-lpe-audio
	ALSA: hda/ca0132 - make pci_iounmap() call conditional
	ALSA: ac97: Fix of-node refcount unbalance
	ext4: fix NULL pointer dereference while journal is aborted
	ext4: fix data corruption caused by unaligned direct AIO
	ext4: brelse all indirect buffer in ext4_ind_remove_space()
	media: v4l2-ctrls.c/uvc: zero v4l2_event
	Bluetooth: hci_uart: Check if socket buffer is ERR_PTR in h4_recv_buf()
	Bluetooth: Fix decrementing reference count twice in releasing socket
	Bluetooth: hci_ldisc: Initialize hci_dev before open()
	Bluetooth: hci_ldisc: Postpone HCI_UART_PROTO_READY bit set in hci_uart_set_proto()
	drm: Reorder set_property_atomic to avoid returning with an active ww_ctx
	RDMA/cma: Rollback source IP address if failing to acquire device
	f2fs: fix to avoid deadlock of atomic file operations
	netfilter: ebtables: remove BUGPRINT messages
	loop: access lo_backing_file only when the loop device is Lo_bound
	x86/unwind: Handle NULL pointer calls better in frame unwinder
	x86/unwind: Add hardcoded ORC entry for NULL
	locking/lockdep: Add debug_locks check in __lock_downgrade()
	ALSA: hda - Record the current power state before suspend/resume calls
	ALSA: hda - Enforces runtime_resume after S3 and S4 for each codec
	power: supply: charger-manager: Fix incorrect return value
	Linux 4.19.32

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2019-03-30 08:40:51 +01:00
Quentin Perret
549020c814 ANDROID: sched: Disable find_best_target() by default
Now that the mainline EAS wake-up path has been extended to cope with
prefer-idle tasks, the need for a dedicated Android-specific wake-up
routine (find_best_target()) becomes less clear. Indeed, main reasons
for introducting find_best_target() in the first place were:

  1. the energy_diff function was very slow, so we couldn't afford to
     use it on all CPUs for each wake-up for latency reasons;

  2. schedtune provides additional information about tasks (the
     prefer-idle flag in particular) which needed to be taken into
     account in the placement algorithm.

Now that the energy diff calculation is much faster (with the simplified
energy model) and that the EAS path is aware of prefer-idle tasks, there
is no clear reason to use find_best_target() any more.

So, let's disable it for now to minimize the amount of out-of-tree code
used in the scheduler. If using the mainline path doesn't cause
regressions, it is a good sign find_best_target() can be removed safely,
eventually. Otherwise, reverting back to the old behaviour is trivial
since this patch only changes the sched_feat default, but doesn't remove
the fbt() code path.

Bug: 120440300
Change-Id: Idb5d68a3c4af7d2212e0922ab6d9a089170b5e1c
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2019-03-27 15:58:02 +00:00
Quentin Perret
d0eb1f3514 ANDROID: sched/fair: Make the EAS wake-up prefer-idle aware
Make the mainline EAS wake-up path aware of prefer idle tasks in
preparation for disabling find_best_target().

What is done in the mainline algoritm isn't strictly equivalent to the
find_best_target() algorithm but comes real close, and isn't very
invasive. The main differences with the original find_best_target()
behaviour are the following:

 1. the policy for prefer idle when there isn't a single idle CPU in the
    system is simpler now. We just pick the CPU with the highest spare
    capacity;

 2. the cstate awareness for prefer idle is implemented by minimizing
    the exit latency rather than the idle state index. This is how it is
    done in the slow path (find_idlest_group_cpu()), it doesn't require
    us to keep hooks into CPUIdle, and should actually be better because
    what we want is a CPU that can wake up quickly;

 3. non-prefer-idle tasks just use the standard mainline energy-aware
    wake-up path, which decides the placement using the Energy Model.

Bug: 120440300
Change-Id: I57769c90c57115f6a28d27c5a88e08aa93a30a56
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2019-03-27 15:58:02 +00:00
Waiman Long
0e0f7b3072 locking/lockdep: Add debug_locks check in __lock_downgrade()
commit 7149258057 upstream.

Tetsuo Handa had reported he saw an incorrect "downgrading a read lock"
warning right after a previous lockdep warning. It is likely that the
previous warning turned off lock debugging causing the lockdep to have
inconsistency states leading to the lock downgrade warning.

Fix that by add a check for debug_locks at the beginning of
__lock_downgrade().

Debugged-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Reported-by: syzbot+53383ae265fb161ef488@syzkaller.appspotmail.com
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: https://lkml.kernel.org/r/1547093005-26085-1-git-send-email-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-27 14:14:43 +09:00
Chen Jie
36d52f5bcd futex: Ensure that futex address is aligned in handle_futex_death()
commit 5a07168d8d upstream.

The futex code requires that the user space addresses of futexes are 32bit
aligned. sys_futex() checks this in futex_get_keys() but the robust list
code has no alignment check in place.

As a consequence the kernel crashes on architectures with strict alignment
requirements in handle_futex_death() when trying to cmpxchg() on an
unaligned futex address which was retrieved from the robust list.

[ tglx: Rewrote changelog, proper sizeof() based alignement check and add
  	comment ]

Fixes: 0771dfefc9 ("[PATCH] lightweight robust futexes: core")
Signed-off-by: Chen Jie <chenjie6@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: <dvhart@infradead.org>
Cc: <peterz@infradead.org>
Cc: <zengweilin@huawei.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1552621478-119787-1-git-send-email-chenjie6@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-27 14:14:40 +09:00
Vincent Guittot
d36e8b820e UPSTREAM: sched/pelt: Skip updating util_est when utilization is higher than CPU's capacity
util_est is mainly meant to be a lower-bound for tasks utilization.
That's why task_util_est() returns the actual util_avg when it's higher
than the estimated utilization.

With new invaraince signal and without any special check on samples
collection, if a task is limited because of thermal capping for
example, we could end up overestimating its utilization and thus
perhaps generating an unwanted frequency spike when the capping is
relaxed... and (even worst) it will take some more activations for the
estimated utilization to converge back to the actual utilization.

Since we cannot easily know if there is idle time in a CPU when a task
completes an activation with a utilization higher then the CPU capacity,
we skip the sampling when utilization is higher than CPU's capacity.

Bug: 120440300
Change-Id: If1a6001451f80acb953e2a5f955fd302b1b73bc0
Suggested-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: pjt@google.com
Cc: pkondeti@codeaurora.org
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: srinivas.pandruvada@linux.intel.com
Cc: thara.gopinath@linaro.org
Link: https://lkml.kernel.org/r/1548257214-13745-4-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 10a35e6812)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2019-03-26 14:22:50 +00:00
Vincent Guittot
eb0db1782a UPSTREAM: sched/fair: Update scale invariance of PELT
The current implementation of load tracking invariance scales the
contribution with current frequency and uarch performance (only for
utilization) of the CPU. One main result of this formula is that the
figures are capped by current capacity of CPU. Another one is that the
load_avg is not invariant because not scaled with uarch.

The util_avg of a periodic task that runs r time slots every p time slots
varies in the range :

    U * (1-y^r)/(1-y^p) * y^i < Utilization < U * (1-y^r)/(1-y^p)

with U is the max util_avg value = SCHED_CAPACITY_SCALE

At a lower capacity, the range becomes:

    U * C * (1-y^r')/(1-y^p) * y^i' < Utilization <  U * C * (1-y^r')/(1-y^p)

with C reflecting the compute capacity ratio between current capacity and
max capacity.

so C tries to compensate changes in (1-y^r') but it can't be accurate.

Instead of scaling the contribution value of PELT algo, we should scale the
running time. The PELT signal aims to track the amount of computation of
tasks and/or rq so it seems more correct to scale the running time to
reflect the effective amount of computation done since the last update.

In order to be fully invariant, we need to apply the same amount of
running time and idle time whatever the current capacity. Because running
at lower capacity implies that the task will run longer, we have to ensure
that the same amount of idle time will be applied when system becomes idle
and no idle time has been "stolen". But reaching the maximum utilization
value (SCHED_CAPACITY_SCALE) means that the task is seen as an
always-running task whatever the capacity of the CPU (even at max compute
capacity). In this case, we can discard this "stolen" idle times which
becomes meaningless.

In order to achieve this time scaling, a new clock_pelt is created per rq.
The increase of this clock scales with current capacity when something
is running on rq and synchronizes with clock_task when rq is idle. With
this mechanism, we ensure the same running and idle time whatever the
current capacity. This also enables to simplify the pelt algorithm by
removing all references of uarch and frequency and applying the same
contribution to utilization and loads. Furthermore, the scaling is done
only once per update of clock (update_rq_clock_task()) instead of during
each update of sched_entities and cfs/rt/dl_rq of the rq like the current
implementation. This is interesting when cgroup are involved as shown in
the results below:

On a hikey (octo Arm64 platform).
Performance cpufreq governor and only shallowest c-state to remove variance
generated by those power features so we only track the impact of pelt algo.

each test runs 16 times:

	./perf bench sched pipe
	(higher is better)
	kernel	tip/sched/core     + patch
	        ops/seconds        ops/seconds         diff
	cgroup
	root    59652(+/- 0.18%)   59876(+/- 0.24%)    +0.38%
	level1  55608(+/- 0.27%)   55923(+/- 0.24%)    +0.57%
	level2  52115(+/- 0.29%)   52564(+/- 0.22%)    +0.86%

	hackbench -l 1000
	(lower is better)
	kernel	tip/sched/core     + patch
	        duration(sec)      duration(sec)        diff
	cgroup
	root    4.453(+/- 2.37%)   4.383(+/- 2.88%)     -1.57%
	level1  4.859(+/- 8.50%)   4.830(+/- 7.07%)     -0.60%
	level2  5.063(+/- 9.83%)   4.928(+/- 9.66%)     -2.66%

Then, the responsiveness of PELT is improved when CPU is not running at max
capacity with this new algorithm. I have put below some examples of
duration to reach some typical load values according to the capacity of the
CPU with current implementation and with this patch. These values has been
computed based on the geometric series and the half period value:

  Util (%)     max capacity  half capacity(mainline)  half capacity(w/ patch)
  972 (95%)    138ms         not reachable            276ms
  486 (47.5%)  30ms          138ms                     60ms
  256 (25%)    13ms           32ms                     26ms

On my hikey (octo Arm64 platform) with schedutil governor, the time to
reach max OPP when starting from a null utilization, decreases from 223ms
with current scale invariance down to 121ms with the new algorithm.

Bug: 120440300
Change-Id: I0bd4ed2317f2a9a965634e53ce1476417af697a6
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: patrick.bellasi@arm.com
Cc: pjt@google.com
Cc: pkondeti@codeaurora.org
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: srinivas.pandruvada@linux.intel.com
Cc: thara.gopinath@linaro.org
Link: https://lkml.kernel.org/r/1548257214-13745-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 2312729688)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2019-03-26 14:22:50 +00:00
Vincent Guittot
0dd28f4253 UPSTREAM: sched/fair: Move the rq_of() helper function
Move rq_of() helper function so it can be used in pelt.c

[ mingo: Improve readability while at it. ]

Bug: 120440300
Change-Id: I2133979476631d68baaffcaa308f4cdab94f22b1
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: patrick.bellasi@arm.com
Cc: pjt@google.com
Cc: pkondeti@codeaurora.org
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: srinivas.pandruvada@linux.intel.com
Cc: thara.gopinath@linaro.org
Link: https://lkml.kernel.org/r/1548257214-13745-2-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 62478d9911)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2019-03-26 14:22:50 +00:00
Dietmar Eggemann
5bfd3ce4e0 UPSTREAM: sched/fair: Remove setting task's se->runnable_weight during PELT update
A CFS (SCHED_OTHER, SCHED_BATCH or SCHED_IDLE policy) task's
se->runnable_weight must always be in sync with its se->load.weight.

se->runnable_weight is set to se->load.weight when the task is
forked (init_entity_runnable_average()) or reniced (reweight_entity()).

There are two cases in set_load_weight() which since they currently only
set se->load.weight could lead to a situation in which se->load.weight
is different to se->runnable_weight for a CFS task:

(1) A task switches to SCHED_IDLE.

(2) A SCHED_FIFO, SCHED_RR or SCHED_DEADLINE task which has been reniced
    (during which only its static priority gets set) switches to
    SCHED_OTHER or SCHED_BATCH.

Set se->runnable_weight to se->load.weight in these two cases to prevent
this. This eliminates the need to explicitly set it to se->load.weight
during PELT updates in the CFS scheduler fastpath.

Bug: 120440300
Change-Id: I52184a9e1fd53cb42ef3ae546b1fae78b744c9ad
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/20180803140538.1178-1-dietmar.eggemann@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 4a465e3ebb)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2019-03-26 14:22:49 +00:00
Greg Kroah-Hartman
bb418a146a This is the 4.19.31 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlyWhJcACgkQONu9yGCS
 aT6XzxAAzP2QGzC4SVPgcFH1woF/d8Cz0zQ81mLXzjXtEPm39fZCM2hbBnxkXLu1
 peFyrKNk6/c9541D9gsQCQT6Fu+H6u1bJKcIezlKJ2xyB/MsU1hXkjZrTJYW3RRs
 gimy1EGdood2el1ubEBZiaspazoeRzBqtg1Nsmr4V0l+RT8HwtKKw+0+Nxixfp59
 NoVkqTpPI5mL0FiH2R9ogcfg3SvgMZOsOhOBjdPvSjiJJsbvIWcW48MCs95XSUpF
 R+l/fWn+oiFCcIqBaFheujuqZMvVrUHZHaWAPMuoR/c3Cdf0lTBokdv6UM9c0nv3
 61jX5r5ImRI/dfQANN5mbB1YKcs5xOI+I7QZHQ2q4clsWrWyLapXW4clrAZJ6z5t
 UVeVbuLV2y5PL9GJyBcXpyY0BOf4e2gZURaPY3C5McNwgybNoiR0ZePqKb8ZhZyh
 jYOYRoBjJJpZoVTSt6MNX95NTvGaSAtqKMu1s3IeMfpwCfQKBPMOuBHr/dUqSC6I
 U0xxjk/71C15dSPVcTVJT/lmcKc6TXgoagnfbn8GBtDOAjBNsYyUJLQI+db1ERCe
 9MEB9k1Z87ROQ5jQCQmWsewOVAtFZBEvSszFmpKv3zTe8M2oFpXG56zckdiumwHU
 nSfeZTTeWzsFJd30MioEnGYm3ZwKwZx7wi0x4B4WWvBfSpp20Us=
 =xtLx
 -----END PGP SIGNATURE-----

Merge 4.19.31 into android-4.19

Changes in 4.19.31
	media: videobuf2-v4l2: drop WARN_ON in vb2_warn_zero_bytesused()
	9p: use inode->i_lock to protect i_size_write() under 32-bit
	9p/net: fix memory leak in p9_client_create
	ASoC: fsl_esai: fix register setting issue in RIGHT_J mode
	ASoC: codecs: pcm186x: fix wrong usage of DECLARE_TLV_DB_SCALE()
	ASoC: codecs: pcm186x: Fix energysense SLEEP bit
	iio: adc: exynos-adc: Fix NULL pointer exception on unbind
	mei: hbm: clean the feature flags on link reset
	mei: bus: move hw module get/put to probe/release
	stm class: Fix an endless loop in channel allocation
	crypto: caam - fix hash context DMA unmap size
	crypto: ccree - fix missing break in switch statement
	crypto: caam - fixed handling of sg list
	crypto: caam - fix DMA mapping of stack memory
	crypto: ccree - fix free of unallocated mlli buffer
	crypto: ccree - unmap buffer before copying IV
	crypto: ccree - don't copy zero size ciphertext
	crypto: cfb - add missing 'chunksize' property
	crypto: cfb - remove bogus memcpy() with src == dest
	crypto: ahash - fix another early termination in hash walk
	crypto: rockchip - fix scatterlist nents error
	crypto: rockchip - update new iv to device in multiple operations
	drm/imx: ignore plane updates on disabled crtcs
	gpu: ipu-v3: Fix i.MX51 CSI control registers offset
	drm/imx: imx-ldb: add missing of_node_puts
	gpu: ipu-v3: Fix CSI offsets for imx53
	ASoC: rt5682: Correct the setting while select ASRC clk for AD/DA filter
	clocksource: timer-ti-dm: Fix pwm dmtimer usage of fck reparenting
	KVM: arm/arm64: vgic: Make vgic_dist->lpi_list_lock a raw_spinlock
	arm64: dts: rockchip: fix graph_port warning on rk3399 bob kevin and excavator
	s390/dasd: fix using offset into zero size array error
	Input: pwm-vibra - prevent unbalanced regulator
	Input: pwm-vibra - stop regulator after disabling pwm, not before
	ARM: dts: Configure clock parent for pwm vibra
	ARM: OMAP2+: Variable "reg" in function omap4_dsi_mux_pads() could be uninitialized
	ASoC: dapm: fix out-of-bounds accesses to DAPM lookup tables
	ASoC: rsnd: fixup rsnd_ssi_master_clk_start() user count check
	KVM: arm/arm64: Reset the VCPU without preemption and vcpu state loaded
	arm/arm64: KVM: Allow a VCPU to fully reset itself
	arm/arm64: KVM: Don't panic on failure to properly reset system registers
	KVM: arm/arm64: vgic: Always initialize the group of private IRQs
	KVM: arm64: Forbid kprobing of the VHE world-switch code
	ASoC: samsung: Prevent clk_get_rate() calls in atomic context
	ARM: OMAP2+: fix lack of timer interrupts on CPU1 after hotplug
	Input: cap11xx - switch to using set_brightness_blocking()
	Input: ps2-gpio - flush TX work when closing port
	Input: matrix_keypad - use flush_delayed_work()
	mac80211: call drv_ibss_join() on restart
	mac80211: Fix Tx aggregation session tear down with ITXQs
	netfilter: compat: initialize all fields in xt_init
	blk-mq: insert rq with DONTPREP to hctx dispatch list when requeue
	ipvs: fix dependency on nf_defrag_ipv6
	floppy: check_events callback should not return a negative number
	xprtrdma: Make sure Send CQ is allocated on an existing compvec
	NFS: Don't use page_file_mapping after removing the page
	mm/gup: fix gup_pmd_range() for dax
	Revert "mm: use early_pfn_to_nid in page_ext_init"
	scsi: qla2xxx: Fix panic from use after free in qla2x00_async_tm_cmd
	net: dsa: bcm_sf2: potential array overflow in bcm_sf2_sw_suspend()
	x86/CPU: Add Icelake model number
	mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs
	net: hns: Fix object reference leaks in hns_dsaf_roce_reset()
	i2c: cadence: Fix the hold bit setting
	i2c: bcm2835: Clear current buffer pointers and counts after a transfer
	auxdisplay: ht16k33: fix potential user-after-free on module unload
	Input: st-keyscan - fix potential zalloc NULL dereference
	clk: sunxi-ng: v3s: Fix TCON reset de-assert bit
	kallsyms: Handle too long symbols in kallsyms.c
	clk: sunxi: A31: Fix wrong AHB gate number
	esp: Skip TX bytes accounting when sending from a request socket
	ARM: 8824/1: fix a migrating irq bug when hotplug cpu
	bpf: only adjust gso_size on bytestream protocols
	bpf: fix lockdep false positive in stackmap
	af_key: unconditionally clone on broadcast
	ARM: 8835/1: dma-mapping: Clear DMA ops on teardown
	assoc_array: Fix shortcut creation
	keys: Fix dependency loop between construction record and auth key
	scsi: libiscsi: Fix race between iscsi_xmit_task and iscsi_complete_task
	net: systemport: Fix reception of BPDUs
	net: dsa: bcm_sf2: Do not assume DSA master supports WoL
	pinctrl: meson: meson8b: fix the sdxc_a data 1..3 pins
	qmi_wwan: apply SET_DTR quirk to Sierra WP7607
	net: mv643xx_eth: disable clk on error path in mv643xx_eth_shared_probe()
	xfrm: Fix inbound traffic via XFRM interfaces across network namespaces
	mailbox: bcm-flexrm-mailbox: Fix FlexRM ring flush timeout issue
	ASoC: topology: free created components in tplg load error
	qed: Fix iWARP buffer size provided for syn packet processing.
	qed: Fix iWARP syn packet mac address validation.
	ARM: dts: armada-xp: fix Armada XP boards NAND description
	arm64: Relax GIC version check during early boot
	ARM: tegra: Restore DT ABI on Tegra124 Chromebooks
	net: marvell: mvneta: fix DMA debug warning
	mm: handle lru_add_drain_all for UP properly
	tmpfs: fix link accounting when a tmpfile is linked in
	ixgbe: fix older devices that do not support IXGBE_MRQC_L3L4TXSWEN
	ARCv2: lib: memcpy: fix doing prefetchw outside of buffer
	ARC: uacces: remove lp_start, lp_end from clobber list
	ARCv2: support manual regfile save on interrupts
	ARCv2: don't assume core 0x54 has dual issue
	phonet: fix building with clang
	mac80211_hwsim: propagate genlmsg_reply return code
	bpf, lpm: fix lookup bug in map_delete_elem
	net: thunderx: make CFG_DONE message to run through generic send-ack sequence
	net: thunderx: add nicvf_send_msg_to_pf result check for set_rx_mode_task
	nfp: bpf: fix code-gen bug on BPF_ALU | BPF_XOR | BPF_K
	nfp: bpf: fix ALU32 high bits clearance bug
	bnxt_en: Fix typo in firmware message timeout logic.
	bnxt_en: Wait longer for the firmware message response to complete.
	net: set static variable an initial value in atl2_probe()
	selftests: fib_tests: sleep after changing carrier. again.
	tmpfs: fix uninitialized return value in shmem_link
	stm class: Prevent division by zero
	nfit: acpi_nfit_ctl(): Check out_obj->type in the right place
	acpi/nfit: Fix bus command validation
	nfit/ars: Attempt a short-ARS whenever the ARS state is idle at boot
	nfit/ars: Attempt short-ARS even in the no_init_ars case
	libnvdimm/label: Clear 'updating' flag after label-set update
	libnvdimm, pfn: Fix over-trim in trim_pfn_device()
	libnvdimm/pmem: Honor force_raw for legacy pmem regions
	libnvdimm: Fix altmap reservation size calculation
	fix cgroup_do_mount() handling of failure exits
	crypto: aead - set CRYPTO_TFM_NEED_KEY if ->setkey() fails
	crypto: aegis - fix handling chunked inputs
	crypto: arm/crct10dif - revert to C code for short inputs
	crypto: arm64/aes-neonbs - fix returning final keystream block
	crypto: arm64/crct10dif - revert to C code for short inputs
	crypto: hash - set CRYPTO_TFM_NEED_KEY if ->setkey() fails
	crypto: morus - fix handling chunked inputs
	crypto: pcbc - remove bogus memcpy()s with src == dest
	crypto: skcipher - set CRYPTO_TFM_NEED_KEY if ->setkey() fails
	crypto: testmgr - skip crc32c context test for ahash algorithms
	crypto: x86/aegis - fix handling chunked inputs and MAY_SLEEP
	crypto: x86/aesni-gcm - fix crash on empty plaintext
	crypto: x86/morus - fix handling chunked inputs and MAY_SLEEP
	crypto: arm64/aes-ccm - fix logical bug in AAD MAC handling
	crypto: arm64/aes-ccm - fix bugs in non-NEON fallback routine
	CIFS: Do not reset lease state to NONE on lease break
	CIFS: Do not skip SMB2 message IDs on send failures
	CIFS: Fix read after write for files with read caching
	tracing: Use strncpy instead of memcpy for string keys in hist triggers
	tracing: Do not free iter->trace in fail path of tracing_open_pipe()
	tracing/perf: Use strndup_user() instead of buggy open-coded version
	xen: fix dom0 boot on huge systems
	ACPI / device_sysfs: Avoid OF modalias creation for removed device
	mmc: sdhci-esdhc-imx: fix HS400 timing issue
	mmc:fix a bug when max_discard is 0
	netfilter: ipt_CLUSTERIP: fix warning unused variable cn
	spi: ti-qspi: Fix mmap read when more than one CS in use
	spi: pxa2xx: Setup maximum supported DMA transfer length
	regulator: s2mps11: Fix steps for buck7, buck8 and LDO35
	regulator: max77620: Initialize values for DT properties
	regulator: s2mpa01: Fix step values for some LDOs
	clocksource/drivers/exynos_mct: Move one-shot check from tick clear to ISR
	clocksource/drivers/exynos_mct: Clear timer interrupt when shutdown
	clocksource/drivers/arch_timer: Workaround for Allwinner A64 timer instability
	s390/setup: fix early warning messages
	s390/virtio: handle find on invalid queue gracefully
	scsi: virtio_scsi: don't send sc payload with tmfs
	scsi: aacraid: Fix performance issue on logical drives
	scsi: sd: Optimal I/O size should be a multiple of physical block size
	scsi: target/iscsi: Avoid iscsit_release_commands_from_conn() deadlock
	scsi: qla2xxx: Fix LUN discovery if loop id is not assigned yet by firmware
	fs/devpts: always delete dcache dentry-s in dput()
	splice: don't merge into linked buffers
	ovl: During copy up, first copy up data and then xattrs
	ovl: Do not lose security.capability xattr over metadata file copy-up
	m68k: Add -ffreestanding to CFLAGS
	Btrfs: setup a nofs context for memory allocation at btrfs_create_tree()
	Btrfs: setup a nofs context for memory allocation at __btrfs_set_acl
	btrfs: ensure that a DUP or RAID1 block group has exactly two stripes
	Btrfs: fix corruption reading shared and compressed extents after hole punching
	soc: qcom: rpmh: Avoid accessing freed memory from batch API
	libertas_tf: don't set URB_ZERO_PACKET on IN USB transfer
	irqchip/gic-v3-its: Avoid parsing _indirect_ twice for Device table
	irqchip/brcmstb-l2: Use _irqsave locking variants in non-interrupt code
	x86/kprobes: Prohibit probing on optprobe template code
	cpufreq: kryo: Release OPP tables on module removal
	cpufreq: tegra124: add missing of_node_put()
	cpufreq: pxa2xx: remove incorrect __init annotation
	ext4: fix check of inode in swap_inode_boot_loader
	ext4: cleanup pagecache before swap i_data
	ext4: update quota information while swapping boot loader inode
	ext4: add mask of ext4 flags to swap
	ext4: fix crash during online resizing
	PCI/ASPM: Use LTR if already enabled by platform
	PCI/DPC: Fix print AER status in DPC event handling
	PCI: dwc: skip MSI init if MSIs have been explicitly disabled
	IB/hfi1: Close race condition on user context disable and close
	cxl: Wrap iterations over afu slices inside 'afu_list_lock'
	ext2: Fix underflow in ext2_max_size()
	clk: uniphier: Fix update register for CPU-gear
	clk: clk-twl6040: Fix imprecise external abort for pdmclk
	clk: samsung: exynos5: Fix possible NULL pointer exception on platform_device_alloc() failure
	clk: samsung: exynos5: Fix kfree() of const memory on setting driver_override
	clk: ingenic: Fix round_rate misbehaving with non-integer dividers
	clk: ingenic: Fix doc of ingenic_cgu_div_info
	usb: chipidea: tegra: Fix missed ci_hdrc_remove_device()
	usb: typec: tps6598x: handle block writes separately with plain-I2C adapters
	dmaengine: usb-dmac: Make DMAC system sleep callbacks explicit
	mm: hwpoison: fix thp split handing in soft_offline_in_use_page()
	mm/vmalloc: fix size check for remap_vmalloc_range_partial()
	mm/memory.c: do_fault: avoid usage of stale vm_area_struct
	kernel/sysctl.c: add missing range check in do_proc_dointvec_minmax_conv
	device property: Fix the length used in PROPERTY_ENTRY_STRING()
	intel_th: Don't reference unassigned outputs
	parport_pc: fix find_superio io compare code, should use equal test.
	i2c: tegra: fix maximum transfer size
	media: i2c: ov5640: Fix post-reset delay
	gpio: pca953x: Fix dereference of irq data in shutdown
	can: flexcan: FLEXCAN_IFLAG_MB: add () around macro argument
	drm/i915: Relax mmap VMA check
	bpf: only test gso type on gso packets
	serial: uartps: Fix stuck ISR if RX disabled with non-empty FIFO
	serial: 8250_of: assume reg-shift of 2 for mrvl,mmp-uart
	serial: 8250_pci: Fix number of ports for ACCES serial cards
	serial: 8250_pci: Have ACCES cards that use the four port Pericom PI7C9X7954 chip use the pci_pericom_setup()
	jbd2: clear dirty flag when revoking a buffer from an older transaction
	jbd2: fix compile warning when using JBUFFER_TRACE
	selinux: add the missing walk_size + len check in selinux_sctp_bind_connect
	security/selinux: fix SECURITY_LSM_NATIVE_LABELS on reused superblock
	powerpc/32: Clear on-stack exception marker upon exception return
	powerpc/wii: properly disable use of BATs when requested.
	powerpc/powernv: Make opal log only readable by root
	powerpc/83xx: Also save/restore SPRG4-7 during suspend
	powerpc/powernv: Don't reprogram SLW image on every KVM guest entry/exit
	powerpc: Fix 32-bit KVM-PR lockup and host crash with MacOS guest
	powerpc/ptrace: Simplify vr_get/set() to avoid GCC warning
	powerpc/hugetlb: Don't do runtime allocation of 16G pages in LPAR configuration
	powerpc/traps: fix recoverability of machine check handling on book3s/32
	powerpc/traps: Fix the message printed when stack overflows
	ARM: s3c24xx: Fix boolean expressions in osiris_dvs_notify
	arm64: Fix HCR.TGE status for NMI contexts
	arm64: debug: Ensure debug handlers check triggering exception level
	arm64: KVM: Fix architecturally invalid reset value for FPEXC32_EL2
	ipmi_si: fix use-after-free of resource->name
	dm: fix to_sector() for 32bit
	dm integrity: limit the rate of error messages
	mfd: sm501: Fix potential NULL pointer dereference
	cpcap-charger: generate events for userspace
	NFS: Fix I/O request leakages
	NFS: Fix an I/O request leakage in nfs_do_recoalesce
	NFS: Don't recoalesce on error in nfs_pageio_complete_mirror()
	nfsd: fix performance-limiting session calculation
	nfsd: fix memory corruption caused by readdir
	nfsd: fix wrong check in write_v4_end_grace()
	NFSv4.1: Reinitialise sequence results before retransmitting a request
	svcrpc: fix UDP on servers with lots of threads
	PM / wakeup: Rework wakeup source timer cancellation
	bcache: never writeback a discard operation
	stable-kernel-rules.rst: add link to networking patch queue
	vt: perform safe console erase in the right order
	x86/unwind/orc: Fix ORC unwind table alignment
	perf intel-pt: Fix CYC timestamp calculation after OVF
	perf tools: Fix split_kallsyms_for_kcore() for trampoline symbols
	perf auxtrace: Define auxtrace record alignment
	perf intel-pt: Fix overlap calculation for padding
	perf/x86/intel/uncore: Fix client IMC events return huge result
	perf intel-pt: Fix divide by zero when TSC is not available
	md: Fix failed allocation of md_register_thread
	tpm/tpm_crb: Avoid unaligned reads in crb_recv()
	tpm: Unify the send callback behaviour
	rcu: Do RCU GP kthread self-wakeup from softirq and interrupt
	media: imx: prpencvf: Stop upstream before disabling IDMA channel
	media: lgdt330x: fix lock status reporting
	media: uvcvideo: Avoid NULL pointer dereference at the end of streaming
	media: vimc: Add vimc-streamer for stream control
	media: imx: csi: Disable CSI immediately after last EOF
	media: imx: csi: Stop upstream before disabling IDMA channel
	drm/fb-helper: generic: Fix drm_fbdev_client_restore()
	drm/radeon/evergreen_cs: fix missing break in switch statement
	drm/amd/powerplay: correct power reading on fiji
	drm/amd/display: don't call dm_pp_ function from an fpu block
	KVM: Call kvm_arch_memslots_updated() before updating memslots
	KVM: x86/mmu: Detect MMIO generation wrap in any address space
	KVM: x86/mmu: Do not cache MMIO accesses while memslots are in flux
	KVM: nVMX: Sign extend displacements of VMX instr's mem operands
	KVM: nVMX: Apply addr size mask to effective address for VMX instructions
	KVM: nVMX: Ignore limit checks on VMX instructions using flat segments
	bcache: use (REQ_META|REQ_PRIO) to indicate bio for metadata
	s390/setup: fix boot crash for machine without EDAT-1
	Linux 4.19.31

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2019-03-23 21:13:30 +01:00
Zhang, Jun
e97a32a5a3 rcu: Do RCU GP kthread self-wakeup from softirq and interrupt
commit 1d1f898df6 upstream.

The rcu_gp_kthread_wake() function is invoked when it might be necessary
to wake the RCU grace-period kthread.  Because self-wakeups are normally
a useless waste of CPU cycles, if rcu_gp_kthread_wake() is invoked from
this kthread, it naturally refuses to do the wakeup.

Unfortunately, natural though it might be, this heuristic fails when
rcu_gp_kthread_wake() is invoked from an interrupt or softirq handler
that interrupted the grace-period kthread just after the final check of
the wait-event condition but just before the schedule() call.  In this
case, a wakeup is required, even though the call to rcu_gp_kthread_wake()
is within the RCU grace-period kthread's context.  Failing to provide
this wakeup can result in grace periods failing to start, which in turn
results in out-of-memory conditions.

This race window is quite narrow, but it actually did happen during real
testing.  It would of course need to be fixed even if it was strictly
theoretical in nature.

This patch does not Cc stable because it does not apply cleanly to
earlier kernel versions.

Fixes: 48a7639ce8 ("rcu: Make callers awaken grace-period kthread")
Reported-by: "He, Bo" <bo.he@intel.com>
Co-developed-by: "Zhang, Jun" <jun.zhang@intel.com>
Co-developed-by: "He, Bo" <bo.he@intel.com>
Co-developed-by: "xiao, jin" <jin.xiao@intel.com>
Co-developed-by: Bai, Jie A <jie.a.bai@intel.com>
Signed-off: "Zhang, Jun" <jun.zhang@intel.com>
Signed-off: "He, Bo" <bo.he@intel.com>
Signed-off: "xiao, jin" <jin.xiao@intel.com>
Signed-off: Bai, Jie A <jie.a.bai@intel.com>
Signed-off-by: "Zhang, Jun" <jun.zhang@intel.com>
[ paulmck: Switch from !in_softirq() to "!in_interrupt() &&
  !in_serving_softirq() to avoid redundant wakeups and to also handle the
  interrupt-handler scenario as well as the softirq-handler scenario that
  actually occurred in testing. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Link: https://lkml.kernel.org/r/CD6925E8781EFD4D8E11882D20FC406D52A11F61@SHSMSX104.ccr.corp.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-23 20:10:12 +01:00
Zev Weiss
93c8a44a82 kernel/sysctl.c: add missing range check in do_proc_dointvec_minmax_conv
commit 8cf7630b29 upstream.

This bug has apparently existed since the introduction of this function
in the pre-git era (4500e91754d3 in Thomas Gleixner's history.git,
"[NET]: Add proc_dointvec_userhz_jiffies, use it for proper handling of
neighbour sysctls.").

As a minimal fix we can simply duplicate the corresponding check in
do_proc_dointvec_conv().

Link: http://lkml.kernel.org/r/20190207123426.9202-3-zev@bewilderbeest.net
Signed-off-by: Zev Weiss <zev@bewilderbeest.net>
Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: <stable@vger.kernel.org>	[2.6.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-23 20:10:04 +01:00
Jann Horn
24d5097655 tracing/perf: Use strndup_user() instead of buggy open-coded version
commit 83540fbc88 upstream.

The first version of this method was missing the check for
`ret == PATH_MAX`; then such a check was added, but it didn't call kfree()
on error, so there was still a small memory leak in the error case.
Fix it by using strndup_user() instead of open-coding it.

Link: http://lkml.kernel.org/r/20190220165443.152385-1-jannh@google.com

Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Fixes: 0eadcc7a7b ("perf/core: Fix perf_uprobe_init()")
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-23 20:09:56 +01:00
zhangyi (F)
f27077e5f5 tracing: Do not free iter->trace in fail path of tracing_open_pipe()
commit e7f0c424d0 upstream.

Commit d716ff71dd ("tracing: Remove taking of trace_types_lock in
pipe files") use the current tracer instead of the copy in
tracing_open_pipe(), but it forget to remove the freeing sentence in
the error path.

There's an error path that can call kfree(iter->trace) after the iter->trace
was assigned to tr->current_trace, which would be bad to free.

Link: http://lkml.kernel.org/r/1550060946-45984-1-git-send-email-yi.zhang@huawei.com

Cc: stable@vger.kernel.org
Fixes: d716ff71dd ("tracing: Remove taking of trace_types_lock in pipe files")
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-23 20:09:56 +01:00
Tom Zanussi
ebca08d7e8 tracing: Use strncpy instead of memcpy for string keys in hist triggers
commit 9f0bbf3115 upstream.

Because there may be random garbage beyond a string's null terminator,
it's not correct to copy the the complete character array for use as a
hist trigger key.  This results in multiple histogram entries for the
'same' string key.

So, in the case of a string key, use strncpy instead of memcpy to
avoid copying in the extra bytes.

Before, using the gdbus entries in the following hist trigger as an
example:

  # echo 'hist:key=comm' > /sys/kernel/debug/tracing/events/sched/sched_waking/trigger
  # cat /sys/kernel/debug/tracing/events/sched/sched_waking/hist

  ...

  { comm: ImgDecoder #4                      } hitcount:        203
  { comm: gmain                              } hitcount:        213
  { comm: gmain                              } hitcount:        216
  { comm: StreamTrans #73                    } hitcount:        221
  { comm: mozStorage #3                      } hitcount:        230
  { comm: gdbus                              } hitcount:        233
  { comm: StyleThread#5                      } hitcount:        253
  { comm: gdbus                              } hitcount:        256
  { comm: gdbus                              } hitcount:        260
  { comm: StyleThread#4                      } hitcount:        271

  ...

  # cat /sys/kernel/debug/tracing/events/sched/sched_waking/hist | egrep gdbus | wc -l
  51

After:

  # cat /sys/kernel/debug/tracing/events/sched/sched_waking/hist | egrep gdbus | wc -l
  1

Link: http://lkml.kernel.org/r/50c35ae1267d64eee975b8125e151e600071d4dc.1549309756.git.tom.zanussi@linux.intel.com

Cc: Namhyung Kim <namhyung@kernel.org>
Cc: stable@vger.kernel.org
Fixes: 79e577cbce ("tracing: Support string type key properly")
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-23 20:09:56 +01:00
Al Viro
7a8b048430 fix cgroup_do_mount() handling of failure exits
commit 399504e21a upstream.

same story as with last May fixes in sysfs (7b745a4e40
"unfuck sysfs_mount()"); new_sb is left uninitialized
in case of early errors in kernfs_mount_ns() and papering
over it by treating any error from kernfs_mount_ns() as
equivalent to !new_ns ends up conflating the cases when
objects had never been transferred to a superblock with
ones when that has happened and resulting new superblock
had been dropped.  Easily fixed (same way as in sysfs
case).  Additionally, there's a superblock leak on
kernfs_node_dentry() failure *and* a dentry leak inside
kernfs_node_dentry() itself - the latter on probably
impossible errors, but the former not impossible to trigger
(as the matter of fact, injecting allocation failures
at that point *does* trigger it).

Cc: stable@kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-23 20:09:53 +01:00
Alban Crequy
02f8211b75 bpf, lpm: fix lookup bug in map_delete_elem
[ Upstream commit 7c0cdf0b39 ]

trie_delete_elem() was deleting an entry even though it was not matching
if the prefixlen was correct. This patch adds a check on matchlen.

Reproducer:

$ sudo bpftool map create /sys/fs/bpf/mylpm type lpm_trie key 8 value 1 entries 128 name mylpm flags 1
$ sudo bpftool map update pinned /sys/fs/bpf/mylpm key hex 10 00 00 00 aa bb cc dd value hex 01
$ sudo bpftool map dump pinned /sys/fs/bpf/mylpm
key: 10 00 00 00 aa bb cc dd  value: 01
Found 1 element
$ sudo bpftool map delete pinned /sys/fs/bpf/mylpm key hex 10 00 00 00 ff ff ff ff
$ echo $?
0
$ sudo bpftool map dump pinned /sys/fs/bpf/mylpm
Found 0 elements

A similar reproducer is added in the selftests.

Without the patch:

$ sudo ./tools/testing/selftests/bpf/test_lpm_map
test_lpm_map: test_lpm_map.c:485: test_lpm_delete: Assertion `bpf_map_delete_elem(map_fd, key) == -1 && errno == ENOENT' failed.
Aborted

With the patch: test_lpm_map runs without errors.

Fixes: e454cf5958 ("bpf: Implement map_delete_elem for BPF_MAP_TYPE_LPM_TRIE")
Cc: Craig Gallek <kraig@google.com>
Signed-off-by: Alban Crequy <alban@kinvolk.io>
Acked-by: Craig Gallek <kraig@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-23 20:09:51 +01:00
Alexei Starovoitov
c7c68a1b9a bpf: fix lockdep false positive in stackmap
[ Upstream commit 3defaf2f15 ]

Lockdep warns about false positive:
[   11.211460] ------------[ cut here ]------------
[   11.211936] DEBUG_LOCKS_WARN_ON(depth <= 0)
[   11.211985] WARNING: CPU: 0 PID: 141 at ../kernel/locking/lockdep.c:3592 lock_release+0x1ad/0x280
[   11.213134] Modules linked in:
[   11.214954] RIP: 0010:lock_release+0x1ad/0x280
[   11.223508] Call Trace:
[   11.223705]  <IRQ>
[   11.223874]  ? __local_bh_enable+0x7a/0x80
[   11.224199]  up_read+0x1c/0xa0
[   11.224446]  do_up_read+0x12/0x20
[   11.224713]  irq_work_run_list+0x43/0x70
[   11.225030]  irq_work_run+0x26/0x50
[   11.225310]  smp_irq_work_interrupt+0x57/0x1f0
[   11.225662]  irq_work_interrupt+0xf/0x20

since rw_semaphore is released in a different task vs task that locked the sema.
It is expected behavior.
Fix the warning with up_read_non_owner() and rwsem_release() annotation.

Fixes: bae77c5eb5 ("bpf: enable stackmap with build_id in nmi context")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-23 20:09:48 +01:00
Suren Baghdasaryan
617a4ba0ec FROMLIST: psi: introduce psi monitor
Psi monitor aims to provide a low-latency short-term pressure
detection mechanism configurable by users. It allows users to
monitor psi metrics growth and trigger events whenever a metric
raises above user-defined threshold within user-defined time window.

Time window and threshold are both expressed in usecs. Multiple psi
resources with different thresholds and window sizes can be monitored
concurrently.

Psi monitors activate when system enters stall state for the monitored
psi metric and deactivate upon exit from the stall state. While system
is in the stall state psi signal growth is monitored at a rate of 10 times
per tracking window. Min window size is 500ms, therefore the min monitoring
interval is 50ms. Max window size is 10s with monitoring interval of 1s.

When activated psi monitor stays active for at least the duration of one
tracking window to avoid repeated activations/deactivations when psi
signal is bouncing.

Notifications to the users are rate-limited to one per tracking window.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

(not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/1052418/)

Bug: 127712811
Bug: 129157727
Test: lmkd in PSI mode
Change-Id: I860049d32420485346ad545c4650f990fe0c08e3
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-22 23:07:14 +00:00
Suren Baghdasaryan
3a905dc573 FROMLIST: refactor header includes to allow kthread.h inclusion in psi_types.h
kthread.h can't be included in psi_types.h because it creates a circular
inclusion with kthread.h eventually including psi_types.h and complaining
on kthread structures not being defined because they are defined further
in the kthread.h. Resolve this by removing psi_types.h inclusion from the
headers included from kthread.h.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>

(not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/1052417/)

Bug: 127712811
Bug: 129157727
Test: lmkd in PSI mode
Change-Id: I88cd99f41534f0b9df18043cde8d1ee54aaa93de
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-22 23:07:04 +00:00
Suren Baghdasaryan
23c32cf595 FROMLIST: psi: track changed states
Introduce changed_states parameter into collect_percpu_times to track
the states changed since the last update.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>

(not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/1052420/)

Bug: 127712811
Bug: 129157727
Test: lmkd in PSI mode
Change-Id: I944b024cd65e8520a57097bf5a3d7b2c01605bd0
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-22 23:06:55 +00:00
Suren Baghdasaryan
f270022469 FROMLIST: psi: split update_stats into parts
Split update_stats into collect_percpu_times and update_averages for
collect_percpu_times to be reused later inside psi monitor.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>

(not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/1052419/)

Bug: 127712811
Bug: 129157727
Test: lmkd in PSI mode
Change-Id: Ia9cfed8964fd57e41098fca285a2be0252fd5277
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-22 23:06:46 +00:00
Suren Baghdasaryan
c6e18d9458 FROMLIST: psi: rename psi fields in preparation for psi trigger addition
Renaming psi_group structure member fields used for calculating psi totals
and averages for clear distinction between them and trigger-related fields
that will be added next.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>

(not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/1052416/)

Bug: 127712811
Bug: 129157727
Test: lmkd in PSI mode
Change-Id: I579a60e0915fa8fedaa508357d3d1aefab9428c4
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-22 23:06:38 +00:00
Suren Baghdasaryan
18d15b1861 FROMLIST: psi: make psi_enable static
psi_enable is not used outside of psi.c, make it static.

Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>

(not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/1052415/)

Bug: 127712811
Bug: 129157727
Test: lmkd in PSI mode
Change-Id: I249c6d2271f93a7975f1622faf2d2b4196b701bc
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-22 23:06:29 +00:00
Suren Baghdasaryan
ada57da3b1 FROMLIST: psi: introduce state_mask to represent stalled psi states
The psi monitoring patches will need to determine the same states as
record_times().  To avoid calculating them twice, maintain a state mask
that can be consulted cheaply.  Do this in a separate patch to keep the
churn in the main feature patch at a minimum.

This adds 4-byte state_mask member into psi_group_cpu struct which results
in its first cacheline-aligned part becoming 52 bytes long.  Add explicit
values to enumeration element counters that affect psi_group_cpu struct
size.

Link: http://lkml.kernel.org/r/20190124211518.244221-4-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

(not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/1052414/)

Bug: 127712811
Bug: 129157727
Test: lmkd in PSI mode
Change-Id: I38a1ca3d5c9e6cc3ba39e88c6a9af29ecdc0df5b
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-22 23:06:13 +00:00
Johannes Weiner
9f79143ebb UPSTREAM: kernel: cgroup: add poll file operation
Cgroup has a standardized poll/notification mechanism for waking all
pollers on all fds when a filesystem node changes.  To allow polling for
custom events, add a .poll callback that can override the default.

This is in preparation for pollable cgroup pressure files which have
per-fd trigger configurations.

Link: http://lkml.kernel.org/r/20190124211518.244221-3-surenb@google.com
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

(cherry picked from commit: dc50537bdd)

Bug: 127712811
Test: lmkd in PSI mode
Change-Id: Idc648e7b7b7bd5fc00c7b32163e55a93b0f49a98
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-21 16:25:28 -07:00
Johannes Weiner
ec350213df UPSTREAM: psi: avoid divide-by-zero crash inside virtual machines
We've been seeing hard-to-trigger psi crashes when running inside VM
instances:

    divide error: 0000 [#1] SMP PTI
    Modules linked in: [...]
    CPU: 0 PID: 212 Comm: kworker/0:2 Not tainted 4.16.18-119_fbk9_3817_gfe944c98d695 #119
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
    Workqueue: events psi_clock
    RIP: 0010:psi_update_stats+0x270/0x490
    RSP: 0018:ffffc90001117e10 EFLAGS: 00010246
    RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff8800a35a13f8
    RDX: 0000000000000000 RSI: ffff8800a35a1340 RDI: 0000000000000000
    RBP: 0000000000000658 R08: ffff8800a35a1470 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
    R13: 0000000000000000 R14: 0000000000000000 R15: 00000000000f8502
    FS:  0000000000000000(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fbe370fa000 CR3: 00000000b1e3a000 CR4: 00000000000006f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
     psi_clock+0x12/0x50
     process_one_work+0x1e0/0x390
     worker_thread+0x2b/0x3c0
     ? rescuer_thread+0x330/0x330
     kthread+0x113/0x130
     ? kthread_create_worker_on_cpu+0x40/0x40
     ? SyS_exit_group+0x10/0x10
     ret_from_fork+0x35/0x40
    Code: 48 0f 47 c7 48 01 c2 45 85 e4 48 89 16 0f 85 e6 00 00 00 4c 8b 49 10 4c 8b 51 08 49 69 d9 f2 07 00 00 48 6b c0 64 4c 8b 29 31 d2 <48> f7 f7 49 69 d5 8d 06 00 00 48 89 c5 4c 69 f0 00 98 0b 00 48

The Code-line points to `period` being 0 inside update_stats(), and we
divide by that when calculating that period's pressure percentage.

The elapsed period should never be 0.  The reason this can happen is due
to an off-by-one in the idle time / missing period calculation combined
with a coarse sched_clock() in the virtual machine.

The target time for aggregation is advanced into the future on a fixed
grid to prevent clock drift.  So when an aggregation runs after some idle
period, we can not just set it to "now + psi_period", but have to
calculate the downtime and advance the target time relative to itself.

However, if the aggregator was disabled exactly one psi_period (ns), we
drop one idle period in the calculation due to a > when we should do >=.
In that case, next_update will be advanced from 'now - psi_period' to
'now' when it should be moved to 'now + psi_period'.  The run finishes
with last_update == next_update == sched_clock().

With hardware clocks, this exact nanosecond match isn't likely in the
first place; but if it does happen, the clock will still have moved on and
the period non-zero by the time the worker runs.  A pointlessly short
period, but besides the extra work, no harm no foul.  However, a slow
sched_clock() like we have on VMs might not have advanced either by the
time the worker runs again.  And when we calculate the elapsed period, the
result, our pressure divisor, will be 0.  Ouch.

Fix this by correctly handling the situation when the elapsed time between
aggregation runs is precisely two periods, and advance the expiration
timestamp correctly to period into the future.

Link: http://lkml.kernel.org/r/20190214193157.15788-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Łukasz Siudut <lsiudut@fb.com
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit 4e37504d1c)

Bug: 127712811
Test: lmkd in PSI mode
Change-Id: I40917c84354f9f32259c6703f00b6b1d21f45f02
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-21 16:25:27 -07:00
Johannes Weiner
2a070382c9 UPSTREAM: psi: fix aggregation idle shut-off
psi has provisions to shut off the periodic aggregation worker when
there is a period of no task activity - and thus no data that needs
aggregating.  However, while developing psi monitoring, Suren noticed
that the aggregation clock currently won't stay shut off for good.

Debugging this revealed a flaw in the idle design: an aggregation run
will see no task activity and decide to go to sleep; shortly thereafter,
the kworker thread that executed the aggregation will go idle and cause
a scheduling change, during which the psi callback will kick the
!pending worker again.  This will ping-pong forever, and is equivalent
to having no shut-off logic at all (but with more code!)

Fix this by exempting aggregation workers from psi's clock waking logic
when the state change is them going to sleep.  To do this, tag workers
with the last work function they executed, and if in psi we see a worker
going to sleep after aggregating psi data, we will not reschedule the
aggregation work item.

What if the worker is also executing other items before or after?

Any psi state times that were incurred by work items preceding the
aggregation work will have been collected from the per-cpu buckets
during the aggregation itself.  If there are work items following the
aggregation work, the worker's last_func tag will be overwritten and the
aggregator will be kept alive to process this genuine new activity.

If the aggregation work is the last thing the worker does, and we decide
to go idle, the brief period of non-idle time incurred between the
aggregation run and the kworker's dequeue will be stranded in the
per-cpu buckets until the clock is woken by later activity.  But that
should not be a problem.  The buckets can hold 4s worth of time, and
future activity will wake the clock with a 2s delay, giving us 2s worth
of data we can leave behind when disabling aggregation.  If it takes a
worker more than two seconds to go idle after it finishes its last work
item, we likely have bigger problems in the system, and won't notice one
sample that was averaged with a bogus per-CPU weight.

Link: http://lkml.kernel.org/r/20190116193501.1910-1-hannes@cmpxchg.org
Fixes: eb414681d5 ("psi: pressure stall information for CPU, memory, and IO")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit 1b69ac6b40)

Bug: 127712811
Test: lmkd in PSI mode
Change-Id: I2877fec3d381b1006b8bd1261895fdfd68bd21db
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-21 16:25:27 -07:00
Johannes Weiner
3bbcbc8039 UPSTREAM: psi: make disabling/enabling easier for vendor kernels
Mel Gorman reports a hackbench regression with psi that would prohibit
shipping the suse kernel with it default-enabled, but he'd still like
users to be able to opt in at little to no cost to others.

With the current combination of CONFIG_PSI and the psi_disabled bool set
from the commandline, this is a challenge.  Do the following things to
make it easier:

1. Add a config option CONFIG_PSI_DEFAULT_DISABLED that allows distros
   to enable CONFIG_PSI in their kernel but leave the feature disabled
   unless a user requests it at boot-time.

   To avoid double negatives, rename psi_disabled= to psi=.

2. Make psi_disabled a static branch to eliminate any branch costs
   when the feature is disabled.

In terms of numbers before and after this patch, Mel says:

: The following is a comparision using CONFIG_PSI=n as a baseline against
: your patch and a vanilla kernel
:
:                          4.20.0-rc4             4.20.0-rc4             4.20.0-rc4
:                 kconfigdisable-v1r1                vanilla        psidisable-v1r1
: Amean     1       1.3100 (   0.00%)      1.3923 (  -6.28%)      1.3427 (  -2.49%)
: Amean     3       3.8860 (   0.00%)      4.1230 *  -6.10%*      3.8860 (  -0.00%)
: Amean     5       6.8847 (   0.00%)      8.0390 * -16.77%*      6.7727 (   1.63%)
: Amean     7       9.9310 (   0.00%)     10.8367 *  -9.12%*      9.9910 (  -0.60%)
: Amean     12     16.6577 (   0.00%)     18.2363 *  -9.48%*     17.1083 (  -2.71%)
: Amean     18     26.5133 (   0.00%)     27.8833 *  -5.17%*     25.7663 (   2.82%)
: Amean     24     34.3003 (   0.00%)     34.6830 (  -1.12%)     32.0450 (   6.58%)
: Amean     30     40.0063 (   0.00%)     40.5800 (  -1.43%)     41.5087 (  -3.76%)
: Amean     32     40.1407 (   0.00%)     41.2273 (  -2.71%)     39.9417 (   0.50%)
:
: It's showing that the vanilla kernel takes a hit (as the bisection
: indicated it would) and that disabling PSI by default is reasonably
: close in terms of performance for this particular workload on this
: particular machine so;

Link: http://lkml.kernel.org/r/20181127165329.GA29728@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit e0c274472d)

Bug: 127712811
Test: lmkd in PSI mode
Change-Id: I6cb666fa351e8901df82e4d6931bfec0c5ce230d
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-21 16:25:27 -07:00
Olof Johansson
b822a6da85 UPSTREAM: kernel/sched/psi.c: simplify cgroup_move_task()
The existing code triggered an invalid warning about 'rq' possibly being
used uninitialized.  Instead of doing the silly warning suppression by
initializa it to NULL, refactor the code to bail out early instead.

Warning was:

  kernel/sched/psi.c: In function `cgroup_move_task':
  kernel/sched/psi.c:639:13: warning: `rq' may be used uninitialized in this function [-Wmaybe-uninitialized]

Link: http://lkml.kernel.org/r/20181103183339.8669-1-olof@lixom.net
Fixes: 2ce7135adc ("psi: cgroup support")
Signed-off-by: Olof Johansson <olof@lixom.net>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit 8fcb2312d1)

Bug: 127712811
Test: lmkd in PSI mode
Change-Id: Id989da224a726082e0cfa5d5d9460bf63d448a93
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-21 16:25:27 -07:00
Johannes Weiner
dc9cd29ded UPSTREAM: psi: cgroup support
On a system that executes multiple cgrouped jobs and independent
workloads, we don't just care about the health of the overall system, but
also that of individual jobs, so that we can ensure individual job health,
fairness between jobs, or prioritize some jobs over others.

This patch implements pressure stall tracking for cgroups.  In kernels
with CONFIG_PSI=y, cgroup2 groups will have cpu.pressure, memory.pressure,
and io.pressure files that track aggregate pressure stall times for only
the tasks inside the cgroup.

Link: http://lkml.kernel.org/r/20180828172258.3185-10-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit 2ce7135adc)

Bug: 127712811
Test: lmkd in PSI mode
Change-Id: I163e6657aaa60aa5aab9372616a3bce2a65e90ec
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-21 16:25:27 -07:00
Johannes Weiner
e550f94252 UPSTREAM: psi: pressure stall information for CPU, memory, and IO
When systems are overcommitted and resources become contended, it's hard
to tell exactly the impact this has on workload productivity, or how close
the system is to lockups and OOM kills.  In particular, when machines work
multiple jobs concurrently, the impact of overcommit in terms of latency
and throughput on the individual job can be enormous.

In order to maximize hardware utilization without sacrificing individual
job health or risk complete machine lockups, this patch implements a way
to quantify resource pressure in the system.

A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or IO,
respectively.  Stall states are aggregate versions of the per-task delay
accounting delays:

       cpu: some tasks are runnable but not executing on a CPU
       memory: tasks are reclaiming, or waiting for swapin or thrashing cache
       io: tasks are waiting for io completions

These percentages of walltime can be thought of as pressure percentages,
and they give a general sense of system health and productivity loss
incurred by resource overcommit.  They can also indicate when the system
is approaching lockup scenarios and OOMs.

To do this, psi keeps track of the task states associated with each CPU
and samples the time they spend in stall states.  Every 2 seconds, the
samples are averaged across CPUs - weighted by the CPUs' non-idle time to
eliminate artifacts from unused CPUs - and translated into percentages of
walltime.  A running average of those percentages is maintained over 10s,
1m, and 5m periods (similar to the loadaverage).

[hannes@cmpxchg.org: doc fixlet, per Randy]
  Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
[hannes@cmpxchg.org: code optimization]
  Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
[hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
  Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
[hannes@cmpxchg.org: fix build]
  Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit eb414681d5)

Bug: 127712811
Test: lmkd in PSI mode
Change-Id: Id00d23c977169b0c4636d92016fc1fee0274be05
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-21 16:25:27 -07:00
Johannes Weiner
8cd88f5398 UPSTREAM: sched: introduce this_rq_lock_irq()
do_sched_yield() disables IRQs, looks up this_rq() and locks it.  The next
patch is adding another site with the same pattern, so provide a
convenience function for it.

Link: http://lkml.kernel.org/r/20180828172258.3185-8-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Tested-by: Daniel Drake <drake@endlessm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit 246b3b3342)

Bug: 127712811
Test: lmkd in PSI mode
Change-Id: I24b42cff1624c80633f116b7cb485564f53a30a7
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-21 16:25:27 -07:00
Johannes Weiner
cdda3cf652 UPSTREAM: sched: sched.h: make rq locking and clock functions available in stats.h
kernel/sched/sched.h includes "stats.h" half-way through the file.  The
next patch introduces users of sched.h's rq locking functions and
update_rq_clock() in kernel/sched/stats.h.  Move those definitions up in
the file so they are available in stats.h.

Link: http://lkml.kernel.org/r/20180828172258.3185-7-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Tested-by: Daniel Drake <drake@endlessm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit 1f351d7f75)

Bug: 127712811
Test: lmkd in PSI mode
Change-Id: Id342e0ba9a62b49e64f2ce8b87f883ea70230b2f
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-21 16:25:27 -07:00