linux

mirror of https://github.com/torvalds/linux.git synced 2026-06-08 22:52:35 +02:00

Author	SHA1	Message	Date
Greg Kroah-Hartman	d885da678e	This is the 4.19.34 stable release -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlynu40ACgkQONu9yGCS aT5X6g//Wkfm/+qSZ0GhLDQkPniiH1QkvzhOmVrrxu+KB0qsiwsEl8Srw33ZVkJK LT8+IPGiG9jEGu9dj+BYXTIfy9ZvfSsEL2N6GhYwDSXP0fok2rUaHbZvv1IB2g4W afhGdNwNAUCJ/j1UrUsi+SAFJ+xWbVxFpGstd0cqM9IbKdEV7RIukvuKckHiKOKR qI8FxC+G2PAr+BtnETfk5/suPDJ7B3ZicDoMhiWJGxJ6dfFTVmkSmasSoPDaMiHm 4S3hN2lu+WTeRpRPPB17Dlk4MmIp0k+bGYBKAlaxAMCc/RZxvbT2pRYaMQbId2/L mNUfSnOQFGEAhlAPfb7wdbObphnyT34GhlkWfZBTrnhPO0/FomLOvU6xVdcNuakX Tv2JKfDzb+2ttcMZ+0T84Ru9RztoswFATSw8uFMVxW8oTS6MVWnHu96Kxfl7QO3J PdlIGcyqxSuWNE8OX1QVtdSruGZfwUDNs94S4nQJtkB8BViRwhGJlqaXuy4d9Wp6 fGlI2W6qhjyosi2wBSMTjh/ytk/jq0vfs+z2XjR2gAYssvB/SOLR/AlSVguWsDnf WaoFBkXvCbuPvPlo0TrLpl5RW5WlOtLXHE3Vr3dKp458wLwpf/OZBGoZiknp7DrF PzBZs2ie5tmyqTxbAygl7WkbQPJ682pd5R4nf5CY+zvUaOMZv1g= =Iuup -----END PGP SIGNATURE----- Merge 4.19.34 into android-4.19 Changes in 4.19.34 arm64: debug: Don't propagate UNKNOWN FAR into si_code for debug signals ext4: cleanup bh release code in ext4_ind_remove_space() tty/serial: atmel: Add is_half_duplex helper tty/serial: atmel: RS485 HD w/DMA: enable RX after TX is stopped CIFS: fix POSIX lock leak and invalid ptr deref h8300: use cc-cross-prefix instead of hardcoding h8300-unknown-linux- f2fs: fix to adapt small inline xattr space in __find_inline_xattr() f2fs: fix to avoid deadlock in f2fs_read_inline_dir() tracing: kdb: Fix ftdump to not sleep net/mlx5: Avoid panic when setting vport rate net/mlx5: Avoid panic when setting vport mac, getting vport config gpio: gpio-omap: fix level interrupt idling include/linux/relay.h: fix percpu annotation in struct rchan sysctl: handle overflow for file-max net: stmmac: Avoid sometimes uninitialized Clang warnings enic: fix build warning without CONFIG_CPUMASK_OFFSTACK libbpf: force fixdep compilation at the start of the build scsi: hisi_sas: Set PHY linkrate when disconnected scsi: hisi_sas: Fix a timeout race of driver internal and SMP IO iio: adc: fix warning in Qualcomm PM8xxx HK/XOADC driver x86/hyperv: Fix kernel panic when kexec on HyperV perf c2c: Fix c2c report for empty numa node mm/sparse: fix a bad comparison mm/cma.c: cma_declare_contiguous: correct err handling mm/page_ext.c: fix an imbalance with kmemleak mm, swap: bounds check swap_info array accesses to avoid NULL derefs mm,oom: don't kill global init via memory.oom.group memcg: killed threads should not invoke memcg OOM killer mm, mempolicy: fix uninit memory access mm/vmalloc.c: fix kernel BUG at mm/vmalloc.c:512! mm/slab.c: kmemleak no scan alien caches ocfs2: fix a panic problem caused by o2cb_ctl f2fs: do not use mutex lock in atomic context fs/file.c: initialize init_files.resize_wait page_poison: play nicely with KASAN cifs: use correct format characters dm thin: add sanity checks to thin-pool and external snapshot creation f2fs: fix to check inline_xattr_size boundary correctly cifs: Accept validate negotiate if server return NT_STATUS_NOT_SUPPORTED cifs: Fix NULL pointer dereference of devname netfilter: nf_tables: check the result of dereferencing base_chain->stats netfilter: conntrack: tcp: only close if RST matches exact sequence jbd2: fix invalid descriptor block checksum fs: fix guard_bio_eod to check for real EOD errors tools lib traceevent: Fix buffer overflow in arg_eval PCI/PME: Fix hotplug/sysfs remove deadlock in pcie_pme_remove() wil6210: check null pointer in _wil_cfg80211_merge_extra_ies mt76: fix a leaked reference by adding a missing of_node_put crypto: crypto4xx - add missing of_node_put after of_device_is_available crypto: cavium/zip - fix collision with generic cra_driver_name usb: chipidea: Grab the (legacy) USB PHY by phandle first powerpc/powernv/ioda: Fix locked_vm counting for memory used by IOMMU tables scsi: core: replace GFP_ATOMIC with GFP_KERNEL in scsi_scan.c kbuild: invoke syncconfig if include/config/auto.conf.cmd is missing powerpc/xmon: Fix opcode being uninitialized in print_insn_powerpc coresight: etm4x: Add support to enable ETMv4.2 serial: 8250_pxa: honor the port number from devicetree ARM: 8840/1: use a raw_spinlock_t in unwind iommu/io-pgtable-arm-v7s: Only kmemleak_ignore L2 tables powerpc/hugetlb: Handle mmap_min_addr correctly in get_unmapped_area callback btrfs: qgroup: Make qgroup async transaction commit more aggressive mmc: omap: fix the maximum timeout setting net: dsa: mv88e6xxx: Add lockdep classes to fix false positive splat e1000e: Fix -Wformat-truncation warnings mlxsw: spectrum: Avoid -Wformat-truncation warnings platform/x86: ideapad-laptop: Fix no_hw_rfkill_list for Lenovo RESCUER R720-15IKBN platform/mellanox: mlxreg-hotplug: Fix KASAN warning loop: set GENHD_FL_NO_PART_SCAN after blkdev_reread_part() IB/mlx4: Increase the timeout for CM cache clk: fractional-divider: check parent rate only if flag is set perf annotate: Fix getting source line failure ASoC: qcom: Fix of-node refcount unbalance in qcom_snd_parse_of() cpufreq: acpi-cpufreq: Report if CPU doesn't support boost technologies efi: cper: Fix possible out-of-bounds access s390/ism: ignore some errors during deregistration scsi: megaraid_sas: return error when create DMA pool failed scsi: fcoe: make use of fip_mode enum complete drm/amd/display: Clear stream->mode_changed after commit perf test: Fix failure of 'evsel-tp-sched' test on s390 mwifiex: don't advertise IBSS features without FW support perf report: Don't shadow inlined symbol with different addr range SoC: imx-sgtl5000: add missing put_device() media: ov7740: fix runtime pm initialization media: sh_veu: Correct return type for mem2mem buffer helpers media: s5p-jpeg: Correct return type for mem2mem buffer helpers media: rockchip/rga: Correct return type for mem2mem buffer helpers media: s5p-g2d: Correct return type for mem2mem buffer helpers media: mx2_emmaprp: Correct return type for mem2mem buffer helpers media: mtk-jpeg: Correct return type for mem2mem buffer helpers mt76: usb: do not run mt76u_queues_deinit twice xen/gntdev: Do not destroy context while dma-bufs are in use vfs: fix preadv64v2 and pwritev64v2 compat syscalls with offset == -1 HID: intel-ish-hid: avoid binding wrong ishtp_cl_device cgroup, rstat: Don't flush subtree root unless necessary jbd2: fix race when writing superblock leds: lp55xx: fix null deref on firmware load failure perf report: Add s390 diagnosic sampling descriptor size iwlwifi: pcie: fix emergency path ACPI / video: Refactor and fix dmi_is_desktop() selftests: skip seccomp get_metadata test if not real root kprobes: Prohibit probing on bsearch() kprobes: Prohibit probing on RCU debug routine netfilter: conntrack: fix cloned unconfirmed skb->_nfct race in __nf_conntrack_confirm ARM: 8833/1: Ensure that NEON code always compiles with Clang ARM: dts: meson8b: fix the Ethernet data line signals in eth_rgmii_pins ALSA: PCM: check if ops are defined before suspending PCM ath10k: fix shadow register implementation for WCN3990 usb: f_fs: Avoid crash due to out-of-scope stack ptr access sched/topology: Fix percpu data types in struct sd_data & struct s_data bcache: fix input overflow to cache set sysfs file io_error_halflife bcache: fix input overflow to sequential_cutoff bcache: fix potential div-zero error of writeback_rate_i_term_inverse bcache: improve sysfs_strtoul_clamp() genirq: Avoid summation loops for /proc/stat net: marvell: mvpp2: fix stuck in-band SGMII negotiation iw_cxgb4: fix srqidx leak during connection abort net: phy: consider latched link-down status in polling mode fbdev: fbmem: fix memory access if logo is bigger than the screen cdrom: Fix race condition in cdrom_sysctl_register drm: rcar-du: add missing of_node_put drm/amd/display: Don't re-program planes for DPMS changes drm/amd/display: Disconnect mpcc when changing tg perf/aux: Make perf_event accessible to setup_aux() e1000e: fix cyclic resets at link up with active tx e1000e: Exclude device from suspend direct complete optimization platform/x86: intel_pmc_core: Fix PCH IP sts reading i2c: of: Try to find an I2C adapter matching the parent staging: spi: mt7621: Add return code check on device_reset() iwlwifi: mvm: fix RFH config command with >=10 CPUs ASoC: fsl-asoc-card: fix object reference leaks in fsl_asoc_card_probe sched/debug: Initialize sd_sysctl_cpus if !CONFIG_CPUMASK_OFFSTACK efi/memattr: Don't bail on zero VA if it equals the region's PA sched/core: Use READ_ONCE()/WRITE_ONCE() in move_queued_task()/task_rq_lock() drm/vkms: Bugfix extra vblank frame ARM: dts: lpc32xx: Remove leading 0x and 0s from bindings notation efi/arm/arm64: Allow SetVirtualAddressMap() to be omitted soc: qcom: gsbi: Fix error handling in gsbi_probe() mt7601u: bump supported EEPROM version ARM: 8830/1: NOMMU: Toggle only bits in EXC_RETURN we are really care of ARM: avoid Cortex-A9 livelock on tight dmb loops block, bfq: fix in-service-queue check for queue merging bpf: fix missing prototype warnings selftests/bpf: skip verifier tests for unsupported program types powerpc/64s: Clear on-stack exception marker upon exception return cgroup/pids: turn cgroup_subsys->free() into cgroup_subsys->release() to fix the accounting backlight: pwm_bl: Use gpiod_get_value_cansleep() to get initial state tty: increase the default flip buffer limit to 2640K powerpc/pseries: Perform full re-add of CPU for topology update post-migration drm/amd/display: Enable vblank interrupt during CRC capture ALSA: dice: add support for Solid State Logic Duende Classic/Mini usb: dwc3: gadget: Fix OTG events when gadget driver isn't loaded platform/x86: intel-hid: Missing power button release on some Dell models perf script python: Use PyBytes for attr in trace-event-python perf script python: Add trace_context extension module to sys.modules media: mt9m111: set initial frame size other than 0x0 hwrng: virtio - Avoid repeated init of completion soc/tegra: fuse: Fix illegal free of IO base address HID: intel-ish: ipc: handle PIMR before ish_wakeup also clear PISR busy_clear bit f2fs: UBSAN: set boolean value iostat_enable correctly hpet: Fix missing '=' character in the __setup() code of hpet_mmap_enable cpu/hotplug: Mute hotplug lockdep during init dmaengine: imx-dma: fix warning comparison of distinct pointer types dmaengine: qcom_hidma: assign channel cookie correctly dmaengine: qcom_hidma: initialize tx flags in hidma_prep_dma_ netfilter: physdev: relax br_netfilter dependency media: rcar-vin: Allow independent VIN link enablement media: s5p-jpeg: Check for fmt_ver_flag when doing fmt enumeration regulator: act8865: Fix act8600_sudcdc_voltage_ranges setting pinctrl: meson: meson8b: add the eth_rxd2 and eth_rxd3 pins drm: Auto-set allow_fb_modifiers when given modifiers at plane init drm/nouveau: Stop using drm_crtc_force_disable x86/build: Specify elf_i386 linker emulation explicitly for i386 objects selinux: do not override context on context mounts brcmfmac: Use firmware_request_nowarn for the clm_blob wlcore: Fix memory leak in case wl12xx_fetch_firmware failure x86/build: Mark per-CPU symbols as absolute explicitly for LLD drm/fb-helper: fix leaks in error path of drm_fb_helper_fbdev_setup clk: meson: clean-up clock registration clk: rockchip: fix frac settings of GPLL clock for rk3328 dmaengine: tegra: avoid overflow of byte tracking Input: soc_button_array - fix mapping of the 5th GPIO in a PNP0C40 device drm/dp/mst: Configure no_stop_bit correctly for remote i2c xfers net: stmmac: Avoid one more sometimes uninitialized Clang warning ACPI / video: Extend chassis-type detection with a "Lunch Box" check bcache: fix potential div-zero error of writeback_rate_p_term_inverse kprobes/x86: Blacklist non-attachable interrupt functions Linux 4.19.34 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>	2019-04-05 22:43:09 +02:00
Valentin Schneider	fba4c61e98	cpu/hotplug: Mute hotplug lockdep during init [ Upstream commit `ce48c457b9` ] Since we've had: commit `cb538267ea` ("jump_label/lockdep: Assert we hold the hotplug lock for _cpuslocked() operations") we've been getting some lockdep warnings during init, such as on HiKey960: [ 0.820495] WARNING: CPU: 4 PID: 0 at kernel/cpu.c:316 lockdep_assert_cpus_held+0x3c/0x48 [ 0.820498] Modules linked in: [ 0.820509] CPU: 4 PID: 0 Comm: swapper/4 Tainted: G S 4.20.0-rc5-00051-g4cae42a #34 [ 0.820511] Hardware name: HiKey960 (DT) [ 0.820516] pstate: 600001c5 (nZCv dAIF -PAN -UAO) [ 0.820520] pc : lockdep_assert_cpus_held+0x3c/0x48 [ 0.820523] lr : lockdep_assert_cpus_held+0x38/0x48 [ 0.820526] sp : ffff00000a9cbe50 [ 0.820528] x29: ffff00000a9cbe50 x28: 0000000000000000 [ 0.820533] x27: 00008000b69e5000 x26: ffff8000bff4cfe0 [ 0.820537] x25: ffff000008ba69e0 x24: 0000000000000001 [ 0.820541] x23: ffff000008fce000 x22: ffff000008ba70c8 [ 0.820545] x21: 0000000000000001 x20: 0000000000000003 [ 0.820548] x19: ffff00000a35d628 x18: ffffffffffffffff [ 0.820552] x17: 0000000000000000 x16: 0000000000000000 [ 0.820556] x15: ffff00000958f848 x14: 455f3052464d4d34 [ 0.820559] x13: 00000000769dde98 x12: ffff8000bf3f65a8 [ 0.820564] x11: 0000000000000000 x10: ffff00000958f848 [ 0.820567] x9 : ffff000009592000 x8 : ffff00000958f848 [ 0.820571] x7 : ffff00000818ffa0 x6 : 0000000000000000 [ 0.820574] x5 : 0000000000000000 x4 : 0000000000000001 [ 0.820578] x3 : 0000000000000000 x2 : 0000000000000001 [ 0.820582] x1 : 00000000ffffffff x0 : 0000000000000000 [ 0.820587] Call trace: [ 0.820591] lockdep_assert_cpus_held+0x3c/0x48 [ 0.820598] static_key_enable_cpuslocked+0x28/0xd0 [ 0.820606] arch_timer_check_ool_workaround+0xe8/0x228 [ 0.820610] arch_timer_starting_cpu+0xe4/0x2d8 [ 0.820615] cpuhp_invoke_callback+0xe8/0xd08 [ 0.820619] notify_cpu_starting+0x80/0xb8 [ 0.820625] secondary_start_kernel+0x118/0x1d0 We've also had a similar warning in sched_init_smp() for every asymmetric system that would enable the sched_asym_cpucapacity static key, although that was singled out in: commit `40fa3780ba` ("sched/core: Take the hotplug lock in sched_init_smp()") Those warnings are actually harmless, since we cannot have hotplug operations at the time they appear. Instead of starting to sprinkle useless hotplug lock operations in the init codepaths, mute the warnings until they start warning about real problems. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Cc: cai@gmx.us Cc: daniel.lezcano@linaro.org Cc: dietmar.eggemann@arm.com Cc: linux-arm-kernel@lists.infradead.org Cc: longman@redhat.com Cc: marc.zyngier@arm.com Cc: mark.rutland@arm.com Link: https://lkml.kernel.org/r/1545243796-23224-2-git-send-email-valentin.schneider@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-04-05 22:33:14 +02:00
Oleg Nesterov	d0bc74c563	cgroup/pids: turn cgroup_subsys->free() into cgroup_subsys->release() to fix the accounting [ Upstream commit `51bee5abea` ] The only user of cgroup_subsys->free() callback is pids_cgrp_subsys which needs pids_free() to uncharge the pid. However, ->free() is called from __put_task_struct()->cgroup_free() and this is too late. Even the trivial program which does for (;;) { int pid = fork(); assert(pid >= 0); if (pid) wait(NULL); else exit(0); } can run out of limits because release_task()->call_rcu(delayed_put_task_struct) implies an RCU gp after the task/pid goes away and before the final put(). Test-case: mkdir -p /tmp/CG mount -t cgroup2 none /tmp/CG echo '+pids' > /tmp/CG/cgroup.subtree_control mkdir /tmp/CG/PID echo 2 > /tmp/CG/PID/pids.max perl -e 'while ($p = fork) { wait; } $p // die "fork failed: $!\n"' & echo $! > /tmp/CG/PID/cgroup.procs Without this patch the forking process fails soon after migration. Rename cgroup_subsys->free() to cgroup_subsys->release() and move the callsite into the new helper, cgroup_release(), called by release_task() which actually frees the pid(s). Reported-by: Herton R. Krzesinski <hkrzesin@redhat.com> Reported-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-04-05 22:33:13 +02:00
Andrea Parri	e8e0bd4915	sched/core: Use READ_ONCE()/WRITE_ONCE() in move_queued_task()/task_rq_lock() [ Upstream commit `c546951d9c` ] move_queued_task() synchronizes with task_rq_lock() as follows: move_queued_task() task_rq_lock() [S] ->on_rq = MIGRATING [L] rq = task_rq() WMB (__set_task_cpu()) ACQUIRE (rq->lock); [S] ->cpu = new_cpu [L] ->on_rq where "[L] rq = task_rq()" is ordered before "ACQUIRE (rq->lock)" by an address dependency and, in turn, "ACQUIRE (rq->lock)" is ordered before "[L] ->on_rq" by the ACQUIRE itself. Use READ_ONCE() to load ->cpu in task_rq() (c.f., task_cpu()) to honor this address dependency. Also, mark the accesses to ->cpu and ->on_rq with READ_ONCE()/WRITE_ONCE() to comply with the LKMM. Signed-off-by: Andrea Parri <andrea.parri@amarulasolutions.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E. McKenney <paulmck@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Link: https://lkml.kernel.org/r/20190121155240.27173-1-andrea.parri@amarulasolutions.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-04-05 22:33:12 +02:00
Hidetoshi Seto	f056c90f07	sched/debug: Initialize sd_sysctl_cpus if !CONFIG_CPUMASK_OFFSTACK [ Upstream commit `1ca4fa3ab6` ] register_sched_domain_sysctl() copies the cpu_possible_mask into sd_sysctl_cpus, but only if sd_sysctl_cpus hasn't already been allocated (ie, CONFIG_CPUMASK_OFFSTACK is set). However, when CONFIG_CPUMASK_OFFSTACK is not set, sd_sysctl_cpus is left uninitialized (all zeroes) and the kernel may fail to initialize sched_domain sysctl entries for all possible CPUs. This is visible to the user if the kernel is booted with maxcpus=n, or if ACPI tables have been modified to leave CPUs offline, and then checking for missing /proc/sys/kernel/sched_domain/cpu* entries. Fix this by separating the allocation and initialization, and adding a flag to initialize the possible CPU entries while system booting only. Tested-by: Syuuichirou Ishii <ishii.shuuichir@jp.fujitsu.com> Tested-by: Tarumizu, Kohei <tarumizu.kohei@jp.fujitsu.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Acked-by: Joe Lawrence <joe.lawrence@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masayoshi Mizuma <msys.mizuma@gmail.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190129151245.5073-1-msys.mizuma@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-04-05 22:33:11 +02:00
Mathieu Poirier	efd85d83ac	perf/aux: Make perf_event accessible to setup_aux() [ Upstream commit `840018668c` ] When pmu::setup_aux() is called the coresight PMU needs to know which sink to use for the session by looking up the information in the event's attr::config2 field. As such simply replace the cpu information by the complete perf_event structure and change all affected customers. Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org> Reviewed-by: Suzuki Poulouse <suzuki.poulose@arm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-s390@vger.kernel.org Link: http://lkml.kernel.org/r/20190131184714.20388-2-mathieu.poirier@linaro.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-04-05 22:33:11 +02:00
Thomas Gleixner	1f3694865d	genirq: Avoid summation loops for /proc/stat [ Upstream commit `1136b07289` ] Waiman reported that on large systems with a large amount of interrupts the readout of /proc/stat takes a long time to sum up the interrupt statistics. In principle this is not a problem. but for unknown reasons some enterprise quality software reads /proc/stat with a high frequency. The reason for this is that interrupt statistics are accounted per cpu. So the /proc/stat logic has to sum up the interrupt stats for each interrupt. This can be largely avoided for interrupts which are not marked as 'PER_CPU' interrupts by simply adding a per interrupt summation counter which is incremented along with the per interrupt per cpu counter. The PER_CPU interrupts need to avoid that and use only per cpu accounting because they share the interrupt number and the interrupt descriptor and concurrent updates would conflict or require unwanted synchronization. Reported-by: Waiman Long <longman@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Waiman Long <longman@redhat.com> Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Kees Cook <keescook@chromium.org> Cc: linux-fsdevel@vger.kernel.org Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Randy Dunlap <rdunlap@infradead.org> Link: https://lkml.kernel.org/r/20190208135020.925487496@linutronix.de 8<------------- v2: Undo the unintentional layout change of struct irq_desc. include/linux/irqdesc.h \| 1 + kernel/irq/chip.c \| 12 ++++++++++-- kernel/irq/internals.h \| 8 +++++++- kernel/irq/irqdesc.c \| 7 ++++++- 4 files changed, 24 insertions(+), 4 deletions(-) Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-04-05 22:33:09 +02:00
Luc Van Oostenryck	845d4849b6	sched/topology: Fix percpu data types in struct sd_data & struct s_data [ Upstream commit `99687cdbb3` ] The percpu members of struct sd_data and s_data are declared as: struct ... ** __percpu member; So their type is: __percpu pointer to pointer to struct ... But looking at how they're used, their type should be: pointer to __percpu pointer to struct ... and they should thus be declared as: struct ... * __percpu member; So fix the placement of '__percpu' in the definition of these structures. This addresses a bunch of Sparse's warnings like: warning: incorrect type in initializer (different address spaces) expected void const [noderef] <asn:3> __vpp_verify got struct sched_domain ** Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190118144936.79158-1-luc.vanoostenryck@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-04-05 22:33:09 +02:00
Masami Hiramatsu	d53b295f78	kprobes: Prohibit probing on RCU debug routine [ Upstream commit `a39f15b964` ] Since kprobe itself depends on RCU, probing on RCU debug routine can cause recursive breakpoint bugs. Prohibit probing on RCU debug routines. int3 ->do_int3() ->ist_enter() ->RCU_LOCKDEP_WARN() ->debug_lockdep_rcu_enabled() -> int3 Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andrea Righi <righi.andrea@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/154998807741.31052.11229157537816341591.stgit@devbox Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-04-05 22:33:08 +02:00
Tejun Heo	a74ebf047e	cgroup, rstat: Don't flush subtree root unless necessary [ Upstream commit `b4ff1b44bc` ] cgroup_rstat_cpu_pop_updated() is used to traverse the updated cgroups on flush. While it was only visiting updated ones in the subtree, it was visiting @root unconditionally. We can easily check whether @root is updated or not by looking at its ->updated_next just as with the cgroups in the subtree. * Remove the unnecessary cgroup_parent() test. The system root cgroup is never updated and thus its ->updated_next is always NULL. No need to test whether cgroup_parent() exists in addition to ->updated_next. * Terminate traverse if ->updated_next is NULL. This can only happen for subtree @root and there's no reason to visit it if it's not marked updated. This reduces cpu consumption when reading a lot of rstat backed files. In a micro benchmark reading stat from ~1600 cgroups, the sys time was lowered by >40%. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-04-05 22:33:06 +02:00
Christian Brauner	b227f15712	sysctl: handle overflow for file-max [ Upstream commit `32a5ad9c22` ] Currently, when writing echo 18446744073709551616 > /proc/sys/fs/file-max /proc/sys/fs/file-max will overflow and be set to 0. That quickly crashes the system. This commit sets the max and min value for file-max. The max value is set to long int. Any higher value cannot currently be used as the percpu counters are long ints and not unsigned integers. Note that the file-max value is ultimately parsed via __do_proc_doulongvec_minmax(). This function does not report error when min or max are exceeded. Which means if a value largen that long int is written userspace will not receive an error instead the old value will be kept. There is an argument to be made that this should be changed and __do_proc_doulongvec_minmax() should return an error when a dedicated min or max value are exceeded. However this has the potential to break userspace so let's defer this to an RFC patch. Link: http://lkml.kernel.org/r/20190107222700.15954-3-christian@brauner.io Signed-off-by: Christian Brauner <christian@brauner.io> Acked-by: Kees Cook <keescook@chromium.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Joe Lawrence <joe.lawrence@redhat.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Waiman Long <longman@redhat.com> [christian@brauner.io: v4] Link: http://lkml.kernel.org/r/20190210203943.8227-3-christian@brauner.io Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-04-05 22:32:57 +02:00
Douglas Anderson	b73c7d0204	tracing: kdb: Fix ftdump to not sleep [ Upstream commit `31b265b3ba` ] As reported back in 2016-11 [1], the "ftdump" kdb command triggers a BUG for "sleeping function called from invalid context". kdb's "ftdump" command wants to call ring_buffer_read_prepare() in atomic context. A very simple solution for this is to add allocation flags to ring_buffer_read_prepare() so kdb can call it without triggering the allocation error. This patch does that. Note that in the original email thread about this, it was suggested that perhaps the solution for kdb was to either preallocate the buffer ahead of time or create our own iterator. I'm hoping that this alternative of adding allocation flags to ring_buffer_read_prepare() can be considered since it means I don't need to duplicate more of the core trace code into "trace_kdb.c" (for either creating my own iterator or re-preparing a ring allocator whose memory was already allocated). NOTE: another option for kdb is to actually figure out how to make it reuse the existing ftrace_dump() function and totally eliminate the duplication. This sounds very appealing and actually works (the "sr z" command can be seen to properly dump the ftrace buffer). The downside here is that ftrace_dump() fully consumes the trace buffer. Unless that is changed I'd rather not use it because it means "ftdump \| grep xyz" won't be very useful to search the ftrace buffer since it will throw away the whole trace on the first grep. A future patch to dump only the last few lines of the buffer will also be hard to implement. [1] https://lkml.kernel.org/r/20161117191605.GA21459@google.com Link: http://lkml.kernel.org/r/20190308193205.213659-1-dianders@chromium.org Reported-by: Brian Norris <briannorris@chromium.org> Signed-off-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-04-05 22:32:56 +02:00
Greg Kroah-Hartman	0b065cd568	This is the 4.19.33 stable release -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlykNfcACgkQONu9yGCS aT40dRAAiCeYjEC1zH8dkAnbFlKo6IZuhKgISfVTgrWlRe9nUTYaenBXqAfGjufH EzXHrD1IANRAnFfWg8xt01TNBfTaiEYnYzFmJkWAHFWGKxa5fRU5Kan0MB97r8s9 NjoSRsnFl8l2oJI88zwFa7k89Itop9ST/zvZIgnrysAr+j8yEZb7BZWaU2UrKK/q qfnJxjfCb/jeqAxwVh3OkasXj0gG2JkGR/uEGTw2EARuI6pvKo5OCzYz0tXTN6ZJ CSzM4X7dhkGSgLIUw3JOCB28riK9TYbOdPr4MFYMrnoU5VL8+n62tXoKXewobJ1C 2+Pmg5E54r13Rr35eoGCiHsW2LGQrOyvm8S9TFB/0SmTPtzUjFlHBC62Vs/AW5ut HSmwVy+ILM/xTIdts5QT58Gw+5NCmHxw2oEdrgcct+6FtnR9XPOqZYyzH1tw1ZB+ DL2PqYyT9czuo2bKanWA37M8339q2INDFskXqKRokQ9GNiqUx1E6fNBqtK5SqyDI CdohtAs7xSQoPPbKDiITOmt82MM8xvefKmqTIvHN5B7Ns4lT1QC74DCXEFEKtw6M l+p64h6Qw4DiKmqna7fsbKPjmg8pg1lVrCOwD5iUF3JXy+Wxi/OKnr2gqAVMChAq GSfFFf+MMZhYzJPRQKxOn4GwDBDz5niaWvQmlPcPWLJ5jdbzl+Y= =Yyc/ -----END PGP SIGNATURE----- Merge 4.19.33 into android-4.19 Changes in 4.19.33 Bluetooth: Check L2CAP option sizes returned from l2cap_get_conf_opt Bluetooth: Verify that l2cap_get_conf_opt provides large enough buffer ipmi_si: Fix crash when using hard-coded device dccp: do not use ipv6 header for ipv4 flow genetlink: Fix a memory leak on error path gtp: change NET_UDP_TUNNEL dependency to select ipv6: make ip6_create_rt_rcu return ip6_null_entry instead of NULL mac8390: Fix mmio access size probe mISDN: hfcpci: Test both vendor & device ID for Digium HFC4S net: aquantia: fix rx checksum offload for UDP/TCP over IPv6 net: datagram: fix unbounded loop in __skb_try_recv_datagram() net/packet: Set __GFP_NOWARN upon allocation in alloc_pg_vec net: phy: meson-gxl: fix interrupt support net: rose: fix a possible stack overflow net: stmmac: fix memory corruption with large MTUs net-sysfs: call dev_hold if kobject_init_and_add success packets: Always register packet sk in the same order rhashtable: Still do rehash when we get EEXIST sctp: get sctphdr by offset in sctp_compute_cksum sctp: use memdup_user instead of vmemdup_user tcp: do not use ipv6 header for ipv4 flow tipc: allow service ranges to be connect()'ed on RDM/DGRAM tipc: change to check tipc_own_id to return in tipc_net_stop tipc: fix cancellation of topology subscriptions tun: properly test for IFF_UP vrf: prevent adding upper devices vxlan: Don't call gro_cells_destroy() before device is unregistered ila: Fix rhashtable walker list corruption net: sched: fix cleanup NULL pointer exception in act_mirr thunderx: enable page recycling for non-XDP case thunderx: eliminate extra calls to put_page() for pages held for recycling tun: add a missing rcu_read_unlock() in error path powerpc/fsl: Add infrastructure to fixup branch predictor flush powerpc/fsl: Add macro to flush the branch predictor powerpc/fsl: Emulate SPRN_BUCSR register powerpc/fsl: Add nospectre_v2 command line argument powerpc/fsl: Flush the branch predictor at each kernel entry (64bit) powerpc/fsl: Flush the branch predictor at each kernel entry (32 bit) powerpc/fsl: Flush branch predictor when entering KVM powerpc/fsl: Enable runtime patching if nospectre_v2 boot arg is used powerpc/fsl: Update Spectre v2 reporting powerpc/fsl: Fixed warning: orphan section `__btb_flush_fixup' powerpc/fsl: Fix the flush of branch predictor. powerpc/security: Fix spectre_v2 reporting Btrfs: fix incorrect file size after shrinking truncate and fsync btrfs: remove WARN_ON in log_dir_items btrfs: don't report readahead errors and don't update statistics btrfs: raid56: properly unmap parity page in finish_parity_scrub() btrfs: Avoid possible qgroup_rsv_size overflow in btrfs_calculate_inode_block_rsv_size Btrfs: fix assertion failure on fsync with NO_HOLES enabled ARM: imx6q: cpuidle: fix bug that CPU might not wake up at expected time powerpc: bpf: Fix generation of load/store DW instructions vfio: ccw: only free cp on final interrupt NFS: fix mount/umount race in nlmclnt. NFSv4.1 don't free interrupted slot on open net: dsa: qca8k: remove leftover phy accessors ALSA: rawmidi: Fix potential Spectre v1 vulnerability ALSA: seq: oss: Fix Spectre v1 vulnerability ALSA: pcm: Fix possible OOB access in PCM oss plugins ALSA: pcm: Don't suspend stream in unrecoverable PCM state ALSA: hda/realtek - Add support headset mode for DELL WYSE AIO ALSA: hda/realtek - Add support headset mode for New DELL WYSE NB ALSA: hda/realtek: Enable headset MIC of Acer AIO with ALC286 ALSA: hda/realtek: Enable headset MIC of Acer Aspire Z24-890 with ALC286 ALSA: hda/realtek - Add support for Acer Aspire E5-523G/ES1-432 headset mic ALSA: hda/realtek: Enable ASUS X441MB and X705FD headset MIC with ALC256 ALSA: hda/realtek: Enable headset mic of ASUS P5440FF with ALC256 ALSA: hda/realtek: Enable headset MIC of ASUS X430UN and X512DK with ALC256 ALSA: hda/realtek - Fix speakers on Acer Predator Helios 500 Ryzen laptops kbuild: modversions: Fix relative CRC byte order interpretation fs/open.c: allow opening only regular files during execve() ocfs2: fix inode bh swapping mixup in ocfs2_reflink_inodes_lock scsi: sd: Fix a race between closing an sd device and sd I/O scsi: sd: Quiesce warning if device does not report optimal I/O size scsi: zfcp: fix rport unblock if deleted SCSI devices on Scsi_Host scsi: zfcp: fix scsi_eh host reset with port_forced ERP for non-NPIV FCP devices drm/rockchip: vop: reset scale mode when win is disabled tty: mxs-auart: fix a potential NULL pointer dereference tty: atmel_serial: fix a potential NULL pointer dereference tty: serial: qcom_geni_serial: Initialize baud in qcom_geni_console_setup staging: comedi: ni_mio_common: Fix divide-by-zero for DIO cmdtest staging: speakup_soft: Fix alternate speech with other synths staging: vt6655: Remove vif check from vnt_interrupt staging: vt6655: Fix interrupt race condition on device start up. staging: erofs: fix to handle error path of erofs_vmap() serial: max310x: Fix to avoid potential NULL pointer dereference serial: mvebu-uart: Fix to avoid a potential NULL pointer dereference serial: sh-sci: Fix setting SCSCR_TIE while transferring data USB: serial: cp210x: add new device id USB: serial: ftdi_sio: add additional NovaTech products USB: serial: mos7720: fix mos_parport refcount imbalance on error path USB: serial: option: set driver_info for SIM5218 and compatibles USB: serial: option: add support for Quectel EM12 USB: serial: option: add Olicard 600 Disable kgdboc failed by echo space to /sys/module/kgdboc/parameters/kgdboc fs/proc/proc_sysctl.c: fix NULL pointer dereference in put_links drm/vgem: fix use-after-free when drm_gem_handle_create() fails drm/vkms: fix use-after-free when drm_gem_handle_create() fails drm/i915/gvt: Fix MI_FLUSH_DW parsing with correct index check gpio: exar: add a check for the return value of ida_simple_get fails gpio: adnp: Fix testing wrong value in adnp_gpio_direction_input phy: sun4i-usb: Support set_mode to USB_HOST for non-OTG PHYs usb: mtu3: fix EXTCON dependency USB: gadget: f_hid: fix deadlock in f_hidg_write() usb: common: Consider only available nodes for dr_mode usb: host: xhci-rcar: Add XHCI_TRUST_TX_LENGTH quirk xhci: Fix port resume done detection for SS ports with LPM enabled usb: xhci: dbc: Don't free all memory with spinlock held xhci: Don't let USB3 ports stuck in polling state prevent suspend usb: cdc-acm: fix race during wakeup blocking TX traffic mm: add support for kmem caches in DMA32 zone iommu/io-pgtable-arm-v7s: request DMA32 memory, and improve debugging mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified mm/migrate.c: add missing flush_dcache_page for non-mapped page migrate perf pmu: Fix parser error for uncore event alias perf intel-pt: Fix TSC slip objtool: Query pkg-config for libelf location powerpc/pseries/energy: Use OF accessor functions to read ibm,drc-indexes powerpc/64: Fix memcmp reading past the end of src/dest watchdog: Respect watchdog cpumask on CPU hotplug cpu/hotplug: Prevent crash when CPU bringup fails on CONFIG_HOTPLUG_CPU=n x86/smp: Enforce CONFIG_HOTPLUG_CPU when SMP=y KVM: Reject device ioctls from processes other than the VM's creator KVM: x86: update %rip after emulating IO KVM: x86: Emulate MSR_IA32_ARCH_CAPABILITIES on AMD hosts staging: erofs: fix error handling when failed to read compresssed data staging: erofs: keep corrupted fs from crashing kernel in erofs_readdir() bpf: do not restore dst_reg when cur_state is freed drivers: base: Helpers for adding device connection descriptions platform: x86: intel_cht_int33fe: Register all connections at once platform: x86: intel_cht_int33fe: Add connection for the DP alt mode platform: x86: intel_cht_int33fe: Add connections for the USB Type-C port usb: typec: class: Don't use port parent for getting mux handles platform: x86: intel_cht_int33fe: Remove the old connections for the muxes Linux 4.19.33 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>	2019-04-03 06:53:19 +02:00
Xu Yu	f5959dec08	bpf: do not restore dst_reg when cur_state is freed commit `0803278b0b` upstream. Syzkaller hit 'KASAN: use-after-free Write in sanitize_ptr_alu' bug. Call trace: dump_stack+0xbf/0x12e print_address_description+0x6a/0x280 kasan_report+0x237/0x360 sanitize_ptr_alu+0x85a/0x8d0 adjust_ptr_min_max_vals+0x8f2/0x1ca0 adjust_reg_min_max_vals+0x8ed/0x22e0 do_check+0x1ca6/0x5d00 bpf_check+0x9ca/0x2570 bpf_prog_load+0xc91/0x1030 __se_sys_bpf+0x61e/0x1f00 do_syscall_64+0xc8/0x550 entry_SYSCALL_64_after_hwframe+0x49/0xbe Fault injection trace: kfree+0xea/0x290 free_func_state+0x4a/0x60 free_verifier_state+0x61/0xe0 push_stack+0x216/0x2f0 <- inject failslab sanitize_ptr_alu+0x2b1/0x8d0 adjust_ptr_min_max_vals+0x8f2/0x1ca0 adjust_reg_min_max_vals+0x8ed/0x22e0 do_check+0x1ca6/0x5d00 bpf_check+0x9ca/0x2570 bpf_prog_load+0xc91/0x1030 __se_sys_bpf+0x61e/0x1f00 do_syscall_64+0xc8/0x550 entry_SYSCALL_64_after_hwframe+0x49/0xbe When kzalloc() fails in push_stack(), free_verifier_state() will free current verifier state. As push_stack() returns, dst_reg was restored if ptr_is_dst_reg is false. However, as member of the cur_state, dst_reg is also freed, and error occurs when dereferencing dst_reg. Simply fix it by testing ret of push_stack() before restoring dst_reg. Fixes: `979d63d50c` ("bpf: prevent out of bounds speculation on pointer arithmetic") Signed-off-by: Xu Yu <xuyu@linux.alibaba.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-04-03 06:26:30 +02:00
Thomas Gleixner	a56aa02e6f	cpu/hotplug: Prevent crash when CPU bringup fails on CONFIG_HOTPLUG_CPU=n commit `206b92353c` upstream. Tianyu reported a crash in a CPU hotplug teardown callback when booting a kernel which has CONFIG_HOTPLUG_CPU disabled with the 'nosmt' boot parameter. It turns out that the SMP=y CONFIG_HOTPLUG_CPU=n case has been broken forever in case that a bringup callback fails. Unfortunately this issue was not recognized when the CPU hotplug code was reworked, so the shortcoming just stayed in place. When a bringup callback fails, the CPU hotplug code rolls back the operation and takes the CPU offline. The 'nosmt' command line argument uses a bringup failure to abort the bringup of SMT sibling CPUs. This partial bringup is required due to the MCE misdesign on Intel CPUs. With CONFIG_HOTPLUG_CPU=y the rollback works perfectly fine, but CONFIG_HOTPLUG_CPU=n lacks essential mechanisms to exercise the low level teardown of a CPU including the synchronizations in various facilities like RCU, NOHZ and others. As a consequence the teardown callbacks which must be executed on the outgoing CPU within stop machine with interrupts disabled are executed on the control CPU in interrupt enabled and preemptible context causing the kernel to crash and burn. The pre state machine code has a different failure mode which is more subtle and resulting in a less obvious use after free crash because the control side frees resources which are still in use by the undead CPU. But this is not a x86 only problem. Any architecture which supports the SMP=y HOTPLUG_CPU=n combination suffers from the same issue. It's just less likely to be triggered because in 99.99999% of the cases all bringup callbacks succeed. The easy solution of making HOTPLUG_CPU mandatory for SMP is not working on all architectures as the following architectures have either no hotplug support at all or not all subarchitectures support it: alpha, arc, hexagon, openrisc, riscv, sparc (32bit), mips (partial). Crashing the kernel in such a situation is not an acceptable state either. Implement a minimal rollback variant by limiting the teardown to the point where all regular teardown callbacks have been invoked and leave the CPU in the 'dead' idle state. This has the following consequences: - the CPU is brought down to the point where the stop_machine takedown would happen. - the CPU stays there forever and is idle - The CPU is cleared in the CPU active mask, but not in the CPU online mask which is a legit state. - Interrupts are not forced away from the CPU - All facilities which only look at online mask would still see it, but that is the case during normal hotplug/unplug operations as well. It's just a (way) longer time frame. This will expose issues, which haven't been exposed before or only seldom, because now the normally transient state of being non active but online is a permanent state. In testing this exposed already an issue vs. work queues where the vmstat code schedules work on the almost dead CPU which ends up in an unbound workqueue and triggers 'preemtible context' warnings. This is not a problem of this change, it merily exposes an already existing issue. Still this is better than crashing fully without a chance to debug it. This is mainly thought as workaround for those architectures which do not support HOTPLUG_CPU. All others should enforce HOTPLUG_CPU for SMP. Fixes: `2e1a3483ce` ("cpu/hotplug: Split out the state walk into functions") Reported-by: Tianyu Lan <Tianyu.Lan@microsoft.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Tianyu Lan <Tianyu.Lan@microsoft.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Konrad Wilk <konrad.wilk@oracle.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Mukesh Ojha <mojha@codeaurora.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Rik van Riel <riel@surriel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Micheal Kelley <michael.h.kelley@microsoft.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20190326163811.503390616@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-04-03 06:26:29 +02:00
Thomas Gleixner	336f6b23b5	watchdog: Respect watchdog cpumask on CPU hotplug commit `7dd4761711` upstream. The rework of the watchdog core to use cpu_stop_work broke the watchdog cpumask on CPU hotplug. The watchdog_enable/disable() functions are now called unconditionally from the hotplug callback, i.e. even on CPUs which are not in the watchdog cpumask. As a consequence the watchdog can become unstoppable. Only invoke them when the plugged CPU is in the watchdog cpumask. Fixes: `9cf57731b6` ("watchdog/softlockup: Replace "watchdog/%u" threads with cpu_stop_work") Reported-by: Maxime Coquelin <maxime.coquelin@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Maxime Coquelin <maxime.coquelin@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Don Zickus <dzickus@redhat.com> Cc: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1903262245490.1789@nanos.tec.linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-04-03 06:26:29 +02:00
Greg Kroah-Hartman	6f994bf048	This is the 4.19.32 stable release -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlybBsMACgkQONu9yGCS aT6mnhAApfo3mX+F3z5Ikcx7LKQZkO7AbBO/PmjPmn2AQN/I77qlYZgv3jOTk0al 6Jk8reVS7PjKi+RDugku+xA5iaEkalFW/epS8MIp95yLEiPHLrEMmFmqd9Bbk9dy sPmQ1l5ZZ4h4mdlScxzIMKLDlWVB4w4Sk5zBl2zGwG/KiQ2zEWz+3Tfz6glQPu+o GGH9AL+9HBBVqtTlF63LBPvdz6er5NHTOrgHC7K4GXLnt9B8+kcObOjoLDtGqXNG tl1cQDOtMyMm64r+OTvfEwzIQ6shfcbxQTdLheJlJmCkIHTY5A9Xeyb9S1Opa/Xg k8zj03StKMTdqQfOHgbYdIVyHJ/nWmsRNIP5fDp8YGl91pBaHrqHCSLaYtFrrV6n yvHfl29e8QH8SYH/1VMXziFGTncqUO2/7NTmWZWJ+B/1oxHwSrvcIpo2q6mRaJwD i3XRnanvczvpefMaQPcUrI+aMUXPPeEytrGbqW2KuX/uxhXtV0jFB517JeY6UQw+ OqEiIRYx3FyQZDNvxUn66+Prr2wt4vOMK7WzV/PrH49/JmxPJypjSXjoRsbRxq8N hnD+JTK8mX6K2NgBwh2Ez2fnCQxPTbH12fk2NIRCVcOY8ZoiQud10mhyY9oyAkCj pq7X2US1W+Xml3Nn4XJHQg38rv7PrN0nFJ6Eib4EizoHzy0CFUk= =bYhI -----END PGP SIGNATURE----- Merge 4.19.32 into android-4.19 Changes in 4.19.32 ALSA: hda - add Lenovo IdeaCentre B550 to the power_save_blacklist ALSA: firewire-motu: use 'version' field of unit directory to identify model mmc: pxamci: fix enum type confusion mmc: mxcmmc: "Revert mmc: mxcmmc: handle highmem pages" mmc: renesas_sdhi: limit block count to 16 bit for old revisions drm/vmwgfx: Don't double-free the mode stored in par->set_mode drm/vmwgfx: Return 0 when gmrid::get_node runs out of ID's iommu/amd: fix sg->dma_address for sg->offset bigger than PAGE_SIZE libceph: wait for latest osdmap in ceph_monc_blacklist_add() udf: Fix crash on IO error during truncate mips: loongson64: lemote-2f: Add IRQF_NO_SUSPEND to "cascade" irqaction. MIPS: Ensure ELF appended dtb is relocated MIPS: Fix kernel crash for R6 in jump label branch function powerpc/vdso64: Fix CLOCK_MONOTONIC inconsistencies across Y2038 scsi: ibmvscsi: Protect ibmvscsi_head from concurrent modificaiton scsi: ibmvscsi: Fix empty event pool access during host removal futex: Ensure that futex address is aligned in handle_futex_death() cifs: allow guest mounts to work for smb3.11 perf probe: Fix getting the kernel map objtool: Move objtool_file struct off the stack irqchip/gic-v3-its: Fix comparison logic in lpi_range_cmp SMB3: Fix SMB3.1.1 guest mounts to Samba ALSA: x86: Fix runtime PM for hdmi-lpe-audio ALSA: hda/ca0132 - make pci_iounmap() call conditional ALSA: ac97: Fix of-node refcount unbalance ext4: fix NULL pointer dereference while journal is aborted ext4: fix data corruption caused by unaligned direct AIO ext4: brelse all indirect buffer in ext4_ind_remove_space() media: v4l2-ctrls.c/uvc: zero v4l2_event Bluetooth: hci_uart: Check if socket buffer is ERR_PTR in h4_recv_buf() Bluetooth: Fix decrementing reference count twice in releasing socket Bluetooth: hci_ldisc: Initialize hci_dev before open() Bluetooth: hci_ldisc: Postpone HCI_UART_PROTO_READY bit set in hci_uart_set_proto() drm: Reorder set_property_atomic to avoid returning with an active ww_ctx RDMA/cma: Rollback source IP address if failing to acquire device f2fs: fix to avoid deadlock of atomic file operations netfilter: ebtables: remove BUGPRINT messages loop: access lo_backing_file only when the loop device is Lo_bound x86/unwind: Handle NULL pointer calls better in frame unwinder x86/unwind: Add hardcoded ORC entry for NULL locking/lockdep: Add debug_locks check in __lock_downgrade() ALSA: hda - Record the current power state before suspend/resume calls ALSA: hda - Enforces runtime_resume after S3 and S4 for each codec power: supply: charger-manager: Fix incorrect return value Linux 4.19.32 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>	2019-03-30 08:40:51 +01:00
Quentin Perret	549020c814	ANDROID: sched: Disable find_best_target() by default Now that the mainline EAS wake-up path has been extended to cope with prefer-idle tasks, the need for a dedicated Android-specific wake-up routine (find_best_target()) becomes less clear. Indeed, main reasons for introducting find_best_target() in the first place were: 1. the energy_diff function was very slow, so we couldn't afford to use it on all CPUs for each wake-up for latency reasons; 2. schedtune provides additional information about tasks (the prefer-idle flag in particular) which needed to be taken into account in the placement algorithm. Now that the energy diff calculation is much faster (with the simplified energy model) and that the EAS path is aware of prefer-idle tasks, there is no clear reason to use find_best_target() any more. So, let's disable it for now to minimize the amount of out-of-tree code used in the scheduler. If using the mainline path doesn't cause regressions, it is a good sign find_best_target() can be removed safely, eventually. Otherwise, reverting back to the old behaviour is trivial since this patch only changes the sched_feat default, but doesn't remove the fbt() code path. Bug: 120440300 Change-Id: Idb5d68a3c4af7d2212e0922ab6d9a089170b5e1c Signed-off-by: Quentin Perret <quentin.perret@arm.com>	2019-03-27 15:58:02 +00:00
Quentin Perret	d0eb1f3514	ANDROID: sched/fair: Make the EAS wake-up prefer-idle aware Make the mainline EAS wake-up path aware of prefer idle tasks in preparation for disabling find_best_target(). What is done in the mainline algoritm isn't strictly equivalent to the find_best_target() algorithm but comes real close, and isn't very invasive. The main differences with the original find_best_target() behaviour are the following: 1. the policy for prefer idle when there isn't a single idle CPU in the system is simpler now. We just pick the CPU with the highest spare capacity; 2. the cstate awareness for prefer idle is implemented by minimizing the exit latency rather than the idle state index. This is how it is done in the slow path (find_idlest_group_cpu()), it doesn't require us to keep hooks into CPUIdle, and should actually be better because what we want is a CPU that can wake up quickly; 3. non-prefer-idle tasks just use the standard mainline energy-aware wake-up path, which decides the placement using the Energy Model. Bug: 120440300 Change-Id: I57769c90c57115f6a28d27c5a88e08aa93a30a56 Signed-off-by: Quentin Perret <quentin.perret@arm.com>	2019-03-27 15:58:02 +00:00
Waiman Long	0e0f7b3072	locking/lockdep: Add debug_locks check in __lock_downgrade() commit `7149258057` upstream. Tetsuo Handa had reported he saw an incorrect "downgrading a read lock" warning right after a previous lockdep warning. It is likely that the previous warning turned off lock debugging causing the lockdep to have inconsistency states leading to the lock downgrade warning. Fix that by add a check for debug_locks at the beginning of __lock_downgrade(). Debugged-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Reported-by: syzbot+53383ae265fb161ef488@syzkaller.appspotmail.com Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Link: https://lkml.kernel.org/r/1547093005-26085-1-git-send-email-longman@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-03-27 14:14:43 +09:00
Chen Jie	36d52f5bcd	futex: Ensure that futex address is aligned in handle_futex_death() commit `5a07168d8d` upstream. The futex code requires that the user space addresses of futexes are 32bit aligned. sys_futex() checks this in futex_get_keys() but the robust list code has no alignment check in place. As a consequence the kernel crashes on architectures with strict alignment requirements in handle_futex_death() when trying to cmpxchg() on an unaligned futex address which was retrieved from the robust list. [ tglx: Rewrote changelog, proper sizeof() based alignement check and add comment ] Fixes: `0771dfefc9` ("[PATCH] lightweight robust futexes: core") Signed-off-by: Chen Jie <chenjie6@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: <dvhart@infradead.org> Cc: <peterz@infradead.org> Cc: <zengweilin@huawei.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/1552621478-119787-1-git-send-email-chenjie6@huawei.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-03-27 14:14:40 +09:00
Vincent Guittot	d36e8b820e	UPSTREAM: sched/pelt: Skip updating util_est when utilization is higher than CPU's capacity util_est is mainly meant to be a lower-bound for tasks utilization. That's why task_util_est() returns the actual util_avg when it's higher than the estimated utilization. With new invaraince signal and without any special check on samples collection, if a task is limited because of thermal capping for example, we could end up overestimating its utilization and thus perhaps generating an unwanted frequency spike when the capping is relaxed... and (even worst) it will take some more activations for the estimated utilization to converge back to the actual utilization. Since we cannot easily know if there is idle time in a CPU when a task completes an activation with a utilization higher then the CPU capacity, we skip the sampling when utilization is higher than CPU's capacity. Bug: 120440300 Change-Id: If1a6001451f80acb953e2a5f955fd302b1b73bc0 Suggested-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bsegall@google.com Cc: dietmar.eggemann@arm.com Cc: pjt@google.com Cc: pkondeti@codeaurora.org Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: srinivas.pandruvada@linux.intel.com Cc: thara.gopinath@linaro.org Link: https://lkml.kernel.org/r/1548257214-13745-4-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit `10a35e6812`) Signed-off-by: Quentin Perret <quentin.perret@arm.com>	2019-03-26 14:22:50 +00:00
Vincent Guittot	eb0db1782a	UPSTREAM: sched/fair: Update scale invariance of PELT The current implementation of load tracking invariance scales the contribution with current frequency and uarch performance (only for utilization) of the CPU. One main result of this formula is that the figures are capped by current capacity of CPU. Another one is that the load_avg is not invariant because not scaled with uarch. The util_avg of a periodic task that runs r time slots every p time slots varies in the range : U * (1-y^r)/(1-y^p) * y^i < Utilization < U * (1-y^r)/(1-y^p) with U is the max util_avg value = SCHED_CAPACITY_SCALE At a lower capacity, the range becomes: U * C * (1-y^r')/(1-y^p) * y^i' < Utilization < U * C * (1-y^r')/(1-y^p) with C reflecting the compute capacity ratio between current capacity and max capacity. so C tries to compensate changes in (1-y^r') but it can't be accurate. Instead of scaling the contribution value of PELT algo, we should scale the running time. The PELT signal aims to track the amount of computation of tasks and/or rq so it seems more correct to scale the running time to reflect the effective amount of computation done since the last update. In order to be fully invariant, we need to apply the same amount of running time and idle time whatever the current capacity. Because running at lower capacity implies that the task will run longer, we have to ensure that the same amount of idle time will be applied when system becomes idle and no idle time has been "stolen". But reaching the maximum utilization value (SCHED_CAPACITY_SCALE) means that the task is seen as an always-running task whatever the capacity of the CPU (even at max compute capacity). In this case, we can discard this "stolen" idle times which becomes meaningless. In order to achieve this time scaling, a new clock_pelt is created per rq. The increase of this clock scales with current capacity when something is running on rq and synchronizes with clock_task when rq is idle. With this mechanism, we ensure the same running and idle time whatever the current capacity. This also enables to simplify the pelt algorithm by removing all references of uarch and frequency and applying the same contribution to utilization and loads. Furthermore, the scaling is done only once per update of clock (update_rq_clock_task()) instead of during each update of sched_entities and cfs/rt/dl_rq of the rq like the current implementation. This is interesting when cgroup are involved as shown in the results below: On a hikey (octo Arm64 platform). Performance cpufreq governor and only shallowest c-state to remove variance generated by those power features so we only track the impact of pelt algo. each test runs 16 times: ./perf bench sched pipe (higher is better) kernel tip/sched/core + patch ops/seconds ops/seconds diff cgroup root 59652(+/- 0.18%) 59876(+/- 0.24%) +0.38% level1 55608(+/- 0.27%) 55923(+/- 0.24%) +0.57% level2 52115(+/- 0.29%) 52564(+/- 0.22%) +0.86% hackbench -l 1000 (lower is better) kernel tip/sched/core + patch duration(sec) duration(sec) diff cgroup root 4.453(+/- 2.37%) 4.383(+/- 2.88%) -1.57% level1 4.859(+/- 8.50%) 4.830(+/- 7.07%) -0.60% level2 5.063(+/- 9.83%) 4.928(+/- 9.66%) -2.66% Then, the responsiveness of PELT is improved when CPU is not running at max capacity with this new algorithm. I have put below some examples of duration to reach some typical load values according to the capacity of the CPU with current implementation and with this patch. These values has been computed based on the geometric series and the half period value: Util (%) max capacity half capacity(mainline) half capacity(w/ patch) 972 (95%) 138ms not reachable 276ms 486 (47.5%) 30ms 138ms 60ms 256 (25%) 13ms 32ms 26ms On my hikey (octo Arm64 platform) with schedutil governor, the time to reach max OPP when starting from a null utilization, decreases from 223ms with current scale invariance down to 121ms with the new algorithm. Bug: 120440300 Change-Id: I0bd4ed2317f2a9a965634e53ce1476417af697a6 Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bsegall@google.com Cc: dietmar.eggemann@arm.com Cc: patrick.bellasi@arm.com Cc: pjt@google.com Cc: pkondeti@codeaurora.org Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: srinivas.pandruvada@linux.intel.com Cc: thara.gopinath@linaro.org Link: https://lkml.kernel.org/r/1548257214-13745-3-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit `2312729688`) Signed-off-by: Quentin Perret <quentin.perret@arm.com>	2019-03-26 14:22:50 +00:00
Vincent Guittot	0dd28f4253	UPSTREAM: sched/fair: Move the rq_of() helper function Move rq_of() helper function so it can be used in pelt.c [ mingo: Improve readability while at it. ] Bug: 120440300 Change-Id: I2133979476631d68baaffcaa308f4cdab94f22b1 Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bsegall@google.com Cc: dietmar.eggemann@arm.com Cc: patrick.bellasi@arm.com Cc: pjt@google.com Cc: pkondeti@codeaurora.org Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: srinivas.pandruvada@linux.intel.com Cc: thara.gopinath@linaro.org Link: https://lkml.kernel.org/r/1548257214-13745-2-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit `62478d9911`) Signed-off-by: Quentin Perret <quentin.perret@arm.com>	2019-03-26 14:22:50 +00:00
Dietmar Eggemann	5bfd3ce4e0	UPSTREAM: sched/fair: Remove setting task's se->runnable_weight during PELT update A CFS (SCHED_OTHER, SCHED_BATCH or SCHED_IDLE policy) task's se->runnable_weight must always be in sync with its se->load.weight. se->runnable_weight is set to se->load.weight when the task is forked (init_entity_runnable_average()) or reniced (reweight_entity()). There are two cases in set_load_weight() which since they currently only set se->load.weight could lead to a situation in which se->load.weight is different to se->runnable_weight for a CFS task: (1) A task switches to SCHED_IDLE. (2) A SCHED_FIFO, SCHED_RR or SCHED_DEADLINE task which has been reniced (during which only its static priority gets set) switches to SCHED_OTHER or SCHED_BATCH. Set se->runnable_weight to se->load.weight in these two cases to prevent this. This eliminates the need to explicitly set it to se->load.weight during PELT updates in the CFS scheduler fastpath. Bug: 120440300 Change-Id: I52184a9e1fd53cb42ef3ae546b1fae78b744c9ad Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Joel Fernandes <joelaf@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Patrick Bellasi <patrick.bellasi@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: http://lkml.kernel.org/r/20180803140538.1178-1-dietmar.eggemann@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit `4a465e3ebb`) Signed-off-by: Quentin Perret <quentin.perret@arm.com>	2019-03-26 14:22:49 +00:00
Greg Kroah-Hartman	bb418a146a	This is the 4.19.31 stable release -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlyWhJcACgkQONu9yGCS aT6XzxAAzP2QGzC4SVPgcFH1woF/d8Cz0zQ81mLXzjXtEPm39fZCM2hbBnxkXLu1 peFyrKNk6/c9541D9gsQCQT6Fu+H6u1bJKcIezlKJ2xyB/MsU1hXkjZrTJYW3RRs gimy1EGdood2el1ubEBZiaspazoeRzBqtg1Nsmr4V0l+RT8HwtKKw+0+Nxixfp59 NoVkqTpPI5mL0FiH2R9ogcfg3SvgMZOsOhOBjdPvSjiJJsbvIWcW48MCs95XSUpF R+l/fWn+oiFCcIqBaFheujuqZMvVrUHZHaWAPMuoR/c3Cdf0lTBokdv6UM9c0nv3 61jX5r5ImRI/dfQANN5mbB1YKcs5xOI+I7QZHQ2q4clsWrWyLapXW4clrAZJ6z5t UVeVbuLV2y5PL9GJyBcXpyY0BOf4e2gZURaPY3C5McNwgybNoiR0ZePqKb8ZhZyh jYOYRoBjJJpZoVTSt6MNX95NTvGaSAtqKMu1s3IeMfpwCfQKBPMOuBHr/dUqSC6I U0xxjk/71C15dSPVcTVJT/lmcKc6TXgoagnfbn8GBtDOAjBNsYyUJLQI+db1ERCe 9MEB9k1Z87ROQ5jQCQmWsewOVAtFZBEvSszFmpKv3zTe8M2oFpXG56zckdiumwHU nSfeZTTeWzsFJd30MioEnGYm3ZwKwZx7wi0x4B4WWvBfSpp20Us= =xtLx -----END PGP SIGNATURE----- Merge 4.19.31 into android-4.19 Changes in 4.19.31 media: videobuf2-v4l2: drop WARN_ON in vb2_warn_zero_bytesused() 9p: use inode->i_lock to protect i_size_write() under 32-bit 9p/net: fix memory leak in p9_client_create ASoC: fsl_esai: fix register setting issue in RIGHT_J mode ASoC: codecs: pcm186x: fix wrong usage of DECLARE_TLV_DB_SCALE() ASoC: codecs: pcm186x: Fix energysense SLEEP bit iio: adc: exynos-adc: Fix NULL pointer exception on unbind mei: hbm: clean the feature flags on link reset mei: bus: move hw module get/put to probe/release stm class: Fix an endless loop in channel allocation crypto: caam - fix hash context DMA unmap size crypto: ccree - fix missing break in switch statement crypto: caam - fixed handling of sg list crypto: caam - fix DMA mapping of stack memory crypto: ccree - fix free of unallocated mlli buffer crypto: ccree - unmap buffer before copying IV crypto: ccree - don't copy zero size ciphertext crypto: cfb - add missing 'chunksize' property crypto: cfb - remove bogus memcpy() with src == dest crypto: ahash - fix another early termination in hash walk crypto: rockchip - fix scatterlist nents error crypto: rockchip - update new iv to device in multiple operations drm/imx: ignore plane updates on disabled crtcs gpu: ipu-v3: Fix i.MX51 CSI control registers offset drm/imx: imx-ldb: add missing of_node_puts gpu: ipu-v3: Fix CSI offsets for imx53 ASoC: rt5682: Correct the setting while select ASRC clk for AD/DA filter clocksource: timer-ti-dm: Fix pwm dmtimer usage of fck reparenting KVM: arm/arm64: vgic: Make vgic_dist->lpi_list_lock a raw_spinlock arm64: dts: rockchip: fix graph_port warning on rk3399 bob kevin and excavator s390/dasd: fix using offset into zero size array error Input: pwm-vibra - prevent unbalanced regulator Input: pwm-vibra - stop regulator after disabling pwm, not before ARM: dts: Configure clock parent for pwm vibra ARM: OMAP2+: Variable "reg" in function omap4_dsi_mux_pads() could be uninitialized ASoC: dapm: fix out-of-bounds accesses to DAPM lookup tables ASoC: rsnd: fixup rsnd_ssi_master_clk_start() user count check KVM: arm/arm64: Reset the VCPU without preemption and vcpu state loaded arm/arm64: KVM: Allow a VCPU to fully reset itself arm/arm64: KVM: Don't panic on failure to properly reset system registers KVM: arm/arm64: vgic: Always initialize the group of private IRQs KVM: arm64: Forbid kprobing of the VHE world-switch code ASoC: samsung: Prevent clk_get_rate() calls in atomic context ARM: OMAP2+: fix lack of timer interrupts on CPU1 after hotplug Input: cap11xx - switch to using set_brightness_blocking() Input: ps2-gpio - flush TX work when closing port Input: matrix_keypad - use flush_delayed_work() mac80211: call drv_ibss_join() on restart mac80211: Fix Tx aggregation session tear down with ITXQs netfilter: compat: initialize all fields in xt_init blk-mq: insert rq with DONTPREP to hctx dispatch list when requeue ipvs: fix dependency on nf_defrag_ipv6 floppy: check_events callback should not return a negative number xprtrdma: Make sure Send CQ is allocated on an existing compvec NFS: Don't use page_file_mapping after removing the page mm/gup: fix gup_pmd_range() for dax Revert "mm: use early_pfn_to_nid in page_ext_init" scsi: qla2xxx: Fix panic from use after free in qla2x00_async_tm_cmd net: dsa: bcm_sf2: potential array overflow in bcm_sf2_sw_suspend() x86/CPU: Add Icelake model number mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs net: hns: Fix object reference leaks in hns_dsaf_roce_reset() i2c: cadence: Fix the hold bit setting i2c: bcm2835: Clear current buffer pointers and counts after a transfer auxdisplay: ht16k33: fix potential user-after-free on module unload Input: st-keyscan - fix potential zalloc NULL dereference clk: sunxi-ng: v3s: Fix TCON reset de-assert bit kallsyms: Handle too long symbols in kallsyms.c clk: sunxi: A31: Fix wrong AHB gate number esp: Skip TX bytes accounting when sending from a request socket ARM: 8824/1: fix a migrating irq bug when hotplug cpu bpf: only adjust gso_size on bytestream protocols bpf: fix lockdep false positive in stackmap af_key: unconditionally clone on broadcast ARM: 8835/1: dma-mapping: Clear DMA ops on teardown assoc_array: Fix shortcut creation keys: Fix dependency loop between construction record and auth key scsi: libiscsi: Fix race between iscsi_xmit_task and iscsi_complete_task net: systemport: Fix reception of BPDUs net: dsa: bcm_sf2: Do not assume DSA master supports WoL pinctrl: meson: meson8b: fix the sdxc_a data 1..3 pins qmi_wwan: apply SET_DTR quirk to Sierra WP7607 net: mv643xx_eth: disable clk on error path in mv643xx_eth_shared_probe() xfrm: Fix inbound traffic via XFRM interfaces across network namespaces mailbox: bcm-flexrm-mailbox: Fix FlexRM ring flush timeout issue ASoC: topology: free created components in tplg load error qed: Fix iWARP buffer size provided for syn packet processing. qed: Fix iWARP syn packet mac address validation. ARM: dts: armada-xp: fix Armada XP boards NAND description arm64: Relax GIC version check during early boot ARM: tegra: Restore DT ABI on Tegra124 Chromebooks net: marvell: mvneta: fix DMA debug warning mm: handle lru_add_drain_all for UP properly tmpfs: fix link accounting when a tmpfile is linked in ixgbe: fix older devices that do not support IXGBE_MRQC_L3L4TXSWEN ARCv2: lib: memcpy: fix doing prefetchw outside of buffer ARC: uacces: remove lp_start, lp_end from clobber list ARCv2: support manual regfile save on interrupts ARCv2: don't assume core 0x54 has dual issue phonet: fix building with clang mac80211_hwsim: propagate genlmsg_reply return code bpf, lpm: fix lookup bug in map_delete_elem net: thunderx: make CFG_DONE message to run through generic send-ack sequence net: thunderx: add nicvf_send_msg_to_pf result check for set_rx_mode_task nfp: bpf: fix code-gen bug on BPF_ALU \| BPF_XOR \| BPF_K nfp: bpf: fix ALU32 high bits clearance bug bnxt_en: Fix typo in firmware message timeout logic. bnxt_en: Wait longer for the firmware message response to complete. net: set static variable an initial value in atl2_probe() selftests: fib_tests: sleep after changing carrier. again. tmpfs: fix uninitialized return value in shmem_link stm class: Prevent division by zero nfit: acpi_nfit_ctl(): Check out_obj->type in the right place acpi/nfit: Fix bus command validation nfit/ars: Attempt a short-ARS whenever the ARS state is idle at boot nfit/ars: Attempt short-ARS even in the no_init_ars case libnvdimm/label: Clear 'updating' flag after label-set update libnvdimm, pfn: Fix over-trim in trim_pfn_device() libnvdimm/pmem: Honor force_raw for legacy pmem regions libnvdimm: Fix altmap reservation size calculation fix cgroup_do_mount() handling of failure exits crypto: aead - set CRYPTO_TFM_NEED_KEY if ->setkey() fails crypto: aegis - fix handling chunked inputs crypto: arm/crct10dif - revert to C code for short inputs crypto: arm64/aes-neonbs - fix returning final keystream block crypto: arm64/crct10dif - revert to C code for short inputs crypto: hash - set CRYPTO_TFM_NEED_KEY if ->setkey() fails crypto: morus - fix handling chunked inputs crypto: pcbc - remove bogus memcpy()s with src == dest crypto: skcipher - set CRYPTO_TFM_NEED_KEY if ->setkey() fails crypto: testmgr - skip crc32c context test for ahash algorithms crypto: x86/aegis - fix handling chunked inputs and MAY_SLEEP crypto: x86/aesni-gcm - fix crash on empty plaintext crypto: x86/morus - fix handling chunked inputs and MAY_SLEEP crypto: arm64/aes-ccm - fix logical bug in AAD MAC handling crypto: arm64/aes-ccm - fix bugs in non-NEON fallback routine CIFS: Do not reset lease state to NONE on lease break CIFS: Do not skip SMB2 message IDs on send failures CIFS: Fix read after write for files with read caching tracing: Use strncpy instead of memcpy for string keys in hist triggers tracing: Do not free iter->trace in fail path of tracing_open_pipe() tracing/perf: Use strndup_user() instead of buggy open-coded version xen: fix dom0 boot on huge systems ACPI / device_sysfs: Avoid OF modalias creation for removed device mmc: sdhci-esdhc-imx: fix HS400 timing issue mmc:fix a bug when max_discard is 0 netfilter: ipt_CLUSTERIP: fix warning unused variable cn spi: ti-qspi: Fix mmap read when more than one CS in use spi: pxa2xx: Setup maximum supported DMA transfer length regulator: s2mps11: Fix steps for buck7, buck8 and LDO35 regulator: max77620: Initialize values for DT properties regulator: s2mpa01: Fix step values for some LDOs clocksource/drivers/exynos_mct: Move one-shot check from tick clear to ISR clocksource/drivers/exynos_mct: Clear timer interrupt when shutdown clocksource/drivers/arch_timer: Workaround for Allwinner A64 timer instability s390/setup: fix early warning messages s390/virtio: handle find on invalid queue gracefully scsi: virtio_scsi: don't send sc payload with tmfs scsi: aacraid: Fix performance issue on logical drives scsi: sd: Optimal I/O size should be a multiple of physical block size scsi: target/iscsi: Avoid iscsit_release_commands_from_conn() deadlock scsi: qla2xxx: Fix LUN discovery if loop id is not assigned yet by firmware fs/devpts: always delete dcache dentry-s in dput() splice: don't merge into linked buffers ovl: During copy up, first copy up data and then xattrs ovl: Do not lose security.capability xattr over metadata file copy-up m68k: Add -ffreestanding to CFLAGS Btrfs: setup a nofs context for memory allocation at btrfs_create_tree() Btrfs: setup a nofs context for memory allocation at __btrfs_set_acl btrfs: ensure that a DUP or RAID1 block group has exactly two stripes Btrfs: fix corruption reading shared and compressed extents after hole punching soc: qcom: rpmh: Avoid accessing freed memory from batch API libertas_tf: don't set URB_ZERO_PACKET on IN USB transfer irqchip/gic-v3-its: Avoid parsing _indirect_ twice for Device table irqchip/brcmstb-l2: Use _irqsave locking variants in non-interrupt code x86/kprobes: Prohibit probing on optprobe template code cpufreq: kryo: Release OPP tables on module removal cpufreq: tegra124: add missing of_node_put() cpufreq: pxa2xx: remove incorrect __init annotation ext4: fix check of inode in swap_inode_boot_loader ext4: cleanup pagecache before swap i_data ext4: update quota information while swapping boot loader inode ext4: add mask of ext4 flags to swap ext4: fix crash during online resizing PCI/ASPM: Use LTR if already enabled by platform PCI/DPC: Fix print AER status in DPC event handling PCI: dwc: skip MSI init if MSIs have been explicitly disabled IB/hfi1: Close race condition on user context disable and close cxl: Wrap iterations over afu slices inside 'afu_list_lock' ext2: Fix underflow in ext2_max_size() clk: uniphier: Fix update register for CPU-gear clk: clk-twl6040: Fix imprecise external abort for pdmclk clk: samsung: exynos5: Fix possible NULL pointer exception on platform_device_alloc() failure clk: samsung: exynos5: Fix kfree() of const memory on setting driver_override clk: ingenic: Fix round_rate misbehaving with non-integer dividers clk: ingenic: Fix doc of ingenic_cgu_div_info usb: chipidea: tegra: Fix missed ci_hdrc_remove_device() usb: typec: tps6598x: handle block writes separately with plain-I2C adapters dmaengine: usb-dmac: Make DMAC system sleep callbacks explicit mm: hwpoison: fix thp split handing in soft_offline_in_use_page() mm/vmalloc: fix size check for remap_vmalloc_range_partial() mm/memory.c: do_fault: avoid usage of stale vm_area_struct kernel/sysctl.c: add missing range check in do_proc_dointvec_minmax_conv device property: Fix the length used in PROPERTY_ENTRY_STRING() intel_th: Don't reference unassigned outputs parport_pc: fix find_superio io compare code, should use equal test. i2c: tegra: fix maximum transfer size media: i2c: ov5640: Fix post-reset delay gpio: pca953x: Fix dereference of irq data in shutdown can: flexcan: FLEXCAN_IFLAG_MB: add () around macro argument drm/i915: Relax mmap VMA check bpf: only test gso type on gso packets serial: uartps: Fix stuck ISR if RX disabled with non-empty FIFO serial: 8250_of: assume reg-shift of 2 for mrvl,mmp-uart serial: 8250_pci: Fix number of ports for ACCES serial cards serial: 8250_pci: Have ACCES cards that use the four port Pericom PI7C9X7954 chip use the pci_pericom_setup() jbd2: clear dirty flag when revoking a buffer from an older transaction jbd2: fix compile warning when using JBUFFER_TRACE selinux: add the missing walk_size + len check in selinux_sctp_bind_connect security/selinux: fix SECURITY_LSM_NATIVE_LABELS on reused superblock powerpc/32: Clear on-stack exception marker upon exception return powerpc/wii: properly disable use of BATs when requested. powerpc/powernv: Make opal log only readable by root powerpc/83xx: Also save/restore SPRG4-7 during suspend powerpc/powernv: Don't reprogram SLW image on every KVM guest entry/exit powerpc: Fix 32-bit KVM-PR lockup and host crash with MacOS guest powerpc/ptrace: Simplify vr_get/set() to avoid GCC warning powerpc/hugetlb: Don't do runtime allocation of 16G pages in LPAR configuration powerpc/traps: fix recoverability of machine check handling on book3s/32 powerpc/traps: Fix the message printed when stack overflows ARM: s3c24xx: Fix boolean expressions in osiris_dvs_notify arm64: Fix HCR.TGE status for NMI contexts arm64: debug: Ensure debug handlers check triggering exception level arm64: KVM: Fix architecturally invalid reset value for FPEXC32_EL2 ipmi_si: fix use-after-free of resource->name dm: fix to_sector() for 32bit dm integrity: limit the rate of error messages mfd: sm501: Fix potential NULL pointer dereference cpcap-charger: generate events for userspace NFS: Fix I/O request leakages NFS: Fix an I/O request leakage in nfs_do_recoalesce NFS: Don't recoalesce on error in nfs_pageio_complete_mirror() nfsd: fix performance-limiting session calculation nfsd: fix memory corruption caused by readdir nfsd: fix wrong check in write_v4_end_grace() NFSv4.1: Reinitialise sequence results before retransmitting a request svcrpc: fix UDP on servers with lots of threads PM / wakeup: Rework wakeup source timer cancellation bcache: never writeback a discard operation stable-kernel-rules.rst: add link to networking patch queue vt: perform safe console erase in the right order x86/unwind/orc: Fix ORC unwind table alignment perf intel-pt: Fix CYC timestamp calculation after OVF perf tools: Fix split_kallsyms_for_kcore() for trampoline symbols perf auxtrace: Define auxtrace record alignment perf intel-pt: Fix overlap calculation for padding perf/x86/intel/uncore: Fix client IMC events return huge result perf intel-pt: Fix divide by zero when TSC is not available md: Fix failed allocation of md_register_thread tpm/tpm_crb: Avoid unaligned reads in crb_recv() tpm: Unify the send callback behaviour rcu: Do RCU GP kthread self-wakeup from softirq and interrupt media: imx: prpencvf: Stop upstream before disabling IDMA channel media: lgdt330x: fix lock status reporting media: uvcvideo: Avoid NULL pointer dereference at the end of streaming media: vimc: Add vimc-streamer for stream control media: imx: csi: Disable CSI immediately after last EOF media: imx: csi: Stop upstream before disabling IDMA channel drm/fb-helper: generic: Fix drm_fbdev_client_restore() drm/radeon/evergreen_cs: fix missing break in switch statement drm/amd/powerplay: correct power reading on fiji drm/amd/display: don't call dm_pp_ function from an fpu block KVM: Call kvm_arch_memslots_updated() before updating memslots KVM: x86/mmu: Detect MMIO generation wrap in any address space KVM: x86/mmu: Do not cache MMIO accesses while memslots are in flux KVM: nVMX: Sign extend displacements of VMX instr's mem operands KVM: nVMX: Apply addr size mask to effective address for VMX instructions KVM: nVMX: Ignore limit checks on VMX instructions using flat segments bcache: use (REQ_META\|REQ_PRIO) to indicate bio for metadata s390/setup: fix boot crash for machine without EDAT-1 Linux 4.19.31 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>	2019-03-23 21:13:30 +01:00
Zhang, Jun	e97a32a5a3	rcu: Do RCU GP kthread self-wakeup from softirq and interrupt commit `1d1f898df6` upstream. The rcu_gp_kthread_wake() function is invoked when it might be necessary to wake the RCU grace-period kthread. Because self-wakeups are normally a useless waste of CPU cycles, if rcu_gp_kthread_wake() is invoked from this kthread, it naturally refuses to do the wakeup. Unfortunately, natural though it might be, this heuristic fails when rcu_gp_kthread_wake() is invoked from an interrupt or softirq handler that interrupted the grace-period kthread just after the final check of the wait-event condition but just before the schedule() call. In this case, a wakeup is required, even though the call to rcu_gp_kthread_wake() is within the RCU grace-period kthread's context. Failing to provide this wakeup can result in grace periods failing to start, which in turn results in out-of-memory conditions. This race window is quite narrow, but it actually did happen during real testing. It would of course need to be fixed even if it was strictly theoretical in nature. This patch does not Cc stable because it does not apply cleanly to earlier kernel versions. Fixes: `48a7639ce8` ("rcu: Make callers awaken grace-period kthread") Reported-by: "He, Bo" <bo.he@intel.com> Co-developed-by: "Zhang, Jun" <jun.zhang@intel.com> Co-developed-by: "He, Bo" <bo.he@intel.com> Co-developed-by: "xiao, jin" <jin.xiao@intel.com> Co-developed-by: Bai, Jie A <jie.a.bai@intel.com> Signed-off: "Zhang, Jun" <jun.zhang@intel.com> Signed-off: "He, Bo" <bo.he@intel.com> Signed-off: "xiao, jin" <jin.xiao@intel.com> Signed-off: Bai, Jie A <jie.a.bai@intel.com> Signed-off-by: "Zhang, Jun" <jun.zhang@intel.com> [ paulmck: Switch from !in_softirq() to "!in_interrupt() && !in_serving_softirq() to avoid redundant wakeups and to also handle the interrupt-handler scenario as well as the softirq-handler scenario that actually occurred in testing. ] Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com> Link: https://lkml.kernel.org/r/CD6925E8781EFD4D8E11882D20FC406D52A11F61@SHSMSX104.ccr.corp.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-03-23 20:10:12 +01:00
Zev Weiss	93c8a44a82	kernel/sysctl.c: add missing range check in do_proc_dointvec_minmax_conv commit `8cf7630b29` upstream. This bug has apparently existed since the introduction of this function in the pre-git era (4500e91754d3 in Thomas Gleixner's history.git, "[NET]: Add proc_dointvec_userhz_jiffies, use it for proper handling of neighbour sysctls."). As a minimal fix we can simply duplicate the corresponding check in do_proc_dointvec_conv(). Link: http://lkml.kernel.org/r/20190207123426.9202-3-zev@bewilderbeest.net Signed-off-by: Zev Weiss <zev@bewilderbeest.net> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: <stable@vger.kernel.org> [2.6.2+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-03-23 20:10:04 +01:00
Jann Horn	24d5097655	tracing/perf: Use strndup_user() instead of buggy open-coded version commit `83540fbc88` upstream. The first version of this method was missing the check for `ret == PATH_MAX`; then such a check was added, but it didn't call kfree() on error, so there was still a small memory leak in the error case. Fix it by using strndup_user() instead of open-coding it. Link: http://lkml.kernel.org/r/20190220165443.152385-1-jannh@google.com Cc: Ingo Molnar <mingo@kernel.org> Cc: stable@vger.kernel.org Fixes: `0eadcc7a7b` ("perf/core: Fix perf_uprobe_init()") Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-03-23 20:09:56 +01:00
zhangyi (F)	f27077e5f5	tracing: Do not free iter->trace in fail path of tracing_open_pipe() commit `e7f0c424d0` upstream. Commit `d716ff71dd` ("tracing: Remove taking of trace_types_lock in pipe files") use the current tracer instead of the copy in tracing_open_pipe(), but it forget to remove the freeing sentence in the error path. There's an error path that can call kfree(iter->trace) after the iter->trace was assigned to tr->current_trace, which would be bad to free. Link: http://lkml.kernel.org/r/1550060946-45984-1-git-send-email-yi.zhang@huawei.com Cc: stable@vger.kernel.org Fixes: `d716ff71dd` ("tracing: Remove taking of trace_types_lock in pipe files") Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-03-23 20:09:56 +01:00
Tom Zanussi	ebca08d7e8	tracing: Use strncpy instead of memcpy for string keys in hist triggers commit `9f0bbf3115` upstream. Because there may be random garbage beyond a string's null terminator, it's not correct to copy the the complete character array for use as a hist trigger key. This results in multiple histogram entries for the 'same' string key. So, in the case of a string key, use strncpy instead of memcpy to avoid copying in the extra bytes. Before, using the gdbus entries in the following hist trigger as an example: # echo 'hist:key=comm' > /sys/kernel/debug/tracing/events/sched/sched_waking/trigger # cat /sys/kernel/debug/tracing/events/sched/sched_waking/hist ... { comm: ImgDecoder #4 } hitcount: 203 { comm: gmain } hitcount: 213 { comm: gmain } hitcount: 216 { comm: StreamTrans #73 } hitcount: 221 { comm: mozStorage #3 } hitcount: 230 { comm: gdbus } hitcount: 233 { comm: StyleThread#5 } hitcount: 253 { comm: gdbus } hitcount: 256 { comm: gdbus } hitcount: 260 { comm: StyleThread#4 } hitcount: 271 ... # cat /sys/kernel/debug/tracing/events/sched/sched_waking/hist \| egrep gdbus \| wc -l 51 After: # cat /sys/kernel/debug/tracing/events/sched/sched_waking/hist \| egrep gdbus \| wc -l 1 Link: http://lkml.kernel.org/r/50c35ae1267d64eee975b8125e151e600071d4dc.1549309756.git.tom.zanussi@linux.intel.com Cc: Namhyung Kim <namhyung@kernel.org> Cc: stable@vger.kernel.org Fixes: `79e577cbce` ("tracing: Support string type key properly") Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-03-23 20:09:56 +01:00
Al Viro	7a8b048430	fix cgroup_do_mount() handling of failure exits commit `399504e21a` upstream. same story as with last May fixes in sysfs (`7b745a4e40` "unfuck sysfs_mount()"); new_sb is left uninitialized in case of early errors in kernfs_mount_ns() and papering over it by treating any error from kernfs_mount_ns() as equivalent to !new_ns ends up conflating the cases when objects had never been transferred to a superblock with ones when that has happened and resulting new superblock had been dropped. Easily fixed (same way as in sysfs case). Additionally, there's a superblock leak on kernfs_node_dentry() failure and a dentry leak inside kernfs_node_dentry() itself - the latter on probably impossible errors, but the former not impossible to trigger (as the matter of fact, injecting allocation failures at that point does trigger it). Cc: stable@kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-03-23 20:09:53 +01:00
Alban Crequy	02f8211b75	bpf, lpm: fix lookup bug in map_delete_elem [ Upstream commit `7c0cdf0b39` ] trie_delete_elem() was deleting an entry even though it was not matching if the prefixlen was correct. This patch adds a check on matchlen. Reproducer: $ sudo bpftool map create /sys/fs/bpf/mylpm type lpm_trie key 8 value 1 entries 128 name mylpm flags 1 $ sudo bpftool map update pinned /sys/fs/bpf/mylpm key hex 10 00 00 00 aa bb cc dd value hex 01 $ sudo bpftool map dump pinned /sys/fs/bpf/mylpm key: 10 00 00 00 aa bb cc dd value: 01 Found 1 element $ sudo bpftool map delete pinned /sys/fs/bpf/mylpm key hex 10 00 00 00 ff ff ff ff $ echo $? 0 $ sudo bpftool map dump pinned /sys/fs/bpf/mylpm Found 0 elements A similar reproducer is added in the selftests. Without the patch: $ sudo ./tools/testing/selftests/bpf/test_lpm_map test_lpm_map: test_lpm_map.c:485: test_lpm_delete: Assertion `bpf_map_delete_elem(map_fd, key) == -1 && errno == ENOENT' failed. Aborted With the patch: test_lpm_map runs without errors. Fixes: `e454cf5958` ("bpf: Implement map_delete_elem for BPF_MAP_TYPE_LPM_TRIE") Cc: Craig Gallek <kraig@google.com> Signed-off-by: Alban Crequy <alban@kinvolk.io> Acked-by: Craig Gallek <kraig@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-03-23 20:09:51 +01:00
Alexei Starovoitov	c7c68a1b9a	bpf: fix lockdep false positive in stackmap [ Upstream commit `3defaf2f15` ] Lockdep warns about false positive: [ 11.211460] ------------[ cut here ]------------ [ 11.211936] DEBUG_LOCKS_WARN_ON(depth <= 0) [ 11.211985] WARNING: CPU: 0 PID: 141 at ../kernel/locking/lockdep.c:3592 lock_release+0x1ad/0x280 [ 11.213134] Modules linked in: [ 11.214954] RIP: 0010:lock_release+0x1ad/0x280 [ 11.223508] Call Trace: [ 11.223705] <IRQ> [ 11.223874] ? __local_bh_enable+0x7a/0x80 [ 11.224199] up_read+0x1c/0xa0 [ 11.224446] do_up_read+0x12/0x20 [ 11.224713] irq_work_run_list+0x43/0x70 [ 11.225030] irq_work_run+0x26/0x50 [ 11.225310] smp_irq_work_interrupt+0x57/0x1f0 [ 11.225662] irq_work_interrupt+0xf/0x20 since rw_semaphore is released in a different task vs task that locked the sema. It is expected behavior. Fix the warning with up_read_non_owner() and rwsem_release() annotation. Fixes: `bae77c5eb5` ("bpf: enable stackmap with build_id in nmi context") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-03-23 20:09:48 +01:00
Suren Baghdasaryan	617a4ba0ec	FROMLIST: psi: introduce psi monitor Psi monitor aims to provide a low-latency short-term pressure detection mechanism configurable by users. It allows users to monitor psi metrics growth and trigger events whenever a metric raises above user-defined threshold within user-defined time window. Time window and threshold are both expressed in usecs. Multiple psi resources with different thresholds and window sizes can be monitored concurrently. Psi monitors activate when system enters stall state for the monitored psi metric and deactivate upon exit from the stall state. While system is in the stall state psi signal growth is monitored at a rate of 10 times per tracking window. Min window size is 500ms, therefore the min monitoring interval is 50ms. Max window size is 10s with monitoring interval of 1s. When activated psi monitor stays active for at least the duration of one tracking window to avoid repeated activations/deactivations when psi signal is bouncing. Notifications to the users are rate-limited to one per tracking window. Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> (not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/1052418/) Bug: 127712811 Bug: 129157727 Test: lmkd in PSI mode Change-Id: I860049d32420485346ad545c4650f990fe0c08e3 Signed-off-by: Suren Baghdasaryan <surenb@google.com>	2019-03-22 23:07:14 +00:00
Suren Baghdasaryan	3a905dc573	FROMLIST: refactor header includes to allow kthread.h inclusion in psi_types.h kthread.h can't be included in psi_types.h because it creates a circular inclusion with kthread.h eventually including psi_types.h and complaining on kthread structures not being defined because they are defined further in the kthread.h. Resolve this by removing psi_types.h inclusion from the headers included from kthread.h. Signed-off-by: Suren Baghdasaryan <surenb@google.com> (not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/1052417/) Bug: 127712811 Bug: 129157727 Test: lmkd in PSI mode Change-Id: I88cd99f41534f0b9df18043cde8d1ee54aaa93de Signed-off-by: Suren Baghdasaryan <surenb@google.com>	2019-03-22 23:07:04 +00:00
Suren Baghdasaryan	23c32cf595	FROMLIST: psi: track changed states Introduce changed_states parameter into collect_percpu_times to track the states changed since the last update. Signed-off-by: Suren Baghdasaryan <surenb@google.com> (not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/1052420/) Bug: 127712811 Bug: 129157727 Test: lmkd in PSI mode Change-Id: I944b024cd65e8520a57097bf5a3d7b2c01605bd0 Signed-off-by: Suren Baghdasaryan <surenb@google.com>	2019-03-22 23:06:55 +00:00
Suren Baghdasaryan	f270022469	FROMLIST: psi: split update_stats into parts Split update_stats into collect_percpu_times and update_averages for collect_percpu_times to be reused later inside psi monitor. Signed-off-by: Suren Baghdasaryan <surenb@google.com> (not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/1052419/) Bug: 127712811 Bug: 129157727 Test: lmkd in PSI mode Change-Id: Ia9cfed8964fd57e41098fca285a2be0252fd5277 Signed-off-by: Suren Baghdasaryan <surenb@google.com>	2019-03-22 23:06:46 +00:00
Suren Baghdasaryan	c6e18d9458	FROMLIST: psi: rename psi fields in preparation for psi trigger addition Renaming psi_group structure member fields used for calculating psi totals and averages for clear distinction between them and trigger-related fields that will be added next. Signed-off-by: Suren Baghdasaryan <surenb@google.com> (not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/1052416/) Bug: 127712811 Bug: 129157727 Test: lmkd in PSI mode Change-Id: I579a60e0915fa8fedaa508357d3d1aefab9428c4 Signed-off-by: Suren Baghdasaryan <surenb@google.com>	2019-03-22 23:06:38 +00:00
Suren Baghdasaryan	18d15b1861	FROMLIST: psi: make psi_enable static psi_enable is not used outside of psi.c, make it static. Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Suren Baghdasaryan <surenb@google.com> (not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/1052415/) Bug: 127712811 Bug: 129157727 Test: lmkd in PSI mode Change-Id: I249c6d2271f93a7975f1622faf2d2b4196b701bc Signed-off-by: Suren Baghdasaryan <surenb@google.com>	2019-03-22 23:06:29 +00:00
Suren Baghdasaryan	ada57da3b1	FROMLIST: psi: introduce state_mask to represent stalled psi states The psi monitoring patches will need to determine the same states as record_times(). To avoid calculating them twice, maintain a state mask that can be consulted cheaply. Do this in a separate patch to keep the churn in the main feature patch at a minimum. This adds 4-byte state_mask member into psi_group_cpu struct which results in its first cacheline-aligned part becoming 52 bytes long. Add explicit values to enumeration element counters that affect psi_group_cpu struct size. Link: http://lkml.kernel.org/r/20190124211518.244221-4-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Dennis Zhou <dennis@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> (not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/1052414/) Bug: 127712811 Bug: 129157727 Test: lmkd in PSI mode Change-Id: I38a1ca3d5c9e6cc3ba39e88c6a9af29ecdc0df5b Signed-off-by: Suren Baghdasaryan <surenb@google.com>	2019-03-22 23:06:13 +00:00
Johannes Weiner	9f79143ebb	UPSTREAM: kernel: cgroup: add poll file operation Cgroup has a standardized poll/notification mechanism for waking all pollers on all fds when a filesystem node changes. To allow polling for custom events, add a .poll callback that can override the default. This is in preparation for pollable cgroup pressure files which have per-fd trigger configurations. Link: http://lkml.kernel.org/r/20190124211518.244221-3-surenb@google.com Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> (cherry picked from commit: `dc50537bdd`) Bug: 127712811 Test: lmkd in PSI mode Change-Id: Idc648e7b7b7bd5fc00c7b32163e55a93b0f49a98 Signed-off-by: Suren Baghdasaryan <surenb@google.com>	2019-03-21 16:25:28 -07:00
Johannes Weiner	ec350213df	UPSTREAM: psi: avoid divide-by-zero crash inside virtual machines We've been seeing hard-to-trigger psi crashes when running inside VM instances: divide error: 0000 [#1] SMP PTI Modules linked in: [...] CPU: 0 PID: 212 Comm: kworker/0:2 Not tainted 4.16.18-119_fbk9_3817_gfe944c98d695 #119 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 Workqueue: events psi_clock RIP: 0010:psi_update_stats+0x270/0x490 RSP: 0018:ffffc90001117e10 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff8800a35a13f8 RDX: 0000000000000000 RSI: ffff8800a35a1340 RDI: 0000000000000000 RBP: 0000000000000658 R08: ffff8800a35a1470 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 00000000000f8502 FS: 0000000000000000(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fbe370fa000 CR3: 00000000b1e3a000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: psi_clock+0x12/0x50 process_one_work+0x1e0/0x390 worker_thread+0x2b/0x3c0 ? rescuer_thread+0x330/0x330 kthread+0x113/0x130 ? kthread_create_worker_on_cpu+0x40/0x40 ? SyS_exit_group+0x10/0x10 ret_from_fork+0x35/0x40 Code: 48 0f 47 c7 48 01 c2 45 85 e4 48 89 16 0f 85 e6 00 00 00 4c 8b 49 10 4c 8b 51 08 49 69 d9 f2 07 00 00 48 6b c0 64 4c 8b 29 31 d2 <48> f7 f7 49 69 d5 8d 06 00 00 48 89 c5 4c 69 f0 00 98 0b 00 48 The Code-line points to `period` being 0 inside update_stats(), and we divide by that when calculating that period's pressure percentage. The elapsed period should never be 0. The reason this can happen is due to an off-by-one in the idle time / missing period calculation combined with a coarse sched_clock() in the virtual machine. The target time for aggregation is advanced into the future on a fixed grid to prevent clock drift. So when an aggregation runs after some idle period, we can not just set it to "now + psi_period", but have to calculate the downtime and advance the target time relative to itself. However, if the aggregator was disabled exactly one psi_period (ns), we drop one idle period in the calculation due to a > when we should do >=. In that case, next_update will be advanced from 'now - psi_period' to 'now' when it should be moved to 'now + psi_period'. The run finishes with last_update == next_update == sched_clock(). With hardware clocks, this exact nanosecond match isn't likely in the first place; but if it does happen, the clock will still have moved on and the period non-zero by the time the worker runs. A pointlessly short period, but besides the extra work, no harm no foul. However, a slow sched_clock() like we have on VMs might not have advanced either by the time the worker runs again. And when we calculate the elapsed period, the result, our pressure divisor, will be 0. Ouch. Fix this by correctly handling the situation when the elapsed time between aggregation runs is precisely two periods, and advance the expiration timestamp correctly to period into the future. Link: http://lkml.kernel.org/r/20190214193157.15788-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Łukasz Siudut <lsiudut@fb.com Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit `4e37504d1c`) Bug: 127712811 Test: lmkd in PSI mode Change-Id: I40917c84354f9f32259c6703f00b6b1d21f45f02 Signed-off-by: Suren Baghdasaryan <surenb@google.com>	2019-03-21 16:25:27 -07:00
Johannes Weiner	2a070382c9	UPSTREAM: psi: fix aggregation idle shut-off psi has provisions to shut off the periodic aggregation worker when there is a period of no task activity - and thus no data that needs aggregating. However, while developing psi monitoring, Suren noticed that the aggregation clock currently won't stay shut off for good. Debugging this revealed a flaw in the idle design: an aggregation run will see no task activity and decide to go to sleep; shortly thereafter, the kworker thread that executed the aggregation will go idle and cause a scheduling change, during which the psi callback will kick the !pending worker again. This will ping-pong forever, and is equivalent to having no shut-off logic at all (but with more code!) Fix this by exempting aggregation workers from psi's clock waking logic when the state change is them going to sleep. To do this, tag workers with the last work function they executed, and if in psi we see a worker going to sleep after aggregating psi data, we will not reschedule the aggregation work item. What if the worker is also executing other items before or after? Any psi state times that were incurred by work items preceding the aggregation work will have been collected from the per-cpu buckets during the aggregation itself. If there are work items following the aggregation work, the worker's last_func tag will be overwritten and the aggregator will be kept alive to process this genuine new activity. If the aggregation work is the last thing the worker does, and we decide to go idle, the brief period of non-idle time incurred between the aggregation run and the kworker's dequeue will be stranded in the per-cpu buckets until the clock is woken by later activity. But that should not be a problem. The buckets can hold 4s worth of time, and future activity will wake the clock with a 2s delay, giving us 2s worth of data we can leave behind when disabling aggregation. If it takes a worker more than two seconds to go idle after it finishes its last work item, we likely have bigger problems in the system, and won't notice one sample that was averaged with a bogus per-CPU weight. Link: http://lkml.kernel.org/r/20190116193501.1910-1-hannes@cmpxchg.org Fixes: `eb414681d5` ("psi: pressure stall information for CPU, memory, and IO") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit `1b69ac6b40`) Bug: 127712811 Test: lmkd in PSI mode Change-Id: I2877fec3d381b1006b8bd1261895fdfd68bd21db Signed-off-by: Suren Baghdasaryan <surenb@google.com>	2019-03-21 16:25:27 -07:00
Johannes Weiner	3bbcbc8039	UPSTREAM: psi: make disabling/enabling easier for vendor kernels Mel Gorman reports a hackbench regression with psi that would prohibit shipping the suse kernel with it default-enabled, but he'd still like users to be able to opt in at little to no cost to others. With the current combination of CONFIG_PSI and the psi_disabled bool set from the commandline, this is a challenge. Do the following things to make it easier: 1. Add a config option CONFIG_PSI_DEFAULT_DISABLED that allows distros to enable CONFIG_PSI in their kernel but leave the feature disabled unless a user requests it at boot-time. To avoid double negatives, rename psi_disabled= to psi=. 2. Make psi_disabled a static branch to eliminate any branch costs when the feature is disabled. In terms of numbers before and after this patch, Mel says: : The following is a comparision using CONFIG_PSI=n as a baseline against : your patch and a vanilla kernel : : 4.20.0-rc4 4.20.0-rc4 4.20.0-rc4 : kconfigdisable-v1r1 vanilla psidisable-v1r1 : Amean 1 1.3100 ( 0.00%) 1.3923 ( -6.28%) 1.3427 ( -2.49%) : Amean 3 3.8860 ( 0.00%) 4.1230 * -6.10%* 3.8860 ( -0.00%) : Amean 5 6.8847 ( 0.00%) 8.0390 * -16.77%* 6.7727 ( 1.63%) : Amean 7 9.9310 ( 0.00%) 10.8367 * -9.12%* 9.9910 ( -0.60%) : Amean 12 16.6577 ( 0.00%) 18.2363 * -9.48%* 17.1083 ( -2.71%) : Amean 18 26.5133 ( 0.00%) 27.8833 * -5.17%* 25.7663 ( 2.82%) : Amean 24 34.3003 ( 0.00%) 34.6830 ( -1.12%) 32.0450 ( 6.58%) : Amean 30 40.0063 ( 0.00%) 40.5800 ( -1.43%) 41.5087 ( -3.76%) : Amean 32 40.1407 ( 0.00%) 41.2273 ( -2.71%) 39.9417 ( 0.50%) : : It's showing that the vanilla kernel takes a hit (as the bisection : indicated it would) and that disabling PSI by default is reasonably : close in terms of performance for this particular workload on this : particular machine so; Link: http://lkml.kernel.org/r/20181127165329.GA29728@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Tested-by: Mel Gorman <mgorman@techsingularity.net> Reported-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit `e0c274472d`) Bug: 127712811 Test: lmkd in PSI mode Change-Id: I6cb666fa351e8901df82e4d6931bfec0c5ce230d Signed-off-by: Suren Baghdasaryan <surenb@google.com>	2019-03-21 16:25:27 -07:00
Olof Johansson	b822a6da85	UPSTREAM: kernel/sched/psi.c: simplify cgroup_move_task() The existing code triggered an invalid warning about 'rq' possibly being used uninitialized. Instead of doing the silly warning suppression by initializa it to NULL, refactor the code to bail out early instead. Warning was: kernel/sched/psi.c: In function `cgroup_move_task': kernel/sched/psi.c:639:13: warning: `rq' may be used uninitialized in this function [-Wmaybe-uninitialized] Link: http://lkml.kernel.org/r/20181103183339.8669-1-olof@lixom.net Fixes: `2ce7135adc` ("psi: cgroup support") Signed-off-by: Olof Johansson <olof@lixom.net> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit `8fcb2312d1`) Bug: 127712811 Test: lmkd in PSI mode Change-Id: Id989da224a726082e0cfa5d5d9460bf63d448a93 Signed-off-by: Suren Baghdasaryan <surenb@google.com>	2019-03-21 16:25:27 -07:00
Johannes Weiner	dc9cd29ded	UPSTREAM: psi: cgroup support On a system that executes multiple cgrouped jobs and independent workloads, we don't just care about the health of the overall system, but also that of individual jobs, so that we can ensure individual job health, fairness between jobs, or prioritize some jobs over others. This patch implements pressure stall tracking for cgroups. In kernels with CONFIG_PSI=y, cgroup2 groups will have cpu.pressure, memory.pressure, and io.pressure files that track aggregate pressure stall times for only the tasks inside the cgroup. Link: http://lkml.kernel.org/r/20180828172258.3185-10-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Daniel Drake <drake@endlessm.com> Tested-by: Suren Baghdasaryan <surenb@google.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit `2ce7135adc`) Bug: 127712811 Test: lmkd in PSI mode Change-Id: I163e6657aaa60aa5aab9372616a3bce2a65e90ec Signed-off-by: Suren Baghdasaryan <surenb@google.com>	2019-03-21 16:25:27 -07:00
Johannes Weiner	e550f94252	UPSTREAM: psi: pressure stall information for CPU, memory, and IO When systems are overcommitted and resources become contended, it's hard to tell exactly the impact this has on workload productivity, or how close the system is to lockups and OOM kills. In particular, when machines work multiple jobs concurrently, the impact of overcommit in terms of latency and throughput on the individual job can be enormous. In order to maximize hardware utilization without sacrificing individual job health or risk complete machine lockups, this patch implements a way to quantify resource pressure in the system. A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that expose the percentage of time the system is stalled on CPU, memory, or IO, respectively. Stall states are aggregate versions of the per-task delay accounting delays: cpu: some tasks are runnable but not executing on a CPU memory: tasks are reclaiming, or waiting for swapin or thrashing cache io: tasks are waiting for io completions These percentages of walltime can be thought of as pressure percentages, and they give a general sense of system health and productivity loss incurred by resource overcommit. They can also indicate when the system is approaching lockup scenarios and OOMs. To do this, psi keeps track of the task states associated with each CPU and samples the time they spend in stall states. Every 2 seconds, the samples are averaged across CPUs - weighted by the CPUs' non-idle time to eliminate artifacts from unused CPUs - and translated into percentages of walltime. A running average of those percentages is maintained over 10s, 1m, and 5m periods (similar to the loadaverage). [hannes@cmpxchg.org: doc fixlet, per Randy] Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org [hannes@cmpxchg.org: code optimization] Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org [hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter] Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org [hannes@cmpxchg.org: fix build] Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Daniel Drake <drake@endlessm.com> Tested-by: Suren Baghdasaryan <surenb@google.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit `eb414681d5`) Bug: 127712811 Test: lmkd in PSI mode Change-Id: Id00d23c977169b0c4636d92016fc1fee0274be05 Signed-off-by: Suren Baghdasaryan <surenb@google.com>	2019-03-21 16:25:27 -07:00
Johannes Weiner	8cd88f5398	UPSTREAM: sched: introduce this_rq_lock_irq() do_sched_yield() disables IRQs, looks up this_rq() and locks it. The next patch is adding another site with the same pattern, so provide a convenience function for it. Link: http://lkml.kernel.org/r/20180828172258.3185-8-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Daniel Drake <drake@endlessm.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit `246b3b3342`) Bug: 127712811 Test: lmkd in PSI mode Change-Id: I24b42cff1624c80633f116b7cb485564f53a30a7 Signed-off-by: Suren Baghdasaryan <surenb@google.com>	2019-03-21 16:25:27 -07:00
Johannes Weiner	cdda3cf652	UPSTREAM: sched: sched.h: make rq locking and clock functions available in stats.h kernel/sched/sched.h includes "stats.h" half-way through the file. The next patch introduces users of sched.h's rq locking functions and update_rq_clock() in kernel/sched/stats.h. Move those definitions up in the file so they are available in stats.h. Link: http://lkml.kernel.org/r/20180828172258.3185-7-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Daniel Drake <drake@endlessm.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit `1f351d7f75`) Bug: 127712811 Test: lmkd in PSI mode Change-Id: Id342e0ba9a62b49e64f2ce8b87f883ea70230b2f Signed-off-by: Suren Baghdasaryan <surenb@google.com>	2019-03-21 16:25:27 -07:00

1 2 3 4 5 ...

28691 Commits