Commit Graph

1044914 Commits

Author SHA1 Message Date
Eric Whitney
0add491df4 ext4: remove extent cache entries when truncating inline data
Conditionally remove all cached extents belonging to an inode
when truncating its inline data.  It's only necessary to attempt to
remove cached extents when a conversion from inline to extent storage
has been initiated (!EXT4_STATE_MAY_INLINE_DATA).  This avoids
unnecessary es lock overhead in the more common inline case.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20210819144927.25163-2-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-09-09 10:52:05 -04:00
Theodore Ts'o
11ef08c9eb Merge branch 'delalloc-buffer-write' into dev
Fix a bug in how we update i_disksize, and the error path in
inline_data_end.  Finally, drop an unnecessary creation of a journal
handle which was only needed for inline data, which can give us a
large performance gain in delayed allocation writes.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-09-09 10:47:06 -04:00
Matthias Kaehlcke
70ee251ded thermal/drivers/qcom/spmi-adc-tm5: Don't abort probing if a sensor is not used
adc_tm5_register_tzd() registers the thermal zone sensors for all
channels of the thermal monitor. If the registration of one channel
fails the function skips the processing of the remaining channels
and returns an error, which results in _probe() being aborted.

One of the reasons the registration could fail is that none of the
thermal zones is using the channel/sensor, which hardly is a critical
error (if it is an error at all). If this case is detected emit a
warning and continue with processing the remaining channels.

Fixes: ca66dca5ed ("thermal: qcom: add support for adc-tm5 PMIC thermal monitor")
Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
Reported-by: Stephen Boyd <swboyd@chromium.org>
Reviewed-by: Stephen Boyd <swboyd@chromium.org>
Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Link: https://lore.kernel.org/r/20210823134726.1.I1dd23ddf77e5b3568625d80d6827653af071ce19@changeid
2021-09-09 16:33:29 +02:00
Srinivas Pandruvada
5950fc44a5 thermal/drivers/intel: Allow processing of HWP interrupt
Add a weak function to process HWP (Hardware P-states) notifications and
move updating HWP_STATUS MSR to this function.

This allows HWP interrupts to be processed by the intel_pstate driver in
HWP mode by overriding the implementation.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Acked-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Link: https://lore.kernel.org/r/20210820024006.2347720-1-srinivas.pandruvada@linux.intel.com
2021-09-09 16:33:20 +02:00
Guenter Roeck
2bab94090b
spi: tegra20-slink: Declare runtime suspend and resume functions conditionally
The following build error is seen with CONFIG_PM=n.

drivers/spi/spi-tegra20-slink.c:1188:12: error:
	'tegra_slink_runtime_suspend' defined but not used
drivers/spi/spi-tegra20-slink.c:1200:12: error:
	'tegra_slink_runtime_resume' defined but not used

Declare the functions only if PM is enabled. While at it, remove the
unnecessary forward declarations.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Acked-by: Thierry Reding <treding@nvidia.com>
Link: https://lore.kernel.org/r/20210907045358.2138282-1-linux@roeck-us.net
Signed-off-by: Mark Brown <broonie@kernel.org>
2021-09-09 14:16:27 +01:00
Trevor Wu
5a80dea931
ASoC: mediatek: add required config dependency
Because SND_SOC_MT8195 depends on COMPILE_TEST, it's possible to build
MT8195 driver in many different config combinations.
Add all dependent config for SND_SOC_MT8195 in case some errors happen
when COMPILE_TEST is enabled.

Signed-off-by: Trevor Wu <trevor.wu@mediatek.com>
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lore.kernel.org/r/20210909065533.2114-1-trevor.wu@mediatek.com
Signed-off-by: Mark Brown <broonie@kernel.org>
2021-09-09 14:16:25 +01:00
Pierre-Louis Bossart
58eafe1ff5
ASoC: Intel: sof_sdw: tag SoundWire BEs as non-atomic
The SoundWire BEs make use of 'stream' functions for .prepare and
.trigger. These functions will in turn force a Bank Switch, which
implies a wait operation.

Mark SoundWire BEs as nonatomic for consistency, but keep all other
types of BEs as is. The initialization of .nonatomic is done outside
of the create_sdw_dailink helper to avoid adding more parameters to
deal with a single exception to the rule that BEs are atomic.

Suggested-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Reviewed-by: Rander Wang <rander.wang@intel.com>
Reviewed-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com>
Reviewed-by: Bard Liao <bard.liao@intel.com>
Link: https://lore.kernel.org/r/20210907184436.33152-1-pierre-louis.bossart@linux.intel.com
Signed-off-by: Mark Brown <broonie@kernel.org>
2021-09-09 14:16:24 +01:00
Qiang.zhang
66e70be722 io-wq: fix memory leak in create_io_worker()
BUG: memory leak
unreferenced object 0xffff888126fcd6c0 (size 192):
  comm "syz-executor.1", pid 11934, jiffies 4294983026 (age 15.690s)
  backtrace:
    [<ffffffff81632c91>] kmalloc_node include/linux/slab.h:609 [inline]
    [<ffffffff81632c91>] kzalloc_node include/linux/slab.h:732 [inline]
    [<ffffffff81632c91>] create_io_worker+0x41/0x1e0 fs/io-wq.c:739
    [<ffffffff8163311e>] io_wqe_create_worker fs/io-wq.c:267 [inline]
    [<ffffffff8163311e>] io_wqe_enqueue+0x1fe/0x330 fs/io-wq.c:866
    [<ffffffff81620b64>] io_queue_async_work+0xc4/0x200 fs/io_uring.c:1473
    [<ffffffff8162c59c>] __io_queue_sqe+0x34c/0x510 fs/io_uring.c:6933
    [<ffffffff8162c7ab>] io_req_task_submit+0x4b/0xa0 fs/io_uring.c:2233
    [<ffffffff8162cb48>] io_async_task_func+0x108/0x1c0 fs/io_uring.c:5462
    [<ffffffff816259e3>] tctx_task_work+0x1b3/0x3a0 fs/io_uring.c:2158
    [<ffffffff81269b43>] task_work_run+0x73/0xb0 kernel/task_work.c:164
    [<ffffffff812dcdd1>] tracehook_notify_signal include/linux/tracehook.h:212 [inline]
    [<ffffffff812dcdd1>] handle_signal_work kernel/entry/common.c:146 [inline]
    [<ffffffff812dcdd1>] exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
    [<ffffffff812dcdd1>] exit_to_user_mode_prepare+0x151/0x180 kernel/entry/common.c:209
    [<ffffffff843ff25d>] __syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline]
    [<ffffffff843ff25d>] syscall_exit_to_user_mode+0x1d/0x40 kernel/entry/common.c:302
    [<ffffffff843fa4a2>] do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
    [<ffffffff84600068>] entry_SYSCALL_64_after_hwframe+0x44/0xae

when create_io_thread() return error, and not retry, the worker object
need to be freed.

Reported-by: syzbot+65454c239241d3d647da@syzkaller.appspotmail.com
Signed-off-by: Qiang.zhang <qiang.zhang@windriver.com>
Link: https://lore.kernel.org/r/20210909115822.181188-1-qiang.zhang@windriver.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-09 06:57:04 -06:00
Robin Murphy
8cc633190b iommu: Clarify default domain Kconfig
Although strictly it is the AMD and Intel drivers which have an existing
expectation of lazy behaviour by default, it ends up being rather
unintuitive to describe this literally in Kconfig. Express it instead as
an architecture dependency, to clarify that it is a valid config-time
decision. The end result is the same since virtio-iommu doesn't support
lazy mode and thus falls back to strict at runtime regardless.

The per-architecture disparity is a matter of historical expectations:
the AMD and Intel drivers have been lazy by default since 2008, and
changing that gets noticed by people asking where their I/O throughput
has gone. Conversely, Arm-based systems with their wider assortment of
IOMMU drivers mostly only support strict mode anyway; only the Arm SMMU
drivers have later grown support for passthrough and lazy mode, for
users who wanted to explicitly trade off isolation for performance.
These days, reducing the default level of isolation in a way which may
go unnoticed by users who expect otherwise hardly seems worth risking
for the sake of one line of Kconfig, so here's where we are.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/69a0c6f17b000b54b8333ee42b3124c1d5a869e2.1631105737.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2021-09-09 13:18:07 +02:00
Fenghua Yu
6ef0505158 iommu/vt-d: Fix a deadlock in intel_svm_drain_prq()
pasid_mutex and dev->iommu->param->lock are held while unbinding mm is
flushing IO page fault workqueue and waiting for all page fault works to
finish. But an in-flight page fault work also need to hold the two locks
while unbinding mm are holding them and waiting for the work to finish.
This may cause an ABBA deadlock issue as shown below:

	idxd 0000:00:0a.0: unbind PASID 2
	======================================================
	WARNING: possible circular locking dependency detected
	5.14.0-rc7+ #549 Not tainted [  186.615245] ----------
	dsa_test/898 is trying to acquire lock:
	ffff888100d854e8 (&param->lock){+.+.}-{3:3}, at:
	iopf_queue_flush_dev+0x29/0x60
	but task is already holding lock:
	ffffffff82b2f7c8 (pasid_mutex){+.+.}-{3:3}, at:
	intel_svm_unbind+0x34/0x1e0
	which lock already depends on the new lock.

	the existing dependency chain (in reverse order) is:

	-> #2 (pasid_mutex){+.+.}-{3:3}:
	       __mutex_lock+0x75/0x730
	       mutex_lock_nested+0x1b/0x20
	       intel_svm_page_response+0x8e/0x260
	       iommu_page_response+0x122/0x200
	       iopf_handle_group+0x1c2/0x240
	       process_one_work+0x2a5/0x5a0
	       worker_thread+0x55/0x400
	       kthread+0x13b/0x160
	       ret_from_fork+0x22/0x30

	-> #1 (&param->fault_param->lock){+.+.}-{3:3}:
	       __mutex_lock+0x75/0x730
	       mutex_lock_nested+0x1b/0x20
	       iommu_report_device_fault+0xc2/0x170
	       prq_event_thread+0x28a/0x580
	       irq_thread_fn+0x28/0x60
	       irq_thread+0xcf/0x180
	       kthread+0x13b/0x160
	       ret_from_fork+0x22/0x30

	-> #0 (&param->lock){+.+.}-{3:3}:
	       __lock_acquire+0x1134/0x1d60
	       lock_acquire+0xc6/0x2e0
	       __mutex_lock+0x75/0x730
	       mutex_lock_nested+0x1b/0x20
	       iopf_queue_flush_dev+0x29/0x60
	       intel_svm_drain_prq+0x127/0x210
	       intel_svm_unbind+0xc5/0x1e0
	       iommu_sva_unbind_device+0x62/0x80
	       idxd_cdev_release+0x15a/0x200 [idxd]
	       __fput+0x9c/0x250
	       ____fput+0xe/0x10
	       task_work_run+0x64/0xa0
	       exit_to_user_mode_prepare+0x227/0x230
	       syscall_exit_to_user_mode+0x2c/0x60
	       do_syscall_64+0x48/0x90
	       entry_SYSCALL_64_after_hwframe+0x44/0xae

	other info that might help us debug this:

	Chain exists of:
	  &param->lock --> &param->fault_param->lock --> pasid_mutex

	 Possible unsafe locking scenario:

	       CPU0                    CPU1
	       ----                    ----
	  lock(pasid_mutex);
				       lock(&param->fault_param->lock);
				       lock(pasid_mutex);
	  lock(&param->lock);

	 *** DEADLOCK ***

	2 locks held by dsa_test/898:
	 #0: ffff888100cc1cc0 (&group->mutex){+.+.}-{3:3}, at:
	 iommu_sva_unbind_device+0x53/0x80
	 #1: ffffffff82b2f7c8 (pasid_mutex){+.+.}-{3:3}, at:
	 intel_svm_unbind+0x34/0x1e0

	stack backtrace:
	CPU: 2 PID: 898 Comm: dsa_test Not tainted 5.14.0-rc7+ #549
	Hardware name: Intel Corporation Kabylake Client platform/KBL S
	DDR4 UD IMM CRB, BIOS KBLSE2R1.R00.X050.P01.1608011715 08/01/2016
	Call Trace:
	 dump_stack_lvl+0x5b/0x74
	 dump_stack+0x10/0x12
	 print_circular_bug.cold+0x13d/0x142
	 check_noncircular+0xf1/0x110
	 __lock_acquire+0x1134/0x1d60
	 lock_acquire+0xc6/0x2e0
	 ? iopf_queue_flush_dev+0x29/0x60
	 ? pci_mmcfg_read+0xde/0x240
	 __mutex_lock+0x75/0x730
	 ? iopf_queue_flush_dev+0x29/0x60
	 ? pci_mmcfg_read+0xfd/0x240
	 ? iopf_queue_flush_dev+0x29/0x60
	 mutex_lock_nested+0x1b/0x20
	 iopf_queue_flush_dev+0x29/0x60
	 intel_svm_drain_prq+0x127/0x210
	 ? intel_pasid_tear_down_entry+0x22e/0x240
	 intel_svm_unbind+0xc5/0x1e0
	 iommu_sva_unbind_device+0x62/0x80
	 idxd_cdev_release+0x15a/0x200

pasid_mutex protects pasid and svm data mapping data. It's unnecessary
to hold pasid_mutex while flushing the workqueue. To fix the deadlock
issue, unlock pasid_pasid during flushing the workqueue to allow the works
to be handled.

Fixes: d5b9e4bfe0 ("iommu/vt-d: Report prq to io-pgfault framework")
Reported-and-tested-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Link: https://lore.kernel.org/r/20210826215918.4073446-1-fenghua.yu@intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20210828070622.2437559-3-baolu.lu@linux.intel.com
[joro: Removed timing information from kernel log messages]
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2021-09-09 13:18:07 +02:00
Fenghua Yu
a21518cb23 iommu/vt-d: Fix PASID leak in intel_svm_unbind_mm()
The mm->pasid will be used in intel_svm_free_pasid() after load_pasid()
during unbinding mm. Clearing it in load_pasid() will cause PASID cannot
be freed in intel_svm_free_pasid().

Additionally mm->pasid was updated already before load_pasid() during pasid
allocation. No need to update it again in load_pasid() during binding mm.
Don't update mm->pasid to avoid the issues in both binding mm and unbinding
mm.

Fixes: 4048377414 ("iommu/vt-d: Use iommu_sva_alloc(free)_pasid() helpers")
Reported-and-tested-by: Dave Jiang <dave.jiang@intel.com>
Co-developed-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Link: https://lore.kernel.org/r/20210826215918.4073446-1-fenghua.yu@intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20210828070622.2437559-2-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2021-09-09 13:18:06 +02:00
Suravee Suthikulpanit
eb03f2d2f6 iommu/amd: Remove iommu_init_ga()
Since the function has been simplified and only call iommu_init_ga_log(),
remove the function and replace with iommu_init_ga_log() instead.

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Link: https://lore.kernel.org/r/20210820202957.187572-4-suravee.suthikulpanit@amd.com
Fixes: 8bda0cfbdc ("iommu/amd: Detect and initialize guest vAPIC log")
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2021-09-09 13:18:06 +02:00
Wei Huang
c3811a50ad iommu/amd: Relocate GAMSup check to early_enable_iommus
Currently, iommu_init_ga() checks and disables IOMMU VAPIC support
(i.e. AMD AVIC support in IOMMU) when GAMSup feature bit is not set.
However it forgets to clear IRQ_POSTING_CAP from the previously set
amd_iommu_irq_ops.capability.

This triggers an invalid page fault bug during guest VM warm reboot
if AVIC is enabled since the irq_remapping_cap(IRQ_POSTING_CAP) is
incorrectly set, and crash the system with the following kernel trace.

    BUG: unable to handle page fault for address: 0000000000400dd8
    RIP: 0010:amd_iommu_deactivate_guest_mode+0x19/0xbc
    Call Trace:
     svm_set_pi_irte_mode+0x8a/0xc0 [kvm_amd]
     ? kvm_make_all_cpus_request_except+0x50/0x70 [kvm]
     kvm_request_apicv_update+0x10c/0x150 [kvm]
     svm_toggle_avic_for_irq_window+0x52/0x90 [kvm_amd]
     svm_enable_irq_window+0x26/0xa0 [kvm_amd]
     vcpu_enter_guest+0xbbe/0x1560 [kvm]
     ? avic_vcpu_load+0xd5/0x120 [kvm_amd]
     ? kvm_arch_vcpu_load+0x76/0x240 [kvm]
     ? svm_get_segment_base+0xa/0x10 [kvm_amd]
     kvm_arch_vcpu_ioctl_run+0x103/0x590 [kvm]
     kvm_vcpu_ioctl+0x22a/0x5d0 [kvm]
     __x64_sys_ioctl+0x84/0xc0
     do_syscall_64+0x33/0x40
     entry_SYSCALL_64_after_hwframe+0x44/0xae

Fixes by moving the initializing of AMD IOMMU interrupt remapping mode
(amd_iommu_guest_ir) earlier before setting up the
amd_iommu_irq_ops.capability with appropriate IRQ_POSTING_CAP flag.

[joro:	Squashed the two patches and limited
	check_features_on_all_iommus() to CONFIG_IRQ_REMAP
	to fix a compile warning.]

Signed-off-by: Wei Huang <wei.huang2@amd.com>
Co-developed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Link: https://lore.kernel.org/r/20210820202957.187572-2-suravee.suthikulpanit@amd.com
Link: https://lore.kernel.org/r/20210820202957.187572-3-suravee.suthikulpanit@amd.com
Fixes: 8bda0cfbdc ("iommu/amd: Detect and initialize guest vAPIC log")
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2021-09-09 13:15:02 +02:00
Helge Deller
d97180ad68 parisc: Mark sched_clock unstable only if clocks are not syncronized
We check at runtime if the cr16 clocks are stable across CPUs. Only mark
the sched_clock unstable by calling clear_sched_clock_stable() if we
know that we run on a system which isn't syncronized across CPUs.

Signed-off-by: Helge Deller <deller@gmx.de>
2021-09-09 12:44:31 +02:00
Guenter Roeck
907872baa9 parisc: Move pci_dev_is_behind_card_dino to where it is used
parisc build test images fail to compile with the following error.

drivers/parisc/dino.c:160:12: error:
	'pci_dev_is_behind_card_dino' defined but not used

Move the function just ahead of its only caller to avoid the error.

Fixes: 5fa1659105 ("parisc: Disable HP HSC-PCI Cards to prevent kernel crash")
Cc: Helge Deller <deller@gmx.de>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Helge Deller <deller@gmx.de>
2021-09-09 12:44:31 +02:00
Helge Deller
e4f2006f12 parisc: Reduce sigreturn trampoline to 3 instructions
We can move the INSN_LDI_R20 instruction into the branch delay slot.

Signed-off-by: Helge Deller <deller@gmx.de>
2021-09-09 12:44:31 +02:00
Helge Deller
3e4a1aff2a parisc: Check user signal stack trampoline is inside TASK_SIZE
Add some additional checks to ensure the signal stack is inside
userspace bounds.

Signed-off-by: Helge Deller <deller@gmx.de>
2021-09-09 12:44:31 +02:00
Helge Deller
ea4b3fca18 parisc: Drop useless debug info and comments from signal.c
Signed-off-by: Helge Deller <deller@gmx.de>
2021-09-09 12:44:30 +02:00
Helge Deller
1260dea6d2 parisc: Drop strnlen_user() in favour of generic version
As suggested by Arnd Bergmann, drop the parisc version of
strnlen_user() and switch to the generic version.

Suggested-by: Arnd Bergmann <arnd@kernel.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Helge Deller <deller@gmx.de>
2021-09-09 12:44:30 +02:00
Helge Deller
3da6379a6d parisc: Add missing FORCE prerequisite in Makefile
Signed-off-by: Helge Deller <deller@gmx.de>
2021-09-09 12:44:30 +02:00
Guenter Roeck
e011912651 net: ni65: Avoid typecast of pointer to u32
Building alpha:allmodconfig results in the following error.

drivers/net/ethernet/amd/ni65.c: In function 'ni65_stop_start':
drivers/net/ethernet/amd/ni65.c:751:37: error:
	cast from pointer to integer of different size
		buffer[i] = (u32) isa_bus_to_virt(tmdp->u.buffer);

'buffer[]' is declared as unsigned long, so replace the typecast to u32
with a typecast to unsigned long to fix the problem.

Cc: Arnd Bergmann <arnd@kernel.org>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-09-09 11:21:19 +01:00
David S. Miller
e3a843f98c Merge branch 'sfx-xdp-fallback-tx-queues'
Íñigo Huguet says:

====================
sfc: fallback for lack of xdp tx queues

If there are not enough hardware resources to allocate one tx queue per
CPU for XDP, XDP_TX and XDP_REDIRECT actions were unavailable, and using
them resulted each time with the packet being drop and this message in
the logs: XDP TX failed (-22)

These patches implement 2 fallback solutions for 2 different situations
that might happen:
1. There are not enough free resources for all the tx queues, but there
   are some free resources available
2. There are not enough free resources at all for tx queues.

Both solutions are based in sharing tx queues, using __netif_tx_lock for
synchronization. In the second case, as there are not XDP TX queues to
share, network stack queues are used instead, but since we're taking
__netif_tx_lock, concurrent access to the queues is correctly protected.

The solution for this second case might affect performance both of XDP
traffic and normal traffice due to lock contention if both are used
intensively. That's why I call it a "last resort" fallback: it's not a
desirable situation, but at least we have XDP TX working.

Some tests has shown good results and indicate that the non-fallback
case is not being damaged by this changes. They are also promising for
the fallback cases. This is the test:
1. From another machine, send high amount of packets with pktgen, script
   samples/pktgen/pktgen_sample04_many_flows.sh
2. In the tested machine, run samples/bpf/xdp_rxq_info with arguments
   "-a XDP_TX --swapmac" and see the results
3. In the tested machine, run also pktgen_sample04 to create high TX
   normal traffic, and see how xdp_rxq_info results vary

Note that this test doesn't check the worst situations for the fallback
solutions because XDP_TX will only be executed from the same CPUs that
are processed by sfc, and not from every CPU in the system, so the
performance drop due to the highest locking contention doesn't happen.
I'd like to test that, as well, but I don't have access right now to a
proper environment.

Test results:

Without doing TX:
Before changes: ~2,900,000 pps
After changes, 1 queues/core: ~2,900,000 pps
After changes, 2 queues/core: ~2,900,000 pps
After changes, 8 queues/core: ~2,900,000 pps
After changes, borrowing from network stack: ~2,900,000 pps

With multiflow TX at the same time:
Before changes: ~1,700,000 - 2,900,000 pps
After changes, 1 queues/core: ~1,700,000 - 2,900,000 pps
After changes, 2 queues/core: ~1,700,000 pps
After changes, 8 queues/core: ~1,700,000 pps
After changes, borrowing from network stack: 1,150,000 pps

Sporadic "XDP TX failed (-5)" warnings are shown when running xdp program
and pktgen simultaneously. This was expected because XDP doesn't have any
buffering system if the NIC is under very high pressure. Thousands of
these warnings are shown in the case of borrowing net stack queues. As I
said before, this was also expected.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-09-09 11:17:37 +01:00
Íñigo Huguet
6215b608a8 sfc: last resort fallback for lack of xdp tx queues
Previous patch addressed the situation of having some free resources for
xdp tx but not enough for one tx queue per CPU. This patch address the
worst case of not having resources at all for xdp tx.

Instead of using queues dedicated to xdp, normal queues used by network
stack are shared for both cases, using __netif_tx_lock for
synchronization. Also queue stop/restart must be considered in the xdp
path to avoid freezing the queue.

This is not the ideal situation we might want to be, and a performance
penalty is expected both for normal and xdp traffic, but at least XDP
will work in all possible situations (with a warning in the logs),
improving a bit the pain of not knowing in what situations we can use it
and in what situations we cannot.

Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-09-09 11:17:37 +01:00
Íñigo Huguet
415446185b sfc: fallback for lack of xdp tx queues
If there are not enough resources to allocate one TX queue per core for
XDP TX it was completely disabled.

This patch implements a fallback solution for sharing the available
queues using __netif_tx_lock for synchronization. In the normal case that
there is one TX queue per CPU, no locking is done, as it was before.

With this fallback solution, XDP TX will work in much more cases that
were failing, specially in machines with many CPUs. It's hard for XDP
users to know what features are supported across different NICs and
configurations, so they will benefit on having wider support.

Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-09-09 11:17:37 +01:00
Joakim Zhang
2a48d96fd5 net: stmmac: platform: fix build warning when with !CONFIG_PM_SLEEP
Use __maybe_unused for noirq_suspend()/noirq_resume() hooks to avoid
build warning with !CONFIG_PM_SLEEP:

>> drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c:796:12: error: 'stmmac_pltfr_noirq_resume' defined but not used [-Werror=unused-function]
     796 | static int stmmac_pltfr_noirq_resume(struct device *dev)
         |            ^~~~~~~~~~~~~~~~~~~~~~~~~
>> drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c:775:12: error: 'stmmac_pltfr_noirq_suspend' defined but not used [-Werror=unused-function]
     775 | static int stmmac_pltfr_noirq_suspend(struct device *dev)
         |            ^~~~~~~~~~~~~~~~~~~~~~~~~~
   cc1: all warnings being treated as errors

Fixes: 276aae3772 ("net: stmmac: fix system hang caused by eee_ctrl_timer during suspend/resume")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Joakim Zhang <qiangqing.zhang@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-09-09 11:14:07 +01:00
Pavel Skripkin
3c10ffddc6 net: xfrm: fix shift-out-of-bounds in xfrm_get_default
Syzbot hit shift-out-of-bounds in xfrm_get_default. The problem was in
missing validation check for user data.

up->dirmask comes from user-space, so we need to check if this value
is less than XFRM_USERPOLICY_DIRMASK_MAX to avoid shift-out-of-bounds bugs.

Fixes: 2d151d3907 ("xfrm: Add possibility to set the default to block if we have no policy")
Reported-and-tested-by: syzbot+b2be9dd8ca6f6c73ee2d@syzkaller.appspotmail.com
Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-09-09 12:12:17 +02:00
Xiyu Yang
9b6ff7eb66 net/l2tp: Fix reference count leak in l2tp_udp_recv_core
The reference count leak issue may take place in an error handling
path. If both conditions of tunnel->version == L2TP_HDR_VER_3 and the
return value of l2tp_v3_ensure_opt_in_linear is nonzero, the function
would directly jump to label invalid, without decrementing the reference
count of the l2tp_session object session increased earlier by
l2tp_tunnel_get_session(). This may result in refcount leaks.

Fix this issue by decrease the reference count before jumping to the
label invalid.

Fixes: 4522a70db7 ("l2tp: fix reading optional fields of L2TPv3")
Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: Xin Xiong <xiongx18@fudan.edu.cn>
Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-09-09 11:00:20 +01:00
Eric Dumazet
04f08eb44b net/af_unix: fix a data-race in unix_dgram_poll
syzbot reported another data-race in af_unix [1]

Lets change __skb_insert() to use WRITE_ONCE() when changing
skb head qlen.

Also, change unix_dgram_poll() to use lockless version
of unix_recvq_full()

It is verry possible we can switch all/most unix_recvq_full()
to the lockless version, this will be done in a future kernel version.

[1] HEAD commit: 8596e589b7

BUG: KCSAN: data-race in skb_queue_tail / unix_dgram_poll

write to 0xffff88814eeb24e0 of 4 bytes by task 25815 on cpu 0:
 __skb_insert include/linux/skbuff.h:1938 [inline]
 __skb_queue_before include/linux/skbuff.h:2043 [inline]
 __skb_queue_tail include/linux/skbuff.h:2076 [inline]
 skb_queue_tail+0x80/0xa0 net/core/skbuff.c:3264
 unix_dgram_sendmsg+0xff2/0x1600 net/unix/af_unix.c:1850
 sock_sendmsg_nosec net/socket.c:703 [inline]
 sock_sendmsg net/socket.c:723 [inline]
 ____sys_sendmsg+0x360/0x4d0 net/socket.c:2392
 ___sys_sendmsg net/socket.c:2446 [inline]
 __sys_sendmmsg+0x315/0x4b0 net/socket.c:2532
 __do_sys_sendmmsg net/socket.c:2561 [inline]
 __se_sys_sendmmsg net/socket.c:2558 [inline]
 __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2558
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x3d/0x90 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x44/0xae

read to 0xffff88814eeb24e0 of 4 bytes by task 25834 on cpu 1:
 skb_queue_len include/linux/skbuff.h:1869 [inline]
 unix_recvq_full net/unix/af_unix.c:194 [inline]
 unix_dgram_poll+0x2bc/0x3e0 net/unix/af_unix.c:2777
 sock_poll+0x23e/0x260 net/socket.c:1288
 vfs_poll include/linux/poll.h:90 [inline]
 ep_item_poll fs/eventpoll.c:846 [inline]
 ep_send_events fs/eventpoll.c:1683 [inline]
 ep_poll fs/eventpoll.c:1798 [inline]
 do_epoll_wait+0x6ad/0xf00 fs/eventpoll.c:2226
 __do_sys_epoll_wait fs/eventpoll.c:2238 [inline]
 __se_sys_epoll_wait fs/eventpoll.c:2233 [inline]
 __x64_sys_epoll_wait+0xf6/0x120 fs/eventpoll.c:2233
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x3d/0x90 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x44/0xae

value changed: 0x0000001b -> 0x00000001

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 25834 Comm: syz-executor.1 Tainted: G        W         5.14.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

Fixes: 86b18aaa2b ("skbuff: fix a data race in skb_queue_len()")
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-09-09 10:57:52 +01:00
Tong Zhang
d82d5303c4 net: macb: fix use after free on rmmod
plat_dev->dev->platform_data is released by platform_device_unregister(),
use of pclk and hclk is a use-after-free. Since device unregister won't
need a clk device we adjust the function call sequence to fix this issue.

[   31.261225] BUG: KASAN: use-after-free in macb_remove+0x77/0xc6 [macb_pci]
[   31.275563] Freed by task 306:
[   30.276782]  platform_device_release+0x25/0x80

Suggested-by: Nicolas Ferre <Nicolas.Ferre@microchip.com>
Signed-off-by: Tong Zhang <ztong0001@gmail.com>
Acked-by: Nicolas Ferre <nicolas.ferre@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-09-09 10:55:44 +01:00
Sukadev Bhattiprolu
273c29e944 ibmvnic: check failover_pending in login response
If a failover occurs before a login response is received, the login
response buffer maybe undefined. Check that there was no failover
before accessing the login response buffer.

Fixes: 032c5e8284 ("Driver for IBM System i/p VNIC protocol")
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-09-09 10:54:55 +01:00
Paolo Abeni
3c4cea8fa7 vhost_net: fix OoB on sendmsg() failure.
If the sendmsg() call in vhost_tx_batch() fails, both the 'batched_xdp'
and 'done_idx' indexes are left unchanged. If such failure happens
when batched_xdp == VHOST_NET_BATCH, the next call to
vhost_net_build_xdp() will access and write memory outside the xdp
buffers area.

Since sendmsg() can only error with EBADFD, this change addresses the
issue explicitly freeing the XDP buffers batch on error.

Fixes: 0a0be13b8f ("vhost_net: batch submitting XDP buffers to underlayer sockets")
Suggested-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-09-09 10:52:12 +01:00
Thomas Gleixner
868ad33bfa sched: Prevent balance_push() on remote runqueues
sched_setscheduler() and rt_mutex_setprio() invoke the run-queue balance
callback after changing priorities or the scheduling class of a task. The
run-queue for which the callback is invoked can be local or remote.

That's not a problem for the regular rq::push_work which is serialized with
a busy flag in the run-queue struct, but for the balance_push() work which
is only valid to be invoked on the outgoing CPU that's wrong. It not only
triggers the debug warning, but also leaves the per CPU variable push_work
unprotected, which can result in double enqueues on the stop machine list.

Remove the warning and validate that the function is invoked on the
outgoing CPU.

Fixes: ae79270232 ("sched: Optimize finish_lock_switch()")
Reported-by: Sebastian Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/87zgt1hdw7.ffs@tglx
2021-09-09 11:27:23 +02:00
Sebastian Andrzej Siewior
9848417926 sched/idle: Make the idle timer expire in hard interrupt context
The intel powerclamp driver will setup a per-CPU worker with RT
priority. The worker will then invoke play_idle() in which it remains in
the idle poll loop until it is stopped by the timer it started earlier.

That timer needs to expire in hard interrupt context on PREEMPT_RT.
Otherwise the timer will expire in ksoftirqd as a SOFT timer but that task
won't be scheduled on the CPU because its priority is lower than the
priority of the worker which is in the idle loop.

Always expire the idle timer in hard interrupt context.

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210906113034.jgfxrjdvxnjqgtmc@linutronix.de
2021-09-09 10:36:16 +02:00
Peter Zijlstra
e548057270 locking/rtmutex: Fix ww_mutex deadlock check
Dan reported that rt_mutex_adjust_prio_chain() can be called with
.orig_waiter == NULL however commit a055fcc132 ("locking/rtmutex: Return
success on deadlock for ww_mutex waiters") unconditionally dereferences it.

Since both call-sites that have .orig_waiter == NULL don't care for the
return value, simply disable the deadlock squash by adding the NULL check.

Notably, both callers use the deadlock condition as a termination condition
for the iteration; once detected, it is sure that (de)boosting is done.
Arguably step [3] would be a more natural termination point, but it's
dubious whether adding a third deadlock detection state would improve the
code.

Fixes: a055fcc132 ("locking/rtmutex: Return success on deadlock for ww_mutex waiters")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/r/YS9La56fHMiCCo75@hirez.programming.kicks-ass.net
2021-09-09 10:31:22 +02:00
Yu-Tung Chang
0c45d3e24e rtc: rx8010: select REGMAP_I2C
The rtc-rx8010 uses the I2C regmap but doesn't select it in Kconfig so
depending on the configuration the build may fail. Fix it.

Signed-off-by: Yu-Tung Chang <mtwget@gmail.com>
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Link: https://lore.kernel.org/r/20210830052532.40356-1-mtwget@gmail.com
2021-09-09 10:18:40 +02:00
Guenter Roeck
0c5483a577 Input: analog - always use ktime functions
m68k, mips, s390, and sparc allmodconfig images fail to build with the
following error.

drivers/input/joystick/analog.c:160:2: error:
	#warning Precise timer not defined for this architecture.

Remove architecture specific time handling code and always use ktime
functions to determine time deltas. Also remove the now useless use_ktime
kernel parameter.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Link: https://lore.kernel.org/r/20210907123734.21520-1-linux@roeck-us.net
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
2021-09-08 23:39:47 -07:00
Steve French
8d014f5fe9 cifs: move SMB FSCTL definitions to common code
The FSCTL definitions are in smbfsctl.h which should be
shared by client and server.  Move the updated version of
smbfsctl.h into smbfs_common and have the client code use
it (subsequent patch will change the server to use this
common version of the header).

Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-09-09 00:09:20 -05:00
Steve French
23e91d8b7c cifs: rename cifs_common to smbfs_common
As we move to common code between client and server, we have
been asked to make the names less confusing, and refer less
to "cifs" and more to words which include "smb" instead to
e.g. "smbfs" for the client (we already have "ksmbd" for the
kernel server, and "smbd" for the user space Samba daemon).
So to be more consistent in the naming of common code between
client and server and reduce the risk of merge conflicts as
more common code is added - rename "cifs_common" to
"smbfs_common" (in future releases we also will rename
the fs/cifs directory to fs/smbfs)

Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-09-08 23:59:26 -05:00
Steve French
fc111fb9a6 cifs: update FSCTL definitions
Add some missing defines used by ksmbd to the client
version of smbfsctl.h, and add a missing newer define
mentioned in the protocol definitions (MS-FSCC).

This will also make it easier to move to common code.

Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-09-08 23:36:31 -05:00
Dave Airlie
de04744d65 drm-misc-next-fixes for v5.15:
- Fix ttm_bo_move_memcpy() when ttm_resource is subclassed.
 - Small fixes to panfrost, mgag200, vc4.
 - Small ttm compilation fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEuXvWqAysSYEJGuVH/lWMcqZwE8MFAmEx7NgACgkQ/lWMcqZw
 E8OFdQ/8DqIpEr1E4ED/UA99KTUV073je8Oo4/hMKOOPFrCP/8N6+DT0VcsSwWee
 FK4zJhYJsMaLlmBFqJQSmDIop29FyxUzX+bPD0o14Np+0SV/tXWwxQPy5+1h81YY
 0dhUXMRbehnhthU7+36CgGQR1DXZSnR4Af9qGkUUzfnfGBY4qYFiSYv9gwQaxNez
 0YPlwjXcabCclpo51KkJTAZN390r6FoHI86bEDnzfyEfed0fKanuSFMzahSHWV1o
 jE25hofuxjHWaGmeDeGyspLHrU0scxwCIWLp8W05gnCHtI6YINbVBNmLmA/YchRc
 BCqxptdGbFl5raitrcF/mHK1zI6wOzP8I16tYvWZugGkuRrsmfuqb9c9+CfGwSYp
 mAkKXpkaVoi179gXUk7Qqf1Z/cge6mAgRgSD9lB4wG6P6TBYiFxZPJRSF3NAqLke
 9811SELBSyOvLXzqgjpEVdF6JKL7G+sl0DjHwR8JR5++i73o0Gux96Ufg3vNRg09
 4whGCCTNZq4wj/apty8ZK/hhGJGDu1B8Yrs7xpeXHEk3XbmwB2jA7qh3/SzFJ4tF
 bRIl5USiZTH2lWtOLswo2JT4Yk/EJsxsSoz6u7HJYzPz9QsRGcNs762sQobY1fKd
 KcdT1Vx5i5j226zaGe/VyooQk/9Z046NTaH4QRuBWUgsFSut41M=
 =2OzY
 -----END PGP SIGNATURE-----

Merge tag 'drm-misc-next-fixes-2021-09-03' of git://anongit.freedesktop.org/drm/drm-misc into drm-next

drm-misc-next-fixes for v5.15:
- Fix ttm_bo_move_memcpy() when ttm_resource is subclassed.
- Small fixes to panfrost, mgag200, vc4.
- Small ttm compilation fixes.

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/41ff5e54-0837-2226-a182-97ffd11ef01e@linux.intel.com
2021-09-09 13:35:54 +10:00
Dave Airlie
06b224d516 Merge tag 'amd-drm-next-5.15-2021-09-01' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
amd-drm-next-5.15-2021-09-01:

amdgpu:
- Misc cleanups, typo fixes
- EEPROM fix
- Add some new PCI IDs
- Scatter/Gather display support for Yellow Carp
- PCIe DPM fix for RKL platforms
- RAS fix

amdkfd:
- SVM fix

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210901214015.4488-1-alexander.deucher@amd.com
2021-09-09 13:33:49 +10:00
Jens Axboe
3b33e3f4a6 io-wq: fix silly logic error in io_task_work_match()
We check for the func with an OR condition, which means it always ends
up being false and we never match the task_work we want to cancel. In
the unexpected case that we do exit with that pending, we can trigger
a hang waiting for a worker to exit, but it was never created. syzbot
reports that as such:

INFO: task syz-executor687:8514 blocked for more than 143 seconds.
      Not tainted 5.14.0-syzkaller #0
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz-executor687 state:D stack:27296 pid: 8514 ppid:  8479 flags:0x00024004
Call Trace:
 context_switch kernel/sched/core.c:4940 [inline]
 __schedule+0x940/0x26f0 kernel/sched/core.c:6287
 schedule+0xd3/0x270 kernel/sched/core.c:6366
 schedule_timeout+0x1db/0x2a0 kernel/time/timer.c:1857
 do_wait_for_common kernel/sched/completion.c:85 [inline]
 __wait_for_common kernel/sched/completion.c:106 [inline]
 wait_for_common kernel/sched/completion.c:117 [inline]
 wait_for_completion+0x176/0x280 kernel/sched/completion.c:138
 io_wq_exit_workers fs/io-wq.c:1162 [inline]
 io_wq_put_and_exit+0x40c/0xc70 fs/io-wq.c:1197
 io_uring_clean_tctx fs/io_uring.c:9607 [inline]
 io_uring_cancel_generic+0x5fe/0x740 fs/io_uring.c:9687
 io_uring_files_cancel include/linux/io_uring.h:16 [inline]
 do_exit+0x265/0x2a30 kernel/exit.c:780
 do_group_exit+0x125/0x310 kernel/exit.c:922
 get_signal+0x47f/0x2160 kernel/signal.c:2868
 arch_do_signal_or_restart+0x2a9/0x1c40 arch/x86/kernel/signal.c:865
 handle_signal_work kernel/entry/common.c:148 [inline]
 exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
 exit_to_user_mode_prepare+0x17d/0x290 kernel/entry/common.c:209
 __syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline]
 syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:302
 do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x445cd9
RSP: 002b:00007fc657f4b308 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
RAX: 0000000000000001 RBX: 00000000004cb448 RCX: 0000000000445cd9
RDX: 00000000000f4240 RSI: 0000000000000081 RDI: 00000000004cb44c
RBP: 00000000004cb440 R08: 000000000000000e R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000049b154
R13: 0000000000000003 R14: 00007fc657f4b400 R15: 0000000000022000

While in there, also decrement accr->nr_workers. This isn't strictly
needed as we're exiting, but let's make sure the accounting matches up.

Fixes: 3146cba99a ("io-wq: make worker creation resilient against signals")
Reported-by: syzbot+f62d3e0a4ea4f38f5326@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-08 19:57:26 -06:00
Linus Torvalds
a3fa7a101d Merge branches 'akpm' and 'akpm-hotfixes' (patches from Andrew)
Merge yet more updates and hotfixes from Andrew Morton:
 "Post-linux-next material, based upon latest upstream to catch the
  now-merged dependencies:

   - 10 patches.

     Subsystems affected by this patch series: mm (vmstat and migration)
     and compat.

  And bunch of hotfixes, mostly cc:stable:

   - 8 patches.

     Subsystems affected by this patch series: mm (hmm, hugetlb, vmscan,
     pagealloc, pagemap, kmemleak, mempolicy, and memblock)"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  arch: remove compat_alloc_user_space
  compat: remove some compat entry points
  mm: simplify compat numa syscalls
  mm: simplify compat_sys_move_pages
  kexec: avoid compat_alloc_user_space
  kexec: move locking into do_kexec_load
  mm: migrate: change to use bool type for 'page_was_mapped'
  mm: migrate: fix the incorrect function name in comments
  mm: migrate: introduce a local variable to get the number of pages
  mm/vmstat: protect per cpu variables with preempt disable on RT

* emailed hotfixes from Andrew Morton <akpm@linux-foundation.org>:
  nds32/setup: remove unused memblock_region variable in setup_memory()
  mm/mempolicy: fix a race between offset_il_node and mpol_rebind_task
  mm/kmemleak: allow __GFP_NOLOCKDEP passed to kmemleak's gfp
  mmap_lock: change trace and locking order
  mm/page_alloc.c: avoid accessing uninitialized pcp page migratetype
  mm,vmscan: fix divide by zero in get_scan_count
  mm/hugetlb: initialize hugetlb_usage in mm_init
  mm/hmm: bypass devmap pte when all pfn requested flags are fulfilled
2021-09-08 18:52:05 -07:00
Mike Rapoport
ddb13122aa nds32/setup: remove unused memblock_region variable in setup_memory()
kernel test robot reports unused variable warning:

   arch/nds32/kernel/setup.c:247:26: warning: Unused variable: region
   [unusedVariable]
    struct memblock_region *region;
                            ^

Remove the unused variable.

Link: https://lkml.kernel.org/r/20210712125218.28951-1-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08 18:45:53 -07:00
yanghui
276aeee1c5 mm/mempolicy: fix a race between offset_il_node and mpol_rebind_task
Servers happened below panic:

  Kernel version:5.4.56
  BUG: unable to handle page fault for address: 0000000000002c48
  RIP: 0010:__next_zones_zonelist+0x1d/0x40
  Call Trace:
    __alloc_pages_nodemask+0x277/0x310
    alloc_page_interleave+0x13/0x70
    handle_mm_fault+0xf99/0x1390
    __do_page_fault+0x288/0x500
    do_page_fault+0x30/0x110
    page_fault+0x3e/0x50

The reason for the panic is that MAX_NUMNODES is passed in the third
parameter in __alloc_pages_nodemask(preferred_nid).  So access to
zonelist->zoneref->zone_idx in __next_zones_zonelist will cause a panic.

In offset_il_node(), first_node() returns nid from pol->v.nodes, after
this other threads may chang pol->v.nodes before next_node().  This race
condition will let next_node return MAX_NUMNODES.  So put pol->nodes in
a local variable.

The race condition is between offset_il_node and cpuset_change_task_nodemask:

  CPU0:                                     CPU1:
  alloc_pages_vma()
    interleave_nid(pol,)
      offset_il_node(pol,)
        first_node(pol->v.nodes)            cpuset_change_task_nodemask
                        //nodes==0xc          mpol_rebind_task
                                                mpol_rebind_policy
                                                  mpol_rebind_nodemask(pol,nodes)
                        //nodes==0x3
        next_node(nid, pol->v.nodes)//return MAX_NUMNODES

Link: https://lkml.kernel.org/r/20210906034658.48721-1-yanghui.def@bytedance.com
Signed-off-by: yanghui <yanghui.def@bytedance.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08 18:45:53 -07:00
Naohiro Aota
79d3705040 mm/kmemleak: allow __GFP_NOLOCKDEP passed to kmemleak's gfp
In a memory pressure situation, I'm seeing the lockdep WARNING below.
Actually, this is similar to a known false positive which is already
addressed by commit 6dcde60efd ("xfs: more lockdep whackamole with
kmem_alloc*").

This warning still persists because it's not from kmalloc() itself but
from an allocation for kmemleak object.  While kmalloc() itself suppress
the warning with __GFP_NOLOCKDEP, gfp_kmemleak_mask() is dropping the
flag for the kmemleak's allocation.

Allow __GFP_NOLOCKDEP to be passed to kmemleak's allocation, so that the
warning for it is also suppressed.

  ======================================================
  WARNING: possible circular locking dependency detected
  5.14.0-rc7-BTRFS-ZNS+ #37 Not tainted
  ------------------------------------------------------
  kswapd0/288 is trying to acquire lock:
  ffff88825ab45df0 (&xfs_nondir_ilock_class){++++}-{3:3}, at: xfs_ilock+0x8a/0x250

  but task is already holding lock:
  ffffffff848cc1e0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #1 (fs_reclaim){+.+.}-{0:0}:
         fs_reclaim_acquire+0x112/0x160
         kmem_cache_alloc+0x48/0x400
         create_object.isra.0+0x42/0xb10
         kmemleak_alloc+0x48/0x80
         __kmalloc+0x228/0x440
         kmem_alloc+0xd3/0x2b0
         kmem_alloc_large+0x5a/0x1c0
         xfs_attr_copy_value+0x112/0x190
         xfs_attr_shortform_getvalue+0x1fc/0x300
         xfs_attr_get_ilocked+0x125/0x170
         xfs_attr_get+0x329/0x450
         xfs_get_acl+0x18d/0x430
         get_acl.part.0+0xb6/0x1e0
         posix_acl_xattr_get+0x13a/0x230
         vfs_getxattr+0x21d/0x270
         getxattr+0x126/0x310
         __x64_sys_fgetxattr+0x1a6/0x2a0
         do_syscall_64+0x3b/0x90
         entry_SYSCALL_64_after_hwframe+0x44/0xae

  -> #0 (&xfs_nondir_ilock_class){++++}-{3:3}:
         __lock_acquire+0x2c0f/0x5a00
         lock_acquire+0x1a1/0x4b0
         down_read_nested+0x50/0x90
         xfs_ilock+0x8a/0x250
         xfs_can_free_eofblocks+0x34f/0x570
         xfs_inactive+0x411/0x520
         xfs_fs_destroy_inode+0x2c8/0x710
         destroy_inode+0xc5/0x1a0
         evict+0x444/0x620
         dispose_list+0xfe/0x1c0
         prune_icache_sb+0xdc/0x160
         super_cache_scan+0x31e/0x510
         do_shrink_slab+0x337/0x8e0
         shrink_slab+0x362/0x5c0
         shrink_node+0x7a7/0x1a40
         balance_pgdat+0x64e/0xfe0
         kswapd+0x590/0xa80
         kthread+0x38c/0x460
         ret_from_fork+0x22/0x30

  other info that might help us debug this:
   Possible unsafe locking scenario:
         CPU0                    CPU1
         ----                    ----
    lock(fs_reclaim);
                                 lock(&xfs_nondir_ilock_class);
                                 lock(fs_reclaim);
    lock(&xfs_nondir_ilock_class);

   *** DEADLOCK ***
  3 locks held by kswapd0/288:
   #0: ffffffff848cc1e0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
   #1: ffffffff848a08d8 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x269/0x5c0
   #2: ffff8881a7a820e8 (&type->s_umount_key#60){++++}-{3:3}, at: super_cache_scan+0x5a/0x510

Link: https://lkml.kernel.org/r/20210907055659.3182992-1-naohiro.aota@wdc.com
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: "Darrick J . Wong" <djwong@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08 18:45:53 -07:00
Liam Howlett
1099431608 mmap_lock: change trace and locking order
Print to the trace log before releasing the lock to avoid racing with
other trace log printers of the same lock type.

Link: https://lkml.kernel.org/r/20210903022041.1843024-1-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Suggested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michel Lespinasse <walken.cr@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08 18:45:53 -07:00
Miaohe Lin
053cfda102 mm/page_alloc.c: avoid accessing uninitialized pcp page migratetype
If it's not prepared to free unref page, the pcp page migratetype is
unset.  Thus we will get rubbish from get_pcppage_migratetype() and
might list_del(&page->lru) again after it's already deleted from the list
leading to grumble about data corruption.

Link: https://lkml.kernel.org/r/20210902115447.57050-1-linmiaohe@huawei.com
Fixes: df1acc8569 ("mm/page_alloc: avoid conflating IRQs disabled with zone->lock")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08 18:45:53 -07:00
Rik van Riel
32d4f4b782 mm,vmscan: fix divide by zero in get_scan_count
Commit f56ce412a5 ("mm: memcontrol: fix occasional OOMs due to
proportional memory.low reclaim") introduced a divide by zero corner
case when oomd is being used in combination with cgroup memory.low
protection.

When oomd decides to kill a cgroup, it will force the cgroup memory to
be reclaimed after killing the tasks, by writing to the memory.max file
for that cgroup, forcing the remaining page cache and reclaimable slab
to be reclaimed down to zero.

Previously, on cgroups with some memory.low protection that would result
in the memory being reclaimed down to the memory.low limit, or likely
not at all, having the page cache reclaimed asynchronously later.

With f56ce412a5 the oomd write to memory.max tries to reclaim all the
way down to zero, which may race with another reclaimer, to the point of
ending up with the divide by zero below.

This patch implements the obvious fix.

Link: https://lkml.kernel.org/r/20210826220149.058089c6@imladris.surriel.com
Fixes: f56ce412a5 ("mm: memcontrol: fix occasional OOMs due to proportional memory.low reclaim")
Signed-off-by: Rik van Riel <riel@surriel.com>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Chris Down <chris@chrisdown.name>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08 18:45:53 -07:00
Liu Zixian
13db8c5047 mm/hugetlb: initialize hugetlb_usage in mm_init
After fork, the child process will get incorrect (2x) hugetlb_usage.  If
a process uses 5 2MB hugetlb pages in an anonymous mapping,

	HugetlbPages:	   10240 kB

and then forks, the child will show,

	HugetlbPages:	   20480 kB

The reason for double the amount is because hugetlb_usage will be copied
from the parent and then increased when we copy page tables from parent
to child.  Child will have 2x actual usage.

Fix this by adding hugetlb_count_init in mm_init.

Link: https://lkml.kernel.org/r/20210826071742.877-1-liuzixian4@huawei.com
Fixes: 5d317b2b65 ("mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status")
Signed-off-by: Liu Zixian <liuzixian4@huawei.com>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08 18:45:53 -07:00