linux/drivers
Zhao Heming 05cafe5ad8 md/cluster: fix deadlock when node is doing resync job
commit bca5b06580 upstream.

md-cluster uses MD_CLUSTER_SEND_LOCK to make node can exclusively send msg.
During sending msg, node can concurrently receive msg from another node.
When node does resync job, grab token_lockres:EX may trigger a deadlock:
```
nodeA                       nodeB
--------------------     --------------------
a.
send METADATA_UPDATED
held token_lockres:EX
                         b.
                         md_do_sync
                          resync_info_update
                            send RESYNCING
                             + set MD_CLUSTER_SEND_LOCK
                             + wait for holding token_lockres:EX

                         c.
                         mdadm /dev/md0 --remove /dev/sdg
                          + held reconfig_mutex
                          + send REMOVE
                             + wait_event(MD_CLUSTER_SEND_LOCK)

                         d.
                         recv_daemon //METADATA_UPDATED from A
                          process_metadata_update
                           + (mddev_trylock(mddev) ||
                              MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD)
                             //this time, both return false forever
```
Explaination:
a. A send METADATA_UPDATED
   This will block another node to send msg

b. B does sync jobs, which will send RESYNCING at intervals.
   This will be block for holding token_lockres:EX lock.

c. B do "mdadm --remove", which will send REMOVE.
   This will be blocked by step <b>: MD_CLUSTER_SEND_LOCK is 1.

d. B recv METADATA_UPDATED msg, which send from A in step <a>.
   This will be blocked by step <c>: holding mddev lock, it makes
   wait_event can't hold mddev lock. (btw,
   MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD keep ZERO in this scenario.)

There is a similar deadlock in commit 0ba959774e
("md-cluster: use sync way to handle METADATA_UPDATED msg")
In that commit, step c is "update sb". This patch step c is
"mdadm --remove".

For fixing this issue, we can refer the solution of function:
metadata_update_start. Which does the same grab lock_token action.
lock_comm can use the same steps to avoid deadlock. By moving
MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD from lock_token to lock_comm.
It enlarge a little bit window of MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
but it is safe & can break deadlock.

Repro steps (I only triggered 3 times with hundreds tests):

two nodes share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB.
```
ssh root@node2 "mdadm -S --scan"
mdadm -S --scan
for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
count=20; done

mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh \
 --bitmap-chunk=1M
ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"

sleep 5

mkfs.xfs /dev/md0
mdadm --manage --add /dev/md0 /dev/sdi
mdadm --wait /dev/md0
mdadm --grow --raid-devices=3 /dev/md0

mdadm /dev/md0 --fail /dev/sdg
mdadm /dev/md0 --remove /dev/sdg
mdadm --grow --raid-devices=2 /dev/md0
```

test script will hung when executing "mdadm --remove".

```
 # dump stacks by "echo t > /proc/sysrq-trigger"
md0_cluster_rec D    0  5329      2 0x80004000
Call Trace:
 __schedule+0x1f6/0x560
 ? _cond_resched+0x2d/0x40
 ? schedule+0x4a/0xb0
 ? process_metadata_update.isra.0+0xdb/0x140 [md_cluster]
 ? wait_woken+0x80/0x80
 ? process_recvd_msg+0x113/0x1d0 [md_cluster]
 ? recv_daemon+0x9e/0x120 [md_cluster]
 ? md_thread+0x94/0x160 [md_mod]
 ? wait_woken+0x80/0x80
 ? md_congested+0x30/0x30 [md_mod]
 ? kthread+0x115/0x140
 ? __kthread_bind_mask+0x60/0x60
 ? ret_from_fork+0x1f/0x40

mdadm           D    0  5423      1 0x00004004
Call Trace:
 __schedule+0x1f6/0x560
 ? __schedule+0x1fe/0x560
 ? schedule+0x4a/0xb0
 ? lock_comm.isra.0+0x7b/0xb0 [md_cluster]
 ? wait_woken+0x80/0x80
 ? remove_disk+0x4f/0x90 [md_cluster]
 ? hot_remove_disk+0xb1/0x1b0 [md_mod]
 ? md_ioctl+0x50c/0xba0 [md_mod]
 ? wait_woken+0x80/0x80
 ? blkdev_ioctl+0xa2/0x2a0
 ? block_ioctl+0x39/0x40
 ? ksys_ioctl+0x82/0xc0
 ? __x64_sys_ioctl+0x16/0x20
 ? do_syscall_64+0x5f/0x150
 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9

md0_resync      D    0  5425      2 0x80004000
Call Trace:
 __schedule+0x1f6/0x560
 ? schedule+0x4a/0xb0
 ? dlm_lock_sync+0xa1/0xd0 [md_cluster]
 ? wait_woken+0x80/0x80
 ? lock_token+0x2d/0x90 [md_cluster]
 ? resync_info_update+0x95/0x100 [md_cluster]
 ? raid1_sync_request+0x7d3/0xa40 [raid1]
 ? md_do_sync.cold+0x737/0xc8f [md_mod]
 ? md_thread+0x94/0x160 [md_mod]
 ? md_congested+0x30/0x30 [md_mod]
 ? kthread+0x115/0x140
 ? __kthread_bind_mask+0x60/0x60
 ? ret_from_fork+0x1f/0x40
```

At last, thanks for Xiao's solution.

Cc: stable@vger.kernel.org
Signed-off-by: Zhao Heming <heming.zhao@suse.com>
Suggested-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30 11:26:16 +01:00
..
accessibility
acpi ACPI: PNP: compare the string length in the matching_id() 2020-12-30 11:26:08 +01:00
amba
android binder: fix UAF when releasing todo list 2020-10-29 09:54:56 +01:00
ata ata: sata_nv: Fix retrieving of active qcs 2020-11-05 11:08:38 +01:00
atm atm: nicstar: Unmap DMA on send error 2020-11-24 13:27:15 +01:00
auxdisplay
base PM: runtime: Resume the device earlier in __device_release_driver() 2020-11-10 12:36:01 +01:00
bcma
block nbd: fix a block_device refcount leak in nbd_release 2020-11-18 19:18:47 +01:00
bluetooth Bluetooth: hci_h5: fix memory leak in h5_close 2020-12-30 11:25:52 +01:00
bus bus: fsl-mc: fix error return code in fsl_mc_object_allocate() 2020-12-30 11:26:02 +01:00
cdrom
char random32: make prandom_u32() output unpredictable 2020-11-18 19:18:52 +01:00
clk clk: sunxi-ng: Make sure divider tables have sentinel 2020-12-30 11:26:06 +01:00
clocksource clocksource/drivers/arm_arch_timer: Correct fault programming of CNTKCTL_EL1.EVNTI 2020-12-30 11:26:00 +01:00
connector
cpufreq cpufreq: scpi: Add missing MODULE_ALIAS 2020-12-30 11:26:01 +01:00
cpuidle cpuidle: Fixup IRQ state 2020-09-09 19:04:23 +02:00
crypto crypto: omap-aes - Fix PM disable depth imbalance in omap_aes_probe 2020-12-30 11:25:55 +01:00
dax
dca
devfreq PM / devfreq: tegra30: Fix integer overflow on CPU's freq max out 2020-10-01 13:14:26 +02:00
dio
dma dmaengine: mv_xor_v2: Fix error return code in mv_xor_v2_probe() 2020-12-30 11:25:56 +01:00
dma-buf dma-fence: Serialise signal enabling (dma_fence_enable_sw_signaling) 2020-10-01 13:14:24 +02:00
edac EDAC/amd64: Fix PCI component registration 2020-12-30 11:26:10 +01:00
eisa
extcon extcon: max77693: Fix modalias string 2020-12-30 11:26:03 +01:00
firewire
firmware firmware: arm_sdei: Use cpus_read_lock() to avoid races with cpuhp 2020-10-01 13:14:35 +02:00
fmc
fpga fpga: dfl: fix bug in port reset handshake 2020-07-29 10:16:48 +02:00
fsi
gnss gnss: sirf: fix error return code in sirf_probe() 2020-06-22 09:05:28 +02:00
gpio gpio: eic-sprd: break loop when getting NULL device resource 2020-12-30 11:25:45 +01:00
gpu drm/dp_aux_dev: check aux_dev before use in drm_dp_aux_dev_get_by_minor() 2020-12-30 11:26:13 +01:00
hid HID: i2c-hid: add Vero K147 to descriptor override 2020-12-30 11:25:48 +01:00
hsi HSI: omap_ssi: Don't jump to free ID in ssi_add_controller() 2020-12-30 11:25:57 +01:00
hv hv_balloon: disable warning when floor reached 2020-11-18 19:18:41 +01:00
hwmon hwmon: (pmbus/max34440) Fix status register reads for MAX344{51,60,61} 2020-10-29 09:55:02 +01:00
hwspinlock
hwtracing coresight: tmc-etr: Check if page is valid before dma_map_page() 2020-12-30 11:25:48 +01:00
i2c i2c: qup: Fix error return code in qup_i2c_bam_schedule_desc() 2020-12-11 13:25:04 +01:00
ide
idle
iio iio:imu:bmi160: Fix too large a buffer. 2020-12-30 11:26:16 +01:00
infiniband RDMA/cxgb4: Validate the number of CQEs 2020-12-30 11:25:56 +01:00
input Input: cyapa_gen6 - fix out-of-bounds stack access 2020-12-30 11:26:07 +01:00
iommu iommu/amd: Set DTE[IntTabLen] to represent 512 IRTEs 2020-12-11 13:25:03 +01:00
ipack
irqchip irqchip/alpine-msi: Fix freeing of interrupts on allocation error path 2020-12-30 11:26:03 +01:00
isdn PCI: add USR vendor id and use it in r8169 and w6692 driver 2020-06-22 09:05:23 +02:00
leds leds: bcm6328, bcm6358: use devres LED registering function 2020-11-05 11:08:46 +01:00
lightnvm
macintosh drivers/macintosh: Fix memleak in windfarm_pm112 driver 2020-06-22 09:05:29 +02:00
mailbox mailbox: avoid timer start from callback 2020-10-30 10:38:21 +01:00
mcb
md md/cluster: fix deadlock when node is doing resync job 2020-12-30 11:26:16 +01:00
media media: ipu3-cio2: Make the field on subdev format V4L2_FIELD_NONE 2020-12-30 11:26:07 +01:00
memory memory: emif: Remove bogus debugfs error handling 2020-11-05 11:08:45 +01:00
memstick memstick: r592: Fix error return in r592_probe() 2020-12-30 11:26:00 +01:00
message scsi: mptfusion: Fix null pointer dereferences in mptscsih_remove() 2020-11-05 11:08:47 +01:00
mfd mfd: sprd: Add wakeup capability for PMIC IRQ 2020-11-18 19:18:46 +01:00
misc mei: protect mei_cl_mtu from null dereference 2020-11-18 19:18:49 +01:00
mmc mmc: block: Fixup condition for CMD13 polling for RPMB requests 2020-12-30 11:25:39 +01:00
mtd mtd: rawnand: qcom: Fix DMA sync on FLASH_STATUS register read 2020-12-30 11:26:15 +01:00
mux
net qlcnic: Fix error code in probe 2020-12-30 11:26:05 +01:00
nfc nfc: s3fwrn5: Release the nfc firmware 2020-12-30 11:26:04 +01:00
ntb NTB: hw: amd: fix an issue about leak system resources 2020-10-30 10:38:25 +01:00
nubus
nvdimm libnvdimm/label: Return -ENXIO for no slot in __blk_label_update 2020-12-30 11:26:05 +01:00
nvme nvme: free sq/cq dbbuf pointers when dbbuf set fails 2020-12-02 08:48:09 +01:00
nvmem
of of/address: Fix of_node memory leak in of_dma_is_coherent 2020-11-18 19:18:48 +01:00
opp
oprofile
parisc parisc: mask out enable and reserved bits from sba imask 2020-08-19 08:15:07 +02:00
parport
pci PM: ACPI: PCI: Drop acpi_pm_set_bridge_wakeup() 2020-12-30 11:26:08 +01:00
pcmcia
perf drivers/perf: xgene_pmu: Fix uninitialized resource struct 2020-10-29 09:55:00 +01:00
phy phy: tegra: xusb: Fix dangling pointer on probe failure 2020-12-02 08:48:10 +01:00
pinctrl pinctrl: falcon: add missing put_device() call in pinctrl_falcon_probe() 2020-12-30 11:26:00 +01:00
platform platform/x86: mlx-platform: Fix item counter assignment for MSN2700, MSN24xx systems 2020-12-30 11:26:01 +01:00
pnp
power power: supply: bq24190_charger: fix reference leak 2020-12-30 11:25:57 +01:00
powercap powercap: restrict energy meter to root access 2020-11-10 21:11:27 +01:00
pps
ps3 powerpc/ps3: use dma_mapping_error() 2020-12-30 11:26:04 +01:00
ptp
pwm pwm: lp3943: Dynamically allocate PWM chip base 2020-12-30 11:26:05 +01:00
rapidio rapidio: fix the missed put_device() for rio_mport_add_riodev 2020-10-30 10:38:21 +01:00
ras
regulator regulator: workaround self-referent regulators 2020-11-24 13:27:25 +01:00
remoteproc remoteproc: qcom: q6v5: Update running state before requesting stop 2020-08-21 11:05:34 +02:00
reset
rpmsg rpmsg: glink: Use complete_all for open states 2020-11-05 11:08:43 +01:00
rtc rtc: rx8010: don't modify the global rtc ops 2020-11-05 11:08:54 +01:00
s390 s390/dasd: fix list corruption of lcu list 2020-12-30 11:26:10 +01:00
sbus
scsi scsi: lpfc: Re-fix use after free in lpfc_rq_buf_free() 2020-12-30 11:26:15 +01:00
sfi
sh
siox
slimbus slimbus: qcom-ngd-ctrl: Avoid sending power requests without QMI 2020-12-30 11:25:57 +01:00
sn
soc soc: qcom: smp2p: Safely acquire spinlock without IRQs 2020-12-30 11:26:14 +01:00
soundwire
spi spi: st-ssc4: Fix unbalanced pm_runtime_disable() in probe error path 2020-12-30 11:26:14 +01:00
spmi
ssb
staging spi: mt7621: fix missing clk_disable_unprepare() on error in mt7621_spi_probe 2020-12-30 11:26:14 +01:00
target scsi: target: iscsi: Fix cmd abort fabric stop race 2020-12-02 08:48:10 +01:00
tc
tee optee: add writeback to valid memory type 2020-12-02 08:48:12 +01:00
thermal thermal: rcar_thermal: Handle probe error gracefully 2020-10-01 13:14:39 +02:00
thunderbolt thunderbolt: Add the missed ida_simple_remove() in ring_request_msix() 2020-11-18 19:18:49 +01:00
tty serial_core: Check for port state when tty is in error state 2020-12-30 11:25:48 +01:00
uio uio: Fix use-after-free in uio_unregister_device() 2020-11-18 19:18:49 +01:00
usb USB: serial: keyspan_pda: fix write unthrottling 2020-12-30 11:26:11 +01:00
uwb
vfio vfio-pci: Use io_remap_pfn_range() for PCI IO memory 2020-12-30 11:25:59 +01:00
vhost vringh: fix __vringh_iov() when riov and wiov are different 2020-11-05 11:08:53 +01:00
video video: fbdev: atmel_lcdfb: fix return error code in atmel_lcdfb_of_init() 2020-12-30 11:25:54 +01:00
virt drivers/virt/fsl_hypervisor: Fix error handling path 2020-10-29 09:55:09 +01:00
virtio virtio_ring: Avoid loop when vq is broken in virtqueue_poll 2020-08-26 10:31:01 +02:00
visorbus
vlynq
vme
w1 w1: mxc_w1: Fix timeout resolution problem leading to bus error 2020-11-05 11:08:47 +01:00
watchdog watchdog: coh901327: add COMMON_CLK dependency 2020-12-30 11:26:05 +01:00
xen xen/events: block rogue events for some time 2020-11-05 11:08:37 +01:00
zorro
Kconfig
Makefile