linux/drivers/md
Coly Li b113f98432 bcache: fix race in btree_flush_write()
[ Upstream commit 50a260e859 ]

There is a race between mca_reap(), btree_node_free() and journal code
btree_flush_write(), which results very rare and strange deadlock or
panic and are very hard to reproduce.

Let me explain how the race happens. In btree_flush_write() one btree
node with oldest journal pin is selected, then it is flushed to cache
device, the select-and-flush is a two steps operation. Between these two
steps, there are something may happen inside the race window,
- The selected btree node was reaped by mca_reap() and allocated to
  other requesters for other btree node.
- The slected btree node was selected, flushed and released by mca
  shrink callback bch_mca_scan().
When btree_flush_write() tries to flush the selected btree node, firstly
b->write_lock is held by mutex_lock(). If the race happens and the
memory of selected btree node is allocated to other btree node, if that
btree node's write_lock is held already, a deadlock very probably
happens here. A worse case is the memory of the selected btree node is
released, then all references to this btree node (e.g. b->write_lock)
will trigger NULL pointer deference panic.

This race was introduced in commit cafe563591 ("bcache: A block layer
cache"), and enlarged by commit c4dc2497d5 ("bcache: fix high CPU
occupancy during journal"), which selected 128 btree nodes and flushed
them one-by-one in a quite long time period.

Such race is not easy to reproduce before. On a Lenovo SR650 server with
48 Xeon cores, and configure 1 NVMe SSD as cache device, a MD raid0
device assembled by 3 NVMe SSDs as backing device, this race can be
observed around every 10,000 times btree_flush_write() gets called. Both
deadlock and kernel panic all happened as aftermath of the race.

The idea of the fix is to add a btree flag BTREE_NODE_journal_flush. It
is set when selecting btree nodes, and cleared after btree nodes
flushed. Then when mca_reap() selects a btree node with this bit set,
this btree node will be skipped. Since mca_reap() only reaps btree node
without BTREE_NODE_journal_flush flag, such race is avoided.

Once corner case should be noticed, that is btree_node_free(). It might
be called in some error handling code path. For example the following
code piece from btree_split(),
        2149 err_free2:
        2150         bkey_put(b->c, &n2->key);
        2151         btree_node_free(n2);
        2152         rw_unlock(true, n2);
        2153 err_free1:
        2154         bkey_put(b->c, &n1->key);
        2155         btree_node_free(n1);
        2156         rw_unlock(true, n1);
At line 2151 and 2155, the btree node n2 and n1 are released without
mac_reap(), so BTREE_NODE_journal_flush also needs to be checked here.
If btree_node_free() is called directly in such error handling path,
and the selected btree node has BTREE_NODE_journal_flush bit set, just
delay for 1 us and retry again. In this case this btree node won't
be skipped, just retry until the BTREE_NODE_journal_flush bit cleared,
and free the btree node memory.

Fixes: cafe563591 ("bcache: A block layer cache")
Signed-off-by: Coly Li <colyli@suse.de>
Reported-and-tested-by: kbuild test robot <lkp@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-09-16 08:22:23 +02:00
..
bcache bcache: fix race in btree_flush_write() 2019-09-16 08:22:23 +02:00
persistent-data dm space map metadata: fix missing store of apply_bops() return value 2019-08-29 08:28:56 +02:00
dm-bio-prison-v1.c dm: adjust structure members to improve alignment 2018-06-08 11:53:14 -04:00
dm-bio-prison-v1.h
dm-bio-prison-v2.c dm: adjust structure members to improve alignment 2018-06-08 11:53:14 -04:00
dm-bio-prison-v2.h
dm-bio-record.h
dm-bufio.c Revert "dm bufio: fix deadlock with loop device" 2019-08-29 08:28:49 +02:00
dm-builtin.c
dm-cache-background-tracker.c dm cache background tracker: fix sparse warning 2018-04-30 15:40:40 -04:00
dm-cache-background-tracker.h
dm-cache-block-types.h
dm-cache-metadata.c dm cache metadata: Fix loading discard bitset 2019-05-25 18:23:38 +02:00
dm-cache-metadata.h
dm-cache-policy-internal.h
dm-cache-policy-smq.c treewide: Use array_size() in vzalloc() 2018-06-12 16:19:22 -07:00
dm-cache-policy.c
dm-cache-policy.h
dm-cache-target.c dm cache: destroy migration_cache if cache target registration failed 2018-10-09 13:53:03 -04:00
dm-core.h dm: disable DISCARD if the underlying storage no longer supports it 2019-08-25 10:48:01 +02:00
dm-crypt.c dm crypt: move detailed message into debug level 2019-09-16 08:22:14 +02:00
dm-delay.c dm delay: fix a crash when invalid device is specified 2019-05-25 18:23:39 +02:00
dm-era-target.c dm: allow targets to return output from messages they are sent 2018-04-03 15:04:10 -04:00
dm-exception-store.c
dm-exception-store.h
dm-flakey.c dm: Check for device sector overflow if CONFIG_LBDAF is not set 2019-01-26 09:32:42 +01:00
dm-integrity.c dm integrity: fix a crash due to BUG_ON in __journal_read_write() 2019-08-29 08:28:55 +02:00
dm-io.c dm: Use kzalloc for all structs with embedded biosets/mempools 2018-06-05 08:47:43 -06:00
dm-ioctl.c dm ioctl: harden copy_params()'s copy_from_user() from malicious users 2018-11-13 11:08:49 -08:00
dm-kcopyd.c dm kcopyd: always complete failed jobs 2019-08-29 08:28:55 +02:00
dm-linear.c dm: Check for device sector overflow if CONFIG_LBDAF is not set 2019-01-26 09:32:42 +01:00
dm-log-userspace-base.c dm: convert to bioset_init()/mempool_init() 2018-05-30 15:33:32 -06:00
dm-log-userspace-transfer.c
dm-log-userspace-transfer.h
dm-log-writes.c dm log writes: make sure super sector log updates are written in order 2019-07-03 13:14:45 +02:00
dm-log.c
dm-mpath.c dm mpath: fix missing call of path selector type->end_io 2019-09-16 08:22:12 +02:00
dm-mpath.h
dm-path-selector.c
dm-path-selector.h
dm-queue-length.c dm mpath selector: more evenly distribute ties 2018-01-29 13:44:58 -05:00
dm-raid.c dm raid: add missing cleanup in raid_ctr() 2019-08-29 08:28:55 +02:00
dm-raid1.c dm: Check for device sector overflow if CONFIG_LBDAF is not set 2019-01-26 09:32:42 +01:00
dm-region-hash.c - Error path bug fix for overflow tests (Dan) 2018-06-12 18:28:00 -07:00
dm-round-robin.c
dm-rq.c dm mpath: fix missing call of path selector type->end_io 2019-09-16 08:22:12 +02:00
dm-rq.h
dm-service-time.c dm mpath selector: more evenly distribute ties 2018-01-29 13:44:58 -05:00
dm-snap-persistent.c dm bufio: move dm-bufio.h to include/linux/ 2018-04-03 15:04:23 -04:00
dm-snap-transient.c
dm-snap.c dm snapshot: Fix excessive memory usage and workqueue stalls 2019-01-26 09:32:41 +01:00
dm-stats.c treewide: kmalloc() -> kmalloc_array() 2018-06-12 16:19:22 -07:00
dm-stats.h
dm-stripe.c dax: Introduce a ->copy_to_iter dax operation 2018-05-22 23:18:31 -07:00
dm-switch.c treewide: Use array_size() in vmalloc() 2018-06-12 16:19:22 -07:00
dm-sysfs.c
dm-table.c dm table: fix invalid memory accesses with too high sector number 2019-08-29 08:28:56 +02:00
dm-target.c dm mpath: fix missing call of path selector type->end_io 2019-09-16 08:22:12 +02:00
dm-thin-metadata.c dm thin metadata: check if in fail_io mode when setting needs_check 2019-09-16 08:22:21 +02:00
dm-thin-metadata.h dm thin: fix passdown_double_checking_shared_status() 2019-01-31 08:14:38 +01:00
dm-thin.c dm thin: add sanity checks to thin-pool and external snapshot creation 2019-04-05 22:32:59 +02:00
dm-uevent.c
dm-uevent.h
dm-unstripe.c dm: Check for device sector overflow if CONFIG_LBDAF is not set 2019-01-26 09:32:42 +01:00
dm-verity-fec.c Refactors rslib and callers to provide a per-instance allocation area 2018-06-05 10:48:05 -07:00
dm-verity-fec.h dm: convert to bioset_init()/mempool_init() 2018-05-30 15:33:32 -06:00
dm-verity-target.c dm verity: use message limit for data block corruption message 2019-07-21 09:03:08 +02:00
dm-verity.h dm verity: add 'check_at_most_once' option to only validate hashes once 2018-04-03 15:04:29 -04:00
dm-writecache.c libnvdimm-for-4.19_misc 2018-08-25 18:13:10 -07:00
dm-zero.c
dm-zoned-metadata.c dm zoned: fix potential NULL dereference in dmz_do_reclaim() 2019-08-29 08:28:59 +02:00
dm-zoned-reclaim.c dm zoned: properly handle backing device failure 2019-08-29 08:28:56 +02:00
dm-zoned-target.c dm zoned: properly handle backing device failure 2019-08-29 08:28:56 +02:00
dm-zoned.h dm zoned: properly handle backing device failure 2019-08-29 08:28:56 +02:00
dm.c dm: disable DISCARD if the underlying storage no longer supports it 2019-08-25 10:48:01 +02:00
dm.h dm: move dm_table_destroy() to same header as dm_table_create() 2018-01-17 09:16:06 -05:00
Kconfig dm: add writecache target 2018-06-08 11:59:51 -04:00
Makefile dm: add writecache target 2018-06-08 11:59:51 -04:00
md-bitmap.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input 2018-08-18 16:48:07 -07:00
md-bitmap.h md: Avoid namespace collision with bitmap API 2018-08-01 15:49:39 -07:00
md-cluster.c md-cluster: release RESYNC lock after the last resync message 2018-08-31 17:38:10 -07:00
md-cluster.h
md-faulty.c md: convert to bioset_init()/mempool_init() 2018-05-30 15:33:32 -06:00
md-linear.c md: convert to bioset_init()/mempool_init() 2018-05-30 15:33:32 -06:00
md-linear.h Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md 2017-11-14 16:07:26 -08:00
md-multipath.c treewide: kzalloc() -> kcalloc() 2018-06-12 16:19:22 -07:00
md-multipath.h md: convert to bioset_init()/mempool_init() 2018-05-30 15:33:32 -06:00
md.c md: fix for divide error in status_resync 2019-07-14 08:11:13 +02:00
md.h md: batch flush requests. 2019-05-25 18:23:25 +02:00
raid1-10.c
raid1.c md/raid1: don't clear bitmap bits on interrupted recovery. 2019-02-20 10:25:48 +01:00
raid1.h md: convert to bioset_init()/mempool_init() 2018-05-30 15:33:32 -06:00
raid5-cache.c md/raid5: fix 'out of memory' during raid cache recovery 2019-02-06 17:30:16 +01:00
raid5-log.h md/raid5-cache: disable reshape completely 2018-08-31 17:38:09 -07:00
raid5-ppl.c md: convert to bioset_init()/mempool_init() 2018-05-30 15:33:32 -06:00
raid5.c raid5-cache: Need to do start() part job after adding journal device 2019-07-26 09:14:23 +02:00
raid5.h Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md 2018-06-09 12:01:36 -07:00
raid10.c md: Fix failed allocation of md_register_thread 2019-03-23 20:10:11 +01:00
raid10.h md: convert to bioset_init()/mempool_init() 2018-05-30 15:33:32 -06:00
raid0.c md/raid0: Do not bypass blocking queue entered for raid0 bios 2019-07-10 09:53:30 +02:00
raid0.h