linux/include
Greg Thelen 6f051f8986 writeback: safer lock nesting
commit 2e898e4c0a upstream.

lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if
the page's memcg is undergoing move accounting, which occurs when a
process leaves its memcg for a new one that has
memory.move_charge_at_immigrate set.

unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if
the given inode is switching writeback domains.  Switches occur when
enough writes are issued from a new domain.

This existing pattern is thus suspicious:
    lock_page_memcg(page);
    unlocked_inode_to_wb_begin(inode, &locked);
    ...
    unlocked_inode_to_wb_end(inode, locked);
    unlock_page_memcg(page);

If both inode switch and process memcg migration are both in-flight then
unlocked_inode_to_wb_end() will unconditionally enable interrupts while
still holding the lock_page_memcg() irq spinlock.  This suggests the
possibility of deadlock if an interrupt occurs before unlock_page_memcg().

    truncate
    __cancel_dirty_page
    lock_page_memcg
    unlocked_inode_to_wb_begin
    unlocked_inode_to_wb_end
    <interrupts mistakenly enabled>
                                    <interrupt>
                                    end_page_writeback
                                    test_clear_page_writeback
                                    lock_page_memcg
                                    <deadlock>
    unlock_page_memcg

Due to configuration limitations this deadlock is not currently possible
because we don't mix cgroup writeback (a cgroupv2 feature) and
memory.move_charge_at_immigrate (a cgroupv1 feature).

If the kernel is hacked to always claim inode switching and memcg
moving_account, then this script triggers lockup in less than a minute:

  cd /mnt/cgroup/memory
  mkdir a b
  echo 1 > a/memory.move_charge_at_immigrate
  echo 1 > b/memory.move_charge_at_immigrate
  (
    echo $BASHPID > a/cgroup.procs
    while true; do
      dd if=/dev/zero of=/mnt/big bs=1M count=256
    done
  ) &
  while true; do
    sync
  done &
  sleep 1h &
  SLEEP=$!
  while true; do
    echo $SLEEP > a/cgroup.procs
    echo $SLEEP > b/cgroup.procs
  done

The deadlock does not seem possible, so it's debatable if there's any
reason to modify the kernel.  I suggest we should to prevent future
surprises.  And Wang Long said "this deadlock occurs three times in our
environment", so there's more reason to apply this, even to stable.
Stable 4.4 has minor conflicts applying this patch.  For a clean 4.4 patch
see "[PATCH for-4.4] writeback: safer lock nesting"
https://lkml.org/lkml/2018/4/11/146

Wang Long said "this deadlock occurs three times in our environment"

[gthelen@google.com: v4]
  Link: http://lkml.kernel.org/r/20180411084653.254724-1-gthelen@google.com
[akpm@linux-foundation.org: comment tweaks, struct initialization simplification]
Change-Id: Ibb773e8045852978f6207074491d262f1b3fb613
Link: http://lkml.kernel.org/r/20180410005908.167976-1-gthelen@google.com
Fixes: 682aa8e1a6 ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates")
Signed-off-by: Greg Thelen <gthelen@google.com>
Reported-by: Wang Long <wanglong19@meituan.com>
Acked-by: Wang Long <wanglong19@meituan.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: <stable@vger.kernel.org>	[v4.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[natechancellor: Applied to 4.4 based on Greg's backport on lkml.org]
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-24 09:32:12 +02:00
..
acpi
asm-generic mm/vmalloc: add interfaces to free unmapped page table 2018-03-28 18:40:14 +02:00
clocksource
crypto crypto: poly1305 - remove ->setkey() method 2018-02-16 20:09:43 +01:00
drm drm: Allow determining if current task is output poll worker 2018-03-18 11:17:48 +01:00
dt-bindings ARM: dts: Fix omap3 off mode pull defines 2017-11-21 09:21:19 +01:00
keys
kvm KVM: arm/arm64: arch_timer: Preserve physical dist. active state on LR.active 2015-11-24 18:07:40 +01:00
linux writeback: safer lock nesting 2018-04-24 09:32:12 +02:00
math-emu
media videobuf2-core: Check user space planes array in dqbuf 2016-05-04 14:48:50 -07:00
memory
misc
net slip: Check if rstate is initialized before uncompressing 2018-04-24 09:32:04 +02:00
pcmcia
ras
rdma RDMA/ucma: Introduce safer rdma_addr_size() variants 2018-04-08 11:51:59 +02:00
rxrpc
scsi scsi: sg: disable SET_FORCE_LOW_DMA 2018-01-23 19:50:14 +01:00
soc ARM: at91: define LPDDR types 2017-03-12 06:37:24 +01:00
sound ALSA: pcm: Return -EBUSY for OSS ioctls changing busy streams 2018-04-24 09:32:09 +02:00
target target: Avoid early CMD_T_PRE_EXECUTE failures during ABORT_TASK 2018-01-17 09:35:31 +01:00
trace clk: fix a panic error caused by accessing NULL pointer 2018-02-25 11:03:41 +01:00
uapi PCI: Make PCI_ROM_ADDRESS_MASK a 32-bit constant 2018-04-08 11:51:57 +02:00
video drm/imx: Match imx-ipuv3-crtc components using device node in platform data 2016-06-07 18:14:37 -07:00
xen fix xen_swiotlb_dma_mmap prototype 2017-10-05 09:41:48 +02:00
Kbuild