* android12-5.10-2021-08: (429 commits)
ANDROID: Update symbol list for mtk
ANDROID: scheduler: export task_sched_runtime
FROMLIST: mm: slub: fix slub_debug disabling for list of slabs
FROMLIST: mm/madvise: add MADV_WILLNEED to process_madvise()
ANDROID: Update the exynos symbol list
FROMGIT: firmware: arm_scmi: Free mailbox channels if probe fails
ANDROID: GKI: gki_defconfig: Enable CONFIG_NFC
ANDROID: sched: Make uclamp changes depend on CAP_SYS_NICE
ANDROID: GKI: update xiaomi symbol list and ABI XML
ANDROID: ABI: update generic symbol list
ANDROID: scsi: ufs: Enable CONFIG_SCSI_UFS_HPB
ANDROID: scsi: ufs: Make CONFIG_SCSI_UFS_HPB compatible with the GKI
UPSTREAM: arm64: vdso: Avoid ISB after reading from cntvct_el0
ANDROID: GKI: Disable X86_MCE drivers
ANDROID: GKI: Update symbols to symbol list
ANDROID: ABI: update allowed list for exynos
FROMGIT: sched: Skip priority checks with SCHED_FLAG_KEEP_PARAMS
FROMGIT: sched: Don't report SCHED_FLAG_SUGOV in sched_getattr()
FROMGIT: sched/deadline: Fix reset_on_fork reporting of DL tasks
BACKPORT: FROMGIT: sched: Fix UCLAMP_FLAG_IDLE setting
...
Change-Id: I5e0600bb4ccd0333366b016b42332e1e79e56b61
Conflicts:
drivers/usb/gadget/configfs.c
include/linux/usb/gadget.h
Since commit 1b6b26ae70 ("pipe: fix and clarify pipe write wakeup
logic") we have sanitized the pipe write logic, and would only try to
wake up readers if they needed it.
In particular, if the pipe already had data in it before the write,
there was no point in trying to wake up a reader, since any existing
readers must have been aware of the pre-existing data already. Doing
extraneous wakeups will only cause potential thundering herd problems.
However, it turns out that some Android libraries have misused the EPOLL
interface, and expected "edge triggered" be to "any new write will
trigger it". Even if there was no edge in sight.
Quoting Sandeep Patil:
"The commit 1b6b26ae70 ('pipe: fix and clarify pipe write wakeup
logic') changed pipe write logic to wakeup readers only if the pipe
was empty at the time of write. However, there are libraries that
relied upon the older behavior for notification scheme similar to
what's described in [1]
One such library 'realm-core'[2] is used by numerous Android
applications. The library uses a similar notification mechanism as GNU
Make but it never drains the pipe until it is full. When Android moved
to v5.10 kernel, all applications using this library stopped working.
The library has since been fixed[3] but it will be a while before all
applications incorporate the updated library"
Our regression rule for the kernel is that if applications break from
new behavior, it's a regression, even if it was because the application
did something patently wrong. Also note the original report [4] by
Michal Kerrisk about a test for this epoll behavior - but at that point
we didn't know of any actual broken use case.
So add the extraneous wakeup, to approximate the old behavior.
[ I say "approximate", because the exact old behavior was to do a wakeup
not for each write(), but for each pipe buffer chunk that was filled
in. The behavior introduced by this change is not that - this is just
"every write will cause a wakeup, whether necessary or not", which
seems to be sufficient for the broken library use. ]
It's worth noting that this adds the extraneous wakeup only for the
write side, while the read side still considers the "edge" to be purely
about reading enough from the pipe to allow further writes.
See commit f467a6a664 ("pipe: fix and clarify pipe read wakeup logic")
for the pipe read case, which remains that "only wake up if the pipe was
full, and we read something from it".
Link: https://lore.kernel.org/lkml/CAHk-=wjeG0q1vgzu4iJhW5juPkTsjTYmiqiMUYAebWW+0bam6w@mail.gmail.com/ [1]
Link: https://github.com/realm/realm-core [2]
Link: https://github.com/realm/realm-core/issues/4666 [3]
Link: https://lore.kernel.org/lkml/CAKgNAkjMBGeAwF=2MKK758BhxvW58wYTgYKB2V-gY1PwXxrH+Q@mail.gmail.com/ [4]
Link: https://lore.kernel.org/lkml/20210729222635.2937453-1-sspatil@android.com/
Bug: 193851993
Reported-by: Sandeep Patil <sspatil@android.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 3a34b13a88)
Signed-off-by: Sandeep Patil <sspatil@android.com>
Change-Id: Idcf3e8faa31bff47ada4b815237a355e0757b964
The file permissions on the fdinfo dir from were changed from
S_IRUSR|S_IXUSR to S_IRUGO|S_IXUGO, and a PTRACE_MODE_READ check was added
for opening the fdinfo files [1]. However, the ptrace permission check
was not added to the directory, allowing anyone to get the open FD numbers
by reading the fdinfo directory.
Add the missing ptrace permission check for opening the fdinfo directory.
[1] https://lkml.kernel.org/r/20210308170651.919148-1-kaleshsingh@google.com
Link: https://lkml.kernel.org/r/20210713162008.1056986-1-kaleshsingh@google.com
Fixes: 7bc3fa0172 ("procfs: allow reading fdinfo with PTRACE_MODE_READ")
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Hridya Valsaraju <hridya@google.com>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mark Brown <broonie@kernel.org>
Bug: 151772539
Change-Id: I274b30aa0a5ce8412eae7161d31c6ee955035da9
(cherry picked from commit fc73829fa54b0c7af32d6da7c972eb3390957da4
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master)
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
This tries to fix priority inversion in the below condition resulting in
long checkpoint delay.
f2fs_get_node_info()
- nat_tree_lock
-> sleep to grab journal_rwsem by contention
checkpoint
- waiting for nat_tree_lock
In order to let checkpoint go, let's release nat_tree_lock, if there's a
journal_rwsem contention.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Bug: 191987855
(cherry picked from commit 2eeb0dce72
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs.git dev)
Change-Id: I97ac4f9d3bde399ab4f17f5b3a6e949ae9b79f0f
Signed-off-by: Daeho Jeong <daehojeong@google.com>
commit '1b6b26ae7053 ("pipe: fix and clarify pipe write wakeup logic")'
change `pipe_write()` wakeup logic to wakeup readers only if the pipe
was empty.
This meant that applications that are not draining the pipe
before each write were exposed to unexpected timeouts / hangs in
epoll_wait() waiting for data in a pipe using EPOLLIN | EPOLLET flags.
This behaviour can be easily tested with android12-5.4 kernel where
the test that uses pipes for notifications in this way works while it
fails 100% with android12-5.10.
This change restores the old behavior to wakeup all pipe_readers if any
new data is written to the pipe.
Bug: 193851993
Bug: 193846582
Change-Id: If0c5a844091ccf16d5236bd072326325d4d5447a
Signed-off-by: Sandeep Patil <sspatil@google.com>
SBI_NEED_FSCK is an indicator that fsck.f2fs needs to be triggered, so it
is not fully critical to stop any IO writes. So, let's allow to write data
instead of reporting EIO forever given SBI_NEED_FSCK, but do keep OPU.
Bug: 193659742
Fixes: 9557727876 ("f2fs: drop inplace IO if fs status is abnormal")
Cc: <stable@kernel.org> # v5.13+
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
(cherry picked from commit 1ffc8f5f77
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs.git dev)
Change-Id: I9585358c1cee864064d9d210ee643f49ff1e6749
This effectively locks down OWNERS approval to a small group to guard
the code base against unintentional breakages.
Bug: 194314089
Signed-off-by: Matthias Maennich <maennich@google.com>
Change-Id: Ifd1ea97639a622320ea83f901f6451e2e52b38d4
Added gc_reclaimed_segments and gc_segment_mode sysfs nodes.
1) "gc_reclaimed_segments" shows how many segments have been
reclaimed by GC during a specific GC mode.
2) "gc_segment_mode" is used to control for which gc mode
the "gc_reclaimed_segments" node shows.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Bug: 182708936
(cherry picked from commit 07c6b5933e
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs.git dev)
Change-Id: Ie8c2ccf5d36ce2f388c98b77e22848f3ff6645c3
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Some of the platforms might be still expecting dedicated memory region
for ramoops node. So add logic to detect the start and size of the
ramoops memory region by looking up reserved memory region with
of_reserved_mem_lookup() when platform_get_resource() failed.
Bug: 191636717
Change-Id: Idc479b45fb3f637f7235efd6eabac62059d5e92b
Signed-off-by: Prasad Sodagudi <psodagud@codeaurora.org>
And 'ino' field to /proc/<pid>/fdinfo/<FD> and
/proc/<pid>/task/<tid>/fdinfo/<FD>.
The inode numbers can be used to uniquely identify DMA buffers in user
space and avoids a dependency on /proc/<pid>/fd/* when accounting
per-process DMA buffer sizes.
Link: https://lkml.kernel.org/r/20210308170651.919148-2-kaleshsingh@google.com
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Christian König <christian.koenig@amd.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hridya Valsaraju <hridya@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Alexey Gladkov <gladkov.alexey@gmail.com>
Cc: Szabolcs Nagy <szabolcs.nagy@arm.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Bernd Edlinger <bernd.edlinger@hotmail.de>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Helge Deller <deller@gmx.de>
Cc: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 3845f256a8)
Bug: 159126739
Bug: 167141117
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: Id07beac3edcc95c0b42805e24e5486965acbb46e
Android captures per-process system memory state when certain low memory
events (e.g a foreground app kill) occur, to identify potential memory
hoggers. In order to measure how much memory a process actually consumes,
it is necessary to include the DMA buffer sizes for that process in the
memory accounting. Since the handle to DMA buffers are raw FDs, it is
important to be able to identify which processes have FD references to a
DMA buffer.
Currently, DMA buffer FDs can be accounted using /proc/<pid>/fd/* and
/proc/<pid>/fdinfo -- both are only readable by the process owner, as
follows:
1. Do a readlink on each FD.
2. If the target path begins with "/dmabuf", then the FD is a dmabuf FD.
3. stat the file to get the dmabuf inode number.
4. Read/ proc/<pid>/fdinfo/<fd>, to get the DMA buffer size.
Accessing other processes' fdinfo requires root privileges. This limits
the use of the interface to debugging environments and is not suitable for
production builds. Granting root privileges even to a system process
increases the attack surface and is highly undesirable.
Since fdinfo doesn't permit reading process memory and manipulating
process state, allow accessing fdinfo under PTRACE_MODE_READ_FSCRED.
Link: https://lkml.kernel.org/r/20210308170651.919148-1-kaleshsingh@google.com
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Suggested-by: Jann Horn <jannh@google.com>
Acked-by: Christian König <christian.koenig@amd.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Alexey Gladkov <gladkov.alexey@gmail.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Bernd Edlinger <bernd.edlinger@hotmail.de>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Hridya Valsaraju <hridya@google.com>
Cc: James Morris <jamorris@linux.microsoft.com>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Szabolcs Nagy <szabolcs.nagy@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 7bc3fa0172)
Bug: 159126739
Bug: 167141117
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I842b689670f731138592f45c7124ef446d9aa59a
Through this vendor hook, we can get the timing to check
current running task for the validation of its credential
and open operation.
Bug: 191291287
Signed-off-by: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com>
Change-Id: Ia644ceb02dbc230ee1d25cad3630c2c3f908e41a
The reserved memory region for ramoops is assumed to be at a fixed
and known location when read from the devicetree. This is not desirable
in environments where it is preferred for the region to be dynamically
allocated at runtime, as opposed to it being fixed at compile time.
Change the logic for detecting the start and size of the ramoops
memory region by looking up the reserved memory region instead of
using platform_get_resource(), which assumes that the location
of the memory is known ahead of time.
Bug: 191636717
Link: https://lore.kernel.org/patchwork/patch/1451704/
Change-Id: I24066de9f4fe1f1575cb1bbb1687c37a2b1938a4
Signed-off-by: Isaac J. Manjarres <isaacm@codeaurora.org>
Signed-off-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: Prasad Sodagudi <psodagud@codeaurora.org>
* android12-5.10: (2274 commits)
FROMGIT: mm: slub: move sysfs slab alloc/free interfaces to debugfs
ANDROID: gki - CONFIG_NET_SCH_FQ=y
ANDROID: GKI: Kconfig.gki: Add GKI_HIDDEN_ETHERNET_CONFIGS
FROMLIST: media: Kconfig: Fix DVB_CORE can't be selected as module
ANDROID: Update ABI and symbol list
Revert "net: usb: cdc_ncm: don't spew notifications"
ANDROID: Fips 140: move fips symbols entirely in own list
ANDROID: core of xt_IDLETIMER send_nl_msg support
ANDROID: start to re-add xt_IDLETIMER send_nl_msg support
ANDROID: add fips140.ko symbols to module ABI
ANDROID: inject correct HMAC digest into fips140.ko at build time
ANDROID: crypto: fips140 - perform load time integrity check
FROMLIST: crypto: shash - stop comparing function pointers to avoid breaking CFI
ANDROID: arm64: module: preserve RELA sections for FIPS140 integrity selfcheck
ANDROID: arm64: simd: omit capability check in may_use_simd()
ANDROID: kbuild: lto: permit the use of .a archives in LTO modules
ANDROID: arm64: only permit certain alternatives in the FIPS140 module
ANDROID: crypto: lib/aes - add vendor hooks for AES library routines
ANDROID: crypto: lib/sha256 - add vendor hook for sha256() routine
UPSTREAM: KVM: arm64: Mark the host stage-2 memory pools static
...
Conflicts:
drivers/mmc/core/mmc_ops.c
drivers/usb/gadget/function/f_uac1.c
drivers/usb/gadget/function/f_uac2.c
drivers/usb/gadget/function/f_uvc.c
After power lost, spinand may work in a unkonw state and result in
bit flip, including:
1.Write to cache invalid and dirty cache data write to page's array
which result in node CRC error for some pages.
2.One page write fail but the next page write success result in
empty space corruption.
Change-Id: I212c237202b32de0217efc8dd5a4e84174953a3f
Signed-off-by: Jon Lin <jon.lin@rock-chips.com>
* aosp/upstream-f2fs-stable-linux-5.10.y:
Revert "f2fs: avoid attaching SB_ACTIVE flag during mount/remount"
f2fs: remove false alarm on iget failure during GC
f2fs: enable extent cache for compression files in read-only
f2fs: fix to avoid adding tab before doc section
f2fs: introduce f2fs_casefolded_name slab cache
f2fs: swap: support migrating swapfile in aligned write mode
f2fs: swap: remove dead codes
f2fs: compress: add compress_inode to cache compressed blocks
f2fs: clean up /sys/fs/f2fs/<disk>/features
f2fs: add pin_file in feature list
f2fs: Advertise encrypted casefolding in sysfs
f2fs: Show casefolding support only when supported
f2fs: support RO feature
f2fs: logging neatening
Bug: 186107892
Bug: 190759634
Bug: 190517210
Signed-off-by: Jaegeuk Kim <jaegeuk@google.com>
Change-Id: I8eb93d8a43304b98166676da52a9c2434b15b942
This patch removes setting SBI_NEED_FSCK when GC gets an error on f2fs_iget,
since f2fs_iget can give ENOMEM and others by race condition.
If we set this critical fsck flag, we'll get EIO during fsync via the below
code path.
In f2fs_inplace_write_data(),
if (is_sbi_flag_set(sbi, SBI_NEED_FSCK) || f2fs_cp_error(sbi)) {
err = -EIO;
goto drop_bio;
}
Fixes: 9557727876 ("f2fs: drop inplace IO if fs status is abnormal")
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Add a slab cache: "f2fs_casefolded_name" for memory allocation
of casefold name.
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Changes in 5.10.43
btrfs: tree-checker: do not error out if extent ref hash doesn't match
net: usb: cdc_ncm: don't spew notifications
hwmon: (dell-smm-hwmon) Fix index values
hwmon: (pmbus/isl68137) remove READ_TEMPERATURE_3 for RAA228228
netfilter: conntrack: unregister ipv4 sockopts on error unwind
efi/fdt: fix panic when no valid fdt found
efi: Allow EFI_MEMORY_XP and EFI_MEMORY_RO both to be cleared
efi/libstub: prevent read overflow in find_file_option()
efi: cper: fix snprintf() use in cper_dimm_err_location()
vfio/pci: Fix error return code in vfio_ecap_init()
vfio/pci: zap_vma_ptes() needs MMU
samples: vfio-mdev: fix error handing in mdpy_fb_probe()
vfio/platform: fix module_put call in error flow
ipvs: ignore IP_VS_SVC_F_HASHED flag when adding service
HID: logitech-hidpp: initialize level variable
HID: pidff: fix error return code in hid_pidff_init()
HID: i2c-hid: fix format string mismatch
devlink: Correct VIRTUAL port to not have phys_port attributes
net/sched: act_ct: Offload connections with commit action
net/sched: act_ct: Fix ct template allocation for zone 0
mptcp: always parse mptcp options for MPC reqsk
nvme-rdma: fix in-casule data send for chained sgls
ACPICA: Clean up context mutex during object deletion
perf probe: Fix NULL pointer dereference in convert_variable_location()
net: dsa: tag_8021q: fix the VLAN IDs used for encoding sub-VLANs
net: sock: fix in-kernel mark setting
net/tls: Replace TLS_RX_SYNC_RUNNING with RCU
net/tls: Fix use-after-free after the TLS device goes down and up
net/mlx5e: Fix incompatible casting
net/mlx5: Check firmware sync reset requested is set before trying to abort it
net/mlx5e: Check for needed capability for cvlan matching
net/mlx5: DR, Create multi-destination flow table with level less than 64
nvmet: fix freeing unallocated p2pmem
netfilter: nft_ct: skip expectations for confirmed conntrack
netfilter: nfnetlink_cthelper: hit EBUSY on updates if size mismatches
drm/i915/selftests: Fix return value check in live_breadcrumbs_smoketest()
bpf: Simplify cases in bpf_base_func_proto
bpf, lockdown, audit: Fix buggy SELinux lockdown permission checks
ieee802154: fix error return code in ieee802154_add_iface()
ieee802154: fix error return code in ieee802154_llsec_getparams()
igb: add correct exception tracing for XDP
ixgbevf: add correct exception tracing for XDP
cxgb4: fix regression with HASH tc prio value update
ipv6: Fix KASAN: slab-out-of-bounds Read in fib6_nh_flush_exceptions
ice: Fix allowing VF to request more/less queues via virtchnl
ice: Fix VFR issues for AVF drivers that expect ATQLEN cleared
ice: handle the VF VSI rebuild failure
ice: report supported and advertised autoneg using PHY capabilities
ice: Allow all LLDP packets from PF to Tx
i2c: qcom-geni: Add shutdown callback for i2c
cxgb4: avoid link re-train during TC-MQPRIO configuration
i40e: optimize for XDP_REDIRECT in xsk path
i40e: add correct exception tracing for XDP
ice: simplify ice_run_xdp
ice: optimize for XDP_REDIRECT in xsk path
ice: add correct exception tracing for XDP
ixgbe: optimize for XDP_REDIRECT in xsk path
ixgbe: add correct exception tracing for XDP
arm64: dts: ti: j7200-main: Mark Main NAVSS as dma-coherent
optee: use export_uuid() to copy client UUID
bus: ti-sysc: Fix am335x resume hang for usb otg module
arm64: dts: ls1028a: fix memory node
arm64: dts: zii-ultra: fix 12V_MAIN voltage
arm64: dts: freescale: sl28: var4: fix RGMII clock and voltage
ARM: dts: imx7d-meerkat96: Fix the 'tuning-step' property
ARM: dts: imx7d-pico: Fix the 'tuning-step' property
ARM: dts: imx: emcon-avari: Fix nxp,pca8574 #gpio-cells
bus: ti-sysc: Fix flakey idling of uarts and stop using swsup_sidle_act
tipc: add extack messages for bearer/media failure
tipc: fix unique bearer names sanity check
serial: stm32: fix threaded interrupt handling
riscv: vdso: fix and clean-up Makefile
io_uring: fix link timeout refs
io_uring: use better types for cflags
drm/amdgpu/vcn3: add cancel_delayed_work_sync before power gate
drm/amdgpu/jpeg2.5: add cancel_delayed_work_sync before power gate
drm/amdgpu/jpeg3: add cancel_delayed_work_sync before power gate
Bluetooth: fix the erroneous flush_work() order
Bluetooth: use correct lock to prevent UAF of hdev object
wireguard: do not use -O3
wireguard: peer: allocate in kmem_cache
wireguard: use synchronize_net rather than synchronize_rcu
wireguard: selftests: remove old conntrack kconfig value
wireguard: selftests: make sure rp_filter is disabled on vethc
wireguard: allowedips: initialize list head in selftest
wireguard: allowedips: remove nodes in O(1)
wireguard: allowedips: allocate nodes in kmem_cache
wireguard: allowedips: free empty intermediate nodes when removing single node
net: caif: added cfserl_release function
net: caif: add proper error handling
net: caif: fix memory leak in caif_device_notify
net: caif: fix memory leak in cfusbl_device_notify
HID: i2c-hid: Skip ELAN power-on command after reset
HID: magicmouse: fix NULL-deref on disconnect
HID: multitouch: require Finger field to mark Win8 reports as MT
gfs2: fix scheduling while atomic bug in glocks
ALSA: timer: Fix master timer notification
ALSA: hda: Fix for mute key LED for HP Pavilion 15-CK0xx
ALSA: hda: update the power_state during the direct-complete
ARM: dts: imx6dl-yapp4: Fix RGMII connection to QCA8334 switch
ARM: dts: imx6q-dhcom: Add PU,VDD1P1,VDD2P5 regulators
ext4: fix memory leak in ext4_fill_super
ext4: fix bug on in ext4_es_cache_extent as ext4_split_extent_at failed
ext4: fix fast commit alignment issues
ext4: fix memory leak in ext4_mb_init_backend on error path.
ext4: fix accessing uninit percpu counter variable with fast_commit
usb: dwc2: Fix build in periphal-only mode
pid: take a reference when initializing `cad_pid`
ocfs2: fix data corruption by fallocate
mm/debug_vm_pgtable: fix alignment for pmd/pud_advanced_tests()
mm/page_alloc: fix counting of free pages after take off from buddy
x86/cpufeatures: Force disable X86_FEATURE_ENQCMD and remove update_pasid()
x86/sev: Check SME/SEV support in CPUID first
nfc: fix NULL ptr dereference in llcp_sock_getname() after failed connect
drm/amdgpu: Don't query CE and UE errors
drm/amdgpu: make sure we unpin the UVD BO
x86/apic: Mark _all_ legacy interrupts when IO/APIC is missing
powerpc/kprobes: Fix validation of prefixed instructions across page boundary
btrfs: mark ordered extent and inode with error if we fail to finish
btrfs: fix error handling in btrfs_del_csums
btrfs: return errors from btrfs_del_csums in cleanup_ref_head
btrfs: fixup error handling in fixup_inode_link_counts
btrfs: abort in rename_exchange if we fail to insert the second ref
btrfs: fix deadlock when cloning inline extents and low on available space
mm, hugetlb: fix simple resv_huge_pages underflow on UFFDIO_COPY
drm/msm/dpu: always use mdp device to scale bandwidth
btrfs: fix unmountable seed device after fstrim
KVM: SVM: Truncate GPR value for DR and CR accesses in !64-bit mode
KVM: arm64: Fix debug register indexing
x86/kvm: Teardown PV features on boot CPU as well
x86/kvm: Disable kvmclock on all CPUs on shutdown
x86/kvm: Disable all PV features on crash
lib/lz4: explicitly support in-place decompression
i2c: qcom-geni: Suspend and resume the bus during SYSTEM_SLEEP_PM ops
netfilter: nf_tables: missing error reporting for not selected expressions
xen-netback: take a reference to the RX task thread
neighbour: allow NUD_NOARP entries to be forced GCed
Linux 5.10.43
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I8d7ec0878193e4e454076809b7fb71fcc4e3d810
commit 5e753a817b upstream.
The following test case reproduces an issue of wrongly freeing in-use
blocks on the readonly seed device when fstrim is called on the rw sprout
device. As shown below.
Create a seed device and add a sprout device to it:
$ mkfs.btrfs -fq -dsingle -msingle /dev/loop0
$ btrfstune -S 1 /dev/loop0
$ mount /dev/loop0 /btrfs
$ btrfs dev add -f /dev/loop1 /btrfs
BTRFS info (device loop0): relocating block group 290455552 flags system
BTRFS info (device loop0): relocating block group 1048576 flags system
BTRFS info (device loop0): disk added /dev/loop1
$ umount /btrfs
Mount the sprout device and run fstrim:
$ mount /dev/loop1 /btrfs
$ fstrim /btrfs
$ umount /btrfs
Now try to mount the seed device, and it fails:
$ mount /dev/loop0 /btrfs
mount: /btrfs: wrong fs type, bad option, bad superblock on /dev/loop0, missing codepage or helper program, or other error.
Block 5292032 is missing on the readonly seed device:
$ dmesg -kt | tail
<snip>
BTRFS error (device loop0): bad tree block start, want 5292032 have 0
BTRFS warning (device loop0): couldn't read-tree root
BTRFS error (device loop0): open_ctree failed
>From the dump-tree of the seed device (taken before the fstrim). Block
5292032 belonged to the block group starting at 5242880:
$ btrfs inspect dump-tree -e /dev/loop0 | grep -A1 BLOCK_GROUP
<snip>
item 3 key (5242880 BLOCK_GROUP_ITEM 8388608) itemoff 16169 itemsize 24
block group used 114688 chunk_objectid 256 flags METADATA
<snip>
>From the dump-tree of the sprout device (taken before the fstrim).
fstrim used block-group 5242880 to find the related free space to free:
$ btrfs inspect dump-tree -e /dev/loop1 | grep -A1 BLOCK_GROUP
<snip>
item 1 key (5242880 BLOCK_GROUP_ITEM 8388608) itemoff 16226 itemsize 24
block group used 32768 chunk_objectid 256 flags METADATA
<snip>
BPF kernel tracing the fstrim command finds the missing block 5292032
within the range of the discarded blocks as below:
kprobe:btrfs_discard_extent {
printf("freeing start %llu end %llu num_bytes %llu:\n",
arg1, arg1+arg2, arg2);
}
freeing start 5259264 end 5406720 num_bytes 147456
<snip>
Fix this by avoiding the discard command to the readonly seed device.
Reported-by: Chris Murphy <lists@colorremedies.com>
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 76a6d5cd74 upstream.
There are a few cases where cloning an inline extent requires copying data
into a page of the destination inode. For these cases we are allocating
the required data and metadata space while holding a leaf locked. This can
result in a deadlock when we are low on available space because allocating
the space may flush delalloc and two deadlock scenarios can happen:
1) When starting writeback for an inode with a very small dirty range that
fits in an inline extent, we deadlock during the writeback when trying
to insert the inline extent, at cow_file_range_inline(), if the extent
is going to be located in the leaf for which we are already holding a
read lock;
2) After successfully starting writeback, for non-inline extent cases,
the async reclaim thread will hang waiting for an ordered extent to
complete if the ordered extent completion needs to modify the leaf
for which the clone task is holding a read lock (for adding or
replacing file extent items). So the cloning task will wait forever
on the async reclaim thread to make progress, which in turn is
waiting for the ordered extent completion which in turn is waiting
to acquire a write lock on the same leaf.
So fix this by making sure we release the path (and therefore the leaf)
every time we need to copy the inline extent's data into a page of the
destination inode, as by that time we do not need to have the leaf locked.
Fixes: 05a5a7621c ("Btrfs: implement full reflink support for inline extents")
CC: stable@vger.kernel.org # 5.10+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit dc09ef3562 upstream.
Error injection stress uncovered a problem where we'd leave a dangling
inode ref if we failed during a rename_exchange. This happens because
we insert the inode ref for one side of the rename, and then for the
other side. If this second inode ref insert fails we'll leave the first
one dangling and leave a corrupt file system behind. Fix this by
aborting if we did the insert for the first inode ref.
CC: stable@vger.kernel.org # 4.9+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 011b28acf9 upstream.
This function has the following pattern
while (1) {
ret = whatever();
if (ret)
goto out;
}
ret = 0
out:
return ret;
However several places in this while loop we simply break; when there's
a problem, thus clearing the return value, and in one case we do a
return -EIO, and leak the memory for the path.
Fix this by re-arranging the loop to deal with ret == 1 coming from
btrfs_search_slot, and then simply delete the
ret = 0;
out:
bit so everybody can break if there is an error, which will allow for
proper error handling to occur.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 856bd270dc upstream.
We are unconditionally returning 0 in cleanup_ref_head, despite the fact
that btrfs_del_csums could fail. We need to return the error so the
transaction gets aborted properly, fix this by returning ret from
btrfs_del_csums in cleanup_ref_head.
Reviewed-by: Qu Wenruo <wqu@suse.com>
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b86652be7c upstream.
Error injection stress would sometimes fail with checksums on disk that
did not have a corresponding extent. This occurred because the pattern
in btrfs_del_csums was
while (1) {
ret = btrfs_search_slot();
if (ret < 0)
break;
}
ret = 0;
out:
btrfs_free_path(path);
return ret;
If we got an error from btrfs_search_slot we'd clear the error because
we were breaking instead of goto out. Instead of using goto out, simply
handle the cases where we may leave a random value in ret, and get rid
of the
ret = 0;
out:
pattern and simply allow break to have the proper error reporting. With
this fix we properly abort the transaction and do not commit thinking we
successfully deleted the csum.
Reviewed-by: Qu Wenruo <wqu@suse.com>
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit d61bec08b9 upstream.
While doing error injection testing I saw that sometimes we'd get an
abort that wouldn't stop the current transaction commit from completing.
This abort was coming from finish ordered IO, but at this point in the
transaction commit we should have gotten an error and stopped.
It turns out the abort came from finish ordered io while trying to write
out the free space cache. It occurred to me that any failure inside of
finish_ordered_io isn't actually raised to the person doing the writing,
so we could have any number of failures in this path and think the
ordered extent completed successfully and the inode was fine.
Fix this by marking the ordered extent with BTRFS_ORDERED_IOERR, and
marking the mapping of the inode with mapping_set_error, so any callers
that simply call fdatawait will also get the error.
With this we're seeing the IO error on the free space inode when we fail
to do the finish_ordered_io.
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 6bba4471f0 upstream.
When fallocate punches holes out of inode size, if original isize is in
the middle of last cluster, then the part from isize to the end of the
cluster will be zeroed with buffer write, at that time isize is not yet
updated to match the new size, if writeback is kicked in, it will invoke
ocfs2_writepage()->block_write_full_page() where the pages out of inode
size will be dropped. That will cause file corruption. Fix this by
zero out eof blocks when extending the inode size.
Running the following command with qemu-image 4.2.1 can get a corrupted
coverted image file easily.
qemu-img convert -p -t none -T none -f qcow2 $qcow_image \
-O qcow2 -o compat=1.1 $qcow_image.conv
The usage of fallocate in qemu is like this, it first punches holes out
of inode size, then extend the inode size.
fallocate(11, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 2276196352, 65536) = 0
fallocate(11, 0, 2276196352, 65536) = 0
v1: https://www.spinics.net/lists/linux-fsdevel/msg193999.html
v2: https://lore.kernel.org/linux-fsdevel/20210525093034.GB4112@quack2.suse.cz/T/
Link: https://lkml.kernel.org/r/20210528210648.9124-1-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b45f189a19 upstream.
When running generic/527 with fast_commit configuration, the following
issue is seen on Power. With fast_commit, during ext4_fc_replay()
(which can be called from ext4_fill_super()), if inode eviction
happens then it can access an uninitialized percpu counter variable.
This patch adds the check before accessing the counters in
ext4_free_inode() path.
[ 321.165371] run fstests generic/527 at 2021-04-29 08:38:43
[ 323.027786] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: block_validity. Quota mode: none.
[ 323.618772] BUG: Unable to handle kernel data access on read at 0x1fbd80000
[ 323.619767] Faulting instruction address: 0xc000000000bae78c
cpu 0x1: Vector: 300 (Data Access) at [c000000010706ef0]
pc: c000000000bae78c: percpu_counter_add_batch+0x3c/0x100
lr: c0000000006d0bb0: ext4_free_inode+0x780/0xb90
pid = 5593, comm = mount
ext4_free_inode+0x780/0xb90
ext4_evict_inode+0xa8c/0xc60
evict+0xfc/0x1e0
ext4_fc_replay+0xc50/0x20f0
do_one_pass+0xfe0/0x1350
jbd2_journal_recover+0x184/0x2e0
jbd2_journal_load+0x1c0/0x4a0
ext4_fill_super+0x2458/0x4200
mount_bdev+0x1dc/0x290
ext4_mount+0x28/0x40
legacy_get_tree+0x4c/0xa0
vfs_get_tree+0x4c/0x120
path_mount+0xcf8/0xd70
do_mount+0x80/0xd0
sys_mount+0x3fc/0x490
system_call_exception+0x384/0x3d0
system_call_common+0xec/0x278
Cc: stable@kernel.org
Fixes: 8016e29f43 ("ext4: fast commit recovery path")
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/6cceb9a75c54bef8fa9696c1b08c8df5ff6169e2.1619692410.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a7ba36bc94 upstream.
Fast commit recovery data on disk may not be aligned. So, when the
recovery code reads it, this patch makes sure that fast commit info
found on-disk is first memcpy-ed into an aligned variable before
accessing it. As a consequence of it, we also remove some macros that
could resulted in unaligned accesses.
Cc: stable@kernel.org
Fixes: 8016e29f43 ("ext4: fast commit recovery path")
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20210519215920.2037527-1-harshads@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 082cd4ec24 upstream.
We got follow bug_on when run fsstress with injecting IO fault:
[130747.323114] kernel BUG at fs/ext4/extents_status.c:762!
[130747.323117] Internal error: Oops - BUG: 0 [#1] SMP
......
[130747.334329] Call trace:
[130747.334553] ext4_es_cache_extent+0x150/0x168 [ext4]
[130747.334975] ext4_cache_extents+0x64/0xe8 [ext4]
[130747.335368] ext4_find_extent+0x300/0x330 [ext4]
[130747.335759] ext4_ext_map_blocks+0x74/0x1178 [ext4]
[130747.336179] ext4_map_blocks+0x2f4/0x5f0 [ext4]
[130747.336567] ext4_mpage_readpages+0x4a8/0x7a8 [ext4]
[130747.336995] ext4_readpage+0x54/0x100 [ext4]
[130747.337359] generic_file_buffered_read+0x410/0xae8
[130747.337767] generic_file_read_iter+0x114/0x190
[130747.338152] ext4_file_read_iter+0x5c/0x140 [ext4]
[130747.338556] __vfs_read+0x11c/0x188
[130747.338851] vfs_read+0x94/0x150
[130747.339110] ksys_read+0x74/0xf0
This patch's modification is according to Jan Kara's suggestion in:
https://patchwork.ozlabs.org/project/linux-ext4/patch/20210428085158.3728201-1-yebin10@huawei.com/
"I see. Now I understand your patch. Honestly, seeing how fragile is trying
to fix extent tree after split has failed in the middle, I would probably
go even further and make sure we fix the tree properly in case of ENOSPC
and EDQUOT (those are easily user triggerable). Anything else indicates a
HW problem or fs corruption so I'd rather leave the extent tree as is and
don't try to fix it (which also means we will not create overlapping
extents)."
Cc: stable@kernel.org
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210506141042.3298679-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit afd09b617d upstream.
Buffer head references must be released before calling kill_bdev();
otherwise the buffer head (and its page referenced by b_data) will not
be freed by kill_bdev, and subsequently that bh will be leaked.
If blocksizes differ, sb_set_blocksize() will kill current buffers and
page cache by using kill_bdev(). And then super block will be reread
again but using correct blocksize this time. sb_set_blocksize() didn't
fully free superblock page and buffer head, and being busy, they were
not freed and instead leaked.
This can easily be reproduced by calling an infinite loop of:
systemctl start <ext4_on_lvm>.mount, and
systemctl stop <ext4_on_lvm>.mount
... since systemd creates a cgroup for each slice which it mounts, and
the bh leak get amplified by a dying memory cgroup that also never
gets freed, and memory consumption is much more easily noticed.
Fixes: ce40733ce9 ("ext4: Check for return value from sb_set_blocksize")
Fixes: ac27a0ec11 ("ext4: initial copy of files from ext3")
Link: https://lore.kernel.org/r/20210521075533.95732-1-amakhalov@vmware.com
Signed-off-by: Alexey Makhalov <amakhalov@vmware.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 20265d9a67 upstream.
Before this patch, in the unlikely event that gfs2_glock_dq encountered
a withdraw, it would do a wait_on_bit to wait for its journal to be
recovered, but it never released the glock's spin_lock, which caused a
scheduling-while-atomic error.
This patch unlocks the lockref spin_lock before waiting for recovery.
Fixes: 601ef0d52e ("gfs2: Force withdraw to replay journals and wait for it to finish")
Cc: stable@vger.kernel.org # v5.7+
Reported-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 8c3f9cd160 ]
__io_cqring_fill_event() takes cflags as long to squeeze it into u32 in
an CQE, awhile all users pass int or unsigned. Replace it with unsigned
int and store it as u32 in struct io_completion to match CQE.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 1119a72e22 upstream.
The tree checker checks the extent ref hash at read and write time to
make sure we do not corrupt the file system. Generally extent
references go inline, but if we have enough of them we need to make an
item, which looks like
key.objectid = <bytenr>
key.type = <BTRFS_EXTENT_DATA_REF_KEY|BTRFS_TREE_BLOCK_REF_KEY>
key.offset = hash(tree, owner, offset)
However if key.offset collide with an unrelated extent reference we'll
simply key.offset++ until we get something that doesn't collide.
Obviously this doesn't match at tree checker time, and thus we error
while writing out the transaction. This is relatively easy to
reproduce, simply do something like the following
xfs_io -f -c "pwrite 0 1M" file
offset=2
for i in {0..10000}
do
xfs_io -c "reflink file 0 ${offset}M 1M" file
offset=$(( offset + 2 ))
done
xfs_io -c "reflink file 0 17999258914816 1M" file
xfs_io -c "reflink file 0 35998517829632 1M" file
xfs_io -c "reflink file 0 53752752058368 1M" file
btrfs filesystem sync
And the sync will error out because we'll abort the transaction. The
magic values above are used because they generate hash collisions with
the first file in the main subvol.
The fix for this is to remove the hash value check from tree checker, as
we have no idea which offset ours should belong to.
Reported-by: Tuomas Lähdekorpi <tuomas.lahdekorpi@gmail.com>
Fixes: 0785a9aacf ("btrfs: tree-checker: Add EXTENT_DATA_REF check")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
lz4 uses LZ4_DISTANCE_MAX to record history preservation. When
using rolling decompression, a block with a higher compression
ratio will cause a larger memory allocation (up to 64k). It may
cause a large resource burden in extreme cases on devices with
small memory and a large number of concurrent IOs. So appropriately
reducing this value can improve performance.
Decreasing this value will reduce the compression ratio (except
when input_size <LZ4_DISTANCE_MAX). But considering that erofs
currently only supports 4k output, reducing this value will not
significantly reduce the compression benefits.
The maximum value of LZ4_DISTANCE_MAX defined by lz4 is 64k, and
we can only reduce this value. For the old kernel, it just can't
reduce the memory allocation during rolling decompression without
affecting the decompression result.
Link: https://lore.kernel.org/r/20210329012308.28743-3-hsiangkao@aol.com
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Huang Jianan <huangjianan@oppo.com>
Signed-off-by: Guo Weichao <guoweichao@oppo.com>
[ Gao Xiang: introduce struct erofs_sb_lz4_info for configurations. ]
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Bug: 190585249
Change-Id: Ia1dbe56677a0bd2a388ae6b484f4c4d40170bf1c
(cherry picked from commit 5d50538fc5)
Signed-off-by: Huang Jianan <huangjianan@oppo.com>
Sync decompression was introduced to get rid of additional kworker
scheduling overhead. But there is no such overhead in non-atomic
contexts. Therefore, it should be better to turn off sync decompression
to avoid the current thread waiting in z_erofs_runqueue.
Link: https://lore.kernel.org/r/20210317035448.13921-3-huangjianan@oppo.com
Reviewed-by: Gao Xiang <hsiangkao@redhat.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Huang Jianan <huangjianan@oppo.com>
Signed-off-by: Guo Weichao <guoweichao@oppo.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Bug: 190585249
Change-Id: I7e03053d690b9cda4fe78c0ed0e2fbb0ba188d38
(cherry picked from commit 30048cdac4)
Signed-off-by: Huang Jianan <huangjianan@oppo.com>
z_erofs_decompressqueue_endio may not be executed in the atomic
context, for example, when dm-verity is turned on. In this scenario,
data can be decompressed directly to get rid of additional kworker
scheduling overhead.
Link: https://lore.kernel.org/r/20210317035448.13921-2-huangjianan@oppo.com
Reviewed-by: Gao Xiang <hsiangkao@redhat.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Huang Jianan <huangjianan@oppo.com>
Signed-off-by: Guo Weichao <guoweichao@oppo.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Bug: 190585249
Change-Id: I369268e398ce8b16362ace2e6a03a5061caebd90
(cherry picked from commit 648f2de053)
Signed-off-by: Huang Jianan <huangjianan@oppo.com>
Currently, err would be treated as io error. Therefore, it'd be
better to ensure memory allocation during rolling decompression
to avoid such io error.
In the long term, we might consider adding another !Uptodate case
for such case.
Link: https://lore.kernel.org/r/20210316031515.90954-1-huangjianan@oppo.com
Reviewed-by: Gao Xiang <hsiangkao@redhat.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Huang Jianan <huangjianan@oppo.com>
Signed-off-by: Guo Weichao <guoweichao@oppo.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Bug: 190585249
Change-Id: I8bab4ab3e0d26b698fc0054517065872eafa5d15
(cherry picked from commit b4892fa3e7)
Signed-off-by: Huang Jianan <huangjianan@oppo.com>
Try to forcely switch to inplace I/O under low memory scenario in
order to avoid direct memory reclaim due to cached page allocation.
Link: https://lore.kernel.org/r/20201209123717.12430-1-hsiangkao@aol.com
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Signed-off-by: Huang Jianan <huangjianan@oppo.com>
Bug: 190585249
Change-Id: I877e97f493a74c94e94c3671d2cd8a102ad5ed54
(cherry picked from commit 1825c8d7ce)
Signed-off-by: Huang Jianan <huangjianan@oppo.com>
Previously, it could be some concern to call add_to_page_cache_lru()
with page->mapping == Z_EROFS_MAPPING_STAGING (!= NULL).
In contrast, page->private is used instead now, so partially revert
commit 5ddcee1f3a ("erofs: get rid of __stagingpage_alloc helper")
with some adaption for simplicity.
Link: https://lore.kernel.org/r/20201208095834.3133565-2-hsiangkao@redhat.com
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Signed-off-by: Huang Jianan <huangjianan@oppo.com>
Bug: 190585249
Change-Id: Ia67c77a85f1e22bce2b1c0f238639fbb615d613a
(cherry picked from commit bf225074ff)
Signed-off-by: Huang Jianan <huangjianan@oppo.com>
Previously, we played around with magical page->mapping for short-lived
temporary pages since we need to identify different types of pages in
the same pcluster but both invalidated and short-lived temporary pages
can have page->mapping == NULL. It was considered as safe because that
temporary pages are all non-LRU / non-movable pages.
This patch tends to use specific page->private to identify short-lived
pages instead so it won't rely on page->mapping anymore. Details are
described in "compress.h" as well.
Link: https://lore.kernel.org/r/20201208095834.3133565-1-hsiangkao@redhat.com
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Signed-off-by: Huang Jianan <huangjianan@oppo.com>
Bug: 190585249
Change-Id: I0beb89854846dc5fe9bf167bc9ce311659003dc3
(cherry picked from commit 6aaa7b0664)
Signed-off-by: Huang Jianan <huangjianan@oppo.com>
There is a deadlock when the reply of FUSE_CANONICAL_PATH from user-
space client, because the kern_path function will issue a new request
and wait the respond from client which has been in wait state. The ba-
cktrace is like this:
<6>[ 518.977731] ntfs-3g S 0 2138 1 0x04000000
<4>[ 518.977745] Call trace:
<4>[ 518.977757] __switch_to+0x130/0x13c
<4>[ 518.977767] __schedule+0x740/0x964
<4>[ 518.977777] schedule+0x70/0x90
<4>[ 518.977794] __fuse_request_send+0x1a0/0x340
<4>[ 518.977808] fuse_simple_request+0x178/0x1c8
<4>[ 518.977818] fuse_lookup_name+0xfc/0x220
<4>[ 518.977829] fuse_lookup+0x48/0x134
<4>[ 518.977842] __lookup_slow+0xc8/0x154
<4>[ 518.977853] walk_component+0x1c0/0x728
<4>[ 518.977863] path_lookupat+0xa8/0x208
<4>[ 518.977875] filename_lookup+0x8c/0x190
<4>[ 518.977887] kern_path+0x30/0x3c
<4>[ 518.977901] fuse_dev_do_write+0x79c/0x114c
<4>[ 518.977914] fuse_dev_write+0x60/0x84
<4>[ 518.977928] do_iter_readv_writev+0x11c/0x158
<4>[ 518.977941] do_iter_write+0x7c/0x1b8
<4>[ 518.977953] vfs_writev+0x84/0xe8
<4>[ 518.977966] do_writev+0x78/0x114
<4>[ 518.977979] __arm64_sys_writev+0x1c/0x24
<4>[ 518.977992] el0_svc_common+0x98/0x160
<4>[ 518.978005] el0_svc_handler+0x5c/0x64
<4>[ 518.978015] el0_svc+0x8/0xc
Fixes: fa199896a3 ("ANDROID: fuse: Add support for d_canonical_path")
Signed-off-by: Cliff Chen <cliff.chen@rock-chips.com>
Change-Id: I13487e5c956c4537c2554a44208d6664653ef4f1