Commit Graph

154 Commits

Author SHA1 Message Date
Greg Kroah-Hartman
8735c21738 This is the 4.19.14 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlw2Jd8ACgkQONu9yGCS
 aT5DIw//RlX7Djwh9VnEEgggVpPxzIDfO8BcIR5EvSpHoci2skeD6/M5a+xiKKLk
 HOuH/cqBobkifnCzHwHLQYP9rIbkRceW0wDU2tdaecTf6G82TPoa5rQzG0rMMTM4
 HFrMlMXvQoWSlaALBi5xkGGa7AGOVcmiJBaIkbqNST4Ah8KMBRxEqDvnbh/ALXCe
 qLRc7lDf/WRoN9GBzoCJwuaF9EcDW/C3EyHowVroDkN3UobzfdFSmrjkteFbkIkp
 9rMzoyIXmKAe762ggkQTk8hEaVHqs7YxWlq53cym6NBtiBgfjqIKtT6tEtGs5U3i
 sA+YK6PzCfwp4I0ffXVqUoFi3WfJ4Ist+co8e8Uu0+taRDzahBkxtxxmNb6URU64
 1sosY0YyG7k72OYp9J4mYhCAbxUKC8S80TWjwPlyaVaUDWDHAbOQk5HDJ9wIERmN
 PltF9wQ7ZQrha4v4nafPYJn/FmQuDCfDA78vOJ09PEbNZoNBhqXbHJGx/GEShdDE
 /ZzoVigpN2tqIvXFM99rVPRDaTsWlCSiorOvn8vTyqv64EaGO2qZUDmvaReEbUxy
 i1jJ5YcQoPk4GbNI8hfShGOhT+eAtw/KW5pHwqHbEle6jyeK+7KIdBmzw5ZXQIM6
 4tzDOgn7yIpkMc+qyj3n3WE1LqRLt/cbOoxMu85jHDf5LgrtF50=
 =Gqyx
 -----END PGP SIGNATURE-----

Merge 4.19.14 into android-4.19

Changes in 4.19.14
	ax25: fix a use-after-free in ax25_fillin_cb()
	gro_cell: add napi_disable in gro_cells_destroy
	ibmveth: fix DMA unmap error in ibmveth_xmit_start error path
	ieee802154: lowpan_header_create check must check daddr
	ip6mr: Fix potential Spectre v1 vulnerability
	ipv4: Fix potential Spectre v1 vulnerability
	ipv6: explicitly initialize udp6_addr in udp_sock_create6()
	ipv6: tunnels: fix two use-after-free
	ip: validate header length on virtual device xmit
	isdn: fix kernel-infoleak in capi_unlocked_ioctl
	net: clear skb->tstamp in forwarding paths
	net/hamradio/6pack: use mod_timer() to rearm timers
	net: ipv4: do not handle duplicate fragments as overlapping
	net: macb: restart tx after tx used bit read
	net: mvpp2: 10G modes aren't supported on all ports
	net: phy: Fix the issue that netif always links up after resuming
	netrom: fix locking in nr_find_socket()
	net/smc: fix TCP fallback socket release
	net: stmmac: Fix an error code in probe()
	net/tls: allocate tls context using GFP_ATOMIC
	net/wan: fix a double free in x25_asy_open_tty()
	packet: validate address length
	packet: validate address length if non-zero
	ptr_ring: wrap back ->producer in __ptr_ring_swap_queue()
	qmi_wwan: Added support for Fibocom NL668 series
	qmi_wwan: Added support for Telit LN940 series
	qmi_wwan: Add support for Fibocom NL678 series
	sctp: initialize sin6_flowinfo for ipv6 addrs in sctp_inet6addr_event
	sock: Make sock->sk_stamp thread-safe
	tcp: fix a race in inet_diag_dump_icsk()
	tipc: check tsk->group in tipc_wait_for_cond()
	tipc: compare remote and local protocols in tipc_udp_enable()
	tipc: fix a double free in tipc_enable_bearer()
	tipc: fix a double kfree_skb()
	tipc: use lock_sock() in tipc_sk_reinit()
	vhost: make sure used idx is seen before log in vhost_add_used_n()
	VSOCK: Send reset control packet when socket is partially bound
	xen/netfront: tolerate frags with no data
	net/mlx5: Typo fix in del_sw_hw_rule
	tipc: check group dests after tipc_wait_for_cond()
	net/mlx5e: Remove the false indication of software timestamping support
	ipv6: frags: Fix bogus skb->sk in reassembled packets
	net/ipv6: Fix a test against 'ipv6_find_idev()' return value
	nfp: flower: ensure TCP flags can be placed in IPv6 frame
	ipv6: route: Fix return value of ip6_neigh_lookup() on neigh_create() error
	mscc: Configured MAC entries should be locked.
	net/mlx5e: Cancel DIM work on close SQ
	net/mlx5e: RX, Verify MPWQE stride size is in range
	net: mvpp2: fix the phylink mode validation
	qed: Fix command number mismatch between driver and the mfw
	mlxsw: core: Increase timeout during firmware flash process
	net/mlx5e: Remove unused UDP GSO remaining counter
	net/mlx5e: RX, Fix wrong early return in receive queue poll
	net: mvneta: fix operation for 64K PAGE_SIZE
	net: Use __kernel_clockid_t in uapi net_stamp.h
	r8169: fix WoL device wakeup enable
	IB/hfi1: Incorrect sizing of sge for PIO will OOPs
	ALSA: rme9652: Fix potential Spectre v1 vulnerability
	ALSA: emu10k1: Fix potential Spectre v1 vulnerabilities
	ALSA: pcm: Fix potential Spectre v1 vulnerability
	ALSA: emux: Fix potential Spectre v1 vulnerabilities
	powerpc/fsl: Fix spectre_v2 mitigations reporting
	mtd: atmel-quadspi: disallow building on ebsa110
	mtd: rawnand: marvell: prevent timeouts on a loaded machine
	mtd: rawnand: omap2: Pass the parent of pdev to dma_request_chan()
	ALSA: hda: add mute LED support for HP EliteBook 840 G4
	ALSA: hda/realtek: Enable audio jacks of ASUS UX391UA with ALC294
	ALSA: fireface: fix for state to fetch PCM frames
	ALSA: firewire-lib: fix wrong handling payload_length as payload_quadlet
	ALSA: firewire-lib: fix wrong assignment for 'out_packet_without_header' tracepoint
	ALSA: firewire-lib: use the same print format for 'without_header' tracepoints
	ALSA: hda/realtek: Enable the headset mic auto detection for ASUS laptops
	ALSA: hda/tegra: clear pending irq handlers
	usb: dwc2: host: use hrtimer for NAK retries
	USB: serial: pl2303: add ids for Hewlett-Packard HP POS pole displays
	USB: serial: option: add Fibocom NL678 series
	usb: r8a66597: Fix a possible concurrency use-after-free bug in r8a66597_endpoint_disable()
	usb: dwc2: disable power_down on Amlogic devices
	Revert "usb: dwc3: pci: Use devm functions to get the phy GPIOs"
	usb: roles: Add a description for the class to Kconfig
	media: dvb-usb-v2: Fix incorrect use of transfer_flags URB_FREE_BUFFER
	staging: wilc1000: fix missing read_write setting when reading data
	ASoC: intel: cht_bsw_max98090_ti: Add pmc_plt_clk_0 quirk for Chromebook Clapper
	ASoC: intel: cht_bsw_max98090_ti: Add pmc_plt_clk_0 quirk for Chromebook Gnawty
	s390/pci: fix sleeping in atomic during hotplug
	Input: atmel_mxt_ts - don't try to free unallocated kernel memory
	Input: elan_i2c - add ACPI ID for touchpad in ASUS Aspire F5-573G
	x86/speculation/l1tf: Drop the swap storage limit restriction when l1tf=off
	x86/mm: Drop usage of __flush_tlb_all() in kernel_physical_mapping_init()
	KVM: x86: Use jmp to invoke kvm_spurious_fault() from .fixup
	arm64: KVM: Make VHE Stage-2 TLB invalidation operations non-interruptible
	KVM: nVMX: Free the VMREAD/VMWRITE bitmaps if alloc_kvm_area() fails
	platform-msi: Free descriptors in platform_msi_domain_free()
	drm/v3d: Skip debugfs dumping GCA on platforms without GCA.
	DRM: UDL: get rid of useless vblank initialization
	clocksource/drivers/arc_timer: Utilize generic sched_clock
	perf machine: Record if a arch has a single user/kernel address space
	perf thread: Add fallback functions for cases where cpumode is insufficient
	perf tools: Use fallback for sample_addr_correlates_sym() cases
	perf script: Use fallbacks for branch stacks
	perf pmu: Suppress potential format-truncation warning
	perf env: Also consider env->arch == NULL as local operation
	ocxl: Fix endiannes bug in ocxl_link_update_pe()
	ocxl: Fix endiannes bug in read_afu_name()
	ext4: add ext4_sb_bread() to disambiguate ENOMEM cases
	ext4: fix possible use after free in ext4_quota_enable
	ext4: missing unlock/put_page() in ext4_try_to_write_inline_data()
	ext4: fix EXT4_IOC_GROUP_ADD ioctl
	ext4: include terminating u32 in size of xattr entries when expanding inodes
	ext4: avoid declaring fs inconsistent due to invalid file handles
	ext4: force inode writes when nfsd calls commit_metadata()
	ext4: check for shutdown and r/o file system in ext4_write_inode()
	spi: bcm2835: Fix race on DMA termination
	spi: bcm2835: Fix book-keeping of DMA termination
	spi: bcm2835: Avoid finishing transfer prematurely in IRQ mode
	clk: rockchip: fix typo in rk3188 spdif_frac parent
	clk: sunxi-ng: Use u64 for calculation of NM rate
	crypto: cavium/nitrox - fix a DMA pool free failure
	crypto: chcr - small packet Tx stalls the queue
	crypto: testmgr - add AES-CFB tests
	crypto: cfb - fix decryption
	cgroup: fix CSS_TASK_ITER_PROCS
	cdc-acm: fix abnormal DATA RX issue for Mediatek Preloader.
	btrfs: dev-replace: go back to suspended state if target device is missing
	btrfs: dev-replace: go back to suspend state if another EXCL_OP is running
	btrfs: skip file_extent generation check for free_space_inode in run_delalloc_nocow
	Btrfs: fix fsync of files with multiple hard links in new directories
	btrfs: run delayed items before dropping the snapshot
	Btrfs: send, fix race with transaction commits that create snapshots
	brcmfmac: fix roamoff=1 modparam
	brcmfmac: Fix out of bounds memory access during fw load
	powerpc/tm: Unset MSR[TS] if not recheckpointing
	dax: Don't access a freed inode
	dax: Use non-exclusive wait in wait_entry_unlocked()
	f2fs: read page index before freeing
	f2fs: fix validation of the block count in sanity_check_raw_super
	f2fs: sanity check of xattr entry size
	serial: uartps: Fix interrupt mask issue to handle the RX interrupts properly
	media: cec: keep track of outstanding transmits
	media: cec-pin: fix broken tx_ignore_nack_until_eom error injection
	media: rc: cec devices do not have a lirc chardev
	media: imx274: fix stack corruption in imx274_read_reg
	media: vivid: free bitmap_cap when updating std/timings/etc.
	media: vb2: check memory model for VIDIOC_CREATE_BUFS
	media: v4l2-tpg: array index could become negative
	tools lib traceevent: Fix processing of dereferenced args in bprintk events
	MIPS: math-emu: Write-protect delay slot emulation pages
	MIPS: c-r4k: Add r4k_blast_scache_node for Loongson-3
	MIPS: Ensure pmd_present() returns false after pmd_mknotpresent()
	MIPS: Align kernel load address to 64KB
	MIPS: Expand MIPS32 ASIDs to 64 bits
	MIPS: OCTEON: mark RGMII interface disabled on OCTEON III
	MIPS: Fix a R10000_LLSC_WAR logic in atomic.h
	CIFS: Fix error mapping for SMB2_LOCK command which caused OFD lock problem
	smb3: fix large reads on encrypted connections
	arm64: KVM: Avoid setting the upper 32 bits of VTCR_EL2 to 1
	arm/arm64: KVM: vgic: Force VM halt when changing the active state of GICv3 PPIs/SGIs
	ARM: dts: exynos: Specify I2S assigned clocks in proper node
	rtc: m41t80: Correct alarm month range with RTC reads
	KVM: arm/arm64: vgic: Do not cond_resched_lock() with IRQs disabled
	KVM: arm/arm64: vgic: Cap SPIs to the VM-defined maximum
	KVM: arm/arm64: vgic-v2: Set active_source to 0 when restoring state
	KVM: arm/arm64: vgic: Fix off-by-one bug in vgic_get_irq()
	iommu/arm-smmu-v3: Fix big-endian CMD_SYNC writes
	arm64: compat: Avoid sending SIGILL for unallocated syscall numbers
	tpm: tpm_try_transmit() refactor error flow.
	tpm: tpm_i2c_nuvoton: use correct command duration for TPM 2.x
	spi: bcm2835: Unbreak the build of esoteric configs
	MIPS: Only include mmzone.h when CONFIG_NEED_MULTIPLE_NODES=y
	Linux 4.19.14

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2019-01-09 18:55:03 +01:00
Tejun Heo
8a2fbdd5b0 cgroup: fix CSS_TASK_ITER_PROCS
commit e9d81a1bc2 upstream.

CSS_TASK_ITER_PROCS implements process-only iteration by making
css_task_iter_advance() skip tasks which aren't threadgroup leaders;
however, when an iteration is started css_task_iter_start() calls the
inner helper function css_task_iter_advance_css_set() instead of
css_task_iter_advance().  As the helper doesn't have the skip logic,
when the first task to visit is a non-leader thread, it doesn't get
skipped correctly as shown in the following example.

  # ps -L 2030
    PID   LWP TTY      STAT   TIME COMMAND
   2030  2030 pts/0    Sl+    0:00 ./test-thread
   2030  2031 pts/0    Sl+    0:00 ./test-thread
  # mkdir -p /sys/fs/cgroup/x/a/b
  # echo threaded > /sys/fs/cgroup/x/a/cgroup.type
  # echo threaded > /sys/fs/cgroup/x/a/b/cgroup.type
  # echo 2030 > /sys/fs/cgroup/x/a/cgroup.procs
  # cat /sys/fs/cgroup/x/a/cgroup.threads
  2030
  2031
  # cat /sys/fs/cgroup/x/cgroup.procs
  2030
  # echo 2030 > /sys/fs/cgroup/x/a/b/cgroup.threads
  # cat /sys/fs/cgroup/x/cgroup.procs
  2031
  2030

The last read of cgroup.procs is incorrectly showing non-leader 2031
in cgroup.procs output.

This can be fixed by updating css_task_iter_advance() to handle the
first advance and css_task_iters_tart() to call
css_task_iter_advance() instead of the inner helper.  After the fix,
the same commands result in the following (correct) result:

  # ps -L 2062
    PID   LWP TTY      STAT   TIME COMMAND
   2062  2062 pts/0    Sl+    0:00 ./test-thread
   2062  2063 pts/0    Sl+    0:00 ./test-thread
  # mkdir -p /sys/fs/cgroup/x/a/b
  # echo threaded > /sys/fs/cgroup/x/a/cgroup.type
  # echo threaded > /sys/fs/cgroup/x/a/b/cgroup.type
  # echo 2062 > /sys/fs/cgroup/x/a/cgroup.procs
  # cat /sys/fs/cgroup/x/a/cgroup.threads
  2062
  2063
  # cat /sys/fs/cgroup/x/cgroup.procs
  2062
  # echo 2062 > /sys/fs/cgroup/x/a/b/cgroup.threads
  # cat /sys/fs/cgroup/x/cgroup.procs
  2062

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Fixes: 8cfd8147df ("cgroup: implement cgroup v2 thread support")
Cc: stable@vger.kernel.org # v4.14+
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-09 17:38:45 +01:00
Dmitry Torokhov
b27b4492b7 CHROMIUM: cgroups: relax permissions on moving tasks between cgroups
Android expects system_server to be able to move tasks between different
cgroups/cpusets, but does not want to be running as root. Let's relax
permission check so that processes can move other tasks if they have
CAP_SYS_NICE in the affected task's user namespace.

BUG=b:31790445,chromium:647994
TEST=Boot android container, examine logcat

Signed-off-by: Dmitry Torokhov <dtor@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/394927
Reviewed-by: Ricky Zhou <rickyz@chromium.org>
[AmitP: Refactored original changes to align with upstream commit
        201af4c0fa ("cgroup: move cgroup files under kernel/cgroup/")]
Bug: 120445593
Change-Id: Ia919c66ab6ed6a6daf7c4cf67feb38b13b1ad09b
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
2018-12-05 09:48:12 -08:00
Riley Andrews
ea7de80972 ANDROID: cpuset: Make cpusets restore on hotplug
This deliberately changes the behavior of the per-cpuset
cpus file to not be effected by hotplug. When a cpu is offlined,
it will be removed from the cpuset/cpus file. When a cpu is onlined,
if the cpuset originally requested that that cpu was part of the cpuset,
that cpu will be restored to the cpuset. The cpus files still
have to be hierachical, but the ranges no longer have to be out of
the currently online cpus, just the physically present cpus.

Bug: 120444281
Change-Id: I22cdf33e7d312117bcefba1aeb0125e1ada289a9
Signed-off-by: Dmitry Shmidt <dimitrysh@google.com>
[AmitP: Refactored original changes to align with upstream commit
        201af4c0fa ("cgroup: move cgroup files under kernel/cgroup/")]
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
2018-12-05 09:48:12 -08:00
Tejun Heo
479adb89a9 cgroup: Fix dom_cgrp propagation when enabling threaded mode
A cgroup which is already a threaded domain may be converted into a
threaded cgroup if the prerequisite conditions are met.  When this
happens, all threaded descendant should also have their ->dom_cgrp
updated to the new threaded domain cgroup.  Unfortunately, this
propagation was missing leading to the following failure.

  # cd /sys/fs/cgroup/unified
  # cat cgroup.subtree_control    # show that no controllers are enabled

  # mkdir -p mycgrp/a/b/c
  # echo threaded > mycgrp/a/b/cgroup.type

  At this point, the hierarchy looks as follows:

      mycgrp [d]
	  a [dt]
	      b [t]
		  c [inv]

  Now let's make node "a" threaded (and thus "mycgrp" s made "domain threaded"):

  # echo threaded > mycgrp/a/cgroup.type

  By this point, we now have a hierarchy that looks as follows:

      mycgrp [dt]
	  a [t]
	      b [t]
		  c [inv]

  But, when we try to convert the node "c" from "domain invalid" to
  "threaded", we get ENOTSUP on the write():

  # echo threaded > mycgrp/a/b/c/cgroup.type
  sh: echo: write error: Operation not supported

This patch fixes the problem by

* Moving the opencoded ->dom_cgrp save and restoration in
  cgroup_enable_threaded() into cgroup_{save|restore}_control() so
  that mulitple cgroups can be handled.

* Updating all threaded descendants' ->dom_cgrp to point to the new
  dom_cgrp when enabling threaded mode.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Reported-by: Amin Jamali <ajamali@pivotal.io>
Reported-by: Joao De Almeida Pereira <jpereira@pivotal.io>
Link: https://lore.kernel.org/r/CAKgNAkhHYCMn74TCNiMJ=ccLd7DcmXSbvw3CbZ1YREeG7iJM5g@mail.gmail.com
Fixes: 454000adaa ("cgroup: introduce cgroup->dom_cgrp and threaded css_set handling")
Cc: stable@vger.kernel.org # v4.14+
2018-10-04 13:28:08 -07:00
Linus Torvalds
596766102a Merge branch 'for-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "Just one commit from Steven to take out spin lock from trace event
  handlers"

* 'for-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup/tracing: Move taking of spin lock out of trace event handlers
2018-08-24 13:19:27 -07:00
Dmitry Torokhov
488dee96bb kernfs: allow creating kernfs objects with arbitrary uid/gid
This change allows creating kernfs files and directories with arbitrary
uid/gid instead of always using GLOBAL_ROOT_UID/GID by extending
kernfs_create_dir_ns() and kernfs_create_file_ns() with uid/gid arguments.
The "simple" kernfs_create_file() and kernfs_create_dir() are left alone
and always create objects belonging to the global root.

When creating symlinks ownership (uid/gid) is taken from the target kernfs
object.

Co-Developed-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-20 23:44:35 -07:00
Steven Rostedt (VMware)
e4f8d81c73 cgroup/tracing: Move taking of spin lock out of trace event handlers
It is unwise to take spin locks from the handlers of trace events.
Mainly, because they can introduce lockups, because it introduces locks
in places that are normally not tested. Worse yet, because trace events
are tucked away in the include/trace/events/ directory, locks that are
taken there are forgotten about.

As a general rule, I tell people never to take any locks in a trace
event handler.

Several cgroup trace event handlers call cgroup_path() which eventually
takes the kernfs_rename_lock spinlock. This injects the spinlock in the
code without people realizing it. It also can cause issues for the
PREEMPT_RT patch, as the spinlock becomes a mutex, and the trace event
handlers are called with preemption disabled.

By moving the calculation of the cgroup_path() out of the trace event
handlers and into a macro (surrounded by a
trace_cgroup_##type##_enabled()), then we could place the cgroup_path
into a string, and pass that to the trace event. Not only does this
remove the taking of the spinlock out of the trace event handler, but
it also means that the cgroup_path() only needs to be called once (it
is currently called twice, once to get the length to reserver the
buffer for, and once again to get the path itself. Now it only needs to
be done once.

Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2018-07-11 10:48:47 -07:00
Mauro Carvalho Chehab
5fb94e9ca3 docs: Fix some broken references
As we move stuff around, some doc references are broken. Fix some of
them via this script:
	./scripts/documentation-file-ref-check --fix

Manually checked if the produced result is valid, removing a few
false-positives.

Acked-by: Takashi Iwai <tiwai@suse.de>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Stephen Boyd <sboyd@kernel.org>
Acked-by: Charles Keepax <ckeepax@opensource.wolfsonmicro.com>
Acked-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Acked-by: Jonathan Corbet <corbet@lwn.net>
2018-06-15 18:10:01 -03:00
Kees Cook
42bc47b353 treewide: Use array_size() in vmalloc()
The vmalloc() function has no 2-factor argument form, so multiplication
factors need to be wrapped in array_size(). This patch replaces cases of:

        vmalloc(a * b)

with:
        vmalloc(array_size(a, b))

as well as handling cases of:

        vmalloc(a * b * c)

with:

        vmalloc(array3_size(a, b, c))

This does, however, attempt to ignore constant size factors like:

        vmalloc(4 * 1024)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
  vmalloc(
-	(sizeof(TYPE)) * E
+	sizeof(TYPE) * E
  , ...)
|
  vmalloc(
-	(sizeof(THING)) * E
+	sizeof(THING) * E
  , ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
  vmalloc(
-	sizeof(u8) * (COUNT)
+	COUNT
  , ...)
|
  vmalloc(
-	sizeof(__u8) * (COUNT)
+	COUNT
  , ...)
|
  vmalloc(
-	sizeof(char) * (COUNT)
+	COUNT
  , ...)
|
  vmalloc(
-	sizeof(unsigned char) * (COUNT)
+	COUNT
  , ...)
|
  vmalloc(
-	sizeof(u8) * COUNT
+	COUNT
  , ...)
|
  vmalloc(
-	sizeof(__u8) * COUNT
+	COUNT
  , ...)
|
  vmalloc(
-	sizeof(char) * COUNT
+	COUNT
  , ...)
|
  vmalloc(
-	sizeof(unsigned char) * COUNT
+	COUNT
  , ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
  vmalloc(
-	sizeof(TYPE) * (COUNT_ID)
+	array_size(COUNT_ID, sizeof(TYPE))
  , ...)
|
  vmalloc(
-	sizeof(TYPE) * COUNT_ID
+	array_size(COUNT_ID, sizeof(TYPE))
  , ...)
|
  vmalloc(
-	sizeof(TYPE) * (COUNT_CONST)
+	array_size(COUNT_CONST, sizeof(TYPE))
  , ...)
|
  vmalloc(
-	sizeof(TYPE) * COUNT_CONST
+	array_size(COUNT_CONST, sizeof(TYPE))
  , ...)
|
  vmalloc(
-	sizeof(THING) * (COUNT_ID)
+	array_size(COUNT_ID, sizeof(THING))
  , ...)
|
  vmalloc(
-	sizeof(THING) * COUNT_ID
+	array_size(COUNT_ID, sizeof(THING))
  , ...)
|
  vmalloc(
-	sizeof(THING) * (COUNT_CONST)
+	array_size(COUNT_CONST, sizeof(THING))
  , ...)
|
  vmalloc(
-	sizeof(THING) * COUNT_CONST
+	array_size(COUNT_CONST, sizeof(THING))
  , ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

  vmalloc(
-	SIZE * COUNT
+	array_size(COUNT, SIZE)
  , ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
  vmalloc(
-	sizeof(TYPE) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  vmalloc(
-	sizeof(TYPE) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  vmalloc(
-	sizeof(TYPE) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  vmalloc(
-	sizeof(TYPE) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  vmalloc(
-	sizeof(THING) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  vmalloc(
-	sizeof(THING) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  vmalloc(
-	sizeof(THING) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  vmalloc(
-	sizeof(THING) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
  vmalloc(
-	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  vmalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  vmalloc(
-	sizeof(THING1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  vmalloc(
-	sizeof(THING1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  vmalloc(
-	sizeof(TYPE1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
|
  vmalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
  vmalloc(
-	(COUNT) * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vmalloc(
-	COUNT * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vmalloc(
-	COUNT * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vmalloc(
-	(COUNT) * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vmalloc(
-	COUNT * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vmalloc(
-	(COUNT) * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vmalloc(
-	(COUNT) * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vmalloc(
-	COUNT * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
)

// Any remaining multi-factor products, first at least 3-factor products
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
  vmalloc(C1 * C2 * C3, ...)
|
  vmalloc(
-	E1 * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
)

// And then all remaining 2 factors products when they're not all constants.
@@
expression E1, E2;
constant C1, C2;
@@

(
  vmalloc(C1 * C2, ...)
|
  vmalloc(
-	E1 * E2
+	array_size(E1, E2)
  , ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 16:19:22 -07:00
Kees Cook
6da2ec5605 treewide: kmalloc() -> kmalloc_array()
The kmalloc() function has a 2-factor argument form, kmalloc_array(). This
patch replaces cases of:

        kmalloc(a * b, gfp)

with:
        kmalloc_array(a * b, gfp)

as well as handling cases of:

        kmalloc(a * b * c, gfp)

with:

        kmalloc(array3_size(a, b, c), gfp)

as it's slightly less ugly than:

        kmalloc_array(array_size(a, b), c, gfp)

This does, however, attempt to ignore constant size factors like:

        kmalloc(4 * 1024, gfp)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The tools/ directory was manually excluded, since it has its own
implementation of kmalloc().

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
  kmalloc(
-	(sizeof(TYPE)) * E
+	sizeof(TYPE) * E
  , ...)
|
  kmalloc(
-	(sizeof(THING)) * E
+	sizeof(THING) * E
  , ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
  kmalloc(
-	sizeof(u8) * (COUNT)
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(__u8) * (COUNT)
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(char) * (COUNT)
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(unsigned char) * (COUNT)
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(u8) * COUNT
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(__u8) * COUNT
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(char) * COUNT
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(unsigned char) * COUNT
+	COUNT
  , ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * (COUNT_ID)
+	COUNT_ID, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * COUNT_ID
+	COUNT_ID, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * (COUNT_CONST)
+	COUNT_CONST, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * COUNT_CONST
+	COUNT_CONST, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * (COUNT_ID)
+	COUNT_ID, sizeof(THING)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * COUNT_ID
+	COUNT_ID, sizeof(THING)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * (COUNT_CONST)
+	COUNT_CONST, sizeof(THING)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * COUNT_CONST
+	COUNT_CONST, sizeof(THING)
  , ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

- kmalloc
+ kmalloc_array
  (
-	SIZE * COUNT
+	COUNT, SIZE
  , ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
  kmalloc(
-	sizeof(TYPE) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kmalloc(
-	sizeof(TYPE) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kmalloc(
-	sizeof(TYPE) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kmalloc(
-	sizeof(TYPE) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kmalloc(
-	sizeof(THING) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kmalloc(
-	sizeof(THING) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kmalloc(
-	sizeof(THING) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kmalloc(
-	sizeof(THING) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
  kmalloc(
-	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  kmalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  kmalloc(
-	sizeof(THING1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  kmalloc(
-	sizeof(THING1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  kmalloc(
-	sizeof(TYPE1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
|
  kmalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
  kmalloc(
-	(COUNT) * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	COUNT * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	COUNT * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	(COUNT) * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	COUNT * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	(COUNT) * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	(COUNT) * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	COUNT * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
)

// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
  kmalloc(C1 * C2 * C3, ...)
|
  kmalloc(
-	(E1) * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
|
  kmalloc(
-	(E1) * (E2) * E3
+	array3_size(E1, E2, E3)
  , ...)
|
  kmalloc(
-	(E1) * (E2) * (E3)
+	array3_size(E1, E2, E3)
  , ...)
|
  kmalloc(
-	E1 * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
)

// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@

(
  kmalloc(sizeof(THING) * C2, ...)
|
  kmalloc(sizeof(TYPE) * C2, ...)
|
  kmalloc(C1 * C2 * C3, ...)
|
  kmalloc(C1 * C2, ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * (E2)
+	E2, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * E2
+	E2, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * (E2)
+	E2, sizeof(THING)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * E2
+	E2, sizeof(THING)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	(E1) * E2
+	E1, E2
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	(E1) * (E2)
+	E1, E2
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	E1 * E2
+	E1, E2
  , ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 16:19:22 -07:00
Linus Torvalds
2857676045 - Introduce arithmetic overflow test helper functions (Rasmus)
- Use overflow helpers in 2-factor allocators (Kees, Rasmus)
 - Introduce overflow test module (Rasmus, Kees)
 - Introduce saturating size helper functions (Matthew, Kees)
 - Treewide use of struct_size() for allocators (Kees)
 -----BEGIN PGP SIGNATURE-----
 Comment: Kees Cook <kees@outflux.net>
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAlsYJ1gWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJlCTEACwdEeriAd2VwxknnsstojGD/3g
 8TTFA19vSu4Gxa6WiDkjGoSmIlfhXTlZo1Nlmencv16ytSvIVDNLUIB3uDxUIv1J
 2+dyHML9JpXYHHR7zLXXnGFJL0wazqjbsD3NYQgXqmun7EVVYnOsAlBZ7h/Lwiej
 jzEJd8DaHT3TA586uD3uggiFvQU0yVyvkDCDONIytmQx+BdtGdg9TYCzkBJaXuDZ
 YIthyKDvxIw5nh/UaG3L+SKo73tUr371uAWgAfqoaGQQCWe+mxnWL4HkCKsjFzZL
 u9ouxxF/n6pij3E8n6rb0i2fCzlsTDdDF+aqV1rQ4I4hVXCFPpHUZgjDPvBWbj7A
 m6AfRHVNnOgI8HGKqBGOfViV+2kCHlYeQh3pPW33dWzy/4d/uq9NIHKxE63LH+S4
 bY3oO2ela8oxRyvEgXLjqmRYGW1LB/ZU7FS6Rkx2gRzo4k8Rv+8K/KzUHfFVRX61
 jEbiPLzko0xL9D53kcEn0c+BhofK5jgeSWxItdmfuKjLTW4jWhLRlU+bcUXb6kSS
 S3G6aF+L+foSUwoq63AS8QxCuabuhreJSB+BmcGUyjthCbK/0WjXYC6W/IJiRfBa
 3ZTxBC/2vP3uq/AGRNh5YZoxHL8mSxDfn62F+2cqlJTTKR/O+KyDb1cusyvk3H04
 KCDVLYPxwQQqK1Mqig==
 =/3L8
 -----END PGP SIGNATURE-----

Merge tag 'overflow-v4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull overflow updates from Kees Cook:
 "This adds the new overflow checking helpers and adds them to the
  2-factor argument allocators. And this adds the saturating size
  helpers and does a treewide replacement for the struct_size() usage.
  Additionally this adds the overflow testing modules to make sure
  everything works.

  I'm still working on the treewide replacements for allocators with
  "simple" multiplied arguments:

     *alloc(a * b, ...) -> *alloc_array(a, b, ...)

  and

     *zalloc(a * b, ...) -> *calloc(a, b, ...)

  as well as the more complex cases, but that's separable from this
  portion of the series. I expect to have the rest sent before -rc1
  closes; there are a lot of messy cases to clean up.

  Summary:

   - Introduce arithmetic overflow test helper functions (Rasmus)

   - Use overflow helpers in 2-factor allocators (Kees, Rasmus)

   - Introduce overflow test module (Rasmus, Kees)

   - Introduce saturating size helper functions (Matthew, Kees)

   - Treewide use of struct_size() for allocators (Kees)"

* tag 'overflow-v4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  treewide: Use struct_size() for devm_kmalloc() and friends
  treewide: Use struct_size() for vmalloc()-family
  treewide: Use struct_size() for kmalloc()-family
  device: Use overflow helpers for devm_kmalloc()
  mm: Use overflow helpers in kvmalloc()
  mm: Use overflow helpers in kmalloc_array*()
  test_overflow: Add memory allocation overflow tests
  overflow.h: Add allocation size calculation helpers
  test_overflow: Report test failures
  test_overflow: macrofy some more, do more tests for free
  lib: add runtime test of check_*_overflow functions
  compiler.h: enable builtin overflow checkers and add fallback code
2018-06-06 17:27:14 -07:00
Kees Cook
acafe7e302 treewide: Use struct_size() for kmalloc()-family
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:

struct foo {
    int stuff;
    void *entry[];
};

instance = kmalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);

Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:

instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL);

This patch makes the changes for kmalloc()-family (and kvmalloc()-family)
uses. It was done via automatic conversion with manual review for the
"CHECKME" non-standard cases noted below, using the following Coccinelle
script:

// pkey_cache = kmalloc(sizeof *pkey_cache + tprops->pkey_tbl_len *
//                      sizeof *pkey_cache->table, GFP_KERNEL);
@@
identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
expression GFP;
identifier VAR, ELEMENT;
expression COUNT;
@@

- alloc(sizeof(*VAR) + COUNT * sizeof(*VAR->ELEMENT), GFP)
+ alloc(struct_size(VAR, ELEMENT, COUNT), GFP)

// mr = kzalloc(sizeof(*mr) + m * sizeof(mr->map[0]), GFP_KERNEL);
@@
identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
expression GFP;
identifier VAR, ELEMENT;
expression COUNT;
@@

- alloc(sizeof(*VAR) + COUNT * sizeof(VAR->ELEMENT[0]), GFP)
+ alloc(struct_size(VAR, ELEMENT, COUNT), GFP)

// Same pattern, but can't trivially locate the trailing element name,
// or variable name.
@@
identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
expression GFP;
expression SOMETHING, COUNT, ELEMENT;
@@

- alloc(sizeof(SOMETHING) + COUNT * sizeof(ELEMENT), GFP)
+ alloc(CHECKME_struct_size(&SOMETHING, ELEMENT, COUNT), GFP)

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-06 11:15:43 -07:00
Linus Torvalds
9f25a8da42 Merge branch 'for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:

 - For cpustat, cgroup has a percpu hierarchical stat mechanism which
   propagates up the hierarchy lazily.

   This contains commits to factor out and generalize the mechanism so
   that it can be used for other cgroup stats too.

   The original intention was to update memcg stats to use it but memcg
   went for a different approach, so still the only user is cpustat. The
   factoring out and generalization still make sense and it's likely
   that this can be used for other purposes in the future.

 - cgroup uses kernfs_notify() (which uses fsnotify()) to inform user
   space of certain events. A rate limiting mechanism is added.

 - Other misc changes.

* 'for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: css_set_lock should nest inside tasklist_lock
  rdmacg: Convert to use match_string() helper
  cgroup: Make cgroup_rstat_updated() ready for root cgroup usage
  cgroup: Add memory barriers to plug cgroup_rstat_updated() race window
  cgroup: Add cgroup_subsys->css_rstat_flush()
  cgroup: Replace cgroup_rstat_mutex with a spinlock
  cgroup: Factor out and expose cgroup_rstat_*() interface functions
  cgroup: Reorganize kernel/cgroup/rstat.c
  cgroup: Distinguish base resource stat implementation from rstat
  cgroup: Rename stat to rstat
  cgroup: Rename kernel/cgroup/stat.c to kernel/cgroup/rstat.c
  cgroup: Limit event generation frequency
  cgroup: Explicitly remove core interface files
2018-06-05 17:08:45 -07:00
Tejun Heo
d8742e2290 cgroup: css_set_lock should nest inside tasklist_lock
cgroup_enable_task_cg_lists() incorrectly nests non-irq-safe
tasklist_lock inside irq-safe css_set_lock triggering the following
lockdep warning.

  WARNING: possible irq lock inversion dependency detected
  4.17.0-rc1-00027-gb37d049 #6 Not tainted
  --------------------------------------------------------
  systemd/1 just changed the state of lock:
  00000000fe57773b (css_set_lock){..-.}, at: cgroup_free+0xf2/0x12a
  but this lock took another, SOFTIRQ-unsafe lock in the past:
   (tasklist_lock){.+.+}

  and interrupts could create inverse lock ordering between them.

  other info that might help us debug this:
   Possible interrupt unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(tasklist_lock);
				 local_irq_disable();
				 lock(css_set_lock);
				 lock(tasklist_lock);
    <Interrupt>
      lock(css_set_lock);

   *** DEADLOCK ***

The condition is highly unlikely to actually happen especially given
that the path is executed only once per boot.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Boqun Feng <boqun.feng@gmail.com>
2018-05-23 11:04:54 -07:00
Christoph Hellwig
3f3942aca6 proc: introduce proc_create_single{,_data}
Variants of proc_create{,_data} that directly take a seq_file show
callback and drastically reduces the boilerplate code in the callers.

All trivial callers converted over.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-16 07:23:35 +02:00
Andy Shevchenko
cc659e76f3 rdmacg: Convert to use match_string() helper
The new helper returns index of the matching string in an array.
We are going to use it here.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2018-05-07 09:27:26 -07:00
Tejun Heo
c43c5ea75f cgroup: Make cgroup_rstat_updated() ready for root cgroup usage
cgroup_rstat_updated() ensures that the cgroup's rstat is linked to
the parent.  If there's no parent, it never gets linked and the
function ends up grabbing and releasing the cgroup_rstat_lock each
time for no reason which can be expensive.

This hasn't been a problem till now because nobody was calling the
function for the root cgroup but rstat is gonna be exposed to
controllers and use cases, so let's get ready.  Make
cgroup_rstat_updated() an no-op for the root cgroup.

Signed-off-by: Tejun Heo <tj@kernel.org>
2018-04-26 14:29:06 -07:00
Tejun Heo
9a9e97b2f1 cgroup: Add memory barriers to plug cgroup_rstat_updated() race window
cgroup_rstat_updated() has a small race window where an updated
signaling can race with flush and could be lost till the next update.
This wasn't a problem for the existing usages, but we plan to use
rstat to track counters which need to be accurate.

This patch plugs the race window by synchronizing
cgroup_rstat_updated() and flush path with memory barriers around
cgroup_rstat_cpu->updated_next pointer.

Signed-off-by: Tejun Heo <tj@kernel.org>
2018-04-26 14:29:05 -07:00
Tejun Heo
8f53470bab cgroup: Add cgroup_subsys->css_rstat_flush()
This patch adds cgroup_subsys->css_rstat_flush().  If a subsystem has
this callback, its csses are linked on cgrp->css_rstat_list and rstat
will call the function whenever the associated cgroup is flushed.
Flush is also performed when such csses are released so that residual
counts aren't lost.

Combined with the rstat API previous patches factored out, this allows
controllers to plug into rstat to manage their statistics in a
scalable way.

Signed-off-by: Tejun Heo <tj@kernel.org>
2018-04-26 14:29:05 -07:00
Tejun Heo
0fa294fb19 cgroup: Replace cgroup_rstat_mutex with a spinlock
Currently, rstat flush path is protected with a mutex which is fine as
all the existing users are from interface file show path.  However,
rstat is being generalized for use by controllers and flushing from
atomic contexts will be necessary.

This patch replaces cgroup_rstat_mutex with a spinlock and adds a
irq-safe flush function - cgroup_rstat_flush_irqsafe().  Explicit
yield handling is added to the flush path so that other flush
functions can yield to other threads and flushers.

Signed-off-by: Tejun Heo <tj@kernel.org>
2018-04-26 14:29:05 -07:00
Tejun Heo
6162cef0f7 cgroup: Factor out and expose cgroup_rstat_*() interface functions
cgroup_rstat is being generalized so that controllers can use it too.
This patch factors out and exposes the following interface functions.

* cgroup_rstat_updated(): Renamed from cgroup_rstat_cpu_updated() for
  consistency.

* cgroup_rstat_flush_hold/release(): Factored out from base stat
  implementation.

* cgroup_rstat_flush(): Verbatim expose.

While at it, drop assert on cgroup_rstat_mutex in
cgroup_base_stat_flush() as it crosses layers and make a minor comment
update.

v2: Added EXPORT_SYMBOL_GPL(cgroup_rstat_updated) to fix a build bug.

Signed-off-by: Tejun Heo <tj@kernel.org>
2018-04-26 14:29:05 -07:00
Tejun Heo
a17556f8d9 cgroup: Reorganize kernel/cgroup/rstat.c
Currently, rstat.c has rstat and base stat implementations intermixed.
Collect base stat implementation at the end of the file.  Also,
reorder the prototypes.

This patch doesn't make any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
2018-04-26 14:29:05 -07:00
Tejun Heo
d4ff749b5e cgroup: Distinguish base resource stat implementation from rstat
Base resource stat accounts universial (not specific to any
controller) resource consumptions on top of rstat.  Currently, its
implementation is intermixed with rstat implementation making the code
confusing to follow.

This patch clarifies the distintion by doing the followings.

* Encapsulate base resource stat counters, currently only cputime, in
  struct cgroup_base_stat.

* Move prev_cputime into struct cgroup and initialize it with cgroup.

* Rename the related functions so that they start with cgroup_base_stat.

* Prefix the related variables and field names with b.

This patch doesn't make any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
2018-04-26 14:29:04 -07:00
Tejun Heo
c58632b363 cgroup: Rename stat to rstat
stat is too generic a name and ends up causing subtle confusions.
It'll be made generic so that controllers can plug into it, which will
make the problem worse.  Let's rename it to something more specific -
cgroup_rstat for cgroup recursive stat.

This patch does the following renames.  No other changes.

* cpu_stat	-> rstat_cpu
* stat		-> rstat
* ?cstat	-> ?rstatc

Note that the renames are selective.  The unrenamed are the ones which
implement basic resource statistics on top of rstat.  This will be
further cleaned up in the following patches.

Signed-off-by: Tejun Heo <tj@kernel.org>
2018-04-26 14:29:04 -07:00
Tejun Heo
a5c2b93f79 cgroup: Rename kernel/cgroup/stat.c to kernel/cgroup/rstat.c
stat is too generic a name and ends up causing subtle confusions.
It'll be made generic so that controllers can plug into it, which will
make the problem worse.  Let's rename it to something more specific -
cgroup_rstat for cgroup recursive stat.

First, rename kernel/cgroup/stat.c to kernel/cgroup/rstat.c.  No
content changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
2018-04-26 14:29:04 -07:00
Tejun Heo
b12e358328 cgroup: Limit event generation frequency
".events" files generate file modified event to notify userland of
possible new events.  Some of the events can be quite bursty
(e.g. memory high event) and generating notification each time is
costly and pointless.

This patch implements a event rate limit mechanism.  If a new
notification is requested before 10ms has passed since the previous
notification, the new notification is delayed till then.

As this only delays from the second notification on in a given close
cluster of notifications, userland reactions to notifications
shouldn't be delayed at all in most cases while avoiding notification
storms.

Signed-off-by: Tejun Heo <tj@kernel.org>
2018-04-26 14:29:04 -07:00
Tejun Heo
5faaf05f29 cgroup: Explicitly remove core interface files
The "cgroup." core interface files bypass the usual interface removal
path and get removed recursively along with the cgroup itself.  While
this works now, the subtle discrepancy gets in the way of implementing
common mechanisms.

This patch updates cgroup core interface file handling so that it's
consistent with controller interface files.  When added, the css is
marked CSS_VISIBLE and they're explicitly removed before the cgroup is
destroyed.

This doesn't cause user-visible behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
2018-04-26 14:29:04 -07:00
Linus Torvalds
d92cd810e6 Merge branch 'for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue updates from Tejun Heo:
 "rcu_work addition and a couple trivial changes"

* 'for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: remove the comment about the old manager_arb mutex
  workqueue: fix the comments of nr_idle
  fs/aio: Use rcu_work instead of explicit rcu and work item
  cgroup: Use rcu_work instead of explicit rcu and work item
  RCU, workqueue: Implement rcu_work
2018-04-03 18:00:13 -07:00
Tejun Heo
8f36aaec9c cgroup: Use rcu_work instead of explicit rcu and work item
Workqueue now has rcu_work.  Use it instead of open-coding rcu -> work
item bouncing.

Signed-off-by: Tejun Heo <tj@kernel.org>
2018-03-19 10:12:03 -07:00
Tejun Heo
d1897c9538 cgroup: fix rule checking for threaded mode switching
A domain cgroup isn't allowed to be turned threaded if its subtree is
populated or domain controllers are enabled.  cgroup_enable_threaded()
depended on cgroup_can_be_thread_root() test to enforce this rule.  A
parent which has populated domain descendants or have domain
controllers enabled can't become a thread root, so the above rules are
enforced automatically.

However, for the root cgroup which can host mixed domain and threaded
children, cgroup_can_be_thread_root() doesn't check any of those
conditions and thus first level cgroups ends up escaping those rules.

This patch fixes the bug by adding explicit checks for those rules in
cgroup_enable_threaded().

Reported-by: Michael Kerrisk (man-pages) <mtk.manpages@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 8cfd8147df ("cgroup: implement cgroup v2 thread support")
Cc: stable@vger.kernel.org # v4.14+
2018-02-21 11:39:22 -08:00
Yaowei Bai
77ef80c65a kernel/cpuset: current_cpuset_is_being_rebound can be boolean
Make current_cpuset_is_being_rebound return bool due to this particular
function only using either one or zero as its return value.

No functional change.

Link: http://lkml.kernel.org/r/1513266622-15860-4-git-send-email-baiyaowei@cmss.chinamobile.com
Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-06 18:32:47 -08:00
Tejun Heo
08a77676f9 string: drop __must_check from strscpy() and restore strscpy() usages in cgroup
e7fd37ba12 ("cgroup: avoid copying strings longer than the buffers")
converted possibly unsafe strncpy() usages in cgroup to strscpy().
However, although the callsites are completely fine with truncated
copied, because strscpy() is marked __must_check, it led to the
following warnings.

  kernel/cgroup/cgroup.c: In function ‘cgroup_file_name’:
  kernel/cgroup/cgroup.c:1400:10: warning: ignoring return value of ‘strscpy’, declared with attribute warn_unused_result [-Wunused-result]
     strscpy(buf, cft->name, CGROUP_FILE_NAME_MAX);
	       ^

To avoid the warnings, 50034ed496 ("cgroup: use strlcpy() instead of
strscpy() to avoid spurious warning") switched them to strlcpy().

strlcpy() is worse than strlcpy() because it unconditionally runs
strlen() on the source string, and the only reason we switched to
strlcpy() here was because it was lacking __must_check, which doesn't
reflect any material differences between the two function.  It's just
that someone added __must_check to strscpy() and not to strlcpy().

These basic string copy operations are used in variety of ways, and
one of not-so-uncommon use cases is safely handling truncated copies,
where the caller naturally doesn't care about the return value.  The
__must_check doesn't match the actual use cases and forces users to
opt for inferior variants which lack __must_check by happenstance or
spread ugly (void) casts.

Remove __must_check from strscpy() and restore strscpy() usages in
cgroup.

Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
2018-01-19 08:51:36 -08:00
Roman Gushchin
4f58424da3 cgroup: make cgroup.threads delegatable
Make cgroup.threads file delegatable.
The behavior of cgroup.threads should follow the behavior of cgroup.procs.

Signed-off-by: Roman Gushchin <guro@fb.com>
Discovered-by: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2018-01-10 09:42:32 -08:00
Tejun Heo
74d0833c65 cgroup: fix css_task_iter crash on CSS_TASK_ITER_PROC
While teaching css_task_iter to handle skipping over tasks which
aren't group leaders, bc2fb7ed08 ("cgroup: add @flags to
css_task_iter_start() and implement CSS_TASK_ITER_PROCS") introduced a
silly bug.

CSS_TASK_ITER_PROCS is implemented by repeating
css_task_iter_advance() while the advanced cursor is pointing to a
non-leader thread.  However, the cursor variable, @l, wasn't updated
when the iteration has to advance to the next css_set and the
following repetition would operate on the terminal @l from the
previous iteration which isn't pointing to a valid task leading to
oopses like the following or infinite looping.

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000254
  IP: __task_pid_nr_ns+0xc7/0xf0
  PGD 0 P4D 0
  Oops: 0000 [#1] SMP
  ...
  CPU: 2 PID: 1 Comm: systemd Not tainted 4.14.4-200.fc26.x86_64 #1
  Hardware name: System manufacturer System Product Name/PRIME B350M-A, BIOS 3203 11/09/2017
  task: ffff88c4baee8000 task.stack: ffff96d5c3158000
  RIP: 0010:__task_pid_nr_ns+0xc7/0xf0
  RSP: 0018:ffff96d5c315bd50 EFLAGS: 00010206
  RAX: 0000000000000000 RBX: ffff88c4b68c6000 RCX: 0000000000000250
  RDX: ffffffffa5e47960 RSI: 0000000000000000 RDI: ffff88c490f6ab00
  RBP: ffff96d5c315bd50 R08: 0000000000001000 R09: 0000000000000005
  R10: ffff88c4be006b80 R11: ffff88c42f1b8004 R12: ffff96d5c315bf18
  R13: ffff88c42d7dd200 R14: ffff88c490f6a510 R15: ffff88c4b68c6000
  FS:  00007f9446f8ea00(0000) GS:ffff88c4be680000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000254 CR3: 00000007f956f000 CR4: 00000000003406e0
  Call Trace:
   cgroup_procs_show+0x19/0x30
   cgroup_seqfile_show+0x4c/0xb0
   kernfs_seq_show+0x21/0x30
   seq_read+0x2ec/0x3f0
   kernfs_fop_read+0x134/0x180
   __vfs_read+0x37/0x160
   ? security_file_permission+0x9b/0xc0
   vfs_read+0x8e/0x130
   SyS_read+0x55/0xc0
   entry_SYSCALL_64_fastpath+0x1a/0xa5
  RIP: 0033:0x7f94455f942d
  RSP: 002b:00007ffe81ba2d00 EFLAGS: 00000293 ORIG_RAX: 0000000000000000
  RAX: ffffffffffffffda RBX: 00005574e2233f00 RCX: 00007f94455f942d
  RDX: 0000000000001000 RSI: 00005574e2321a90 RDI: 000000000000002b
  RBP: 0000000000000000 R08: 00005574e2321a90 R09: 00005574e231de60
  R10: 00007f94458c8b38 R11: 0000000000000293 R12: 00007f94458c8ae0
  R13: 00007ffe81ba3800 R14: 0000000000000000 R15: 00005574e2116560
  Code: 04 74 0e 89 f6 48 8d 04 76 48 8d 04 c5 f0 05 00 00 48 8b bf b8 05 00 00 48 01 c7 31 c0 48 8b 0f 48 85 c9 74 18 8b b2 30 08 00 00 <3b> 71 04 77 0d 48 c1 e6 05 48 01 f1 48 3b 51 38 74 09 5d c3 8b
  RIP: __task_pid_nr_ns+0xc7/0xf0 RSP: ffff96d5c315bd50

Fix it by moving the initialization of the cursor below the repeat
label.  While at it, rename it to @next for readability.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: bc2fb7ed08 ("cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS")
Cc: stable@vger.kernel.org # v4.14+
Reported-by: Laura Abbott <labbott@redhat.com>
Reported-by: Bronek Kozicki <brok@incorrekt.com>
Reported-by: George Amanakis <gamanakis@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-12-20 07:09:19 -08:00
Prateek Sood
116d2f7496 cgroup: Fix deadlock in cpu hotplug path
Deadlock during cgroup migration from cpu hotplug path when a task T is
being moved from source to destination cgroup.

kworker/0:0
cpuset_hotplug_workfn()
   cpuset_hotplug_update_tasks()
      hotplug_update_tasks_legacy()
        remove_tasks_in_empty_cpuset()
          cgroup_transfer_tasks() // stuck in iterator loop
            cgroup_migrate()
              cgroup_migrate_add_task()

In cgroup_migrate_add_task() it checks for PF_EXITING flag of task T.
Task T will not migrate to destination cgroup. css_task_iter_start()
will keep pointing to task T in loop waiting for task T cg_list node
to be removed.

Task T
do_exit()
  exit_signals() // sets PF_EXITING
  exit_task_namespaces()
    switch_task_namespaces()
      free_nsproxy()
        put_mnt_ns()
          drop_collected_mounts()
            namespace_unlock()
              synchronize_rcu()
                _synchronize_rcu_expedited()
                  schedule_work() // on cpu0 low priority worker pool
                  wait_event() // waiting for work item to execute

Task T inserted a work item in the worklist of cpu0 low priority
worker pool. It is waiting for expedited grace period work item
to execute. This work item will only be executed once kworker/0:0
complete execution of cpuset_hotplug_workfn().

kworker/0:0 ==> Task T ==>kworker/0:0

In case of PF_EXITING task being migrated from source to destination
cgroup, migrate next available task in source cgroup.

Signed-off-by: Prateek Sood <prsood@codeaurora.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-12-19 05:38:47 -08:00
Arnd Bergmann
50034ed496 cgroup: use strlcpy() instead of strscpy() to avoid spurious warning
As long as cft->name is guaranteed to be NUL-terminated, using strlcpy() would
work just as well and avoid that warning, so the change below could be folded
into that commit.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-12-15 05:09:47 -08:00
Ma Shimiao
e7fd37ba12 cgroup: avoid copying strings longer than the buffers
cgroup root name and file name have max length limit, we should
avoid copying longer name than that to the name.

tj: minor update to $SUBJ.

Signed-off-by: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-12-12 07:53:29 -08:00
Tejun Heo
bdfbbda90a Revert "cgroup/cpuset: remove circular dependency deadlock"
This reverts commit aa24163b2e.

This and the following commit led to another circular locking scenario
and the scenario which is fixed by this commit no longer exists after
e8b3f8db7a ("workqueue/hotplug: simplify workqueue_offline_cpu()")
which removes work item flushing from hotplug path.

Revert it for now.

Signed-off-by: Tejun Heo <tj@kernel.org>
2017-12-04 14:55:59 -08:00
Tejun Heo
11db855c3d Revert "cpuset: Make cpuset hotplug synchronous"
This reverts commit 1599a185f0.

This and the previous commit led to another circular locking scenario
and the scenario which is fixed by this commit no longer exists after
e8b3f8db7a ("workqueue/hotplug: simplify workqueue_offline_cpu()")
which removes work item flushing from hotplug path.

Revert it for now.

Signed-off-by: Tejun Heo <tj@kernel.org>
2017-12-04 14:41:11 -08:00
Lucas Stach
52cf373c37 cgroup: properly init u64_stats
Lockdep complains that the stats update is trying to register a non-static
key. This is because u64_stats are using a seqlock on 32bit arches, which
needs to be initialized before usage.

Fixes: 041cd640b2 (cgroup: Implement cgroup2 basic CPU usage accounting)
Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-11-28 07:16:08 -08:00
Wang Long
ddf7005f32 debug cgroup: use task_css_set instead of rcu_dereference
This macro `task_css_set` verifies that the caller is
inside proper critical section if the kernel set CONFIG_PROVE_RCU=y.

Signed-off-by: Wang Long <wanglong19@meituan.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-11-27 11:37:33 -08:00
Prateek Sood
1599a185f0 cpuset: Make cpuset hotplug synchronous
Convert cpuset_hotplug_workfn() into synchronous call for cpu hotplug
path. For memory hotplug path it still gets queued as a work item.

Since cpuset_hotplug_workfn() can be made synchronous for cpu hotplug
path, it is not required to wait for cpuset hotplug while thawing
processes.

Signed-off-by: Prateek Sood <prsood@codeaurora.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-11-27 08:48:10 -08:00
Prateek Sood
aa24163b2e cgroup/cpuset: remove circular dependency deadlock
Remove circular dependency deadlock in a scenario where hotplug of CPU is
being done while there is updation in cgroup and cpuset triggered from
userspace.

Process A => kthreadd => Process B => Process C => Process A

Process A
cpu_subsys_offline();
  cpu_down();
    _cpu_down();
      percpu_down_write(&cpu_hotplug_lock); //held
      cpuhp_invoke_callback();
	     workqueue_offline_cpu();
            queue_work_on(); // unbind_work on system_highpri_wq
               __queue_work();
                 insert_work();
                    wake_up_worker();
            flush_work();
               wait_for_completion();

worker_thread();
   manage_workers();
      create_worker();
	     kthread_create_on_node();
		    wake_up_process(kthreadd_task);

kthreadd
kthreadd();
  kernel_thread();
    do_fork();
      copy_process();
        percpu_down_read(&cgroup_threadgroup_rwsem);
          __rwsem_down_read_failed_common(); //waiting

Process B
kernfs_fop_write();
  cgroup_file_write();
    cgroup_procs_write();
      percpu_down_write(&cgroup_threadgroup_rwsem); //held
      cgroup_attach_task();
        cgroup_migrate();
          cgroup_migrate_execute();
            cpuset_can_attach();
              mutex_lock(&cpuset_mutex); //waiting

Process C
kernfs_fop_write();
  cgroup_file_write();
    cpuset_write_resmask();
      mutex_lock(&cpuset_mutex); //held
      update_cpumask();
        update_cpumasks_hier();
          rebuild_sched_domains_locked();
            get_online_cpus();
              percpu_down_read(&cpu_hotplug_lock); //waiting

Eliminating deadlock by reversing the locking order for cpuset_mutex and
cpu_hotplug_lock.

Signed-off-by: Prateek Sood <prsood@codeaurora.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-11-27 08:48:10 -08:00
Linus Torvalds
22714a2ba4 Merge branch 'for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "Cgroup2 cpu controller support is finally merged.

   - Basic cpu statistics support to allow monitoring by default without
     the CPU controller enabled.

   - cgroup2 cpu controller support.

   - /sys/kernel/cgroup files to help dealing with new / optional
     features"

* 'for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: export list of cgroups v2 features using sysfs
  cgroup: export list of delegatable control files using sysfs
  cgroup: mark @cgrp __maybe_unused in cpu_stat_show()
  MAINTAINERS: relocate cpuset.c
  cgroup, sched: Move basic cpu stats from cgroup.stat to cpu.stat
  sched: Implement interface for cgroup unified hierarchy
  sched: Misc preps for cgroup unified hierarchy interface
  sched/cputime: Add dummy cputime_adjust() implementation for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
  cgroup: statically initialize init_css_set->dfl_cgrp
  cgroup: Implement cgroup2 basic CPU usage accounting
  cpuacct: Introduce cgroup_account_cputime[_field]()
  sched/cputime: Expose cputime_adjust()
2017-11-15 14:29:44 -08:00
Linus Torvalds
5bbcc0f595 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
 "Highlights:

   1) Maintain the TCP retransmit queue using an rbtree, with 1GB
      windows at 100Gb this really has become necessary. From Eric
      Dumazet.

   2) Multi-program support for cgroup+bpf, from Alexei Starovoitov.

   3) Perform broadcast flooding in hardware in mv88e6xxx, from Andrew
      Lunn.

   4) Add meter action support to openvswitch, from Andy Zhou.

   5) Add a data meta pointer for BPF accessible packets, from Daniel
      Borkmann.

   6) Namespace-ify almost all TCP sysctl knobs, from Eric Dumazet.

   7) Turn on Broadcom Tags in b53 driver, from Florian Fainelli.

   8) More work to move the RTNL mutex down, from Florian Westphal.

   9) Add 'bpftool' utility, to help with bpf program introspection.
      From Jakub Kicinski.

  10) Add new 'cpumap' type for XDP_REDIRECT action, from Jesper
      Dangaard Brouer.

  11) Support 'blocks' of transformations in the packet scheduler which
      can span multiple network devices, from Jiri Pirko.

  12) TC flower offload support in cxgb4, from Kumar Sanghvi.

  13) Priority based stream scheduler for SCTP, from Marcelo Ricardo
      Leitner.

  14) Thunderbolt networking driver, from Amir Levy and Mika Westerberg.

  15) Add RED qdisc offloadability, and use it in mlxsw driver. From
      Nogah Frankel.

  16) eBPF based device controller for cgroup v2, from Roman Gushchin.

  17) Add some fundamental tracepoints for TCP, from Song Liu.

  18) Remove garbage collection from ipv6 route layer, this is a
      significant accomplishment. From Wei Wang.

  19) Add multicast route offload support to mlxsw, from Yotam Gigi"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2177 commits)
  tcp: highest_sack fix
  geneve: fix fill_info when link down
  bpf: fix lockdep splat
  net: cdc_ncm: GetNtbFormat endian fix
  openvswitch: meter: fix NULL pointer dereference in ovs_meter_cmd_reply_start
  netem: remove unnecessary 64 bit modulus
  netem: use 64 bit divide by rate
  tcp: Namespace-ify sysctl_tcp_default_congestion_control
  net: Protect iterations over net::fib_notifier_ops in fib_seq_sum()
  ipv6: set all.accept_dad to 0 by default
  uapi: fix linux/tls.h userspace compilation error
  usbnet: ipheth: prevent TX queue timeouts when device not ready
  vhost_net: conditionally enable tx polling
  uapi: fix linux/rxrpc.h userspace compilation errors
  net: stmmac: fix LPI transitioning for dwmac4
  atm: horizon: Fix irq release error
  net-sysfs: trigger netlink notification on ifalias change via sysfs
  openvswitch: Using kfree_rcu() to simplify the code
  openvswitch: Make local function ovs_nsh_key_attr_size() static
  openvswitch: Fix return value check in ovs_meter_cmd_features()
  ...
2017-11-15 11:56:19 -08:00
Ingo Molnar
8a103df440 Merge branch 'linus' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-08 10:17:15 +01:00
Roman Gushchin
5f2e673405 cgroup: export list of cgroups v2 features using sysfs
The active development of cgroups v2 sometimes leads to a creation
of interfaces, which are not turned on by default (to provide
backward compatibility). It's handy to know from userspace, which
cgroup v2 features are supported without calculating it based
on the kernel version. So, let's export the list of such features
using /sys/kernel/cgroup/features pseudo-file.

The list is hardcoded and has to be extended when new functionality
is added. Each feature is printed on a new line.

Example:
  $ cat /sys/kernel/cgroup/features
  nsdelegate

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-11-06 12:01:57 -08:00
Roman Gushchin
01ee6cfb14 cgroup: export list of delegatable control files using sysfs
Delegatable cgroup v2 control files may require special handling
(e.g. chowning), and the exact list of such files varies between
kernel versions (and likely to be extended in the future).

To guarantee correctness of this list and simplify the life
of userspace (systemd, first of all), let's export the list
via /sys/kernel/cgroup/delegate pseudo-file.

Format is siple: each control file name is printed on a new line.
Example:
  $ cat /sys/kernel/cgroup/delegate
  cgroup.procs
  cgroup.subtree_control

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-11-06 12:01:54 -08:00
David S. Miller
2a171788ba Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Files removed in 'net-next' had their license header updated
in 'net'.  We take the remove from 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-04 09:26:51 +09:00