linux

mirror of https://github.com/torvalds/linux.git synced 2026-06-11 08:03:05 +02:00

Author	SHA1	Message	Date
Greg Kroah-Hartman	b1a6760ddf	Merge branch 'android12-5.10' into `android12-5.10-lts` Sync up with android12-5.10 for the following commits: `dd139186ef` ANDROID: usb: gadget: fix NULL pointer dereference in android_setup `07f65598af` ANDROID: GKI: Disable kmem cgroup accounting `309aa7e7a2` FROMLIST: mm, memcg: inline swap-related functions to improve disabled memcg config `3ae8e2f183` BACKPORT: FROMLIST: mm, memcg: inline mem_cgroup_{charge/uncharge} to improve disabled memcg config `f73d029485` FROMLIST: mm, memcg: add mem_cgroup_disabled checks in vmpressure and swap-related functions `669df367a9` UPSTREAM: mm/memcg: bail early from swap accounting if memcg disabled `1f0c32a667` UPSTREAM: procfs/dmabuf: add inode number to /proc/*/fdinfo `0c8c125f57` UPSTREAM: procfs: allow reading fdinfo with PTRACE_MODE_READ `2e0476a465` Revert "FROMLIST: procfs: Allow reading fdinfo with PTRACE_MODE_READ" `5ded961aa2` Revert "FROMLIST: BACKPORT: procfs/dmabuf: Add inode number to /..." `3ee5565017` UPSTREAM: f2fs: initialize page->private when using for our internal use `dba79c3af3` ANDROID: mm: page_pinner: report test_page_isolation_failure `13362ab28e` ANDROID: mm: page_pinner: add state of page_pinner `3254948484` ANDROID: mm: page_pinner: add more struct page fields `0445b67bee` ANDROID: mm: page_pinner: change timestamp format `71da06728c` ANDROID: mm: page_pinner: print_page_pinner refactoring `b83e564914` ANDROID: mm: page_pinner: remove shared_count `849f048050` ANDROID: mm: page_pinner: remove WARN_ON_ONCE `9a453100fc` ANDROID: mm: page_pinner: fix typos `d012783a86` ANDROID: mm: page_pinner: reset migration failed page `470cce5085` ANDROID: mm: page_pinner: record every put_page `9f47e5fdda` ANDROID: mm: page_pinner: change function names `a8385d61f2` ANDROID: Allow vendor module to reclaim a memcg `f41a95eadc` ANDROID: Export memcg functions to allow module to add new files `46bf3b94e7` FROMGIT: dt-bindings: usb: dwc3: Update dwc3 TX fifo properties `b36b813e39` UPSTREAM: dt-bindings: usb: Convert DWC USB3 bindings to DT schema `9a80b7b728` FROMGIT: of: Add stub for of_add_property() `2742be5903` ANDROID: fips140: define fips_enabled to 1 to enable FIPS behavior `e886dd4c33` ANDROID: fips140: unregister existing DRBG algorithms `634445a640` ANDROID: fips140: fix deadlock in unregister_existing_fips140_algos() `0af06624ea` ANDROID: fips140: check for errors from initcalls `92de53472e` ANDROID: fips140: log already-live algorithms `0a7da21583` ANDROID: Update new mtk gki symbol `98085b5dd8` ANDROID: usb: Add vendor hook for usb suspend and resume `956db89e71` BACKPORT: FROMLIST: dma-heap: Let dma heap use dma_map_attrs to map & unmap iova `749d6e7f2c` ANDROID: abi_gki_aarch64_qcom: Add vendor hook for shmem_alloc_page `b05bbe48be` ANDROID: abi_gki_aarch64_qcom: Add reclaim_shmem_address_space `d80c70d7a8` ANDROID: android: export kernel function arch_mmap_rnd `25c7eb4932` ANDROID: mm: shmem: Fix build break with allnoconfig `1cdcf76b15` ANDROID: vendor_hooks: add hooks in mem_cgroup subsystem `726468dd4a` ANDROID: GKI: add vendor padding variable in struct skb_shared_info `fc79c93657` FROMLIST: scsi: ufs: add quirk to enable host controller without interface configuration `2d5ae6b787` FROMLIST: scsi: ufs: add quirk to handle broken UIC command `38abaebab7` ANDROID: syscall_check: add vendor hook for bpf syscall `a7a3b31d58` ANDROID: syscall_check: add vendor hook for open syscall `a5543c9cd7` ANDROID: syscall_check: add vendor hook for mmap syscall `1f0769279f` ANDROID: GKI: Add symbol to symbol list `2cff74e08c` ANDROID: vendor_hooks: Add vendor hook to the net `25edba0d4d` FROMLIST: scsi: ufs: Fix the SCSI abort handler `c0efdc4a5e` ANDROID: android: export kernel function vm_unmapped_area `964220d080` ANDROID: shmem: vendor hook in shmem_alloc_page `bd2ca0ba5b` FROMLIST: pstore/ram: Rework logic for detecting ramoops reserved memory region `daeabfe7fa` ANDROID: mm: add reclaim_shmem_address_space() for faster reclaims `4c3dddf408` ANDROID: Update the generic ABI symbol list `4c4d8cbdef` ANDROID: GKI: refresh ABI XML `01e4a037d8` ANDROID: GKI: turn on TIDY_ABI `edf973fd24` ANDROID: Update symbol list for VIVO `1702d2c8b7` FROMGIT: net: cdc_ncm: switch to eth%d interface naming `f4d6e8324c` ANDROID: GKI: add allowed GKI symbol for Exynosauto SoC `444a0b7752` ANDROID: mm: add vendor hook for vmpressure `c799c6644b` ANDROID: fips140: adjust some log messages `091338cb39` ANDROID: fips140: add missing static keyword to fips140_init() `70bfd6a7e0` ANDROID: GKI: update allowed list for exynosauto SoC `3e3147b280` UPSTREAM: scsi: ufs: ufshcd: Fix some function doc-rot `2c553e754f` UPSTREAM: scsi: ufs: Adjust ufshcd_hold() during sending attribute requests `52ccdf90b9` FROMLIST: lockdep: Remove console_verbose when disable lock debugging `4458494476` ANDROID: ABI: qcom: Add symbols for 80211 `5c51579fde` ANDROID: fork: Export task_newtask tracepoint `e2a90797e8` ANDROID: Fix kernelci warnings for indentation in smp.c `bac33eaebf` ANDROID: irqchip: gic-v3: Move struct gic_chip_data to header `bdac4418bf` ANDROID: abi_gki_aarch64_qcom: Add android_vh_ufs_clock_scaling `65c1de0f06` ANDROID: Update symbol list for mtk `d4d02ab9b0` UPSTREAM: swiotlb: manipulate orig_addr when tlb_addr has offset `58aa0f2832` ANDROID: qcom: Add net related symbol `2f9f816445` ANDROID: Update the exynos symbol list `b2a9471239` ANDROID: Update symbol list for mtk `7c9599e204` FROMGIT: usb: dwc3: Create helper function getting MDWIDTH `0a24affb86` ANDROID: vendor_hooks: modify the function name `d686d5ffc6` ANDROID: GKI: Add some symbols to symbol list `bdfb11230b` ANDROID: cpuidle: Allow for an early exit from cpuidle_enter_state() `f0b280c395` ANDROID: cpuidle: Update cpuidle_uninstall_idle_handler() to wakeup all online CPUs `14dd90ab37` ANDROID: scsi: ufs: Add hook to influence the UFS clock scaling policy `00aec39e2e` FROMGIT: bpf: Support all gso types in bpf_skb_change_proto() Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: I6e5be89f3f02c420237a549f4c6a08b5ed434581	2021-07-13 15:00:50 +02:00
Liujie Xie	f41a95eadc	ANDROID: Export memcg functions to allow module to add new files Export cgroup_add_legacy_cftypes and a helper function to allow vendor module to expose additional files in the memory cgroup hierarchy. Bug: 192052083 Signed-off-by: Liujie Xie <xieliujie@oppo.com> Change-Id: Ie2b936b3e77c7ab6d740d1bb6d70e03c70a326a7	2021-07-12 18:53:29 +00:00
Greg Kroah-Hartman	bc9699030e	Merge branch 'android12-5.10' into `android12-5.10-lts` Sync up with android12-5.10 for the following commits: `de1a0ea811` ANDROID: logbuf: Remove if directive for vendor hooks `3b6916b4d4` ANDROID: iommu/io-pgtable-arm: Add IOMMU_CACHE_ICACHE_OCACHE_NWA `2e289f3641` FROMGIT: mac80211_hwsim: add concurrent channels scanning support over virtio `5b1baee639` ANDROID: GKI: update allowed symbols for exynosauto soc `961be31178` ANDROID: GKI: initial upload list for exynosauto soc `01f2392e13` ANDROID: logbuf: Add new logbuf vendor hook to support pr_cont() `3cd04ea95a` ANDROID: lib: Export show_mem() for vendor module usage `ba085dd70a` FROMLIST: remoteproc: core: Export the rproc coredump APIs `1093a9bfdb` ANDROID: sched: select fallback rq must check for allowed cpus `8943a2e7a3` ANDROID: android: Export symbols for invoking cpufreq_update_util() `e7cf28a1a4` ANDROID: Update symbol list for mtk `c685777105` ANDROID: kernel: add module info for debug_kinfo `ea53c24cb0` FROMGIT: bpf: Do not change gso_size during bpf_skb_change_proto() `b007e43692` FROMGIT: mm: slub: fix the leak of alloc/free traces debugfs interface `850f11aa85` Revert "KMI: BACKPORT: FROMGIT: scsi: ufs: Optimize host lock on transfer requests send/compl paths" `46575badbb` Revert "BACKPORT: FROMGIT: scsi: ufs: Optimize host lock on transfer requests send/compl paths" `83d653257a` Revert "FROMGIT: scsi: ufs: Utilize Transfer Request List Completion Notification Register" `9f8f2ea03e` ANDROID: power: wakeup_reason: change abort log `d5a092726b` ANDROID: GKI: Update abi_gki_aarch64_qcom list for rwsem list add `eabe9707f2` ANDROID: Add hook to show vendor info for transactions `12902c9996` ANDROID: vendor_hooks: Export direct reclaim trace points `cea24faf98` ANDROID: Update the ABI representation `c0e8aae5c5` ANDROID: qcom: Add xfrm and skb related symbols `8102df91f2` Merge remote-tracking branch 'aosp/upstream-f2fs-stable-linux-5.10.y' into android12-5.10 `b70055f85a` ANDROID: iommu: Revise vendor hook param for iova free tracking `a985701859` ANDROID: abi_gki_aarch64_qcom: Add additional symbols for 32bit execve `c7c351ab3f` ANDROID: sched: add restricted tracehooks for 32bit execve `f502bc761a` ANDROID: GKI: Update symbols to symbol list `4198bdb2cf` ANDROID: coresight: Update ETE DT yaml file `6f1bd7583f` ANDROID: coresight: Update ETE/TRBE to v6 merged upstream `ce71010347` ANDROID: kvm: arm64: Clarify the comment for SPE save context `ee7e80c81b` BACKPORT: arm64: KVM: Enable access to TRBE support for host `b6b0927eac` BACKPORT: KVM: arm64: Move SPE availability check to VCPU load `dde40c1089` UPSTREAM: KVM: arm64: Handle access to TRFCR_EL1 `ee682ec766` ANDROID: GKI: Enable ARCH_SPRD and SPRD_TIMER `39147716e8` UPSTREAM: x86, lto: Pass -stack-alignment only on LLD < 13.0.0 `7d3618b8b9` ANDROID: fix permission error of page_pinner `cfe1f5aea6` ANDROID: gki_config: disable per-cgroup pressure tracking `c7186c2c46` FROMGIT: cgroup: make per-cgroup pressure stall tracking configurable `0d054fc5d7` Revert "ANDROID: make per-cgroup PSI tracking configurable" `951bdfd077` FROMLIST: arm: Mark the recheduling IPI as raw interrupt `42f775df72` FROMLIST: arm64: Mark the recheduling IPI as raw interrupt `3f9d45d802` FROMLIST: genirq: Allow an interrupt to be marked as 'raw' `08327b9007` FROMLIST: genirq: Add __irq_modify_status() helper to clear/set special flags `77c9f446b6` ANDROID: GKI: Update abi_gki_aarch64_qcom list for shmem allocations `728626cb04` ANDROID: power: Add vendor hook to qos for GKI purpose. `9d55580966` ANDROID: selinux: modify RTM_GETNEIGH{TBL} `bb51a33182` Revert "f2fs: avoid attaching SB_ACTIVE flag during mount/remount" `c81ac64da1` f2fs: remove false alarm on iget failure during GC `cdeff03989` f2fs: enable extent cache for compression files in read-only `15a475975e` f2fs: fix to avoid adding tab before doc section `44e0be85eb` f2fs: introduce f2fs_casefolded_name slab cache `eaef955b91` f2fs: swap: support migrating swapfile in aligned write mode `34c703ff04` f2fs: swap: remove dead codes `ec3ea14d2f` f2fs: compress: add compress_inode to cache compressed blocks `d90505e519` f2fs: clean up /sys/fs/f2fs/<disk>/features `95f2afc02d` f2fs: add pin_file in feature list `d8755821c4` f2fs: Advertise encrypted casefolding in sysfs `7004c47db2` f2fs: Show casefolding support only when supported `832ee33262` f2fs: support RO feature `b5a393c8a8` f2fs: logging neatening Change-Id: I42f9108240ebf0d1e75b5d2ee24644cb995f8b3c Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>	2021-06-30 15:34:53 +02:00
Suren Baghdasaryan	c7186c2c46	FROMGIT: cgroup: make per-cgroup pressure stall tracking configurable PSI accounts stalls for each cgroup separately and aggregates it at each level of the hierarchy. This causes additional overhead with psi_avgs_work being called for each cgroup in the hierarchy. psi_avgs_work has been highly optimized, however on systems with large number of cgroups the overhead becomes noticeable. Systems which use PSI only at the system level could avoid this overhead if PSI can be configured to skip per-cgroup stall accounting. Add "cgroup_disable=pressure" kernel command-line option to allow requesting system-wide only pressure stall accounting. When set, it keeps system-wide accounting under /proc/pressure/ but skips accounting for individual cgroups and does not expose PSI nodes in cgroup hierarchy. Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/patchwork/patch/1435705 (cherry picked from commit `3958e2d0c3` https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git tj) Bug: 178872719 Bug: 191734423 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: Ifc8fbc52f9a1131d7c2668edbb44c525c76c3360	2021-06-23 18:35:35 +00:00
Greg Kroah-Hartman	82658bfd88	This is the 5.10.44 stable release -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmDJzHwACgkQONu9yGCS aT6opRAAuTY0BewZFxfx+tMNplEo6Z/AnerfZN5UxjmIWhvE4NBoIhxgj7ZKHzKE 5xBP53Dunqa6MVrLv3VCyYstUMHO3qlDmYU6tGz/omstaQxqxB0y6b/8+Q3hkzkK SjpLeeMIzbpleZXt+zvi5LIwMb7WM2bVLmH4kPVEAW0+GmPWZRentaF7LtHNOOfU LBgAcRH2emxGZ3ucHJDI0xsrkdoWfe+sPZqkAiRI06wlK5EZEcc55uCy3OWzO47Z 579j4s0GX1RP6iuC8tVfoPFPPATRGAHJABX46w2xqs34Ni4PCA9FZ+cTbXgJyZoV ZzMnjSvmtM6SjdJDz5JucUS70zU2ailHuTnjZk41XUUYoY5reG1DWSVxPP5LVh3e 1eC9P5RTHjCKt/oHA+xvfJJzKL3VyBFLpkssJkh/LOjj1yCCMj2XK7u1q+swCd7O mJgkZS30c5bIVgV8tHjL2HPG8HnPRqR2+3vZ1yMCOSEDjQs3Qxhjlhohq33Jlfa3 wHOuuDvJyxgx+c/G4fgofWjU1eYpCltUthDiATl4w9+sACKm0FRm+ZVUlX/xr6WI aCBEuk9hFbXQF6Jfbmg3RrhiyF1BTZJ/MKzxaOmEk8HLLEtW249qz5Up651+caqy cfppiiV9M/QWB/soemK9uLnoBNjxEdvP00KI362ED99cNwF3eZc= =LmrZ -----END PGP SIGNATURE----- Merge 5.10.44 into android12-5.10-lts Changes in 5.10.44 proc: Track /proc/$pid/attr/ opener mm_struct ASoC: max98088: fix ni clock divider calculation ASoC: amd: fix for pcm_read() error spi: Fix spi device unregister flow spi: spi-zynq-qspi: Fix stack violation bug bpf: Forbid trampoline attach for functions with variable arguments net/nfc/rawsock.c: fix a permission check bug usb: cdns3: Fix runtime PM imbalance on error ASoC: Intel: bytcr_rt5640: Add quirk for the Glavey TM800A550L tablet ASoC: Intel: bytcr_rt5640: Add quirk for the Lenovo Miix 3-830 tablet vfio-ccw: Reset FSM state to IDLE inside FSM vfio-ccw: Serialize FSM IDLE state with I/O completion ASoC: sti-sas: add missing MODULE_DEVICE_TABLE spi: sprd: Add missing MODULE_DEVICE_TABLE usb: chipidea: udc: assign interrupt number to USB gadget structure isdn: mISDN: netjet: Fix crash in nj_probe: bonding: init notify_work earlier to avoid uninitialized use netlink: disable IRQs for netlink_lock_table() net: mdiobus: get rid of a BUG_ON() cgroup: disable controllers at parse time wq: handle VM suspension in stall detection net/qla3xxx: fix schedule while atomic in ql_sem_spinlock RDS tcp loopback connection can hang net:sfc: fix non-freed irq in legacy irq mode scsi: bnx2fc: Return failure if io_req is already in ABTS processing scsi: vmw_pvscsi: Set correct residual data length scsi: hisi_sas: Drop free_irq() of devm_request_irq() allocated irq scsi: target: qla2xxx: Wait for stop_phase1 at WWN removal net: macb: ensure the device is available before accessing GEMGXL control registers net: appletalk: cops: Fix data race in cops_probe1 net: dsa: microchip: enable phy errata workaround on 9567 nvme-fabrics: decode host pathing error for connect MIPS: Fix kernel hang under FUNCTION_GRAPH_TRACER and PREEMPT_TRACER dm verity: fix require_signatures module_param permissions bnx2x: Fix missing error code in bnx2x_iov_init_one() nvme-tcp: remove incorrect Kconfig dep in BLK_DEV_NVME nvmet: fix false keep-alive timeout when a controller is torn down powerpc/fsl: set fsl,i2c-erratum-a004447 flag for P2041 i2c controllers powerpc/fsl: set fsl,i2c-erratum-a004447 flag for P1010 i2c controllers spi: Don't have controller clean up spi device before driver unbind spi: Cleanup on failure of initial setup i2c: mpc: Make use of i2c_recover_bus() i2c: mpc: implement erratum A-004447 workaround ALSA: seq: Fix race of snd_seq_timer_open() ALSA: firewire-lib: fix the context to call snd_pcm_stop_xrun() ALSA: hda/realtek: headphone and mic don't work on an Acer laptop ALSA: hda/realtek: fix mute/micmute LEDs and speaker for HP Elite Dragonfly G2 ALSA: hda/realtek: fix mute/micmute LEDs and speaker for HP EliteBook x360 1040 G8 ALSA: hda/realtek: fix mute/micmute LEDs for HP EliteBook 840 Aero G8 ALSA: hda/realtek: fix mute/micmute LEDs for HP ZBook Power G8 spi: bcm2835: Fix out-of-bounds access with more than 4 slaves Revert "ACPI: sleep: Put the FACS table after using it" drm: Fix use-after-free read in drm_getunique() drm: Lock pointer access in drm_master_release() perf/x86/intel/uncore: Fix M2M event umask for Ice Lake server KVM: X86: MMU: Use the correct inherited permissions to get shadow page kvm: avoid speculation-based attacks from out-of-range memslot accesses staging: rtl8723bs: Fix uninitialized variables async_xor: check src_offs is not NULL before updating it btrfs: return value from btrfs_mark_extent_written() in case of error btrfs: promote debugging asserts to full-fledged checks in validate_super cgroup1: don't allow '\n' in renaming ftrace: Do not blindly read the ip address in ftrace_bug() mmc: renesas_sdhi: abort tuning when timeout detected mmc: renesas_sdhi: Fix HS400 on R-Car M3-W+ USB: f_ncm: ncm_bitrate (speed) is unsigned usb: f_ncm: only first packet of aggregate needs to start timer usb: pd: Set PD_T_SINK_WAIT_CAP to 310ms usb: dwc3-meson-g12a: fix usb2 PHY glue init when phy0 is disabled usb: dwc3: meson-g12a: Disable the regulator in the error handling path of the probe usb: dwc3: gadget: Bail from dwc3_gadget_exit() if dwc->gadget is NULL usb: dwc3: ep0: fix NULL pointer exception usb: musb: fix MUSB_QUIRK_B_DISCONNECT_99 handling usb: typec: wcove: Use LE to CPU conversion when accessing msg->header usb: typec: ucsi: Clear PPM capability data in ucsi_init() error path usb: typec: intel_pmc_mux: Put fwnode in error case during ->probe() usb: typec: intel_pmc_mux: Add missed error check for devm_ioremap_resource() usb: gadget: f_fs: Ensure io_completion_wq is idle during unbind USB: serial: ftdi_sio: add NovaTech OrionMX product ID USB: serial: omninet: add device id for Zyxel Omni 56K Plus USB: serial: quatech2: fix control-request directions USB: serial: cp210x: fix alternate function for CP2102N QFN20 usb: gadget: eem: fix wrong eem header operation usb: fix various gadgets null ptr deref on 10gbps cabling. usb: fix various gadget panics on 10gbps cabling usb: typec: tcpm: cancel vdm and state machine hrtimer when unregister tcpm port usb: typec: tcpm: cancel frs hrtimer when unregister tcpm port regulator: core: resolve supply for boot-on/always-on regulators regulator: max77620: Use device_set_of_node_from_dev() regulator: bd718x7: Fix the BUCK7 voltage setting on BD71837 regulator: fan53880: Fix missing n_voltages setting regulator: bd71828: Fix .n_voltages settings regulator: rtmv20: Fix .set_current_limit/.get_current_limit callbacks phy: usb: Fix misuse of IS_ENABLED usb: dwc3: gadget: Disable gadget IRQ during pullup disable usb: typec: mux: Fix copy-paste mistake in typec_mux_match drm/mcde: Fix off by 10^3 in calculation drm/msm/a6xx: fix incorrectly set uavflagprd_inv field for A650 drm/msm/a6xx: update/fix CP_PROTECT initialization drm/msm/a6xx: avoid shadow NULL reference in failure path RDMA/ipoib: Fix warning caused by destroying non-initial netns RDMA/mlx4: Do not map the core_clock page to user space unless enabled ARM: cpuidle: Avoid orphan section warning vmlinux.lds.h: Avoid orphan section with !SMP tools/bootconfig: Fix error return code in apply_xbc() phy: cadence: Sierra: Fix error return code in cdns_sierra_phy_probe() ASoC: core: Fix Null-point-dereference in fmt_single_name() ASoC: meson: gx-card: fix sound-dai dt schema phy: ti: Fix an error code in wiz_probe() gpio: wcd934x: Fix shift-out-of-bounds error perf: Fix data race between pin_count increment/decrement sched/fair: Keep load_avg and load_sum synced sched/fair: Make sure to update tg contrib for blocked load sched/fair: Fix util_est UTIL_AVG_UNCHANGED handling x86/nmi_watchdog: Fix old-style NMI watchdog regression on old Intel CPUs KVM: x86: Ensure liveliness of nested VM-Enter fail tracepoint message IB/mlx5: Fix initializing CQ fragments buffer NFS: Fix a potential NULL dereference in nfs_get_client() NFSv4: Fix deadlock between nfs4_evict_inode() and nfs4_opendata_get_inode() perf session: Correct buffer copying when peeking events kvm: fix previous commit for 32-bit builds NFS: Fix use-after-free in nfs4_init_client() NFSv4: Fix second deadlock in nfs4_evict_inode() NFSv4: nfs4_proc_set_acl needs to restore NFS_CAP_UIDGID_NOMAP on error. scsi: core: Fix error handling of scsi_host_alloc() scsi: core: Fix failure handling of scsi_add_host_with_dma() scsi: core: Put .shost_dev in failure path if host state changes to RUNNING scsi: core: Only put parent device if host state differs from SHOST_CREATED tracing: Correct the length check which causes memory corruption proc: only require mm_struct for writing Linux 5.10.44 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: Ic64172b4e72ccb54d96000b3065dd8b33aa9fef5	2021-06-16 13:14:03 +02:00
Shakeel Butt	5ca472d40e	cgroup: disable controllers at parse time [ Upstream commit `45e1ba4083` ] This patch effectively reverts the commit `a3e72739b7` ("cgroup: fix too early usage of static_branch_disable()"). The commit `6041186a32` ("init: initialize jump labels before command line option parsing") has moved the jump_label_init() before parse_args() which has made the commit `a3e72739b7` unnecessary. On the other hand there are consequences of disabling the controllers later as there are subsystems doing the controller checks for different decisions. One such incident is reported [1] regarding the memory controller and its impact on memory reclaim code. [1] https://lore.kernel.org/linux-mm/921e53f3-4b13-aab8-4a9e-e83ff15371e4@nec.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: NOMURA JUNICHI(野村　淳一) <junichi.nomura@nec.com> Signed-off-by: Tejun Heo <tj@kernel.org> Tested-by: Jun'ichi Nomura <junichi.nomura@nec.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2021-06-16 12:01:36 +02:00
Jing-Ting Wu	2bb462a3af	ANDROID: cgroup: add vendor hook to cgroup .attach() Add a vendor hook when a set of tasks change the cgroups in the controller for performance tuning. Bug: 187972571 Signed-off-by: Jing-Ting Wu <Jing-Ting.Wu@mediatek.com> Change-Id: If0ea3d089e077a4650507c69b6e60952d102fa1d	2021-05-18 15:41:48 +00:00
Pavankumar Kondeti	29203f8c8f	ANDROID: cgroup: Add android_rvh_cgroup_force_kthread_migration In Android GKI, CONFIG_FAIR_GROUP_SCHED is enabled [1] to help prioritize important work. Given that CPU shares of root cgroup can't be changed, leaving the tasks inside root cgroup will give them higher share compared to the other tasks inside important cgroups. This is mitigated by moving all tasks inside root cgroup to a different cgroup after Android is booted. However, there are many kernel tasks stuck in the root cgroup after the boot. It is possible to relax kernel threads and kworkers migrations under certain scenarios. However the patch [2] posted at upstream is not accepted. Hence add a restricted vendor hook to notify modules when a kernel thread is requested for cgroup migration. The modules can relax the restrictions forced by the kernel and allow the cgroup migration. [1] `f08f049de1` [2] https://lore.kernel.org/lkml/1617714261-18111-1-git-send-email-pkondeti@codeaurora.org Bug: 184594949 Change-Id: I445a170ba797c8bece3b4b59b7a42cdd85438f1f Signed-off-by: Pavankumar Kondeti <quic_pkondeti@quicinc.com>	2021-05-04 20:13:09 +00:00
Greg Kroah-Hartman	b129c98dc6	Merge 5.10.17 into android12-5.10 Changes in 5.10.17 objtool: Fix seg fault with Clang non-section symbols Revert "dts: phy: add GPIO number and active state used for phy reset" gpio: mxs: GPIO_MXS should not default to y unconditionally gpio: ep93xx: fix BUG_ON port F usage gpio: ep93xx: Fix single irqchip with multi gpiochips tracing: Do not count ftrace events in top level enable output tracing: Check length before giving out the filter buffer drm/i915: Fix overlay frontbuffer tracking arm/xen: Don't probe xenbus as part of an early initcall cgroup: fix psi monitor for root cgroup Revert "drm/amd/display: Update NV1x SR latency values" drm/i915/tgl+: Make sure TypeC FIA is powered up when initializing it drm/dp_mst: Don't report ports connected if nothing is attached to them dmaengine: move channel device_node deletion to driver tmpfs: disallow CONFIG_TMPFS_INODE64 on s390 tmpfs: disallow CONFIG_TMPFS_INODE64 on alpha soc: ti: omap-prm: Fix boot time errors for rst_map_012 bits 0 and 1 arm64: dts: rockchip: Fix PCIe DT properties on rk3399 arm64: dts: qcom: sdm845: Reserve LPASS clocks in gcc ARM: OMAP2+: Fix suspcious RCU usage splats for omap_enter_idle_coupled arm64: dts: rockchip: remove interrupt-names property from rk3399 vdec node platform/x86: hp-wmi: Disable tablet-mode reporting by default arm64: dts: rockchip: Disable display for NanoPi R2S ovl: perform vfs_getxattr() with mounter creds cap: fix conversions on getxattr ovl: skip getxattr of security labels scsi: lpfc: Fix EEH encountering oops with NVMe traffic x86/split_lock: Enable the split lock feature on another Alder Lake CPU nvme-pci: ignore the subsysem NQN on Phison E16 drm/amd/display: Fix DPCD translation for LTTPR AUX_RD_INTERVAL drm/amd/display: Add more Clock Sources to DCN2.1 drm/amd/display: Release DSC before acquiring drm/amd/display: Fix dc_sink kref count in emulated_link_detect drm/amd/display: Free atomic state after drm_atomic_commit drm/amd/display: Decrement refcount of dc_sink before reassignment riscv: virt_addr_valid must check the address belongs to linear mapping bfq-iosched: Revert "bfq: Fix computation of shallow depth" ARM: dts: lpc32xx: Revert set default clock rate of HCLK PLL kallsyms: fix nonconverging kallsyms table with lld ARM: ensure the signal page contains defined contents ARM: kexec: fix oops after TLB are invalidated ubsan: implement __ubsan_handle_alignment_assumption Revert "lib: Restrict cpumask_local_spread to houskeeping CPUs" x86/efi: Remove EFI PGD build time checks lkdtm: don't move ctors to .rodata KVM: x86: cleanup CR3 reserved bits checks cgroup-v1: add disabled controller check in cgroup1_parse_param() dmaengine: idxd: fix misc interrupt completion ath9k: fix build error with LEDS_CLASS=m mt76: dma: fix a possible memory leak in mt76_add_fragment() drm/vc4: hvs: Fix buffer overflow with the dlist handling dmaengine: idxd: check device state before issue command bpf: Unbreak BPF_PROG_TYPE_KPROBE when kprobe is called via do_int3 bpf: Check for integer overflow when using roundup_pow_of_two() netfilter: xt_recent: Fix attempt to update deleted entry selftests: netfilter: fix current year netfilter: nftables: fix possible UAF over chains from packet path in netns netfilter: flowtable: fix tcp and udp header checksum update xen/netback: avoid race in xenvif_rx_ring_slots_available() net: hdlc_x25: Return meaningful error code in x25_open net: ipa: set error code in gsi_channel_setup() hv_netvsc: Reset the RSC count if NVSP_STAT_FAIL in netvsc_receive() net: enetc: initialize the RFS and RSS memories selftests: txtimestamp: fix compilation issue net: stmmac: set TxQ mode back to DCB after disabling CBS ibmvnic: Clear failover_pending if unable to schedule netfilter: conntrack: skip identical origin tuple in same zone only scsi: scsi_debug: Fix a memory leak x86/build: Disable CET instrumentation in the kernel for 32-bit too net: dsa: felix: implement port flushing on .phylink_mac_link_down net: hns3: add a check for queue_id in hclge_reset_vf_queue() net: hns3: add a check for tqp_index in hclge_get_ring_chain_from_mbx() net: hns3: add a check for index in hclge_get_rss_key() firmware_loader: align .builtin_fw to 8 drm/sun4i: tcon: set sync polarity for tcon1 channel drm/sun4i: dw-hdmi: always set clock rate drm/sun4i: Fix H6 HDMI PHY configuration drm/sun4i: dw-hdmi: Fix max. frequency for H6 clk: sunxi-ng: mp: fix parent rate change flag check i2c: stm32f7: fix configuration of the digital filter h8300: fix PREEMPTION build, TI_PRE_COUNT undefined scripts: set proper OpenSSL include dir also for sign-file x86/pci: Create PCI/MSI irqdomain after x86_init.pci.arch_init() arm64: mte: Allow PTRACE_PEEKMTETAGS access to the zero page rxrpc: Fix clearance of Tx/Rx ring when releasing a call udp: fix skb_copy_and_csum_datagram with odd segment sizes net: dsa: call teardown method on probe failure cpufreq: ACPI: Extend frequency tables to cover boost frequencies cpufreq: ACPI: Update arch scale-invariance max perf ratio if CPPC is not there net: gro: do not keep too many GRO packets in napi->rx_list net: fix iteration for sctp transport seq_files net/vmw_vsock: fix NULL pointer dereference net/vmw_vsock: improve locking in vsock_connect_timeout() net: watchdog: hold device global xmit lock during tx disable bridge: mrp: Fix the usage of br_mrp_port_switchdev_set_state switchdev: mrp: Remove SWITCHDEV_ATTR_ID_MRP_PORT_STAT vsock/virtio: update credit only if socket is not closed vsock: fix locking in vsock_shutdown() net/rds: restrict iovecs length for RDS_CMSG_RDMA_ARGS net/qrtr: restrict user-controlled length in qrtr_tun_write_iter() ovl: expand warning in ovl_d_real() kcov, usb: only collect coverage from __usb_hcd_giveback_urb in softirq Linux 5.10.17 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: Id0300681f52b51d3f466f1e66ec3a6c25f65f4d3	2021-02-18 11:21:01 +01:00
Odin Ugedal	e72a65802a	cgroup: fix psi monitor for root cgroup commit `385aac1519` upstream. Fix NULL pointer dereference when adding new psi monitor to the root cgroup. PSI files for root cgroup was introduced in `df5ba5be74` by using system wide psi struct when reading, but file write/monitor was not properly fixed. Since the PSI config for the root cgroup isn't initialized, the current implementation tries to lock a NULL ptr, resulting in a crash. Can be triggered by running this as root: $ tee /sys/fs/cgroup/cpu.pressure <<< "some 10000 1000000" Signed-off-by: Odin Ugedal <odin@uged.al> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Dan Schatzberg <dschatzberg@fb.com> Fixes: `df5ba5be74` ("kernel/sched/psi.c: expose pressure metrics on root cgroup") Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: stable@vger.kernel.org # 5.2+ Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-02-17 11:02:21 +01:00
Abhijeet Dharmapurikar	3975c98baf	ANDROID: cgroup: Export functions used by modules Below symbols would be used by vendor modules for tracking tasks in various cgroups 1. cgroup_taskset_first 2. cgroup_taskset_next Bug: 175045928 Change-Id: I6d89148eb4c71174a02a27acab196ff940be9082 Signed-off-by: Abhijeet Dharmapurikar <adharmap@codeaurora.org> (cherry picked from commit `7333bd73c0`)	2020-12-14 11:22:22 +00:00
Shaleen Agrawal	5a920a6503	ANDROID: Sched: Export scheduler symbols needed by vendor modules Need to export internal scheduler symbols to facilitate vendor module with scheduler based value-adds. Bug: 173725277 Change-Id: I021f09097dfc1480abcc998cc8e05e75b2ee828b Signed-off-by: Shaleen Agrawal <shalagra@codeaurora.org> Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>	2020-12-03 16:50:04 +00:00
Jouni Roivas	65026da59c	cgroup: Zero sized write should be no-op Do not report failure on zero sized writes, and handle them as no-op. There's issues for example in case of writev() when there's iovec containing zero buffer as a first one. It's expected writev() on below example to successfully perform the write to specified writable cgroup file expecting integer value, and to return 2. For now it's returning value -1, and skipping the write: int writetest(int fd) { const char buf1 = ""; const char buf2 = "1\n"; struct iovec iov[2] = { { .iov_base = (void)buf1, .iov_len = 0 }, { .iov_base = (void)buf2, .iov_len = 2 } }; return writev(fd, iov, 2); } This patch fixes the issue by checking if there's nothing to write, and handling the write as no-op by just returning 0. Signed-off-by: Jouni Roivas <jouni.roivas@tuxera.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2020-09-30 13:52:06 -04:00
Wei Yang	95d325185c	cgroup: remove redundant kernfs_activate in cgroup_setup_root() This step is already done in rebind_subsystems(). Not necessary to do it again. Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2020-09-30 12:03:10 -04:00
Cong Wang	ad0f75e5f5	cgroup: fix cgroup_sk_alloc() for sk_clone_lock() When we clone a socket in sk_clone_lock(), its sk_cgrp_data is copied, so the cgroup refcnt must be taken too. And, unlike the sk_alloc() path, sock_update_netprioidx() is not called here. Therefore, it is safe and necessary to grab the cgroup refcnt even when cgroup_sk_alloc is disabled. sk_clone_lock() is in BH context anyway, the in_interrupt() would terminate this function if called there. And for sk_alloc() skcd->val is always zero. So it's safe to factor out the code to make it more readable. The global variable 'cgroup_sk_alloc_disabled' is used to determine whether to take these reference counts. It is impossible to make the reference counting correct unless we save this bit of information in skcd->val. So, add a new bit there to record whether the socket has already taken the reference counts. This obviously relies on kmalloc() to align cgroup pointers to at least 4 bytes, ARCH_KMALLOC_MINALIGN is certainly larger than that. This bug seems to be introduced since the beginning, commit `d979a39d72` ("cgroup: duplicate cgroup reference when cloning sockets") tried to fix it but not compeletely. It seems not easy to trigger until the recent commit `090e28b229` ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups") was merged. Fixes: `bd1060a1d6` ("sock, cgroup: add sock->sk_cgroup") Reported-by: Cameron Berkenpas <cam@neo-zeon.de> Reported-by: Peter Geis <pgwipeout@gmail.com> Reported-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reported-by: Daniël Sonck <dsonck92@gmail.com> Reported-by: Zhang Qiang <qiang.zhang@windriver.com> Tested-by: Cameron Berkenpas <cam@neo-zeon.de> Tested-by: Peter Geis <pgwipeout@gmail.com> Tested-by: Thomas Lamprecht <t.lamprecht@proxmox.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Zefan Li <lizefan@huawei.com> Cc: Tejun Heo <tj@kernel.org> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-07-07 13:34:11 -07:00
Linus Torvalds	4a7e89c5ec	Merge branch 'for-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "Just two patches: one to add system-level cpu.stat to the root cgroup for convenience and a trivial comment update" * 'for-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: add cpu.stat file to root cgroup cgroup: Remove stale comments	2020-06-06 09:59:34 -07:00
Boris Burkov	936f2a70f2	cgroup: add cpu.stat file to root cgroup Currently, the root cgroup does not have a cpu.stat file. Add one which is consistent with /proc/stat to capture global cpu statistics that might not fall under cgroup accounting. We haven't done this in the past because the data are already presented in /proc/stat and we didn't want to add overhead from collecting root cgroup stats when cgroups are configured, but no cgroups have been created. By keeping the data consistent with /proc/stat, I think we avoid the first problem, while improving the usability of cgroups stats. We avoid the second problem by computing the contents of cpu.stat from existing data collected for /proc/stat anyway. Signed-off-by: Boris Burkov <boris@bur.io> Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>	2020-05-28 10:06:35 -04:00
Zefan Li	6b6ebb3474	cgroup: Remove stale comments - The default root is where we can create v2 cgroups. - The __DEVEL__sane_behavior mount option has been removed long long ago. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2020-05-26 13:20:24 -04:00
Andrii Nakryiko	f9d041271c	bpf: Refactor bpf_link update handling Make bpf_link update support more generic by making it into another bpf_link_ops methods. This allows generic syscall handling code to be agnostic to various conditionally compiled features (e.g., the case of CONFIG_CGROUP_BPF). This also allows to keep link type-specific code to remain static within respective code base. Refactor existing bpf_cgroup_link code and take advantage of this. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200429001614.1544-2-andriin@fb.com	2020-04-28 17:27:07 -07:00
Linus Torvalds	d883600523	Merge branch 'for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - Christian extended clone3 so that processes can be spawned into cgroups directly. This is not only neat in terms of semantics but also avoids grabbing the global cgroup_threadgroup_rwsem for migration. - Daniel added !root xattr support to cgroupfs. Userland already uses xattrs on cgroupfs for bookkeeping. This will allow delegated cgroups to support such usages. - Prateek tried to make cpuset hotplug handling synchronous but that led to possible deadlock scenarios. Reverted. - Other minor changes including release_agent_path handling cleanup. * 'for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: docs: cgroup-v1: Document the cpuset_v2_mode mount option Revert "cpuset: Make cpuset hotplug synchronous" cgroupfs: Support user xattrs kernfs: Add option to enable user xattrs kernfs: Add removed_size out param for simple_xattr_set kernfs: kvmalloc xattr value instead of kmalloc cgroup: Restructure release_agent_path handling selftests/cgroup: add tests for cloning into cgroups clone3: allow spawning processes into cgroups cgroup: add cgroup_may_write() helper cgroup: refactor fork helpers cgroup: add cgroup_get_from_file() helper cgroup: unify attach permission checking cpuset: Make cpuset hotplug synchronous cgroup.c: Use built-in RCU list checking kselftest/cgroup: add cgroup destruction test cgroup: Clean up css_set task traversal	2020-04-03 11:30:20 -07:00
Johannes Weiner	8a931f8013	mm: memcontrol: recursive memory.low protection Right now, the effective protection of any given cgroup is capped by its own explicit memory.low setting, regardless of what the parent says. The reasons for this are mostly historical and ease of implementation: to make delegation of memory.low safe, effective protection is the min() of all memory.low up the tree. Unfortunately, this limitation makes it impossible to protect an entire subtree from another without forcing the user to make explicit protection allocations all the way to the leaf cgroups - something that is highly undesirable in real life scenarios. Consider memory in a data center host. At the cgroup top level, we have a distinction between system management software and the actual workload the system is executing. Both branches are further subdivided into individual services, job components etc. We want to protect the workload as a whole from the system management software, but that doesn't mean we want to protect and prioritize individual workload wrt each other. Their memory demand can vary over time, and we'd want the VM to simply cache the hottest data within the workload subtree. Yet, the current memory.low limitations force us to allocate a fixed amount of protection to each workload component in order to get protection from system management software in general. This results in very inefficient resource distribution. Another concern with mandating downward allocation is that, as the complexity of the cgroup tree grows, it gets harder for the lower levels to be informed about decisions made at the host-level. Consider a container inside a namespace that in turn creates its own nested tree of cgroups to run multiple workloads. It'd be extremely difficult to configure memory.low parameters in those leaf cgroups that on one hand balance pressure among siblings as the container desires, while also reflecting the host-level protection from e.g. rpm upgrades, that lie beyond one or more delegation and namespacing points in the tree. It's highly unusual from a cgroup interface POV that nested levels have to be aware of and reflect decisions made at higher levels for them to be effective. To enable such use cases and scale configurability for complex trees, this patch implements a resource inheritance model for memory that is similar to how the CPU and the IO controller implement work-conserving resource allocations: a share of a resource allocated to a subree always applies to the entire subtree recursively, while allowing, but not mandating, children to further specify distribution rules. That means that if protection is explicitly allocated among siblings, those configured shares are being followed during page reclaim just like they are now. However, if the memory.low set at a higher level is not fully claimed by the children in that subtree, the "floating" remainder is applied to each cgroup in the tree in proportion to its size. Since reclaim pressure is applied in proportion to size as well, each child in that tree gets the same boost, and the effect is neutral among siblings - with respect to each other, they behave as if no memory control was enabled at all, and the VM simply balances the memory demands optimally within the subtree. But collectively those cgroups enjoy a boost over the cgroups in neighboring trees. E.g. a leaf cgroup with a memory.low setting of 0 no longer means that it's not getting a share of the hierarchically assigned resource, just that it doesn't claim a fixed amount of it to protect from its siblings. This allows us to recursively protect one subtree (workload) from another (system management), while letting subgroups compete freely among each other - without having to assign fixed shares to each leaf, and without nested groups having to echo higher-level settings. The floating protection composes naturally with fixed protection. Consider the following example tree: A A: low = 2G / \ A1: low = 1G A1 A2 A2: low = 0G As outside pressure is applied to this tree, A1 will enjoy a fixed protection from A2 of 1G, but the remaining, unclaimed 1G from A is split evenly among A1 and A2, coming out to 1.5G and 0.5G. There is a slight risk of regressing theoretical setups where the top-level cgroups don't know about the true budgeting and set bogusly high "bypass" values that are meaningfully allocated down the tree. Such setups would rely on unclaimed protection to be discarded, and distributing it would change the intended behavior. Be safe and hide the new behavior behind a mount option, 'memory_recursiveprot'. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Chris Down <chris@chrisdown.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Koutný <mkoutny@suse.com> Link: http://lkml.kernel.org/r/20200227195606.46212-4-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-04-02 09:35:28 -07:00
Andrii Nakryiko	0c991ebc8c	bpf: Implement bpf_prog replacement for an active bpf_cgroup_link Add new operation (LINK_UPDATE), which allows to replace active bpf_prog from under given bpf_link. Currently this is only supported for bpf_cgroup_link, but will be extended to other kinds of bpf_links in follow-up patches. For bpf_cgroup_link, implemented functionality matches existing semantics for direct bpf_prog attachment (including BPF_F_REPLACE flag). User can either unconditionally set new bpf_prog regardless of which bpf_prog is currently active under given bpf_link, or, optionally, can specify expected active bpf_prog. If active bpf_prog doesn't match expected one, no changes are performed, old bpf_link stays intact and attached, operation returns a failure. cgroup_bpf_replace() operation is resolving race between auto-detachment and bpf_prog update in the same fashion as it's done for bpf_link detachment, except in this case update has no way of succeeding because of target cgroup marked as dying. So in this case error is returned. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200330030001.2312810-3-andriin@fb.com	2020-03-30 17:36:33 -07:00
Andrii Nakryiko	af6eea5743	bpf: Implement bpf_link-based cgroup BPF program attachment Implement new sub-command to attach cgroup BPF programs and return FD-based bpf_link back on success. bpf_link, once attached to cgroup, cannot be replaced, except by owner having its FD. Cgroup bpf_link supports only BPF_F_ALLOW_MULTI semantics. Both link-based and prog-based BPF_F_ALLOW_MULTI attachments can be freely intermixed. To prevent bpf_cgroup_link from keeping cgroup alive past the point when no BPF program can be executed, implement auto-detachment of link. When cgroup_bpf_release() is called, all attached bpf_links are forced to release cgroup refcounts, but they leave bpf_link otherwise active and allocated, as well as still owning underlying bpf_prog. This is because user-space might still have FDs open and active, so bpf_link as a user-referenced object can't be freed yet. Once last active FD is closed, bpf_link will be freed and underlying bpf_prog refcount will be dropped. But cgroup refcount won't be touched, because cgroup is released already. The inherent race between bpf_cgroup_link release (from closing last FD) and cgroup_bpf_release() is resolved by both operations taking cgroup_mutex. So the only additional check required is when bpf_cgroup_link attempts to detach itself from cgroup. At that time we need to check whether there is still cgroup associated with that link. And if not, exit with success, because bpf_cgroup_link was already successfully detached. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Roman Gushchin <guro@fb.com> Link: https://lore.kernel.org/bpf/20200330030001.2312810-2-andriin@fb.com	2020-03-30 17:35:59 -07:00
Daniel Xu	38aca3071c	cgroupfs: Support user xattrs This patch turns on xattr support for cgroupfs. This is useful for letting non-root owners of delegated subtrees attach metadata to cgroups. One use case is for subtree owners to tell a userspace out of memory killer to bias away from killing specific subtrees. Tests: [/sys/fs/cgroup]# for i in $(seq 0 130); \ do setfattr workload.slice -n user.name$i -v wow; done setfattr: workload.slice: No space left on device setfattr: workload.slice: No space left on device setfattr: workload.slice: No space left on device [/sys/fs/cgroup]# for i in $(seq 0 130); \ do setfattr workload.slice --remove user.name$i; done setfattr: workload.slice: No such attribute setfattr: workload.slice: No such attribute setfattr: workload.slice: No such attribute [/sys/fs/cgroup]# for i in $(seq 0 130); \ do setfattr workload.slice -n user.name$i -v wow; done setfattr: workload.slice: No space left on device setfattr: workload.slice: No space left on device setfattr: workload.slice: No space left on device `seq 0 130` is inclusive, and 131 - 128 = 3, which is the number of errors we expect to see. [/data]# cat testxattr.c #include <sys/types.h> #include <sys/xattr.h> #include <stdio.h> #include <stdlib.h> int main() { char name[256]; char *buf = malloc(64 << 10); if (!buf) { perror("malloc"); return 1; } for (int i = 0; i < 4; ++i) { snprintf(name, 256, "user.bigone%d", i); if (setxattr("/sys/fs/cgroup/system.slice", name, buf, 64 << 10, 0)) { printf("setxattr failed on iteration=%d\n", i); return 1; } } return 0; } [/data]# ./a.out setxattr failed on iteration=2 [/data]# ./a.out setxattr failed on iteration=0 [/sys/fs/cgroup]# setfattr -x user.bigone0 system.slice/ [/sys/fs/cgroup]# setfattr -x user.bigone1 system.slice/ [/data]# ./a.out setxattr failed on iteration=2 Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Acked-by: Chris Down <chris@chrisdown.name> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Tejun Heo <tj@kernel.org>	2020-03-16 15:53:47 -04:00
Linus Torvalds	1b51f69461	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from David Miller: "It looks like a decent sized set of fixes, but a lot of these are one liner off-by-one and similar type changes: 1) Fix netlink header pointer to calcular bad attribute offset reported to user. From Pablo Neira Ayuso. 2) Don't double clear PHY interrupts when ->did_interrupt is set, from Heiner Kallweit. 3) Add missing validation of various (devlink, nl802154, fib, etc.) attributes, from Jakub Kicinski. 4) Missing pos increments in various netfilter seq_next ops, from Vasily Averin. 5) Missing break in of_mdiobus_register() loop, from Dajun Jin. 6) Don't double bump tx_dropped in veth driver, from Jiang Lidong. 7) Work around FMAN erratum A050385, from Madalin Bucur. 8) Make sure ARP header is pulled early enough in bonding driver, from Eric Dumazet. 9) Do a cond_resched() during multicast processing of ipvlan and macvlan, from Mahesh Bandewar. 10) Don't attach cgroups to unrelated sockets when in interrupt context, from Shakeel Butt. 11) Fix tpacket ring state management when encountering unknown GSO types. From Willem de Bruijn. 12) Fix MDIO bus PHY resume by checking mdio_bus_phy_may_suspend() only in the suspend context. From Heiner Kallweit" git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (112 commits) net: systemport: fix index check to avoid an array out of bounds access tc-testing: add ETS scheduler to tdc build configuration net: phy: fix MDIO bus PM PHY resuming net: hns3: clear port base VLAN when unload PF net: hns3: fix RMW issue for VLAN filter switch net: hns3: fix VF VLAN table entries inconsistent issue net: hns3: fix "tc qdisc del" failed issue taprio: Fix sending packets without dequeueing them net: mvmdio: avoid error message for optional IRQ net: dsa: mv88e6xxx: Add missing mask of ATU occupancy register net: memcg: fix lockdep splat in inet_csk_accept() s390/qeth: implement smarter resizing of the RX buffer pool s390/qeth: refactor buffer pool code s390/qeth: use page pointers to manage RX buffer pool seg6: fix SRv6 L2 tunnels to use IANA-assigned protocol number net: dsa: Don't instantiate phylink for CPU/DSA ports unless needed net/packet: tpacket_rcv: do not increment ring index on drop sxgbe: Fix off by one in samsung driver strncpy size arg net: caif: Add lockdep expression to RCU traversal primitive MAINTAINERS: remove Sathya Perla as Emulex NIC maintainer ...	2020-03-12 16:19:19 -07:00
Tejun Heo	a09833f7cd	Merge branch 'for-5.6-fixes' into for-5.7	2020-03-12 16:44:18 -04:00
Shakeel Butt	e876ecc67d	cgroup: memcg: net: do not associate sock with unrelated cgroup We are testing network memory accounting in our setup and noticed inconsistent network memory usage and often unrelated cgroups network usage correlates with testing workload. On further inspection, it seems like mem_cgroup_sk_alloc() and cgroup_sk_alloc() are broken in irq context specially for cgroup v1. mem_cgroup_sk_alloc() and cgroup_sk_alloc() can be called in irq context and kind of assumes that this can only happen from sk_clone_lock() and the source sock object has already associated cgroup. However in cgroup v1, where network memory accounting is opt-in, the source sock can be unassociated with any cgroup and the new cloned sock can get associated with unrelated interrupted cgroup. Cgroup v2 can also suffer if the source sock object was created by process in the root cgroup or if sk_alloc() is called in irq context. The fix is to just do nothing in interrupt. WARNING: Please note that about half of the TCP sockets are allocated from the IRQ context, so, memory used by such sockets will not be accouted by the memcg. The stack trace of mem_cgroup_sk_alloc() from IRQ-context: CPU: 70 PID: 12720 Comm: ssh Tainted: 5.6.0-smp-DEV #1 Hardware name: ... Call Trace: <IRQ> dump_stack+0x57/0x75 mem_cgroup_sk_alloc+0xe9/0xf0 sk_clone_lock+0x2a7/0x420 inet_csk_clone_lock+0x1b/0x110 tcp_create_openreq_child+0x23/0x3b0 tcp_v6_syn_recv_sock+0x88/0x730 tcp_check_req+0x429/0x560 tcp_v6_rcv+0x72d/0xa40 ip6_protocol_deliver_rcu+0xc9/0x400 ip6_input+0x44/0xd0 ? ip6_protocol_deliver_rcu+0x400/0x400 ip6_rcv_finish+0x71/0x80 ipv6_rcv+0x5b/0xe0 ? ip6_sublist_rcv+0x2e0/0x2e0 process_backlog+0x108/0x1e0 net_rx_action+0x26b/0x460 __do_softirq+0x104/0x2a6 do_softirq_own_stack+0x2a/0x40 </IRQ> do_softirq.part.19+0x40/0x50 __local_bh_enable_ip+0x51/0x60 ip6_finish_output2+0x23d/0x520 ? ip6table_mangle_hook+0x55/0x160 __ip6_finish_output+0xa1/0x100 ip6_finish_output+0x30/0xd0 ip6_output+0x73/0x120 ? __ip6_finish_output+0x100/0x100 ip6_xmit+0x2e3/0x600 ? ipv6_anycast_cleanup+0x50/0x50 ? inet6_csk_route_socket+0x136/0x1e0 ? skb_free_head+0x1e/0x30 inet6_csk_xmit+0x95/0xf0 __tcp_transmit_skb+0x5b4/0xb20 __tcp_send_ack.part.60+0xa3/0x110 tcp_send_ack+0x1d/0x20 tcp_rcv_state_process+0xe64/0xe80 ? tcp_v6_connect+0x5d1/0x5f0 tcp_v6_do_rcv+0x1b1/0x3f0 ? tcp_v6_do_rcv+0x1b1/0x3f0 __release_sock+0x7f/0xd0 release_sock+0x30/0xa0 __inet_stream_connect+0x1c3/0x3b0 ? prepare_to_wait+0xb0/0xb0 inet_stream_connect+0x3b/0x60 __sys_connect+0x101/0x120 ? __sys_getsockopt+0x11b/0x140 __x64_sys_connect+0x1a/0x20 do_syscall_64+0x51/0x200 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The stack trace of mem_cgroup_sk_alloc() from IRQ-context: Fixes: `2d75807383` ("mm: memcontrol: consolidate cgroup socket tracking") Fixes: `d979a39d72` ("cgroup: duplicate cgroup reference when cloning sockets") Signed-off-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Roman Gushchin <guro@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-03-10 15:33:05 -07:00
Qian Cai	190ecb190a	cgroup: fix psi_show() crash on 32bit ino archs Similar to the commit `d749534322` ("cgroup: fix incorrect WARN_ON_ONCE() in cgroup_setup_root()"), cgroup_id(root_cgrp) does not equal to 1 on 32bit ino archs which triggers all sorts of issues with psi_show() on s390x. For example, BUG: KASAN: slab-out-of-bounds in collect_percpu_times+0x2d0/ Read of size 4 at addr 000000001e0ce000 by task read_all/3667 collect_percpu_times+0x2d0/0x798 psi_show+0x7c/0x2a8 seq_read+0x2ac/0x830 vfs_read+0x92/0x150 ksys_read+0xe2/0x188 system_call+0xd8/0x2b4 Fix it by using cgroup_ino(). Fixes: `743210386c` ("cgroup: use cgrp->kn->id as the cgroup ID") Signed-off-by: Qian Cai <cai@lca.pw> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org # v5.5	2020-03-04 11:15:58 -05:00
Christian Brauner	ef2c41cf38	clone3: allow spawning processes into cgroups This adds support for creating a process in a different cgroup than its parent. Callers can limit and account processes and threads right from the moment they are spawned: - A service manager can directly spawn new services into dedicated cgroups. - A process can be directly created in a frozen cgroup and will be frozen as well. - The initial accounting jitter experienced by process supervisors and daemons is eliminated with this. - Threaded applications or even thread implementations can choose to create a specific cgroup layout where each thread is spawned directly into a dedicated cgroup. This feature is limited to the unified hierarchy. Callers need to pass a directory file descriptor for the target cgroup. The caller can choose to pass an O_PATH file descriptor. All usual migration restrictions apply, i.e. there can be no processes in inner nodes. In general, creating a process directly in a target cgroup adheres to all migration restrictions. One of the biggest advantages of this feature is that CLONE_INTO_GROUP does not need to grab the write side of the cgroup cgroup_threadgroup_rwsem. This global lock makes moving tasks/threads around super expensive. With clone3() this lock is avoided. Cc: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: cgroups@vger.kernel.org Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2020-02-12 17:57:51 -05:00
Christian Brauner	f3553220d4	cgroup: add cgroup_may_write() helper Add a cgroup_may_write() helper which we can use in the CLONE_INTO_CGROUP patch series to verify that we can write to the destination cgroup. Cc: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Li Zefan <lizefan@huawei.com> Cc: cgroups@vger.kernel.org Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2020-02-12 17:57:51 -05:00
Christian Brauner	5a5cf5cb30	cgroup: refactor fork helpers This refactors the fork helpers so they can be easily modified in the next patches. The patch just moves the cgroup threadgroup rwsem grab and release into the helpers. They don't need to be directly exposed in fork.c. Cc: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Li Zefan <lizefan@huawei.com> Cc: cgroups@vger.kernel.org Acked-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2020-02-12 17:57:51 -05:00
Christian Brauner	17703097f3	cgroup: add cgroup_get_from_file() helper Add a helper cgroup_get_from_file(). The helper will be used in subsequent patches to retrieve a cgroup while holding a reference to the struct file it was taken from. Cc: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Li Zefan <lizefan@huawei.com> Cc: cgroups@vger.kernel.org Acked-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2020-02-12 17:57:51 -05:00
Christian Brauner	6df970e4f5	cgroup: unify attach permission checking The core codepaths to check whether a process can be attached to a cgroup are the same for threads and thread-group leaders. Only a small piece of code verifying that source and destination cgroup are in the same domain differentiates the thread permission checking from thread-group leader permission checking. Since cgroup_migrate_vet_dst() only matters cgroup2 - it is a noop on cgroup1 - we can move it out of cgroup_attach_task(). All checks can now be consolidated into a new helper cgroup_attach_permissions() callable from both cgroup_procs_write() and cgroup_threads_write(). Cc: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: cgroups@vger.kernel.org Acked-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2020-02-12 17:57:51 -05:00
Madhuparna Bhowmik	3010c5b9f5	cgroup.c: Use built-in RCU list checking list_for_each_entry_rcu has built-in RCU and lock checking. Pass cond argument to list_for_each_entry_rcu() to silence false lockdep warning when CONFIG_PROVE_RCU_LIST is enabled by default. Even though the function css_next_child() already checks if cgroup_mutex or rcu_read_lock() is held using cgroup_assert_mutex_or_rcu_locked(), there is a need to pass cond to list_for_each_entry_rcu() to avoid false positive lockdep warning. Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com> Acked-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2020-02-12 17:12:22 -05:00
Michal Koutný	f43caa2adc	cgroup: Clean up css_set task traversal css_task_iter stores pointer to head of each iterable list, this dates back to commit `0f0a2b4fa6` ("cgroup: reorganize css_task_iter") when we did not store cur_cset. Let us utilize list heads directly in cur_cset and streamline css_task_iter_advance_css_set a bit. This is no intentional function change. Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2020-02-12 17:11:52 -05:00
Michal Koutný	9c974c7724	cgroup: Iterate tasks that did not finish do_exit() PF_EXITING is set earlier than actual removal from css_set when a task is exitting. This can confuse cgroup.procs readers who see no PF_EXITING tasks, however, rmdir is checking against css_set membership so it can transitionally fail with EBUSY. Fix this by listing tasks that weren't unlinked from css_set active lists. It may happen that other users of the task iterator (without CSS_TASK_ITER_PROCS) spot a PF_EXITING task before cgroup_exit(). This is equal to the state before commit `c03cd7738a` ("cgroup: Include dying leaders with live threads in PROCS iterations") but it may be reviewed later. Reported-by: Suren Baghdasaryan <surenb@google.com> Fixes: `c03cd7738a` ("cgroup: Include dying leaders with live threads in PROCS iterations") Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2020-02-12 17:02:53 -05:00
Vasily Averin	2d4ecb030d	cgroup: cgroup_procs_next should increase position index If seq_file .next fuction does not change position index, read after some lseek can generate unexpected output: 1) dd bs=1 skip output of each 2nd elements $ dd if=/sys/fs/cgroup/cgroup.procs bs=8 count=1 2 3 4 5 1+0 records in 1+0 records out 8 bytes copied, 0,000267297 s, 29,9 kB/s [test@localhost ~]$ dd if=/sys/fs/cgroup/cgroup.procs bs=1 count=8 2 4 <<< NB! 3 was skipped 6 <<< ... and 5 too 8 <<< ... and 7 8+0 records in 8+0 records out 8 bytes copied, 5,2123e-05 s, 153 kB/s This happen because __cgroup_procs_start() makes an extra extra cgroup_procs_next() call 2) read after lseek beyond end of file generates whole last line. 3) read after lseek into middle of last line generates expected rest of last line and unexpected whole line once again. Additionally patch removes an extra position index changes in __cgroup_procs_start() Cc: stable@vger.kernel.org https://bugzilla.kernel.org/show_bug.cgi?id=206283 Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2020-02-12 17:00:04 -05:00
Linus Torvalds	0a679e13ea	Merge branch 'for-5.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fix from Tejun Heo: "I made a mistake while removing cgroup task list lazy init optimization making the root cgroup.procs show entries for the init_tasks. The zero entries doesn't cause critical failures but does make systemd print out warning messages during boot. Fix it by omitting init_tasks as they should be" * 'for-5.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: init_tasks shouldn't be linked to the root cgroup	2020-02-10 17:07:05 -08:00
Linus Torvalds	c9d35ee049	Merge branch 'merge.nfs-fs_parse.1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs file system parameter updates from Al Viro: "Saner fs_parser.c guts and data structures. The system-wide registry of syntax types (string/enum/int32/oct32/.../etc.) is gone and so is the horror switch() in fs_parse() that would have to grow another case every time something got added to that system-wide registry. New syntax types can be added by filesystems easily now, and their namespace is that of functions - not of system-wide enum members. IOW, they can be shared or kept private and if some turn out to be widely useful, we can make them common library helpers, etc., without having to do anything whatsoever to fs_parse() itself. And we already get that kind of requests - the thing that finally pushed me into doing that was "oh, and let's add one for timeouts - things like 15s or 2h". If some filesystem really wants that, let them do it. Without somebody having to play gatekeeper for the variants blessed by direct support in fs_parse(), TYVM. Quite a bit of boilerplate is gone. And IMO the data structures make a lot more sense now. -200LoC, while we are at it" * 'merge.nfs-fs_parse.1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (25 commits) tmpfs: switch to use of invalfc() cgroup1: switch to use of errorfc() et.al. procfs: switch to use of invalfc() hugetlbfs: switch to use of invalfc() cramfs: switch to use of errofc() et.al. gfs2: switch to use of errorfc() et.al. fuse: switch to use errorfc() et.al. ceph: use errorfc() and friends instead of spelling the prefix out prefix-handling analogues of errorf() and friends turn fs_param_is_... into functions fs_parse: handle optional arguments sanely fs_parse: fold fs_parameter_desc/fs_parameter_spec fs_parser: remove fs_parameter_description name field add prefix to fs_context->log ceph_parse_param(), ceph_parse_mon_ips(): switch to passing fc_log new primitive: __fs_parse() switch rbd and libceph to p_log-based primitives struct p_log, variants of warnf() et.al. taking that one instead teach logfc() to handle prefices, give it saner calling conventions get rid of cg_invalf() ...	2020-02-08 13:26:41 -08:00
Al Viro	d7167b1499	fs_parse: fold fs_parameter_desc/fs_parameter_spec The former contains nothing but a pointer to an array of the latter... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-02-07 14:48:37 -05:00
Eric Sandeen	96cafb9ccb	fs_parser: remove fs_parameter_description name field Unused now. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-02-07 14:48:36 -05:00
Tejun Heo	0cd9d33ace	cgroup: init_tasks shouldn't be linked to the root cgroup `5153faac18` ("cgroup: remove cgroup_enable_task_cg_lists() optimization") removed lazy initialization of css_sets so that new tasks are always lniked to its css_set. In the process, it incorrectly ended up adding init_tasks to root css_set. They show up as PID 0's in root's cgroup.procs triggering warnings in systemd and generally confusing people. Fix it by skip css_set linking for init_tasks. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: https://github.com/joanbm Link: https://github.com/systemd/systemd/issues/14682 Fixes: `5153faac18` ("cgroup: remove cgroup_enable_task_cg_lists() optimization") Cc: stable@vger.kernel.org # v5.5+	2020-01-30 11:37:33 -05:00
Linus Torvalds	bd2463ac7d	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from David Miller: 1) Add WireGuard 2) Add HE and TWT support to ath11k driver, from John Crispin. 3) Add ESP in TCP encapsulation support, from Sabrina Dubroca. 4) Add variable window congestion control to TIPC, from Jon Maloy. 5) Add BCM84881 PHY driver, from Russell King. 6) Start adding netlink support for ethtool operations, from Michal Kubecek. 7) Add XDP drop and TX action support to ena driver, from Sameeh Jubran. 8) Add new ipv4 route notifications so that mlxsw driver does not have to handle identical routes itself. From Ido Schimmel. 9) Add BPF dynamic program extensions, from Alexei Starovoitov. 10) Support RX and TX timestamping in igc, from Vinicius Costa Gomes. 11) Add support for macsec HW offloading, from Antoine Tenart. 12) Add initial support for MPTCP protocol, from Christoph Paasch, Matthieu Baerts, Florian Westphal, Peter Krystad, and many others. 13) Add Octeontx2 PF support, from Sunil Goutham, Geetha sowjanya, Linu Cherian, and others. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1469 commits) net: phy: add default ARCH_BCM_IPROC for MDIO_BCM_IPROC udp: segment looped gso packets correctly netem: change mailing list qed: FW 8.42.2.0 debug features qed: rt init valid initialization changed qed: Debug feature: ilt and mdump qed: FW 8.42.2.0 Add fw overlay feature qed: FW 8.42.2.0 HSI changes qed: FW 8.42.2.0 iscsi/fcoe changes qed: Add abstraction for different hsi values per chip qed: FW 8.42.2.0 Additional ll2 type qed: Use dmae to write to widebus registers in fw_funcs qed: FW 8.42.2.0 Parser offsets modified qed: FW 8.42.2.0 Queue Manager changes qed: FW 8.42.2.0 Expose new registers and change windows qed: FW 8.42.2.0 Internal ram offsets modifications MAINTAINERS: Add entry for Marvell OcteonTX2 Physical Function driver Documentation: net: octeontx2: Add RVU HW and drivers overview octeontx2-pf: ethtool RSS config support octeontx2-pf: Add basic ethtool support ...	2020-01-28 16:02:33 -08:00
Michal Koutný	3bc0bb36fa	cgroup: Prevent double killing of css when enabling threaded cgroup The test_cgcore_no_internal_process_constraint_on_threads selftest when running with subsystem controlling noise triggers two warnings: > [ 597.443115] WARNING: CPU: 1 PID: 28167 at kernel/cgroup/cgroup.c:3131 cgroup_apply_control_enable+0xe0/0x3f0 > [ 597.443413] WARNING: CPU: 1 PID: 28167 at kernel/cgroup/cgroup.c:3177 cgroup_apply_control_disable+0xa6/0x160 Both stem from a call to cgroup_type_write. The first warning was also triggered by syzkaller. When we're switching cgroup to threaded mode shortly after a subsystem was disabled on it, we can see the respective subsystem css dying there. The warning in cgroup_apply_control_enable is harmless in this case since we're not adding new subsys anyway. The warning in cgroup_apply_control_disable indicates an attempt to kill css of recently disabled subsystem repeatedly. The commit prevents these situations by making cgroup_type_write wait for all dying csses to go away before re-applying subtree controls. When at it, the locations of WARN_ON_ONCE calls are moved so that warning is triggered only when we are about to misuse the dying css. Reported-by: syzbot+5493b2a54d31d6aea629@syzkaller.appspotmail.com Reported-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2020-01-15 08:04:29 -08:00
Andrey Ignatov	7dd68b3279	bpf: Support replacing cgroup-bpf program in MULTI mode The common use-case in production is to have multiple cgroup-bpf programs per attach type that cover multiple use-cases. Such programs are attached with BPF_F_ALLOW_MULTI and can be maintained by different people. Order of programs usually matters, for example imagine two egress programs: the first one drops packets and the second one counts packets. If they're swapped the result of counting program will be different. It brings operational challenges with updating cgroup-bpf program(s) attached with BPF_F_ALLOW_MULTI since there is no way to replace a program: * One way to update is to detach all programs first and then attach the new version(s) again in the right order. This introduces an interruption in the work a program is doing and may not be acceptable (e.g. if it's egress firewall); * Another way is attach the new version of a program first and only then detach the old version. This introduces the time interval when two versions of same program are working, what may not be acceptable if a program is not idempotent. It also imposes additional burden on program developers to make sure that two versions of their program can co-exist. Solve the problem by introducing a "replace" mode in BPF_PROG_ATTACH command for cgroup-bpf programs being attached with BPF_F_ALLOW_MULTI flag. This mode is enabled by newly introduced BPF_F_REPLACE attach flag and bpf_attr.replace_bpf_fd attribute to pass fd of the old program to replace That way user can replace any program among those attached with BPF_F_ALLOW_MULTI flag without the problems described above. Details of the new API: * If BPF_F_REPLACE is set but replace_bpf_fd doesn't have valid descriptor of BPF program, BPF_PROG_ATTACH will return corresponding error (EINVAL or EBADF). * If replace_bpf_fd has valid descriptor of BPF program but such a program is not attached to specified cgroup, BPF_PROG_ATTACH will return ENOENT. BPF_F_REPLACE is introduced to make the user intent clear, since replace_bpf_fd alone can't be used for this (its default value, 0, is a valid fd). BPF_F_REPLACE also makes it possible to extend the API in the future (e.g. add BPF_F_BEFORE and BPF_F_AFTER if needed). Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Andrii Narkyiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/30cd850044a0057bdfcaaf154b7d2f39850ba813.1576741281.git.rdna@fb.com	2019-12-19 21:22:25 -08:00
Linus Torvalds	1b96a41b42	Merge branch 'for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "There are several notable changes here: - Single thread migrating itself has been optimized so that it doesn't need threadgroup rwsem anymore. - Freezer optimization to avoid unnecessary frozen state changes. - cgroup ID unification so that cgroup fs ino is the only unique ID used for the cgroup and can be used to directly look up live cgroups through filehandle interface on 64bit ino archs. On 32bit archs, cgroup fs ino is still the only ID in use but it is only unique when combined with gen. - selftest and other changes" * 'for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (24 commits) writeback: fix -Wformat compilation warnings docs: cgroup: mm: Fix spelling of "list" cgroup: fix incorrect WARN_ON_ONCE() in cgroup_setup_root() cgroup: use cgrp->kn->id as the cgroup ID kernfs: use 64bit inos if ino_t is 64bit kernfs: implement custom exportfs ops and fid type kernfs: combine ino/id lookup functions into kernfs_find_and_get_node_by_id() kernfs: convert kernfs_node->id from union kernfs_node_id to u64 kernfs: kernfs_find_and_get_node_by_ino() should only look up activated nodes kernfs: use dumber locking for kernfs_find_and_get_node_by_ino() netprio: use css ID instead of cgroup ID writeback: use ino_t for inodes in tracepoints kernfs: fix ino wrap-around detection kselftests: cgroup: Avoid the reuse of fd after it is deallocated cgroup: freezer: don't change task and cgroups status unnecessarily cgroup: use cgroup->last_bstat instead of cgroup->bstat_pending for consistency cgroup: remove cgroup_enable_task_cg_lists() optimization cgroup: pids: use atomic64_t for pids->limit selftests: cgroup: Run test_core under interfering stress selftests: cgroup: Add task migration tests ...	2019-11-25 19:23:46 -08:00
Tejun Heo	d749534322	cgroup: fix incorrect WARN_ON_ONCE() in cgroup_setup_root() `743210386c` ("cgroup: use cgrp->kn->id as the cgroup ID") added WARN which triggers if cgroup_id(root_cgrp) is not 1. This is fine on 64bit ino archs but on 32bit archs cgroup ID is ((gen << 32) \| ino) and gen starts at 1, so the root id is 0x1_0000_0001 instead of 1 always triggering the WARN. What we wanna make sure is that the ino part is 1. Fix it. Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Fixes: `743210386c` ("cgroup: use cgrp->kn->id as the cgroup ID") Signed-off-by: Tejun Heo <tj@kernel.org>	2019-11-14 14:46:51 -08:00
Tejun Heo	743210386c	cgroup: use cgrp->kn->id as the cgroup ID cgroup ID is currently allocated using a dedicated per-hierarchy idr and used internally and exposed through tracepoints and bpf. This is confusing because there are tracepoints and other interfaces which use the cgroupfs ino as IDs. The preceding changes made kn->id exposed as ino as 64bit ino on supported archs or ino+gen (low 32bits as ino, high gen). There's no reason for cgroup to use different IDs. The kernfs IDs are unique and userland can easily discover them and map them back to paths using standard file operations. This patch replaces cgroup IDs with kernfs IDs. * cgroup_id() is added and all cgroup ID users are converted to use it. * kernfs_node creation is moved to earlier during cgroup init so that cgroup_id() is available during init. * While at it, s/cgroup/cgrp/ in psi helpers for consistency. * Fallback ID value is changed to 1 to be consistent with root cgroup ID. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Namhyung Kim <namhyung@kernel.org>	2019-11-12 08:18:04 -08:00
Tejun Heo	fe0f726c9f	kernfs: combine ino/id lookup functions into kernfs_find_and_get_node_by_id() kernfs_find_and_get_node_by_ino() looks the kernfs_node matching the specified ino. On top of that, kernfs_get_node_by_id() and kernfs_fh_get_inode() implement full ID matching by testing the rest of ID. On surface, confusingly, the two are slightly different in that the latter uses 0 gen as wildcard while the former doesn't - does it mean that the latter can't uniquely identify inodes w/ 0 gen? In practice, this is a distinction without a difference because generation number starts at 1. There are no actual IDs with 0 gen, so it can always safely used as wildcard. Let's simplify the code by renaming kernfs_find_and_get_node_by_ino() to kernfs_find_and_get_node_by_id(), moving all lookup logics into it, and removing now unnecessary kernfs_get_node_by_id(). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-11-12 08:18:04 -08:00
Tejun Heo	67c0496e87	kernfs: convert kernfs_node->id from union kernfs_node_id to u64 kernfs_node->id is currently a union kernfs_node_id which represents either a 32bit (ino, gen) pair or u64 value. I can't see much value in the usage of the union - all that's needed is a 64bit ID which the current code is already limited to. Using a union makes the code unnecessarily complicated and prevents using 64bit ino without adding practical benefits. This patch drops union kernfs_node_id and makes kernfs_node->id a u64. ino is stored in the lower 32bits and gen upper. Accessors - kernfs[_id]_ino() and kernfs[_id]_gen() - are added to retrieve the ino and gen. This simplifies ID handling less cumbersome and will allow using 64bit inos on supported archs. This patch doesn't make any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Alexei Starovoitov <ast@kernel.org>	2019-11-12 08:18:03 -08:00

1 2 3 4 5

214 Commits