linux

mirror of https://github.com/torvalds/linux.git synced 2026-06-04 12:35:52 +02:00

Author	SHA1	Message	Date
Linus Torvalds	31e62c2ebb	ptrace: slightly saner 'get_dumpable()' logic The 'dumpability' of a task is fundamentally about the memory image of the task - the concept comes from whether it can core dump or not - and makes no sense when you don't have an associated mm. And almost all users do in fact use it only for the case where the task has a mm pointer. But we have one odd special case: ptrace_may_access() uses 'dumpable' to check various other things entirely independently of the MM (typically explicitly using flags like PTRACE_MODE_READ_FSCREDS). Including for threads that no longer have a VM (and maybe never did, like most kernel threads). It's not what this flag was designed for, but it is what it is. The ptrace code does check that the uid/gid matches, so you do have to be uid-0 to see kernel thread details, but this means that the traditional "drop capabilities" model doesn't make any difference for this all. Make it all make a bit more sense by saying that if you don't have a MM pointer, we'll use a cached "last dumpability" flag if the thread ever had a MM (it will be zero for kernel threads since it is never set), and require a proper CAP_SYS_PTRACE capability to override. Reported-by: Qualys Security Advisory <qsa@qualys.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kees Cook <kees@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-05-14 08:32:11 -07:00
Linus Torvalds	59a62ea458	sched_ext: Fixes for v7.1-rc3 Bulk is hardening of the new sub-scheduler infrastructure. - UAFs and lifecycle bugs on the sub-sched attach/detach paths: parent sub_kset freed under a racing child, list_del_rcu on an uninitialized list head, ops->priv stomped by concurrent attach/detach, and a UAF in the init-failure error path. - Task state-machine reorg closing concurrent enable-vs-dead races: a task exiting during the unlocked init window could trip NULL ops derefs or skip exit_task() cleanup. - A scx_link_sched() self-deadlock on scx_sched_lock. - isolcpus: stop dereferencing the now-RCU-protected HK_TYPE_DOMAIN cpumask without RCU, and stop rejecting BPF schedulers when only cpuset isolated partitions are active. - PREEMPT_RT: disable irq_work runs in hardirq context so dumps show the failing task rather than the irq_work kthread. - Assorted !CONFIG_EXT_SUB_SCHED, randconfig, and selftest build fixes. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCagTk1g4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGT6TAP0ZbRHz9ViligecZXIHjEvZQjEV4sn1NLpGi4og V0Ol2AD/RzqHQZo5+HpMz4hPrcZdkAWcr74cLrNTJ2WQjOk4RgE= =6Mbx -----END PGP SIGNATURE----- Merge tag 'sched_ext-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: "The bulk of this is hardening of the new sub-scheduler infrastructure. - UAFs and lifecycle bugs on the sub-sched attach/detach paths: parent sub_kset freed under a racing child, list_del_rcu on an uninitialized list head, ops->priv stomped by concurrent attach/detach, and a UAF in the init-failure error path - Task state-machine reorg closing concurrent enable-vs-dead races: a task exiting during the unlocked init window could trip NULL ops derefs or skip exit_task() cleanup - A scx_link_sched() self-deadlock on scx_sched_lock - isolcpus: stop dereferencing the now-RCU-protected HK_TYPE_DOMAIN cpumask without RCU, and stop rejecting BPF schedulers when only cpuset isolated partitions are active - PREEMPT_RT: disable irq_work runs in hardirq context so dumps show the failing task rather than the irq_work kthread - Assorted !CONFIG_EXT_SUB_SCHED, randconfig, and selftest build fixes" * tag 'sched_ext-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Use HK_TYPE_DOMAIN_BOOT to detect isolcpus= domain isolation sched_ext: Defer sub_kset base put to scx_sched_free_rcu_work sched_ext: INIT_LIST_HEAD() &sch->all in scx_alloc_and_add_sched() sched_ext: Drop NONE early return in scx_disable_and_exit_task() sched_ext: Avoid UAF in scx_root_enable_workfn() init failure path sched_ext: Clear ops->priv on scx_alloc_and_add_sched() error paths sched_ext: Fix ops->priv clobber on concurrent attach/detach selftests/sched_ext: Fix build error in dequeue selftest sched_ext: Handle SCX_TASK_NONE in disable/switched_from paths sched_ext: Close sub-sched init race with post-init DEAD recheck sched_ext: Close root-enable vs sched_ext_dead() race with SCX_TASK_INIT_BEGIN sched_ext: Replace SCX_TASK_OFF_TASKS flag with SCX_TASK_DEAD state sched_ext: Inline scx_init_task() and move RESET_RUNNABLE_AT into scx_set_task_state() sched_ext: Cleanups in preparation for the SCX_TASK_INIT_BEGIN/DEAD work sched_ext: Use IRQ_WORK_INIT_HARD() to initialize sch->disable_irq_work sched_ext: Fix !CONFIG_EXT_SUB_SCHED build warnings sched_ext: Drop unused scx_find_sub_sched() stub sched_ext: Move scx_error() out of scx_link_sched()'s lock region	2026-05-13 15:00:40 -07:00
Linus Torvalds	0913b580f8	cgroup: Fixes for v7.1-rc3 - cpuset fixes: - Partition invalidation could return CPUs still in use by sibling partitions, producing overlapping effective_cpus. - cpuset_can_attach() over-reserved DL bandwidth on moves that stayed within the same root domain. - Pending DL migration state leaked into later attaches when a later can_attach() check failed. - Reorder PF_EXITING and __GFP_HARDWALL checks so dying tasks can allocate from any node and exit quickly. - dmem: propagate -ENOMEM instead of spinning forever when the fallback pool allocation also fails. - selftests/cgroup: percpu test error-path leak, bogus numeric comparison of cpuset strings, and a zero-length read() that silently passed OOM-kill tests. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCagTkzw4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGR+AAQCcYEGJ+yNAzzrTcY8xy7333rorMckSmZt18jzv 1KSqEQD+KjindGNcWP/meQBPnEjcBjix6i961mgnQ99e/UD2HQ4= =4pT3 -----END PGP SIGNATURE----- Merge tag 'cgroup-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - cpuset fixes: - Partition invalidation could return CPUs still in use by sibling partitions, producing overlapping effective_cpus - cpuset_can_attach() over-reserved DL bandwidth on moves that stayed within the same root domain - Pending DL migration state leaked into later attaches when a later can_attach() check failed - Reorder PF_EXITING and __GFP_HARDWALL checks so dying tasks can allocate from any node and exit quickly - dmem: propagate -ENOMEM instead of spinning forever when the fallback pool allocation also fails - selftests/cgroup: percpu test error-path leak, bogus numeric comparison of cpuset strings, and a zero-length read() that silently passed OOM-kill tests * tag 'cgroup-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup/cpuset: Return only actually allocated CPUs during partition invalidation selftests/cgroup: Fix error path leaks in test_percpu_basic cgroup/cpuset: Reserve DL bandwidth only for root-domain moves cgroup/cpuset: Reset DL migration state on can_attach() failure selftests/cgroup: Fix string comparison in write_test selftests/cgroup: Fix cg_read_strcmp() empty string comparison cgroup/dmem: Return -ENOMEM on failed pool preallocation cgroup/cpuset: move PF_EXITING check before __GFP_HARDWALL in cpuset_current_node_allowed()	2026-05-13 14:56:31 -07:00
Linus Torvalds	50599e4c68	workqueue: Fixes for v7.1-rc3 - Plug a wq->cpu_pwq leak on the WQ_UNBOUND allocation failure path. - Fix a cancel_delayed_work_sync() livelock against drain_workqueue() caused by the drain/destroy reject path leaving WORK_STRUCT_PENDING set with no owner. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCagTkwA4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGXGNAQDarHcCjUzjddPY1drGJz73LIsfAhU1haDWYQgD Ssd/ZgD/fYP0Gp6GwbFF/n9JAo48Y2P29PF4lOfVagv1Md0SeAM= =NMbR -----END PGP SIGNATURE----- Merge tag 'wq-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fixes from Tejun Heo: - Plug a wq->cpu_pwq leak on the WQ_UNBOUND allocation failure path - Fix a cancel_delayed_work_sync() livelock against drain_workqueue() caused by the drain/destroy reject path leaving WORK_STRUCT_PENDING set with no owner * tag 'wq-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: Fix wq->cpu_pwq leak in alloc_and_link_pwqs() WQ_UNBOUND path workqueue: Release PENDING in __queue_work() drain/destroy reject path	2026-05-13 14:49:13 -07:00
Andrea Righi	6ae315d379	sched_ext: Use HK_TYPE_DOMAIN_BOOT to detect isolcpus= domain isolation scx_enable() refuses to attach a BPF scheduler when isolcpus=domain is in effect by comparing housekeeping_cpumask(HK_TYPE_DOMAIN) against cpu_possible_mask. Since commit `27c3a5967f` ("sched/isolation: Convert housekeeping cpumasks to rcu pointers"), HK_TYPE_DOMAIN's cpumask is RCU protected and dereferencing it requires either RCU read lock, the cpu_hotplug write lock, or the cpuset lock; scx_enable() holds none of these, so booting with isolcpus=domain and attaching any BPF scheduler triggers the following lockdep splat: ============================= WARNING: suspicious RCU usage ----------------------------- kernel/sched/isolation.c:60 suspicious rcu_dereference_check() usage! 1 lock held by scx_flash/281: #0: ffffffff8379fce0 (update_mutex){+.+.}-{4:4}, at: bpf_struct_ops_link_create+0x134/0x1c0 Call Trace: dump_stack_lvl+0x6f/0xb0 lockdep_rcu_suspicious.cold+0x37/0x70 housekeeping_cpumask+0xcd/0xe0 scx_enable.isra.0+0x17/0x120 bpf_scx_reg+0x5e/0x80 bpf_struct_ops_link_create+0x151/0x1c0 __sys_bpf+0x1e4b/0x33c0 __x64_sys_bpf+0x21/0x30 do_syscall_64+0x117/0xf80 entry_SYSCALL_64_after_hwframe+0x77/0x7f In addition, commit `03ff735101` ("cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset") made HK_TYPE_DOMAIN include cpuset isolated partitions as well, which means the current check also rejects BPF schedulers when a cpuset partition is active. That contradicts the original intent of commit `9f391f94a1` ("sched_ext: Disallow loading BPF scheduler if isolcpus= domain isolation is in effect"), which explicitly noted that cpuset partitions are honored through per-task cpumasks and should not be rejected. Switch to housekeeping_enabled(HK_TYPE_DOMAIN_BOOT), which reads only the housekeeping flag bit (no RCU dereference) and reflects exactly the boot-time isolcpus= configuration that the error message refers to. Fixes: `27c3a5967f` ("sched/isolation: Convert housekeeping cpumasks to rcu pointers") Cc: stable@vger.kernel.org # v7.0+ Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Frederic Weisbecker <frederic@kernel.org>	2026-05-13 10:02:57 -10:00
sunshaojie	345f401666	cgroup/cpuset: Return only actually allocated CPUs during partition invalidation In update_parent_effective_cpumask() with partcmd_invalidate, the CPUs to return to the parent are computed as: adding = cpumask_and(tmp->addmask, xcpus, parent->effective_xcpus); where xcpus = user_xcpus(cs) which returns cs->exclusive_cpus (if set) or cs->cpus_allowed. When exclusive_cpus is not set, user_xcpus(cs) can contain CPUs that were never actually granted to the partition due to sibling exclusion in compute_excpus(). Consequently, the invalidation may return CPUs to the parent that remain in use by sibling partitions, causing overlapping effective_cpus and triggering the WARN_ON_ONCE(1) in generate_sched_domains(). Use cs->effective_xcpus instead, which reflects the CPUs actually granted to this partition. Reproducer (on a 4-CPU machine): cd /sys/fs/cgroup mkdir a1 b1 # a1 becomes partition root with CPUs 0-1 echo "0-1" > a1/cpuset.cpus echo "root" > a1/cpuset.cpus.partition # b1 becomes partition root with CPUs 1-2, but sibling exclusion # reduces its effective_xcpus to CPU 2 only echo "1-2" > b1/cpuset.cpus echo "root" > b1/cpuset.cpus.partition # b1 changes cpus_allowed to 0-1 -> partition invalidation echo "0-1" > b1/cpuset.cpus # Expected: CPUs 2-3 (only CPU 2 returned from b1) # Actual: CPUs 1-3 (CPU 0-1 returned, overlapping with a1) cat cpuset.cpus.effective dmesg will also show a WARNING from generate_sched_domains() reporting overlapping partition root effective_cpus. Fixes: `2a3602030d` ("cgroup/cpuset: Don't invalidate sibling partitions on cpuset.cpus conflict") Cc: stable@vger.kernel.org # v7.0+ Signed-off-by: sunshaojie <sunshaojie@kylinos.cn> Tested-by: Chen Ridong <chenridong@huaweicloud.com> Reviewed-by: Chen Ridong <chenridong@huaweicloud.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-05-13 08:54:53 -10:00
Linus Torvalds	e1914add27	Arm: * Add the pKVM side of the workaround for ARM's erratum 4193714, provided that the EL3 firmware does its part of the job. KVM will refuse to initialise otherwise. - Correctly handle 52bit VAs for guest EL2 stage-1 translations when running under NV with E2H==0. * Correctly deal with permission faults in guest_memfd memslots. * Fix the steal-time selftest after the infrastructure was reworked. * Make sure the host cannot pass a non-sensical clock update to the EL2 tracing infrastructure. * Appoint Steffen Eiden as a reviewer in anticipation of the KVM/s390 ability to run arm64 guests, which will inevitably lead to arm64 code being directly used on s390. * Make sure that EL2 is configured with both exception entry and exit being Context Synchronization Events. * Handle the current vcpu being NULL on EL2 panic. * Fix the selftest_vcpu memcache being empty at the point of donation or sharing. * Check that the memcache has enough capacity before engaging on the share/donate path. * Fix __deactivate_fgt() to use its parameter rather than a variable in the macro context. s390: * Fix array overrun with large amounts of PCI devices. x86: * Never use L0's PAUSE loop exiting while L2 is running, since it's unlikely that a nested guest will help solving the hypervisor's spinlock contention * Fix emulation of MOVNTDQA. * Fix typo in Xen hypercall tracepoint * Add back an optimization that was left behind when recently fixing a bug. * Add module parameter to disable CET, whose implementation seems to have issues. For now it remains enabled by default. Generic: * Reject offset causing an unsigned overflow in kvm_reset_dirty_gfn() Documentation: * Update stale links Selftests: * Fix guest_memfd_test with host page size > guest page size. -----BEGIN PGP SIGNATURE----- iQFIBAABCgAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmoEnNgUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroPOeAgArZ60yQGH0TJipyNsaPt+m+IEGMZ/ UC1tRd384EJnwpjFfZOvwluNNxeFlSXlku7iEXHHveK1qqFXnh+WBXJ91ftfDK/+ OOqVBBziOyxI6Mbsm2S415kzOQ15atsrclrcGC4emSydgX+JASZ4nsGx6MDRPu/8 p4TNy3vD5wxe3UGttYElMoFcgT0N/HepMyvUlXohjcjl/hkgf5GL4yPc/TGuvdtz EJfmDRhJEwyzf4/Ut8tzX+LhNxSY2iBr5XBvC8XQMSJBVbU/CRGxUk28fEzo7ykx EHVOlkxgUN1zO0xh/8aMgRIZNDMveWupR2sJe6StCqOlcbBMI2oYFNnLfQ== =f8oe -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm fixes from Paolo Bonzini: "arm64: - Add the pKVM side of the workaround for ARM's erratum 4193714, provided that the EL3 firmware does its part of the job. KVM will refuse to initialise otherwise - Correctly handle 52bit VAs for guest EL2 stage-1 translations when running under NV with E2H==0 - Correctly deal with permission faults in guest_memfd memslots - Fix the steal-time selftest after the infrastructure was reworked - Make sure the host cannot pass a non-sensical clock update to the EL2 tracing infrastructure - Appoint Steffen Eiden as a reviewer in anticipation of the KVM/s390 ability to run arm64 guests, which will inevitably lead to arm64 code being directly used on s390 - Make sure that EL2 is configured with both exception entry and exit being Context Synchronization Events - Handle the current vcpu being NULL on EL2 panic - Fix the selftest_vcpu memcache being empty at the point of donation or sharing - Check that the memcache has enough capacity before engaging on the share/donate path - Fix __deactivate_fgt() to use its parameter rather than a variable in the macro context s390: - Fix array overrun with large amounts of PCI devices x86: - Never use L0's PAUSE loop exiting while L2 is running, since it's unlikely that a nested guest will help solving the hypervisor's spinlock contention - Fix emulation of MOVNTDQA - Fix typo in Xen hypercall tracepoint - Add back an optimization that was left behind when recently fixing a bug - Add module parameter to disable CET, whose implementation seems to have issues. For now it remains enabled by default Generic: - Reject offset causing an unsigned overflow in kvm_reset_dirty_gfn() Documentation: - Update stale links Selftests: - Fix guest_memfd_test with host page size > guest page size" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (22 commits) KVM: VMX: introduce module parameter to disable CET KVM: x86: Swap the dst and src operand for MOVNTDQA KVM: x86: use again the flush argument of __link_shadow_page() KVM: selftests: Ensure gmem file sizes are multiple of host page size Documentation: kvm: update links in the references section of AMD Memory Encryption KVM: nSVM: Never use L0's PAUSE loop exiting while L2 is running KVM: x86: Fix Xen hypercall tracepoint argument assignment KVM: Reject wrapped offset in kvm_reset_dirty_gfn() KVM: arm64: Pre-check vcpu memcache for host->guest donate KVM: arm64: Pre-check vcpu memcache for host->guest share KVM: arm64: Seed pkvm_ownership_selftest vcpu memcache KVM: arm64: Fix __deactivate_fgt macro parameter typo KVM: arm64: Guard against NULL vcpu on VHE hyp panic path KVM: arm64: Make EL2 exception entry and exit context-synchronization events MAINTAINERS: Add Steffen as reviewer for KVM/arm64 KVM: arm64: Remove potential UB on nvhe tracing clock update KVM: selftests: arm64: Fix steal_time test after UAPI refactoring KVM: arm64: Handle permission faults with guest_memfd KVM: arm64: nv: Consider the DS bit when translating TCR_EL2 KVM: arm64: Work around C1-Pro erratum 4193714 for protected guests ...	2026-05-13 11:53:51 -07:00
Yu Miao	7d8f3158a5	selftests/cgroup: Fix error path leaks in test_percpu_basic When cg_name_indexed() returns NULL partway through the child creation loop, the code returned -1 without running cleanup_children and cleanup. That left the `parent` pathname allocation unreleased and did not remove child cgroup directories already created under the parent. Fix by jumping to cleanup_children instead of returning. When cg_create() fails, `child` (the pathname from cg_name_indexed()) was not freed before cleanup_children. Fix by freeing `child` before branching to cleanup_children. Fixes: `90631e1dea` ("kselftests: cgroup: add perpcu memory accounting test") Signed-off-by: Yu Miao <yumiao@kylinos.cn> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-05-13 08:40:52 -10:00
Linus Torvalds	1f63dd8ca0	liveupdate fixes for v7.1-rc4 A few fixes for kexec handover and liveupdate: * make sure KHO is skipped for crash kernel * fix error reporting in memfd preservation if it fails mid-loop * don't allow preserving memfds whose page count exceeds UINT_MAX * fix documentation of memfd seals preservation to match the code -----BEGIN PGP SIGNATURE----- iQFEBAABCgAuFiEEeOVYVaWZL5900a/pOQOGJssO/ZEFAmoET5gQHHJwcHRAa2Vy bmVsLm9yZwAKCRA5A4Ymyw79kVfdB/99gLJy40MO9ZCHSxRQD9TE7Fbuv71flVuD wmDz43UOyDIEp+qCB0VcNQPG3v+UINygUMGHkhOG4fgKLm0bEORXIJHNr8sTXYYk LuxN8g+Xv1P/qkucEIXy1oB38okg9cORhlfrCOiwpWBjNt5/AqZYKWttDshuZiIM kjIKEDtTZ/nDLXjkWAa4Qs4MtBjqTVCrG3glSNHT0yiFDEkAejXbr4RZ/Ght/9pz FwHzTfdIOnecvOCD2OHVQx9TJluaP57mlxTkOXJV6OApg0wiHjohl0Xcerh+JfB4 HAdF7xpr5Sk/BQVc3ygsDKwfTVfB/eYMfCoyUkXg9AVhcoBXmhD0 =luI7 -----END PGP SIGNATURE----- Merge tag 'fixes-2026-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux Pull liveupdate fixes from Mike Rapoport: "A few fixes for kexec handover and liveupdate: - make sure KHO is skipped for crash kernel - fix error reporting in memfd preservation if it fails mid-loop - don't allow preserving memfds whose page count exceeds UINT_MAX - fix documentation of memfd seals preservation to match the code" * tag 'fixes-2026-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux: mm/memfd_luo: document preservation of file seals mm/memfd_luo: reject memfds whose page count exceeds UINT_MAX mm/memfd_luo: report error when restoring a folio fails mid-loop kho: skip KHO for crash kernel	2026-05-13 08:24:50 -07:00
Paolo Bonzini	2d5d3fc593	KVM: VMX: introduce module parameter to disable CET There have been reports of host hangs caused by CET virtualization. Until these are analyzed further, introduce a module parameter that makes it possible to easily disable it. Link: https://lore.kernel.org/all/85548beb-1486-40f9-beb4-632c78e3360b@proxmox.com/ Cc: David Riley <d.riley@proxmox.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2026-05-13 15:38:22 +02:00
Tejun Heo	cceb874eee	sched_ext: Defer sub_kset base put to scx_sched_free_rcu_work scx_sub_enable_workfn() pins parent->kobj before dropping scx_sched_lock, but that does not pin parent->sub_kset. Concurrent disable can kset_unregister and free sub_kset before scx_alloc_and_add_sched() dereferences it. Split sub_kset teardown: kobject_del() at disable keeps sysfs removal; defer kobject_put() to scx_sched_free_rcu_work so the memory survives. A racing child sees state_in_sysfs=0 with valid memory, sysfs_create_dir() fails, and the existing exit_kind gate in scx_link_sched() turns it away with -ENOENT. Fixes: `411d3ef1a7` ("sched_ext: Unregister sub_kset on scheduler disable") Signed-off-by: Tejun Heo <tj@kernel.org>	2026-05-12 11:28:56 -10:00
Tejun Heo	b273b75b8d	sched_ext: INIT_LIST_HEAD() &sch->all in scx_alloc_and_add_sched() On scx_link_sched() error paths (parent disabled, hash insert failure), &sch->all is never added to scx_sched_all. The cleanup path runs scx_unlink_sched() unconditionally, which calls list_del_rcu(&sch->all) on a list_head that was never initialized triggering a corruption warning. Initialize &sch->all. Fixes: `54be8de423` ("sched_ext: Factor out scx_link_sched() and scx_unlink_sched()") Signed-off-by: Tejun Heo <tj@kernel.org>	2026-05-12 11:28:56 -10:00
Paolo Bonzini	ef7e0c51d9	KVM: s390: pci: fix array indexing For large amounts of PCI devices its possible to overrun the arrays as the index was miscalculated in 2 places. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE+SKTgaM0CPnbq/vKEXu8gLWmHHwFAmn4p9UACgkQEXu8gLWm HHw7kA//cr8wtdVq2CWwvLpHIvpjRYmQDCApB2vIYPE1AECJqtddiJhq9TolT5rw +kqn3hcYmjVhqgqay2IukbuJXFfruPPK2UrF46NmGSxsc/iCglcefRoTOkvJsOXo wNzJ/Y7AzZNT1vTTm396vdb/8ACv2zuh073iowDFdRSDLMLt087rJNPf8MQkfxhj ZwIqOsGsl1p4WYnnwSy3E5ZsRxPK3kV/JGYvLQyAtx0PGMaTkbAB7KR/PmaxJPal IeawpKrpsGzvajXV2EfGVpisTSdKvJ2dwM7NQtX8Q0qVuDufYfsSNlbDKKl3lWIq 8y5wA2z9oAumZynejBeG46b/nq6Sbeq9lyTNk/52u9ED4RoNmh0s5K0FajD/f229 xx2XAwsLTrF3ojb0ynHfXKfyzBKMjYu/Y4LtE8bL/wi/BKfs9puoBFixnbvwMrDe J8zhxlQyLeZ7Z/hjSWP3UI6w+idA72Z0thf9Nrh0MjhqsKOW4TAD2ZRh/9KQ6B66 TcmCVe57ehp0aMJ/cqNhXBrvVSH7HL31F/g6Qj8CMiZzJlq3+mlPbWULlqiiBGLr Aoxytqlg6YquB8T7SPuopWNkmEU3B9edAn35sqz7Q6/1kzyWdLKMpJLZvU5gtQC1 KTxpm8aeLdmAzXcwuakdGin9wCT6VfLDEj+wo9qGSLr58wQdwco= =CGuC -----END PGP SIGNATURE----- Merge tag 'kvm-s390-master-7.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD KVM: s390: pci: fix array indexing For large amounts of PCI devices its possible to overrun the arrays as the index was miscalculated in 2 places.	2026-05-12 23:15:38 +02:00
Tejun Heo	39e25a2100	sched_ext: Drop NONE early return in scx_disable_and_exit_task() `d3e73a0808` ("sched_ext: Handle SCX_TASK_NONE in disable/switched_from paths") skipped the trailing scx_set_task_sched(p, NULL) on NONE tasks. After scx_fail_parent() parks a task at NONE/sched=parent and the parent is later freed via queue_rcu_work() during root_disable, the preserved p->scx.sched dangles - print_scx_info() from sched_show_task() reads sch->ops.name from freed memory. Drop the early return. __scx_disable_and_exit_task() already short- circuits on NONE and the SUB_INIT block was cleared by scx_fail_parent()'s earlier call, so clearing p->scx.sched is the only work left - and the one thing the path actually needs. v2: Extend the SUB_INIT block comment to note that the flag is only set on the sub-enable path, so it's always clear on the NONE re-entry (Andrea). Fixes: `d3e73a0808` ("sched_ext: Handle SCX_TASK_NONE in disable/switched_from paths") Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-05-12 11:13:58 -10:00
Sean Christopherson	3098c076c8	KVM: x86: Swap the dst and src operand for MOVNTDQA Swap the MOVNTDQA operands, as MOVNTDQA does NOT in fact have "the same characteristics as 0F E7 (MOVNTDQ)"; MOVNTDQA loads from memory and stores to registers, while MOVNTDQ loads from registers and stores to memory. Per the SDM: MOVNTDQ - Move packed integer values in xmm1 to m128 using non-temporal hint. MOVNTDQA - Move double quadword from m128 to xmm1 using non-temporal hint if WC memory type. Reported-by: Josh Eads <josheads@google.com> Fixes: `c57d9bafbd` ("KVM: x86: Add support for emulating MOVNTDQA") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20260506213514.2781948-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2026-05-12 23:12:32 +02:00
Paolo Bonzini	6b72d0578c	KVM: x86: use again the flush argument of __link_shadow_page() Except in the case of parentless nested-TDP pages, mmu_page_zap_pte() clears the SPTE but leaves the invalid_list empty. In this case, using kvm_flush_remote_tlbs() as kvm_mmu_remote_flush_or_zap() does is overkill. Avoid flushing the entirety of the remote TLBs unless the invalid_list was populated: instead, use a more efficient gfn-targeting flush (if available) and skip it altogether if the caller guarantees that a TLB flush is not necessary. Based-on: <20260503201029.106481-1-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20260503210917.121840-1-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2026-05-12 23:12:31 +02:00
Sean Christopherson	87c810160e	KVM: selftests: Ensure gmem file sizes are multiple of host page size When creating a guest_memfd file and associated memslot to validate shared guest memory, size the file+memslot to the maximum of the host or guest page size. Attempting to allocate a single guest page will fail if the host page size is greater than the guest page size, as KVM requires that the size of memslots and guest_memfd files are a multiple of the host page size. For simplicity, verify the entire file can be shared between guest and host, e.g. instead of trying to validate "partial" mappings. Fixes: `42188667be` ("KVM: selftests: Add guest_memfd testcase to fault-in on !mmap()'d memory") Reported-by: Zenghui Yu <zenghui.yu@linux.dev> Closes: https://lore.kernel.org/all/0064952b-048c-455d-ad89-e27e5cb82591@linux.dev Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20260512155634.772602-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2026-05-12 22:26:10 +02:00
Paolo Bonzini	4a9ee4fc79	KVM/arm64 fixes for 7.1, take #2 - Add the pKVM side of the workaround for ARM's erratum 4193714, provided that the EL3 firmware does its part of the job. KVM will refuse to initialise otherwise. - Correctly handle 52bit VAs for guest EL2 stage-1 translations when running under NV with E2H==0. - Correctly deal with permission faults in guest_memfd memslots. - Fix the steal-time selftest after the infrastructure was reworked. - Make sure the host cannot pass a non-sensical clock update to the EL2 tracing infrastructure. - Appoint Steffen Eiden as a reviewer in anticipation of the KVM/s390 ability to run arm64 guests, which will inevitably lead to arm64 code being directly used on s390. - Make sure that EL2 is configured with both exception entry and exit being Context Synchronization Events. - Handle the current vcpu being NULL on EL2 panic. - Fix the selftest_vcpu memcache being empty at the point of donation or sharing. - Check that the memcache has enough capacity before engaging on the share/donate path. - Fix __deactivate_fgt() to use its parameter rather than a variable in the macro context. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmn8sE0ACgkQI9DQutE9 ekOtxw/+Je7H1k+JyRfu8MNG3cFa8by+sQa3C1IAQfqdXaMV7nJZItFdd1eUvBW5 f3eyoi28j9zS+jr7uuvg6R9GjTQqkoWEVdPLWUgg9eeBatFAh2IroPf1a4md18d9 oJcR4fJrDfYYYkoppdIc6TG+t9R+y47ebE3RlBLXxiSlElHjKYt+hv2mrBs0v583 pW1Vm3BhfbmHExYPidCNkpGXdamzBZxbEx5hzzW/B3VQllxwlvG/B60BQOGxC1qt 1irGFSGmUJMOptHB2XZkWyXysMwwTPJrAiZ2e2oU3DiwTaPS+JOHa8G7oZy8CkEP h6dAWlW45bLWyxWoWxXkGdcRwvuIW1Vdd6SIiJN8SHoqSOoNIQdw4n0x8CghWHu7 FHKSBD8C4iX0Oa2WXlhUNqvM51cHNb72JAGp5UAOE3ffYRTap6/AcMV9WdCEdTCQ LKlqxj9JPgmf8CxZitWvwdGCgGCHr9hGTsIa71N1lql1Imw4vxNbCA8xt1AXqakp 0JpXdRLif6NgJ2dPd99W8aR3cbWxCu/sj23U3xug7cfr6Py8dRVydijmSWZeRonY XuzYJONQlHPmkbaEN4YTwtp5/stWn6+QghVKYt5X19+Nx8rr1gyRCEL38HmR38m0 Sc5QoymiLjW8IR/+9sjF7JbKFP510gMCQibRraDb62k9qbVsDmY= =7AW5 -----END PGP SIGNATURE----- Merge tag 'kvmarm-fixes-7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 7.1, take #2 - Add the pKVM side of the workaround for ARM's erratum 4193714, provided that the EL3 firmware does its part of the job. KVM will refuse to initialise otherwise. - Correctly handle 52bit VAs for guest EL2 stage-1 translations when running under NV with E2H==0. - Correctly deal with permission faults in guest_memfd memslots. - Fix the steal-time selftest after the infrastructure was reworked. - Make sure the host cannot pass a non-sensical clock update to the EL2 tracing infrastructure. - Appoint Steffen Eiden as a reviewer in anticipation of the KVM/s390 ability to run arm64 guests, which will inevitably lead to arm64 code being directly used on s390. - Make sure that EL2 is configured with both exception entry and exit being Context Synchronization Events. - Handle the current vcpu being NULL on EL2 panic. - Fix the selftest_vcpu memcache being empty at the point of donation or sharing. - Check that the memcache has enough capacity before engaging on the share/donate path. - Fix __deactivate_fgt() to use its parameter rather than a variable in the macro context.	2026-05-12 22:19:20 +02:00
Ninad Naik	80f4a7b8ce	Documentation: kvm: update links in the references section of AMD Memory Encryption Replace non-working links in the reference section with the working ones. Signed-off-by: Ninad Naik <ninadnaik07@gmail.com> Link: https://patch.msgid.link/20260511174302.811918-1-ninadnaik07@gmail.com/ Reviewed-by: Liam Merwick <liam.merwick@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2026-05-12 22:17:42 +02:00
Sean Christopherson	5bd1ddb791	KVM: nSVM: Never use L0's PAUSE loop exiting while L2 is running Never use L0's (KVM's) PAUSE loop exiting controls while L2 is running, and instead always configure vmcb02 according to L1's exact capabilities and desires. The purpose of intercepting PAUSE after N attempts is to detect when the vCPU may be stuck waiting on a lock, so that KVM can schedule in a different vCPU that may be holding said lock. Barring a very interesting setup, L1 and L2 do not share locks, and it's extremely unlikely that an L1 vCPU would hold a spinlock while running L2. I.e. having a vCPU executing in L1 yield to a vCPU running in L2 will not allow the L1 vCPU to make forward progress, and vice versa. While teaching KVM's "on spin" logic to only yield to other vCPUs in L2 is doable, in all likelihood it would do more harm than good for most setups. KVM has limited visibility into which L2 "vCPUs" belong to the same VM, and thus share a locking domain. And even if L2 vCPUs are in the same VM, KVM has no visilibity into L2 vCPU's that are scheduled out by the L1 hypervisor. Furthermore, KVM doesn't actually steal PAUSE exits from L1. If L1 is intercepting PAUSE, KVM will route PAUSE exits to L1, not L0, as nested_svm_intercept() gives priority to the vmcb12 intercept. As such, overriding the count/threshold fields in vmcb02 with vmcb01's values is nonsensical, as doing so clobbers all the training/learning that has been done in L1. Even worse, if L1 is not intercepting PAUSE, i.e. KVM is handling PAUSE exits, then KVM will adjust the PLE knobs based on L2 behavior, which could very well be detrimental to L1, e.g. due to essentially poisoning L1 PLE training with bad data. And copying the count from vmcb02 to vmcb01 on a nested VM-Exit makes even less sense, because again, the purpose of PLE is to detect spinning vCPUs. Whether or not a vCPU is spinning in L2 at the time of a nested VM-Exit has no relevance as to the behavior of the vCPU when it executes in L1. The only scenarios where any of this actually works is if at least one of KVM or L1 is NOT intercepting PAUSE for the guest. Per the original changelog, those were the only scenarios considered to be supported. Disabling KVM's use of PLE makes it so the VM is always in a "supported" mode. Last, but certainly not least, using KVM's count/threshold instead of the values provided by L1 is a blatant violation of the SVM architecture. Fixes: `74fd41ed16` ("KVM: x86: nSVM: support PAUSE filtering when L0 doesn't intercept PAUSE") Cc: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://patch.msgid.link/20260508213321.373309-1-seanjc@google.com/ Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2026-05-12 22:17:28 +02:00
Qiang Ma	2b72f1674e	KVM: x86: Fix Xen hypercall tracepoint argument assignment TRACE_EVENT(kvm_xen_hypercall) stores a5 in __entry->a4 instead of __entry->a5. That overwrites the recorded a4 argument and leaves a5 unset in the trace entry. Fix the typo so both arguments are captured correctly. Signed-off-by: Qiang Ma <maqianga@uniontech.com> Link: https://patch.msgid.link/20260512015313.1685784-1-maqianga@uniontech.com/ Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2026-05-12 22:16:26 +02:00
Aaron Sacks	577a8d3bae	KVM: Reject wrapped offset in kvm_reset_dirty_gfn() kvm_reset_dirty_gfn() guards the gfn range with if (!memslot \|\| (offset + __fls(mask)) >= memslot->npages) return; but offset is u64 and the addition is unchecked. The check can be silently bypassed by a u64 wrap. The dirty ring backing those entries is MAP_SHARED at KVM_DIRTY_LOG_PAGE_OFFSET of the vcpu fd, so the VMM can rewrite the slot and offset fields of any entry between when the kernel pushes them and when KVM_RESET_DIRTY_RINGS consumes them. On reset, kvm_dirty_ring_reset() re-reads the values via READ_ONCE() and feeds them straight back into this check; only the flags handshake is treated as the handover, the slot/offset payload is taken on trust. Crafting two entries entry[i].offset = 0xffffffffffffffc1 entry[i+1].offset = 0 makes the coalescing loop in kvm_dirty_ring_reset() compute delta = (s64)(0 - 0xffffffffffffffc1) = 63 which falls in [0, BITS_PER_LONG), so it folds entry[i+1] into the existing mask by setting bit 63. The trailing kvm_reset_dirty_gfn() call then sees offset = 0xffffffffffffffc1 and __fls(mask) = 63; the sum is 0 in u64 and the bounds check passes. That offset propagates into kvm_arch_mmu_enable_log_dirty_pt_masked() unchanged. On the legacy MMU path -- kvm_memslots_have_rmaps() == true, i.e. shadow paging, any VM that has allocated shadow roots, or a write-tracked slot -- it reaches gfn_to_rmap(), which indexes slot->arch.rmap[0][] with a near-U64_MAX gfn. That is an out-of-bounds load of a kvm_rmap_head, followed by a conditional clear of PT_WRITABLE_MASK in whatever the loaded pointer points at. The path is reachable from any process holding /dev/kvm. Range-check offset on its own first, so the addition cannot wrap. memslot->npages is bounded well below U64_MAX, so once offset < npages holds, offset + __fls(mask) (with __fls(mask) < BITS_PER_LONG) stays in range. Fixes: `fb04a1eddb` ("KVM: X86: Implement ring-based dirty memory tracking") Cc: stable@vger.kernel.org Signed-off-by: Aaron Sacks <contact@xchglabs.com> Link: https://patch.msgid.link/20260512060742.1628959-1-contact@xchglabs.com/ Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2026-05-12 22:16:16 +02:00
Linus Torvalds	1d5dcaa3bd	Probes fixes for v7.1-rc3 - kprobes: skip non-symbol addresses in kprobe_add_ksym_blacklist() Since the ftrace adds its NOPs at .kprobes.text section (which stores an array), a wrong entry is added when loading a module which uses "__kprobes" attribute. To solve this, add "notrace" to __kprobes functions. - test_kprobes: clear kprobes between test runs Clear all kprobes in the test program after running a test set, because Kunit test can run several times. - fprobe: Fix unregister_fprobe() to wait for RCU grace period Since the fprobe data structure is removed with hlist_del_rcu(), it should wait for the RCU grace period. If the caller waits for RCU, we can use the async variant (e.g. eBPF) -----BEGIN PGP SIGNATURE----- iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmoCf4QbHG1hc2FtaS5o aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bAt8H/RiNH4k/20YKE2Z56GLy N+qCb8CO8L+AroNGCAj4KRVYtBLVzxBLf+Fcdfz6UM/jQ/k2UTeh6ysIt8iWCZYA 2vJBlVDvvjWPpEZW6yCxlpEAgU2B/Xv/92ZnQjW7sGvL75+gsA1dLu1Gt6lqM5zS X335PrIN3c4g+zhwCwW8wLCpMJvyk0qnXiN3thfXTCT/P9GPZldMEAAOecyLl7C3 Y/Zc8Af3xbMdqplIoYoKRWr0uzYBb1NB2FZR7Dp6i5/5MAhVYobd23s6VXWXZwxV FHRJ6R16vCK/ftnwtOiUeuiC3iXn21XQdma6pr2nI6bRhr5v/NBXxmh5U2+tRHeF /I4= =E/h6 -----END PGP SIGNATURE----- Merge tag 'probes-fixes-v7.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes fixes from Masami Hiramatsu: - kprobes: skip non-symbol addresses in kprobe_add_ksym_blacklist() Since the ftrace adds its NOPs at .kprobes.text section (which stores an array), a wrong entry is added when loading a module which uses "__kprobes" attribute. To solve this, add "notrace" to __kprobes functions - test_kprobes: clear kprobes between test runs Clear all kprobes in the test program after running a test set, because Kunit test can run several times - fprobe: Fix unregister_fprobe() to wait for RCU grace period Since the fprobe data structure is removed with hlist_del_rcu(), it should wait for the RCU grace period. If the caller waits for RCU, we can use the async variant (e.g. eBPF) * tag 'probes-fixes-v7.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: fprobe: Fix unregister_fprobe() to wait for RCU grace period test_kprobes: clear kprobes between test runs kprobes: skip non-symbol addresses in kprobe_add_ksym_blacklist()	2026-05-12 10:18:02 -07:00
Prathyushi Nangia	c21b90f776	x86/CPU/AMD: Prevent improper isolation of shared resources in Zen2's op cache Make sure resources are not improperly shared in the op cache and cause instruction corruption this way. Signed-off-by: Prathyushi Nangia <prathyushi.nangia@amd.com> Co-developed-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-05-11 20:06:36 -07:00
Linus Torvalds	50897c9559	linux_kselftest-kunit-fixes-7.1-rc4 Fix to decouple KUNIT_DEBUGFS and KUNIT_ALL_TESTS options and fix KUNIT_DEBUGFS dependencies so it depends on DEBUG_FS without which it will not be useful. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEPZKym/RZuOCGeA/kCwJExA0NQxwFAmoCUf0ACgkQCwJExA0N Qxxsog//Wd/HweNWaLJFG3z9QL4bCAl7cfDTV0uNAobnHge5ymqaKgWmXxi6Xpnt CVYOxDsqJ+Pt4AZedOZ8ZAx09rw0NkfbkRRdJMSvh+7EBJHNwguKxaWa99zcz96p O4qppKOWOeLWP5YDIDlDLK7atiDhBQjm/p5LdtDBa1oxnD3g2ETvu4mUmPwpNT1W t07duEbjLblvOL+hAZ9oRn56638pAyFQG3B9Gs7pjY/YsVWwEIpJnRmFjVu0ZdwU gSbxXYgsc8L/EKFWqsz1kfrZjb24s6z4amYCg12UMGBr1YVfSWp0CLkPjm+BbTjx SzSoPmUSg4g5m+nLwvGkcK9cGhEvAoyPsDdywcSPtE66vOdbkzdeRQ/ukzwcA+5o hm2eC4Z3GXIj+lw2yXzeLflrupUBzi2mZbj0Rcm/rn3h0fUmmDhQf9AoKbH4q/Q+ UKT/2U0R39WO/e7LTYMpcaajXCNFiAwl3eFpoJfYlWWYV56YgqIp4Q2/B9lmRsi1 YNTVJUNtesgtzVOUsRjlgjjKMJjXF6wZRVf8YU4e0Uj95gxEc9pxcilmoOnMqWFr VvutsERfhaCLbfRnEKx2MP9hsora66aqF3fdMaseM9JoNiuXdlXONIQSD8vAuOkR fVp38l2v/SvOb6zyVfAURFYPlPaXUfPGRHraeyQcBcXfew+B8vg= =QdSk -----END PGP SIGNATURE----- Merge tag 'linux_kselftest-kunit-fixes-7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest Pull kunit fixes from Shuah Khan: "Fix to decouple KUNIT_DEBUGFS and KUNIT_ALL_TESTS options and fix KUNIT_DEBUGFS dependencies so it depends on DEBUG_FS without which it will not be useful" * tag 'linux_kselftest-kunit-fixes-7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: kunit: config: KUNIT_DEBUGFS should depend on DEBUG_FS kunit: config: Enable KUNIT_DEBUGFS by default	2026-05-11 15:38:49 -07:00
Tejun Heo	9a415cc537	sched_ext: Avoid UAF in scx_root_enable_workfn() init failure path In scx_root_enable_workfn(), put_task_struct(p) is called before scx_error() dereferences p->comm and p->pid. If the iterator's reference is the last drop, the task is freed synchronously and the deref becomes a UAF. Move put_task_struct() past scx_error(). Reported-by: Sashiko <sashiko-bot@kernel.org> Closes: https://lore.kernel.org/all/20260511214031.AF5E9C2BCB0@smtp.kernel.org/ Fixes: `f0e1a0643a` ("sched_ext: Implement BPF extensible scheduler class") Cc: stable@vger.kernel.org # v6.12+ Signed-off-by: Tejun Heo <tj@kernel.org>	2026-05-11 12:05:48 -10:00
Guopeng Zhang	5dd74441cb	cgroup/cpuset: Reserve DL bandwidth only for root-domain moves cpuset_can_attach() currently adds the bandwidth of all migrating SCHED_DEADLINE tasks to sum_migrate_dl_bw. If the source and destination cpuset effective CPU masks do not overlap, the whole sum is then reserved in the destination root domain. set_cpus_allowed_dl(), however, subtracts bandwidth from the source root domain only when the affinity change really moves the task between root domains. A DL task can move between cpusets that are still in the same root domain, so including that task in sum_migrate_dl_bw can reserve destination bandwidth without a matching source-side subtraction. Share the root-domain move test with set_cpus_allowed_dl(). Keep nr_migrate_dl_tasks counting all migrating deadline tasks for cpuset DL task accounting, but add to sum_migrate_dl_bw only for tasks that need a root-domain bandwidth move. Keep using the destination cpuset effective CPU mask and leave the broader can_attach()/attach() transaction model unchanged. Fixes: `2ef269ef1a` ("cgroup/cpuset: Free DL BW in case can_attach() fails") Cc: stable@vger.kernel.org # v6.10+ Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn> Reviewed-by: Waiman Long <longman@redhat.com> Acked-by: Juri Lelli <juri.lelli@redhat.com> Tested-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-05-11 10:27:14 -10:00
Jann Horn	c1fa0bb633	exit: prevent preemption of oopsing TASK_DEAD task When an already-exiting task oopses, make_task_dead() currently calls do_task_dead() with preemption enabled. That is forbidden: do_task_dead() calls __schedule(), which has a comment saying "WARNING: must be called with preemption disabled!". If an oopsing task is preempted in do_task_dead(), between becoming TASK_DEAD and entering the scheduler explicitly, bad things happen: finish_task_switch() assumes that once the scheduler has switched away from a TASK_DEAD task, the task can never run again and its stack is no longer needed; but that assumption apparently doesn't hold if the dead task was preempted (the SM_PREEMPT case). This means that the scheduler ends up repeatedly dropping references on the dead task's stack, which can lead to use-after-free or double-free of the entire task stack; in other words, two tasks can end up running on the same stack, resulting in various kinds of memory corruption. (This does not just affect "recursively oopsing" tasks; it is enough to oops once during task exit, for example in a file_operations::release handler) Fixes: `7f80a2fd7d` ("exit: Stop poorly open coding do_task_dead in make_task_dead") Cc: stable@kernel.org Signed-off-by: Jann Horn <jannh@google.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-05-11 08:55:11 -07:00
Masami Hiramatsu (Google)	657b594b20	fprobe: Fix unregister_fprobe() to wait for RCU grace period Commit `4346ba1604` ("fprobe: Rewrite fprobe on function-graph tracer") changed fprobe to register struct fprobe to an rcu-hlist, but it forgot to wait for RCU GP. Thus there can be use-after-free if the fprobe is released right after unregistering. This can be happened on fprobe event and sample module code. To fix this issue, add synchronize_rcu() in unregister_fprobe(). Note that BPF is OK because fprobe is used as a part of bpf_kprobe_multi_link. This unregisters its fprobe in bpf_kprobe_multi_link_release() and it is deallocated via bpf_kprobe_multi_link_dealloc(), which is invoked from bpf_link_defer_dealloc_rcu_gp() RCU callback. For BPF, this also introduced unregister_fprobe_async() which does NOT wait for RCU grace priod. Link: https://lore.kernel.org/all/177813998919.256460.2809243930741138224.stgit@mhiramat.tok.corp.google.com/ Fixes: `4346ba1604` ("fprobe: Rewrite fprobe on function-graph tracer") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>	2026-05-11 19:04:46 +09:00
Andrea Righi	86ecb1c1a1	sched_ext: Clear ops->priv on scx_alloc_and_add_sched() error paths scx_alloc_and_add_sched() can fail after @sch has been assigned to ops->priv. In those cases @sch is torn down (either via kfree() through the err_free_* chain or via kobject_put() -> scx_kobj_release() -> RCU work), but @ops->priv is left pointing at the about-to-be-freed pointer. With the recent -EBUSY gate in scx_root_enable_workfn() and scx_sub_enable_workfn() that rejects an attach when @ops->priv is still non-NULL, see commit `bbf30b383c` ("sched_ext: Fix ops->priv clobber on concurrent attach/detach"), a dangling @ops->priv permanently locks the kdata out: every future attach attempt sees a stale binding and returns -EBUSY even though no scheduler is actually attached. Clear @ops->priv on the post-assign failure paths so that the kdata returns to its pre-attach state when the function returns ERR_PTR(). Fixes: `bbf30b383c` ("sched_ext: Fix ops->priv clobber on concurrent attach/detach") Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-05-10 22:50:31 -10:00
Guopeng Zhang	4a39eda5fd	cgroup/cpuset: Reset DL migration state on can_attach() failure cpuset_can_attach() accumulates temporary SCHED_DEADLINE migration state in the destination cpuset while walking the taskset. If a later task_can_attach() or security_task_setscheduler() check fails, cgroup_migrate_execute() treats cpuset as the failing subsystem and does not call cpuset_cancel_attach() for it. The partially accumulated state is then left behind and can be consumed by a later attach, corrupting cpuset DL task accounting and pending DL bandwidth accounting. Reset the pending DL migration state from the common error exit when ret is non-zero. Successful can_attach() keeps the state for cpuset_attach() or cpuset_cancel_attach(). Fixes: `2ef269ef1a` ("cgroup/cpuset: Free DL BW in case can_attach() fails") Cc: stable@vger.kernel.org # v6.10+ Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Chen Ridong <chenridong@huaweicloud.com> Reviewed-by: Waiman Long <longman@redhat.com>	2026-05-10 22:14:49 -10:00
Andrea Righi	bbf30b383c	sched_ext: Fix ops->priv clobber on concurrent attach/detach Under heavy concurrent attach/detach operations, scx_claim_exit() can trigger a NULL pointer dereference. This can be reproduced running the reload_loop kselftests inside a virtme-ng session: $ vng -v -- ./tools/testing/selftests/sched_ext/runner -t reload_loop ... BUG: kernel NULL pointer dereference, address: 0000000000000400 RIP: 0010:scx_claim_exit+0x3b/0x120 Call Trace: <TASK> bpf_scx_unreg+0x45/0xb0 bpf_struct_ops_map_link_dealloc+0x39/0x50 bpf_link_release+0x18/0x20 __fput+0x10b/0x2e0 __x64_sys_close+0x47/0xa0 The underlying race (diagnosed by Tejun Heo) is a stomp of @ops->priv, not a missing NULL check: T2 unreg(K) T1 reg(K) ----------- --------- sch = ops->priv = sch_b800 scx_disable; flush_disable_work [scx_root_disable: scx_root=NULL, mutex_unlock, state=DISABLED] mutex_lock; state ok scx_alloc_and_add_sched: ops->priv = sch_a800 scx_root = sch_a800; init=0 state=ENABLED; mutex_unlock [flush returns] RCU_INIT_POINTER(ops->priv, NULL) <-- clobbers sch_a800 kobject_put(sch_b800) T1 acquires scx_enable_mutex inside scx_root_disable()'s mutex_unlock window and starts a fresh attach on the same kdata, assigning sch_a800 to @ops->priv. T2 then continues out of scx_disable()/flush_disable_work and clobbers @ops->priv to NULL, leaking sch_a800; the bpf_link is gone but state stays SCX_ENABLED, so all future attaches fail with -EBUSY permanently. The next bpf_scx_unreg() on that kdata then reads NULL @ops->priv and dereferences it in scx_claim_exit(). Make @ops->priv the lifecycle binding: in scx_root_enable_workfn() and scx_sub_enable_workfn(), after the existing state check and still under scx_enable_mutex, refuse with -EBUSY if @ops->priv is non-NULL. This rejects an attempt to reuse a kdata that is still bound to a previous scheduler instance, closing the race without changing the unreg side. Fixes: `105dcd005b` ("sched_ext: Introduce scx_prog_sched()") Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-05-10 21:40:03 -10:00
Andrea Righi	3788e32516	selftests/sched_ext: Fix build error in dequeue selftest Building the dequeue selftest with newer compilers (e.g., gcc 16) triggers the following error: dequeue.c:28:22: error: variable 'sum' set but not used The 'volatile' qualifier prevents the writes from being optimized away, but does not silence the unused variable 'sum' is indeed only written and never read. Consume 'sum' via an empty asm() with a register input constraint. This forces the compiler to keep the accumulated value (preserving the CPU stress loop) and avoiding the build error. Fixes: `658ad2259b` ("selftests/sched_ext: Add test to validate ops.dequeue() semantics") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-05-10 16:03:05 -10:00
Hongfu Li	2a3d7256fa	selftests/cgroup: Fix string comparison in write_test Use string comparison (!=) instead of numeric comparison (-ne) for cpuset values like "0-1". For example: $ [[ "0-1" != "2-3" ]] && echo "true" \|\| echo "false" true $ [[ "0-1" -ne "2-3" ]] && echo "true" \|\| echo "false" false Signed-off-by: Hongfu Li <lihongfu@kylinos.cn> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-05-10 15:54:12 -10:00
Hongfu Li	e32e6f0216	selftests/cgroup: Fix cg_read_strcmp() empty string comparison cg_read_strcmp() allocated a buffer sized to strlen(expected) + 1, then passed it to read_text() which calls read(fd, buf, size-1). When comparing against an empty string (""), strlen("") = 0 gives a 1-byte buffer, and read() is asked to read 0 bytes. The file content is never actually read, so strcmp("", buf) always returns 0 regardless of the real content. This caused cg_test_proc_killed() to always report the cgroup as empty immediately, making OOM tests pass without verifying that processes were killed. Signed-off-by: Hongfu Li <lihongfu@kylinos.cn> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-05-10 15:53:44 -10:00
Guopeng Zhang	796ad62204	cgroup/dmem: Return -ENOMEM on failed pool preallocation get_cg_pool_unlocked() handles allocation failures under dmemcg_lock by dropping the lock, preallocating a pool with GFP_KERNEL, and retrying the locked lookup and creation path. If the fallback allocation fails too, pool remains NULL. Since the loop condition is while (!pool), the function can keep retrying instead of propagating the allocation failure to the caller. Set pool to ERR_PTR(-ENOMEM) when the fallback allocation fails so the loop exits through the existing common return path. The callers already handle ERR_PTR() from get_cg_pool_unlocked(), so this restores the expected error path. Fixes: `b168ed458d` ("kernel/cgroup: Add "dmem" memory accounting cgroup") Cc: stable@vger.kernel.org # v6.14+ Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-05-10 15:43:46 -10:00
Linus Torvalds	5d6919055d	Linux 7.1-rc3	2026-05-10 14:08:09 -07:00
Tejun Heo	d3e73a0808	sched_ext: Handle SCX_TASK_NONE in disable/switched_from paths scx_fail_parent() leaves cgroup tasks at (state=NONE, sched=parent, sched_class=ext) until the parent itself is torn down by the scx_error() it raised. When the later root_disable iterates them, two paths trip on NONE. scx_disable_and_exit_task() re-enters the wrapper at NONE: the inner switch returns early but the trailing scx_set_task_sched(p, NULL) clobbers the parent sched left by scx_fail_parent(), and scx_set_task_state(p, NONE) wastes a write on an already-NONE task. switched_from_scx() then calls scx_disable_task(), which WARNs on non-ENABLED state and writes state=READY, producing a NONE -> READY transition the validation matrix rejects. Treat NONE as "nothing to do" in both paths. Add a NONE early-return at the top of scx_disable_and_exit_task() and a parallel NONE check in switched_from_scx() next to task_dead_and_done(). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-05-10 10:08:16 -10:00
Tejun Heo	cd6aab7367	sched_ext: Close sub-sched init race with post-init DEAD recheck scx_sub_enable_workfn()'s init pass and scx_sub_disable() migration both drop the rq lock to call __scx_init_task() against the other sched. A TASK_DEAD @p can fall through sched_ext_dead() in that window. sched_ext_dead() runs ops.exit_task() on the sched @p was attached to, not on the sched whose init just completed, so the new allocation leaks. Reuse the DEAD signal set by sched_ext_dead(). After __scx_init_task() returns, take task_rq_lock(p) and check for DEAD; on hit, call scx_sub_init_cancel_task() against the sub sched the init ran for and drop @p; on miss, proceed as before. Reported-by: zhidao su <suzhidao@xiaomi.com> Link: https://lore.kernel.org/all/20260429133155.3825247-1-suzhidao@xiaomi.com/ Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-05-10 10:08:16 -10:00
Tejun Heo	c941d7391f	sched_ext: Close root-enable vs sched_ext_dead() race with SCX_TASK_INIT_BEGIN scx_root_enable_workfn() drops the iter rq lock for ops.init_task() and a TASK_DEAD @p can fall through sched_ext_dead() in that window. The race hits when sched_ext_dead() observes SCX_TASK_INIT (the intermediate state before @p->scx.sched is published) and dereferences NULL via SCX_HAS_OP(NULL, exit_task), or observes SCX_TASK_NONE during the unlocked init window and skips cleanup so exit_task() never runs. Add SCX_TASK_INIT_BEGIN. The enable path writes NONE -> INIT_BEGIN under the iter rq lock, then takes the rq lock again after init to walk INIT_BEGIN -> INIT -> READY. sched_ext_dead() that wins the rq-lock race observes INIT_BEGIN and sets DEAD without calling into ops; the post-init recheck unwinds via scx_sub_init_cancel_task(). scx_fork() runs single-threaded against sched_ext_dead() (the task is not on scx_tasks until scx_post_fork() adds it) so its INIT_BEGIN -> INIT walk needs no rq-lock pairing; it rolls back to NONE on ops.init_task() failure. The validation matrix grows the INIT_BEGIN row and the INIT_BEGIN -> DEAD edge; INIT now requires INIT_BEGIN as the predecessor. scx_sub_disable()'s migration writes INIT_BEGIN as a synthetic predecessor to satisfy the tightened verification. The sub-sched paths still race with sched_ext_dead() during the unlocked init window. This will be fixed by the next patch. Reported-by: zhidao su <suzhidao@xiaomi.com> Link: https://lore.kernel.org/all/20260429133155.3825247-1-suzhidao@xiaomi.com/ Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-05-10 10:08:16 -10:00
Tejun Heo	cceb8fa9cb	sched_ext: Replace SCX_TASK_OFF_TASKS flag with SCX_TASK_DEAD state SCX_TASK_OFF_TASKS marked tasks already through sched_ext_dead() so cgroup task iteration would skip them. This can be expressed better with a task state. Replace the flag with SCX_TASK_DEAD. scx_disable_and_exit_task() resets state to NONE on its way out, so sched_ext_dead() now sets DEAD after the wrapper returns. The validation matrix grows NONE -> DEAD, warns on DEAD -> NONE, and tightens READY's predecessor to INIT or ENABLED so the new DEAD value cannot silently transition to READY. Prepares for the following enable vs dead race fix. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-05-10 10:08:16 -10:00
Tejun Heo	938dd9ab2b	sched_ext: Inline scx_init_task() and move RESET_RUNNABLE_AT into scx_set_task_state() Prepare for the SCX_TASK_INIT_BEGIN/DEAD work that follows by collapsing the scx_init_task() helper. Move the SCX_TASK_RESET_RUNNABLE_AT setting into scx_set_task_state() on the INIT transition (it was set unconditionally at every INIT site through the scx_init_task() helper), inline scx_init_task() into scx_fork() and scx_root_enable_workfn(), and drop the helper. As a side effect, scx_sub_disable() migration sequence now also sets RESET_RUNNABLE_AT (it previously wrote INIT directly without going through scx_init_task()). The flag triggers a runnable_at reset on the next set_task_runnable(), which is harmless on a task that has just been moved between scheds. On root-enable, p->scx.flags is written without the task's rq lock. The task isn't visible to scx yet, and a follow-up patch restores the lock-held write. v2: Note p->scx.flags rq-lock relaxation on root-enable path. (Andrea) Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-05-10 10:08:16 -10:00
Tejun Heo	6947bea4b7	sched_ext: Cleanups in preparation for the SCX_TASK_INIT_BEGIN/DEAD work Cleanups in preparation for the state-machine work that follows: - Convert three sub-sched call sites that open-code rcu_assign_pointer(p->scx.sched, ...) to scx_set_task_sched(). - Move scx_get_task_state()/scx_set_task_state() above the SCX task iter section so scx_task_iter_next_locked() can use them without a forward declaration. No functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-05-10 10:08:16 -10:00
Linus Torvalds	afaa0a4770	- Fix a string leak in the versalnet driver -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmoA1dYACgkQEsHwGGHe VUrDjQ/+LiHor5aRMUwG96jtBAyxAPhFVYFwienHALDAdz44vZzvmmR2/6jJzQAb xjlbhNxNCCP5HnTqq/nt2bjfx5re9KIMr55HShDhFsdphGbdSd70/4ErRHKTrMxE 1i6gemqFH10bgj0bdb3nLdqL1xx/vNSBuBbLilzj6MQzgX/tgv/CqvOw+ZwLGCQc 4kMHPRA6H/wDe2A+S+WOSoKCwcQ5GuQSTYWSiBZkODUw+BTpQlcsFwfy1ZcR/YTl NAXX9JmyzlC2R3yrkVzh/6c1fT7ir+tOH99hjnKleI3krcSgDhCsqr4okguUcm5p aFAHghmu8D/8z/5qxIimCXUa90QQMrPLZkPd+iPJipKFMbYFN0NC2xcBLnKw3oQf AF9af3ww3U4qBseCNEE40gWaibalVFtpRYIRDIbuDt62/U5ngd8TlLC45wNUcDLa w+QogyPYCRDfelKy0uysoIrUJcKBEVMfuXk3ddlZF4BUEAih2EQFF+CWbyKUqQXn HmltHhveRIfFC4FsDFIu3qUEH92sugcFpcTl1Cj21LjeKtfP7Pxr5roJaZ8uqVKC INEv6zBhhcQh5OJdtrhDIUk5PH/oAdniH7OaaxM1qBOMuHk6fleLQGHRo7pZSMrQ +yXsmwLl4FAD18IVrUXWPr/iqmKUvFnjISYbflpv+DmqdkObrIE= =24jB -----END PGP SIGNATURE----- Merge tag 'edac_urgent_for_v7.1_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras Pull EDAC fix from Borislav Petkov: - Fix a string leak in the versalnet driver * tag 'edac_urgent_for_v7.1_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: EDAC/versalnet: Fix device name memory leak	2026-05-10 12:21:57 -07:00
Hyunwoo Kim	aa54b1d27f	rxrpc: Also unshare DATA/RESPONSE packets when paged frags are present The DATA-packet handler in rxrpc_input_call_event() and the RESPONSE handler in rxrpc_verify_response() copy the skb to a linear one before calling into the security ops only when skb_cloned() is true. An skb that is not cloned but still carries externally-owned paged fragments (e.g. SKBFL_SHARED_FRAG set by splice() into a UDP socket via __ip_append_data, or a chained skb_has_frag_list()) falls through to the in-place decryption path, which binds the frag pages directly into the AEAD/skcipher SGL via skb_to_sgvec(). Extend the gate to also unshare when skb_has_frag_list() or skb_has_shared_frag() is true. This catches the splice-loopback vector and other externally-shared frag sources while preserving the zero-copy fast path for skbs whose frags are kernel-private (e.g. NIC page_pool RX, GRO). The OOM/trace handling already in place is reused. Fixes: `d0d5c0cd1e` ("rxrpc: Use skb_unshare() rather than skb_cow_data()") Cc: stable@vger.kernel.org Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com> Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-05-10 08:15:57 -07:00
Linus Torvalds	a1a10cdbc6	Fixes for clk drivers: - Mark the DDR bus clk critical in the SpaceMiT driver so that boot doesn't fail - Fix boot on Mobile EyeQ by creating the auxiliary device for the ethernet PHY - Plug an OF node leak in Rockchip rk808 clk driver -----BEGIN PGP SIGNATURE----- iQJIBAABCAAyFiEE9L57QeeUxqYDyoaDrQKIl8bklSUFAmoAnnsUHHN3Ym95ZEBj aHJvbWl1bS5vcmcACgkQrQKIl8bklSXTgA//VV/2VvDcWpXLa+Jr05D5j6/JgkSx P70ssFqVc3mHHfCtlrAHmU+xaLI37C2jDMotOMt1hfwy8CTs9BpP8L5IMbIo0tpg m7agI4fSUnwZdU5hCh6o9BEcY3KHEWqMeXdXbFXuwqS+/+4pTNOYVpGYLfB8rgZo qW2VpsK0rrhNC82V3C86pdoC99gHK+fu1+MeKrh9DcNL1+wt89Eh60Fl0G+UfrjJ 0fuIohtsp8W+ciQHg70oBRurmePLoWvWFqmH/kEUvftNU38SnjVT4V7FY6DBDOp3 9sAl3sHsnaWoXIt6fx6YujFXiOUgN5hMSaXQ+uGcH9t+6qxNUtlh0hAEolAvEPe8 SfjByQ3PClUCSu0Gnf6gPu9IBFXTDfWPH6tCk7Du3CY5HnISdQXdagpElhjP6N3B PGUQJF4oK7W1bs0ryYh3OYHG94nybncz1tJrCipPxmrY1PzZAbvdT7E0lickO35F MeEeg2xx3iALhK6koMaOuCEobrxeq5aG52qVqnKixupm1vLwPMxBtxhaEIUkjBZR I7k/qcZoDFXxSnzXdk6TXjbZl6JVJUy0tl3yxIwqVkZVapnGNylsS85psNhy8ovg PoJUENmKN9AjevtqW2THy77kaqutanYsd6AqMWqpvlux5scQBwJXXVzQTDW0yf1a LrXCrJQQFmJgPJs= =yAV8 -----END PGP SIGNATURE----- Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux Pull clk driver fixes from Stephen Boyd: - Mark the DDR bus clk critical in the SpaceMiT driver so that boot doesn't fail - Fix boot on Mobile EyeQ by creating the auxiliary device for the ethernet PHY - Plug an OF node leak in Rockchip rk808 clk driver * tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux: clk: rk808: fix OF node reference imbalance MAINTAINERS: add myself as a reviewer for the clk subsystem reset: eyeq: drop device_set_of_node_from_dev() done by parent clk: eyeq: add EyeQ5 children auxiliary device for generic PHYs clk: eyeq: use the auxiliary device creation helper clk: spacemit: k3: mark top_dclk as CLK_IS_CRITICAL	2026-05-10 08:10:47 -07:00
Linus Torvalds	515186b7be	bpf-fixes -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmn/wmgACgkQ6rmadz2v bTosmhAAgYkQLg7zVQdruoSYb7Vzjz1Di4tM2rBXNIX4S7dvfZUGGBNzFV1lWobk /r6269llSnPKXofs+69LDVCpdvUXmGRmS7+bq+bxV7WVmg7JruVOTWg839jValJK cY3IQi0lZ9GVKaePI5C2XxBS3rCrdQmby91fcfp5C6A/gR6m7PzAlnoIuJ2SQx6A 7tsxxJb4wRtFWPBp7ClbBo7MAMIzPse/6CzsA2eP+icyJC+De9WGYs6bTDNi7vpY +eul0HMyHLTszJe/AGrsu5Ky3S6l+CTydi1fAUSOnk1pYHHhRvvD2WV8ix05/0rO 2looZl6ogpcisCm1i8HN8g1ST0tS74x3bL9kjvB/hhKGh6K1QpU6/drEvmJqKMAu fspYHD3qO+OXN7EV7tFZ1ErJvJZ7zT7UP0JxirAK1DFQZWrki/tJKehSD6gbir8R GwwZctXDOPTGADBsdqbxEPEAp1gVTvDXf04k6GOCLkzqqYBMVKdW/8GXN+6Itr+O nxxoC0SOOkW7rRlJaxuJd5+kpaCKOuK9FaXWONOn7HPzBgK0E0CL9g3+cZcS1QvI 2/5utfFj0gMeo40ZDjCyDWXm7w+AnTSKMMapB5pyi0FY3AVtroSV88HNbpm7DJrs xp9jO5ZD6EQ9Wn1cufOYAkrgZYwTZL5Z2EqyKcoJUIk1ZjpQbXg= =x/fg -----END PGP SIGNATURE----- Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Pull bpf fixes from Alexei Starovoitov: - Fix sk_local_storage diag dump via netlink (Amery Hung) - Fix off-by-one in arena direct-value access (Junyoung Jang) - Reject TCP_NODELAY in bpf-tcp congestion control (KaFai Wan) - Fix type confusion in bpf__sock() (Kuniyuki Iwashima) - Reject TX-only AF_XDP sockets (Linpu Yu) - Don't run arg-tracking analysis twice on main subprog (Paul Chaignon) - Fix NULL pointer dereference in bpf_sk_storage_clone and fib lookup (Weiming Shi) tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf: Fix off-by-one boundary validation in arena direct-value access xskmap: reject TX-only AF_XDP sockets bpf: Don't run arg-tracking analysis twice on main subprog bpf: Free reuseport cBPF prog after RCU grace period. bpf: tcp: Fix type confusion in sol_tcp_sockopt(). bpf: tcp: Fix type confusion in bpf_skc_to_tcp6_sock(). bpf: tcp: Fix type confusion in bpf_skc_to_tcp_sock(). mptcp: bpf: Fix type confusion in bpf_mptcp_sock_from_subflow() selftest: bpf: Add test for bpf_tcp_sock() and RAW socket. bpf: tcp: Fix type confusion in bpf_tcp_sock(). tools/headers: Regenerate stddef.h to fix BPF selftests bpf: Fix sk_local_storage diag dumping uninitialized special fields bpf: Fix NULL pointer dereference in bpf_skb_fib_lookup() sockmap: Fix sk_psock_drop() race vs sock_map_{unhash,close,destroy}(). bpf: Fix NULL pointer dereference in bpf_sk_storage_clone and diag paths selftests/bpf: Verify bpf-tcp-cc rejects TCP_NODELAY selftests/bpf: Test TCP_NODELAY in TCP hdr opt callbacks bpf: Reject TCP_NODELAY in bpf-tcp-cc bpf: Reject TCP_NODELAY in TCP header option callbacks	2026-05-09 18:42:54 -07:00
Junyoung Jang	3ac1a467e3	bpf: Fix off-by-one boundary validation in arena direct-value access BPF_MAP_TYPE_ARENA accepts BPF_PSEUDO_MAP_VALUE offsets at exactly the end of the arena mapping (off == arena_size). The boundary check in arena_map_direct_value_addr() uses `>` instead of `>=`, which incorrectly allows a one-past-end pointer to be accepted. Change the condition to `>=` to correctly reject offsets that fall outside the valid arena user_vm range. Fixes: `317460317a` ("bpf: Introduce bpf_arena.") Signed-off-by: Junyoung Jang <graypanda.inzag@gmail.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Link: https://lore.kernel.org/r/20260426172505.1947915-1-graypanda.inzag@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-05-09 16:18:39 -07:00
Linpu Yu	bf6d507f7e	xskmap: reject TX-only AF_XDP sockets XSKMAP entries are used as redirect targets for incoming XDP frames. A TX-only AF_XDP socket lacks an Rx ring and cannot handle redirected traffic, but xsk_map_update_elem() currently allows such sockets to be inserted into the map. Redirecting packets to such a socket on the veth generic-XDP path causes a kernel crash in xsk_generic_rcv(). This became possible after xsk_is_setup_for_bpf_map() was removed from the XSKMAP update path, which allowed bound TX-only sockets to be inserted into the map. Reject TX-only sockets during XSKMAP updates to avoid the crash. They remain fully operational for pure Tx purposes outside XSKMAP. Fixes: `968be23cea` ("xsk: Fix possible segfault at xskmap entry insertion") Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Signed-off-by: Yifan Wu <yifanwucs@gmail.com> Signed-off-by: Linpu Yu <linpu5433@gmail.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://lore.kernel.org/r/20260508144344.694-1-linpu5433@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-05-09 16:17:01 -07:00
Paul Chaignon	512809bb8a	bpf: Don't run arg-tracking analysis twice on main subprog Because subprog 0, the main subprog, is considered a global function, we end up running the arg-tracking dataflow analysis twice on it. That results in slightly longer verification but mostly in more verbose verifier logs. This patch fixes it by keeping only the iteration over global subprogs. When running over all of Cilium's programs with BPF_LOG_LEVEL2, this reduces verbosity by ~20% on average. Fixes: `bf0c571f7f` ("bpf: introduce forward arg-tracking dataflow analysis") Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/e4d7b53d4963ef520541a782f5fc8108a168877c.1778176504.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-05-09 16:12:40 -07:00

1 2 3 4 5 ...

1445145 Commits