Commit Graph

1445145 Commits

Author SHA1 Message Date
Linus Torvalds
31e62c2ebb ptrace: slightly saner 'get_dumpable()' logic
The 'dumpability' of a task is fundamentally about the memory image of
the task - the concept comes from whether it can core dump or not - and
makes no sense when you don't have an associated mm.

And almost all users do in fact use it only for the case where the task
has a mm pointer.

But we have one odd special case: ptrace_may_access() uses 'dumpable' to
check various other things entirely independently of the MM (typically
explicitly using flags like PTRACE_MODE_READ_FSCREDS).  Including for
threads that no longer have a VM (and maybe never did, like most kernel
threads).

It's not what this flag was designed for, but it is what it is.

The ptrace code does check that the uid/gid matches, so you do have to
be uid-0 to see kernel thread details, but this means that the
traditional "drop capabilities" model doesn't make any difference for
this all.

Make it all make a *bit* more sense by saying that if you don't have a
MM pointer, we'll use a cached "last dumpability" flag if the thread
ever had a MM (it will be zero for kernel threads since it is never
set), and require a proper CAP_SYS_PTRACE capability to override.

Reported-by: Qualys Security Advisory <qsa@qualys.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Kees Cook <kees@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-05-14 08:32:11 -07:00
Linus Torvalds
59a62ea458 sched_ext: Fixes for v7.1-rc3
Bulk is hardening of the new sub-scheduler infrastructure.
 
 - UAFs and lifecycle bugs on the sub-sched attach/detach paths: parent
   sub_kset freed under a racing child, list_del_rcu on an uninitialized
   list head, ops->priv stomped by concurrent attach/detach, and a UAF in
   the init-failure error path.
 
 - Task state-machine reorg closing concurrent enable-vs-dead races: a
   task exiting during the unlocked init window could trip NULL ops
   derefs or skip exit_task() cleanup.
 
 - A scx_link_sched() self-deadlock on scx_sched_lock.
 
 - isolcpus: stop dereferencing the now-RCU-protected HK_TYPE_DOMAIN
   cpumask without RCU, and stop rejecting BPF schedulers when only
   cpuset isolated partitions are active.
 
 - PREEMPT_RT: disable irq_work runs in hardirq context so dumps show the
   failing task rather than the irq_work kthread.
 
 - Assorted !CONFIG_EXT_SUB_SCHED, randconfig, and selftest build fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCagTk1g4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGT6TAP0ZbRHz9ViligecZXIHjEvZQjEV4sn1NLpGi4og
 V0Ol2AD/RzqHQZo5+HpMz4hPrcZdkAWcr74cLrNTJ2WQjOk4RgE=
 =6Mbx
 -----END PGP SIGNATURE-----

Merge tag 'sched_ext-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fixes from Tejun Heo:
 "The bulk of this is hardening of the new sub-scheduler infrastructure.

   - UAFs and lifecycle bugs on the sub-sched attach/detach paths:
     parent sub_kset freed under a racing child, list_del_rcu on an
     uninitialized list head, ops->priv stomped by concurrent
     attach/detach, and a UAF in the init-failure error path

   - Task state-machine reorg closing concurrent enable-vs-dead races: a
     task exiting during the unlocked init window could trip NULL ops
     derefs or skip exit_task() cleanup

   - A scx_link_sched() self-deadlock on scx_sched_lock

   - isolcpus: stop dereferencing the now-RCU-protected HK_TYPE_DOMAIN
     cpumask without RCU, and stop rejecting BPF schedulers when only
     cpuset isolated partitions are active

   - PREEMPT_RT: disable irq_work runs in hardirq context so dumps show
     the failing task rather than the irq_work kthread

   - Assorted !CONFIG_EXT_SUB_SCHED, randconfig, and selftest build
     fixes"

* tag 'sched_ext-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: Use HK_TYPE_DOMAIN_BOOT to detect isolcpus= domain isolation
  sched_ext: Defer sub_kset base put to scx_sched_free_rcu_work
  sched_ext: INIT_LIST_HEAD() &sch->all in scx_alloc_and_add_sched()
  sched_ext: Drop NONE early return in scx_disable_and_exit_task()
  sched_ext: Avoid UAF in scx_root_enable_workfn() init failure path
  sched_ext: Clear ops->priv on scx_alloc_and_add_sched() error paths
  sched_ext: Fix ops->priv clobber on concurrent attach/detach
  selftests/sched_ext: Fix build error in dequeue selftest
  sched_ext: Handle SCX_TASK_NONE in disable/switched_from paths
  sched_ext: Close sub-sched init race with post-init DEAD recheck
  sched_ext: Close root-enable vs sched_ext_dead() race with SCX_TASK_INIT_BEGIN
  sched_ext: Replace SCX_TASK_OFF_TASKS flag with SCX_TASK_DEAD state
  sched_ext: Inline scx_init_task() and move RESET_RUNNABLE_AT into scx_set_task_state()
  sched_ext: Cleanups in preparation for the SCX_TASK_INIT_BEGIN/DEAD work
  sched_ext: Use IRQ_WORK_INIT_HARD() to initialize sch->disable_irq_work
  sched_ext: Fix !CONFIG_EXT_SUB_SCHED build warnings
  sched_ext: Drop unused scx_find_sub_sched() stub
  sched_ext: Move scx_error() out of scx_link_sched()'s lock region
2026-05-13 15:00:40 -07:00
Linus Torvalds
0913b580f8 cgroup: Fixes for v7.1-rc3
- cpuset fixes:
   - Partition invalidation could return CPUs still in use by sibling
     partitions, producing overlapping effective_cpus.
   - cpuset_can_attach() over-reserved DL bandwidth on moves that stayed
     within the same root domain.
   - Pending DL migration state leaked into later attaches when a later
     can_attach() check failed.
   - Reorder PF_EXITING and __GFP_HARDWALL checks so dying tasks can
     allocate from any node and exit quickly.
 
 - dmem: propagate -ENOMEM instead of spinning forever when the fallback
   pool allocation also fails.
 
 - selftests/cgroup: percpu test error-path leak, bogus numeric
   comparison of cpuset strings, and a zero-length read() that silently
   passed OOM-kill tests.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCagTkzw4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGR+AAQCcYEGJ+yNAzzrTcY8xy7333rorMckSmZt18jzv
 1KSqEQD+KjindGNcWP/meQBPnEjcBjix6i961mgnQ99e/UD2HQ4=
 =4pT3
 -----END PGP SIGNATURE-----

Merge tag 'cgroup-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup fixes from Tejun Heo:

 - cpuset fixes:
     - Partition invalidation could return CPUs still in use by sibling
       partitions, producing overlapping effective_cpus
     - cpuset_can_attach() over-reserved DL bandwidth on moves that
       stayed within the same root domain
     - Pending DL migration state leaked into later attaches when a
       later can_attach() check failed
     - Reorder PF_EXITING and __GFP_HARDWALL checks so dying tasks can
       allocate from any node and exit quickly

 - dmem: propagate -ENOMEM instead of spinning forever when the fallback
   pool allocation also fails

 - selftests/cgroup: percpu test error-path leak, bogus numeric
   comparison of cpuset strings, and a zero-length read() that silently
   passed OOM-kill tests

* tag 'cgroup-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup/cpuset: Return only actually allocated CPUs during partition invalidation
  selftests/cgroup: Fix error path leaks in test_percpu_basic
  cgroup/cpuset: Reserve DL bandwidth only for root-domain moves
  cgroup/cpuset: Reset DL migration state on can_attach() failure
  selftests/cgroup: Fix string comparison in write_test
  selftests/cgroup: Fix cg_read_strcmp() empty string comparison
  cgroup/dmem: Return -ENOMEM on failed pool preallocation
  cgroup/cpuset: move PF_EXITING check before __GFP_HARDWALL in cpuset_current_node_allowed()
2026-05-13 14:56:31 -07:00
Linus Torvalds
50599e4c68 workqueue: Fixes for v7.1-rc3
- Plug a wq->cpu_pwq leak on the WQ_UNBOUND allocation failure path.
 
 - Fix a cancel_delayed_work_sync() livelock against drain_workqueue()
   caused by the drain/destroy reject path leaving WORK_STRUCT_PENDING set
   with no owner.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCagTkwA4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGXGNAQDarHcCjUzjddPY1drGJz73LIsfAhU1haDWYQgD
 Ssd/ZgD/fYP0Gp6GwbFF/n9JAo48Y2P29PF4lOfVagv1Md0SeAM=
 =NMbR
 -----END PGP SIGNATURE-----

Merge tag 'wq-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq

Pull workqueue fixes from Tejun Heo:

 - Plug a wq->cpu_pwq leak on the WQ_UNBOUND allocation failure path

 - Fix a cancel_delayed_work_sync() livelock against drain_workqueue()
   caused by the drain/destroy reject path leaving WORK_STRUCT_PENDING
   set with no owner

* tag 'wq-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: Fix wq->cpu_pwq leak in alloc_and_link_pwqs() WQ_UNBOUND path
  workqueue: Release PENDING in __queue_work() drain/destroy reject path
2026-05-13 14:49:13 -07:00
Andrea Righi
6ae315d379 sched_ext: Use HK_TYPE_DOMAIN_BOOT to detect isolcpus= domain isolation
scx_enable() refuses to attach a BPF scheduler when isolcpus=domain is
in effect by comparing housekeeping_cpumask(HK_TYPE_DOMAIN) against
cpu_possible_mask.

Since commit 27c3a5967f ("sched/isolation: Convert housekeeping
cpumasks to rcu pointers"), HK_TYPE_DOMAIN's cpumask is RCU protected
and dereferencing it requires either RCU read lock, the cpu_hotplug
write lock, or the cpuset lock; scx_enable() holds none of these, so
booting with isolcpus=domain and attaching any BPF scheduler triggers
the following lockdep splat:

  =============================
  WARNING: suspicious RCU usage
  -----------------------------
  kernel/sched/isolation.c:60 suspicious rcu_dereference_check() usage!

  1 lock held by scx_flash/281:
   #0: ffffffff8379fce0 (update_mutex){+.+.}-{4:4}, at:
       bpf_struct_ops_link_create+0x134/0x1c0

  Call Trace:
   dump_stack_lvl+0x6f/0xb0
   lockdep_rcu_suspicious.cold+0x37/0x70
   housekeeping_cpumask+0xcd/0xe0
   scx_enable.isra.0+0x17/0x120
   bpf_scx_reg+0x5e/0x80
   bpf_struct_ops_link_create+0x151/0x1c0
   __sys_bpf+0x1e4b/0x33c0
   __x64_sys_bpf+0x21/0x30
   do_syscall_64+0x117/0xf80
   entry_SYSCALL_64_after_hwframe+0x77/0x7f

In addition, commit 03ff735101 ("cpuset: Update HK_TYPE_DOMAIN cpumask
from cpuset") made HK_TYPE_DOMAIN include cpuset isolated partitions as
well, which means the current check also rejects BPF schedulers when a
cpuset partition is active. That contradicts the original intent of
commit 9f391f94a1 ("sched_ext: Disallow loading BPF scheduler if
isolcpus= domain isolation is in effect"), which explicitly noted that
cpuset partitions are honored through per-task cpumasks and should not
be rejected.

Switch to housekeeping_enabled(HK_TYPE_DOMAIN_BOOT), which reads only
the housekeeping flag bit (no RCU dereference) and reflects exactly the
boot-time isolcpus= configuration that the error message refers to.

Fixes: 27c3a5967f ("sched/isolation: Convert housekeeping cpumasks to rcu pointers")
Cc: stable@vger.kernel.org # v7.0+
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
2026-05-13 10:02:57 -10:00
sunshaojie
345f401666 cgroup/cpuset: Return only actually allocated CPUs during partition invalidation
In update_parent_effective_cpumask() with partcmd_invalidate, the CPUs
to return to the parent are computed as:

    adding = cpumask_and(tmp->addmask, xcpus, parent->effective_xcpus);

where xcpus = user_xcpus(cs) which returns cs->exclusive_cpus (if set)
or cs->cpus_allowed. When exclusive_cpus is not set, user_xcpus(cs) can
contain CPUs that were never actually granted to the partition due to
sibling exclusion in compute_excpus(). Consequently, the invalidation
may return CPUs to the parent that remain in use by sibling partitions,
causing overlapping effective_cpus and triggering the
WARN_ON_ONCE(1) in generate_sched_domains().

Use cs->effective_xcpus instead, which reflects the CPUs actually
granted to this partition.

Reproducer (on a 4-CPU machine):

    cd /sys/fs/cgroup
    mkdir a1 b1

    # a1 becomes partition root with CPUs 0-1
    echo "0-1" > a1/cpuset.cpus
    echo "root" > a1/cpuset.cpus.partition

    # b1 becomes partition root with CPUs 1-2, but sibling exclusion
    # reduces its effective_xcpus to CPU 2 only
    echo "1-2" > b1/cpuset.cpus
    echo "root" > b1/cpuset.cpus.partition

    # b1 changes cpus_allowed to 0-1 -> partition invalidation
    echo "0-1" > b1/cpuset.cpus

    # Expected: CPUs 2-3  (only CPU 2 returned from b1)
    # Actual:   CPUs 1-3  (CPU 0-1 returned, overlapping with a1)
    cat cpuset.cpus.effective

dmesg will also show a WARNING from generate_sched_domains() reporting
overlapping partition root effective_cpus.

Fixes: 2a3602030d ("cgroup/cpuset: Don't invalidate sibling partitions on cpuset.cpus conflict")
Cc: stable@vger.kernel.org # v7.0+
Signed-off-by: sunshaojie <sunshaojie@kylinos.cn>
Tested-by: Chen Ridong <chenridong@huaweicloud.com>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-13 08:54:53 -10:00
Linus Torvalds
e1914add27 Arm:
* Add the pKVM side of the workaround for ARM's erratum 4193714, provided
   that the EL3 firmware does its part of the job. KVM will refuse to
   initialise otherwise.
 
 - Correctly handle 52bit VAs for guest EL2 stage-1 translations when
   running under NV with E2H==0.
 
 * Correctly deal with permission faults in guest_memfd memslots.
 
 * Fix the steal-time selftest after the infrastructure was reworked.
 
 * Make sure the host cannot pass a non-sensical clock update to the
   EL2 tracing infrastructure.
 
 * Appoint Steffen Eiden as a reviewer in anticipation of the KVM/s390
   ability to run arm64 guests, which will inevitably lead to arm64
   code being directly used on s390.
 
 * Make sure that EL2 is configured with both exception entry and exit
   being Context Synchronization Events.
 
 * Handle the current vcpu being NULL on EL2 panic.
 
 * Fix the selftest_vcpu memcache being empty at the point of donation or
   sharing.
 
 * Check that the memcache has enough capacity before engaging on the
   share/donate path.
 
 * Fix __deactivate_fgt() to use its parameter rather than a variable
   in the macro context.
 
 s390:
 
 * Fix array overrun with large amounts of PCI devices.
 
 x86:
 
 * Never use L0's PAUSE loop exiting while L2 is running, since it's
   unlikely that a nested guest will help solving the hypervisor's
   spinlock contention
 
 * Fix emulation of MOVNTDQA.
 
 * Fix typo in Xen hypercall tracepoint
 
 * Add back an optimization that was left behind when recently
   fixing a bug.
 
 * Add module parameter to disable CET, whose implementation seems
   to have issues.  For now it remains enabled by default.
 
 Generic:
 
 * Reject offset causing an unsigned overflow in kvm_reset_dirty_gfn()
 
 Documentation:
 
 * Update stale links
 
 Selftests:
 
 * Fix guest_memfd_test with host page size > guest page size.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCgAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmoEnNgUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroPOeAgArZ60yQGH0TJipyNsaPt+m+IEGMZ/
 UC1tRd384EJnwpjFfZOvwluNNxeFlSXlku7iEXHHveK1qqFXnh+WBXJ91ftfDK/+
 OOqVBBziOyxI6Mbsm2S415kzOQ15atsrclrcGC4emSydgX+JASZ4nsGx6MDRPu/8
 p4TNy3vD5wxe3UGttYElMoFcgT0N/HepMyvUlXohjcjl/hkgf5GL4yPc/TGuvdtz
 EJfmDRhJEwyzf4/Ut8tzX+LhNxSY2iBr5XBvC8XQMSJBVbU/CRGxUk28fEzo7ykx
 EHVOlkxgUN1zO0xh/8aMgRIZNDMveWupR2sJe6StCqOlcbBMI2oYFNnLfQ==
 =f8oe
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "arm64:

   - Add the pKVM side of the workaround for ARM's erratum 4193714,
     provided that the EL3 firmware does its part of the job. KVM will
     refuse to initialise otherwise

   - Correctly handle 52bit VAs for guest EL2 stage-1 translations when
     running under NV with E2H==0

   - Correctly deal with permission faults in guest_memfd memslots

   - Fix the steal-time selftest after the infrastructure was reworked

   - Make sure the host cannot pass a non-sensical clock update to the
     EL2 tracing infrastructure

   - Appoint Steffen Eiden as a reviewer in anticipation of the KVM/s390
     ability to run arm64 guests, which will inevitably lead to arm64
     code being directly used on s390

   - Make sure that EL2 is configured with both exception entry and exit
     being Context Synchronization Events

   - Handle the current vcpu being NULL on EL2 panic

   - Fix the selftest_vcpu memcache being empty at the point of donation
     or sharing

   - Check that the memcache has enough capacity before engaging on the
     share/donate path

   - Fix __deactivate_fgt() to use its parameter rather than a variable
     in the macro context

  s390:

   - Fix array overrun with large amounts of PCI devices

  x86:

   - Never use L0's PAUSE loop exiting while L2 is running, since it's
     unlikely that a nested guest will help solving the hypervisor's
     spinlock contention

   - Fix emulation of MOVNTDQA

   - Fix typo in Xen hypercall tracepoint

   - Add back an optimization that was left behind when recently fixing
     a bug

   - Add module parameter to disable CET, whose implementation seems to
     have issues. For now it remains enabled by default

  Generic:

   - Reject offset causing an unsigned overflow in kvm_reset_dirty_gfn()

  Documentation:

   - Update stale links

  Selftests:

   - Fix guest_memfd_test with host page size > guest page size"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (22 commits)
  KVM: VMX: introduce module parameter to disable CET
  KVM: x86: Swap the dst and src operand for MOVNTDQA
  KVM: x86: use again the flush argument of __link_shadow_page()
  KVM: selftests: Ensure gmem file sizes are multiple of host page size
  Documentation: kvm: update links in the references section of AMD Memory Encryption
  KVM: nSVM: Never use L0's PAUSE loop exiting while L2 is running
  KVM: x86: Fix Xen hypercall tracepoint argument assignment
  KVM: Reject wrapped offset in kvm_reset_dirty_gfn()
  KVM: arm64: Pre-check vcpu memcache for host->guest donate
  KVM: arm64: Pre-check vcpu memcache for host->guest share
  KVM: arm64: Seed pkvm_ownership_selftest vcpu memcache
  KVM: arm64: Fix __deactivate_fgt macro parameter typo
  KVM: arm64: Guard against NULL vcpu on VHE hyp panic path
  KVM: arm64: Make EL2 exception entry and exit context-synchronization events
  MAINTAINERS: Add Steffen as reviewer for KVM/arm64
  KVM: arm64: Remove potential UB on nvhe tracing clock update
  KVM: selftests: arm64: Fix steal_time test after UAPI refactoring
  KVM: arm64: Handle permission faults with guest_memfd
  KVM: arm64: nv: Consider the DS bit when translating TCR_EL2
  KVM: arm64: Work around C1-Pro erratum 4193714 for protected guests
  ...
2026-05-13 11:53:51 -07:00
Yu Miao
7d8f3158a5 selftests/cgroup: Fix error path leaks in test_percpu_basic
When cg_name_indexed() returns NULL partway through the child creation
loop, the code returned -1 without running cleanup_children and cleanup.
That left the `parent` pathname allocation unreleased and did not remove
child cgroup directories already created under the parent. Fix by jumping
to cleanup_children instead of returning.

When cg_create() fails, `child` (the pathname from cg_name_indexed())
was not freed before cleanup_children. Fix by freeing `child` before
branching to cleanup_children.

Fixes: 90631e1dea ("kselftests: cgroup: add perpcu memory accounting test")
Signed-off-by: Yu Miao <yumiao@kylinos.cn>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-13 08:40:52 -10:00
Linus Torvalds
1f63dd8ca0 liveupdate fixes for v7.1-rc4
A few fixes for kexec handover and liveupdate:
 
 * make sure KHO is skipped for crash kernel
 * fix error reporting in memfd preservation if it fails mid-loop
 * don't allow preserving memfds whose page count exceeds UINT_MAX
 * fix documentation of memfd seals preservation to match the code
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCgAuFiEEeOVYVaWZL5900a/pOQOGJssO/ZEFAmoET5gQHHJwcHRAa2Vy
 bmVsLm9yZwAKCRA5A4Ymyw79kVfdB/99gLJy40MO9ZCHSxRQD9TE7Fbuv71flVuD
 wmDz43UOyDIEp+qCB0VcNQPG3v+UINygUMGHkhOG4fgKLm0bEORXIJHNr8sTXYYk
 LuxN8g+Xv1P/qkucEIXy1oB38okg9cORhlfrCOiwpWBjNt5/AqZYKWttDshuZiIM
 kjIKEDtTZ/nDLXjkWAa4Qs4MtBjqTVCrG3glSNHT0yiFDEkAejXbr4RZ/Ght/9pz
 FwHzTfdIOnecvOCD2OHVQx9TJluaP57mlxTkOXJV6OApg0wiHjohl0Xcerh+JfB4
 HAdF7xpr5Sk/BQVc3ygsDKwfTVfB/eYMfCoyUkXg9AVhcoBXmhD0
 =luI7
 -----END PGP SIGNATURE-----

Merge tag 'fixes-2026-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux

Pull liveupdate fixes from Mike Rapoport:
 "A few fixes for kexec handover and liveupdate:

   - make sure KHO is skipped for crash kernel

   - fix error reporting in memfd preservation if it fails mid-loop

   - don't allow preserving memfds whose page count exceeds UINT_MAX

   - fix documentation of memfd seals preservation to match the code"

* tag 'fixes-2026-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux:
  mm/memfd_luo: document preservation of file seals
  mm/memfd_luo: reject memfds whose page count exceeds UINT_MAX
  mm/memfd_luo: report error when restoring a folio fails mid-loop
  kho: skip KHO for crash kernel
2026-05-13 08:24:50 -07:00
Paolo Bonzini
2d5d3fc593 KVM: VMX: introduce module parameter to disable CET
There have been reports of host hangs caused by CET virtualization.
Until these are analyzed further, introduce a module parameter that
makes it possible to easily disable it.

Link: https://lore.kernel.org/all/85548beb-1486-40f9-beb4-632c78e3360b@proxmox.com/
Cc: David Riley <d.riley@proxmox.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-05-13 15:38:22 +02:00
Tejun Heo
cceb874eee sched_ext: Defer sub_kset base put to scx_sched_free_rcu_work
scx_sub_enable_workfn() pins parent->kobj before dropping scx_sched_lock,
but that does not pin parent->sub_kset. Concurrent disable can
kset_unregister and free sub_kset before scx_alloc_and_add_sched()
dereferences it.

Split sub_kset teardown: kobject_del() at disable keeps sysfs removal; defer
kobject_put() to scx_sched_free_rcu_work so the memory survives. A racing
child sees state_in_sysfs=0 with valid memory, sysfs_create_dir() fails, and
the existing exit_kind gate in scx_link_sched() turns it away with -ENOENT.

Fixes: 411d3ef1a7 ("sched_ext: Unregister sub_kset on scheduler disable")
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-12 11:28:56 -10:00
Tejun Heo
b273b75b8d sched_ext: INIT_LIST_HEAD() &sch->all in scx_alloc_and_add_sched()
On scx_link_sched() error paths (parent disabled, hash insert failure),
&sch->all is never added to scx_sched_all. The cleanup path runs
scx_unlink_sched() unconditionally, which calls list_del_rcu(&sch->all) on a
list_head that was never initialized triggering a corruption warning.

Initialize &sch->all.

Fixes: 54be8de423 ("sched_ext: Factor out scx_link_sched() and scx_unlink_sched()")
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-12 11:28:56 -10:00
Paolo Bonzini
ef7e0c51d9 KVM: s390: pci: fix array indexing
For large amounts of PCI devices its possible to overrun the arrays as
 the index was miscalculated in 2 places.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE+SKTgaM0CPnbq/vKEXu8gLWmHHwFAmn4p9UACgkQEXu8gLWm
 HHw7kA//cr8wtdVq2CWwvLpHIvpjRYmQDCApB2vIYPE1AECJqtddiJhq9TolT5rw
 +kqn3hcYmjVhqgqay2IukbuJXFfruPPK2UrF46NmGSxsc/iCglcefRoTOkvJsOXo
 wNzJ/Y7AzZNT1vTTm396vdb/8ACv2zuh073iowDFdRSDLMLt087rJNPf8MQkfxhj
 ZwIqOsGsl1p4WYnnwSy3E5ZsRxPK3kV/JGYvLQyAtx0PGMaTkbAB7KR/PmaxJPal
 IeawpKrpsGzvajXV2EfGVpisTSdKvJ2dwM7NQtX8Q0qVuDufYfsSNlbDKKl3lWIq
 8y5wA2z9oAumZynejBeG46b/nq6Sbeq9lyTNk/52u9ED4RoNmh0s5K0FajD/f229
 xx2XAwsLTrF3ojb0ynHfXKfyzBKMjYu/Y4LtE8bL/wi/BKfs9puoBFixnbvwMrDe
 J8zhxlQyLeZ7Z/hjSWP3UI6w+idA72Z0thf9Nrh0MjhqsKOW4TAD2ZRh/9KQ6B66
 TcmCVe57ehp0aMJ/cqNhXBrvVSH7HL31F/g6Qj8CMiZzJlq3+mlPbWULlqiiBGLr
 Aoxytqlg6YquB8T7SPuopWNkmEU3B9edAn35sqz7Q6/1kzyWdLKMpJLZvU5gtQC1
 KTxpm8aeLdmAzXcwuakdGin9wCT6VfLDEj+wo9qGSLr58wQdwco=
 =CGuC
 -----END PGP SIGNATURE-----

Merge tag 'kvm-s390-master-7.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD

KVM: s390: pci: fix array indexing

For large amounts of PCI devices its possible to overrun the arrays as
the index was miscalculated in 2 places.
2026-05-12 23:15:38 +02:00
Tejun Heo
39e25a2100 sched_ext: Drop NONE early return in scx_disable_and_exit_task()
d3e73a0808 ("sched_ext: Handle SCX_TASK_NONE in disable/switched_from
paths") skipped the trailing scx_set_task_sched(p, NULL) on NONE tasks.
After scx_fail_parent() parks a task at NONE/sched=parent and the parent
is later freed via queue_rcu_work() during root_disable, the preserved
p->scx.sched dangles - print_scx_info() from sched_show_task() reads
sch->ops.name from freed memory.

Drop the early return. __scx_disable_and_exit_task() already short-
circuits on NONE and the SUB_INIT block was cleared by
scx_fail_parent()'s earlier call, so clearing p->scx.sched is the only
work left - and the one thing the path actually needs.

v2: Extend the SUB_INIT block comment to note that the flag is only
    set on the sub-enable path, so it's always clear on the NONE
    re-entry (Andrea).

Fixes: d3e73a0808 ("sched_ext: Handle SCX_TASK_NONE in disable/switched_from paths")
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-05-12 11:13:58 -10:00
Sean Christopherson
3098c076c8 KVM: x86: Swap the dst and src operand for MOVNTDQA
Swap the MOVNTDQA operands, as MOVNTDQA does NOT in fact have "the same
characteristics as 0F E7 (MOVNTDQ)"; MOVNTDQA loads from memory and stores
to registers, while MOVNTDQ loads from registers and stores to memory.

Per the SDM:

 MOVNTDQ - Move packed integer values in xmm1 to m128 using non-temporal
           hint.

 MOVNTDQA - Move double quadword from m128 to xmm1 using non-temporal hint
            if WC memory type.

Reported-by: Josh Eads <josheads@google.com>
Fixes: c57d9bafbd ("KVM: x86: Add support for emulating MOVNTDQA")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260506213514.2781948-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-05-12 23:12:32 +02:00
Paolo Bonzini
6b72d0578c KVM: x86: use again the flush argument of __link_shadow_page()
Except in the case of parentless nested-TDP pages, mmu_page_zap_pte()
clears the SPTE but leaves the invalid_list empty.  In this case, using
kvm_flush_remote_tlbs() as kvm_mmu_remote_flush_or_zap() does is overkill.
Avoid flushing the entirety of the remote TLBs unless the invalid_list
was populated: instead, use a more efficient gfn-targeting flush (if
available) and skip it altogether if the caller guarantees that a TLB
flush is not necessary.

Based-on: <20260503201029.106481-1-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260503210917.121840-1-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-05-12 23:12:31 +02:00
Sean Christopherson
87c810160e KVM: selftests: Ensure gmem file sizes are multiple of host page size
When creating a guest_memfd file and associated memslot to validate shared
guest memory, size the file+memslot to the maximum of the host or guest
page size.  Attempting to allocate a single guest page will fail if the
host page size is greater than the guest page size, as KVM requires that
the size of memslots and guest_memfd files are a multiple of the host page
size.

For simplicity, verify the entire file can be shared between guest and host,
e.g. instead of trying to validate "partial" mappings.

Fixes: 42188667be ("KVM: selftests: Add guest_memfd testcase to fault-in on !mmap()'d memory")
Reported-by: Zenghui Yu <zenghui.yu@linux.dev>
Closes: https://lore.kernel.org/all/0064952b-048c-455d-ad89-e27e5cb82591@linux.dev
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260512155634.772602-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-05-12 22:26:10 +02:00
Paolo Bonzini
4a9ee4fc79 KVM/arm64 fixes for 7.1, take #2
- Add the pKVM side of the workaround for ARM's erratum 4193714, provided
   that the EL3 firmware does its part of the job. KVM will refuse to
   initialise otherwise.
 
 - Correctly handle 52bit VAs for guest EL2 stage-1 translations when
   running under NV with E2H==0.
 
 - Correctly deal with permission faults in guest_memfd memslots.
 
 - Fix the steal-time selftest after the infrastructure was reworked.
 
 - Make sure the host cannot pass a non-sensical clock update to the
   EL2 tracing infrastructure.
 
 - Appoint Steffen Eiden as a reviewer in anticipation of the KVM/s390
   ability to run arm64 guests, which will inevitably lead to arm64
   code being directly used on s390.
 
 - Make sure that EL2 is configured with both exception entry and exit
   being Context Synchronization Events.
 
 - Handle the current vcpu being NULL on EL2 panic.
 
 - Fix the selftest_vcpu memcache being empty at the point of donation or
   sharing.
 
 - Check that the memcache has enough capacity before engaging on the
   share/donate path.
 
 - Fix __deactivate_fgt() to use its parameter rather than a variable
   in the macro context.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmn8sE0ACgkQI9DQutE9
 ekOtxw/+Je7H1k+JyRfu8MNG3cFa8by+sQa3C1IAQfqdXaMV7nJZItFdd1eUvBW5
 f3eyoi28j9zS+jr7uuvg6R9GjTQqkoWEVdPLWUgg9eeBatFAh2IroPf1a4md18d9
 oJcR4fJrDfYYYkoppdIc6TG+t9R+y47ebE3RlBLXxiSlElHjKYt+hv2mrBs0v583
 pW1Vm3BhfbmHExYPidCNkpGXdamzBZxbEx5hzzW/B3VQllxwlvG/B60BQOGxC1qt
 1irGFSGmUJMOptHB2XZkWyXysMwwTPJrAiZ2e2oU3DiwTaPS+JOHa8G7oZy8CkEP
 h6dAWlW45bLWyxWoWxXkGdcRwvuIW1Vdd6SIiJN8SHoqSOoNIQdw4n0x8CghWHu7
 FHKSBD8C4iX0Oa2WXlhUNqvM51cHNb72JAGp5UAOE3ffYRTap6/AcMV9WdCEdTCQ
 LKlqxj9JPgmf8CxZitWvwdGCgGCHr9hGTsIa71N1lql1Imw4vxNbCA8xt1AXqakp
 0JpXdRLif6NgJ2dPd99W8aR3cbWxCu/sj23U3xug7cfr6Py8dRVydijmSWZeRonY
 XuzYJONQlHPmkbaEN4YTwtp5/stWn6+QghVKYt5X19+Nx8rr1gyRCEL38HmR38m0
 Sc5QoymiLjW8IR/+9sjF7JbKFP510gMCQibRraDb62k9qbVsDmY=
 =7AW5
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-fixes-7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 7.1, take #2

- Add the pKVM side of the workaround for ARM's erratum 4193714, provided
  that the EL3 firmware does its part of the job. KVM will refuse to
  initialise otherwise.

- Correctly handle 52bit VAs for guest EL2 stage-1 translations when
  running under NV with E2H==0.

- Correctly deal with permission faults in guest_memfd memslots.

- Fix the steal-time selftest after the infrastructure was reworked.

- Make sure the host cannot pass a non-sensical clock update to the
  EL2 tracing infrastructure.

- Appoint Steffen Eiden as a reviewer in anticipation of the KVM/s390
  ability to run arm64 guests, which will inevitably lead to arm64
  code being directly used on s390.

- Make sure that EL2 is configured with both exception entry and exit
  being Context Synchronization Events.

- Handle the current vcpu being NULL on EL2 panic.

- Fix the selftest_vcpu memcache being empty at the point of donation or
  sharing.

- Check that the memcache has enough capacity before engaging on the
  share/donate path.

- Fix __deactivate_fgt() to use its parameter rather than a variable
  in the macro context.
2026-05-12 22:19:20 +02:00
Ninad Naik
80f4a7b8ce Documentation: kvm: update links in the references section of AMD Memory Encryption
Replace non-working links in the reference section with the working ones.

Signed-off-by: Ninad Naik <ninadnaik07@gmail.com>
Link: https://patch.msgid.link/20260511174302.811918-1-ninadnaik07@gmail.com/
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-05-12 22:17:42 +02:00
Sean Christopherson
5bd1ddb791 KVM: nSVM: Never use L0's PAUSE loop exiting while L2 is running
Never use L0's (KVM's) PAUSE loop exiting controls while L2 is running,
and instead always configure vmcb02 according to L1's exact capabilities
and desires.

The purpose of intercepting PAUSE after N attempts is to detect when the
vCPU may be stuck waiting on a lock, so that KVM can schedule in a
different vCPU that may be holding said lock.  Barring a very interesting
setup, L1 and L2 do not share locks, and it's extremely unlikely that an
L1 vCPU would hold a spinlock while running L2.  I.e. having a vCPU
executing in L1 yield to a vCPU running in L2 will not allow the L1 vCPU
to make forward progress, and vice versa.

While teaching KVM's "on spin" logic to only yield to other vCPUs in L2 is
doable, in all likelihood it would do more harm than good for most setups.
KVM has limited visibility into which L2 "vCPUs" belong to the same VM,
and thus share a locking domain.  And even if L2 vCPUs are in the same
VM, KVM has no visilibity into L2 vCPU's that are scheduled out by the
L1 hypervisor.

Furthermore, KVM doesn't actually steal PAUSE exits from L1. If L1 is
intercepting PAUSE, KVM will route PAUSE exits to L1, not L0, as
nested_svm_intercept() gives priority to the vmcb12 intercept.  As such,
overriding the count/threshold fields in vmcb02 with vmcb01's values is
nonsensical, as doing so clobbers all the training/learning that has been
done in L1.

Even worse, if L1 is not intercepting PAUSE, i.e. KVM is handling PAUSE
exits, then KVM will adjust the PLE knobs based on L2 behavior, which could
very well be detrimental to L1, e.g. due to essentially poisoning L1 PLE
training with bad data.

And copying the count from vmcb02 to vmcb01 on a nested VM-Exit makes even
less sense, because again, the purpose of PLE is to detect spinning vCPUs.
Whether or not a vCPU is spinning in L2 at the time of a nested VM-Exit
has no relevance as to the behavior of the vCPU when it executes in L1.

The only scenarios where any of this actually works is if at least one
of KVM or L1 is NOT intercepting PAUSE for the guest.  Per the original
changelog, those were the only scenarios considered to be supported.
Disabling KVM's use of PLE makes it so the VM is always in a "supported"
mode.

Last, but certainly not least, using KVM's count/threshold instead of the
values provided by L1 is a blatant violation of the SVM architecture.

Fixes: 74fd41ed16 ("KVM: x86: nSVM: support PAUSE filtering when L0 doesn't intercept PAUSE")
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: David Kaplan <david.kaplan@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://patch.msgid.link/20260508213321.373309-1-seanjc@google.com/
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-05-12 22:17:28 +02:00
Qiang Ma
2b72f1674e KVM: x86: Fix Xen hypercall tracepoint argument assignment
TRACE_EVENT(kvm_xen_hypercall) stores a5 in __entry->a4 instead of
__entry->a5.

That overwrites the recorded a4 argument and leaves a5 unset in the
trace entry. Fix the typo so both arguments are captured correctly.

Signed-off-by: Qiang Ma <maqianga@uniontech.com>
Link: https://patch.msgid.link/20260512015313.1685784-1-maqianga@uniontech.com/
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-05-12 22:16:26 +02:00
Aaron Sacks
577a8d3bae KVM: Reject wrapped offset in kvm_reset_dirty_gfn()
kvm_reset_dirty_gfn() guards the gfn range with

	if (!memslot || (offset + __fls(mask)) >= memslot->npages)
		return;

but offset is u64 and the addition is unchecked.  The check can be
silently bypassed by a u64 wrap.

The dirty ring backing those entries is MAP_SHARED at
KVM_DIRTY_LOG_PAGE_OFFSET of the vcpu fd, so the VMM can rewrite the
slot and offset fields of any entry between when the kernel pushes
them and when KVM_RESET_DIRTY_RINGS consumes them.  On reset,
kvm_dirty_ring_reset() re-reads the values via READ_ONCE() and feeds
them straight back into this check; only the flags handshake is
treated as the handover, the slot/offset payload is taken on trust.

Crafting two entries

	entry[i].offset   = 0xffffffffffffffc1
	entry[i+1].offset = 0

makes the coalescing loop in kvm_dirty_ring_reset() compute

	delta = (s64)(0 - 0xffffffffffffffc1) = 63

which falls in [0, BITS_PER_LONG), so it folds entry[i+1] into the
existing mask by setting bit 63.  The trailing kvm_reset_dirty_gfn()
call then sees offset = 0xffffffffffffffc1 and __fls(mask) = 63;
the sum is 0 in u64 and the bounds check passes.

That offset propagates into kvm_arch_mmu_enable_log_dirty_pt_masked()
unchanged.  On the legacy MMU path -- kvm_memslots_have_rmaps() ==
true, i.e. shadow paging, any VM that has allocated shadow roots, or
a write-tracked slot -- it reaches gfn_to_rmap(), which indexes
slot->arch.rmap[0][] with a near-U64_MAX gfn.  That is an
out-of-bounds load of a kvm_rmap_head, followed by a conditional
clear of PT_WRITABLE_MASK in whatever the loaded pointer points at.
The path is reachable from any process holding /dev/kvm.

Range-check offset on its own first, so the addition cannot wrap.
memslot->npages is bounded well below U64_MAX, so once offset <
npages holds, offset + __fls(mask) (with __fls(mask) < BITS_PER_LONG)
stays in range.

Fixes: fb04a1eddb ("KVM: X86: Implement ring-based dirty memory tracking")
Cc: stable@vger.kernel.org
Signed-off-by: Aaron Sacks <contact@xchglabs.com>
Link: https://patch.msgid.link/20260512060742.1628959-1-contact@xchglabs.com/
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-05-12 22:16:16 +02:00
Linus Torvalds
1d5dcaa3bd Probes fixes for v7.1-rc3
- kprobes: skip non-symbol addresses in kprobe_add_ksym_blacklist()
   Since the ftrace adds its NOPs at .kprobes.text section (which stores
   an array), a wrong entry is added when loading a module which uses
   "__kprobes" attribute. To solve this, add "notrace" to __kprobes
   functions.
 - test_kprobes: clear kprobes between test runs
   Clear all kprobes in the test program after running a test set,
   because Kunit test can run several times.
 - fprobe: Fix unregister_fprobe() to wait for RCU grace period
   Since the fprobe data structure is removed with hlist_del_rcu(), it
   should wait for the RCU grace period. If the caller waits for RCU,
   we can use the async variant (e.g. eBPF)
 -----BEGIN PGP SIGNATURE-----
 
 iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmoCf4QbHG1hc2FtaS5o
 aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bAt8H/RiNH4k/20YKE2Z56GLy
 N+qCb8CO8L+AroNGCAj4KRVYtBLVzxBLf+Fcdfz6UM/jQ/k2UTeh6ysIt8iWCZYA
 2vJBlVDvvjWPpEZW6yCxlpEAgU2B/Xv/92ZnQjW7sGvL75+gsA1dLu1Gt6lqM5zS
 X335PrIN3c4g+zhwCwW8wLCpMJvyk0qnXiN3thfXTCT/P9GPZldMEAAOecyLl7C3
 Y/Zc8Af3xbMdqplIoYoKRWr0uzYBb1NB2FZR7Dp6i5/5MAhVYobd23s6VXWXZwxV
 FHRJ6R16vCK/ftnwtOiUeuiC3iXn21XQdma6pr2nI6bRhr5v/NBXxmh5U2+tRHeF
 /I4=
 =E/h6
 -----END PGP SIGNATURE-----

Merge tag 'probes-fixes-v7.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull probes fixes from Masami Hiramatsu:

 - kprobes: skip non-symbol addresses in kprobe_add_ksym_blacklist()

   Since the ftrace adds its NOPs at .kprobes.text section (which stores
   an array), a wrong entry is added when loading a module which uses
   "__kprobes" attribute.

   To solve this, add "notrace" to __kprobes functions

 - test_kprobes: clear kprobes between test runs

   Clear all kprobes in the test program after running a test set,
   because Kunit test can run several times

 - fprobe: Fix unregister_fprobe() to wait for RCU grace period

   Since the fprobe data structure is removed with hlist_del_rcu(), it
   should wait for the RCU grace period. If the caller waits for RCU, we
   can use the async variant (e.g. eBPF)

* tag 'probes-fixes-v7.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  fprobe: Fix unregister_fprobe() to wait for RCU grace period
  test_kprobes: clear kprobes between test runs
  kprobes: skip non-symbol addresses in kprobe_add_ksym_blacklist()
2026-05-12 10:18:02 -07:00
Prathyushi Nangia
c21b90f776 x86/CPU/AMD: Prevent improper isolation of shared resources in Zen2's op cache
Make sure resources are not improperly shared in the op cache and
cause instruction corruption this way.

Signed-off-by: Prathyushi Nangia <prathyushi.nangia@amd.com>
Co-developed-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-05-11 20:06:36 -07:00
Linus Torvalds
50897c9559 linux_kselftest-kunit-fixes-7.1-rc4
Fix to decouple KUNIT_DEBUGFS and KUNIT_ALL_TESTS options and fix
 KUNIT_DEBUGFS dependencies so it depends on DEBUG_FS without which
 it will not be useful.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEPZKym/RZuOCGeA/kCwJExA0NQxwFAmoCUf0ACgkQCwJExA0N
 Qxxsog//Wd/HweNWaLJFG3z9QL4bCAl7cfDTV0uNAobnHge5ymqaKgWmXxi6Xpnt
 CVYOxDsqJ+Pt4AZedOZ8ZAx09rw0NkfbkRRdJMSvh+7EBJHNwguKxaWa99zcz96p
 O4qppKOWOeLWP5YDIDlDLK7atiDhBQjm/p5LdtDBa1oxnD3g2ETvu4mUmPwpNT1W
 t07duEbjLblvOL+hAZ9oRn56638pAyFQG3B9Gs7pjY/YsVWwEIpJnRmFjVu0ZdwU
 gSbxXYgsc8L/EKFWqsz1kfrZjb24s6z4amYCg12UMGBr1YVfSWp0CLkPjm+BbTjx
 SzSoPmUSg4g5m+nLwvGkcK9cGhEvAoyPsDdywcSPtE66vOdbkzdeRQ/ukzwcA+5o
 hm2eC4Z3GXIj+lw2yXzeLflrupUBzi2mZbj0Rcm/rn3h0fUmmDhQf9AoKbH4q/Q+
 UKT/2U0R39WO/e7LTYMpcaajXCNFiAwl3eFpoJfYlWWYV56YgqIp4Q2/B9lmRsi1
 YNTVJUNtesgtzVOUsRjlgjjKMJjXF6wZRVf8YU4e0Uj95gxEc9pxcilmoOnMqWFr
 VvutsERfhaCLbfRnEKx2MP9hsora66aqF3fdMaseM9JoNiuXdlXONIQSD8vAuOkR
 fVp38l2v/SvOb6zyVfAURFYPlPaXUfPGRHraeyQcBcXfew+B8vg=
 =QdSk
 -----END PGP SIGNATURE-----

Merge tag 'linux_kselftest-kunit-fixes-7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest

Pull kunit fixes from Shuah Khan:
 "Fix to decouple KUNIT_DEBUGFS and KUNIT_ALL_TESTS options and fix
  KUNIT_DEBUGFS dependencies so it depends on DEBUG_FS without which it
  will not be useful"

* tag 'linux_kselftest-kunit-fixes-7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
  kunit: config: KUNIT_DEBUGFS should depend on DEBUG_FS
  kunit: config: Enable KUNIT_DEBUGFS by default
2026-05-11 15:38:49 -07:00
Tejun Heo
9a415cc537 sched_ext: Avoid UAF in scx_root_enable_workfn() init failure path
In scx_root_enable_workfn(), put_task_struct(p) is called before scx_error()
dereferences p->comm and p->pid. If the iterator's reference is the last
drop, the task is freed synchronously and the deref becomes a UAF.

Move put_task_struct() past scx_error().

Reported-by: Sashiko <sashiko-bot@kernel.org>
Closes: https://lore.kernel.org/all/20260511214031.AF5E9C2BCB0@smtp.kernel.org/
Fixes: f0e1a0643a ("sched_ext: Implement BPF extensible scheduler class")
Cc: stable@vger.kernel.org # v6.12+
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-11 12:05:48 -10:00
Guopeng Zhang
5dd74441cb cgroup/cpuset: Reserve DL bandwidth only for root-domain moves
cpuset_can_attach() currently adds the bandwidth of all migrating
SCHED_DEADLINE tasks to sum_migrate_dl_bw. If the source and destination
cpuset effective CPU masks do not overlap, the whole sum is then
reserved in the destination root domain.

set_cpus_allowed_dl(), however, subtracts bandwidth from the source
root domain only when the affinity change really moves the task between
root domains. A DL task can move between cpusets that are still in the
same root domain, so including that task in sum_migrate_dl_bw can reserve
destination bandwidth without a matching source-side subtraction.

Share the root-domain move test with set_cpus_allowed_dl(). Keep
nr_migrate_dl_tasks counting all migrating deadline tasks for cpuset DL
task accounting, but add to sum_migrate_dl_bw only for tasks that need a
root-domain bandwidth move. Keep using the destination cpuset effective
CPU mask and leave the broader can_attach()/attach() transaction model
unchanged.

Fixes: 2ef269ef1a ("cgroup/cpuset: Free DL BW in case can_attach() fails")
Cc: stable@vger.kernel.org # v6.10+
Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn>
Reviewed-by: Waiman Long <longman@redhat.com>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-11 10:27:14 -10:00
Jann Horn
c1fa0bb633 exit: prevent preemption of oopsing TASK_DEAD task
When an already-exiting task oopses, make_task_dead() currently calls
do_task_dead() with preemption enabled.  That is forbidden:
do_task_dead() calls __schedule(), which has a comment saying "WARNING:
must be called with preemption disabled!".

If an oopsing task is preempted in do_task_dead(), between becoming
TASK_DEAD and entering the scheduler explicitly, bad things happen:
finish_task_switch() assumes that once the scheduler has switched away
from a TASK_DEAD task, the task can never run again and its stack is no
longer needed; but that assumption apparently doesn't hold if the dead
task was preempted (the SM_PREEMPT case).

This means that the scheduler ends up repeatedly dropping references on
the dead task's stack, which can lead to use-after-free or double-free
of the entire task stack; in other words, two tasks can end up running
on the same stack, resulting in various kinds of memory corruption.

(This does not just affect "recursively oopsing" tasks; it is enough to
oops once during task exit, for example in a file_operations::release
handler)

Fixes: 7f80a2fd7d ("exit: Stop poorly open coding do_task_dead in make_task_dead")
Cc: stable@kernel.org
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-05-11 08:55:11 -07:00
Masami Hiramatsu (Google)
657b594b20 fprobe: Fix unregister_fprobe() to wait for RCU grace period
Commit 4346ba1604 ("fprobe: Rewrite fprobe on function-graph tracer")
changed fprobe to register struct fprobe to an rcu-hlist, but it forgot
to wait for RCU GP. Thus there can be use-after-free if the fprobe is
released right after unregistering. This can be happened on fprobe
event and sample module code.

To fix this issue, add synchronize_rcu() in unregister_fprobe().

Note that BPF is OK because fprobe is used as a part of
bpf_kprobe_multi_link. This unregisters its fprobe in
bpf_kprobe_multi_link_release() and it is deallocated via
bpf_kprobe_multi_link_dealloc(), which is invoked from
bpf_link_defer_dealloc_rcu_gp() RCU callback.

For BPF, this also introduced unregister_fprobe_async() which does
NOT wait for RCU grace priod.

Link: https://lore.kernel.org/all/177813998919.256460.2809243930741138224.stgit@mhiramat.tok.corp.google.com/

Fixes: 4346ba1604 ("fprobe: Rewrite fprobe on function-graph tracer")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2026-05-11 19:04:46 +09:00
Andrea Righi
86ecb1c1a1 sched_ext: Clear ops->priv on scx_alloc_and_add_sched() error paths
scx_alloc_and_add_sched() can fail after @sch has been assigned to
ops->priv. In those cases @sch is torn down (either via kfree() through
the err_free_* chain or via kobject_put() -> scx_kobj_release() -> RCU
work), but @ops->priv is left pointing at the about-to-be-freed pointer.

With the recent -EBUSY gate in scx_root_enable_workfn() and
scx_sub_enable_workfn() that rejects an attach when @ops->priv is still
non-NULL, see commit bbf30b383c ("sched_ext: Fix ops->priv clobber on
concurrent attach/detach"), a dangling @ops->priv permanently locks the
kdata out: every future attach attempt sees a stale binding and returns
-EBUSY even though no scheduler is actually attached.

Clear @ops->priv on the post-assign failure paths so that the kdata
returns to its pre-attach state when the function returns ERR_PTR().

Fixes: bbf30b383c ("sched_ext: Fix ops->priv clobber on concurrent attach/detach")
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-10 22:50:31 -10:00
Guopeng Zhang
4a39eda5fd cgroup/cpuset: Reset DL migration state on can_attach() failure
cpuset_can_attach() accumulates temporary SCHED_DEADLINE migration
state in the destination cpuset while walking the taskset.

If a later task_can_attach() or security_task_setscheduler() check
fails, cgroup_migrate_execute() treats cpuset as the failing subsystem
and does not call cpuset_cancel_attach() for it. The partially
accumulated state is then left behind and can be consumed by a later
attach, corrupting cpuset DL task accounting and pending DL bandwidth
accounting.

Reset the pending DL migration state from the common error exit when
ret is non-zero. Successful can_attach() keeps the state for
cpuset_attach() or cpuset_cancel_attach().

Fixes: 2ef269ef1a ("cgroup/cpuset: Free DL BW in case can_attach() fails")
Cc: stable@vger.kernel.org # v6.10+
Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Reviewed-by: Waiman Long <longman@redhat.com>
2026-05-10 22:14:49 -10:00
Andrea Righi
bbf30b383c sched_ext: Fix ops->priv clobber on concurrent attach/detach
Under heavy concurrent attach/detach operations, scx_claim_exit() can
trigger a NULL pointer dereference. This can be reproduced running the
reload_loop kselftests inside a virtme-ng session:

 $ vng -v -- ./tools/testing/selftests/sched_ext/runner -t reload_loop
 ...
 BUG: kernel NULL pointer dereference, address: 0000000000000400
 RIP: 0010:scx_claim_exit+0x3b/0x120
 Call Trace:
  <TASK>
  bpf_scx_unreg+0x45/0xb0
  bpf_struct_ops_map_link_dealloc+0x39/0x50
  bpf_link_release+0x18/0x20
  __fput+0x10b/0x2e0
  __x64_sys_close+0x47/0xa0

The underlying race (diagnosed by Tejun Heo) is a stomp of @ops->priv,
not a missing NULL check:

  T2 unreg(K)                       T1 reg(K)
  -----------                       ---------
  sch = ops->priv = sch_b800
  scx_disable; flush_disable_work
    [scx_root_disable: scx_root=NULL,
     mutex_unlock, state=DISABLED]
                                    mutex_lock; state ok
                                    scx_alloc_and_add_sched:
                                      ops->priv = sch_a800
                                    scx_root = sch_a800; init=0
                                    state=ENABLED; mutex_unlock
    [flush returns]
  RCU_INIT_POINTER(ops->priv, NULL) <-- clobbers sch_a800
  kobject_put(sch_b800)

T1 acquires scx_enable_mutex inside scx_root_disable()'s mutex_unlock
window and starts a fresh attach on the same kdata, assigning sch_a800
to @ops->priv. T2 then continues out of scx_disable()/flush_disable_work
and clobbers @ops->priv to NULL, leaking sch_a800; the bpf_link is gone
but state stays SCX_ENABLED, so all future attaches fail with -EBUSY
permanently. The next bpf_scx_unreg() on that kdata then reads NULL
@ops->priv and dereferences it in scx_claim_exit().

Make @ops->priv the lifecycle binding: in scx_root_enable_workfn() and
scx_sub_enable_workfn(), after the existing state check and still under
scx_enable_mutex, refuse with -EBUSY if @ops->priv is non-NULL. This
rejects an attempt to reuse a kdata that is still bound to a previous
scheduler instance, closing the race without changing the unreg side.

Fixes: 105dcd005b ("sched_ext: Introduce scx_prog_sched()")
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-10 21:40:03 -10:00
Andrea Righi
3788e32516 selftests/sched_ext: Fix build error in dequeue selftest
Building the dequeue selftest with newer compilers (e.g., gcc 16)
triggers the following error:

 dequeue.c:28:22: error: variable 'sum' set but not used

The 'volatile' qualifier prevents the writes from being optimized away,
but does not silence the unused variable 'sum' is indeed only written
and never read.

Consume 'sum' via an empty asm() with a register input constraint. This
forces the compiler to keep the accumulated value (preserving the CPU
stress loop) and avoiding the build error.

Fixes: 658ad2259b ("selftests/sched_ext: Add test to validate ops.dequeue() semantics")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-10 16:03:05 -10:00
Hongfu Li
2a3d7256fa selftests/cgroup: Fix string comparison in write_test
Use string comparison (!=) instead of numeric comparison (-ne) for
cpuset values like "0-1".
For example:
$ [[ "0-1" != "2-3" ]] && echo "true" || echo "false"
true
$ [[ "0-1" -ne "2-3" ]] && echo "true" || echo "false"
false

Signed-off-by: Hongfu Li <lihongfu@kylinos.cn>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-10 15:54:12 -10:00
Hongfu Li
e32e6f0216 selftests/cgroup: Fix cg_read_strcmp() empty string comparison
cg_read_strcmp() allocated a buffer sized to strlen(expected) + 1,
then passed it to read_text() which calls read(fd, buf, size-1).

When comparing against an empty string (""), strlen("") = 0 gives a
1-byte buffer, and read() is asked to read 0 bytes.  The file content
is never actually read, so strcmp("", buf) always returns 0 regardless
of the real content.  This caused cg_test_proc_killed() to always
report the cgroup as empty immediately, making OOM tests pass without
verifying that processes were killed.

Signed-off-by: Hongfu Li <lihongfu@kylinos.cn>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-10 15:53:44 -10:00
Guopeng Zhang
796ad62204 cgroup/dmem: Return -ENOMEM on failed pool preallocation
get_cg_pool_unlocked() handles allocation failures under dmemcg_lock by
dropping the lock, preallocating a pool with GFP_KERNEL, and retrying the
locked lookup and creation path.

If the fallback allocation fails too, pool remains NULL. Since the loop
condition is while (!pool), the function can keep retrying instead of
propagating the allocation failure to the caller.

Set pool to ERR_PTR(-ENOMEM) when the fallback allocation fails so the
loop exits through the existing common return path. The callers already
handle ERR_PTR() from get_cg_pool_unlocked(), so this restores the
expected error path.

Fixes: b168ed458d ("kernel/cgroup: Add "dmem" memory accounting cgroup")
Cc: stable@vger.kernel.org # v6.14+
Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-10 15:43:46 -10:00
Linus Torvalds
5d6919055d Linux 7.1-rc3 2026-05-10 14:08:09 -07:00
Tejun Heo
d3e73a0808 sched_ext: Handle SCX_TASK_NONE in disable/switched_from paths
scx_fail_parent() leaves cgroup tasks at (state=NONE, sched=parent,
sched_class=ext) until the parent itself is torn down by the scx_error() it
raised. When the later root_disable iterates them, two paths trip on NONE.

scx_disable_and_exit_task() re-enters the wrapper at NONE: the inner switch
returns early but the trailing scx_set_task_sched(p, NULL) clobbers the
parent sched left by scx_fail_parent(), and scx_set_task_state(p, NONE)
wastes a write on an already-NONE task. switched_from_scx() then calls
scx_disable_task(), which WARNs on non-ENABLED state and writes state=READY,
producing a NONE -> READY transition the validation matrix rejects.

Treat NONE as "nothing to do" in both paths. Add a NONE early-return at the
top of scx_disable_and_exit_task() and a parallel NONE check in
switched_from_scx() next to task_dead_and_done().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-05-10 10:08:16 -10:00
Tejun Heo
cd6aab7367 sched_ext: Close sub-sched init race with post-init DEAD recheck
scx_sub_enable_workfn()'s init pass and scx_sub_disable() migration both
drop the rq lock to call __scx_init_task() against the other sched. A
TASK_DEAD @p can fall through sched_ext_dead() in that window.
sched_ext_dead() runs ops.exit_task() on the sched @p was attached to, not
on the sched whose init just completed, so the new allocation leaks.

Reuse the DEAD signal set by sched_ext_dead(). After __scx_init_task()
returns, take task_rq_lock(p) and check for DEAD; on hit, call
scx_sub_init_cancel_task() against the sub sched the init ran for and drop
@p; on miss, proceed as before.

Reported-by: zhidao su <suzhidao@xiaomi.com>
Link: https://lore.kernel.org/all/20260429133155.3825247-1-suzhidao@xiaomi.com/
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-05-10 10:08:16 -10:00
Tejun Heo
c941d7391f sched_ext: Close root-enable vs sched_ext_dead() race with SCX_TASK_INIT_BEGIN
scx_root_enable_workfn() drops the iter rq lock for ops.init_task() and a
TASK_DEAD @p can fall through sched_ext_dead() in that window. The race hits
when sched_ext_dead() observes SCX_TASK_INIT (the intermediate state before
@p->scx.sched is published) and dereferences NULL via SCX_HAS_OP(NULL,
exit_task), or observes SCX_TASK_NONE during the unlocked init window and
skips cleanup so exit_task() never runs.

Add SCX_TASK_INIT_BEGIN. The enable path writes NONE -> INIT_BEGIN under the
iter rq lock, then takes the rq lock again after init to walk INIT_BEGIN ->
INIT -> READY. sched_ext_dead() that wins the rq-lock race observes
INIT_BEGIN and sets DEAD without calling into ops; the post-init recheck
unwinds via scx_sub_init_cancel_task().

scx_fork() runs single-threaded against sched_ext_dead() (the task is not on
scx_tasks until scx_post_fork() adds it) so its INIT_BEGIN -> INIT walk
needs no rq-lock pairing; it rolls back to NONE on ops.init_task() failure.

The validation matrix grows the INIT_BEGIN row and the INIT_BEGIN -> DEAD
edge; INIT now requires INIT_BEGIN as the predecessor. scx_sub_disable()'s
migration writes INIT_BEGIN as a synthetic predecessor to satisfy the
tightened verification.

The sub-sched paths still race with sched_ext_dead() during the unlocked
init window. This will be fixed by the next patch.

Reported-by: zhidao su <suzhidao@xiaomi.com>
Link: https://lore.kernel.org/all/20260429133155.3825247-1-suzhidao@xiaomi.com/
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-05-10 10:08:16 -10:00
Tejun Heo
cceb8fa9cb sched_ext: Replace SCX_TASK_OFF_TASKS flag with SCX_TASK_DEAD state
SCX_TASK_OFF_TASKS marked tasks already through sched_ext_dead() so cgroup
task iteration would skip them. This can be expressed better with a task
state. Replace the flag with SCX_TASK_DEAD.

scx_disable_and_exit_task() resets state to NONE on its way out, so
sched_ext_dead() now sets DEAD after the wrapper returns. The validation
matrix grows NONE -> DEAD, warns on DEAD -> NONE, and tightens READY's
predecessor to INIT or ENABLED so the new DEAD value cannot silently
transition to READY.

Prepares for the following enable vs dead race fix.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-05-10 10:08:16 -10:00
Tejun Heo
938dd9ab2b sched_ext: Inline scx_init_task() and move RESET_RUNNABLE_AT into scx_set_task_state()
Prepare for the SCX_TASK_INIT_BEGIN/DEAD work that follows by collapsing the
scx_init_task() helper. Move the SCX_TASK_RESET_RUNNABLE_AT setting into
scx_set_task_state() on the INIT transition (it was set unconditionally at
every INIT site through the scx_init_task() helper), inline scx_init_task()
into scx_fork() and scx_root_enable_workfn(), and drop the helper.

As a side effect, scx_sub_disable() migration sequence now also sets
RESET_RUNNABLE_AT (it previously wrote INIT directly without going through
scx_init_task()). The flag triggers a runnable_at reset on the next
set_task_runnable(), which is harmless on a task that has just been moved
between scheds.

On root-enable, p->scx.flags is written without the task's rq lock. The task
isn't visible to scx yet, and a follow-up patch restores the lock-held
write.

v2: Note p->scx.flags rq-lock relaxation on root-enable path. (Andrea)

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-05-10 10:08:16 -10:00
Tejun Heo
6947bea4b7 sched_ext: Cleanups in preparation for the SCX_TASK_INIT_BEGIN/DEAD work
Cleanups in preparation for the state-machine work that follows:

- Convert three sub-sched call sites that open-code
  rcu_assign_pointer(p->scx.sched, ...) to scx_set_task_sched().

- Move scx_get_task_state()/scx_set_task_state() above the SCX task iter
  section so scx_task_iter_next_locked() can use them without a forward
  declaration.

No functional change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-05-10 10:08:16 -10:00
Linus Torvalds
afaa0a4770 - Fix a string leak in the versalnet driver
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmoA1dYACgkQEsHwGGHe
 VUrDjQ/+LiHor5aRMUwG96jtBAyxAPhFVYFwienHALDAdz44vZzvmmR2/6jJzQAb
 xjlbhNxNCCP5HnTqq/nt2bjfx5re9KIMr55HShDhFsdphGbdSd70/4ErRHKTrMxE
 1i6gemqFH10bgj0bdb3nLdqL1xx/vNSBuBbLilzj6MQzgX/tgv/CqvOw+ZwLGCQc
 4kMHPRA6H/wDe2A+S+WOSoKCwcQ5GuQSTYWSiBZkODUw+BTpQlcsFwfy1ZcR/YTl
 NAXX9JmyzlC2R3yrkVzh/6c1fT7ir+tOH99hjnKleI3krcSgDhCsqr4okguUcm5p
 aFAHghmu8D/8z/5qxIimCXUa90QQMrPLZkPd+iPJipKFMbYFN0NC2xcBLnKw3oQf
 AF9af3ww3U4qBseCNEE40gWaibalVFtpRYIRDIbuDt62/U5ngd8TlLC45wNUcDLa
 w+QogyPYCRDfelKy0uysoIrUJcKBEVMfuXk3ddlZF4BUEAih2EQFF+CWbyKUqQXn
 HmltHhveRIfFC4FsDFIu3qUEH92sugcFpcTl1Cj21LjeKtfP7Pxr5roJaZ8uqVKC
 INEv6zBhhcQh5OJdtrhDIUk5PH/oAdniH7OaaxM1qBOMuHk6fleLQGHRo7pZSMrQ
 +yXsmwLl4FAD18IVrUXWPr/iqmKUvFnjISYbflpv+DmqdkObrIE=
 =24jB
 -----END PGP SIGNATURE-----

Merge tag 'edac_urgent_for_v7.1_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras

Pull EDAC fix from Borislav Petkov:

 - Fix a string leak in the versalnet driver

* tag 'edac_urgent_for_v7.1_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
  EDAC/versalnet: Fix device name memory leak
2026-05-10 12:21:57 -07:00
Hyunwoo Kim
aa54b1d27f rxrpc: Also unshare DATA/RESPONSE packets when paged frags are present
The DATA-packet handler in rxrpc_input_call_event() and the RESPONSE
handler in rxrpc_verify_response() copy the skb to a linear one before
calling into the security ops only when skb_cloned() is true.  An skb
that is not cloned but still carries externally-owned paged fragments
(e.g. SKBFL_SHARED_FRAG set by splice() into a UDP socket via
__ip_append_data, or a chained skb_has_frag_list()) falls through to
the in-place decryption path, which binds the frag pages directly into
the AEAD/skcipher SGL via skb_to_sgvec().

Extend the gate to also unshare when skb_has_frag_list() or
skb_has_shared_frag() is true.  This catches the splice-loopback vector
and other externally-shared frag sources while preserving the
zero-copy fast path for skbs whose frags are kernel-private (e.g. NIC
page_pool RX, GRO).  The OOM/trace handling already in place is reused.

Fixes: d0d5c0cd1e ("rxrpc: Use skb_unshare() rather than skb_cow_data()")
Cc: stable@vger.kernel.org
Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com>
Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-05-10 08:15:57 -07:00
Linus Torvalds
a1a10cdbc6 Fixes for clk drivers:
- Mark the DDR bus clk critical in the SpaceMiT driver so that
    boot doesn't fail
  - Fix boot on Mobile EyeQ by creating the auxiliary device for
    the ethernet PHY
  - Plug an OF node leak in Rockchip rk808 clk driver
 -----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEE9L57QeeUxqYDyoaDrQKIl8bklSUFAmoAnnsUHHN3Ym95ZEBj
 aHJvbWl1bS5vcmcACgkQrQKIl8bklSXTgA//VV/2VvDcWpXLa+Jr05D5j6/JgkSx
 P70ssFqVc3mHHfCtlrAHmU+xaLI37C2jDMotOMt1hfwy8CTs9BpP8L5IMbIo0tpg
 m7agI4fSUnwZdU5hCh6o9BEcY3KHEWqMeXdXbFXuwqS+/+4pTNOYVpGYLfB8rgZo
 qW2VpsK0rrhNC82V3C86pdoC99gHK+fu1+MeKrh9DcNL1+wt89Eh60Fl0G+UfrjJ
 0fuIohtsp8W+ciQHg70oBRurmePLoWvWFqmH/kEUvftNU38SnjVT4V7FY6DBDOp3
 9sAl3sHsnaWoXIt6fx6YujFXiOUgN5hMSaXQ+uGcH9t+6qxNUtlh0hAEolAvEPe8
 SfjByQ3PClUCSu0Gnf6gPu9IBFXTDfWPH6tCk7Du3CY5HnISdQXdagpElhjP6N3B
 PGUQJF4oK7W1bs0ryYh3OYHG94nybncz1tJrCipPxmrY1PzZAbvdT7E0lickO35F
 MeEeg2xx3iALhK6koMaOuCEobrxeq5aG52qVqnKixupm1vLwPMxBtxhaEIUkjBZR
 I7k/qcZoDFXxSnzXdk6TXjbZl6JVJUy0tl3yxIwqVkZVapnGNylsS85psNhy8ovg
 PoJUENmKN9AjevtqW2THy77kaqutanYsd6AqMWqpvlux5scQBwJXXVzQTDW0yf1a
 LrXCrJQQFmJgPJs=
 =yAV8
 -----END PGP SIGNATURE-----

Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux

Pull clk driver fixes from Stephen Boyd:

 - Mark the DDR bus clk critical in the SpaceMiT driver so that
   boot doesn't fail

 - Fix boot on Mobile EyeQ by creating the auxiliary device for
   the ethernet PHY

 - Plug an OF node leak in Rockchip rk808 clk driver

* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
  clk: rk808: fix OF node reference imbalance
  MAINTAINERS: add myself as a reviewer for the clk subsystem
  reset: eyeq: drop device_set_of_node_from_dev() done by parent
  clk: eyeq: add EyeQ5 children auxiliary device for generic PHYs
  clk: eyeq: use the auxiliary device creation helper
  clk: spacemit: k3: mark top_dclk as CLK_IS_CRITICAL
2026-05-10 08:10:47 -07:00
Linus Torvalds
515186b7be bpf-fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmn/wmgACgkQ6rmadz2v
 bTosmhAAgYkQLg7zVQdruoSYb7Vzjz1Di4tM2rBXNIX4S7dvfZUGGBNzFV1lWobk
 /r6269llSnPKXofs+69LDVCpdvUXmGRmS7+bq+bxV7WVmg7JruVOTWg839jValJK
 cY3IQi0lZ9GVKaePI5C2XxBS3rCrdQmby91fcfp5C6A/gR6m7PzAlnoIuJ2SQx6A
 7tsxxJb4wRtFWPBp7ClbBo7MAMIzPse/6CzsA2eP+icyJC+De9WGYs6bTDNi7vpY
 +eul0HMyHLTszJe/AGrsu5Ky3S6l+CTydi1fAUSOnk1pYHHhRvvD2WV8ix05/0rO
 2looZl6ogpcisCm1i8HN8g1ST0tS74x3bL9kjvB/hhKGh6K1QpU6/drEvmJqKMAu
 fspYHD3qO+OXN7EV7tFZ1ErJvJZ7zT7UP0JxirAK1DFQZWrki/tJKehSD6gbir8R
 GwwZctXDOPTGADBsdqbxEPEAp1gVTvDXf04k6GOCLkzqqYBMVKdW/8GXN+6Itr+O
 nxxoC0SOOkW7rRlJaxuJd5+kpaCKOuK9FaXWONOn7HPzBgK0E0CL9g3+cZcS1QvI
 2/5utfFj0gMeo40ZDjCyDWXm7w+AnTSKMMapB5pyi0FY3AVtroSV88HNbpm7DJrs
 xp9jO5ZD6EQ9Wn1cufOYAkrgZYwTZL5Z2EqyKcoJUIk1ZjpQbXg=
 =x/fg
 -----END PGP SIGNATURE-----

Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Pull bpf fixes from Alexei Starovoitov:

 - Fix sk_local_storage diag dump via netlink (Amery Hung)

 - Fix off-by-one in arena direct-value access (Junyoung Jang)

 - Reject TCP_NODELAY in bpf-tcp congestion control (KaFai Wan)

 - Fix type confusion in bpf_*_sock() (Kuniyuki Iwashima)

 - Reject TX-only AF_XDP sockets (Linpu Yu)

 - Don't run arg-tracking analysis twice on main subprog (Paul Chaignon)

 - Fix NULL pointer dereference in bpf_sk_storage_clone and fib lookup
   (Weiming Shi)

* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf: Fix off-by-one boundary validation in arena direct-value access
  xskmap: reject TX-only AF_XDP sockets
  bpf: Don't run arg-tracking analysis twice on main subprog
  bpf: Free reuseport cBPF prog after RCU grace period.
  bpf: tcp: Fix type confusion in sol_tcp_sockopt().
  bpf: tcp: Fix type confusion in bpf_skc_to_tcp6_sock().
  bpf: tcp: Fix type confusion in bpf_skc_to_tcp_sock().
  mptcp: bpf: Fix type confusion in bpf_mptcp_sock_from_subflow()
  selftest: bpf: Add test for bpf_tcp_sock() and RAW socket.
  bpf: tcp: Fix type confusion in bpf_tcp_sock().
  tools/headers: Regenerate stddef.h to fix BPF selftests
  bpf: Fix sk_local_storage diag dumping uninitialized special fields
  bpf: Fix NULL pointer dereference in bpf_skb_fib_lookup()
  sockmap: Fix sk_psock_drop() race vs sock_map_{unhash,close,destroy}().
  bpf: Fix NULL pointer dereference in bpf_sk_storage_clone and diag paths
  selftests/bpf: Verify bpf-tcp-cc rejects TCP_NODELAY
  selftests/bpf: Test TCP_NODELAY in TCP hdr opt callbacks
  bpf: Reject TCP_NODELAY in bpf-tcp-cc
  bpf: Reject TCP_NODELAY in TCP header option callbacks
2026-05-09 18:42:54 -07:00
Junyoung Jang
3ac1a467e3 bpf: Fix off-by-one boundary validation in arena direct-value access
BPF_MAP_TYPE_ARENA accepts BPF_PSEUDO_MAP_VALUE offsets at exactly
the end of the arena mapping (off == arena_size). The boundary check
in arena_map_direct_value_addr() uses `>` instead of `>=`, which
incorrectly allows a one-past-end pointer to be accepted.

Change the condition to `>=` to correctly reject offsets that fall
outside the valid arena user_vm range.

Fixes: 317460317a ("bpf: Introduce bpf_arena.")
Signed-off-by: Junyoung Jang <graypanda.inzag@gmail.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Link: https://lore.kernel.org/r/20260426172505.1947915-1-graypanda.inzag@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-05-09 16:18:39 -07:00
Linpu Yu
bf6d507f7e xskmap: reject TX-only AF_XDP sockets
XSKMAP entries are used as redirect targets for incoming XDP frames.
A TX-only AF_XDP socket lacks an Rx ring and cannot handle redirected
traffic, but xsk_map_update_elem() currently allows such sockets to
be inserted into the map.

Redirecting packets to such a socket on the veth generic-XDP path
causes a kernel crash in xsk_generic_rcv().

This became possible after xsk_is_setup_for_bpf_map() was removed from
the XSKMAP update path, which allowed bound TX-only sockets to be
inserted into the map.

Reject TX-only sockets during XSKMAP updates to avoid the crash.
They remain fully operational for pure Tx purposes outside XSKMAP.

Fixes: 968be23cea ("xsk: Fix possible segfault at xskmap entry insertion")
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Reported-by: Yuan Tan <yuantan098@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Signed-off-by: Yifan Wu <yifanwucs@gmail.com>
Signed-off-by: Linpu Yu <linpu5433@gmail.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://lore.kernel.org/r/20260508144344.694-1-linpu5433@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-05-09 16:17:01 -07:00
Paul Chaignon
512809bb8a bpf: Don't run arg-tracking analysis twice on main subprog
Because subprog 0, the main subprog, is considered a global function,
we end up running the arg-tracking dataflow analysis twice on it. That
results in slightly longer verification but mostly in more verbose
verifier logs. This patch fixes it by keeping only the iteration over
global subprogs.

When running over all of Cilium's programs with BPF_LOG_LEVEL2, this
reduces verbosity by ~20% on average.

Fixes: bf0c571f7f ("bpf: introduce forward arg-tracking dataflow analysis")
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/e4d7b53d4963ef520541a782f5fc8108a168877c.1778176504.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-05-09 16:12:40 -07:00