Commit Graph

51779 Commits

Author SHA1 Message Date
Qi Zheng
8285917d6f mm: memcontrol: prepare for reparenting non-hierarchical stats
To resolve the dying memcg issue, we need to reparent LRU folios of child
memcg to its parent memcg.  This could cause problems for non-hierarchical
stats.

As Yosry Ahmed pointed out:

In short, if memory is charged to a dying cgroup at the time of
reparenting, when the memory gets uncharged the stats updates will occur
at the parent. This will update both hierarchical and non-hierarchical
stats of the parent, which would corrupt the parent's non-hierarchical
stats (because those counters were never incremented when the memory was
charged).

Now we have the following two types of non-hierarchical stats, and they
are only used in CONFIG_MEMCG_V1:

a. memcg->vmstats->state_local[i]
b. pn->lruvec_stats->state_local[i]

To ensure that these non-hierarchical stats work properly, we need to
reparent these non-hierarchical stats after reparenting LRU folios. To
this end, this commit makes the following preparations:

1. implement reparent_state_local() to reparent non-hierarchical stats
2. make css_killed_work_fn() to be called in rcu work, and implement
   get_non_dying_memcg_start() and get_non_dying_memcg_end() to avoid race
   between mod_memcg_state()/mod_memcg_lruvec_state()
   and reparent_state_local()

Link: https://lore.kernel.org/e862995c45a7101a541284b6ebee5e5c32c89066.1772711148.git.zhengqi.arch@bytedance.com
Co-developed-by: Yosry Ahmed <yosry@kernel.org>
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Allen Pais <apais@linux.microsoft.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Chen Ridong <chenridong@huawei.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com>
Cc: Harry Yoo <harry.yoo@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Imran Khan <imran.f.khan@oracle.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kamalesh Babulal <kamalesh.babulal@oracle.com>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Usama Arif <usamaarif642@gmail.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Wei Xu <weixugc@google.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18 00:10:47 -07:00
Linus Torvalds
eb0d6d97c2 bpf-fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmnihOkACgkQ6rmadz2v
 bTqjQA/+K6R/teQRwVmP1GDrfBjz2TXUzCN1WQQLzbnJNR96Mzq72+aTWjza89BK
 yEUP379qiOeUfEyyV7DNfHh8hAclUAMKuvI3T3pshLQhpOS0+YcpfbakEZbos+My
 AzEGhGl2nhT7S5twHFznCpuSaLgqldHkdAy4BZIiFkOS5lPBX9CU++OAslFPM+f8
 R28JQYWuv2/b1mRsz8zDmQQXxwH/Rpz9hdJKcpm/kCYYBay3cAFV7ArFJfn+Y5se
 9I6mTwNQ+xtSxtsmR/lftlGo1Vv9ah6qM9gKwgju0SkNrS+9UBlNUSmTrJk1fz+d
 SxdppCrqxwHY3UVd62eF4fWWgusC+oMuKzTh6d+D/ZkKvnEjdAx5XQ7uUQyYhKil
 G12vvKWcHit0Qz9RAhqlEEZ+GIpFTtLql6aW7pRmQKE8/vmQwAD1HBqNqWYKjokW
 btlJ3fUOGu8VHtnYbI3FN6VsK8BU9t/xMny9Fys9X4KmtWBLsm4udmiorV9uC+w6
 xV2s+x+ahythTEzVICB6BlQotSRyMd9kR5qisJsetWk+7NBY0Bwn7C0kfVGepHh0
 WerFSYdSifTvBWQjXnvqmAX7YspmpZvevw8PCtoPq1xq5d1FrYu1K5GX/xzpy+pH
 p13afkbN7Mk6OwteFefD1B0ofug3V9sx3HBI72ENs1Z+hh1KdOQ=
 =79I2
 -----END PGP SIGNATURE-----

Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Pull bpf fixes from Alexei Starovoitov:
 "Most of the diff stat comes from Xu Kuohai's fix to emit ENDBR/BTI,
  since all JITs had to be touched to move constant blinding out and
  pass bpf_verifier_env in.

   - Fix use-after-free in arena_vm_close on fork (Alexei Starovoitov)

   - Dissociate struct_ops program with map if map_update fails (Amery
     Hung)

   - Fix out-of-range and off-by-one bugs in arm64 JIT (Daniel Borkmann)

   - Fix precedence bug in convert_bpf_ld_abs alignment check (Daniel
     Borkmann)

   - Fix arg tracking for imprecise/multi-offset in BPF_ST/STX insns
     (Eduard Zingerman)

   - Copy token from main to subprogs to fix missing kallsyms (Eduard
     Zingerman)

   - Prevent double close and leak of btf objects in libbpf (Jiri Olsa)

   - Fix af_unix null-ptr-deref in sockmap (Michal Luczaj)

   - Fix NULL deref in map_kptr_match_type for scalar regs (Mykyta
     Yatsenko)

   - Avoid unnecessary IPIs. Remove redundant bpf_flush_icache() in
     arm64 and riscv JITs (Puranjay Mohan)

   - Fix out of bounds access. Validate node_id in arena_alloc_pages()
     (Puranjay Mohan)

   - Reject BPF-to-BPF calls and callbacks in arm32 JIT (Puranjay Mohan)

   - Refactor all JITs to pass bpf_verifier_env to emit ENDBR/BTI for
     indirect jump targets on x86-64, arm64 JITs (Xu Kuohai)

   - Allow UTF-8 literals in bpf_bprintf_prepare() (Yihan Ding)"

* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: (32 commits)
  bpf, arm32: Reject BPF-to-BPF calls and callbacks in the JIT
  bpf: Dissociate struct_ops program with map if map_update fails
  bpf: Validate node_id in arena_alloc_pages()
  libbpf: Prevent double close and leak of btf objects
  selftests/bpf: cover UTF-8 trace_printk output
  bpf: allow UTF-8 literals in bpf_bprintf_prepare()
  selftests/bpf: Reject scalar store into kptr slot
  bpf: Fix NULL deref in map_kptr_match_type for scalar regs
  bpf: Fix precedence bug in convert_bpf_ld_abs alignment check
  bpf, arm64: Emit BTI for indirect jump target
  bpf, x86: Emit ENDBR for indirect jump targets
  bpf: Add helper to detect indirect jump targets
  bpf: Pass bpf_verifier_env to JIT
  bpf: Move constants blinding out of arch-specific JITs
  bpf, sockmap: Take state lock for af_unix iter
  bpf, sockmap: Fix af_unix null-ptr-deref in proto update
  selftests/bpf: Extend bpf_iter_unix to attempt deadlocking
  bpf, sockmap: Fix af_unix iter deadlock
  bpf, sockmap: Annotate af_unix sock:: Sk_state data-races
  selftests/bpf: verify kallsyms entries for token-loaded subprograms
  ...
2026-04-17 15:58:22 -07:00
Amery Hung
f75aeb2de8 bpf: Dissociate struct_ops program with map if map_update fails
Currently, when bpf_struct_ops_map_update_elem() fails, the programs'
st_ops_assoc will remain set. They may become dangling pointers if the
map is freed later, but they will never be dereferenced since the
struct_ops attachment did not succeed. However, if one of the programs
is subsequently attached as part of another struct_ops map, its
st_ops_assoc will be poisoned even though its old st_ops_assoc was stale
from a failed attachment.

Fix the spurious poisoned st_ops_assoc by dissociating struct_ops
programs with a map if the attachment fails. Move
bpf_prog_assoc_struct_ops() to after *plink++ to make sure
bpf_prog_disassoc_struct_ops() will not miss a program when iterating
st_map->links.

Note that, dissociating a program from a map requires some attention as
it must not reset a poisoned st_ops_assoc or a st_ops_assoc pointing to
another map. The former is already guarded in
bpf_prog_disassoc_struct_ops(). The latter also will not happen since
st_ops_assoc of programs in st_map->links are set by
bpf_prog_assoc_struct_ops(), which can only be poisoned or pointing to
the current map.

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Link: https://lore.kernel.org/r/20260417174900.2895486-1-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-17 12:04:14 -07:00
Guopeng Zhang
41d701ddc3 cgroup/cpuset: record DL BW alloc CPU for attach rollback
cpuset_can_attach() allocates DL bandwidth only when migrating
deadline tasks to a disjoint CPU mask, but cpuset_cancel_attach()
rolls back based only on nr_migrate_dl_tasks. This makes the DL
bandwidth alloc/free paths asymmetric: rollback can call dl_bw_free()
even when no dl_bw_alloc() was done.

Rollback also needs to undo the reservation against the same CPU/root
domain that was charged. Record the CPU used by dl_bw_alloc() and use
that state in cpuset_cancel_attach(). If no allocation happened,
dl_bw_cpu stays at -1 and rollback skips dl_bw_free(). If allocation
did happen, bandwidth is returned to the same CPU/root domain.

Successful attach paths are unchanged. This only fixes failed attach
rollback accounting.

Fixes: 2ef269ef1a ("cgroup/cpuset: Free DL BW in case can_attach() fails")
Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-17 08:57:37 -10:00
Linus Torvalds
87768582a4 dma-mapping updates for Linux 7.0:
- added support for batched cache sync, what improves performance of
   dma_map/unmap_sg() operations on ARM64 architecture (Barry Song)
 
 - introduced DMA_ATTR_CC_SHARED attribute for explicitly shared memory
   used in confidential computing (Jiri Pirko)
 
 - refactored spaghetti-like code in drivers/of/of_reserved_mem.c and its
   clients (Marek Szyprowski, shared branch with device-tree updates to
   avoid merge conflicts)
 
 - prepared Contiguous Memory Allocator related code for making dma-buf
   drivers modularized (Maxime Ripard)
 
 - added support for benchmarking dma_map_sg() calls to tools/dma utility
   (Qinxin Xia)
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSrngzkoBtlA8uaaJ+Jp1EFxbsSRAUCaeCbdQAKCRCJp1EFxbsS
 RHbWAQCt70dzrU0lu0omTR1HdDP4GTYfuM6nZR91e8/itGN1+QD/XH4I/0wuybzk
 v5uxbIC6lR3abQRc3YNRXfi+i5j26A4=
 =Oee2
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-7.1-2026-04-16' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux

Pull dma-mapping updates from Marek Szyprowski:

 - added support for batched cache sync, what improves performance of
   dma_map/unmap_sg() operations on ARM64 architecture (Barry Song)

 - introduced DMA_ATTR_CC_SHARED attribute for explicitly shared memory
   used in confidential computing (Jiri Pirko)

 - refactored spaghetti-like code in drivers/of/of_reserved_mem.c and
   its clients (Marek Szyprowski, shared branch with device-tree updates
   to avoid merge conflicts)

 - prepared Contiguous Memory Allocator related code for making dma-buf
   drivers modularized (Maxime Ripard)

 - added support for benchmarking dma_map_sg() calls to tools/dma
   utility (Qinxin Xia)

* tag 'dma-mapping-7.1-2026-04-16' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: (24 commits)
  dma-buf: heaps: system: document system_cc_shared heap
  dma-buf: heaps: system: add system_cc_shared heap for explicitly shared memory
  dma-mapping: introduce DMA_ATTR_CC_SHARED for shared memory
  mm: cma: Export cma_alloc(), cma_release() and cma_get_name()
  dma: contiguous: Export dev_get_cma_area()
  dma: contiguous: Make dma_contiguous_default_area static
  dma: contiguous: Make dev_get_cma_area() a proper function
  dma: contiguous: Turn heap registration logic around
  of: reserved_mem: rework fdt_init_reserved_mem_node()
  of: reserved_mem: clarify fdt_scan_reserved_mem*() functions
  of: reserved_mem: rearrange code a bit
  of: reserved_mem: replace CMA quirks by generic methods
  of: reserved_mem: switch to ops based OF_DECLARE()
  of: reserved_mem: use -ENODEV instead of -ENOENT
  of: reserved_mem: remove fdt node from the structure
  dma-mapping: fix false kernel-doc comment marker
  dma-mapping: Support batch mode for dma_direct_{map,unmap}_sg
  dma-mapping: Separate DMA sync issuing and completion waiting
  arm64: Provide dcache_inval_poc_nosync helper
  arm64: Provide dcache_clean_poc_nosync helper
  ...
2026-04-17 11:12:42 -07:00
cuitao
c802f460dd cgroup/rdma: fix integer overflow in rdmacg_try_charge()
The expression `rpool->resources[index].usage + 1` is computed in int
arithmetic before being assigned to s64 variable `new`. When usage equals
INT_MAX (the default "max" value), the addition overflows to INT_MIN.
This negative value then passes the `new > max` check incorrectly,
allowing a charge that should be rejected and corrupting usage to
negative.

Fix by casting usage to s64 before the addition so the arithmetic is
done in 64-bit.

Fixes: 39d3e7584a ("rdmacg: Added rdma cgroup controller")
Signed-off-by: cuitao <cuitao@kylinos.cn>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-17 07:25:27 -10:00
Edward Adam Davis
a5b98009f1 sched/psi: fix race between file release and pressure write
A potential race condition exists between pressure write and cgroup file
release regarding the priv member of struct kernfs_open_file, which
triggers the uaf reported in [1].

Consider the following scenario involving execution on two separate CPUs:

   CPU0					CPU1
   ====					====
					vfs_rmdir()
					kernfs_iop_rmdir()
					cgroup_rmdir()
					cgroup_kn_lock_live()
					cgroup_destroy_locked()
					cgroup_addrm_files()
					cgroup_rm_file()
					kernfs_remove_by_name()
					kernfs_remove_by_name_ns()
 vfs_write()				__kernfs_remove()
 new_sync_write()			kernfs_drain()
 kernfs_fop_write_iter()		kernfs_drain_open_files()
 cgroup_file_write()			kernfs_release_file()
 pressure_write()			cgroup_file_release()
 ctx = of->priv;
					kfree(ctx);
 					of->priv = NULL;
					cgroup_kn_unlock()
 cgroup_kn_lock_live()
 cgroup_get(cgrp)
 cgroup_kn_unlock()
 if (ctx->psi.trigger)  // here, trigger uaf for ctx, that is of->priv

The cgroup_rmdir() is protected by the cgroup_mutex, it also safeguards
the memory deallocation of of->priv performed within cgroup_file_release().
However, the operations involving of->priv executed within pressure_write()
are not entirely covered by the protection of cgroup_mutex. Consequently,
if the code in pressure_write(), specifically the section handling the
ctx variable executes after cgroup_file_release() has completed, a uaf
vulnerability involving of->priv is triggered.

Therefore, the issue can be resolved by extending the scope of the
cgroup_mutex lock within pressure_write() to encompass all code paths
involving of->priv, thereby properly synchronizing the race condition
occurring between cgroup_file_release() and pressure_write().

And, if an live kn lock can be successfully acquired while executing
the pressure write operation, it indicates that the cgroup deletion
process has not yet reached its final stage; consequently, the priv
pointer within open_file cannot be NULL. Therefore, the operation to
retrieve the ctx value must be moved to a point *after* the live kn
lock has been successfully acquired.

In another situation, specifically after entering cgroup_kn_lock_live()
but before acquiring cgroup_mutex, there exists a different class of
race condition:

CPU0: write memory.pressure               CPU1: write cgroup.pressure=0
===========================		  =============================

kernfs_fop_write_iter()
 kernfs_get_active_of(of)
 pressure_write()
   cgroup_kn_lock_live(memory.pressure)
     cgroup_tryget(cgrp)
     kernfs_break_active_protection(kn)
     ... blocks on cgroup_mutex

                                     	  cgroup_pressure_write()
                                     	  cgroup_kn_lock_live(cgroup.pressure)
                                     	  cgroup_file_show(memory.pressure, false)
                                     	    kernfs_show(false)
                                     	      kernfs_drain_open_files()
                                     	        cgroup_file_release(of)
                                     	          kfree(ctx)
                                     	            of->priv = NULL
                                     	  cgroup_kn_unlock()

   ... acquires cgroup_mutex
   ctx = of->priv;        // may now be NULL
   if (ctx->psi.trigger)  // NULL dereference

Consequently, there is a possibility that of->priv is NULL, the pressure
write needs to check for this.

Now that the scope of the cgroup_mutex has been expanded, the original
explicit cgroup_get/put operations are no longer necessary, this is
because acquiring/releasing the live kn lock inherently executes a
cgroup get/put operation.

[1]
BUG: KASAN: slab-use-after-free in pressure_write+0xa4/0x210 kernel/cgroup/cgroup.c:4011
Call Trace:
 pressure_write+0xa4/0x210 kernel/cgroup/cgroup.c:4011
 cgroup_file_write+0x36f/0x790 kernel/cgroup/cgroup.c:4311
 kernfs_fop_write_iter+0x3b0/0x540 fs/kernfs/file.c:352

Allocated by task 9352:
 cgroup_file_open+0x90/0x3a0 kernel/cgroup/cgroup.c:4256
 kernfs_fop_open+0x9eb/0xcb0 fs/kernfs/file.c:724
 do_dentry_open+0x83d/0x13e0 fs/open.c:949

Freed by task 9353:
 cgroup_file_release+0xd6/0x100 kernel/cgroup/cgroup.c:4283
 kernfs_release_file fs/kernfs/file.c:764 [inline]
 kernfs_drain_open_files+0x392/0x720 fs/kernfs/file.c:834
 kernfs_drain+0x470/0x600 fs/kernfs/dir.c:525

Fixes: 0e94682b73 ("psi: introduce psi monitor")
Reported-by: syzbot+33e571025d88efd1312c@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=33e571025d88efd1312c
Tested-by: syzbot+33e571025d88efd1312c@syzkaller.appspotmail.com
Signed-off-by: Edward Adam Davis <eadavis@qq.com>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-17 07:25:09 -10:00
Puranjay Mohan
2845989f2e bpf: Validate node_id in arena_alloc_pages()
arena_alloc_pages() accepts a plain int node_id and forwards it through
the entire allocation chain without any bounds checking.

Validate node_id before passing it down the allocation chain in
arena_alloc_pages().

Fixes: 317460317a ("bpf: Introduce bpf_arena.")
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Link: https://lore.kernel.org/r/20260417152135.1383754-1-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-17 10:12:55 -07:00
Linus Torvalds
0b6bc3dbe6 tracing latency updates for 7.1:
- Add TIMERLAT_ALIGN osnoise option
 
   Add a timer alignment option for timerlat that makes it work like the
   cyclictest -A option. timelat creates threads to test the latency of the
   kernel. The alignment option will have these threads trigger at the
   alignment offsets from each other. Instead of having each thread wake up
   at the exact same time, if the alignment is set to "20" each thread will
   wake up at 20 microseconds from the previous one.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaeIBaBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qqM8AQCyz4uL1lCLQtJb5thG8V9QFRkYh5F3
 +DyuRNoht3ijyAD+K4nZPAu4F09feWuHkssONtSZECKEuN6EhiHE4XX6Pgc=
 =WYYW
 -----END PGP SIGNATURE-----

Merge tag 'trace-latency-v7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing latency update from Steven Rostedt:

 - Add TIMERLAT_ALIGN osnoise option

   Add a timer alignment option for timerlat that makes it work like the
   cyclictest -A option. timelat creates threads to test the latency of
   the kernel. The alignment option will have these threads trigger at
   the alignment offsets from each other. Instead of having each thread
   wake up at the exact same time, if the alignment is set to "20" each
   thread will wake up at 20 microseconds from the previous one.

* tag 'trace-latency-v7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing/osnoise: Add option to align tlat threads
2026-04-17 10:12:11 -07:00
Linus Torvalds
cb30bf881c tracing updates for v7.1:
- Fix printf format warning for bprintf
 
   sunrpc uses a trace_printk() that triggers a printf warning during the
   compile. Move the __printf() attribute around for when debugging is not
   enabled the warning will go away.
 
 - Remove redundant check for EVENT_FILE_FL_FREED in event_filter_write()
 
   The FREED flag is checked in the call to event_file_file() and then
   checked again right afterward, which is unneeded.
 
 - Clean up event_file_file() and event_file_data() helpers
 
   These helper functions played a different role in the past, but now with
   eventfs, the READ_ONCE() isn't needed. Simplify the code a bit and also
   add a warning to event_file_data() if the file or its data is not present.
 
 - Remove updating file->private_data in tracing open
 
   All access to the file private data is handled by the helper functions,
   which do not use file->private_data. Stop updating it on open.
 
 - Show ENUM names in function arguments via BTF in function tracing
 
   When showing the function arguments when func-args option is set for
   function tracing, if one of the arguments is found to be an enum, show the
   name of the enum instead of its number.
 
 - Add new trace_call__##name() API for tracepoints
 
   Tracepoints are enabled via static_branch() blocks, where when not
   enabled, there's only a nop that is in the code where the execution will
   just skip over it. When tracing is enabled, the nop is converted to a
   direct jump to the tracepoint code. Sometimes more calculations are
   required to be performed to update the parameters of the tracepoint. In
   this case, trace_##name##_enabled() is called which is a static_branch()
   that gets enabled only when the tracepoint is enabled. This allows the
   extra calculations to also be skipped by the nop:
 
   if (trace_foo_enabled()) {
       x = bar();
       trace_foo(x);
   }
 
   Where the x=bar() is only performed when foo is enabled. The problem with
   this approach is that there's now two static_branch() calls. One for
   checking if the tracepoint is enabled, and then again to know if the
   tracepoint should be called. The second one is redundant.
 
   Introduce trace_call__foo() that will call the foo() tracepoint directly
   without doing a static_branch():
 
   if (trace_foo_enabled()) {
       x = bar();
       trace_call__foo();
   }
 
 - Update various locations to use the new trace_call__##name() API
 
 - Move snapshot code out of trace.c
 
   Cleaning up trace.c to not be a "dump all", move the snapshot code out of
   it and into a new trace_snapshot.c file.
 
 - Clean up some "%*.s" to "%*s"
 
 - Allow boot kernel command line options to be called multiple times
 
   Have options like:
 
     ftrace_filter=foo ftrace_filter=bar ftrace_filter=zoo
 
   Equal to:
 
     ftrace_filter=foo,bar,zoo
 
 - Fix ipi_raise event CPU field to be a CPU field
 
   The ipi_raise target_cpus field is defined as a __bitmask(). There is now a
   __cpumask() field definition. Update the field to use that.
 
 - Have hist_field_name() use a snprintf() and not a series of strcat()
 
   It's safer to use snprintf() that a series of strcat().
 
 - Fix tracepoint regfunc balancing
 
   A tracepoint can define a "reg" and "unreg" function that gets called
   before the tracepoint is enabled, and after it is disabled respectively.
   But on error, after the "reg" func is called and the tracepoint is not
   enabled, the "unreg" function is not called to tear down what the "reg"
   function performed.
 
 - Fix output that shows what histograms are enabled
 
   Event variables are displayed incorrectly in the histogram output.
 
   Instead of "sched.sched_wakeup.$var", it is showing
   "$sched.sched_wakeup.var" where the '$' is in the incorrect location.
 
 - Some other simple cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaeCpvxQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qt2WAP44m85BbAjBqJe4WR103eOXV+bREBta
 dRoReKJOMe519gEAp0rK/HoCvHgHhIGe3gaGdIsNhnaxoFyNWMG/wokoLAY=
 =Hg6+
 -----END PGP SIGNATURE-----

Merge tag 'trace-v7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing updates from Steven Rostedt:

 - Fix printf format warning for bprintf

   sunrpc uses a trace_printk() that triggers a printf warning during
   the compile. Move the __printf() attribute around for when debugging
   is not enabled the warning will go away

 - Remove redundant check for EVENT_FILE_FL_FREED in
   event_filter_write()

   The FREED flag is checked in the call to event_file_file() and then
   checked again right afterward, which is unneeded

 - Clean up event_file_file() and event_file_data() helpers

   These helper functions played a different role in the past, but now
   with eventfs, the READ_ONCE() isn't needed. Simplify the code a bit
   and also add a warning to event_file_data() if the file or its data
   is not present

 - Remove updating file->private_data in tracing open

   All access to the file private data is handled by the helper
   functions, which do not use file->private_data. Stop updating it on
   open

 - Show ENUM names in function arguments via BTF in function tracing

   When showing the function arguments when func-args option is set for
   function tracing, if one of the arguments is found to be an enum,
   show the name of the enum instead of its number

 - Add new trace_call__##name() API for tracepoints

   Tracepoints are enabled via static_branch() blocks, where when not
   enabled, there's only a nop that is in the code where the execution
   will just skip over it. When tracing is enabled, the nop is converted
   to a direct jump to the tracepoint code. Sometimes more calculations
   are required to be performed to update the parameters of the
   tracepoint. In this case, trace_##name##_enabled() is called which is
   a static_branch() that gets enabled only when the tracepoint is
   enabled. This allows the extra calculations to also be skipped by the
   nop:

	if (trace_foo_enabled()) {
		x = bar();
		trace_foo(x);
	}

   Where the x=bar() is only performed when foo is enabled. The problem
   with this approach is that there's now two static_branch() calls. One
   for checking if the tracepoint is enabled, and then again to know if
   the tracepoint should be called. The second one is redundant

   Introduce trace_call__foo() that will call the foo() tracepoint
   directly without doing a static_branch():

	if (trace_foo_enabled()) {
		x = bar();
		trace_call__foo();
	}

 - Update various locations to use the new trace_call__##name() API

 - Move snapshot code out of trace.c

   Cleaning up trace.c to not be a "dump all", move the snapshot code
   out of it and into a new trace_snapshot.c file

 - Clean up some "%*.s" to "%*s"

 - Allow boot kernel command line options to be called multiple times

   Have options like:

	ftrace_filter=foo ftrace_filter=bar ftrace_filter=zoo

   Equal to:

	ftrace_filter=foo,bar,zoo

 - Fix ipi_raise event CPU field to be a CPU field

   The ipi_raise target_cpus field is defined as a __bitmask(). There is
   now a __cpumask() field definition. Update the field to use that

 - Have hist_field_name() use a snprintf() and not a series of strcat()

   It's safer to use snprintf() that a series of strcat()

 - Fix tracepoint regfunc balancing

   A tracepoint can define a "reg" and "unreg" function that gets called
   before the tracepoint is enabled, and after it is disabled
   respectively. But on error, after the "reg" func is called and the
   tracepoint is not enabled, the "unreg" function is not called to tear
   down what the "reg" function performed

 - Fix output that shows what histograms are enabled

   Event variables are displayed incorrectly in the histogram output

   Instead of "sched.sched_wakeup.$var", it is showing
   "$sched.sched_wakeup.var" where the '$' is in the incorrect location

 - Some other simple cleanups

* tag 'trace-v7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (24 commits)
  selftests/ftrace: Add test case for fully-qualified variable references
  tracing: Fix fully-qualified variable reference printing in histograms
  tracepoint: balance regfunc() on func_add() failure in tracepoint_add_func()
  tracing: Rebuild full_name on each hist_field_name() call
  tracing: Report ipi_raise target CPUs as cpumask
  tracing: Remove duplicate latency_fsnotify() stub
  tracing: Preserve repeated trace_trigger boot parameters
  tracing: Append repeated boot-time tracing parameters
  tracing: Remove spurious default precision from show_event_trigger/filter formats
  cpufreq: Use trace_call__##name() at guarded tracepoint call sites
  tracing: Remove tracing_alloc_snapshot() when snapshot isn't defined
  tracing: Move snapshot code out of trace.c and into trace_snapshot.c
  mm: damon: Use trace_call__##name() at guarded tracepoint call sites
  btrfs: Use trace_call__##name() at guarded tracepoint call sites
  spi: Use trace_call__##name() at guarded tracepoint call sites
  i2c: Use trace_call__##name() at guarded tracepoint call sites
  kernel: Use trace_call__##name() at guarded tracepoint call sites
  tracepoint: Add trace_call__##name() API
  tracing: trace_mmap.h: fix a kernel-doc warning
  tracing: Pretty-print enum parameters in function arguments
  ...
2026-04-17 09:43:12 -07:00
Linus Torvalds
c9e03d5948 Probes updates for v7.1
- fprobe: do not zero out unused fgraph_data. This removes unneeded
   memset of fgraph_data in fprobe entry handler.
 -----BEGIN PGP SIGNATURE-----
 
 iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmnhhjEbHG1hc2FtaS5o
 aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bfBMH/21EM4dL2oDrsR3Ud7G+
 R1LYB959jkztJlIKQ9xGBdUpyMd97JYszknpTG4QHb7XW4FrcvZMiHeT9AW6ZBzw
 zJkWVbSCst7sNYpjmqFcf/p3uUJp3/jHNG6byQ2/oWW9bilWPitY1LDL8u/VwzWN
 ZN35B09bw2IgKXbxCNoNlASdV8w/nnVI+CEjKSbLJBVQffBzU9losLt5Qu1i5w/A
 Q5O5ZLkeawQhRzilxVAqae7U949MzRvn28EiLyMJ4rJA0vWH+ijg7vFYmC4hnFde
 GNzJ7q7S87BmApTsSVT+QBCpIGd2XjLYfiYZBBoZIZE2wXI4fxgF+cd05XE51hrG
 mTs=
 =ycSD
 -----END PGP SIGNATURE-----

Merge tag 'probes-v7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull fprobe update from Masami Hiramatsu:

 - do not zero out unused fgraph_data. This removes unneeded memset of
   fgraph_data in fprobe entry handler.

* tag 'probes-v7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: fprobe: do not zero out unused fgraph_data
2026-04-17 09:18:32 -07:00
Linus Torvalds
01f492e181 Arm:
- Add support for tracing in the standalone EL2 hypervisor code, which
   should help both debugging and performance analysis.  This uses the
   new infrastructure for 'remote' trace buffers that can be exposed
   by non-kernel entities such as firmware, and which came through the
   tracing tree.
 
 - Add support for GICv5 Per Processor Interrupts (PPIs), as the starting
   point for supporting the new GIC architecture in KVM.
 
 - Finally add support for pKVM protected guests, where pages are unmapped
   from the host as they are faulted into the guest and can be shared back
   from the guest using pKVM hypercalls.  Protected guests are created
   using a new machine type identifier.  As the elusive guestmem has not
   yet delivered on its promises, anonymous memory is also supported.
 
   This is only a first step towards full isolation from the host; for
   example, the CPU register state and DMA accesses are not yet isolated.
   Because this does not really yet bring fully what it promises, it is
   hidden behind CONFIG_ARM_PKVM_GUEST + 'kvm-arm.mode=protected', and
   also triggers TAINT_USER when a VM is created.  Caveat emptor.
 
 - Rework the dreaded user_mem_abort() function to make it more
   maintainable, reducing the amount of state being exposed to the
   various helpers and rendering a substantial amount of state immutable.
 
 - Expand the Stage-2 page table dumper to support NV shadow page tables
   on a per-VM basis.
 
 - Tidy up the pKVM PSCI proxy code to be slightly less hard to follow.
 
 - Fix both SPE and TRBE in non-VHE configurations so that they do not
   generate spurious, out of context table walks that ultimately lead
   to very bad HW lockups.
 
 - A small set of patches fixing the Stage-2 MMU freeing in error cases.
 
 - Tighten-up accepted SMC immediate value to be only #0 for host
   SMCCC calls.
 
 - The usual cleanups and other selftest churn.
 
 LoongArch:
 
 - Use CSR_CRMD_PLV for kvm_arch_vcpu_in_kernel().
 
 - Add DMSINTC irqchip in kernel support.
 
 RISC-V:
 
 - Fix steal time shared memory alignment checks
 
 - Fix vector context allocation leak
 
 - Fix array out-of-bounds in pmu_ctr_read() and pmu_fw_ctr_read_hi()
 
 - Fix double-free of sdata in kvm_pmu_clear_snapshot_area()
 
 - Fix integer overflow in kvm_pmu_validate_counter_mask()
 
 - Fix shift-out-of-bounds in make_xfence_request()
 
 - Fix lost write protection on huge pages during dirty logging
 
 - Split huge pages during fault handling for dirty logging
 
 - Skip CSR restore if VCPU is reloaded on the same core
 
 - Implement kvm_arch_has_default_irqchip() for KVM selftests
 
 - Factored-out ISA checks into separate sources
 
 - Added hideleg to struct kvm_vcpu_config
 
 - Factored-out VCPU config into separate sources
 
 - Support configuration of per-VM HGATP mode from KVM user space
 
 s390:
 
 - Support for ESA (31-bit) guests inside nested hypervisors.
 
 - Remove restriction on memslot alignment, which is not needed anymore with
   the new gmap code.
 
 - Fix LPSW/E to update the bear (which of course is the breaking event
   address register).
 
 x86:
 
 - Shut up various UBSAN warnings on reading module parameter before they
   were initialized.
 
 - Don't zero-allocate page tables that are used for splitting hugepages in
   the TDP MMU, as KVM is guaranteed to set all SPTEs in the page table and
   thus write all bytes.
 
 - As an optimization, bail early when trying to unsync 4KiB mappings if the
   target gfn can just be mapped with a 2MiB hugepage.
 
 x86 generic:
 
 - Copy single-chunk MMIO write values into struct kvm_vcpu (more precisely
   struct kvm_mmio_fragment) to fix use-after-free stack bugs where KVM
   would dereference stack pointer after an exit to userspace.
 
 - Clean up and comment the emulated MMIO code to try to make it easier to
   maintain (not necessarily "easy", but "easier").
 
 - Move VMXON+VMXOFF and EFER.SVME toggling out of KVM (not *all* of VMX
   and SVM enabling) as it is needed for trusted I/O.
 
 - Advertise support for AVX512 Bit Matrix Multiply (BMM) instructions
 
 - Immediately fail the build if a required #define is missing in one of
   KVM's headers that is included multiple times.
 
 - Reject SET_GUEST_DEBUG with -EBUSY if there's an already injected
   exception, mostly to prevent syzkaller from abusing the uAPI to
   trigger WARNs, but also because it can help prevent userspace from
   unintentionally crashing the VM.
 
 - Exempt SMM from CPUID faulting on Intel, as per the spec.
 
 - Misc hardening and cleanup changes.
 
 x86 (AMD):
 
 - Fix and optimize IRQ window inhibit handling for AVIC; make it per-vCPU
   so that KVM doesn't prematurely re-enable AVIC if multiple
   vCPUs have to-be-injected IRQs.
 
 - Clean up and optimize the OSVW handling, avoiding a bug in which KVM would
   overwrite state when enabling virtualization on multiple CPUs in parallel.
   This should not be a problem because OSVW should usually be the same for
   all CPUs.
 
 - Drop a WARN in KVM_MEMORY_ENCRYPT_REG_REGION where KVM complains about a
   "too large" size based purely on user input.
 
 - Clean up and harden the pinning code for KVM_MEMORY_ENCRYPT_REG_REGION.
 
 - Disallow synchronizing a VMSA of an already-launched/encrypted vCPU, as
   doing so for an SNP guest will crash the host due to an RMP violation
   page fault.
 
 - Overhaul KVM's APIs for detecting SEV+ guests so that VM-scoped queries
   are required to hold kvm->lock, and enforce it by lockdep.  Fix various
   bugs where sev_guest() was not ensured to be stable for the whole
   duration of a function or ioctl.
 
 - Convert a pile of kvm->lock SEV code to guard().
 
 - Play nicer with userspace that does not enable KVM_CAP_EXCEPTION_PAYLOAD,
   for which KVM needs to set CR2 and DR6 as a response to ioctls such as
   KVM_GET_VCPU_EVENTS (even if the payload would end up in EXITINFO2
   rather than CR2, for example).  Only set CR2 and DR6 when consumption of
   the payload is imminent, but on the other hand force delivery of the
   payload in all paths where userspace retrieves CR2 or DR6.
 
 - Use vcpu->arch.cr2 when updating vmcb12's CR2 on nested #VMEXIT instead
   of vmcb02->save.cr2.  The value is out of sync after a save/restore
   or after a #PF is injected into L2.
 
 - Fix a class of nSVM bugs where some fields written by the CPU are not
   synchronized from vmcb02 to cached vmcb12 after VMRUN, and so are not
   up-to-date when saved by KVM_GET_NESTED_STATE.
 
 - Fix a class of bugs where the ordering between KVM_SET_NESTED_STATE and
   KVM_SET_{S}REGS could cause vmcb02 to be incorrectly initialized after
   save+restore.
 
 - Add a variety of missing nSVM consistency checks.
 
 - Fix several bugs where KVM failed to correctly update VMCB fields on
   nested #VMEXIT.
 
 - Fix several bugs where KVM failed to correctly synthesize #UD or #GP for
   SVM-related instructions.
 
 - Add support for save+restore of virtualized LBRs (on SVM).
 
 - Refactor various helpers and macros to improve clarity and (hopefully)
   make the code easier to maintain.
 
 - Aggressively sanitize fields when copying from vmcb12, to guard against
   unintentionally allowing L1 to utilize yet-to-be-defined features.
 
 - Fix several bugs where KVM botched rAX legality checks when emulating SVM
   instructions.  There are remaining issues in that KVM doesn't handle size
   prefix overrides for 64-bit guests.
 
 - Fail emulation of VMRUN/VMLOAD/VMSAVE if mapping vmcb12 fails instead of
   somewhat arbitrarily synthesizing #GP (i.e. don't double down on AMD's
   architectural but sketchy behavior of generating #GP for "unsupported"
   addresses).
 
 - Cache all used vmcb12 fields to further harden against TOCTOU bugs.
 
 x86 (Intel):
 
 - Drop obsolete branch hint prefixes from the VMX instruction macros.
 
 - Use ASM_INPUT_RM() in __vmcs_writel() to coerce clang into using a
   register input when appropriate.
 
 - Code cleanups.
 
 guest_memfd:
 
 - Don't mark guest_memfd folios as accessed, as guest_memfd doesn't support
   reclaim, the memory is unevictable, and there is no storage to write
   back to.
 
 LoongArch selftests:
 
 - Add KVM PMU test cases
 
 s390 selftests:
 
 - Enable more memory selftests.
 
 x86 selftests:
 
 - Add support for Hygon CPUs in KVM selftests.
 
 - Fix a bug in the MSR test where it would get false failures on AMD/Hygon
   CPUs with exactly one of RDPID or RDTSCP.
 
 - Add an MADV_COLLAPSE testcase for guest_memfd as a regression test for a
   bug where the kernel would attempt to collapse guest_memfd folios against
   KVM's will.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmnftRQUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroPAzwf+NKO4Ktv+7A22ImN0SBl0nlUuulsz
 vTcw3+hxdRoIw83GdNS+hG5js0wrpMDnbv3t4+VliDNBSSxrBzcSWX2wpilW0Xtw
 qGo1MWhs2lKPy1NlaRVOwPS6j7uF3AR0TQ1iQLGMedQuCU9WpiKJxyhNXJdbLrt3
 8EgFzsvtEsv+jKNRUNDf9+d0j4gZsFyIe+Brhianbw+u3/UCiUClLCdsKPc4+5ZX
 08otYXytacGNIf/5Ev1vT4pHkHL0yqKXAtX7LEtaS3+0KrPuLjV4slemivzE9vf5
 Evafm5AhA4wpaNMb1ZerhY3T94lsMaJpWxotjR//0Q7C9B59pCQnXCm8mg==
 =CcE0
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "Arm:

   - Add support for tracing in the standalone EL2 hypervisor code,
     which should help both debugging and performance analysis. This
     uses the new infrastructure for 'remote' trace buffers that can be
     exposed by non-kernel entities such as firmware, and which came
     through the tracing tree

   - Add support for GICv5 Per Processor Interrupts (PPIs), as the
     starting point for supporting the new GIC architecture in KVM

   - Finally add support for pKVM protected guests, where pages are
     unmapped from the host as they are faulted into the guest and can
     be shared back from the guest using pKVM hypercalls. Protected
     guests are created using a new machine type identifier. As the
     elusive guestmem has not yet delivered on its promises, anonymous
     memory is also supported

     This is only a first step towards full isolation from the host; for
     example, the CPU register state and DMA accesses are not yet
     isolated. Because this does not really yet bring fully what it
     promises, it is hidden behind CONFIG_ARM_PKVM_GUEST +
     'kvm-arm.mode=protected', and also triggers TAINT_USER when a VM is
     created. Caveat emptor

   - Rework the dreaded user_mem_abort() function to make it more
     maintainable, reducing the amount of state being exposed to the
     various helpers and rendering a substantial amount of state
     immutable

   - Expand the Stage-2 page table dumper to support NV shadow page
     tables on a per-VM basis

   - Tidy up the pKVM PSCI proxy code to be slightly less hard to
     follow

   - Fix both SPE and TRBE in non-VHE configurations so that they do not
     generate spurious, out of context table walks that ultimately lead
     to very bad HW lockups

   - A small set of patches fixing the Stage-2 MMU freeing in error
     cases

   - Tighten-up accepted SMC immediate value to be only #0 for host
     SMCCC calls

   - The usual cleanups and other selftest churn

  LoongArch:

   - Use CSR_CRMD_PLV for kvm_arch_vcpu_in_kernel()

   - Add DMSINTC irqchip in kernel support

  RISC-V:

   - Fix steal time shared memory alignment checks

   - Fix vector context allocation leak

   - Fix array out-of-bounds in pmu_ctr_read() and pmu_fw_ctr_read_hi()

   - Fix double-free of sdata in kvm_pmu_clear_snapshot_area()

   - Fix integer overflow in kvm_pmu_validate_counter_mask()

   - Fix shift-out-of-bounds in make_xfence_request()

   - Fix lost write protection on huge pages during dirty logging

   - Split huge pages during fault handling for dirty logging

   - Skip CSR restore if VCPU is reloaded on the same core

   - Implement kvm_arch_has_default_irqchip() for KVM selftests

   - Factored-out ISA checks into separate sources

   - Added hideleg to struct kvm_vcpu_config

   - Factored-out VCPU config into separate sources

   - Support configuration of per-VM HGATP mode from KVM user space

  s390:

   - Support for ESA (31-bit) guests inside nested hypervisors

   - Remove restriction on memslot alignment, which is not needed
     anymore with the new gmap code

   - Fix LPSW/E to update the bear (which of course is the breaking
     event address register)

  x86:

   - Shut up various UBSAN warnings on reading module parameter before
     they were initialized

   - Don't zero-allocate page tables that are used for splitting
     hugepages in the TDP MMU, as KVM is guaranteed to set all SPTEs in
     the page table and thus write all bytes

   - As an optimization, bail early when trying to unsync 4KiB mappings
     if the target gfn can just be mapped with a 2MiB hugepage

  x86 generic:

   - Copy single-chunk MMIO write values into struct kvm_vcpu (more
     precisely struct kvm_mmio_fragment) to fix use-after-free stack
     bugs where KVM would dereference stack pointer after an exit to
     userspace

   - Clean up and comment the emulated MMIO code to try to make it
     easier to maintain (not necessarily "easy", but "easier")

   - Move VMXON+VMXOFF and EFER.SVME toggling out of KVM (not *all* of
     VMX and SVM enabling) as it is needed for trusted I/O

   - Advertise support for AVX512 Bit Matrix Multiply (BMM) instructions

   - Immediately fail the build if a required #define is missing in one
     of KVM's headers that is included multiple times

   - Reject SET_GUEST_DEBUG with -EBUSY if there's an already injected
     exception, mostly to prevent syzkaller from abusing the uAPI to
     trigger WARNs, but also because it can help prevent userspace from
     unintentionally crashing the VM

   - Exempt SMM from CPUID faulting on Intel, as per the spec

   - Misc hardening and cleanup changes

  x86 (AMD):

   - Fix and optimize IRQ window inhibit handling for AVIC; make it
     per-vCPU so that KVM doesn't prematurely re-enable AVIC if multiple
     vCPUs have to-be-injected IRQs

   - Clean up and optimize the OSVW handling, avoiding a bug in which
     KVM would overwrite state when enabling virtualization on multiple
     CPUs in parallel. This should not be a problem because OSVW should
     usually be the same for all CPUs

   - Drop a WARN in KVM_MEMORY_ENCRYPT_REG_REGION where KVM complains
     about a "too large" size based purely on user input

   - Clean up and harden the pinning code for KVM_MEMORY_ENCRYPT_REG_REGION

   - Disallow synchronizing a VMSA of an already-launched/encrypted
     vCPU, as doing so for an SNP guest will crash the host due to an
     RMP violation page fault

   - Overhaul KVM's APIs for detecting SEV+ guests so that VM-scoped
     queries are required to hold kvm->lock, and enforce it by lockdep.
     Fix various bugs where sev_guest() was not ensured to be stable for
     the whole duration of a function or ioctl

   - Convert a pile of kvm->lock SEV code to guard()

   - Play nicer with userspace that does not enable
     KVM_CAP_EXCEPTION_PAYLOAD, for which KVM needs to set CR2 and DR6
     as a response to ioctls such as KVM_GET_VCPU_EVENTS (even if the
     payload would end up in EXITINFO2 rather than CR2, for example).
     Only set CR2 and DR6 when consumption of the payload is imminent,
     but on the other hand force delivery of the payload in all paths
     where userspace retrieves CR2 or DR6

   - Use vcpu->arch.cr2 when updating vmcb12's CR2 on nested #VMEXIT
     instead of vmcb02->save.cr2. The value is out of sync after a
     save/restore or after a #PF is injected into L2

   - Fix a class of nSVM bugs where some fields written by the CPU are
     not synchronized from vmcb02 to cached vmcb12 after VMRUN, and so
     are not up-to-date when saved by KVM_GET_NESTED_STATE

   - Fix a class of bugs where the ordering between KVM_SET_NESTED_STATE
     and KVM_SET_{S}REGS could cause vmcb02 to be incorrectly
     initialized after save+restore

   - Add a variety of missing nSVM consistency checks

   - Fix several bugs where KVM failed to correctly update VMCB fields
     on nested #VMEXIT

   - Fix several bugs where KVM failed to correctly synthesize #UD or
     #GP for SVM-related instructions

   - Add support for save+restore of virtualized LBRs (on SVM)

   - Refactor various helpers and macros to improve clarity and
     (hopefully) make the code easier to maintain

   - Aggressively sanitize fields when copying from vmcb12, to guard
     against unintentionally allowing L1 to utilize yet-to-be-defined
     features

   - Fix several bugs where KVM botched rAX legality checks when
     emulating SVM instructions. There are remaining issues in that KVM
     doesn't handle size prefix overrides for 64-bit guests

   - Fail emulation of VMRUN/VMLOAD/VMSAVE if mapping vmcb12 fails
     instead of somewhat arbitrarily synthesizing #GP (i.e. don't double
     down on AMD's architectural but sketchy behavior of generating #GP
     for "unsupported" addresses)

   - Cache all used vmcb12 fields to further harden against TOCTOU bugs

  x86 (Intel):

   - Drop obsolete branch hint prefixes from the VMX instruction macros

   - Use ASM_INPUT_RM() in __vmcs_writel() to coerce clang into using a
     register input when appropriate

   - Code cleanups

  guest_memfd:

   - Don't mark guest_memfd folios as accessed, as guest_memfd doesn't
     support reclaim, the memory is unevictable, and there is no storage
     to write back to

  LoongArch selftests:

   - Add KVM PMU test cases

  s390 selftests:

   - Enable more memory selftests

  x86 selftests:

   - Add support for Hygon CPUs in KVM selftests

   - Fix a bug in the MSR test where it would get false failures on
     AMD/Hygon CPUs with exactly one of RDPID or RDTSCP

   - Add an MADV_COLLAPSE testcase for guest_memfd as a regression test
     for a bug where the kernel would attempt to collapse guest_memfd
     folios against KVM's will"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (373 commits)
  KVM: x86: use inlines instead of macros for is_sev_*guest
  x86/virt: Treat SVM as unsupported when running as an SEV+ guest
  KVM: SEV: Goto an existing error label if charging misc_cg for an ASID fails
  KVM: SVM: Move lock-protected allocation of SEV ASID into a separate helper
  KVM: SEV: use mutex guard in snp_handle_guest_req()
  KVM: SEV: use mutex guard in sev_mem_enc_unregister_region()
  KVM: SEV: use mutex guard in sev_mem_enc_ioctl()
  KVM: SEV: use mutex guard in snp_launch_update()
  KVM: SEV: Assert that kvm->lock is held when querying SEV+ support
  KVM: SEV: Document that checking for SEV+ guests when reclaiming memory is "safe"
  KVM: SEV: Hide "struct kvm_sev_info" behind CONFIG_KVM_AMD_SEV=y
  KVM: SEV: WARN on unhandled VM type when initializing VM
  KVM: LoongArch: selftests: Add PMU overflow interrupt test
  KVM: LoongArch: selftests: Add basic PMU event counting test
  KVM: LoongArch: selftests: Add cpucfg read/write helpers
  LoongArch: KVM: Add DMSINTC inject msi to vCPU
  LoongArch: KVM: Add DMSINTC device support
  LoongArch: KVM: Make vcpu_is_preempted() as a macro rather than function
  LoongArch: KVM: Move host CSR_GSTAT save and restore in context switch
  LoongArch: KVM: Move host CSR_EENTRY save and restore in context switch
  ...
2026-04-17 07:18:03 -07:00
Linus Torvalds
440d6635b2 mm.git review status for linus..mm-nonmm-stable
Total patches:       126
 Reviews/patch:       0.92
 Reviewed rate:       76%
 
 - The 2 patch series "pid: make sub-init creation retryable" from Oleg
   Nesterov increases the robustness of our creation of init in a new
   namespace.  By clearing away some historical cruft which is no longer
   needed.  Also some documentation fixups are provided.
 
 - The 2 patch series "selftests/fchmodat2: Error handling and general"
   from Mark Brown has a fixup and a cleanup for the fchmodat2() syscall
   selftest.
 
 - The 3 patch series "lib: polynomial: Move to math/ and clean up" from
   Andy Shevchenko does as advertised.
 
 - The 3 patch series "hung_task: Provide runtime reset interface for
   hung task detector" from Aaron Tomlin gives administrators the ability
   to zero out /proc/sys/kernel/hung_task_detect_count.
 
 - The 2 patch series "tools/getdelays: use the static UAPI headers from
   tools/include/uapi" from Thomas Weißschuh teaches getdelays to use the
   in-kernel UAPI headers rather than the system-provided ones.
 
 - The 5 patch series "watchdog/hardlockup: Improvements to hardlockup"
   from Mayank Rungta provides several cleanups and fixups to the
   hardlockup detector code and its documentation.
 
 - The 2 patch series "lib/bch: fix undefined behavior from signed
   left-shifts" from Josh Law provides a couple of small/theoretical fixes
   in the bch code.
 
 - The 2 patch series "ocfs2/dlm: fix two bugs in dlm_match_regions()"
   from Junrui Luo does what is claims.
 
 - The 27 patch series "cleanup the RAID5 XOR library" from Christoph
   Hellwig is a quite far-reaching cleanup to this code.  I can't do better
   than to quote Christoph:
 
     The XOR library used for the RAID5 parity is a bit of a mess right
     now.  The main file sits in crypto/ despite not being cryptography and
     not using the crypto API, with the generic implementations sitting in
     include/asm-generic and the arch implementations sitting in an asm/
     header in theory.  The latter doesn't work for many cases, so
     architectures often build the code directly into the core kernel, or
     create another module for the architecture code.
 
     Change this to a single module in lib/ that also contains the
     architecture optimizations, similar to the library work Eric Biggers
     has done for the CRC and crypto libraries later.  After that it
     changes to better calling conventions that allow for smarter
     architecture implementations (although none is contained here yet),
     and uses static_call to avoid indirection function call overhead.
 
 - The 2 patch series "lib/list_sort: Clean up list_sort() scheduling
   workarounds" from Kuan-Wei Chiu cleans up this library code by removing
   a hacky thing which was added for UBIFS, which UBIFS doesn't actually
   need.
 
 - The 5 patch series "Fix bugs in extract_iter_to_sg()" from Christian
   Ehrhardt fixes a few bugs in the scatterlist code, adds in-kernel tests
   for the now-fixed bugs and fixes a leak in the test itself.
 
 - The 3 patch series "kdump: Enable LUKS-encrypted dump target support
   in ARM64 and PowerPC" from Coiby Xu eenables support of the
   LUKS-encrypted device dump target on arm64 and powerpc.
 
 - The 4 patch series "ocfs2: consolidate extent list validation into
   block read callbacks" from Joseph Qi addresses ocfs2's validation of
   extent list fields - cleanup, simplification, robustness.  (Kernel test
   robot loves mounting corrupted fs images!)
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCad90rQAKCRDdBJ7gKXxA
 jl7rAQD4/Rq7ZSSnEv6FS4gOwc3MgTdWcZZaXkqL1KiWyYhRwAEA+cVCO344+AKb
 znBOjet/hUr+/kBwyViifiC8LHzchwM=
 =Nfnf
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2026-04-15-04-20' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull non-MM updates from Andrew Morton:

 - "pid: make sub-init creation retryable" (Oleg Nesterov)

   Make creation of init in a new namespace more robust by clearing away
   some historical cruft which is no longer needed. Also some
   documentation fixups

 - "selftests/fchmodat2: Error handling and general" (Mark Brown)

   Fix and a cleanup for the fchmodat2() syscall selftest

 - "lib: polynomial: Move to math/ and clean up" (Andy Shevchenko)

 - "hung_task: Provide runtime reset interface for hung task detector"
   (Aaron Tomlin)

   Give administrators the ability to zero out
   /proc/sys/kernel/hung_task_detect_count

 - "tools/getdelays: use the static UAPI headers from
   tools/include/uapi" (Thomas Weißschuh)

   Teach getdelays to use the in-kernel UAPI headers rather than the
   system-provided ones

 - "watchdog/hardlockup: Improvements to hardlockup" (Mayank Rungta)

   Several cleanups and fixups to the hardlockup detector code and its
   documentation

 - "lib/bch: fix undefined behavior from signed left-shifts" (Josh Law)

   A couple of small/theoretical fixes in the bch code

 - "ocfs2/dlm: fix two bugs in dlm_match_regions()" (Junrui Luo)

 - "cleanup the RAID5 XOR library" (Christoph Hellwig)

   A quite far-reaching cleanup to this code. I can't do better than to
   quote Christoph:

     "The XOR library used for the RAID5 parity is a bit of a mess right
      now. The main file sits in crypto/ despite not being cryptography
      and not using the crypto API, with the generic implementations
      sitting in include/asm-generic and the arch implementations
      sitting in an asm/ header in theory. The latter doesn't work for
      many cases, so architectures often build the code directly into
      the core kernel, or create another module for the architecture
      code.

      Change this to a single module in lib/ that also contains the
      architecture optimizations, similar to the library work Eric
      Biggers has done for the CRC and crypto libraries later. After
      that it changes to better calling conventions that allow for
      smarter architecture implementations (although none is contained
      here yet), and uses static_call to avoid indirection function call
      overhead"

 - "lib/list_sort: Clean up list_sort() scheduling workarounds"
   (Kuan-Wei Chiu)

   Clean up this library code by removing a hacky thing which was added
   for UBIFS, which UBIFS doesn't actually need

 - "Fix bugs in extract_iter_to_sg()" (Christian Ehrhardt)

   Fix a few bugs in the scatterlist code, add in-kernel tests for the
   now-fixed bugs and fix a leak in the test itself

 - "kdump: Enable LUKS-encrypted dump target support in ARM64 and
   PowerPC" (Coiby Xu)

   Enable support of the LUKS-encrypted device dump target on arm64 and
   powerpc

 - "ocfs2: consolidate extent list validation into block read callbacks"
   (Joseph Qi)

   Cleanup, simplify, and make more robust ocfs2's validation of extent
   list fields (Kernel test robot loves mounting corrupted fs images!)

* tag 'mm-nonmm-stable-2026-04-15-04-20' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (127 commits)
  ocfs2: validate group add input before caching
  ocfs2: validate bg_bits during freefrag scan
  ocfs2: fix listxattr handling when the buffer is full
  doc: watchdog: fix typos etc
  update Sean's email address
  ocfs2: use get_random_u32() where appropriate
  ocfs2: split transactions in dio completion to avoid credit exhaustion
  ocfs2: remove redundant l_next_free_rec check in __ocfs2_find_path()
  ocfs2: validate extent block list fields during block read
  ocfs2: remove empty extent list check in ocfs2_dx_dir_lookup_rec()
  ocfs2: validate dx_root extent list fields during block read
  ocfs2: fix use-after-free in ocfs2_fault() when VM_FAULT_RETRY
  ocfs2: handle invalid dinode in ocfs2_group_extend
  .get_maintainer.ignore: add Askar
  ocfs2: validate bg_list extent bounds in discontig groups
  checkpatch: exclude forward declarations of const structs
  tools/accounting: handle truncated taskstats netlink messages
  taskstats: set version in TGID exit notifications
  ocfs2/heartbeat: fix slot mapping rollback leaks on error paths
  arm64,ppc64le/kdump: pass dm-crypt keys to kdump kernel
  ...
2026-04-16 20:11:56 -07:00
Yihan Ding
b960430ea8 bpf: allow UTF-8 literals in bpf_bprintf_prepare()
bpf_bprintf_prepare() only needs ASCII parsing for conversion
specifiers. Plain text can safely carry bytes >= 0x80, so allow
UTF-8 literals outside '%' sequences while keeping ASCII control
bytes rejected and format specifiers ASCII-only.

This keeps existing parsing rules for format directives unchanged,
while allowing helpers such as bpf_trace_printk() to emit UTF-8
literal text.

Update test_snprintf_negative() in the same commit so selftests keep
matching the new plain-text vs format-specifier split during bisection.

Fixes: 48cac3f4a9 ("bpf: Implement formatted output helpers with bstr_printf")
Signed-off-by: Yihan Ding <dingyihan@uniontech.com>
Acked-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/20260416120142.1420646-2-dingyihan@uniontech.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-16 15:53:32 -07:00
Mykyta Yatsenko
4d0a375887 bpf: Fix NULL deref in map_kptr_match_type for scalar regs
Commit ab6c637ad0 ("bpf: Fix a bpf_kptr_xchg() issue with local
kptr") refactored map_kptr_match_type() to branch on btf_is_kernel()
before checking base_type(). A scalar register stored into a kptr
slot has no btf, so the btf_is_kernel(reg->btf) call dereferences
NULL.

Move the base_type() != PTR_TO_BTF_ID guard before any reg->btf
access.

Fixes: ab6c637ad0 ("bpf: Fix a bpf_kptr_xchg() issue with local kptr")
Reported-by: Hiker Cl <clhiker365@gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=221372
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Acked-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/20260416-kptr_crash-v1-1-5589356584b4@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-16 15:20:26 -07:00
Thomas Gleixner
4096fd0e8e clockevents: Add missing resets of the next_event_forced flag
The prevention mechanism against timer interrupt starvation missed to reset
the next_event_forced flag in a couple of places:

    - When the clock event state changes. That can cause the flag to be
      stale over a shutdown/startup sequence

    - When a non-forced event is armed, which then prevents rearming before
      that event. If that event is far out in the future this will cause
      missed timer interrupts.

    - In the suspend wakeup handler.

That led to stalls which have been reported by several people.

Add the missing resets, which fixes the problems for the reporters.

Fixes: d6e152d905 ("clockevents: Prevent timer interrupt starvation")
Reported-by: Hanabishi <i.r.e.c.c.a.k.u.n+kernel.org@gmail.com>
Reported-by: Eric Naim <dnaim@cachyos.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Hanabishi <i.r.e.c.c.a.k.u.n+kernel.org@gmail.com>
Tested-by: Eric Naim <dnaim@cachyos.org>
Cc: stable@vger.kernel.org
Closes: https://lore.kernel.org/68d1e9ac-2780-4be3-8ee3-0788062dd3a4@gmail.com
Link: https://patch.msgid.link/87340xfeje.ffs@tglx
2026-04-16 21:22:04 +02:00
Tomas Glozar
4245bf4dc5 tracing/osnoise: Add option to align tlat threads
Add an option called TIMERLAT_ALIGN to osnoise/options, together with a
corresponding setting osnoise/timerlat_align_us.

This option sets the alignment of wakeup times between different
timerlat threads, similarly to cyclictest's -A/--aligned option. If
TIMERLAT_ALIGN is set, the first thread that reaches the first cycle
records its first wake-up time. Each following thread sets its first
wake-up time to a fixed offset from the recorded time, and increments
it by the same offset.

Example:

osnoise/timerlat_period is set to 1000, osnoise/timerlat_align_us is
set to 20. There are four threads, on CPUs 1 to 4.

- CPU 4 enters first cycle first. The current time is 20000us, so
the wake-up of the first cycle is set to 21000us. This time is recorded.
- CPU 2 enter first cycle next. It reads the recorded time, increments
it to 21020us, and uses this value as its own wake-up time for the first
cycle.
- CPU 3 enters first cycle next. It reads the recorded time, increments
it to 21040 us, and uses the value as its own wake-up time.
- CPU 1 proceeds analogically.

In each next cycle, the wake-up time (called "absolute period" in
timerlat code) is incremented by the (relative) period of 1000us. Thus,
the wake-ups in the following cycles (provided the times are reached and
not in the past) will be as follows:

CPU 1		CPU 2		CPU 3	 	CPU 4
21080us		21020us		21040us		21000us
22080us		22020us		22040us		22000us
...		...		...		...

Even if any cycle is skipped due to e.g. the first cycle calculation
happening later, the alignment stays in place.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Luis Goncalves <lgoncalv@redhat.com>
Cc: Costa Shulyupin <costa.shul@redhat.com>
Link: https://patch.msgid.link/20260416115942.544032-1-tglozar@redhat.com
Signed-off-by: Tomas Glozar <tglozar@redhat.com>
Reviewed-by: Wander Lairson Costa <wander@redhat.com>
Reviewed-by: Crystal Wood <crwood@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-04-16 10:42:30 -04:00
Xu Kuohai
07ae6c130b bpf: Add helper to detect indirect jump targets
Introduce helper bpf_insn_is_indirect_target to check whether a BPF
instruction is an indirect jump target.

Since the verifier knows which instructions are indirect jump targets,
add a new flag indirect_target to struct bpf_insn_aux_data to mark
them. The verifier sets this flag when verifying an indirect jump target
instruction, and the helper checks the flag to determine whether an
instruction is an indirect jump target.

Reviewed-by: Anton Protopopov <a.s.protopopov@gmail.com> #v8
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> #v12
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20260416064341.151802-4-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-16 07:03:40 -07:00
Xu Kuohai
d9ef13f727 bpf: Pass bpf_verifier_env to JIT
Pass bpf_verifier_env to bpf_int_jit_compile(). The follow-up patch will
use env->insn_aux_data in the JIT stage to detect indirect jump targets.

Since bpf_prog_select_runtime() can be called by cbpf and lib/test_bpf.c
code without verifier, introduce helper __bpf_prog_select_runtime()
to accept the env parameter.

Remove the call to bpf_prog_select_runtime() in bpf_prog_load(), and
switch to call __bpf_prog_select_runtime() in the verifier, with env
variable passed. The original bpf_prog_select_runtime() is preserved for
cbpf and lib/test_bpf.c, where env is NULL.

Now all constants blinding calls are moved into the verifier, except
the cbpf and lib/test_bpf.c cases. The instructions arrays are adjusted
by bpf_patch_insn_data() function for normal cases, so there is no need
to call adjust_insn_arrays() in bpf_jit_blind_constants(). Remove it.

Reviewed-by: Anton Protopopov <a.s.protopopov@gmail.com> # v8
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> # v12
Acked-by: Hengqi Chen <hengqi.chen@gmail.com> # v14
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20260416064341.151802-3-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-16 07:03:40 -07:00
Xu Kuohai
d3e945223e bpf: Move constants blinding out of arch-specific JITs
During the JIT stage, constants blinding rewrites instructions but only
rewrites the private instruction copy of the JITed subprog, leaving the
global env->prog->insnsi and env->insn_aux_data untouched. This causes a
mismatch between subprog instructions and the global state, making it
difficult to use the global data in the JIT.

To avoid this mismatch, and given that all arch-specific JITs already
support constants blinding, move it to the generic verifier code, and
switch to rewrite the global env->prog->insnsi with the global states
adjusted, as other rewrites in the verifier do.

This removes the constants blinding calls in each JIT, which are largely
duplicated code across architectures.

Since constants blinding is only required for JIT, and there are two
JIT entry functions, jit_subprogs() for BPF programs with multiple
subprogs and bpf_prog_select_runtime() for programs with no subprogs,
move the constants blinding invocation into these two functions.

In the verifier path, bpf_patch_insn_data() is used to keep global
verifier auxiliary data in sync with patched instructions. A key
question is whether this global auxiliary data should be restored
on the failure path.

Besides instructions, bpf_patch_insn_data() adjusts:
  - prog->aux->poke_tab
  - env->insn_array_maps
  - env->subprog_info
  - env->insn_aux_data

For prog->aux->poke_tab, it is only used by JIT or only meaningful after
JIT succeeds, so it does not need to be restored on the failure path.

For env->insn_array_maps, when JIT fails, programs using insn arrays
are rejected by bpf_insn_array_ready() due to missing JIT addresses.
Hence, env->insn_array_maps is only meaningful for JIT and does not need
to be restored.

For subprog_info, if jit_subprogs fails and CONFIG_BPF_JIT_ALWAYS_ON
is not enabled, kernel falls back to interpreter. In this case,
env->subprog_info is used to determine subprogram stack depth. So it
must be restored on failure.

For env->insn_aux_data, it is freed by clear_insn_aux_data() at the
end of bpf_check(). Before freeing, clear_insn_aux_data() loops over
env->insn_aux_data to release jump targets recorded in it. The loop
uses env->prog->len as the array length, but this length no longer
matches the actual size of the adjusted env->insn_aux_data array after
constants blinding.

To address it, a simple approach is to keep insn_aux_data as adjusted
after failure, since it will be freed shortly, and record its actual size
for the loop in clear_insn_aux_data(). But since clear_insn_aux_data()
uses the same index to loop over both env->prog->insnsi and env->insn_aux_data,
this approach results in incorrect index for the insnsi array. So an
alternative approach is adopted: clone the original env->insn_aux_data
before blinding and restore it after failure, similar to env->prog.

For classic BPF programs, constants blinding works as before since it
is still invoked from bpf_prog_select_runtime().

Reviewed-by: Anton Protopopov <a.s.protopopov@gmail.com> # v8
Reviewed-by: Hari Bathini <hbathini@linux.ibm.com> # powerpc jit
Reviewed-by: Pu Lehui <pulehui@huawei.com> # riscv jit
Acked-by: Hengqi Chen <hengqi.chen@gmail.com> # loongarch jit
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20260416064341.151802-2-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-16 07:03:40 -07:00
Linus Torvalds
fdbfee9fc5 Runtime Verification updates for 7.1:
- Refactor da_monitor header to share handlers across monitor types
 
   No functional changes, only less code duplication.
 
 - Add Hybrid Automata model class
 
   Add a new model class that extends deterministic automata by adding
   constraints on transitions and states. Those constraints can take into
   account wall-clock time and as such allow RV monitor to make
   assertions on real time. Add documentation and code generation
   scripts.
 
 - Add stall monitor as hybrid automaton example
 
   Add a monitor that triggers a violation when a task is stalling as an
   example of automaton working with real time variables.
 
 - Convert the opid monitor to a hybrid automaton
 
   The opid monitor can be heavily simplified if written as a hybrid
   automaton: instead of tracking preempt and interrupt enable/disable
   events, it can just run constraints on the preemption/interrupt
   states when events like wakeup and need_resched verify.
 
 - Add support for per-object monitors in DA/HA
 
   Allow writing deterministic and hybrid automata monitors for generic
   objects (e.g. any struct), by exploiting a hash table where objects
   are saved. This allows to track more than just tasks in RV. For
   instance it will be used to track deadline entities in deadline
   monitors.
 
 - Add deadline tracepoints and move some deadline utilities
 
   Prepare the ground for deadline monitors by defining events and
   exporting helpers.
 
 - Add nomiss deadline monitor
 
   Add first example of deadline monitor asserting all entities complete
   before their deadline.
 
 - Improve rvgen error handling
 
   Introduce AutomataError exception class and better handle expected
   exceptions while showing a backtrace for unexpected ones.
 
 - Improve python code quality in rvgen
 
   Refactor the rvgen generation scripts to align with python best
   practices: use f-strings instead of %, use len() instead of __len__(),
   remove semicolons, use context managers for file operations, fix
   whitespace violations, extract magic strings into constants, remove
   unused imports and methods.
 
 - Fix small bugs in rvgen
 
   The generator scripts presented some corner case bugs: logical error in
   validating what a correct dot file looks like, fix an isinstance()
   check, enforce a dot file has an initial state, fix type annotations
   and typos in comments.
 
 - rvgen refactoring
 
   Refactor automata.py to use iterator-based parsing and handle required
   arguments directly in argparse.
 
 - Allow epoll in rtapp-sleep monitor
 
   The epoll_wait call is now rt-friendly so it should be allowed in the
   sleep monitor as a valid sleep method.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCad4kUxQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qiJgAPsGJ0TCTubdn2x6fpwLDKUcSsNsYrok
 m8gLHeK0rmkKNQD/ajUW+RBsOufXAsnSzUoUcl7sw1TSiQJ085U+0+ZeDgM=
 =WZqk
 -----END PGP SIGNATURE-----

Merge tag 'trace-rv-v7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull runtime verification updates from Steven Rostedt:

 - Refactor da_monitor header to share handlers across monitor types

   No functional changes, only less code duplication.

 - Add Hybrid Automata model class

   Add a new model class that extends deterministic automata by adding
   constraints on transitions and states. Those constraints can take
   into account wall-clock time and as such allow RV monitor to make
   assertions on real time. Add documentation and code generation
   scripts.

 - Add stall monitor as hybrid automaton example

   Add a monitor that triggers a violation when a task is stalling as an
   example of automaton working with real time variables.

 - Convert the opid monitor to a hybrid automaton

   The opid monitor can be heavily simplified if written as a hybrid
   automaton: instead of tracking preempt and interrupt enable/disable
   events, it can just run constraints on the preemption/interrupt
   states when events like wakeup and need_resched verify.

 - Add support for per-object monitors in DA/HA

   Allow writing deterministic and hybrid automata monitors for generic
   objects (e.g. any struct), by exploiting a hash table where objects
   are saved. This allows to track more than just tasks in RV. For
   instance it will be used to track deadline entities in deadline
   monitors.

 - Add deadline tracepoints and move some deadline utilities

   Prepare the ground for deadline monitors by defining events and
   exporting helpers.

 - Add nomiss deadline monitor

   Add first example of deadline monitor asserting all entities complete
   before their deadline.

 - Improve rvgen error handling

   Introduce AutomataError exception class and better handle expected
   exceptions while showing a backtrace for unexpected ones.

 - Improve python code quality in rvgen

   Refactor the rvgen generation scripts to align with python best
   practices: use f-strings instead of %, use len() instead of
   __len__(), remove semicolons, use context managers for file
   operations, fix whitespace violations, extract magic strings into
   constants, remove unused imports and methods.

 - Fix small bugs in rvgen

   The generator scripts presented some corner case bugs: logical error
   in validating what a correct dot file looks like, fix an isinstance()
   check, enforce a dot file has an initial state, fix type annotations
   and typos in comments.

 - rvgen refactoring

   Refactor automata.py to use iterator-based parsing and handle
   required arguments directly in argparse.

 - Allow epoll in rtapp-sleep monitor

   The epoll_wait call is now rt-friendly so it should be allowed in the
   sleep monitor as a valid sleep method.

* tag 'trace-rv-v7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (32 commits)
  rv: Allow epoll in rtapp-sleep monitor
  rv/rvgen: fix _fill_states() return type annotation
  rv/rvgen: fix unbound loop variable warning
  rv/rvgen: enforce presence of initial state
  rv/rvgen: extract node marker string to class constant
  rv/rvgen: fix isinstance check in Variable.expand()
  rv/rvgen: make monitor arguments required in rvgen
  rv/rvgen: remove unused __get_main_name method
  rv/rvgen: remove unused sys import from dot2c
  rv/rvgen: refactor automata.py to use iterator-based parsing
  rv/rvgen: use class constant for init marker
  rv/rvgen: fix DOT file validation logic error
  rv/rvgen: fix PEP 8 whitespace violations
  rv/rvgen: fix typos in automata and generator docstring and comments
  rv/rvgen: use context managers for file operations
  rv/rvgen: remove unnecessary semicolons
  rv/rvgen: replace __len__() calls with len()
  rv/rvgen: replace % string formatting with f-strings
  rv/rvgen: remove bare except clauses in generator
  rv/rvgen: introduce AutomataError exception class
  ...
2026-04-15 17:15:18 -07:00
Eduard Zingerman
0251e40c48 bpf: copy BPF token from main program to subprograms
bpf_jit_subprogs() copies various fields from the main program's aux to
each subprogram's aux, but omits the BPF token. This causes
bpf_prog_kallsyms_add() to fail for subprograms loaded via BPF token,
as bpf_token_capable() falls back to capable() in init_user_ns when
token is NULL.

Copy prog->aux->token to func[i]->aux->token so that subprograms
inherit the same capability delegation as the main program.

Fixes: d79a354975 ("bpf: Consistently use BPF token throughout BPF verifier logic")
Signed-off-by: Tao Chen <ctao@meta.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260415-subprog-token-fix-v4-1-9bd000e8b068@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-15 16:46:47 -07:00
Linus Torvalds
e4bf304f00 ring-buffer updates for 7.1:
- Add remote buffers for pKVM
 
   pKVM has a hypervisor component that is used to protect the guest from the
   host kernel. This hypervisor is a black box to the kernel as the kernel is
   to user space. The remote buffers are used to have a memory mapping
   between the hypervisor and the kernel where kernel may send commands to
   enable tracing within the hypervisor. Then the kernel will read this
   memory mapping just like user space can read the memory mapped ring buffer
   of the kernel tracing system.
 
   Since the hypervisor only has a single context, it doesn't need to worry
   about races between normal context, interrupt context and NMIs like the
   kernel does. The ring buffer it uses doesn't need to be as complex. The
   remote buffers are a simple version of the ring buffer that works in a
   single context. They are still per-CPU and use sub buffers. The data
   layout is the same as the kernel's ring buffer to share the same parsing.
 
   Currently, only ARM64 implements pKVM, but there's work to implement it
   also in x86. The remote buffer code is separated out from the ARM
   implementation so that it can be used in the future by x86.
 
   The ARM64 updates for pKVM is in the ARM/KVM tree and it merged in the
   remote buffers of this tree.
 
 - Merge commit f35dbac694 ("ring-buffer: Fix to update per-subbuf entries of persistent ring buffer")`
 
   A fix was merged upstream that some new changes depended on. The upstream
   commit was merged into the ring buffer branch to fulfil the dependency.
 
 - Make the backup instance non reusable
 
   The backup instance is a copy of the persistent ring buffer so that the
   persistent ring buffer could start recording again without using the data
   from the previous boot. The backup isn't for normal tracing. It is made
   read-only, and after it is consumed, it is automatically removed.
 
 - Have backup copy persistent instance before it starts recording
 
   To allow the persistent ring buffer to start recording from the kernel
   command line commands, move the copy of the backup instance to before the
   the command line options start recording.
 
 - Report header_page overwrite field as "char" and not "int'
 
   The rust parser of the header_page file was triggering a warning when it
   defined the overwrite variable as "int" but it was only a single byte in
   size.
 
 - Fix memory barriers for the trace_buffer CPU mask
 
   When a CPU comes online, the bit is set to allow readers to know that the
   CPU buffer is allocated. The bit is set after the allocation is done, and
   a smp_wmb() is performed after the allocation and before the setting of
   the bit. But instead of adding a smp_rmb() to all readers, since once a
   buffer is created for a CPU it is not deleted if that CPU goes offline, so
   this allocation is almost always done at boot up before any readers exist.
 
   If for the unlikely case where a CPU comes online for the first time after
   the system boot has finished, send an IPI to all CPUs to force the
   smp_rmb() for each CPU.
 
 - Show clock function being used in debugging ring buffer data
 
   When the ring buffer checks are enabled and the ring buffer detects an
   inconsistency in the times of the invents, print out the clock being used
   when the error occurred. There was a very hard to hit bug that would
   happen every so often and it ended up being only triggered when the jiffies
   clock was being used. If the bug showed the clock being used, it would
   have been much easier to find the problem (which was an internal function
   was being traced which caused the clock accounting to go off).
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCad9S9xQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qpO8AQDwq2GKeups4aMfOjsUAXX9pIGWI9O1
 eerhmazgi1LJLQEAtKRNSddOj/7nOJ5hLillJH4uAQOnJBkAWtjlTSUnLg8=
 =HXkO
 -----END PGP SIGNATURE-----

Merge tag 'trace-ringbuffer-v7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull ring-buffer updates from Steven Rostedt:

 - Add remote buffers for pKVM

   pKVM has a hypervisor component that is used to protect the guest
   from the host kernel. This hypervisor is a black box to the kernel as
   the kernel is to user space. The remote buffers are used to have a
   memory mapping between the hypervisor and the kernel where kernel may
   send commands to enable tracing within the hypervisor. Then the
   kernel will read this memory mapping just like user space can read
   the memory mapped ring buffer of the kernel tracing system.

   Since the hypervisor only has a single context, it doesn't need to
   worry about races between normal context, interrupt context and NMIs
   like the kernel does. The ring buffer it uses doesn't need to be as
   complex. The remote buffers are a simple version of the ring buffer
   that works in a single context. They are still per-CPU and use sub
   buffers. The data layout is the same as the kernel's ring buffer to
   share the same parsing.

   Currently, only ARM64 implements pKVM, but there's work to implement
   it also in x86. The remote buffer code is separated out from the ARM
   implementation so that it can be used in the future by x86.

   The ARM64 updates for pKVM is in the ARM/KVM tree and it merged in
   the remote buffers of this tree.

 - Make the backup instance non reusable

   The backup instance is a copy of the persistent ring buffer so that
   the persistent ring buffer could start recording again without using
   the data from the previous boot. The backup isn't for normal tracing.
   It is made read-only, and after it is consumed, it is automatically
   removed.

 - Have backup copy persistent instance before it starts recording

   To allow the persistent ring buffer to start recording from the
   kernel command line commands, move the copy of the backup instance to
   before the the command line options start recording.

 - Report header_page overwrite field as "char" and not "int'

   The rust parser of the header_page file was triggering a warning when
   it defined the overwrite variable as "int" but it was only a single
   byte in size.

 - Fix memory barriers for the trace_buffer CPU mask

   When a CPU comes online, the bit is set to allow readers to know that
   the CPU buffer is allocated. The bit is set after the allocation is
   done, and a smp_wmb() is performed after the allocation and before
   the setting of the bit. But instead of adding a smp_rmb() to all
   readers, since once a buffer is created for a CPU it is not deleted
   if that CPU goes offline, so this allocation is almost always done at
   boot up before any readers exist.

   If for the unlikely case where a CPU comes online for the first time
   after the system boot has finished, send an IPI to all CPUs to force
   the smp_rmb() for each CPU.

 - Show clock function being used in debugging ring buffer data

   When the ring buffer checks are enabled and the ring buffer detects
   an inconsistency in the times of the invents, print out the clock
   being used when the error occurred. There was a very hard to hit bug
   that would happen every so often and it ended up being only triggered
   when the jiffies clock was being used. If the bug showed the clock
   being used, it would have been much easier to find the problem (which
   was an internal function was being traced which caused the clock
   accounting to go off).

* tag 'trace-ringbuffer-v7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (26 commits)
  ring-buffer: Prevent off-by-one array access in ring_buffer_desc_page()
  ring-buffer: Report header_page overwrite as char
  tracing: Allow backup to save persistent ring buffer before it starts
  tracing/Documentation: Add a section about backup instance
  tracing: Remove the backup instance automatically after read
  tracing: Make the backup instance non-reusable
  ring-buffer: Enforce read ordering of trace_buffer cpumask and buffers
  ring-buffer: Show what clock function is used on timestamp errors
  tracing: Check for undefined symbols in simple_ring_buffer
  tracing: load/unload page callbacks for simple_ring_buffer
  Documentation: tracing: Add tracing remotes
  tracing: selftests: Add trace remote tests
  tracing: Add a trace remote module for testing
  tracing: Introduce simple_ring_buffer
  ring-buffer: Export buffer_data_page and macros
  tracing: Add helpers to create trace remote events
  tracing: Add events/ root files to trace remotes
  tracing: Add events to trace remotes
  tracing: Add init callback to trace remotes
  tracing: Add non-consuming read to trace remotes
  ...
2026-04-15 15:59:46 -07:00
Linus Torvalds
1521829632 ftrace updates for 7.1:
- Speed up ftrace_lookup_symbols() for single lookups
 
   The kallsyms lookup in ftrace_lookup_symbols() does a linear search over
   each symbol. This is fine when it must match multiple strings, but when
   there's only a single string being searched for, using a binary search is
   much more efficient. When a single string is passed in to search, use the
   binary search method.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCad4RehQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6quZeAQD0rsWOGF0D970asst2WtCOoQF5G2ao
 l0ED7QZFoB1qcAEA7ZAuV9+zJwFOvhAekifuDEArPwpadlcUC4dU7tJU1w0=
 =cMjY
 -----END PGP SIGNATURE-----

Merge tag 'ftrace-v7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull ftrace update from Steven Rostedt:

 - Speed up ftrace_lookup_symbols() for single lookups

   The kallsyms lookup in ftrace_lookup_symbols() does a linear search
   over each symbol. This is fine when it must match multiple strings,
   but when there's only a single string being searched for, using a
   binary search is much more efficient. When a single string is passed
   in to search, use the binary search method.

* tag 'ftrace-v7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  ftrace: Use kallsyms binary search for single-symbol lookup
2026-04-15 15:57:11 -07:00
Linus Torvalds
aec2f682d4 This update includes the following changes:
API:
 
 - Replace crypto_get_default_rng with crypto_stdrng_get_bytes.
 - Remove simd skcipher support.
 - Allow algorithm types to be disabled when CRYPTO_SELFTESTS is off.
 
 Algorithms:
 
 - Remove CPU-based des/3des acceleration.
 - Add test vectors for authenc(hmac(md5),cbc(aes)).
 - Add test vectors for authenc(hmac(md5),cbc(des)).
 - Add test vectors for authenc(hmac(md5),rfc3686(ctr(aes))).
 - Add test vectors for authenc(hmac(sha1),rfc3686(ctr(aes))).
 - Add test vectors for authenc(hmac(sha224),rfc3686(ctr(aes))).
 - Add test vectors for authenc(hmac(sha256),rfc3686(ctr(aes))).
 - Add test vectors for authenc(hmac(sha384),rfc3686(ctr(aes))).
 - Add test vectors for authenc(hmac(sha512),rfc3686(ctr(aes))).
 - Replace spin lock with mutex in jitterentropy.
 
 Drivers:
 
 - Add authenc algorithms to safexcel.
 - Add support for zstd in qat.
 - Add wireless mode support for QAT GEN6.
 - Add anti-rollback support for QAT GEN6.
 - Add support for ctr(aes), gcm(aes), and ccm(aes) in dthev2.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEn51F/lCuNhUwmDeSxycdCkmxi6cFAmne7qgACgkQxycdCkmx
 i6cm0w/9HNFzIWuZWh4Q8k1d/SX32/2p40EMvlw9QFO8wt0gsMtbk6NN5G3sIfhL
 36+rT8Vo5yg9MahTqAspXKjP+QTev5D7/nsDa/FzOSA1JxyvBbgV7X33k8EZjcgT
 +ffuh0WbaWlutYw07o2h4cNPz1Yp4M0hp2IdzvY0Y3q9D05eiwis1SQzUVPmTs6K
 I6OP+4JjJbqubOgJxsltEoeCH9ZP0fObRWmAiVm6rwk9uX4CY32nzi3QOttXQ0su
 4F/useoRwWQ1t7FTy8/fcVtFpL/G8hAFSQ4un5ODhDWL7taV5sZPXQBwXUuoVQM6
 aNjZlaju/MB7gnAOrBvSsniohAAqRUNR8O7P8QW6mDrFmDhUZ3ZILmCKW+VwF5SG
 a4fV94XgBVOnKIqD01cc++8mb6keX/88KJW79AEWLeJ9YZ9BuyFphr9OEBFAIHqx
 xG+iEg4uoVxwC52//oGt/yZaZKK3C1y/Zey5bOjfErKq3ATXGIvawaAzdvB9mh6Q
 iAnl71JpR4mrs++fAyUCKM+dfvdmQYDq6HJayMdg+IHAIeIvyMnPjsGigdVJvE65
 RpBKW4aclfiYaDwX9Jf703mHR1uuKGP1GKpz8U+JXN4Ax2JPg0maC1N3wFkDypYO
 HUNKgEk/173f1HTjU0JjbqvqJh+rKQ3ZbHpLxZrYtnSMukDwRO0=
 =KoAB
 -----END PGP SIGNATURE-----

Merge tag 'v7.1-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6

Pull crypto update from Herbert Xu:
 "API:
   - Replace crypto_get_default_rng with crypto_stdrng_get_bytes
   - Remove simd skcipher support
   - Allow algorithm types to be disabled when CRYPTO_SELFTESTS is off

  Algorithms:
   - Remove CPU-based des/3des acceleration
   - Add test vectors for authenc(hmac(md5),cbc({aes,des})) and
     authenc(hmac({md5,sha1,sha224,sha256,sha384,sha512}),rfc3686(ctr(aes)))
   - Replace spin lock with mutex in jitterentropy

  Drivers:
   - Add authenc algorithms to safexcel
   - Add support for zstd in qat
   - Add wireless mode support for QAT GEN6
   - Add anti-rollback support for QAT GEN6
   - Add support for ctr(aes), gcm(aes), and ccm(aes) in dthev2"

* tag 'v7.1-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (129 commits)
  crypto: af_alg - use sock_kmemdup in alg_setkey_by_key_serial
  crypto: vmx - remove CRYPTO_DEV_VMX from Kconfig
  crypto: omap - convert reqctx buffer to fixed-size array
  crypto: atmel-sha204a - add Thorsten Blum as maintainer
  crypto: atmel-ecc - add Thorsten Blum as maintainer
  crypto: qat - fix IRQ cleanup on 6xxx probe failure
  crypto: geniv - Remove unused spinlock from struct aead_geniv_ctx
  crypto: qce - simplify qce_xts_swapiv()
  crypto: hisilicon - Fix dma_unmap_single() direction
  crypto: talitos - rename first/last to first_desc/last_desc
  crypto: talitos - fix SEC1 32k ahash request limitation
  crypto: jitterentropy - replace long-held spinlock with mutex
  crypto: hisilicon - remove unused and non-public APIs for qm and sec
  crypto: hisilicon/qm - drop redundant variable initialization
  crypto: hisilicon/qm - remove else after return
  crypto: hisilicon/qm - add const qualifier to info_name in struct qm_cmd_dump_item
  crypto: hisilicon - fix the format string type error
  crypto: ccree - fix a memory leak in cc_mac_digest()
  crypto: qat - add support for zstd
  crypto: qat - use swab32 macro
  ...
2026-04-15 15:22:26 -07:00
Linus Torvalds
40286d6379 pci-v7.1-changes
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAmnfwfMUHGJoZWxnYWFz
 QGdvb2dsZS5jb20ACgkQWYigwDrT+vxwIRAAlN1h5er8aFDbjON5YMXBZqlQmzaC
 bjlUHgwm7HkdErTFozyuqhE8QUO1kCm4uMQzeyJdfY9nRWqMDOuKYxMD5j0exk+o
 4tbbJg6Xx4dq7Qrawy9PhxyQm/PDAcvs+FRRlGala+qq9o3fxPDOAZVDE/1C8qFQ
 Jd7GGd7NZn/NN4xrqST4RQHjO8fwaMwmksWCStsb79kfesQWP6kLADGfIMcWxNUB
 2s+oTnK6Hw0tkBv56n6i8mbb0EzS3/RN1daTevGAta1rmfUVVtWGRZ4paMvv0Owi
 Rl5+O5Jz6/c1qiXZbUqu5CRQPIy7Dr3JPvURcZX6qbsV8PzWXZr0Wi+geWefGOnp
 55y+3OT0vdBGAuXLJhrcU7Clzq9D/TZOt8oTI8IFArUfDlmrAIdozPn7gr+VGre5
 QuKymSk3XWtyIbe4o8UeZ4f9g0y6ZY1XvtvB7K1tze+OOmqlkfq966+z8aZuGOKx
 ZvAU/NIat5H02EgB4dEVOP8R5vPZlXGT0RLGl1JWRypPWyZDbVVA3z927qRQG5md
 IsVq8WaIrB1zyl9g37lZeEaYwP/qCIQsHkMGPYcP4wdOQEV9AQqi5pmjMXnWyQJD
 PR1nvmTKW7USRCJ+pz8xPhZh0cj3ENaddORTD3I/0CGVV0y452bU/5rr4T+K04bK
 PCJBpxTIDuWDwXc=
 =FFRz
 -----END PGP SIGNATURE-----

Merge tag 'pci-v7.1-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci

Pull pci updates from Bjorn Helgaas:
 "Enumeration:

   - Allow TLP Processing Hints to be enabled for RCiEPs (George Abraham
     P)

   - Enable AtomicOps only if we know the Root Port supports them (Gerd
     Bayer)

   - Don't enable AtomicOps for RCiEPs since none of them need Atomic
     Ops and we can't tell whether the Root Complex would support them
     (Gerd Bayer)

   - Leave Precision Time Measurement disabled until a driver enables it
     to avoid PCIe errors (Mika Westerberg)

   - Make pci_set_vga_state() fail if bridge doesn't support VGA
     routing, i.e., PCI_BRIDGE_CTL_VGA is not writable, and return
     errors to vga_get() callers including userspace via
     /dev/vga_arbiter (Simon Richter)

   - Validate max-link-speed from DT in j721e, brcmstb, mediatek-gen3,
     rzg3s drivers (where the actual controller constraints are known),
     and remove validation from the generic OF DT accessor (Hans Zhang)

   - Remove pc110pad driver (no longer useful after 486 CPU support
     removed) and no_pci_devices() (pc110pad was the last user) (Dmitry
     Torokhov, Heiner Kallweit)

  Resource management:

   - Prevent assigning space to unimplemented bridge windows; previously
     we mistakenly assumed prefetchable window existed and assigned
     space and put a BAR there (Ahmed Naseef)

   - Avoid shrinking bridge windows to fit in the initial Root Port
     window; fixes one problem with devices with large BARs connected
     via switches, e.g., Thunderbolt (Ilpo Järvinen)

   - Pass full extent of empty space, not just the aligned space, to
     resource_alignf callback so free space before the requested
     alignment can be used (Ilpo Järvinen)

   - Place small resources before larger ones for better utilization of
     address space (Ilpo Järvinen)

   - Fix alignment calculation for resource size larger than align,
     e.g., bridge windows larger than the 1MB required alignment (Ilpo
     Järvinen)

  Reset:

   - Update slot handling so all ARI functions are treated as being in
     the same slot. They're all reset by Secondary Bus Reset, but
     previously drivers of ARI functions that appeared to be on a
     non-zero device weren't notified and fatal hardware errors could
     result (Keith Busch)

   - Make sysfs reset_subordinate hotplug safe to avoid spurious hotplug
     events (Keith Busch)

   - Hide Secondary Bus Reset ('bus') from sysfs reset_methods if masked
     by CXL because it has no effect (Vidya Sagar)

   - Avoid FLR for AMD NPU device, where it causes the device to hang
     (Lizhi Hou)

  Error handling:

   - Clear only error bits in PCIe Device Status to avoid accidentally
     clearing Emergency Power Reduction Detected (Shuai Xue)

   - Check for AER errors even in devices without drivers (Lukas Wunner)

   - Initialize ratelimit info so DPC and EDR paths log AER error
     information (Kuppuswamy Sathyanarayanan)

  Power control:

   - Add UPD720201/UPD720202 USB 3.0 xHCI Host Controller .compatible so
     generic pwrctrl driver can control it (Neil Armstrong)

  Hotplug:

   - Set LED_HW_PLUGGABLE for NPEM hotplug-capable ports so LED core
     doesn't complain when setting brightness fails because the endpoint
     is gone (Richard Cheng)

  Peer-to-peer DMA:

   - Allow wildcards in list of host bridges that support peer-to-peer
     DMA between hierarchy domains and add all Google SoCs (Jacob
     Moroni)

  Endpoint framework:

   - Advertise dynamic inbound mapping support in pci-epf-test and
     update host pci_endpoint_test to skip doorbell testing if not
     advertised by endpoint (Koichiro Den)

   - Return 0, not remaining timeout, when MHI eDMA ops complete so
     mhi_ep_ring_add_element() doesn't interpret non-zero as failure
     (Daniel Hodges)

   - Remove vntb and ntb duplicate resource teardown that leads to oops
     when .allow_link() fails or .drop_link() is called (Koichiro Den)

   - Disable vntb delayed work before clearing BAR mappings and
     doorbells to avoid oops caused by doing the work after resources
     have been torn down (Koichiro Den)

   - Add a way to describe reserved subregions within BARs, e.g.,
     platform-owned fixed register windows, and use it for the RK3588
     BAR4 DMA ctrl window (Koichiro Den)

   - Add BAR_DISABLED for BARs that will never be available to an EPF
     driver, and change some BAR_RESERVED annotations to BAR_DISABLED
     (Niklas Cassel)

   - Add NTB .get_dma_dev() callback for cases where DMA API requires a
     different device, e.g., vNTB devices (Koichiro Den)

   - Add reserved region types for MSI-X Table and PBA so Endpoint
     controllers can them as describe hardware-owned regions in a
     BAR_RESERVED BAR (Manikanta Maddireddy)

   - Make Tegra194/234 BAR0 programmable and remove 1MB size limit
     (Manikanta Maddireddy)

   - Expose Tegra BAR2 (MSI-X) and BAR4 (DMA) as 64-bit BAR_RESERVED
     (Manikanta Maddireddy)

   - Add Tegra194 and Tegra234 device table entries to pci_endpoint_test
     (Manikanta Maddireddy)

   - Skip the BAR subrange selftest if there are not enough inbound
     window resources to run the test (Christian Bruel)

  New native PCIe controller drivers:

   - Add DT binding and driver for Andes QiLai SoC PCIe host controller
     (Randolph Lin)

   - Add DT binding and driver for ESWIN PCIe Root Complex (Senchuan
     Zhang)

  Baikal T-1 PCIe controller driver:

   - Remove driver since it never quite became usable (Andy Shevchenko)

  Cadence PCIe controller driver:

   - Implement byte/word config reads with dword (32-bit) reads because
     some Cadence controllers don't support sub-dword accesses (Aksh
     Garg)

  CIX Sky1 PCIe controller driver:

   - Add 'power-domains' to DT binding for SCMI power domain (Gary Yang)

  Freescale i.MX6 PCIe controller driver:

   - Add i.MX94 and i.MX943 to fsl,imx6q-pcie-ep DT binding (Richard
     Zhu)

   - Delay instead of polling for L2/L3 Ready after PME_Turn_off when
     suspending i.MX6SX because LTSSM registers are inaccessible
     (Richard Zhu)

   - Separate PERST# assertion (for resetting endpoints) from core reset
     (for resetting the RC itself) to prepare for new DTs with PERST#
     GPIO in per-Root Port nodes (Sherry Sun)

   - Retain Root Port MSI capability on i.MX7D, i.MX8MM, and i.MX8MQ so
     MSI from downstream devices will work (Richard Zhu)

   - Fix i.MX95 reference clock source selection when internal refclk is
     used (Franz Schnyder)

  Freescale Layerscape PCIe controller driver:

   - Allow building as a removable module (Sascha Hauer)

  MediaTek PCIe Gen3 controller driver:

   - Use dev_err_probe() to simplify error paths and make deferred probe
     messages visible in /sys/kernel/debug/devices_deferred (Chen-Yu
     Tsai)

   - Power off device if setup fails (Chen-Yu Tsai)

   - Integrate new pwrctrl API to enable power control for WiFi/BT
     adapters on mainboard or in PCIe or M.2 slots (Chen-Yu Tsai)

  NVIDIA Tegra194 PCIe controller driver:

   - Poll less aggressively and non-atomically for PME_TO_Ack during
     transition to L2 (Vidya Sagar)

   - Disable LTSSM after transition to Detect on surprise link down to
     stop toggling between Polling and Detect (Manikanta Maddireddy)

   - Don't force the device into the D0 state before L2 when suspending
     or shutting down the controller (Vidya Sagar)

   - Disable PERST# IRQ only in Endpoint mode because it's not
     registered in Root Port mode (Manikanta Maddireddy)

   - Handle 'nvidia,refclk-select' as optional (Vidya Sagar)

   - Disable direct speed change in Endpoint mode so link speed change
     is controlled by the host (Vidya Sagar)

   - Set LTR values before link up to avoid bogus LTR messages with 0
     latency (Vidya Sagar)

   - Allow system suspend when the Endpoint link is down (Vidya Sagar)

   - Use DWC IP core version, not Tegra custom values, to avoid DWC core
     version check warnings (Manikanta Maddireddy)

   - Apply ECRC workaround to devices based on DesignWare 5.00a as well
     as 4.90a (Manikanta Maddireddy)

   - Disable PM Substate L1.2 in Endpoint mode to work around Tegra234
     erratum (Vidya Sagar)

   - Delay post-PERST# cleanup until core is powered on to avoid CBB
     timeout (Manikanta Maddireddy)

   - Assert CLKREQ# so switches that forward it to their downstream side
     can bring up those links successfully (Vidya Sagar)

   - Calibrate pipe to UPHY for Endpoint mode to reset stale PLL state
     from any previous bad link state (Vidya Sagar)

   - Remove IRQF_ONESHOT flag from Endpoint interrupt registration so
     DMA driver and Endpoint controller driver can share the interrupt
     line (Vidya Sagar)

   - Enable DMA interrupt to support DMA in both Root Port and Endpoint
     modes (Vidya Sagar)

   - Enable hardware link retraining after link goes down in Endpoint
     mode (Vidya Sagar)

   - Add DT binding and driver support for core clock monitoring (Vidya
     Sagar)

  Qualcomm PCIe controller driver:

   - Advertise 'Hot-Plug Capable' and set 'No Command Completed Support'
     since Qcom Root Ports support hotplug events like DL_Up/Down and
     can accept writes to Slot Control without delays between writes
     (Krishna Chaitanya Chundru)

  Renesas R-Car PCIe controller driver:

   - Mark Endpoint BAR0 and BAR2 as Resizable (Koichiro Den)

   - Reduce EPC BAR alignment requirement to 4K (Koichiro Den)

  Renesas RZ/G3S PCIe controller driver:

   - Add RZ/G3E to DT binding and to driver (John Madieu)

   - Assert (not deassert) resets in probe error path (John Madieu)

   - Assert resets in suspend path in reverse order they were deasserted
     during probe (John Madieu)

   - Rework inbound window algorithm to prevent mapping more than
     intended region and enforce alignment on size, to prepare for
     RZ/G3E support (John Madieu)

  Rockchip DesignWare PCIe controller driver:

   - Add tracepoints for PCIe controller LTSSM transitions and link rate
     changes (Shawn Lin)

   - Trace LTSSM events collected by the dw-rockchip debug FIFO (Shawn
     Lin)

  SOPHGO PCIe controller driver:

   - Disable ASPM L0s and L1 on Sophgo 2042 PCIe Root Ports that
     advertise support for them (Yao Zi)

  Synopsys DesignWare PCIe controller driver:

   - Continue with system suspend even if an Endpoint doesn't respond
     with PME_TO_Ack message (Manivannan Sadhasivam)

   - Set Endpoint MSI-X Table Size in the correct function of a
     multi-function device when configuring MSI-X, not in Function 0
     (Aksh Garg)

   - Set Max Link Width and Max Link Speed for all functions of a
     multi-function device, not just Function 0 (Aksh Garg)

   - Expose PCIe event counters in groups 5-7 in debugfs (Hans Zhang)

  Miscellaneous:

   - Warn only once about invalid ACS kernel parameter format (Richard
     Cheng)

   - Suppress FW_BUG warning when writing sysfs 'numa_node' with the
     current value (Li RongQing)

   - Drop redundant 'depends on PCI' from Kconfig (Julian Braha)"

* tag 'pci-v7.1-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: (165 commits)
  PCI/P2PDMA: Add Google SoCs to the P2P DMA host bridge list
  PCI/P2PDMA: Allow wildcard Device IDs in host bridge list
  PCI: sg2042: Avoid L0s and L1 on Sophgo 2042 PCIe Root Ports
  PCI: cadence: Add flags for disabling ASPM capability for broken Root Ports
  PCI: tegra194: Add core monitor clock support
  dt-bindings: PCI: tegra194: Add monitor clock support
  PCI: tegra194: Enable hardware hot reset mode in Endpoint mode
  PCI: tegra194: Enable DMA interrupt
  PCI: tegra194: Remove IRQF_ONESHOT flag during Endpoint interrupt registration
  PCI: tegra194: Calibrate pipe to UPHY for Endpoint mode
  PCI: tegra194: Assert CLKREQ# explicitly by default
  PCI: tegra194: Fix CBB timeout caused by DBI access before core power-on
  PCI: tegra194: Disable L1.2 capability of Tegra234 EP
  PCI: dwc: Apply ECRC workaround to DesignWare 5.00a as well
  PCI: tegra194: Use DWC IP core version
  PCI: tegra194: Free up Endpoint resources during remove()
  PCI: tegra194: Allow system suspend when the Endpoint link is not up
  PCI: tegra194: Set LTR message request before PCIe link up in Endpoint mode
  PCI: tegra194: Disable direct speed change for Endpoint mode
  PCI: tegra194: Use devm_gpiod_get_optional() to parse "nvidia,refclk-select"
  ...
2026-04-15 14:41:21 -07:00
Linus Torvalds
334fbe734e mm.git review status for linus..mm-stable
Everything:
 
 Total patches:       368
 Reviews/patch:       1.56
 Reviewed rate:       74%
 
 Excluding DAMON:
 
 Total patches:       316
 Reviews/patch:       1.77
 Reviewed rate:       81%
 
 Excluding DAMON and zram:
 
 Total patches:       306
 Reviews/patch:       1.81
 Reviewed rate:       82%
 
 Excluding DAMON, zram and maple_tree:
 
 Total patches:       276
 Reviews/patch:       2.01
 Reviewed rate:       91%
 
 Significant patch series in this merge:
 
 - The 30 patch series "maple_tree: Replace big node with maple copy"
   from Liam Howlett is mainly prepararatory work for ongoing development
   but it does reduce stack usage and is an improvement.
 
 - The 12 patch series "mm, swap: swap table phase III: remove swap_map"
   from Kairui Song offers memory savings by removing the static swap_map.
   It also yields some CPU savings and implements several cleanups.
 
 - The 2 patch series "mm: memfd_luo: preserve file seals" from Pratyush
   Yadav adds file seal preservation to LUO's memfd code.
 
 - The 2 patch series "mm: zswap: add per-memcg stat for incompressible
   pages" from Jiayuan Chen adds additional userspace stats reportng to
   zswap.
 
 - The 4 patch series "arch, mm: consolidate empty_zero_page" from Mike
   Rapoport implements some cleanups for our handling of ZERO_PAGE() and
   zero_pfn.
 
 - The 2 patch series "mm/kmemleak: Improve scan_should_stop()
   implementation" from Zhongqiu Han provides an robustness improvement and
   some cleanups in the kmemleak code.
 
 - The 4 patch series "Improve khugepaged scan logic" from Vernon Yang
   "improves the khugepaged scan logic and reduces CPU consumption by
   prioritizing scanning tasks that access memory frequently".
 
 - The 2 patch series "Make KHO Stateless" from Jason Miu simplifies
   Kexec Handover by "transitioning KHO from an xarray-based metadata
   tracking system with serialization to a radix tree data structure that
   can be passed directly to the next kernel"
 
 - The 3 patch series "mm: vmscan: add PID and cgroup ID to vmscan
   tracepoints" from Thomas Ballasi and Steven Rostedt enhances vmscan's
   tracepointing.
 
 - The 5 patch series "mm: arch/shstk: Common shadow stack mapping helper
   and VM_NOHUGEPAGE" from Catalin Marinas is a cleanup for the shadow
   stack code: remove per-arch code in favour of a generic implementation.
 
 - The 2 patch series "Fix KASAN support for KHO restored vmalloc
   regions" from Pasha Tatashin fixes a WARN() which can be emitted the KHO
   restores a vmalloc area.
 
 - The 4 patch series "mm: Remove stray references to pagevec" from Tal
   Zussman provides several cleanups, mainly udpating references to "struct
   pagevec", which became folio_batch three years ago.
 
 - The 17 patch series "mm: Eliminate fake head pages from vmemmap
   optimization" from Kiryl Shutsemau simplifies the HugeTLB vmemmap
   optimization (HVO) by changing how tail pages encode their relationship
   to the head page.
 
 - The 2 patch series "mm/damon/core: improve DAMOS quota efficiency for
   core layer filters" from SeongJae Park improves two problematic
   behaviors of DAMOS that makes it less efficient when core layer filters
   are used.
 
 - The 3 patch series "mm/damon: strictly respect min_nr_regions" from
   SeongJae Park improves DAMON usability by extending the treatment of the
   min_nr_regions user-settable parameter.
 
 - The 3 patch series "mm/page_alloc: pcp locking cleanup" from Vlastimil
   Babka is a proper fix for a previously hotfixed SMP=n issue.  Code
   simplifications and cleanups ennsed.
 
 - The 16 patch series "mm: cleanups around unmapping / zapping" from
   David Hildenbrand implements "a bunch of cleanups around unmapping and
   zapping.  Mostly simplifications, code movements, documentation and
   renaming of zapping functions".
 
 - The 6 patch series "support batched checking of the young flag for
   MGLRU" from Baolin Wang supports batched checking of the young flag for
   MGLRU.  It's part cleanups; one benchmark shows large performance
   benefits for arm64.
 
 - The 5 patch series "memcg: obj stock and slab stat caching cleanups"
   from Johannes Weiner provides memcg cleanup and robustness improvements.
 
 - The 5 patch series "Allow order zero pages in page reporting" from
   Yuvraj Sakshith enhances page_reporting's free page reporting - it is
   presently and undesirably order-0 pages when reporting free memory.
 
 - The 6 patch series "mm: vma flag tweaks" from Lorenzo Stoakes is
   cleanup work following from the recent conversion of the VMA flags to a
   bitmap.
 
 - The 10 patch series "mm/damon: add optional debugging-purpose sanity
   checks" from SeongJae Park adds some more developer-facing debug checks
   into DAMON core.
 
 - The 2 patch series "mm/damon: test and document power-of-2
   min_region_sz requirement" from SeongJae Park adds an additional DAMON
   kunit test and makes some adjustments to the addr_unit parameter
   handling.
 
 - The 3 patch series "mm/damon/core: make passed_sample_intervals
   comparisons overflow-safe" from SeongJae Park fixes a hard-to-hit time
   overflow issue in DAMON core.
 
 - The 7 patch series "mm/damon: improve/fixup/update ratio calculation,
   test and documentation" from SeongJae Park is a "batch of misc/minor
   improvements and fixups" for DAMON.
 
 - The 4 patch series "mm: move vma_(kernel|mmu)_pagesize() out of
   hugetlb.c" from David Hildenbrand fixes a possible issue with dax-device
   when CONFIG_HUGETLB=n.  Some code movement was required.
 
 - The 6 patch series "zram: recompression cleanups and tweaks" from
   Sergey Senozhatsky provides "a somewhat random mix of fixups,
   recompression cleanups and improvements" in the zram code.
 
 - The 11 patch series "mm/damon: support multiple goal-based quota
   tuning algorithms" from SeongJae Park extend DAMOS quotas goal
   auto-tuning to support multiple tuning algorithms that users can select.
 
 - The 4 patch series "mm: thp: reduce unnecessary
   start_stop_khugepaged()" from Breno Leitao fixes the khugpaged sysfs
   handling so we no longer spam the logs with reams of junk when
   starting/stopping khugepaged.
 
 - The 3 patch series "mm: improve map count checks" from Lorenzo Stoakes
   provides some cleanups and slight fixes in the mremap, mmap and vma
   code.
 
 - The 5 patch series "mm/damon: support addr_unit on default monitoring
   targets for modules" from SeongJae Park extends the use of DAMON core's
   addr_unit tunable.
 
 - The 5 patch series "mm: khugepaged cleanups and mTHP prerequisites"
   from Nico Pache provides cleanups in the khugepaged and is a base for
   Nico's planned khugepaged mTHP support.
 
 - The 15 patch series "mm: memory hot(un)plug and SPARSEMEM cleanups"
   from David Hildenbrand implements code movement and cleanups in the
   memhotplug and sparsemem code.
 
 - The 2 patch series "mm: remove CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE and
   cleanup CONFIG_MIGRATION" from David Hildenbrand rationalizes some
   memhotplug Kconfig support.
 
 - The 6 patch series "change young flag check functions to return bool"
   from Baolin Wang is "a cleanup patchset to change all young flag check
   functions to return bool".
 
 - The 3 patch series "mm/damon/sysfs: fix memory leak and NULL
   dereference issues" from Josh Law and SeongJae Park fixes a few
   potential DAMON bugs.
 
 - The 25 patch series "mm/vma: convert vm_flags_t to vma_flags_t in vma
   code" from "converts a lot of the existing use of the legacy vm_flags_t
   data type to the new vma_flags_t type which replaces it".  Mainly in the
   vma code.
 
 - The 21 patch series "mm: expand mmap_prepare functionality and usage"
   from Lorenzo Stoakes "expands the mmap_prepare functionality, which is
   intended to replace the deprecated f_op->mmap hook which has been the
   source of bugs and security issues for some time".  Cleanups,
   documentation, extension of mmap_prepare into filesystem drivers.
 
 - The 13 patch series "mm/huge_memory: refactor zap_huge_pmd()" from
   Lorenzo Stoakes simplifies and cleans up zap_huge_pmd().  Additional
   cleanups around vm_normal_folio_pmd() and the softleaf functionality are
   performed.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCad3HDQAKCRDdBJ7gKXxA
 jrUQAPwNhPk5nPSxnyxjAeQtOBHqgCdnICeEismLajPKd9aYRgEA0s2XAu3tSUYi
 GrBnWImHG3s4ePQxVcPCegWTsOUrXgQ=
 =1Q7o
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2026-04-13-21-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - "maple_tree: Replace big node with maple copy" (Liam Howlett)

   Mainly prepararatory work for ongoing development but it does reduce
   stack usage and is an improvement.

 - "mm, swap: swap table phase III: remove swap_map" (Kairui Song)

   Offers memory savings by removing the static swap_map. It also yields
   some CPU savings and implements several cleanups.

 - "mm: memfd_luo: preserve file seals" (Pratyush Yadav)

   File seal preservation to LUO's memfd code

 - "mm: zswap: add per-memcg stat for incompressible pages" (Jiayuan
   Chen)

   Additional userspace stats reportng to zswap

 - "arch, mm: consolidate empty_zero_page" (Mike Rapoport)

   Some cleanups for our handling of ZERO_PAGE() and zero_pfn

 - "mm/kmemleak: Improve scan_should_stop() implementation" (Zhongqiu
   Han)

   A robustness improvement and some cleanups in the kmemleak code

 - "Improve khugepaged scan logic" (Vernon Yang)

   Improve khugepaged scan logic and reduce CPU consumption by
   prioritizing scanning tasks that access memory frequently

 - "Make KHO Stateless" (Jason Miu)

   Simplify Kexec Handover by transitioning KHO from an xarray-based
   metadata tracking system with serialization to a radix tree data
   structure that can be passed directly to the next kernel

 - "mm: vmscan: add PID and cgroup ID to vmscan tracepoints" (Thomas
   Ballasi and Steven Rostedt)

   Enhance vmscan's tracepointing

 - "mm: arch/shstk: Common shadow stack mapping helper and
   VM_NOHUGEPAGE" (Catalin Marinas)

   Cleanup for the shadow stack code: remove per-arch code in favour of
   a generic implementation

 - "Fix KASAN support for KHO restored vmalloc regions" (Pasha Tatashin)

   Fix a WARN() which can be emitted the KHO restores a vmalloc area

 - "mm: Remove stray references to pagevec" (Tal Zussman)

   Several cleanups, mainly udpating references to "struct pagevec",
   which became folio_batch three years ago

 - "mm: Eliminate fake head pages from vmemmap optimization" (Kiryl
   Shutsemau)

   Simplify the HugeTLB vmemmap optimization (HVO) by changing how tail
   pages encode their relationship to the head page

 - "mm/damon/core: improve DAMOS quota efficiency for core layer
   filters" (SeongJae Park)

   Improve two problematic behaviors of DAMOS that makes it less
   efficient when core layer filters are used

 - "mm/damon: strictly respect min_nr_regions" (SeongJae Park)

   Improve DAMON usability by extending the treatment of the
   min_nr_regions user-settable parameter

 - "mm/page_alloc: pcp locking cleanup" (Vlastimil Babka)

   The proper fix for a previously hotfixed SMP=n issue. Code
   simplifications and cleanups ensued

 - "mm: cleanups around unmapping / zapping" (David Hildenbrand)

   A bunch of cleanups around unmapping and zapping. Mostly
   simplifications, code movements, documentation and renaming of
   zapping functions

 - "support batched checking of the young flag for MGLRU" (Baolin Wang)

   Batched checking of the young flag for MGLRU. It's part cleanups; one
   benchmark shows large performance benefits for arm64

 - "memcg: obj stock and slab stat caching cleanups" (Johannes Weiner)

   memcg cleanup and robustness improvements

 - "Allow order zero pages in page reporting" (Yuvraj Sakshith)

   Enhance free page reporting - it is presently and undesirably order-0
   pages when reporting free memory.

 - "mm: vma flag tweaks" (Lorenzo Stoakes)

   Cleanup work following from the recent conversion of the VMA flags to
   a bitmap

 - "mm/damon: add optional debugging-purpose sanity checks" (SeongJae
   Park)

   Add some more developer-facing debug checks into DAMON core

 - "mm/damon: test and document power-of-2 min_region_sz requirement"
   (SeongJae Park)

   An additional DAMON kunit test and makes some adjustments to the
   addr_unit parameter handling

 - "mm/damon/core: make passed_sample_intervals comparisons
   overflow-safe" (SeongJae Park)

   Fix a hard-to-hit time overflow issue in DAMON core

 - "mm/damon: improve/fixup/update ratio calculation, test and
   documentation" (SeongJae Park)

   A batch of misc/minor improvements and fixups for DAMON

 - "mm: move vma_(kernel|mmu)_pagesize() out of hugetlb.c" (David
   Hildenbrand)

   Fix a possible issue with dax-device when CONFIG_HUGETLB=n. Some code
   movement was required.

 - "zram: recompression cleanups and tweaks" (Sergey Senozhatsky)

   A somewhat random mix of fixups, recompression cleanups and
   improvements in the zram code

 - "mm/damon: support multiple goal-based quota tuning algorithms"
   (SeongJae Park)

   Extend DAMOS quotas goal auto-tuning to support multiple tuning
   algorithms that users can select

 - "mm: thp: reduce unnecessary start_stop_khugepaged()" (Breno Leitao)

   Fix the khugpaged sysfs handling so we no longer spam the logs with
   reams of junk when starting/stopping khugepaged

 - "mm: improve map count checks" (Lorenzo Stoakes)

   Provide some cleanups and slight fixes in the mremap, mmap and vma
   code

 - "mm/damon: support addr_unit on default monitoring targets for
   modules" (SeongJae Park)

   Extend the use of DAMON core's addr_unit tunable

 - "mm: khugepaged cleanups and mTHP prerequisites" (Nico Pache)

   Cleanups to khugepaged and is a base for Nico's planned khugepaged
   mTHP support

 - "mm: memory hot(un)plug and SPARSEMEM cleanups" (David Hildenbrand)

   Code movement and cleanups in the memhotplug and sparsemem code

 - "mm: remove CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE and cleanup
   CONFIG_MIGRATION" (David Hildenbrand)

   Rationalize some memhotplug Kconfig support

 - "change young flag check functions to return bool" (Baolin Wang)

   Cleanups to change all young flag check functions to return bool

 - "mm/damon/sysfs: fix memory leak and NULL dereference issues" (Josh
   Law and SeongJae Park)

   Fix a few potential DAMON bugs

 - "mm/vma: convert vm_flags_t to vma_flags_t in vma code" (Lorenzo
   Stoakes)

   Convert a lot of the existing use of the legacy vm_flags_t data type
   to the new vma_flags_t type which replaces it. Mainly in the vma
   code.

 - "mm: expand mmap_prepare functionality and usage" (Lorenzo Stoakes)

   Expand the mmap_prepare functionality, which is intended to replace
   the deprecated f_op->mmap hook which has been the source of bugs and
   security issues for some time. Cleanups, documentation, extension of
   mmap_prepare into filesystem drivers

 - "mm/huge_memory: refactor zap_huge_pmd()" (Lorenzo Stoakes)

   Simplify and clean up zap_huge_pmd(). Additional cleanups around
   vm_normal_folio_pmd() and the softleaf functionality are performed.

* tag 'mm-stable-2026-04-13-21-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits)
  mm: fix deferred split queue races during migration
  mm/khugepaged: fix issue with tracking lock
  mm/huge_memory: add and use has_deposited_pgtable()
  mm/huge_memory: add and use normal_or_softleaf_folio_pmd()
  mm: add softleaf_is_valid_pmd_entry(), pmd_to_softleaf_folio()
  mm/huge_memory: separate out the folio part of zap_huge_pmd()
  mm/huge_memory: use mm instead of tlb->mm
  mm/huge_memory: remove unnecessary sanity checks
  mm/huge_memory: deduplicate zap deposited table call
  mm/huge_memory: remove unnecessary VM_BUG_ON_PAGE()
  mm/huge_memory: add a common exit path to zap_huge_pmd()
  mm/huge_memory: handle buggy PMD entry in zap_huge_pmd()
  mm/huge_memory: have zap_huge_pmd return a boolean, add kdoc
  mm/huge: avoid big else branch in zap_huge_pmd()
  mm/huge_memory: simplify vma_is_specal_huge()
  mm: on remap assert that input range within the proposed VMA
  mm: add mmap_action_map_kernel_pages[_full]()
  uio: replace deprecated mmap hook with mmap_prepare in uio_info
  drivers: hv: vmbus: replace deprecated mmap hook with mmap_prepare
  mm: allow handling of stacked mmap_prepare hooks in more drivers
  ...
2026-04-15 12:59:16 -07:00
Alexei Starovoitov
4fddde2a73 bpf: Fix use-after-free in arena_vm_close on fork
arena_vm_open() only bumps vml->mmap_count but never registers the
child VMA in arena->vma_list. The vml->vma always points at the
parent VMA, so after parent munmap the pointer dangles. If the child
then calls bpf_arena_free_pages(), zap_pages() reads the stale
vml->vma triggering use-after-free.

Fix this by preventing the arena VMA from being inherited across
fork with VM_DONTCOPY, and preventing VMA splits via the may_split
callback.

Also reject mremap with a .mremap callback returning -EINVAL. A
same-size mremap(MREMAP_FIXED) on the full arena VMA reaches
copy_vma() through the following path:

  check_prep_vma()       - returns 0 early: new_len == old_len
                           skips VM_DONTEXPAND check
  prep_move_vma()        - vm_start == old_addr and
                           vm_end == old_addr + old_len
                           so may_split is never called
  move_vma()
    copy_vma_and_data()
      copy_vma()
        vm_area_dup()    - copies vm_private_data (vml pointer)
        vm_ops->open()   - bumps vml->mmap_count
      vm_ops->mremap()   - returns -EINVAL, rollback unmaps new VMA

The refcount ensures the rollback's arena_vm_close does not free
the vml shared with the original VMA.

Reported-by: Weiming Shi <bestswngs@gmail.com>
Reported-by: Xiang Mei <xmei5@asu.edu>
Fixes: 317460317a ("bpf: Introduce bpf_arena.")
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Link: https://lore.kernel.org/r/20260413194245.21449-1-alexei.starovoitov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-15 12:08:55 -07:00
Linus Torvalds
5bdb4078e1 sched_ext: Changes for v7.1
- Cgroup sub-scheduler groundwork. Multiple BPF schedulers can be
   attached to cgroups and the dispatch path is made hierarchical. This
   involves substantial restructuring of the core dispatch, bypass,
   watchdog, and dump paths to be per-scheduler, along with new
   infrastructure for scheduler ownership enforcement, lifecycle
   management, and cgroup subtree iteration. The enqueue path is not yet
   updated and will follow in a later cycle.
 
 - scx_bpf_dsq_reenq() generalized to support any DSQ including remote
   local DSQs and user DSQs. Built on top of this, SCX_ENQ_IMMED
   guarantees that tasks dispatched to local DSQs either run immediately
   or get reenqueued back through ops.enqueue(), giving schedulers tighter
   control over queueing latency. Also useful for opportunistic CPU
   sharing across sub-schedulers.
 
 - ops.dequeue() was only invoked when the core knew a task was in BPF
   data structures, missing scheduling property change events and skipping
   callbacks for non-local DSQ dispatches from ops.select_cpu(). Fixed to
   guarantee exactly one ops.dequeue() call when a task leaves BPF
   scheduler custody.
 
 - Kfunc access validation moved from runtime to BPF verifier time,
   removing runtime mask enforcement.
 
 - Idle SMT sibling prioritization in the idle CPU selection path.
 
 - Documentation, selftest, and tooling updates. Misc bug fixes and
   cleanups.
 
 - Merges from tip/sched-core, cgroup/for-7.1, and for-7.0-fixes to
   resolve dependencies and conflicts for the above changes.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCad0uaA4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGbktAQD2HrKdydyEfefz/n4mNpIXh/DFYX49NgKYcgUh
 sKy4ngD/Sy7nAZS2zwM+36PN6jBV7+cfuoaiKPgCstPFeGsvPwU=
 =fsgj
 -----END PGP SIGNATURE-----

Merge tag 'sched_ext-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext updates from Tejun Heo:

 - cgroup sub-scheduler groundwork

   Multiple BPF schedulers can be attached to cgroups and the dispatch
   path is made hierarchical. This involves substantial restructuring of
   the core dispatch, bypass, watchdog, and dump paths to be
   per-scheduler, along with new infrastructure for scheduler ownership
   enforcement, lifecycle management, and cgroup subtree iteration

   The enqueue path is not yet updated and will follow in a later cycle

 - scx_bpf_dsq_reenq() generalized to support any DSQ including remote
   local DSQs and user DSQs

   Built on top of this, SCX_ENQ_IMMED guarantees that tasks dispatched
   to local DSQs either run immediately or get reenqueued back through
   ops.enqueue(), giving schedulers tighter control over queueing
   latency

   Also useful for opportunistic CPU sharing across sub-schedulers

 - ops.dequeue() was only invoked when the core knew a task was in BPF
   data structures, missing scheduling property change events and
   skipping callbacks for non-local DSQ dispatches from ops.select_cpu()

   Fixed to guarantee exactly one ops.dequeue() call when a task leaves
   BPF scheduler custody

 - Kfunc access validation moved from runtime to BPF verifier time,
   removing runtime mask enforcement

 - Idle SMT sibling prioritization in the idle CPU selection path

 - Documentation, selftest, and tooling updates. Misc bug fixes and
   cleanups

* tag 'sched_ext-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (134 commits)
  tools/sched_ext: Add explicit cast from void* in RESIZE_ARRAY()
  sched_ext: Make string params of __ENUM_set() const
  tools/sched_ext: Kick home CPU for stranded tasks in scx_qmap
  sched_ext: Drop spurious warning on kick during scheduler disable
  sched_ext: Warn on task-based SCX op recursion
  sched_ext: Rename scx_kf_allowed_on_arg_tasks() to scx_kf_arg_task_ok()
  sched_ext: Remove runtime kfunc mask enforcement
  sched_ext: Add verifier-time kfunc context filter
  sched_ext: Drop redundant rq-locked check from scx_bpf_task_cgroup()
  sched_ext: Decouple kfunc unlocked-context check from kf_mask
  sched_ext: Fix ops.cgroup_move() invocation kf_mask and rq tracking
  sched_ext: Track @p's rq lock across set_cpus_allowed_scx -> ops.set_cpumask
  sched_ext: Add select_cpu kfuncs to scx_kfunc_ids_unlocked
  sched_ext: Drop TRACING access to select_cpu kfuncs
  selftests/sched_ext: Fix wrong DSQ ID in peek_dsq error message
  sched_ext: Documentation: improve accuracy of task lifecycle pseudo-code
  selftests/sched_ext: Improve runner error reporting for invalid arguments
  sched_ext: Documentation: Fix scx_bpf_move_to_local kfunc name
  sched_ext: Documentation: Add ops.dequeue() to task lifecycle
  tools/sched_ext: Fix off-by-one in scx_sdt payload zeroing
  ...
2026-04-15 10:54:24 -07:00
Linus Torvalds
7de6b4a246 workqueue: Changes for v7.1
- New default WQ_AFFN_CACHE_SHARD affinity scope subdivides LLCs into
   smaller shards to improve scalability on machines with many CPUs per
   LLC.
 
 - Misc: system_dfl_long_wq for long unbound works, devm_alloc_workqueue()
   for device-managed allocation, sysfs exposure for ordered workqueues and
   the EFI workqueue, removal of HK_TYPE_WQ from wq_unbound_cpumask, and
   various small fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCad0npw4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGdZmAQD4BbIhTGKGcq89jwQRQpmUIZK6yIIWwd0cSvLC
 Biko2AD9FP2M9bqUzo2cZ83AfSC4LTK020e9VmsZStkw+u0s3ws=
 =cSEW
 -----END PGP SIGNATURE-----

Merge tag 'wq-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq

Pull workqueue updates from Tejun Heo:

 - New default WQ_AFFN_CACHE_SHARD affinity scope subdivides LLCs into
   smaller shards to improve scalability on machines with many CPUs per
   LLC

 - Misc:
    - system_dfl_long_wq for long unbound works
    - devm_alloc_workqueue() for device-managed allocation
    - sysfs exposure for ordered workqueues and the EFI workqueue
    - removal of HK_TYPE_WQ from wq_unbound_cpumask
    - various small fixes

* tag 'wq-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (21 commits)
  workqueue: validate cpumask_first() result in llc_populate_cpu_shard_id()
  workqueue: use NR_STD_WORKER_POOLS instead of hardcoded value
  workqueue: avoid unguarded 64-bit division
  docs: workqueue: document WQ_AFFN_CACHE_SHARD affinity scope
  workqueue: add test_workqueue benchmark module
  tools/workqueue: add CACHE_SHARD support to wq_dump.py
  workqueue: set WQ_AFFN_CACHE_SHARD as the default affinity scope
  workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
  workqueue: fix typo in WQ_AFFN_SMT comment
  workqueue: Remove HK_TYPE_WQ from affecting wq_unbound_cpumask
  workqueue: unlink pwqs from wq->pwqs list in alloc_and_link_pwqs() error path
  workqueue: Remove NULL wq WARN in __queue_delayed_work()
  workqueue: fix parse_affn_scope() prefix matching bug
  workqueue: devres: Add device-managed allocate workqueue
  workqueue: Add system_dfl_long_wq for long unbound works
  tools/workqueue/wq_dump.py: add NODE prefix to all node columns
  tools/workqueue/wq_dump.py: fix column alignment in node_nr/max_active section
  tools/workqueue/wq_dump.py: remove backslash separator from node_nr/max_active header
  efi: Allow to expose the workqueue via sysfs
  workqueue: Allow to expose ordered workqueues via sysfs
  ...
2026-04-15 10:32:08 -07:00
Linus Torvalds
b71f0be2d2 cgroup: Changes for v7.1
- cgroup_file_notify() locking converted from a global lock to
   per-cgroup_file spinlock with a lockless fast-path when no notification
   is needed.
 
 - Misc changes including exposing cgroup helpers for sched_ext and minor
   fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCad0heg4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGTFVAP0USl50aZ1SA7Gq84Qp/5v2EN5oH4lVqTlEbPti
 AMOV5wD+JpYS0BnLhj+Q2jElu3Jyb4drf3h5xYHhf5NS2O60EAE=
 =j2ad
 -----END PGP SIGNATURE-----

Merge tag 'cgroup-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup updates from Tejun Heo:

 - cgroup_file_notify() locking converted from a global lock to
   per-cgroup_file spinlock with a lockless fast-path when no
   notification is needed

 - Misc changes including exposing cgroup helpers for sched_ext and
   minor fixes

* tag 'cgroup-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup/rdma: fix swapped arguments in pr_warn() format string
  cgroup/dmem: remove region parameter from dmemcg_parse_limit
  cgroup: replace global cgroup_file_kn_lock with per-cgroup_file lock
  cgroup: add lockless fast-path checks to cgroup_file_notify()
  cgroup: reduce cgroup_file_kn_lock hold time in cgroup_file_notify()
  cgroup: Expose some cgroup helpers
2026-04-15 10:18:49 -07:00
Eduard Zingerman
ecdd4fd8a5 bpf: fix arg tracking for imprecise/multi-offset BPF_ST/STX
BPF_STX through ARG_IMPRECISE dst should be recognized as a local
spill and join at_stack with the written value. For example,
consider the following situation:

   // r1 = ARG_IMPRECISE{mask=BIT(0)|BIT(1)}
   *(u64 *)(r1 + 0) = r8

Here the analysis should produce an equivalent of

  at_stack[*] = join(old, r8)

BPF_ST through multi-offset or imprecise dst should join at_stack with
none instead of overwriting the slots. For example, consider the
following situation:

   // r1 = ARG_IMPRECISE{mask=BIT(0)|BIT(1)}
   *(u64 *)(r1 + 0) = 0

Here the analysis should produce an equivalent of

  at_stack[*r1] = join(old, none).

Move the definition of the clear_overlapping_stack_slots() in order to
have __arg_track_join() visible. Remove the OFF_IMPRECISE constant to
avoid having two ways to express imprecise offset.
Only 'offset-imprecise {frame=N, cnt=0}' remains.

Fixes: bf0c571f7f ("bpf: introduce forward arg-tracking dataflow analysis")
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260413-stacklive-fixes-v2-1-398e126e5cf3@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-15 08:40:47 -07:00
Yiyang Chen
16c4f0211a taskstats: set version in TGID exit notifications
delay accounting started populating taskstats records with a valid version
field via fill_pid() and fill_tgid().

Later, commit ad4ecbcba7 ("[PATCH] delay accounting taskstats interface
send tgid once") changed the TGID exit path to send the cached
signal->stats aggregate directly instead of building the outgoing record
through fill_tgid().  Unlike fill_tgid(), fill_tgid_exit() only
accumulates accounting data and never initializes stats->version.

As a result, TGID exit notifications can reach userspace with version == 0
even though PID exit notifications and TASKSTATS_CMD_GET replies carry a
valid taskstats version.

This is easy to reproduce with `tools/accounting/getdelays.c`.

I have a small follow-up patch for that tool which:

1. increases the receive buffer/message size so the pid+tgid
   combined exit notification is not dropped/truncated

2. prints `stats->version`.

With that patch, the reproducer is:

  Terminal 1:
    ./getdelays -d -v -l -m 0

  Terminal 2:
    taskset -c 0 python3 -c 'import threading,time; t=threading.Thread(target=time.sleep,args=(0.1,)); t.start(); t.join()'

That produces both PID and TGID exit notifications for the same
process.  The PID exit record reports a valid taskstats version, while
the TGID exit record reports `version 0`.


This patch (of 2):

Set stats->version = TASKSTATS_VERSION after copying the cached TGID
aggregate into the outgoing netlink payload so all taskstats records are
self-describing again.

Link: https://lkml.kernel.org/r/ba83d934e59edd431b693607de573eb9ca059309.1774810498.git.cyyzero16@gmail.com
Fixes: ad4ecbcba7 ("[PATCH] delay accounting taskstats interface send tgid once")
Signed-off-by: Yiyang Chen <cyyzero16@gmail.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Dr. Thomas Orgis <thomas.orgis@uni-hamburg.de>
Cc: Fan Yu <fan.yu9@zte.com.cn>
Cc: Wang Yaxin <wang.yaxin@zte.com.cn>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-15 02:15:02 -07:00
Linus Torvalds
5c0f43e853 kernel-7.1-rc1.misc
Please consider pulling these changes from the signed kernel-7.1-rc1.misc tag.
 
 Thanks!
 Christian
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCad41SAAKCRCRxhvAZXjc
 ou3SAQD3QUQObaY7NvJIxwm72jtO2lY6jF03Pyimt4J9yicXZQEAxkHvPpSwBoLx
 n5lnzBJy9t8JQxMEkw+IL8vmGsSMZw0=
 =RGh3
 -----END PGP SIGNATURE-----

Merge tag 'kernel-7.1-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull pid_namespace updates from Christian Brauner:

 - pid_namespace: make init creation more flexible

   Annotate ->child_reaper accesses with {READ,WRITE}_ONCE() to protect
   the unlocked readers from cpu/compiler reordering, and enforce that
   pid 1 in a pid namespace is always the first allocated pid (the
   set_tid path already required this).

   On top of that, allow opening pid_for_children before the pid
   namespace init has been created. This lets one process create the pid
   namespace and a different process create the init via setns(), which
   makes clone3(set_tid) usable in all cases evenly and is particularly
   useful to CRIU when restoring nested containers.

   A new selftest covers both the basic create-pidns-then-init flow and
   the cross-process variant, and a MAINTAINERS entry for the pid
   namespace code is added.

 - unrelated signal cleanup: update outdated comment for the removed
   freezable_schedule()

* tag 'kernel-7.1-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  signal: update outdated comment for removed freezable_schedule()
  MAINTAINERS: add a pid namespace entry
  selftests: Add tests for creating pidns init via setns
  pid_namespace: allow opening pid_for_children before init was created
  pid: check init is created first after idr alloc
  pid_namespace: avoid optimization of accesses to ->child_reaper
2026-04-14 20:28:40 -07:00
Linus Torvalds
7c8a4671dc vfs-7.1-rc1.mount.v2
Please consider pulling these changes from the signed vfs-7.1-rc1.mount.v2 tag.
 
 Thanks!
 Christian
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCad3vFgAKCRCRxhvAZXjc
 onXwAQDwEGvpMUUiuI/JWFqCA5vY5LXXr/36wdcs0iUL1uy9IgEAyOdnYhYkcaX1
 3lm87f6OmYkhlq6enJbco7uT4CUzlQA=
 =1Ls8
 -----END PGP SIGNATURE-----

Merge tag 'vfs-7.1-rc1.mount.v2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs mount updates from Christian Brauner:

 - Add FSMOUNT_NAMESPACE flag to fsmount() that creates a new mount
   namespace with the newly created filesystem attached to a copy of the
   real rootfs. This returns a namespace file descriptor instead of an
   O_PATH mount fd, similar to how OPEN_TREE_NAMESPACE works for
   open_tree().

   This allows creating a new filesystem and immediately placing it in a
   new mount namespace in a single operation, which is useful for
   container runtimes and other namespace-based isolation mechanisms.

   This accompanies OPEN_TREE_NAMESPACE and avoids a needless detour via
   OPEN_TREE_NAMESPACE to get the same effect. Will be especially useful
   when you mount an actual filesystem to be used as the container
   rootfs.

 - Currently, creating a new mount namespace always copies the entire
   mount tree from the caller's namespace. For containers and sandboxes
   that intend to build their mount table from scratch this is wasteful:
   they inherit a potentially large mount tree only to immediately tear
   it down.

   This series adds support for creating a mount namespace that contains
   only a clone of the root mount, with none of the child mounts. Two
   new flags are introduced:

     - CLONE_EMPTY_MNTNS (0x400000000) for clone3(), using the 64-bit flag space
     - UNSHARE_EMPTY_MNTNS (0x00100000) for unshare()

   Both flags imply CLONE_NEWNS. The resulting namespace contains a
   single nullfs root mount with an immutable empty directory. The
   intended workflow is to then mount a real filesystem (e.g., tmpfs)
   over the root and build the mount table from there.

 - Allow MOVE_MOUNT_BENEATH to target the caller's rootfs, allowing to
   switch out the rootfs without pivot_root(2).

   The traditional approach to switching the rootfs involves
   pivot_root(2) or a chroot_fs_refs()-based mechanism that atomically
   updates fs->root for all tasks sharing the same fs_struct. This has
   consequences for fork(), unshare(CLONE_FS), and setns().

   This series instead decomposes root-switching into individually
   atomic, locally-scoped steps:

	fd_tree = open_tree(-EBADF, "/newroot", OPEN_TREE_CLONE | OPEN_TREE_CLOEXEC);
	fchdir(fd_tree);
	move_mount(fd_tree, "", AT_FDCWD, "/", MOVE_MOUNT_BENEATH | MOVE_MOUNT_F_EMPTY_PATH);
	chroot(".");
	umount2(".", MNT_DETACH);

   Since each step only modifies the caller's own state, the
   fork/unshare/setns races are eliminated by design.

   A key step to making this possible is to remove the locked mount
   restriction. Originally MOVE_MOUNT_BENEATH doesn't support mounting
   beneath a mount that is locked. The locked mount protects the
   underlying mount from being revealed. This is a core mechanism of
   unshare(CLONE_NEWUSER | CLONE_NEWNS). The mounts in the new mount
   namespace become locked. That effectively makes the new mount table
   useless as the caller cannot ever get rid of any of the mounts no
   matter how useless they are.

   We can lift this restriction though. We simply transfer the locked
   property from the top mount to the mount beneath. This works because
   what we care about is to protect the underlying mount aka the parent.
   The mount mounted between the parent and the top mount takes over the
   job of protecting the parent mount from the top mount mount. This
   leaves us free to remove the locked property from the top mount which
   can consequently be unmounted:

	unshare(CLONE_NEWUSER | CLONE_NEWNS)

   and we inherit a clone of procfs on /proc then currently we cannot
   unmount it as:

	umount -l /proc

   will fail with EINVAL because the procfs mount is locked.

   After this series we can now do:

	mount --beneath -t tmpfs tmpfs /proc
	umount -l /proc

   after which a tmpfs mount has been placed beneath the procfs mount.
   The tmpfs mount has become locked and the procfs mount has become
   unlocked.

   This means you can safely modify an inherited mount table after
   unprivileged namespace creation.

   Afterwards we simply make it possible to move a mount beneath the
   rootfs allowing to upgrade the rootfs.

   Removing the locked restriction makes this very useful for containers
   created with unshare(CLONE_NEWUSER | CLONE_NEWNS) to reshuffle an
   inherited mount table safely and MOVE_MOUNT_BENEATH makes it possible
   to switch out the rootfs instead of using the costly pivot_root(2).

* tag 'vfs-7.1-rc1.mount.v2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  selftests/namespaces: remove unused utils.h include from listns_efault_test
  selftests/fsmount_ns: add missing TARGETS and fix cap test
  selftests/empty_mntns: fix wrong CLONE_EMPTY_MNTNS hex value in comment
  selftests/empty_mntns: fix statmount_alloc() signature mismatch
  selftests/statmount: remove duplicate wait_for_pid()
  mount: always duplicate mount
  selftests/filesystems: add MOVE_MOUNT_BENEATH rootfs tests
  move_mount: allow MOVE_MOUNT_BENEATH on the rootfs
  move_mount: transfer MNT_LOCKED
  selftests/filesystems: add clone3 tests for empty mount namespaces
  selftests/filesystems: add tests for empty mount namespaces
  namespace: allow creating empty mount namespaces
  selftests: add FSMOUNT_NAMESPACE tests
  selftests/statmount: add statmount_alloc() helper
  tools: update mount.h header
  mount: add FSMOUNT_NAMESPACE
  mount: simplify __do_loopback()
  mount: start iterating from start of rbtree
2026-04-14 19:59:25 -07:00
Linus Torvalds
f5ad410100 bpf-next-7.1
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmndDWsACgkQ6rmadz2v
 bTr/jw//WQ+IowvstytntSbZFhSSKjwUP1J0oz/wAyKxvly+sBQADBQkljqNaEju
 Kq48CPWftJXG45x3O5P4GSYOuBnd9nwDS/hM6jA9f3Ok4IEOHAHCxLot0uq52iJa
 ieGeJTUEGKFUUEiTuImt/0+Y3aeRQFV0f484+WcmCpdm+cqIXxRnxsMMFuovM4Uj
 VUgYaooZteaOcnhZpaX/4bWiXM7x7FibLu9gPu9fyyHJIiVrJD+sMhb/UZtsODZO
 gywy9GNs93Xm9ZoRSTpWA4pAvRajqa8DEtLlV8fx4LpvYdHIjdByiTR9CeKHYxrB
 vcV1Ty6dGTd6ifFtW6ul1qaF9KeZXQBHxCTmhj4ITek1TMNDfJJD+Iwgc1ll9RL4
 RoZ8DJC8Qp2RDH+3b/ptBgfROw1nrwQLuw5cG7mj5mhQdu/z9AMI2ifPk9wv56Zj
 OV6wRnDcwFu5SLBUNCMd/ypnigKdWcSHCNvWo2HTtcy771b/fqz60K8dMcIWKH5B
 3qvXEBHbSdf48D6t64nOyVuo8RKSIizER5Mj/baabcJqZKoAtVUo2l2vd63hX/OD
 v/y51NvI0lH6cOMLka3LHVIVJInOFSKgOUa1aaKQ0KDjQDRRmmy8yY9h6RZ+aHWb
 78K7oCNRx/SCLdslYFGSTQdbiI4/JVoDc6cWtHy413m5+L1447A=
 =k6te
 -----END PGP SIGNATURE-----

Merge tag 'bpf-next-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Pull bpf updates from Alexei Starovoitov:

 - Welcome new BPF maintainers: Kumar Kartikeya Dwivedi, Eduard
   Zingerman while Martin KaFai Lau reduced his load to Reviwer.

 - Lots of fixes everywhere from many first time contributors. Thank you
   All.

 - Diff stat is dominated by mechanical split of verifier.c into
   multiple components:

    - backtrack.c: backtracking logic and jump history
    - states.c:    state equivalence
    - cfg.c:       control flow graph, postorder, strongly connected
                   components
    - liveness.c:  register and stack liveness
    - fixups.c:    post-verification passes: instruction patching, dead
                   code removal, bpf_loop inlining, finalize fastcall

   8k line were moved. verifier.c still stands at 20k lines.

   Further refactoring is planned for the next release.

 - Replace dynamic stack liveness with static stack liveness based on
   data flow analysis.

   This improved the verification time by 2x for some programs and
   equally reduced memory consumption. New logic is in liveness.c and
   supported by constant folding in const_fold.c (Eduard Zingerman,
   Alexei Starovoitov)

 - Introduce BTF layout to ease addition of new BTF kinds (Alan Maguire)

 - Use kmalloc_nolock() universally in BPF local storage (Amery Hung)

 - Fix several bugs in linked registers delta tracking (Daniel Borkmann)

 - Improve verifier support of arena pointers (Emil Tsalapatis)

 - Improve verifier tracking of register bounds in min/max and tnum
   domains (Harishankar Vishwanathan, Paul Chaignon, Hao Sun)

 - Further extend support for implicit arguments in the verifier (Ihor
   Solodrai)

 - Add support for nop,nop5 instruction combo for USDT probes in libbpf
   (Jiri Olsa)

 - Support merging multiple module BTFs (Josef Bacik)

 - Extend applicability of bpf_kptr_xchg (Kaitao Cheng)

 - Retire rcu_trace_implies_rcu_gp() (Kumar Kartikeya Dwivedi)

 - Support variable offset context access for 'syscall' programs (Kumar
   Kartikeya Dwivedi)

 - Migrate bpf_task_work and dynptr to kmalloc_nolock() (Mykyta
   Yatsenko)

 - Fix UAF in in open-coded task_vma iterator (Puranjay Mohan)

* tag 'bpf-next-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (241 commits)
  selftests/bpf: cover short IPv4/IPv6 inputs with adjust_room
  bpf: reject short IPv4/IPv6 inputs in bpf_prog_test_run_skb
  selftests/bpf: Use memfd_create instead of shm_open in cgroup_iter_memcg
  selftests/bpf: Add test for cgroup storage OOB read
  bpf: Fix OOB in pcpu_init_value
  selftests/bpf: Fix reg_bounds to match new tnum-based refinement
  selftests/bpf: Add tests for non-arena/arena operations
  bpf: Allow instructions with arena source and non-arena dest registers
  bpftool: add missing fsession to the usage and docs of bpftool
  docs/bpf: add missing fsession attach type to docs
  bpf: add missing fsession to the verifier log
  bpf: Move BTF checking logic into check_btf.c
  bpf: Move backtracking logic to backtrack.c
  bpf: Move state equivalence logic to states.c
  bpf: Move check_cfg() into cfg.c
  bpf: Move compute_insn_live_regs() into liveness.c
  bpf: Move fixup/post-processing logic from verifier.c into fixups.c
  bpf: Simplify do_check_insn()
  bpf: Move checks for reserved fields out of the main pass
  bpf: Delete unused variable
  ...
2026-04-14 18:04:04 -07:00
Linus Torvalds
88b29f3f57 Modules changes for v7.1-rc1
Kernel symbol flags:
 
   - Replace the separate *_gpl symbol sections (__ksymtab_gpl and
     __kcrctab_gpl) with a unified symbol table and a new
     __kflagstab section. This section stores symbol flags, such as
     the GPL-only flag, as an 8-bit bitset for each exported symbol.
     This is a cleanup that simplifies symbol lookup in the module
     loader by avoiding table fragmentation and will allow a cleaner
     way to add more flags later if needed.
 
 Module signature UAPI:
 
   - Move struct module_signature to the UAPI headers to allow reuse
     by tools outside the kernel proper, such as kmod and
     scripts/sign-file. This also renames a few constants for clarity
     and drops unused signature types as preparation for hash-based
     module integrity checking work that's in progress.
 
 Sysfs:
 
   - Add a /sys/module/<module>/import_ns sysfs attribute to show
     the symbol namespaces imported by loaded modules. This makes it
     easier to verify driver API access at runtime on systems that
     care about such things (e.g. Android).
 
 Cleanups and fixes:
 
   - Force sh_addr to 0 for all sections in module.lds. This prevents
     non-zero section addresses when linking modules with ld.bfd -r,
     which confused elfutils.
 
   - Fix a memory leak of charp module parameters on module unload
     when the kernel is configured with CONFIG_SYSFS=n.
 
   - Override the -EEXIST error code returned by module_init() to
     userspace. This prevents confusion with the errno reserved by
     the module loader to indicate that a module is already loaded.
 
   - Simplify the warning message and drop the stack dump on positive
     returns from module_init().
 
   - Drop unnecessary extern keywords from function declarations and
     synchronize parse_args() arguments with their implementation.
 
 Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQSE9au1u/dCZerzchhaByWrOaGnegUCadmI0gAKCRBaByWrOaGn
 euC6AQCpeQGQv/Z1Pu9DmBRaRD1MjXg1K1J8DN3qH7L8FbWDwAD9FtzAHw9GPOOP
 0aQpDvcYKjdrU8OiuqtENvhzCV1RTA4=
 =YaHp
 -----END PGP SIGNATURE-----

Merge tag 'modules-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux

Pull module updates from Sami Tolvanen:
 "Kernel symbol flags:

   - Replace the separate *_gpl symbol sections (__ksymtab_gpl and
     __kcrctab_gpl) with a unified symbol table and a new __kflagstab
     section.

     This section stores symbol flags, such as the GPL-only flag, as an
     8-bit bitset for each exported symbol. This is a cleanup that
     simplifies symbol lookup in the module loader by avoiding table
     fragmentation and will allow a cleaner way to add more flags later
     if needed.

  Module signature UAPI:

   - Move struct module_signature to the UAPI headers to allow reuse by
     tools outside the kernel proper, such as kmod and
     scripts/sign-file.

     This also renames a few constants for clarity and drops unused
     signature types as preparation for hash-based module integrity
     checking work that's in progress.

  Sysfs:

   - Add a /sys/module/<module>/import_ns sysfs attribute to show the
     symbol namespaces imported by loaded modules.

     This makes it easier to verify driver API access at runtime on
     systems that care about such things (e.g. Android).

  Cleanups and fixes:

   - Force sh_addr to 0 for all sections in module.lds. This prevents
     non-zero section addresses when linking modules with 'ld.bfd -r',
     which confused elfutils.

   - Fix a memory leak of charp module parameters on module unload when
     the kernel is configured with CONFIG_SYSFS=n.

   - Override the -EEXIST error code returned by module_init() to
     userspace. This prevents confusion with the errno reserved by the
     module loader to indicate that a module is already loaded.

   - Simplify the warning message and drop the stack dump on positive
     returns from module_init().

   - Drop unnecessary extern keywords from function declarations and
     synchronize parse_args() arguments with their implementation"

* tag 'modules-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux: (23 commits)
  module: Simplify warning on positive returns from module_init()
  module: Override -EEXIST module return
  documentation: remove references to *_gpl sections
  module: remove *_gpl sections from vmlinux and modules
  module: deprecate usage of *_gpl sections in module loader
  module: use kflagstab instead of *_gpl sections
  module: populate kflagstab in modpost
  module: add kflagstab section to vmlinux and modules
  module: define ksym_flags enumeration to represent kernel symbol flags
  selftests/bpf: verify_pkcs7_sig: Use 'struct module_signature' from the UAPI headers
  sign-file: use 'struct module_signature' from the UAPI headers
  tools uapi headers: add linux/module_signature.h
  module: Move 'struct module_signature' to UAPI
  module: Give MODULE_SIG_STRING a more descriptive name
  module: Give 'enum pkey_id_type' a more specific name
  module: Drop unused signature types
  extract-cert: drop unused definition of PKEY_ID_PKCS7
  docs: symbol-namespaces: mention sysfs attribute
  module: expose imported namespaces via sysfs
  module: Remove extern keyword from param prototypes
  ...
2026-04-14 17:16:38 -07:00
Linus Torvalds
c43267e679 arm64 updates for 7.1:
Core features:
 
  - Add support for FEAT_LSUI, allowing futex atomic operations without
    toggling Privileged Access Never (PAN)
 
  - Further refactor the arm64 exception handling code towards the
    generic entry infrastructure
 
  - Optimise __READ_ONCE() with CONFIG_LTO=y and allow alias analysis
    through it
 
 Memory management:
 
  - Refactor the arm64 TLB invalidation API and implementation for better
    control over barrier placement and level-hinted invalidation
 
  - Enable batched TLB flushes during memory hot-unplug
 
  - Fix rodata=full block mapping support for realm guests (when
    BBML2_NOABORT is available)
 
 Perf and PMU:
 
  - Add support for a whole bunch of system PMUs featured in NVIDIA's
    Tegra410 SoC (cspmu extensions for the fabric and PCIe, new drivers
    for CPU/C2C memory latency PMUs)
 
  - Clean up iomem resource handling in the Arm CMN driver
 
  - Fix signedness handling of AA64DFR0.{PMUVer,PerfMon}
 
 MPAM (Memory Partitioning And Monitoring):
 
  - Add architecture context-switch and hiding of the feature from KVM
 
  - Add interface to allow MPAM to be exposed to user-space using resctrl
 
  - Add errata workaround for some existing platforms
 
  - Add documentation for using MPAM and what shape of platforms can use
    resctrl
 
 Miscellaneous:
 
  - Check DAIF (and PMR, where relevant) at task-switch time
 
  - Skip TFSR_EL1 checks and barriers in synchronous MTE tag check mode
    (only relevant to asynchronous or asymmetric tag check modes)
 
  - Remove a duplicate allocation in the kexec code
 
  - Remove redundant save/restore of SCS SP on entry to/from EL0
 
  - Generate the KERNEL_HWCAP_ definitions from the arm64 hwcap
    descriptions
 
  - Add kselftest coverage for cmpbr_sigill()
 
  - Update sysreg definitions
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAmnc8DEACgkQa9axLQDI
 XvFauRAAhc1cIgoRpgtdZd7+3/g457teDPYA3L/CjJzI28aesIpV/ECrEw2GL4xs
 HrQfijF4oyCDbBwh0sAascO/H7RoyOranlbuc+fVJ6Bj6gP9STzR4GmscsWkAMSJ
 vA3Jd1DREdDBO2sjw+hGhht84nRlcfY1FyORJP+1JaFH4oWTWsRNeOZIiI3BhxR8
 EtFP9E8r2Esxi/FmZb/47m7kYCEH+XsrzQvBQNLVCH899QX2Hn0kAY70ndq2ZiQl
 n+zLAe7FBFwKzUVmlgWuhjrWMmK+1TthK/XQuOtxg13dHmX+vE/j+A+dOqRWSfHY
 ktNcWaf6m4+TWKVeVTe4E1cnSuwTQTm4VQKd9zaeQxiZYyYJhCQjXuEZg3vDmDbq
 F6D3MpTaJHRRWp0rEurxnSBlmQPCBE2IxEBdSrjd/WJ6T9e1oYwWiSJSS7bGCgGr
 dd/XLsOY7Um5n4ooIFEZc1de6VO6/VTKjmxnBMgU+Sa1REbLpD438IX/6CjzG5qM
 l5Ulke/c6/a/faeVCEpZpD8JuvNOzo9RISDPrNg1KKAL+OSU+9tgmVjIFPhDDB0w
 zNTqT7YJIhxlJxnUGWDk8YNsTjT3OzyquY9UT1tBTBqC0k13J2i2ev30toUez7xj
 2aV+9qMpunbLtwYhXNun1hBFiYrCxpX7I8ha0hXiXL0CywVOPTI=
 =CnVn
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Catalin Marinas:
 "The biggest changes are MPAM enablement in drivers/resctrl and new PMU
  support under drivers/perf.

  On the core side, FEAT_LSUI lets futex atomic operations with EL0
  permissions, avoiding PAN toggling.

  The rest is mostly TLB invalidation refactoring, further generic entry
  work, sysreg updates and a few fixes.

  Core features:

   - Add support for FEAT_LSUI, allowing futex atomic operations without
     toggling Privileged Access Never (PAN)

   - Further refactor the arm64 exception handling code towards the
     generic entry infrastructure

   - Optimise __READ_ONCE() with CONFIG_LTO=y and allow alias analysis
     through it

  Memory management:

   - Refactor the arm64 TLB invalidation API and implementation for
     better control over barrier placement and level-hinted invalidation

   - Enable batched TLB flushes during memory hot-unplug

   - Fix rodata=full block mapping support for realm guests (when
     BBML2_NOABORT is available)

  Perf and PMU:

   - Add support for a whole bunch of system PMUs featured in NVIDIA's
     Tegra410 SoC (cspmu extensions for the fabric and PCIe, new drivers
     for CPU/C2C memory latency PMUs)

   - Clean up iomem resource handling in the Arm CMN driver

   - Fix signedness handling of AA64DFR0.{PMUVer,PerfMon}

  MPAM (Memory Partitioning And Monitoring):

   - Add architecture context-switch and hiding of the feature from KVM

   - Add interface to allow MPAM to be exposed to user-space using
     resctrl

   - Add errata workaround for some existing platforms

   - Add documentation for using MPAM and what shape of platforms can
     use resctrl

  Miscellaneous:

   - Check DAIF (and PMR, where relevant) at task-switch time

   - Skip TFSR_EL1 checks and barriers in synchronous MTE tag check mode
     (only relevant to asynchronous or asymmetric tag check modes)

   - Remove a duplicate allocation in the kexec code

   - Remove redundant save/restore of SCS SP on entry to/from EL0

   - Generate the KERNEL_HWCAP_ definitions from the arm64 hwcap
     descriptions

   - Add kselftest coverage for cmpbr_sigill()

   - Update sysreg definitions"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (109 commits)
  arm64: rsi: use linear-map alias for realm config buffer
  arm64: Kconfig: fix duplicate word in CMDLINE help text
  arm64: mte: Skip TFSR_EL1 checks and barriers in synchronous tag check mode
  arm64/sysreg: Update ID_AA64SMFR0_EL1 description to DDI0601 2025-12
  arm64/sysreg: Update ID_AA64ZFR0_EL1 description to DDI0601 2025-12
  arm64/sysreg: Update ID_AA64FPFR0_EL1 description to DDI0601 2025-12
  arm64/sysreg: Update ID_AA64ISAR2_EL1 description to DDI0601 2025-12
  arm64/sysreg: Update ID_AA64ISAR0_EL1 description to DDI0601 2025-12
  arm64/hwcap: Generate the KERNEL_HWCAP_ definitions for the hwcaps
  arm64: kexec: Remove duplicate allocation for trans_pgd
  ACPI: AGDI: fix missing newline in error message
  arm64: Check DAIF (and PMR) at task-switch time
  arm64: entry: Use split preemption logic
  arm64: entry: Use irqentry_{enter_from,exit_to}_kernel_mode()
  arm64: entry: Consistently prefix arm64-specific wrappers
  arm64: entry: Don't preempt with SError or Debug masked
  entry: Split preemption from irqentry_exit_to_kernel_mode()
  entry: Split kernel mode logic from irqentry_{enter,exit}()
  entry: Move irqentry_enter() prototype later
  entry: Remove local_irq_{enable,disable}_exit_to_user()
  ...
2026-04-14 16:48:56 -07:00
Linus Torvalds
1c3b68f0d5 Scheduler changes for v7.1:
Fair scheduling updates:
 
  - Skip SCHED_IDLE rq for SCHED_IDLE tasks (Christian Loehle)
  - Remove superfluous rcu_read_lock() in the wakeup path (K Prateek Nayak)
  - Simplify the entry condition for update_idle_cpu_scan() (K Prateek Nayak)
  - Simplify SIS_UTIL handling in select_idle_cpu() (K Prateek Nayak)
  - Avoid overflow in enqueue_entity() (K Prateek Nayak)
  - Update overutilized detection (Vincent Guittot)
  - Prevent negative lag increase during delayed dequeue (Vincent Guittot)
  - Clear buddies for preempt_short (Vincent Guittot)
  - Implement more complex proportional newidle balance (Peter Zijlstra)
  - Increase weight bits for avg_vruntime (Peter Zijlstra)
  - Use full weight to __calc_delta() (Peter Zijlstra)
 
 RT and DL scheduling updates:
 
  - Fix incorrect schedstats for rt and dl thread (Dengjun Su)
  - Skip group schedulable check with rt_group_sched=0 (Michal Koutný)
  - Move group schedulability check to sched_rt_global_validate()
    (Michal Koutný)
  - Add reporting of runtime left & abs deadline to sched_getattr()
    for DEADLINE tasks (Tommaso Cucinotta)
 
 Scheduling topology updates by K Prateek Nayak:
 
  - Compute sd_weight considering cpuset partitions
  - Extract "imb_numa_nr" calculation into a separate helper
  - Allocate per-CPU sched_domain_shared in s_data
  - Switch to assigning "sd->shared" from s_data
  - Remove sched_domain_shared allocation with sd_data
 
 Energy-aware scheduling updates:
 
  - Filter false overloaded_group case for EAS (Vincent Guittot)
  - PM: EM: Switch to rcu_dereference_all() in wakeup path
    (Dietmar Eggemann)
 
 Infrastructure updates:
 
  - Replace use of system_unbound_wq with system_dfl_wq (Marco Crivellari)
 
 Proxy scheduling updates by John Stultz:
 
  - Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
  - Minimise repeated sched_proxy_exec() checking
  - Fix potentially missing balancing with Proxy Exec
  - Fix and improve task::blocked_on et al handling
  - Add assert_balance_callbacks_empty() helper
  - Add logic to zap balancing callbacks if we pick again
  - Move attach_one_task() and attach_task() helpers to sched.h
  - Handle blocked-waiter migration (and return migration)
  - Add K Prateek Nayak to scheduler reviewers for proxy execution
 
 Misc cleanups and fixes by John Stultz, Joseph Salisbury,
 Peter Zijlstra, K Prateek Nayak, Michal Koutný, Randy Dunlap,
 Shrikanth Hegde, Vincent Guittot, Zhan Xusheng, Xie Yuanbin
 and Vincent Guittot.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmncq4oRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gxoA/8DD0SsMhBLaZLi+LAdY5fD6rGjOLGBtxz
 NgwN8CAvPIFH7qFzPjAk7WtVXoKjF62sRDFvUaBEsliflRzOkBkYr3SnUYRORyBB
 VRj7D6ymuWhxnhYsy8+Hviu/93c3GyEO59IYU0wIShxBzYBxqDfNxWvEUQte2Cin
 1yFy4CICJeGpsBv9Ev+0LtesxtF5bnaioawbAYcpc2IdYsK+nsMKRvkwg1YSdLmh
 v9+vIYuQBrclBn3OR7dsv2krBev5qodYtDZFwdJagE+6aaQv2zhWIfhetPpkzwrq
 zhuzVZH+E9404Pn5EqJaw7KmU9eyBBwIUVqBaQfH73eSe5PY0tiSrpPU9foocUjo
 4Td9sL11SLzjwpM4bIijW0ezZY8y+4Q0A21GwdcwAx3LPstXcF5GIjQ76dVFPRKN
 Unbt6o+9O9NvMLg8CLzwonlFzOoLOrL+5eKJs+caOuOikT+cXnBQrukgB4ck3RAD
 PIVD8XnufJTCKiDvx2vravLXsWiA2cg7citVsgc8y5FBcdhzv3YVqXd/lGkqg+09
 7rVqE6NRDlkk4G4KZACTK45YVcVwXhQlMU/qiS0IduHdD0NtL9DPnQvdfzQWQehO
 30cJ5vZ+fqbHspJ8AdPuqntUyfEvPTCbCT4Ou/AEcvO8NRQu2gplcq9mF4U46WZG
 GBPWXvGHzM8=
 =NjyS
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2026-04-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:
 "Fair scheduling updates:
   - Skip SCHED_IDLE rq for SCHED_IDLE tasks (Christian Loehle)
   - Remove superfluous rcu_read_lock() in the wakeup path (K Prateek Nayak)
   - Simplify the entry condition for update_idle_cpu_scan() (K Prateek Nayak)
   - Simplify SIS_UTIL handling in select_idle_cpu() (K Prateek Nayak)
   - Avoid overflow in enqueue_entity() (K Prateek Nayak)
   - Update overutilized detection (Vincent Guittot)
   - Prevent negative lag increase during delayed dequeue (Vincent Guittot)
   - Clear buddies for preempt_short (Vincent Guittot)
   - Implement more complex proportional newidle balance (Peter Zijlstra)
   - Increase weight bits for avg_vruntime (Peter Zijlstra)
   - Use full weight to __calc_delta() (Peter Zijlstra)

  RT and DL scheduling updates:
   - Fix incorrect schedstats for rt and dl thread (Dengjun Su)
   - Skip group schedulable check with rt_group_sched=0 (Michal Koutný)
   - Move group schedulability check to sched_rt_global_validate()
     (Michal Koutný)
   - Add reporting of runtime left & abs deadline to sched_getattr()
     for DEADLINE tasks (Tommaso Cucinotta)

  Scheduling topology updates by K Prateek Nayak:
   - Compute sd_weight considering cpuset partitions
   - Extract "imb_numa_nr" calculation into a separate helper
   - Allocate per-CPU sched_domain_shared in s_data
   - Switch to assigning "sd->shared" from s_data
   - Remove sched_domain_shared allocation with sd_data

  Energy-aware scheduling updates:
   - Filter false overloaded_group case for EAS (Vincent Guittot)
   - PM: EM: Switch to rcu_dereference_all() in wakeup path
     (Dietmar Eggemann)

  Infrastructure updates:
   - Replace use of system_unbound_wq with system_dfl_wq (Marco Crivellari)

  Proxy scheduling updates by John Stultz:
   - Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
   - Minimise repeated sched_proxy_exec() checking
   - Fix potentially missing balancing with Proxy Exec
   - Fix and improve task::blocked_on et al handling
   - Add assert_balance_callbacks_empty() helper
   - Add logic to zap balancing callbacks if we pick again
   - Move attach_one_task() and attach_task() helpers to sched.h
   - Handle blocked-waiter migration (and return migration)
   - Add K Prateek Nayak to scheduler reviewers for proxy execution

  Misc cleanups and fixes by John Stultz, Joseph Salisbury, Peter
  Zijlstra, K Prateek Nayak, Michal Koutný, Randy Dunlap, Shrikanth
  Hegde, Vincent Guittot, Zhan Xusheng, Xie Yuanbin and Vincent Guittot"

* tag 'sched-core-2026-04-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
  sched/eevdf: Clear buddies for preempt_short
  sched/rt: Cleanup global RT bandwidth functions
  sched/rt: Move group schedulability check to sched_rt_global_validate()
  sched/rt: Skip group schedulable check with rt_group_sched=0
  sched/fair: Avoid overflow in enqueue_entity()
  sched: Use u64 for bandwidth ratio calculations
  sched/fair: Prevent negative lag increase during delayed dequeue
  sched/fair: Use sched_energy_enabled()
  sched: Handle blocked-waiter migration (and return migration)
  sched: Move attach_one_task and attach_task helpers to sched.h
  sched: Add logic to zap balance callbacks if we pick again
  sched: Add assert_balance_callbacks_empty helper
  sched/locking: Add special p->blocked_on==PROXY_WAKING value for proxy return-migration
  sched: Fix modifying donor->blocked on without proper locking
  locking: Add task::blocked_lock to serialize blocked_on state
  sched: Fix potentially missing balancing with Proxy Exec
  sched: Minimise repeated sched_proxy_exec() checking
  sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
  MAINTAINERS: Add K Prateek Nayak to scheduler reviewers
  sched/core: Get this cpu once in ttwu_queue_cond()
  ...
2026-04-14 13:33:36 -07:00
Linus Torvalds
33c66eb5e9 Performance events changes for v7.1:
Core updates:
 
  - Try to allocate task_ctx_data quickly, to optimize
    O(N^2) algorithm on large systems with O(100k) threads
    (Namhyung Kim)
 
 AMD PMU driver IBS support updates and fixes, by Ravi Bangoria:
 
  - Fix interrupt accounting for discarded samples
  - Fix a Zen5-specific quirk
  - Fix PhyAddrVal handling
  - Fix NMI-safety with perf_allow_kernel()
  - Fix a race between event add and NMIs
 
 Intel PMU driver updates:
 
  - Only check GP counters for PEBS constraints validation (Dapeng Mi)
 
 MSR driver:
 
  - Turn SMI_COUNT and PPERF on by default, instead of a long
    list of CPU models to enable them on (Kan Liang)
 
 Misc cleanups and fixes by Aldf Conte, Anshuman Khandual, Namhyung Kim,
 Ravi Bangoria and Yen-Hsiang Hsu.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmncppoRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hN2hAAgd8ix2hZjT/v/wH0iIayRKPEI8KqQ0XP
 7L0nqHVrNw3gzsxkRBIBljyWdYhHmxYnV2ExgddwXdpcD2j/Vf2YUfNtrE0fXL5J
 wD8/M4WxzxA2gwcRgz3kGaeU/I0Ble9TcIdSaI3kJFJarHaDw3jEsif/gfmYbZfm
 oX+gjIUCspnUMqb5EqsHdWxPWub87NnddPI8c9hmhq/9IZ4QvhUxS+lQHc+GihpY
 MTlTxG10W/+f84w0lyG153KslV1rngIqoQ2uJTRe0fjx3VX1uhsgB3LCrkTzMOWe
 GVODiMhN9u5o0pfJLbboSDZ32z3QrsojXbS2Z+ZHqfqbgompIzH9SVh5fFSGKtfK
 64CEP4mO90JGGnDYS6vaPJhZrbusZKzuLt0tcn0aIYHD48PNJXhD2tVE76JsnmAj
 SicnL78QOQkB8Gi0LuCXhxPXY/KAqFtOgmKV9x+gqJuAFgTXEUhem6IOJjShhwOQ
 NfIkXDHz7kmMLblWRmuslGOWfWddRKheQNvuJ+YqbVto6N192PQdSjnBBZjX8GpL
 o52FYCbwGXckZ9X+SU55j3lmQbmtS5Rn8PwB7dmHnVIp8bRI62ANQVoNc1feIrAt
 7UA0SrIPz94oe8tH8lQYb5d98fv/6+Nroli8Vpik4wQZf1VUrzlEEXv/m9CTOL4G
 FA5CtFF1AmA=
 =4Vs0
 -----END PGP SIGNATURE-----

Merge tag 'perf-core-2026-04-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull performance events updates from Ingo Molnar:
 "Core updates:

   - Try to allocate task_ctx_data quickly, to optimize O(N^2) algorithm
     on large systems with O(100k) threads (Namhyung Kim)

  AMD PMU driver IBS support updates and fixes, by Ravi Bangoria:
   - Fix interrupt accounting for discarded samples
   - Fix a Zen5-specific quirk
   - Fix PhyAddrVal handling
   - Fix NMI-safety with perf_allow_kernel()
   - Fix a race between event add and NMIs

  Intel PMU driver updates:
   - Only check GP counters for PEBS constraints validation (Dapeng Mi)

  MSR driver:
   - Turn SMI_COUNT and PPERF on by default, instead of a long list of
     CPU models to enable them on (Kan Liang)

  ... and misc cleanups and fixes by Aldf Conte, Anshuman Khandual,
  Namhyung Kim, Ravi Bangoria and Yen-Hsiang Hsu"

* tag 'perf-core-2026-04-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/events: Replace READ_ONCE() with standard pgtable accessors
  perf/x86/msr: Make SMI and PPERF on by default
  perf/x86/intel/p4: Fix unused variable warning in p4_pmu_init()
  perf/x86/intel: Only check GP counters for PEBS constraints validation
  perf/x86/amd/ibs: Fix comment typo in ibs_op_data
  perf/amd/ibs: Advertise remote socket capability
  perf/amd/ibs: Enable streaming store filter
  perf/amd/ibs: Enable RIP bit63 hardware filtering
  perf/amd/ibs: Enable fetch latency filtering
  perf/amd/ibs: Support IBS_{FETCH|OP}_CTL2[Dis] to eliminate RMW race
  perf/amd/ibs: Add new MSRs and CPUID bits definitions
  perf/amd/ibs: Define macro for ldlat mask and shift
  perf/amd/ibs: Avoid race between event add and NMI
  perf/amd/ibs: Avoid calling perf_allow_kernel() from the IBS NMI handler
  perf/amd/ibs: Preserve PhyAddrVal bit when clearing PhyAddr MSR
  perf/amd/ibs: Limit ldlat->l3missonly dependency to Zen5
  perf/amd/ibs: Account interrupt for discarded samples
  perf/core: Simplify __detach_global_ctx_data()
  perf/core: Try to allocate task_ctx_data quickly
  perf/core: Pass GFP flags to attach_task_ctx_data()
2026-04-14 13:22:40 -07:00
Linus Torvalds
7393febcb1 Locking updates for v7.1:
Mutexes:
 
  - Add killable flavor to guard definitions (Davidlohr Bueso)
  - Remove the list_head from struct mutex (Matthew Wilcox)
  - Rename mutex_init_lockep() (Davidlohr Bueso)
 
 rwsems:
 
  - Remove the list_head from struct rw_semaphore and
    replace it with a single pointer (Matthew Wilcox)
  - Fix logic error in rwsem_del_waiter() (Andrei Vagin)
 
 Semaphores:
 
  - Remove the list_head from struct semaphore (Matthew Wilcox)
 
 Jump labels:
 
  - Use ATOMIC_INIT() for initialization of .enabled (Thomas Weißschuh)
  - Remove workaround for old compilers in initializations
    (Thomas Weißschuh)
 
 Lock context analysis changes and improvements:
 
  - Add context analysis for rwsems (Peter Zijlstra)
  - Fix rwlock and spinlock lock context annotations (Bart Van Assche)
  - Fix rwlock support in <linux/spinlock_up.h> (Bart Van Assche)
  - Add lock context annotations in the spinlock implementation
    (Bart Van Assche)
  - signal: Fix the lock_task_sighand() annotation (Bart Van Assche)
  - ww-mutex: Fix the ww_acquire_ctx function annotations
    (Bart Van Assche)
  - Add lock context support in do_raw_{read,write}_trylock()
    (Bart Van Assche)
  - arm64, compiler-context-analysis: Permit alias analysis through
    __READ_ONCE() with CONFIG_LTO=y (Marco Elver)
  - Add __cond_releases() (Peter Zijlstra)
  - Add context analysis for mutexes (Peter Zijlstra)
  - Add context analysis for rtmutexes (Peter Zijlstra)
  - Convert futexes to compiler context analysis (Peter Zijlstra)
 
 Rust integration updates:
 
  - Add atomic fetch_sub() implementation (Andreas Hindborg)
  - Refactor various rust_helper_ methods for expansion (Boqun Feng)
  - Add Atomic<*{mut,const} T> support (Boqun Feng)
  - Add atomic operation helpers over raw pointers (Boqun Feng)
  - Add performance-optimal Flag type for atomic booleans, to avoid
    slow byte-sized RMWs on architectures that don't support them.
    (FUJITA Tomonori)
  - Misc cleanups and fixes (Andreas Hindborg, Boqun Feng,
    FUJITA Tomonori)
 
 LTO support updates:
 
  - arm64: Optimize __READ_ONCE() with CONFIG_LTO=y (Marco Elver)
  - compiler: Simplify generic RELOC_HIDE() (Marco Elver)
 
 Miscellaneous fixes and cleanups by Peter Zijlstra, Randy Dunlap,
 Thomas Weißschuh, Davidlohr Bueso and Mikhail Gavrilov.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmncnGERHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1ig7Q//a3UgHjUe96/zuJIv/X1lt5MU0GHP/m/n
 Rf6c39P0VWV6iupJtZ6gPmtQBQDyqWsnfE9S6PFDW4P/Njn0CGEBhk5bcYiN7dc5
 pN0hfM67rY1Ids2FJo5JVIxw2pNZpZHU4v3dJC84xH1cPwmccHxt3XW67iQnJCY9
 6m7RJ3nUfmNC1qLGKtAFQp3N91hK+BYxqZQ1Wn6a0lRWfmYY7WDs8qrr5N6Ezn7W
 53ZNXXbXUC09iOO/slOZmFD5tDrp5Z1nPYTeOdFnWYC5SoTvkfauTqmfZRN5sFad
 8vRxXHuCsdBthNF+ljobBUhZx9QL4UJMGOJTFVp9dZSj13vI7UNlbfAtwMKM8lsR
 L+v+GSsGdQWwrhzaiz53k6ZuUUDECltjwKFFUBy9RPFMtKkpKsgjW+X6I+SFeTQW
 QAPzaA/fEK45bvSPUcjn09kKKC1EVIjHQ6NvByoVjPABz92PpJgOR0si2PW5F314
 7sKzk4Ra5x8NGDSEiC9uSwB7mIv56/lUq5zuVoz3CB2rehqIpPdeieE8TaXvcmdK
 8otsWYcUXDcj/d6en9XBzb0t3LpQ1TumcOGw3xUJhrSoB4DvmWgW788SbGIKklR9
 KFQLU2USlm4u8JHEFXOAeaDhWME+eCqP5FCq3YTqxLksiA+oYx3Xui1R+5L4yjRs
 bVbp+BIs3N4=
 =aw95
 -----END PGP SIGNATURE-----

Merge tag 'locking-core-2026-04-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking updates from Ingo Molnar:
 "Mutexes:

   - Add killable flavor to guard definitions (Davidlohr Bueso)

   - Remove the list_head from struct mutex (Matthew Wilcox)

   - Rename mutex_init_lockep() (Davidlohr Bueso)

  rwsems:

   - Remove the list_head from struct rw_semaphore and
     replace it with a single pointer (Matthew Wilcox)

   - Fix logic error in rwsem_del_waiter() (Andrei Vagin)

  Semaphores:

   - Remove the list_head from struct semaphore (Matthew Wilcox)

  Jump labels:

   - Use ATOMIC_INIT() for initialization of .enabled (Thomas Weißschuh)

   - Remove workaround for old compilers in initializations
     (Thomas Weißschuh)

  Lock context analysis changes and improvements:

   - Add context analysis for rwsems (Peter Zijlstra)

   - Fix rwlock and spinlock lock context annotations (Bart Van Assche)

   - Fix rwlock support in <linux/spinlock_up.h> (Bart Van Assche)

   - Add lock context annotations in the spinlock implementation
     (Bart Van Assche)

   - signal: Fix the lock_task_sighand() annotation (Bart Van Assche)

   - ww-mutex: Fix the ww_acquire_ctx function annotations
     (Bart Van Assche)

   - Add lock context support in do_raw_{read,write}_trylock()
     (Bart Van Assche)

   - arm64, compiler-context-analysis: Permit alias analysis through
     __READ_ONCE() with CONFIG_LTO=y (Marco Elver)

   - Add __cond_releases() (Peter Zijlstra)

   - Add context analysis for mutexes (Peter Zijlstra)

   - Add context analysis for rtmutexes (Peter Zijlstra)

   - Convert futexes to compiler context analysis (Peter Zijlstra)

  Rust integration updates:

   - Add atomic fetch_sub() implementation (Andreas Hindborg)

   - Refactor various rust_helper_ methods for expansion (Boqun Feng)

   - Add Atomic<*{mut,const} T> support (Boqun Feng)

   - Add atomic operation helpers over raw pointers (Boqun Feng)

   - Add performance-optimal Flag type for atomic booleans, to avoid
     slow byte-sized RMWs on architectures that don't support them.
     (FUJITA Tomonori)

   - Misc cleanups and fixes (Andreas Hindborg, Boqun Feng, FUJITA
     Tomonori)

  LTO support updates:

   - arm64: Optimize __READ_ONCE() with CONFIG_LTO=y (Marco Elver)

   - compiler: Simplify generic RELOC_HIDE() (Marco Elver)

  Miscellaneous fixes and cleanups by Peter Zijlstra, Randy Dunlap,
  Thomas Weißschuh, Davidlohr Bueso and Mikhail Gavrilov"

* tag 'locking-core-2026-04-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits)
  compiler: Simplify generic RELOC_HIDE()
  locking: Add lock context annotations in the spinlock implementation
  locking: Add lock context support in do_raw_{read,write}_trylock()
  locking: Fix rwlock support in <linux/spinlock_up.h>
  lockdep: Raise default stack trace limits when KASAN is enabled
  cleanup: Optimize guards
  jump_label: remove workaround for old compilers in initializations
  jump_label: use ATOMIC_INIT() for initialization of .enabled
  futex: Convert to compiler context analysis
  locking/rwsem: Fix logic error in rwsem_del_waiter()
  locking/rwsem: Add context analysis
  locking/rtmutex: Add context analysis
  locking/mutex: Add context analysis
  compiler-context-analysys: Add __cond_releases()
  locking/mutex: Remove the list_head from struct mutex
  locking/semaphore: Remove the list_head from struct semaphore
  locking/rwsem: Remove the list_head from struct rw_semaphore
  rust: atomic: Update a safety comment in impl of `fetch_add()`
  rust: sync: atomic: Update documentation for `fetch_add()`
  rust: sync: atomic: Add fetch_sub()
  ...
2026-04-14 12:36:25 -07:00
Linus Torvalds
e80d033851 Updates for the SMP core code:
- Switch smp_call_on_cpu() to user system_percpu_wq instead of system_wq
     a part of the ongoing workqueue restructuring
 
   - Improve the CSD-lock diagnostics for smp_call_function_single() to
     provide better debug mechanisms on weakly ordered systems.
 
   - Cache the current CPU number once in smp_call_function*() instead of
     retrieving it over and over.
 
   - Add missing kernel-doc comments all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCgAuFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmnbu7EQHHRnbHhAa2Vy
 bmVsLm9yZwAKCRCmGPVMDXSYoSTiD/9OrmXH0/NqWnLG2qydztv8fTgDPFtDg7KK
 bQ6YQpsldbBFOBFqKt9mMgWPFIqhsmomVTtJ0HKL31w7tpXG84wPvuknC1WuaA8B
 l3hsslDhU/RNNNdFOmu3vHtduDTVTy+8DhGdpl06KMlvLOzyYzU9kaHFKGPZ1OZ6
 /Zhab9+QKi4uOpSbXa3Lt/xFnwZgitqqn/6RbfwvkeuYBdyVtniff0oE3BGfXuOd
 taVN5XLA4g3pD7tznLm8n+HAW6weSdu7feLt8+Q65yNcLU2ioI+mCZV+j/hHqWX2
 n0ZA9o0AoaOvQne0ABl4nkh9Wh19zyUnfAEswr3+hsNhFDLJhog6EvoYK6HL5Kso
 fqRNn+Ppsxr9aawLUQo7CZRiVOT4LOH8yE88SHELprHFw1y4b1pXXeZHndSOi4iY
 bIWQwhysCpSirT0619TXR4i4b5FUGG8PXfwgLSErY+cIVIN2JHoxoMHRgJ3tajbO
 CpE5BRSbBleX1rGtG8m1pUifb+1k5u+ktBvmz4kofX8pwEemTBK0FbppixyKenSF
 bLrhiFHRPosVmyMrKnuREFKJCZOXDOZ3qQx7WIQ31wTmNIj9O1ZvqVcvhNssjwF4
 e+ndi2ki+rP7pm6PlijNJT/ZcVQJrM/l1jvWhZOJo1yguaIngHDYR/uE5MFPjcmp
 49ZR7kPoDQ==
 =45V4
 -----END PGP SIGNATURE-----

Merge tag 'smp-core-2026-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull SMP core updates from Thomas Gleixner:

 - Switch smp_call_on_cpu() to user system_percpu_wq instead of
   system_wq a part of the ongoing workqueue restructuring

 - Improve the CSD-lock diagnostics for smp_call_function_single() to
   provide better debug mechanisms on weakly ordered systems.

 - Cache the current CPU number once in smp_call_function*() instead of
   retrieving it over and over.

 - Add missing kernel-doc comments all over the place

* tag 'smp-core-2026-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  smp: Use system_percpu_wq instead of system_wq
  smp: Improve smp_call_function_single() CSD-lock diagnostics
  smp: Get this_cpu once in smp_call_function
  smp: Add missing kernel-doc comments
2026-04-14 11:14:56 -07:00
Linus Torvalds
f21f7b5162 Update to the VDSO subsystem:
- Make the handling of compat functions consistent and more robust
 
      - Rework the underlying data store so that it is dynamically
        allocated, which allows the conversion of the last holdout SPARC64
        to the generic VDSO implementation
 
      - Rework the SPARC64 VDSO to utilize the generic implementation
 
      - Mop up the left overs of the non-generic VDSO support in the core
        code.
 
      - Expand the VDSO selftest and make them more robust
 
      - Allow time namespaces to be enabled independently of the generic
        VDSO support, which was not possible before due to SPARC64 not
        using it.
 
      - Various cleanups and improvements in the related code.
 -----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCgAuFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmnb0v8QHHRnbHhAa2Vy
 bmVsLm9yZwAKCRCmGPVMDXSYocfqD/9ywgnvwRH6B612mY4PI3qCbLHs6n9f78aH
 YwyXmmfBZ5vt1ZtptHD+BAxiIMm9GC+/exdj5zhcOWucnBVhorcloE6evxhkJAMn
 RhTQFKkEmcA/UV2Yfct9r+33kgZRyu4IIul4J7hgn2o5T1BqwZbOil0W/O5adr5P
 MDLxjT1OLV80ZZWI9qbWcR/aR7W7sHcdwfVPPqjhombRY7f391Mo3dZeM5C2y55x
 8TXCEqVpN1RJzFinWEgQN7QpP4OmF0rRuXSrDQpkH6pk/+RSqNlT/QGG7MJtmCQR
 E6CeBjNRUn318KiroaGyTKlM9xsL3gNoiCY24ZTwzZxx3g5gSAR3KTCTJhQU0hpu
 Svxj+ksqEAyW7fAOIsbce6W8fUPKC2KM+juXgPKcqZ5hjE2fALD+eEYMlq00jSiu
 sj71007cM9tZKOXPdWs3Fv7AY2Yj7iiRiRz9gv1wqS1z7ybxiaFjxjLYYakej0tr
 rmwBDEGhNow7msZZttr01BRZk9hDUWfIiJtL+0BrgRLNzst2A7WoagtZ2s0Z7Psl
 RjtWgYNBDJ878xK0J+Djqb9TyLraGWZShIIna9uYCAJX9i954xfKJ//NOnUkZhcl
 jslDLHhdttyJ+TmgIsc1ntUGvYvHqH5ywQpyDfWepMKyIYdaJLHOr2K6bwFnGHdw
 uocXvLrkXw==
 =8ixX
 -----END PGP SIGNATURE-----

Merge tag 'timers-vdso-2026-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull vdso updates from Thomas Gleixner:

 - Make the handling of compat functions consistent and more robust

 - Rework the underlying data store so that it is dynamically allocated,
   which allows the conversion of the last holdout SPARC64 to the
   generic VDSO implementation

 - Rework the SPARC64 VDSO to utilize the generic implementation

 - Mop up the left overs of the non-generic VDSO support in the core
   code

 - Expand the VDSO selftest and make them more robust

 - Allow time namespaces to be enabled independently of the generic VDSO
   support, which was not possible before due to SPARC64 not using it

 - Various cleanups and improvements in the related code

* tag 'timers-vdso-2026-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits)
  timens: Use task_lock guard in timens_get*()
  timens: Use mutex guard in proc_timens_set_offset()
  timens: Simplify some calls to put_time_ns()
  timens: Add a __free() wrapper for put_time_ns()
  timens: Remove dependency on the vDSO
  vdso/timens: Move functions to new file
  selftests: vDSO: vdso_test_correctness: Add a test for time()
  selftests: vDSO: vdso_test_correctness: Use facilities from parse_vdso.c
  selftests: vDSO: vdso_test_correctness: Handle different tv_usec types
  selftests: vDSO: vdso_test_correctness: Drop SYS_getcpu fallbacks
  selftests: vDSO: vdso_test_gettimeofday: Remove nolibc checks
  Revert "selftests: vDSO: parse_vdso: Use UAPI headers instead of libc headers"
  random: vDSO: Remove ifdeffery
  random: vDSO: Trim vDSO includes
  vdso/datapage: Trim down unnecessary includes
  vdso/datapage: Remove inclusion of gettimeofday.h
  vdso/helpers: Explicitly include vdso/processor.h
  vdso/gettimeofday: Add explicit includes
  random: vDSO: Add explicit includes
  MIPS: vdso: Explicitly include asm/vdso/vdso.h
  ...
2026-04-14 10:53:44 -07:00
Linus Torvalds
c1fe867b5b Updates for the timer/timekeeping core:
- A rework of the hrtimer subsystem to reduce the overhead for frequently
     armed timers, especially the hrtick scheduler timer.
 
       - Better timer locality decision
 
       - Simplification of the evaluation of the first expiry time by
         keeping track of the neighbor timers in the RB-tree by providing a
         RB-tree variant with neighbor links. That avoids walking the
         RB-tree on removal to find the next expiry time, but even more
         important allows to quickly evaluate whether a timer which is
         rearmed changes the position in the RB-tree with the modified
         expiry time or not. If not, the dequeue/enqueue sequence which both
         can end up in rebalancing can be completely avoided.
 
       - Deferred reprogramming of the underlying clock event device. This
         optimizes for the situation where a hrtimer callback sets the need
         resched bit. In that case the code attempts to defer the
         re-programming of the clock event device up to the point where the
         scheduler has picked the next task and has the next hrtick timer
         armed. In case that there is no immediate reschedule or soft
         interrupts have to be handled before reaching the reschedule point
         in the interrupt entry code the clock event is reprogrammed in one
         of those code paths to prevent that the timer becomes stale.
 
       - Support for clocksource coupled clockevents
 
       	The TSC deadline timer is coupled to the TSC. The next event is
       	programmed in TSC time. Currently this is done by converting the
       	CLOCK_MONOTONIC based expiry value into a relative timeout,
       	converting it into TSC ticks, reading the TSC adding the delta
       	ticks and writing the deadline MSR.
 
 	As the timekeeping core has the conversion factors for the TSC
 	already, the whole back and forth conversion can be completely
 	avoided. The timekeeping core calculates the reverse conversion
 	factors from nanoseconds to TSC ticks and utilizes the base
 	timestamps of TSC and CLOCK_MONOTONIC which are updated once per
 	tick. This allows a direct conversion into the TSC deadline value
 	without reading the time and as a bonus keeps the deadline
 	conversion in sync with the TSC conversion factors, which are
 	updated by adjtimex() on systems with NTP/PTP enabled.
 
      - Allow inlining of the clocksource read and clockevent write
        functions when they are tiny enough, e.g. on x86 RDTSC and WRMSR.
 
     With all those enhancements in place a hrtick enabled scheduler
     provides the same performance as without hrtick. But also other hrtimer
     users obviously benefit from these optimizations.
 
   - Robustness improvements and cleanups of historical sins in the hrtimer
     and timekeeping code.
 
   - Rewrite of the clocksource watchdog.
 
     The clocksource watchdog code has over time reached the state of an
     impenetrable maze of duct tape and staples. The original design, which was
     made in the context of systems far smaller than today, is based on the
     assumption that the to be monitored clocksource (TSC) can be trivially
     compared against a known to be stable clocksource (HPET/ACPI-PM timer).
 
     Over the years this rather naive approach turned out to have major
     flaws. Long delays between the watchdog invocations can cause wrap
     arounds of the reference clocksource. The access to the reference
     clocksource degrades on large multi-sockets systems dure to
     interconnect congestion. This has been addressed with various
     heuristics which degraded the accuracy of the watchdog to the point
     that it fails to detect actual TSC problems on older hardware which
     exposes slow inter CPU drifts due to firmware manipulating the TSC to
     hide SMI time.
 
     The rewrite addresses this by:
 
       - Restricting the validation against the reference clocksource to the
         boot CPU which is usually closest to the legacy block which
         contains the reference clocksource (HPET/ACPI-PM).
 
       - Do a round robin validation betwen the boot CPU and the other CPUs
         based only on the TSC with an algorithm similar to the TSC
         synchronization code during CPU hotplug.
 
       - Being more leniant versus remote timeouts
 
   - The usual tiny fixes, cleanups and enhancements all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCgAuFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmnbxdMQHHRnbHhAa2Vy
 bmVsLm9yZwAKCRCmGPVMDXSYoeq6EAC4h9wuBr5yCkxmog1Bhlk9cnK0oX1THb7V
 Q4z6DrYAiDXP6z4IDwqSW+3vvmNw1QXOeqpyMTiiIcQ5mNSs1IDnCt5HOEwY+ICm
 fiSUMYkXkH6xdFWspYWFkD7aExHJRT3hd/bo+WnXGHhHclPj5NHZssLMIDboHrzX
 jLV1hljmthfwg/uOXDGmQUPRFjqr2ZQjo7zGA5SwfVg8Krz7g/qRVy2wUns9TdW/
 NYwihDm1YV7qkK/+f1GnMdd70toqb1OZo/fS9FPbBrPLdyi8V+UbnFSUeZu8kCwV
 KubAzjLZR4xYCnrlaHhoi208GMd0TOvHMOrdAA0zkQHfhmszGl4N0pbF/EI29Ft2
 tQG/FUTG+nzgNOrMCPN2nr3u/UOXLP+gO+2hkyyQjqUP35IyaTYQn10JgBmPTdJL
 Ab6E8WL9gTMCd+t/bVjdU/B8W9ruMihKBtWkTfMBCcQ9uNJFCEGzrcMF8hzFYRTs
 /4rMDr3NlGYydAnbKPj6bkC5gtjBvh/L08kOdUFyXCMSqIzvJkZJ4241ogl1Awi6
 VfdwjF5KZCQo3M1ujpep+1L010wC/yulqLt2brQMO9Nt05dRhgwM3lxy7cnlMNm3
 NdfMgi+OG0CzQ+ZUpvo20hCgTDUVgWN9g5R3rar8FJX+Ym3T+ZoEKlShZF+fSRjf
 YAUIbUyi7A==
 =2qc8
 -----END PGP SIGNATURE-----

Merge tag 'timers-core-2026-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer core updates from Thomas Gleixner:

 - A rework of the hrtimer subsystem to reduce the overhead for
   frequently armed timers, especially the hrtick scheduler timer:

     - Better timer locality decision

     - Simplification of the evaluation of the first expiry time by
       keeping track of the neighbor timers in the RB-tree by providing
       a RB-tree variant with neighbor links. That avoids walking the
       RB-tree on removal to find the next expiry time, but even more
       important allows to quickly evaluate whether a timer which is
       rearmed changes the position in the RB-tree with the modified
       expiry time or not. If not, the dequeue/enqueue sequence which
       both can end up in rebalancing can be completely avoided.

     - Deferred reprogramming of the underlying clock event device. This
       optimizes for the situation where a hrtimer callback sets the
       need resched bit. In that case the code attempts to defer the
       re-programming of the clock event device up to the point where
       the scheduler has picked the next task and has the next hrtick
       timer armed. In case that there is no immediate reschedule or
       soft interrupts have to be handled before reaching the reschedule
       point in the interrupt entry code the clock event is reprogrammed
       in one of those code paths to prevent that the timer becomes
       stale.

     - Support for clocksource coupled clockevents

       The TSC deadline timer is coupled to the TSC. The next event is
       programmed in TSC time. Currently this is done by converting the
       CLOCK_MONOTONIC based expiry value into a relative timeout,
       converting it into TSC ticks, reading the TSC adding the delta
       ticks and writing the deadline MSR.

       As the timekeeping core has the conversion factors for the TSC
       already, the whole back and forth conversion can be completely
       avoided. The timekeeping core calculates the reverse conversion
       factors from nanoseconds to TSC ticks and utilizes the base
       timestamps of TSC and CLOCK_MONOTONIC which are updated once per
       tick. This allows a direct conversion into the TSC deadline value
       without reading the time and as a bonus keeps the deadline
       conversion in sync with the TSC conversion factors, which are
       updated by adjtimex() on systems with NTP/PTP enabled.

     - Allow inlining of the clocksource read and clockevent write
       functions when they are tiny enough, e.g. on x86 RDTSC and WRMSR.

   With all those enhancements in place a hrtick enabled scheduler
   provides the same performance as without hrtick. But also other
   hrtimer users obviously benefit from these optimizations.

 - Robustness improvements and cleanups of historical sins in the
   hrtimer and timekeeping code.

 - Rewrite of the clocksource watchdog.

   The clocksource watchdog code has over time reached the state of an
   impenetrable maze of duct tape and staples. The original design,
   which was made in the context of systems far smaller than today, is
   based on the assumption that the to be monitored clocksource (TSC)
   can be trivially compared against a known to be stable clocksource
   (HPET/ACPI-PM timer).

   Over the years this rather naive approach turned out to have major
   flaws. Long delays between the watchdog invocations can cause wrap
   arounds of the reference clocksource. The access to the reference
   clocksource degrades on large multi-sockets systems dure to
   interconnect congestion. This has been addressed with various
   heuristics which degraded the accuracy of the watchdog to the point
   that it fails to detect actual TSC problems on older hardware which
   exposes slow inter CPU drifts due to firmware manipulating the TSC to
   hide SMI time.

   The rewrite addresses this by:

     - Restricting the validation against the reference clocksource to
       the boot CPU which is usually closest to the legacy block which
       contains the reference clocksource (HPET/ACPI-PM).

     - Do a round robin validation betwen the boot CPU and the other
       CPUs based only on the TSC with an algorithm similar to the TSC
       synchronization code during CPU hotplug.

     - Being more leniant versus remote timeouts

 - The usual tiny fixes, cleanups and enhancements all over the place

* tag 'timers-core-2026-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (75 commits)
  alarmtimer: Access timerqueue node under lock in suspend
  hrtimer: Fix incorrect #endif comment for BITS_PER_LONG check
  posix-timers: Fix stale function name in comment
  timers: Get this_cpu once while clearing the idle state
  clocksource: Rewrite watchdog code completely
  clocksource: Don't use non-continuous clocksources as watchdog
  x86/tsc: Handle CLOCK_SOURCE_VALID_FOR_HRES correctly
  MIPS: Don't select CLOCKSOURCE_WATCHDOG
  parisc: Remove unused clocksource flags
  hrtimer: Add a helper to retrieve a hrtimer from its timerqueue node
  hrtimer: Remove trailing comma after HRTIMER_MAX_CLOCK_BASES
  hrtimer: Mark index and clockid of clock base as const
  hrtimer: Drop unnecessary pointer indirection in hrtimer_expire_entry event
  hrtimer: Drop spurious space in 'enum hrtimer_base_type'
  hrtimer: Don't zero-initialize ret in hrtimer_nanosleep()
  hrtimer: Remove hrtimer_get_expires_ns()
  timekeeping: Mark offsets array as const
  timekeeping/auxclock: Consistently use raw timekeeper for tk_setup_internals()
  timer_list: Print offset as signed integer
  tracing: Use explicit array size instead of sentinel elements in symbol printing
  ...
2026-04-14 10:27:07 -07:00
Linus Torvalds
db23954eea Update for the core interrupt subsystem:
- Invoke add_interrupt_randomness() in handle_percpu_devid_irq() and
      cleanup the workaround in the Hyper-V driver, which would now invoke
      it twice on ARM64. Removing it from the driver requires to add it to
      the x86 system vector entry point.
 
    - Remove the pointles cpu_read_lock() around reading CPU possible mask,
      which is read only after init.
 
    - Add documentation for the interaction between device tree bindings and
      the interrupt type defines in irq.h.
 
    - Delete stale defines in the matrix allocator and the equivalent in
      loongarch.
 -----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCgAuFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmnbt64QHHRnbHhAa2Vy
 bmVsLm9yZwAKCRCmGPVMDXSYoY/xD/9hdmaaSXX/JySPatJPgkAhekR8cl/brK9t
 K6qhMltg/XoUhmU1y0XSfuNDJwEOa4qvW2DbwfnzknIDr0AXvL39S9oNn+1o9I0x
 BjIbWwHTAsuLVyjQesqPWwyQtZ8HJCIM+3Ju2rUYz4jYuY8q15GYsbh6QN0pYDNG
 F54/OpuNV42/UhX/O01Gxw930GMGnsMkV6ou9dquap23U4FdlIsoY+zU/b09VEU2
 MmJZPGwryPpzhSLbCdOGuWA9oM6wd/FoBFYYU31W5OSXm4B0cRs+weE31WMogYKM
 lqudBuyNAhCIuZ06sdSvBPRswdvkuTUJovrJwG2r+rV6h953ExujNn+/ZxqraPg7
 iPK70Kwp/O2uyMFhHp/6r1u4bylK/AL7q6hace4cZFLqB/Htx5BW+hh/H7Cz6Yan
 H3cyBz9XIE7K5BI3lotKnlWVFwkrHgwYXUzHHrMq/UPdQKog1IPh/E29JekqtR+p
 RS9QzSF+Gs45I7oMP+7P8o5jAZYuuW+cODnXLlQkuz/bJff3QeftFhOKZkzbFk5x
 AFT0DREttxVCGcUCQaQYrWhFshp63+eDXgKGQT1YULleFUC1f9KH2W+6KlqjDwFR
 rvrUgmpMxkH2JJz1owct6vJ6wyffFNQtPeCTmq+7GM87HNm6Ji6AaRzQjRnhVGCt
 mh9VbBTnFw==
 =FvJG
 -----END PGP SIGNATURE-----

Merge tag 'irq-core-2026-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull core irq updates from Thomas Gleixner:

 - Invoke add_interrupt_randomness() in handle_percpu_devid_irq() and
   cleanup the workaround in the Hyper-V driver, which would now invoke
   it twice on ARM64. Removing it from the driver requires to add it to
   the x86 system vector entry point

 - Remove the pointles cpu_read_lock() around reading CPU possible mask,
   which is read only after init

 - Add documentation for the interaction between device tree bindings
   and the interrupt type defines in irq.h

 - Delete stale defines in the matrix allocator and the equivalent in
   loongarch

* tag 'irq-core-2026-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  Drivers: hv: Move add_interrupt_randomness() to hypervisor callback sysvec
  genirq/chip: Invoke add_interrupt_randomness() in handle_percpu_devid_irq()
  genirq/affinity: Remove cpus_read_lock() while reading cpu_possible_mask
  genirq/matrix, LoongArch: Delete IRQ_MATRIX_BITS leftovers
  genirq: Document interaction between <linux/irq.h> and DT binding defines
2026-04-14 10:02:41 -07:00
Tom Zanussi
9236eebd13 tracing: Fix fully-qualified variable reference printing in histograms
The syntax for fully-qualified variable references in histograms is
subsys.event.$var, which is parsed correctly, but not displayed correctly
when printing a histogram spec. The current code puts the $ reference at
the beginning of the fully-qualified variable name i.e. $subsys.event.var,
which is incorrect.

Before:

trigger info: hist:keys=next_comm:vals=hitcount:wakeup_lat=common_timestamp.usecs-$sched.sched_wakeup.ts0: ...

After:

trigger info: hist:keys=next_comm:vals=hitcount:wakeup_lat=common_timestamp.usecs-sched.sched_wakeup.$ts0: ...

Link: https://patch.msgid.link/5dee9a86d062a4dd68c2214f3d90ac93811e1951.1776112478.git.zanussi@kernel.org
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-04-14 05:28:10 -04:00
David Carlier
fad217e16f tracepoint: balance regfunc() on func_add() failure in tracepoint_add_func()
When a tracepoint goes through the 0 -> 1 transition, tracepoint_add_func()
invokes the subsystem's ext->regfunc() before attempting to install the
new probe via func_add(). If func_add() then fails (for example, when
allocate_probes() cannot allocate a new probe array under memory pressure
and returns -ENOMEM), the function returns the error without calling the
matching ext->unregfunc(), leaving the side effects of regfunc() behind
with no installed probe to justify them.

For syscall tracepoints this is particularly unpleasant: syscall_regfunc()
bumps sys_tracepoint_refcount and sets SYSCALL_TRACEPOINT on every task.
After a leaked failure, the refcount is stuck at a non-zero value with no
consumer, and every task continues paying the syscall trace entry/exit
overhead until reboot. Other subsystems providing regfunc()/unregfunc()
pairs exhibit similarly scoped persistent state.

Mirror the existing 1 -> 0 cleanup and call ext->unregfunc() in the
func_add() error path, gated on the same condition used there so the
unwind is symmetric with the registration.

Fixes: 8cf868affd ("tracing: Have the reg function allow to fail")
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260413190601.21993-1-devnexen@gmail.com
Signed-off-by: David Carlier <devnexen@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-04-14 05:17:02 -04:00
Vincent Donnefort
6170922f13 ring-buffer: Prevent off-by-one array access in ring_buffer_desc_page()
As pointed out by Smatch, the ring-buffer descriptor array page_va is
counted by nr_page_va, but the accessor ring_buffer_desc_page() allows
access off by one.

Currently, this does not cause problems, as the page ID always comes
from a trusted source. Nonetheless, ensure robustness and fix the
accessor. While at it, make the page_id unsigned.

Link: https://patch.msgid.link/20260410124527.3563970-1-vdonnefort@google.com
Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-04-14 05:13:09 -04:00
Pengpeng Hou
5ec1d1e97d tracing: Rebuild full_name on each hist_field_name() call
hist_field_name() uses a static MAX_FILTER_STR_VAL buffer for fully
qualified variable-reference names, but it currently appends into that
buffer with strcat() without rebuilding it first. As a result, repeated
calls append a new "system.event.field" name onto the previous one,
which can eventually run past the end of full_name.

Build the name with snprintf() on each call and return NULL if the fully
qualified name does not fit in MAX_FILTER_STR_VAL.

Link: https://patch.msgid.link/20260401112224.85582-1-pengpeng@iscas.ac.cn
Fixes: 067fe038e7 ("tracing: Add variable reference handling to hist triggers")
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Tested-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-04-14 04:59:33 -04:00
Cao Ruichuang
1111e9bd83 ring-buffer: Report header_page overwrite as char
The header_page tracefs metadata currently reports overwrite as an
int field with size 1. That makes parsers warn about a type and
size mismatch even though the field is only used as a one-byte flag
within commit.

Keep the shared offset with commit as-is, but report overwrite as
char so the declared type matches the hardcoded size. The signedness
is already carried separately by the emitted signed field.

Link: https://patch.msgid.link/20260406165333.46052-1-create0818@163.com
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216999
Signed-off-by: Cao Ruichuang <create0818@163.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-04-14 04:29:55 -04:00
Linus Torvalds
d7c8087a9c Power management updates for 7.1-rc1
- Update qcom-hw DT bindings to include Eliza hardware (Abel Vesa)
 
  - Update cpufreq-dt-platdev blocklist (Faruque Ansari)
 
  - Minor updates to driver and dt-bindings for Tegra (Thierry Reding,
    Rosen Penev)
 
  - Add MAINTAINERS entry for CPPC driver (Viresh Kumar)
 
  - Add support for new features: CPPC performance priority, Dynamic EPP,
    Raw EPP, and new unit tests for them to amd-pstate (Gautham Shenoy,
    Mario Limonciello)
 
  - Fix sysfs files being present when HW missing and broken/outdated
    documentation in the amd-pstate driver (Ninad Naik, Gautham Shenoy)
 
  - Pass the policy to cpufreq_driver->adjust_perf() to avoid using
    cpufreq_cpu_get() in the .adjust_perf() callback in amd-pstate which
    leads to a scheduling-while-atomic bug (K Prateek Nayak)
 
  - Clean up dead code in Kconfig for cpufreq (Julian Braha)
 
  - Remove max_freq_req update for pre-existing cpufreq policy and add a
    boost_freq_req QoS request to save the boost constraint instead of
    overwriting the last scaling_max_freq constraint (Pierre Gondois)
 
  - Embed cpufreq QoS freq_req objects in cpufreq policy so they all
    are allocated in one go along with the policy to simplify lifetime
    rules and avoid error handling issues (Viresh Kumar)
 
  - Use DMI max speed when CPPC is unavailable in the acpi-cpufreq
    scaling driver (Henry Tseng)
 
  - Switch policy_is_shared() in cpufreq to using cpumask_nth() instead
    of cpumask_weight() because the former is more efficient (Yury Norov)
 
  - Use sysfs_emit() in sysfs show functions for cpufreq governor
    attributes (Thorsten Blum)
 
  - Update intel_pstate to stop returning an error when "off" is written
    to its status sysfs attribute while the driver is already off (Fabio
    De Francesco)
 
  - Include current frequency in the debug message printed by
    __cpufreq_driver_target() (Pengjie Zhang)
 
  - Refine stopped tick handling in the menu cpuidle governor and
    rearrange stopped tick handling in the teo cpuidle governor (Rafael
    Wysocki)
 
  - Add Panther Lake C-states table to the intel_idle driver (Artem
    Bityutskiy)
 
  - Clean up dead dependencies on CPU_IDLE in Kconfig (Julian Braha)
 
  - Simplify cpuidle_register_device() with guard() (Huisong Li)
 
  - Use performance level if available to distinguish between rates in
    OPP debugfs (Manivannan Sadhasivam)
 
  - Fix scoped_guard in dev_pm_opp_xlate_required_opp() (Viresh Kumar)
 
  - Return -ENODATA if the snapshot image is not loaded (Alberto Garcia)
 
  - Remove inclusion of crypto/hash.h from hibernate_64.c on x86 (Eric
    Biggers)
 
  - Clean up and rearrange the intel_rapl power capping driver to make
    the respective interface drivers (TPMI, MSR, and MMOI) hold their
    own settings and primitives and consolidate PL4 and PMU support
    flags into rapl_defaults (Kuppuswamy Sathyanarayanan)
 
  - Correct kernel-doc function parameter names in the power capping core
    code (Randy Dunlap)
 
  - Remove unneeded casting for HZ_PER_KHZ in devfreq (Andy Shevchenko)
 
  - Use _visible attribute to replace create/remove_sysfs_files() in
    devfreq (Pengjie Zhang)
 
  - Add Tegra114 support to activity monitor device in tegra30-devfreq as
    a preparation to upcoming EMC controller support (Svyatoslav Ryhel)
 
  - Fix mistakes in cpupower man pages, add the boost and epp options to
    the cpupower-frequency-info man page, and add the perf-bias option to
    the cpupower-info man page (Roberto Ricci)
 
  - Remove unnecessary extern declarations from getopt.h in arguments
    parsing functions in cpufreq-set, cpuidle-info, cpuidle-set,
    cpupower-info, and cpupower-set utilities (Kaushlendra Kumar)
 -----BEGIN PGP SIGNATURE-----
 
 iQFGBAABCAAwFiEEcM8Aw/RY0dgsiRUR7l+9nS/U47UFAmnY9TISHHJqd0Byand5
 c29ja2kubmV0AAoJEO5fvZ0v1OO1G9gH/j5mEqfPpiwX6fQ/ZwOGdNOOPVA5w9j4
 KPHSMwMD5lZkoaZfasp2vt27KY5SOoVVvRZ2DKkFJ3Jai4I3cUPZYypga2nre1ag
 tgzX4vOjcw2r40Eda6ezWl1h4mca/xJJBX7xH2+hn1JY+Y1in37g50CqMIjKh96z
 Uugkk6UZytL1XcF55PMhIUgDf6pDtRT5UOW9xOKOkUt8FVWTJ7ei3HaWyV5kDmVq
 b5eQ42+OH7y6sWNnoKczFd8fStvh6J/avoJurBEvcOQhMcjaIaB48G19+KjDg73E
 NjrVcgG20P2rltBvV2d0J1TKskZHkaP7XjIeWfkwjGZhee3FL7ssS/g=
 =fRCO
 -----END PGP SIGNATURE-----

Merge tag 'pm-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "Once again, cpufreq is the most active development area, mostly
  because of the new feature additions and documentation updates in the
  amd-pstate driver, but there are also changes in the cpufreq core
  related to boost support and other assorted updates elsewhere.

  Next up are power capping changes due to the major cleanup of the
  Intel RAPL driver.

  On the cpuidle front, a new C-states table for Intel Panther Lake is
  added to the intel_idle driver, the stopped tick handling in the menu
  and teo governors is updated, and there are a couple of cleanups.

  Apart from the above, support for Tegra114 is added to devfreq and
  there are assorted cleanups of that code, there are also two updates
  of the operating performance points (OPP) library, two minor updates
  related to hibernation, and cpupower utility man pages updates and
  cleanups.

  Specifics:

   - Update qcom-hw DT bindings to include Eliza hardware (Abel Vesa)

   - Update cpufreq-dt-platdev blocklist (Faruque Ansari)

   - Minor updates to driver and dt-bindings for Tegra (Thierry Reding,
     Rosen Penev)

   - Add MAINTAINERS entry for CPPC driver (Viresh Kumar)

   - Add support for new features: CPPC performance priority, Dynamic
     EPP, Raw EPP, and new unit tests for them to amd-pstate (Gautham
     Shenoy, Mario Limonciello)

   - Fix sysfs files being present when HW missing and broken/outdated
     documentation in the amd-pstate driver (Ninad Naik, Gautham Shenoy)

   - Pass the policy to cpufreq_driver->adjust_perf() to avoid using
     cpufreq_cpu_get() in the .adjust_perf() callback in amd-pstate
     which leads to a scheduling-while-atomic bug (K Prateek Nayak)

   - Clean up dead code in Kconfig for cpufreq (Julian Braha)

   - Remove max_freq_req update for pre-existing cpufreq policy and add
     a boost_freq_req QoS request to save the boost constraint instead
     of overwriting the last scaling_max_freq constraint (Pierre
     Gondois)

   - Embed cpufreq QoS freq_req objects in cpufreq policy so they all
     are allocated in one go along with the policy to simplify lifetime
     rules and avoid error handling issues (Viresh Kumar)

   - Use DMI max speed when CPPC is unavailable in the acpi-cpufreq
     scaling driver (Henry Tseng)

   - Switch policy_is_shared() in cpufreq to using cpumask_nth() instead
     of cpumask_weight() because the former is more efficient (Yury
     Norov)

   - Use sysfs_emit() in sysfs show functions for cpufreq governor
     attributes (Thorsten Blum)

   - Update intel_pstate to stop returning an error when "off" is
     written to its status sysfs attribute while the driver is already
     off (Fabio De Francesco)

   - Include current frequency in the debug message printed by
     __cpufreq_driver_target() (Pengjie Zhang)

   - Refine stopped tick handling in the menu cpuidle governor and
     rearrange stopped tick handling in the teo cpuidle governor (Rafael
     Wysocki)

   - Add Panther Lake C-states table to the intel_idle driver (Artem
     Bityutskiy)

   - Clean up dead dependencies on CPU_IDLE in Kconfig (Julian Braha)

   - Simplify cpuidle_register_device() with guard() (Huisong Li)

   - Use performance level if available to distinguish between rates in
     OPP debugfs (Manivannan Sadhasivam)

   - Fix scoped_guard in dev_pm_opp_xlate_required_opp() (Viresh Kumar)

   - Return -ENODATA if the snapshot image is not loaded (Alberto
     Garcia)

   - Remove inclusion of crypto/hash.h from hibernate_64.c on x86 (Eric
     Biggers)

   - Clean up and rearrange the intel_rapl power capping driver to make
     the respective interface drivers (TPMI, MSR, and MMOI) hold their
     own settings and primitives and consolidate PL4 and PMU support
     flags into rapl_defaults (Kuppuswamy Sathyanarayanan)

   - Correct kernel-doc function parameter names in the power capping
     core code (Randy Dunlap)

   - Remove unneeded casting for HZ_PER_KHZ in devfreq (Andy Shevchenko)

   - Use _visible attribute to replace create/remove_sysfs_files() in
     devfreq (Pengjie Zhang)

   - Add Tegra114 support to activity monitor device in tegra30-devfreq
     as a preparation to upcoming EMC controller support (Svyatoslav
     Ryhel)

   - Fix mistakes in cpupower man pages, add the boost and epp options
     to the cpupower-frequency-info man page, and add the perf-bias
     option to the cpupower-info man page (Roberto Ricci)

   - Remove unnecessary extern declarations from getopt.h in arguments
     parsing functions in cpufreq-set, cpuidle-info, cpuidle-set,
     cpupower-info, and cpupower-set utilities (Kaushlendra Kumar)"

* tag 'pm-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (74 commits)
  cpufreq/amd-pstate: Add POWER_SUPPLY select for dynamic EPP
  cpupower: remove extern declarations in cmd functions
  cpuidle: Simplify cpuidle_register_device() with guard()
  PM / devfreq: tegra30-devfreq: add support for Tegra114
  PM / devfreq: use _visible attribute to replace create/remove_sysfs_files()
  PM / devfreq: Remove unneeded casting for HZ_PER_KHZ
  MAINTAINERS: amd-pstate: Step down as maintainer, add Prateek as reviewer
  cpufreq: Pass the policy to cpufreq_driver->adjust_perf()
  cpufreq/amd-pstate: Pass the policy to amd_pstate_update()
  cpufreq/amd-pstate-ut: Add a unit test for raw EPP
  cpufreq/amd-pstate: Add support for raw EPP writes
  cpufreq/amd-pstate: Add support for platform profile class
  cpufreq/amd-pstate: add kernel command line to override dynamic epp
  cpufreq/amd-pstate: Add dynamic energy performance preference
  Documentation: amd-pstate: fix dead links in the reference section
  cpufreq/amd-pstate: Cache the max frequency in cpudata
  Documentation/amd-pstate: Add documentation for amd_pstate_floor_{freq,count}
  Documentation/amd-pstate: List amd_pstate_prefcore_ranking sysfs file
  Documentation/amd-pstate: List amd_pstate_hw_prefcore sysfs file
  amd-pstate-ut: Add a testcase to validate the visibility of driver attributes
  ...
2026-04-13 19:47:52 -07:00
Linus Torvalds
4793dae01f Driver core changes for 7.1-rc1
- debugfs:
   - Fix NULL pointer dereference in debugfs_create_str()
   - Fix misplaced EXPORT_SYMBOL_GPL for debugfs_create_str()
   - Fix soundwire debugfs NULL pointer dereference from uninitialized
     firmware_file
 
 - device property:
   - Make fwnode flags modifications thread safe; widen the field to
     unsigned long and use set_bit() / clear_bit() based accessors
   - Document how to check for the property presence
 
 - devres:
   - Separate struct devres_node from its "subclasses" (struct devres,
     struct devres_group); give struct devres_node its own release and
     free callbacks for per-type dispatch
   - Introduce struct devres_action for devres actions, avoiding the
     ARCH_DMA_MINALIGN alignment overhead of struct devres
   - Export struct devres_node and its init/add/remove/dbginfo
     primitives for use by Rust Devres<T>
   - Fix missing node debug info in devm_krealloc()
   - Use guard(spinlock_irqsave) where applicable; consolidate unlock
     paths in devres_release_group()
 
 - driver_override:
   - Convert PCI, WMI, vdpa, s390/cio, s390/ap, and fsl-mc to the
     generic driver_override infrastructure, replacing per-bus
     driver_override strings, sysfs attributes, and match logic; fixes
     a potential UAF from unsynchronized access to driver_override in
     bus match() callbacks
   - Simplify __device_set_driver_override() logic
 
 - kernfs:
   - Send IN_DELETE_SELF and IN_IGNORED inotify events on kernfs
     file and directory removal
   - Add corresponding selftests for memcg
 
 - platform:
   - Allow attaching software nodes when creating platform devices via
     a new 'swnode' field in struct platform_device_info
   - Add kerneldoc for struct platform_device_info
 
 - software node:
   - Move software node initialization from postcore_initcall() to
     driver_init(), making it available early in the boot process
   - Move kernel_kobj initialization (ksysfs_init) earlier to support
     the above
   - Remove software_node_exit(); dead code in a built-in unit
 
 - SoC:
   - Introduce of_machine_read_compatible() and of_machine_read_model()
     OF helpers and export soc_attr_read_machine() to replace direct
     accesses to of_root from SoC drivers; also enables
     CONFIG_COMPILE_TEST coverage for these drivers
 
 - sysfs:
   - Constify attribute group array pointers to
     'const struct attribute_group *const *' in sysfs functions,
     device_add_groups() / device_remove_groups(), and struct class
 
 - Rust:
   - Devres:
     - Embed struct devres_node directly in Devres<T> instead of going
       through devm_add_action(), avoiding the extra allocation and
       the unnecessary ARCH_DMA_MINALIGN alignment
 
   - I/O:
     - Turn IoCapable from a marker trait into a functional trait
       carrying the raw I/O accessor implementation (io_read /
       io_write), providing working defaults for the per-type Io
       methods
     - Add RelaxedMmio wrapper type, making relaxed accessors usable
       in code generic over the Io trait
     - Remove overloaded per-type Io methods and per-backend macros
       from Mmio and PCI ConfigSpace
 
   - I/O (Register):
     - Add IoLoc trait and generic read/write/update methods to the Io
       trait, making I/O operations parameterizable by typed locations
     - Add register! macro for defining hardware register types with
       typed bitfield accessors backed by Bounded values; supports
       direct, relative, and array register addressing
     - Add write_reg() / try_write_reg() and LocatedRegister trait
     - Update PCI sample driver to demonstrate the register! macro
 
         Example:
 
         ```
             register! {
                 /// UART control register.
                 CTRL(u32) @ 0x18 {
                     /// Receiver enable.
                     19:19   rx_enable => bool;
                     /// Parity configuration.
                     14:13   parity ?=> Parity;
                 }
 
                 /// FIFO watermark and counter register.
                 WATER(u32) @ 0x2c {
                     /// Number of datawords in the receive FIFO.
                     26:24   rx_count;
                     /// RX interrupt threshold.
                     17:16   rx_water;
                 }
             }
 
             impl WATER {
                 fn rx_above_watermark(&self) -> bool {
                     self.rx_count() > self.rx_water()
                 }
             }
 
             fn init(bar: &pci::Bar<BAR0_SIZE>) {
                 let water = WATER::zeroed()
                     .with_const_rx_water::<1>(); // > 3 would not compile
                 bar.write_reg(water);
 
                 let ctrl = CTRL::zeroed()
                     .with_parity(Parity::Even)
                     .with_rx_enable(true);
                 bar.write_reg(ctrl);
             }
 
             fn handle_rx(bar: &pci::Bar<BAR0_SIZE>) {
                 if bar.read(WATER).rx_above_watermark() {
                     // drain the FIFO
                 }
             }
 
             fn set_parity(bar: &pci::Bar<BAR0_SIZE>, parity: Parity) {
                 bar.update(CTRL, |r| r.with_parity(parity));
             }
         ```
 
   - IRQ:
     - Move 'static bounds from where clauses to trait declarations
       for IRQ handler traits
 
   - Misc:
     - Enable the generic_arg_infer Rust feature
     - Extend Bounded with shift operations, single-bit bool conversion,
       and const get()
 
 - Misc:
   - Make deferred_probe_timeout default a Kconfig option
   - Drop auxiliary_dev_pm_ops; the PM core falls back to driver PM
     callbacks when no bus type PM ops are set
   - Add conditional guard support for device_lock()
   - Add ksysfs.c to the DRIVER CORE MAINTAINERS entry
   - Fix kernel-doc warnings in base.h
   - Fix stale reference to memory_block_add_nid() in documentation
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQS2q/xV6QjXAdC7k+1FlHeO1qrKLgUCadl5SwAKCRBFlHeO1qrK
 LpjDAQCSG3vYznwrngfpmRU5bCB9sdUy/pZiX5px1357+amJkwEA9LgIVQvtHAZW
 ZXcQ7Jr+mR3mJEdlatbkWHp3w1VHqAQ=
 =y1DV
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core

Pull driver core updates from Danilo Krummrich:
 "debugfs:
   - Fix NULL pointer dereference in debugfs_create_str()
   - Fix misplaced EXPORT_SYMBOL_GPL for debugfs_create_str()
   - Fix soundwire debugfs NULL pointer dereference from uninitialized
     firmware_file

  device property:
   - Make fwnode flags modifications thread safe; widen the field to
     unsigned long and use set_bit() / clear_bit() based accessors
   - Document how to check for the property presence

  devres:
   - Separate struct devres_node from its "subclasses" (struct devres,
     struct devres_group); give struct devres_node its own release and
     free callbacks for per-type dispatch
   - Introduce struct devres_action for devres actions, avoiding the
     ARCH_DMA_MINALIGN alignment overhead of struct devres
   - Export struct devres_node and its init/add/remove/dbginfo
     primitives for use by Rust Devres<T>
   - Fix missing node debug info in devm_krealloc()
   - Use guard(spinlock_irqsave) where applicable; consolidate unlock
     paths in devres_release_group()

  driver_override:
   - Convert PCI, WMI, vdpa, s390/cio, s390/ap, and fsl-mc to the
     generic driver_override infrastructure, replacing per-bus
     driver_override strings, sysfs attributes, and match logic; fixes a
     potential UAF from unsynchronized access to driver_override in bus
     match() callbacks
   - Simplify __device_set_driver_override() logic

  kernfs:
   - Send IN_DELETE_SELF and IN_IGNORED inotify events on kernfs file
     and directory removal
   - Add corresponding selftests for memcg

  platform:
   - Allow attaching software nodes when creating platform devices via a
     new 'swnode' field in struct platform_device_info
   - Add kerneldoc for struct platform_device_info

  software node:
   - Move software node initialization from postcore_initcall() to
     driver_init(), making it available early in the boot process
   - Move kernel_kobj initialization (ksysfs_init) earlier to support
     the above
   - Remove software_node_exit(); dead code in a built-in unit

  SoC:
   - Introduce of_machine_read_compatible() and of_machine_read_model()
     OF helpers and export soc_attr_read_machine() to replace direct
     accesses to of_root from SoC drivers; also enables
     CONFIG_COMPILE_TEST coverage for these drivers

  sysfs:
   - Constify attribute group array pointers to
     'const struct attribute_group *const *' in sysfs functions,
     device_add_groups() / device_remove_groups(), and struct class

  Rust:
   - Devres:
      - Embed struct devres_node directly in Devres<T> instead of going
        through devm_add_action(), avoiding the extra allocation and the
        unnecessary ARCH_DMA_MINALIGN alignment

   - I/O:
      - Turn IoCapable from a marker trait into a functional trait
        carrying the raw I/O accessor implementation (io_read /
        io_write), providing working defaults for the per-type Io
        methods
      - Add RelaxedMmio wrapper type, making relaxed accessors usable in
        code generic over the Io trait
      - Remove overloaded per-type Io methods and per-backend macros
        from Mmio and PCI ConfigSpace

   - I/O (Register):
      - Add IoLoc trait and generic read/write/update methods to the Io
        trait, making I/O operations parameterizable by typed locations
      - Add register! macro for defining hardware register types with
        typed bitfield accessors backed by Bounded values; supports
        direct, relative, and array register addressing
      - Add write_reg() / try_write_reg() and LocatedRegister trait
      - Update PCI sample driver to demonstrate the register! macro

         Example:

         ```
             register! {
                 /// UART control register.
                 CTRL(u32) @ 0x18 {
                     /// Receiver enable.
                     19:19   rx_enable => bool;
                     /// Parity configuration.
                     14:13   parity ?=> Parity;
                 }

                 /// FIFO watermark and counter register.
                 WATER(u32) @ 0x2c {
                     /// Number of datawords in the receive FIFO.
                     26:24   rx_count;
                     /// RX interrupt threshold.
                     17:16   rx_water;
                 }
             }

             impl WATER {
                 fn rx_above_watermark(&self) -> bool {
                     self.rx_count() > self.rx_water()
                 }
             }

             fn init(bar: &pci::Bar<BAR0_SIZE>) {
                 let water = WATER::zeroed()
                     .with_const_rx_water::<1>(); // > 3 would not compile
                 bar.write_reg(water);

                 let ctrl = CTRL::zeroed()
                     .with_parity(Parity::Even)
                     .with_rx_enable(true);
                 bar.write_reg(ctrl);
             }

             fn handle_rx(bar: &pci::Bar<BAR0_SIZE>) {
                 if bar.read(WATER).rx_above_watermark() {
                     // drain the FIFO
                 }
             }

             fn set_parity(bar: &pci::Bar<BAR0_SIZE>, parity: Parity) {
                 bar.update(CTRL, |r| r.with_parity(parity));
             }
         ```

   - IRQ:
      - Move 'static bounds from where clauses to trait declarations for
        IRQ handler traits

   - Misc:
      - Enable the generic_arg_infer Rust feature
      - Extend Bounded with shift operations, single-bit bool
        conversion, and const get()

  Misc:
   - Make deferred_probe_timeout default a Kconfig option
   - Drop auxiliary_dev_pm_ops; the PM core falls back to driver PM
     callbacks when no bus type PM ops are set
   - Add conditional guard support for device_lock()
   - Add ksysfs.c to the DRIVER CORE MAINTAINERS entry
   - Fix kernel-doc warnings in base.h
   - Fix stale reference to memory_block_add_nid() in documentation"

* tag 'driver-core-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core: (67 commits)
  bus: fsl-mc: use generic driver_override infrastructure
  s390/ap: use generic driver_override infrastructure
  s390/cio: use generic driver_override infrastructure
  vdpa: use generic driver_override infrastructure
  platform/wmi: use generic driver_override infrastructure
  PCI: use generic driver_override infrastructure
  driver core: make software nodes available earlier
  software node: remove software_node_exit()
  kernel: ksysfs: initialize kernel_kobj earlier
  MAINTAINERS: add ksysfs.c to the DRIVER CORE entry
  drivers/base/memory: fix stale reference to memory_block_add_nid()
  device property: Document how to check for the property presence
  soundwire: debugfs: initialize firmware_file to empty string
  debugfs: fix placement of EXPORT_SYMBOL_GPL for debugfs_create_str()
  debugfs: check for NULL pointer in debugfs_create_str()
  driver core: Make deferred_probe_timeout default a Kconfig option
  driver core: simplify __device_set_driver_override() clearing logic
  driver core: auxiliary bus: Drop auxiliary_dev_pm_ops
  device property: Make modifications of fwnode "flags" thread safe
  rust: devres: embed struct devres_node directly
  ...
2026-04-13 19:03:11 -07:00
Linus Torvalds
d568788baa hardening updates for v7.1-rc1
- randomize_kstack: Improve implementation across arches (Ryan Roberts)
 
 - lkdtm/fortify: Drop unneeded FORTIFY_STR_OBJECT test
 
 - refcount: Remove unused __signed_wrap function annotations
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRSPkdeREjth1dHnSE2KwveOeQkuwUCad16PwAKCRA2KwveOeQk
 u7crAP4qz8gXCjes76KsZm/YQS8PtOG5JroAVu5Oa4ohw0RfaQD+K/XLow1plcNF
 4Bi8zSuv2ifcLysh9qEAbx5+wcHijgo=
 =woB3
 -----END PGP SIGNATURE-----

Merge tag 'hardening-v7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull hardening updates from Kees Cook:

 - randomize_kstack: Improve implementation across arches (Ryan Roberts)

 - lkdtm/fortify: Drop unneeded FORTIFY_STR_OBJECT test

 - refcount: Remove unused __signed_wrap function annotations

* tag 'hardening-v7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  lkdtm/fortify: Drop unneeded FORTIFY_STR_OBJECT test
  refcount: Remove unused __signed_wrap function annotations
  randomize_kstack: Unify random source across arches
  randomize_kstack: Maintain kstack_offset per task
2026-04-13 17:52:29 -07:00
Linus Torvalds
de639344bb audit/stable-7.1 PR 20260410
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCgAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmnZegUUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNydxAApWBVRWp/AY7jtCQGWRYAa+6y+bQ0
 RWfu8putXaOyk3NTeWP64e87FKsdByR/yflefYxMH+bXc2mwbuUZYAreEVmLCJ1P
 QxHKuwCkCNOz90n/Y7nlDSDK1GYdzlFkCgidfr4iNSCD58WMTtNNpZREzaNiR8a1
 PZ3bFvJH+S7BRCGA6/S/20rNYeWTga56pSrWt6VpMwVHGJ1R4DsD60pT8z0NqMYI
 BTBLeZ36HlZdwUp+APldKNNDRKG1ZQVKJRO68qcSkopr4vQzK7yL/SJsCdU8MHj2
 LccXTCTHHWJbpdiE7BtzPO9UobVZIdcz2wsnJHWxzHYtXlPolgM7F31111GL4HSv
 V/mq5o7dR3h6nn+1gkWHjOpd/f3J3xl3FaJsH9FIIhPmCRHb4oZI0WG0ZH3mHZBl
 o6aaWja3PBl0XNA+q87DQVBYDOyVNB4RjuaKy+d7hm4eronTRaZkg3zutrB6/XxP
 uFbp+Q3diWNMsYO52DKFThL/sStmnnCMIRJuTxd8QaPhLVakaFSkWZycSUH4HijD
 8WMk3e4yo3TeD6rCAognwKclj0vCMHS3TLOMXlY0vMD04gwXJ2S81yfyXGT4F5De
 KkXj61TFMxPyiZ6yrxk86BmoqHL0DUiCDn1rMKbNdIncHedKZoNuy+O/XNLS6No/
 hLRvXSI7MNthJ5E=
 =1rY2
 -----END PGP SIGNATURE-----

Merge tag 'audit-pr-20260410' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit

Pull audit updates from Paul Moore:

 - Improved handling of unknown status requests from userspace

   The current kernel code ignores unknown/unused request bits sent from
   userspace and returns an error code based on the results of the
   request(s) it does understand. The patch from Ricardo fixes this so
   that unknown requests return an -EINVAL to userspace, making
   compatibility a bit easier moving forward.

 - A number of small style and formatting cleanups

* tag 'audit-pr-20260410' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
  audit: handle unknown status requests in audit_receive_msg()
  audit: fix coding style issues
  audit: remove redundant initialization of static variables to 0
  audit: fix whitespace alignment in include/uapi/linux/audit.h
2026-04-13 14:56:54 -07:00
Linus Torvalds
ef3da345cc vfs-7.1-rc1.misc
Please consider pulling these changes from the signed vfs-7.1-rc1.misc tag.
 
 Thanks!
 Christian
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCadjZCwAKCRCRxhvAZXjc
 ohhBAQCAmQMlMRAXAgUZFYMTZpeQlcujP5rv+/vT2Tf/xS76YwD/dRDaw1FH294+
 qtk/Z1NjleNixzE2sld1K9J32NxeyAc=
 =+g9q
 -----END PGP SIGNATURE-----

Merge tag 'vfs-7.1-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull misc vfs updates from Christian Brauner:
 "Features:
   - coredump: add tracepoint for coredump events
   - fs: hide file and bfile caches behind runtime const machinery

  Fixes:
   - fix architecture-specific compat_ftruncate64 implementations
   - dcache: Limit the minimal number of bucket to two
   - fs/omfs: reject s_sys_blocksize smaller than OMFS_DIR_START
   - fs/mbcache: cancel shrink work before destroying the cache
   - dcache: permit dynamic_dname()s up to NAME_MAX

  Cleanups:
   - remove or unexport unused fs_context infrastructure
   - trivial ->setattr cleanups
   - selftests/filesystems: Assume that TIOCGPTPEER is defined
   - writeback: fix kernel-doc function name mismatch for wb_put_many()
   - autofs: replace manual symlink buffer allocation in autofs_dir_symlink
   - init/initramfs.c: trivial fix: FSM -> Finite-state machine
   - fs: remove stale and duplicate forward declarations
   - readdir: Introduce dirent_size()
   - fs: Replace user_access_{begin/end} by scoped user access
   - kernel: acct: fix duplicate word in comment
   - fs: write a better comment in step_into() concerning .mnt assignment
   - fs: attr: fix comment formatting and spelling issues"

* tag 'vfs-7.1-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (28 commits)
  dcache: permit dynamic_dname()s up to NAME_MAX
  fs: attr: fix comment formatting and spelling issues
  fs: hide file and bfile caches behind runtime const machinery
  fs: write a better comment in step_into() concerning .mnt assignment
  proc: rename proc_notify_change to proc_setattr
  proc: rename proc_setattr to proc_nochmod_setattr
  affs: rename affs_notify_change to affs_setattr
  adfs: rename adfs_notify_change to adfs_setattr
  hfs: update comments on hfs_inode_setattr
  kernel: acct: fix duplicate word in comment
  fs: Replace user_access_{begin/end} by scoped user access
  readdir: Introduce dirent_size()
  coredump: add tracepoint for coredump events
  fs: remove do_sys_truncate
  fs: pass on FTRUNCATE_* flags to do_truncate
  fs: fix archiecture-specific compat_ftruncate64
  fs: remove stale and duplicate forward declarations
  init/initramfs.c: trivial fix: FSM -> Finite-state machine
  autofs: replace manual symlink buffer allocation in autofs_dir_symlink
  fs/mbcache: cancel shrink work before destroying the cache
  ...
2026-04-13 14:20:11 -07:00
Linus Torvalds
07c3ef5822 vfs-7.1-rc1.pidfs
Please consider pulling these changes from the signed vfs-7.1-rc1.pidfs tag.
 
 Thanks!
 Christian
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCadjZCwAKCRCRxhvAZXjc
 omfuAQDckt5g7vxBr9hKdyrq1//nsu44fst/mRqr2iSYjuKfPQD/VN6Lw9e56Y/q
 l4hHxsPPrSSxbijwng7im36iPIGdfwI=
 =BbFh
 -----END PGP SIGNATURE-----

Merge tag 'vfs-7.1-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull clone and pidfs updates from Christian Brauner:
 "Add three new clone3() flags for pidfd-based process lifecycle
  management.

  CLONE_AUTOREAP:

     CLONE_AUTOREAP makes a child process auto-reap on exit without ever
     becoming a zombie. This is a per-process property in contrast to
     the existing auto-reap mechanism via SA_NOCLDWAIT or SIG_IGN for
     SIGCHLD which applies to all children of a given parent.

     Currently the only way to automatically reap children is to set
     SA_NOCLDWAIT or SIG_IGN on SIGCHLD. This is a parent-scoped
     property affecting all children which makes it unsuitable for
     libraries or applications that need selective auto-reaping of
     specific children while still being able to wait() on others.

     CLONE_AUTOREAP stores an autoreap flag in the child's
     signal_struct. When the child exits do_notify_parent() checks this
     flag and causes exit_notify() to transition the task directly to
     EXIT_DEAD. Since the flag lives on the child it survives
     reparenting: if the original parent exits and the child is
     reparented to a subreaper or init the child still auto-reaps when
     it eventually exits. This is cleaner than forcing the subreaper to
     get SIGCHLD and then reaping it. If the parent doesn't care the
     subreaper won't care. If there's a subreaper that would care it
     would be easy enough to add a prctl() that either just turns back
     on SIGCHLD and turns off auto-reaping or a prctl() that just
     notifies the subreaper whenever a child is reparented to it.

     CLONE_AUTOREAP can be combined with CLONE_PIDFD to allow the parent
     to monitor the child's exit via poll() and retrieve exit status via
     PIDFD_GET_INFO. Without CLONE_PIDFD it provides a fire-and-forget
     pattern. No exit signal is delivered so exit_signal must be zero.
     CLONE_THREAD and CLONE_PARENT are rejected: CLONE_THREAD because
     autoreap is a process-level property, and CLONE_PARENT because an
     autoreap child reparented via CLONE_PARENT could become an
     invisible zombie under a parent that never calls wait().

     The flag is not inherited by the autoreap process's own children.
     Each child that should be autoreaped must be explicitly created
     with CLONE_AUTOREAP.

  CLONE_NNP:

     CLONE_NNP sets no_new_privs on the child at clone time. Unlike
     prctl(PR_SET_NO_NEW_PRIVS) which a process sets on itself,
     CLONE_NNP allows the parent to impose no_new_privs on the child at
     creation without affecting the parent's own privileges.
     CLONE_THREAD is rejected because threads share credentials.
     CLONE_NNP is useful on its own for any spawn-and-sandbox pattern
     but was specifically introduced to enable unprivileged usage of
     CLONE_PIDFD_AUTOKILL.

  CLONE_PIDFD_AUTOKILL:

     This flag ties a child's lifetime to the pidfd returned from
     clone3(). When the last reference to the struct file created by
     clone3() is closed the kernel sends SIGKILL to the child. A pidfd
     obtained via pidfd_open() for the same process does not keep the
     child alive and does not trigger autokill - only the specific
     struct file from clone3() has this property. This is useful for
     container runtimes, service managers, and sandboxed subprocess
     execution - any scenario where the child must die if the parent
     crashes or abandons the pidfd or just wants a throwaway helper
     process.

     CLONE_PIDFD_AUTOKILL requires both CLONE_PIDFD and CLONE_AUTOREAP.
     It requires CLONE_PIDFD because the whole point is tying the
     child's lifetime to the pidfd. It requires CLONE_AUTOREAP because a
     killed child with no one to reap it would become a zombie - the
     primary use case is the parent crashing or abandoning the pidfd so
     no one is around to call waitpid(). CLONE_THREAD is rejected
     because autokill targets a process not a thread.

     If CLONE_NNP is specified together with CLONE_PIDFD_AUTOKILL an
     unprivileged user may spawn a process that is autokilled. The child
     cannot escalate privileges via setuid/setgid exec after being
     spawned. If CLONE_PIDFD_AUTOKILL is specified without CLONE_NNP the
     caller must have have CAP_SYS_ADMIN in its user namespace"

* tag 'vfs-7.1-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  selftests: check pidfd_info->coredump_code correctness
  pidfds: add coredump_code field to pidfd_info
  kselftest/coredump: reintroduce null pointer dereference
  selftests/pidfd: add CLONE_PIDFD_AUTOKILL tests
  selftests/pidfd: add CLONE_NNP tests
  selftests/pidfd: add CLONE_AUTOREAP tests
  pidfd: add CLONE_PIDFD_AUTOKILL
  clone: add CLONE_NNP
  clone: add CLONE_AUTOREAP
2026-04-13 13:27:11 -07:00
Linus Torvalds
dc0dfa7338 namespaces-7.1-rc1.misc
Please consider pulling these changes from the signed namespaces-7.1-rc1.misc tag.
 
 Thanks!
 Christian
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCadjZCwAKCRCRxhvAZXjc
 ols1AP9rA4gjJOTwHg0/pc+GL4qLSqUP3O4KeuJ8qccBcEUITAD/frpUjR11Ibw/
 F78/x1QhDPI8PCcw7kEyAPTfDb9VsgU=
 =5HBm
 -----END PGP SIGNATURE-----

Merge tag 'namespaces-7.1-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull namespace update from Christian Brauner:
 "Add two simple helper macros for the namespace infrastructure"

* tag 'namespaces-7.1-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  nsproxy: Add FOR_EACH_NS_TYPE() X-macro and CLONE_NS_ALL
2026-04-13 13:02:49 -07:00
Linus Torvalds
b7d74ea0fd vfs-7.1-rc1.kino
Please consider pulling these changes from the signed vfs-7.1-rc1.kino tag.
 
 Thanks!
 Christian
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCadjZCgAKCRCRxhvAZXjc
 otmnAP4sbsxZQdz2TG2hJuOwnEZOkkxZQOUMc3ERVyZaWXIeTAEA7e5M+8FpoG9n
 8ipO76UoaXdGLESrqVdp9EOhLqOW7QY=
 =uMeJ
 -----END PGP SIGNATURE-----

Merge tag 'vfs-7.1-rc1.kino' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs i_ino updates from Christian Brauner:
 "For historical reasons, the inode->i_ino field is an unsigned long,
  which means that it's 32 bits on 32 bit architectures. This has caused
  a number of filesystems to implement hacks to hash a 64-bit identifier
  into a 32-bit field, and deprives us of a universal identifier field
  for an inode.

  This changes the inode->i_ino field from an unsigned long to a u64.
  This shouldn't make any material difference on 64-bit hosts, but
  32-bit hosts will see struct inode grow by at least 4 bytes. This
  could have effects on slabcache sizes and field alignment.

  The bulk of the changes are to format strings and tracepoints, since
  the kernel itself doesn't care that much about the i_ino field. The
  first patch changes some vfs function arguments, so check that one out
  carefully.

  With this change, we may be able to shrink some inode structures. For
  instance, struct nfs_inode has a fileid field that holds the 64-bit
  inode number. With this set of changes, that field could be
  eliminated. I'd rather leave that sort of cleanups for later just to
  keep this simple"

* tag 'vfs-7.1-rc1.kino' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  nilfs2: fix 64-bit division operations in nilfs_bmap_find_target_in_group()
  EVM: add comment describing why ino field is still unsigned long
  vfs: remove externs from fs.h on functions modified by i_ino widening
  treewide: fix missed i_ino format specifier conversions
  ext4: fix signed format specifier in ext4_load_inode trace event
  treewide: change inode->i_ino from unsigned long to u64
  nilfs2: widen trace event i_ino fields to u64
  f2fs: widen trace event i_ino fields to u64
  ext4: widen trace event i_ino fields to u64
  zonefs: widen trace event i_ino fields to u64
  hugetlbfs: widen trace event i_ino fields to u64
  ext2: widen trace event i_ino fields to u64
  cachefiles: widen trace event i_ino fields to u64
  vfs: widen trace event i_ino fields to u64
  net: change sock.sk_ino and sock_i_ino() to u64
  audit: widen ino fields to u64
  vfs: widen inode hash/lookup functions to u64
2026-04-13 12:19:01 -07:00
Linus Torvalds
28483203f7 RCU changes for v7.1
NOCB CPU management:
 
 - Consolidate rcu_nocb_cpu_offload() and rcu_nocb_cpu_deoffload() to reduce
   code duplication.
 - Extract nocb_bypass_needs_flush() helper to reduce duplication in NOCB
   bypass path.
 
 rcutorture/torture infrastructure:
 
 - Add NOCB01 config for RCU_LAZY torture testing.
 - Add NOCB02 config for NOCB poll mode testing.
 - Add TRIVIAL-PREEMPT config for textbook-style preemptible RCU torture.
 - Test call_srcu() with preemption both disabled and enabled.
 - Remove kvm-check-branches.sh in favor of kvm-series.sh.
 - Make hangs more visible in torture.sh output.
 - Add informative message for tests without a recheck file.
 - Fix numeric test comparison in srcu_lockdep.sh.
 - Use torture_shutdown_init() in refscale and rcuscale instead of open-coded
   shutdown functions.
 - Fix modulo-zero error in torture_hrtimeout_ns().
 
 SRCU:
 
 - Fix SRCU read flavor macro comments.
 - Fix s/they disables/they disable/ typo in srcu_read_unlock_fast().
 
 RCU Tasks:
 
 - Document that RCU Tasks Trace grace periods now imply RCU grace periods.
 - Remove unnecessary smp_store_release() in cblist_init_generic().
 
 RCU stall:
 
 - Add BOOTPARAM_RCU_STALL_PANIC Kconfig option to allow triggering a kernel
   panic on RCU stall via kernel boot parameter.
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEcoCIrlGe4gjE06JJqA4nf2o45hAFAmnMBEAWHGpvZWxhZ25l
 bGZAbnZpZGlhLmNvbQAKCRCoDid/ajjmEBCdD/4rUNIocKBmSJWHFeDzl8rvMGSx
 RISjeaj8m+HUsQRSq7hecygUkSbGEQuj8KwFDJOsmuzhMRg3hR7yEz1AEphQiiIR
 TN0IdGblQd57KqAk6sHsXA5mNjSWZfJQuOkvHsMxIJdhhDRDHPTGtaCBwlYoLc6b
 de31LCvCfRL1QgiNAznTlPLRyehI9/+6VASLvKbmi3sQQv9B/KBBWydkryvMJbfW
 uc05SUn2qQq2h/NNeYvo6RllXWMPqO4/0ll8Y71ltwFhg8JdS9qXDSFX5cWlLISA
 uBA4pC9bk0xi7RGE5v4n9+fNNTr64Tjtd9QoxFvd/Jb6jdKDGjabwojoAYXCsdNw
 r8Dp3fKwGofruhUqgO6LyHyG54Ro2paVYjsqr5HW9C6jalZ9+HmH5wOnKw2h5E/i
 VRAM6MpwQ869ZeHhDxF8meRKf3+E+UIR95qfNZkABcFnwxgXheT+RASP/UOnoTM7
 Zexuzp9c6GQu34MTt0Rz9vNfJta+ZmPlnccan1T3DgoHvcWip7IDhUD1Vqerd9/4
 VK4JeeKI0t/fR5ydjs5qiA/O5caIFqtQTD0Ag2vLGQiP72pe2lRroIjD3xNZ2bWS
 DN4k6ZCxR2hdPUcWzdYlGe2pYmvAlqCBxZ4fNgrs55uP2EuqW+Y8aJd9RwvRPb2B
 wmTZepOh7Ct9+cjTBA==
 =wjsx
 -----END PGP SIGNATURE-----

Merge tag 'rcu.2026.03.31a' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux

Pull RCU updates from Joel Fernandes:
 "NOCB CPU management:

   - Consolidate rcu_nocb_cpu_offload() and rcu_nocb_cpu_deoffload() to
     reduce code duplication

   - Extract nocb_bypass_needs_flush() helper to reduce duplication in
     NOCB bypass path

  rcutorture/torture infrastructure:

   - Add NOCB01 config for RCU_LAZY torture testing

   - Add NOCB02 config for NOCB poll mode testing

   - Add TRIVIAL-PREEMPT config for textbook-style preemptible RCU
     torture

   - Test call_srcu() with preemption both disabled and enabled

   - Remove kvm-check-branches.sh in favor of kvm-series.sh

   - Make hangs more visible in torture.sh output

   - Add informative message for tests without a recheck file

   - Fix numeric test comparison in srcu_lockdep.sh

   - Use torture_shutdown_init() in refscale and rcuscale instead of
     open-coded shutdown functions

   - Fix modulo-zero error in torture_hrtimeout_ns().

  SRCU:

   - Fix SRCU read flavor macro comments

   - Fix s/they disables/they disable/ typo in srcu_read_unlock_fast()

  RCU Tasks:

   - Document that RCU Tasks Trace grace periods now imply RCU grace
     periods

   - Remove unnecessary smp_store_release() in cblist_init_generic()"

* tag 'rcu.2026.03.31a' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux:
  rcutorture: Test call_srcu() with preemption disabled and not
  rcu: Add BOOTPARAM_RCU_STALL_PANIC Kconfig option
  torture: Avoid modulo-zero error in torture_hrtimeout_ns()
  rcu/nocb: Extract nocb_bypass_needs_flush() to reduce duplication
  rcu/nocb: Consolidate rcu_nocb_cpu_offload/deoffload functions
  rcu-tasks: Remove unnecessary smp_store_release() in cblist_init_generic()
  rcutorture: Add NOCB02 config for nocb poll mode testing
  rcutorture: Add NOCB01 config for RCU_LAZY torture testing
  rcu-tasks: Document that RCU Tasks Trace grace periods now imply RCU grace periods
  srcu: Fix s/they disables/they disable/ typo in srcu_read_unlock_fast()
  srcu: Fix SRCU read flavor macro comments
  rcuscale: Ditch rcu_scale_shutdown in favor of torture_shutdown_init()
  refscale: Ditch ref_scale_shutdown in favor of torture_shutdown_init()
  rcutorture: Fix numeric "test" comparison in srcu_lockdep.sh
  torture: Print informative message for test without recheck file
  torture: Make hangs more visible in torture.sh output
  kvm-check-branches.sh: Remove in favor of kvm-series.sh
  rcutorture: Add a textbook-style trivial preemptible RCU
2026-04-13 09:36:45 -07:00
Breno Leitao
76af546488 workqueue: validate cpumask_first() result in llc_populate_cpu_shard_id()
On uniprocessor (UP) configs such as nios2, NR_CPUS is 1, so
cpu_shard_id[] is a single-element array (int[1]). In
llc_populate_cpu_shard_id(), cpumask_first(sibling_cpus) returns an
unsigned int that the compiler cannot prove is always 0, triggering
a -Warray-bounds warning when the result is used to index
cpu_shard_id[]:

  kernel/workqueue.c:8321:55: warning: array subscript 1 is above
  array bounds of 'int[1]' [-Warray-bounds]
   8321 |  cpu_shard_id[c] = cpu_shard_id[cpumask_first(sibling_cpus)];
        |                    ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This is a false positive: sibling_cpus can never be empty here because
'c' itself is always set in it, so cpumask_first() will always return a
valid CPU. However, the compiler cannot prove this statically, and the
warning only manifests on UP configs where the array size is 1.

Add a bounds check with WARN_ON_ONCE to silence the warning, and store
the result in a local variable to make the code clearer and avoid calling
cpumask_first() twice.

Fixes: 5920d046f7 ("workqueue: add WQ_AFFN_CACHE_SHARD affinity scope")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202604022343.GQtkF2vO-lkp@intel.com/
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-13 06:15:26 -10:00
Paolo Bonzini
e74c3a8891 KVM/arm64 updates for 7.1
* New features:
 
 - Add support for tracing in the standalone EL2 hypervisor code,
   which should help both debugging and performance analysis.
   This comes with a full infrastructure for 'remote' trace buffers
   that can be exposed by non-kernel entities such as firmware.
 
 - Add support for GICv5 Per Processor Interrupts (PPIs), as the
   starting point for supporting the new GIC architecture in KVM.
 
 - Finally add support for pKVM protected guests, with anonymous
   memory being used as a backing store. About time!
 
 * Improvements and bug fixes:
 
 - Rework the dreaded user_mem_abort() function to make it more
   maintainable, reducing the amount of state being exposed to
   the various helpers and rendering a substantial amount of
   state immutable.
 
 - Expand the Stage-2 page table dumper to support NV shadow
   page tables on a per-VM basis.
 
 - Tidy up the pKVM PSCI proxy code to be slightly less hard
   to follow.
 
 - Fix both SPE and TRBE in non-VHE configurations so that they
   do not generate spurious, out of context table walks that
   ultimately lead to very bad HW lockups.
 
 - A small set of patches fixing the Stage-2 MMU freeing in error
   cases.
 
 - Tighten-up accepted SMC immediate value to be only #0 for host
   SMCCC calls.
 
 - The usual cleanups and other selftest churn.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmnWdswACgkQI9DQutE9
 ekNYvBAAxj5Zmsx8sJ2CYDTJc2w4XkEjSgDugA+J/s0TMgrzExeBlWCstdhVTncy
 68nwOjQl3TotnIrt7q36kko9u7IdD0pHNrk34NtlggLjHfB61n9SNcAA6j4F6zJa
 GFkHpJSrSnZuUPqapkDnlyhuPkgTIAkEUk2Am9siksSfY4HvRyHZJm2FTdxsdIBn
 NN9wvQqw2wefTXOQ8gS+oHbPVp1cPbwrF2a3EhzXXv/6W3mUBstXgsijgo07UzCp
 W6vHCv2wqHbHdf67z3Q3hL+VXlVH6oHlyW99/swqISvqRkH/iSB90+oUojnMRrSm
 yB6Wmhh8jboCaajWMJhG+veZw+7GMXU4nOrGd1rbnY8cwRl/TQ5YibhRm7DIdvjO
 xeUluTLJ0NdweQUwE2k4OlgKOuGang3E2p0clmkUO4SstA48MdqR/kpST6guIlWw
 U5syuNaaaiuwP5QOi9qZmMCNmQ3ZfnZG3nseJFdoyGjhVhf5jyQyv4Du9vGZQFF/
 Zkg7yTqC4OWiC+3GkW9YYAySM1MyetivLtd47PGzHPTdtaZziWhNvQ0y+8QjQ+R+
 CJNvyS/DvsT7epSya4sLgMP1ZAlih9xkz5sQ6k8NJLBYYXi0v33qwqditErgLLyj
 S4Ci4WNhHHWIusvCVM7JUBkH0AElpmi506f7F6iHoFLlkYR4t9U=
 =/SuQ
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 updates for 7.1

* New features:

- Add support for tracing in the standalone EL2 hypervisor code,
  which should help both debugging and performance analysis.
  This comes with a full infrastructure for 'remote' trace buffers
  that can be exposed by non-kernel entities such as firmware.

- Add support for GICv5 Per Processor Interrupts (PPIs), as the
  starting point for supporting the new GIC architecture in KVM.

- Finally add support for pKVM protected guests, with anonymous
  memory being used as a backing store. About time!

* Improvements and bug fixes:

- Rework the dreaded user_mem_abort() function to make it more
  maintainable, reducing the amount of state being exposed to
  the various helpers and rendering a substantial amount of
  state immutable.

- Expand the Stage-2 page table dumper to support NV shadow
  page tables on a per-VM basis.

- Tidy up the pKVM PSCI proxy code to be slightly less hard
  to follow.

- Fix both SPE and TRBE in non-VHE configurations so that they
  do not generate spurious, out of context table walks that
  ultimately lead to very bad HW lockups.

- A small set of patches fixing the Stage-2 MMU freeing in error
  cases.

- Tighten-up accepted SMC immediate value to be only #0 for host
  SMCCC calls.

- The usual cleanups and other selftest churn.
2026-04-13 11:49:54 +02:00
Alexei Starovoitov
fa2942918a Merge patch series "bpf: Fix OOB in pcpu_init_value and add a test"
xulang <xulang@uniontech.com> says:
====================

Fix OOB read when copying element from a BPF_MAP_TYPE_CGROUP_STORAGE
map to another pcpu map with the same value_size that is not rounded
up to 8 bytes, and add a test case to reproduce the issue.

The root cause is that pcpu_init_value() uses copy_map_value_long() which
rounds up the copy size to 8 bytes, but CGROUP_STORAGE map values are not
8-byte aligned (e.g., 4-byte). This causes a 4-byte OOB read when
the copy is performed.
====================

Link: https://lore.kernel.org/r/7653EEEC2BAB17DF+20260402073948.2185396-1-xulang@uniontech.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-12 13:36:55 -07:00
Lang Xu
576afddfee bpf: Fix OOB in pcpu_init_value
An out-of-bounds read occurs when copying element from a
BPF_MAP_TYPE_CGROUP_STORAGE map to another pcpu map with the
same value_size that is not rounded up to 8 bytes.

The issue happens when:
1. A CGROUP_STORAGE map is created with value_size not aligned to
   8 bytes (e.g., 4 bytes)
2. A pcpu map is created with the same value_size (e.g., 4 bytes)
3. Update element in 2 with data in 1

pcpu_init_value assumes that all sources are rounded up to 8 bytes,
and invokes copy_map_value_long to make a data copy, However, the
assumption doesn't stand since there are some cases where the source
may not be rounded up to 8 bytes, e.g., CGROUP_STORAGE, skb->data.
the verifier verifies exactly the size that the source claims, not
the size rounded up to 8 bytes by kernel, an OOB happens when the
source has only 4 bytes while the copy size(4) is rounded up to 8.

Fixes: d3bec0138b ("bpf: Zero-fill re-used per-cpu map element")
Reported-by: Kaiyan Mei <kaiyanm@hust.edu.cn>
Closes: https://lore.kernel.org/all/14e6c70c.6c121.19c0399d948.Coremail.kaiyanm@hust.edu.cn/
Link: https://lore.kernel.org/r/420FEEDDC768A4BE+20260402074236.2187154-1-xulang@uniontech.com
Signed-off-by: Lang Xu <xulang@uniontech.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-12 13:35:59 -07:00
Emil Tsalapatis
ac61bffe91 bpf: Allow instructions with arena source and non-arena dest registers
The compiler sometimes stores the result of a PTR_TO_ARENA and SCALAR
operation into the scalar register rather than the pointer register.
Relax the verifier to allow operations between a source arena register
and a destination non-arena register, marking the destination's value
as a PTR_TO_ARENA.

Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Acked-by: Song Liu <song@kernel.org>
Fixes: 6082b6c328 ("bpf: Recognize addr_space_cast instruction in the verifier.")
Link: https://lore.kernel.org/r/20260412174546.18684-2-emil@etsalapatis.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-12 12:47:39 -07:00
Menglong Dong
9fd19e3ed7 bpf: add missing fsession to the verifier log
The fsession attach type is missed in the verifier log in
check_get_func_ip(), bpf_check_attach_target() and check_attach_btf_id().
Update them to make the verifier log proper. Meanwhile, update the
corresponding selftests.

Acked-by: Leon Hwang <leon.hwang@linux.dev>
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Link: https://lore.kernel.org/r/20260412060346.142007-2-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-12 12:42:38 -07:00
Alexei Starovoitov
99a832a2b5 bpf: Move BTF checking logic into check_btf.c
BTF validation logic is independent from the main verifier.
Move it into check_btf.c

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20260412152936.54262-7-alexei.starovoitov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-12 12:37:04 -07:00
Alexei Starovoitov
ed0b9710bd bpf: Move backtracking logic to backtrack.c
Move precision propagation and backtracking logic to backtrack.c
to reduce verifier.c size.

No functional changes.

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20260412152936.54262-6-alexei.starovoitov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-12 12:36:58 -07:00
Alexei Starovoitov
c82834a5a1 bpf: Move state equivalence logic to states.c
verifier.c is huge. Move is_state_visited() to states.c,
so that all state equivalence logic is in one file.

Mechanical move. No functional changes.

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20260412152936.54262-5-alexei.starovoitov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-12 12:36:52 -07:00
Alexei Starovoitov
f8a8faceab bpf: Move check_cfg() into cfg.c
verifier.c is huge. Move check_cfg(), compute_postorder(),
compute_scc() into cfg.c

Mechanical move. No functional changes.

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20260412152936.54262-4-alexei.starovoitov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-12 12:36:45 -07:00
Alexei Starovoitov
fc150cddee bpf: Move compute_insn_live_regs() into liveness.c
verifier.c is huge. Move compute_insn_live_regs() into liveness.c.

Mechanical move. No functional changes.

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20260412152936.54262-3-alexei.starovoitov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-12 12:36:38 -07:00
Alexei Starovoitov
449f08fa59 bpf: Move fixup/post-processing logic from verifier.c into fixups.c
verifier.c is huge. Split fixup/post-processing logic that runs after
the verifier accepted the program into fixups.c.

Mechanical move. No functional changes.

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20260412152936.54262-2-alexei.starovoitov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-12 12:35:54 -07:00
Linus Torvalds
35bdc192d8 workqueue: Fixes for v7.0-rc7
- Fix incomplete activation of multiple inactive works when unplugging a
   pool_workqueue, where the pending_pwqs list wasn't being updated for
   subsequent works.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCadvRzg4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGeLdAP9SZN3b4QRE79fnP/1dzf2RR/XWlFq7x49Wv1nP
 mmpwjQEAx/B6YabLNp98bdxaoygUkdz3zBDnL7oaWx/G0p9T/QA=
 =80oo
 -----END PGP SIGNATURE-----

Merge tag 'wq-for-7.0-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq

Pull workqueue fix from Tejun Heo:
 "This is a fix for a stall which triggers on ordered workqueues when
  there are multiple inactive work items during workqueue property
  changes through sysfs, which doesn't happen that frequently.

  While really late, the fix is very low risk as it just repeats an
  operation which is already being performed:

   - Fix incomplete activation of multiple inactive works when
     unplugging a pool_workqueue, where the pending_pwqs list
     wasn't being updated for subsequent works"

* tag 'wq-for-7.0-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: Add pool_workqueue to pending_pwqs list when unplugging multiple inactive works
2026-04-12 10:42:40 -07:00
Linus Torvalds
ab3dee2640 Two fixes for the time/timers subsystem:
- Invert the inverted fastpath decision in check_tick_dependency(), which
     prevents NOHZ full to stop the tick. That's a regression introduced in
     the 7.0 merge window.
 
   - Prevent a unpriviledged DoS in the clockevents code, where user space
     can starve the timer interrupt by arming a timerfd or posix interval
     timer in a tight loop with an absolute expiry time in the past.  The
     fix turned out to be incomplete and was was amended yesterday to make
     it work on some 20 years old AMD machines as well. All issues with it
     have been confirmed to be resolved by various reporters.
 -----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCgAuFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmnboPYQHHRnbHhAa2Vy
 bmVsLm9yZwAKCRCmGPVMDXSYocEDD/4+aEut3QsitcZliE5KTpD4QFVo70Ky5MMe
 XvQgOJyvSJeUBjOW3UVROEZvBWxpb0bYRbXGTs37pCKK4DwjOmodAaY6OI5W/atb
 13lnVWSiCKkZoRLs3+G+PCxCD3WuS2HhZc27fiFJmXNyLGfaLBVC2trNnheEBmUs
 m3DleuJYiuL3tNbQqlKJ2nIXJFp9qpHSkZ241gxinTvFhBz6QhhAkegMbq9DmHTR
 dps4xc3dtNm1poEcTKFeZesfpqlgJaX5LoDydtOFtDomKbXX2SzD82nvCBritUty
 ICcd99kIpAgREEn+aer7hokV8X3eAOBRrzPe8tqWl2mc55jcjQJHxD8dE3zIAv5a
 AkcnIsiItfHLCPB3iP95inUQoeBCvAX7jCWaLFH8EZCz8QfLALSetjn6HSbWeN/6
 ANqFsWlKhK7b1+VcdSL8LVdPtEsE2ILEIHR8dF9GqkVXzs1OPEJJHQoqQxUfDhpa
 BdAahPCybY+7lcvZko+WJYJ2Xj0TPRbvav9Ol1o8c8xLF3itFDTCL+E2VMoLdpUn
 F8SRDo2FUwysO4NVjHzR7YRTBhlF1pC3MgvPCL2kCekPv2TDUgacHjygnd/QGQ8m
 cEOZlD/FlCVE+MNSaLYOUFYJ2a7QwO4qCXOhjy6OFMtO3ezJnol9plqCRCJfhkgy
 towg2pGIaQ==
 =cnPC
 -----END PGP SIGNATURE-----

Merge tag 'timers-urgent-2026-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fixes from Thomas Gleixner:
 "Two fixes for the time/timers subsystem:

   - Invert the inverted fastpath decision in check_tick_dependency(),
     which prevents NOHZ full to stop the tick. That's a regression
     introduced in the 7.0 merge window.

   - Prevent a unpriviledged DoS in the clockevents code, where user
     space can starve the timer interrupt by arming a timerfd or posix
     interval timer in a tight loop with an absolute expiry time in the
     past. The fix turned out to be incomplete and was was amended
     yesterday to make it work on some 20 years old AMD machines as
     well. All issues with it have been confirmed to be resolved by
     various reporters"

* tag 'timers-urgent-2026-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  clockevents: Prevent timer interrupt starvation
  tick/nohz: Fix inverted return value in check_tick_dependency() fast path
2026-04-12 10:01:55 -07:00
Linus Torvalds
02640d8886 Fix DL server related slowdown to deferred fair tasks.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmnbSK8RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1jwlg/8DwVE6c2SeQH51lvWNpzOp+1C/Q+P7Fxv
 PofW6L4AQHclj0NHSjSN+NNEuQ4y4VeaxNeJHtRrRBkAV7cPB0MOEqzRU6dpgaA+
 Tlcm54X6+9xXjWpkLqCZvJBxmK6/d6NUt4mCJbXa/x7omVDoo/8I646WTbdDRZ1p
 YIJ63x+U06+81S/3DhsUR0X4+r5GlJDB3iTlEI3wZ26XeLsT/MvWs30QEzeQ1mzM
 zqqCYZ36aq1ULAz/nRYmGDIEsbwkBU0vXWeZDNeqGvMzNSK8P+Vejdd3wXXPs4vS
 QxbVCMQv256zmK3B9l7av0aJEwszs0aMqW/PD5QVLxTrWwV35akC0QF5GdAwNVbz
 m52KFMWElfq/Bu/fc/HrYeuEUY4ViA6/5Wrq021YGctidpEr+BFcFiLumQB6LL8/
 3lOBqwHtiM2qAtEvVCPgfVf3P9KmGl1OAw5TzgzzqKjA5TcDqSAe2CI/wvDhc6WK
 sSXo6pI3IOE2Q03X5b06Lyqd/NMVhBTb6DfwHTXdYbiDloV3m92UABBFfN9t6/qh
 IIn3U5JJEydZPaSKP1N9IdYScaRgqFgA2tXcMSqD1EJUYBJjh8Sp2cnv49nlfvK8
 fYoSniKlmfnSaS4ooHGp/z44BCceY7x+zfGBgdMfgf31EKypqwn71biXxqvquJoA
 UZTOfCJ/fRc=
 =b4Dg
 -----END PGP SIGNATURE-----

Merge tag 'sched-urgent-2026-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fix from Ingo Molnar:
 "Fix DL server related slowdown to deferred fair tasks"

* tag 'sched-urgent-2026-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/deadline: Use revised wakeup rule for dl_server
2026-04-12 08:30:20 -07:00
Alexei Starovoitov
2ec74a0536 bpf: Simplify do_check_insn()
Move env->insn_idx++ to the caller, so that most of
check_*() calls in do_check_insn() tail call into the next helper.

Link: https://lore.kernel.org/r/20260411230001.71664-1-alexei.starovoitov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-11 17:43:29 -07:00
Alexei Starovoitov
ae3f8ca2ba bpf: Move checks for reserved fields out of the main pass
Check reserved fields of each insn once in a prepass
instead of repeatedly rechecking them during the main verifier pass.

Link: https://lore.kernel.org/r/20260411200932.41797-1-alexei.starovoitov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-11 15:24:41 -07:00
Linus Torvalds
558b9206d5 Probes fixes for v7.0-rc7
- tracing/probe: reject non-closed empty immediate strings
   Fixed a buffer index underflow bug that occurred when passing an
   non-closed empty immediate string to the probe event.
 -----BEGIN PGP SIGNATURE-----
 
 iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmnaJr8bHG1hc2FtaS5o
 aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8b4PcH/29I02lnjYf4IJ6oiANa
 K5+RU+Xq0z+Tn079GapHI9vmF8d9M89bI76fqKAVBGHosk+QTdYPLwupbowY72Fq
 5uZPpkMXD+cvNJz0qGEq2WOPeB+VGxq8zR/lnZjhZLupJP0DfBp6SpXS0lFx4d0Q
 x+bw6keG7Guq0ocq2mFNWPo2U8bxlh78UrX+pYoIHyjeO+LVNC7X7Ld3aRDebMck
 uDnS5oLNMojd4Vi5WYuP1GzGzIUJ0cXDDXnSVqHP5zdt7VV83eZNlfWOKe1YyHEq
 FoVTrb3dRFFWn2EdXOJB+DjQXWZeZtUsfJTxWm4UgPgV1DmLmBFurfB8Kk0JmX7S
 0XU=
 =vHwk
 -----END PGP SIGNATURE-----

Merge tag 'probes-fixes-v7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing probe fix from Masami Hiramatsu:
 "Reject non-closed empty immediate strings

  Fix a buffer index underflow bug that occurred when passing an
  non-closed empty immediate string to the probe event"

* tag 'probes-fixes-v7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing/probe: reject non-closed empty immediate strings
2026-04-11 11:33:08 -07:00
Alexei Starovoitov
57205e2dd9 bpf: Delete unused variable
'cnt' is set, but not used. Delete it.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202604111401.eqzyF2kx-lkp@intel.com/
Fixes: 2c167d9177 ("bpf: change logging scheme for live stack analysis")
Link: https://lore.kernel.org/r/20260411141447.45932-1-alexei.starovoitov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-11 10:04:28 -07:00
Thomas Gleixner
ff1c0c5d07 Merge branch 'timers/urgent' into timers/core
to resolve the conflict with urgent fixes.
2026-04-11 07:58:33 +02:00
Amery Hung
136deea435 bpf: Remove gfp_flags plumbing from bpf_local_storage_update()
Remove the check that rejects sleepable BPF programs from doing
BPF_ANY/BPF_EXIST updates on local storage. This restriction was added
in commit b00fa38a9c ("bpf: Enable non-atomic allocations in local
storage") because kzalloc(GFP_KERNEL) could sleep inside
local_storage->lock. This is no longer a concern: all local storage
allocations now use kmalloc_nolock() which never sleeps.

In addition, since kmalloc_nolock() only accepts __GFP_ACCOUNT,
__GFP_ZERO and __GFP_NO_OBJ_EXT, the gfp_flags parameter plumbing from
bpf_*_storage_get() to bpf_local_storage_update() becomes dead code.
Remove gfp_flags from bpf_selem_alloc(), bpf_local_storage_alloc() and
bpf_local_storage_update(). Drop the hidden 5th argument from
bpf_*_storage_get helpers, and remove the verifier patching that
injected GFP_KERNEL/GFP_ATOMIC into the fifth argument.

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Link: https://lore.kernel.org/r/20260411015419.114016-4-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-10 21:22:32 -07:00
Amery Hung
5063e77588 bpf: Use kmalloc_nolock() universally in local storage
Switch to kmalloc_nolock() universally in local storage. Socket local
storage didn't move to kmalloc_nolock() when BPF memory allocator was
replaced by it for performance reasons. Now that kfree_rcu() supports
freeing memory allocated by kmalloc_nolock(), we can move the remaining
local storages to use kmalloc_nolock() and cleanup the cluttered free
paths.

Use kfree() instead of kfree_nolock() in bpf_selem_free_trace_rcu() and
bpf_local_storage_free_trace_rcu(). Both callbacks run in process context
where spinning is allowed, so kfree_nolock() is unnecessary.

Benchmark:

./bench -p 1 local-storage-create --storage-type socket \
  --batch-size {16,32,64}

The benchmark is a microbenchmark stress-testing how fast local storage
can be created. There is no measurable throughput change for socket local
storage after switching from kzalloc() to kmalloc_nolock().

Socket local storage

                 batch  creation speed              diff
---------------  ----   ------------------          ----
Baseline          16    433.9 ± 0.6 k/s
                  32    434.3 ± 1.4 k/s
                  64    434.2 ± 0.7 k/s

After             16    439.0 ± 1.9 k/s             +1.2%
                  32    437.3 ± 2.0 k/s             +0.7%
                  64    435.8 ± 2.5k/s              +0.4%

Also worth noting that the baseline got a 5% throughput boost when sheaf
replaces percpu partial slab recently [0].

[0] https://lore.kernel.org/bpf/20260123-sheaves-for-all-v4-0-041323d506f7@suse.cz/

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Link: https://lore.kernel.org/r/20260411015419.114016-3-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-10 21:22:32 -07:00
Tejun Heo
49d78adf95 sched_ext: Drop spurious warning on kick during scheduler disable
kick_cpus_irq_workfn() warns when scx_kick_syncs is NULL, but this can
legitimately happen when a BPF timer or other kick source races with
free_kick_syncs() during scheduler disable. Drop the pr_warn_once() and
add a comment explaining the race.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn>
2026-04-10 16:38:25 -10:00
Daniel Borkmann
2f2ec8e773 bpf: Enforce regsafe base id consistency for BPF_ADD_CONST scalars
When regsafe() compares two scalar registers that both carry
BPF_ADD_CONST, check_scalar_ids() maps their full compound id
(aka base | BPF_ADD_CONST flag) as one idmap entry. However,
it never verifies that the underlying base ids, that is, with
the flag stripped are consistent with existing idmap mappings.

This allows construction of two verifier states where the old
state has R3 = R2 + 10 (both sharing base id A) while the current
state has R3 = R4 + 10 (base id C, unrelated to R2). The idmap
creates two independent entries: A->B (for R2) and A|flag->C|flag
(for R3), without catching that A->C conflicts with A->B. State
pruning then incorrectly succeeds.

Fix this by additionally verifying base ID mapping consistency
whenever BPF_ADD_CONST is set: after mapping the compound ids,
also invoke check_ids() on the base IDs (flag bits stripped).
This ensures that if A was already mapped to B from comparing
the source register, any ADD_CONST derivative must also derive
from B, not an unrelated C.

Fixes: 98d7ca374b ("bpf: Track delta between "linked" registers.")
Reported-by: STAR Labs SG <info@starlabs.sg>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20260410232651.559778-1-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-10 17:39:09 -07:00
Linus Torvalds
e774d5f1bc RISC-V updates for v7.0-rc8
Before v7.0 is released, fix a few issues with the CFI patchset,
 merged earlier in v7.0-rc, that primarily affect interfaces to
 non-kernel code:
 
 - Improve the prctl() interface for per-task indirect branch landing
   pad control to expand abbreviations and to resemble the speculation
   control prctl() interface
 
 - Expand the "LP" and "SS" abbreviations in the ptrace uapi header
   file to "branch landing pad" and "shadow stack", to improve
   readability
 
 - Fix a typo in a CFI-related macro name in the ptrace uapi header
   file
 
 - Ensure that the indirect branch tracking state and shadow stack
   state are unlocked immediately after an exec() on the new task so
   that libc subsequently can control it
 
 - While working in this area, clean up the kernel-internal,
   cross-architecture prctl() function names by expanding the
   abbreviations mentioned above
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEElRDoIDdEz9/svf2Kx4+xDQu9KksFAmnYP5YACgkQx4+xDQu9
 KkuoPQ//Yye5D+35EqfA12yP96Vrtg0QCKiqMotz3yLo0T7zh5KosAs/QIE5eQi7
 vWRnCld5PsFa0ZS2822oPfQo8pKVO1y7M2ecFWSwaOWq865Xs82M/puqEQF3GFCS
 219cg1dTVBGvvKSf4MINUBRprfZmZRT9pzhSk79qHEbHKzwCDk7uah51iUdyPJyd
 KX3hshYMLq3rooTHR2wD/ChTpV+pCrt2rSUVbW8+sTUWDfv2sTLauHmemKw7LpdW
 C0SulXvcYkGyiqsB5AXW9x2ttJ5hX9diPb73XS6eBCU0CaMl9BVZWNKeqhEMJxKR
 wmqIadD8pelf7Jh7wGAbNW4hWqTsO3xRpZH38Y/cGLdhs3cqvKjEmT3fOFWUP9bP
 hWv5027gVXVSOmvxhPiUJs7D5WWAz4Q64JZfdJSmDdEWVXcI0v/hzdukuPw4iiT6
 DaqOyClTcwc+j1jawFTICXTF7wXfvZT5sjulrmPk1HX4nZ5padKpfQ77AdKHF9Q6
 9pC25QHQk42h/R4ynA4lm15YnCOfYvjP25hU7K64gQnqO6qBrolfrA4kJOmdYv/g
 1IXsA2YZafJbcXwyFZjWy50uu5gaCM5JhRRFdUrjmB6j3gv9HfBlWJXQywReUjPo
 Kq4tnFppxzFVm23COj9j5kyjsFjUhZ8KCft3+n7lrndeOCk5Z3E=
 =5/Ct
 -----END PGP SIGNATURE-----

Merge tag 'riscv-for-linus-v7.0-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux

Pull RISC-V updates from Paul Walmsley:
 "Before v7.0 is released, fix a few issues with the CFI patchset,
  merged earlier in v7.0-rc, that primarily affect interfaces to
  non-kernel code:

   - Improve the prctl() interface for per-task indirect branch landing
     pad control to expand abbreviations and to resemble the speculation
     control prctl() interface

   - Expand the "LP" and "SS" abbreviations in the ptrace uapi header
     file to "branch landing pad" and "shadow stack", to improve
     readability

   - Fix a typo in a CFI-related macro name in the ptrace uapi header
     file

   - Ensure that the indirect branch tracking state and shadow stack
     state are unlocked immediately after an exec() on the new task so
     that libc subsequently can control it

   - While working in this area, clean up the kernel-internal,
     cross-architecture prctl() function names by expanding the
     abbreviations mentioned above"

* tag 'riscv-for-linus-v7.0-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
  prctl: cfi: change the branch landing pad prctl()s to be more descriptive
  riscv: ptrace: cfi: expand "SS" references to "shadow stack" in uapi headers
  prctl: rename branch landing pad implementation functions to be more explicit
  riscv: ptrace: expand "LP" references to "branch landing pads" in uapi headers
  riscv: cfi: clear CFI lock status in start_thread()
  riscv: ptrace: cfi: fix "PRACE" typo in uapi header
2026-04-10 17:27:08 -07:00
Alexei Starovoitov
2cb27158ad bpf: poison dead stack slots
As a sanity check poison stack slots that stack liveness determined
to be dead, so that any read from such slots will cause program rejection.
If stack liveness logic is incorrect the poison can cause
valid program to be rejected, but it also will prevent unsafe program
to be accepted.

Allow global subprogs "read" poisoned stack slots.
The static stack liveness determined that subprog doesn't read certain
stack slots, but sizeof(arg_type) based global subprog validation
isn't accurate enough to know which slots will actually be read by
the callee, so it needs to check full sizeof(arg_type) at the caller.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260410-patch-set-v4-14-5d4eecb343db@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-10 15:13:38 -07:00
Eduard Zingerman
2c167d9177 bpf: change logging scheme for live stack analysis
Instead of breadcrumbs like:

  (d2,cs15) frame 0 insn 18 +live -16
  (d2,cs15) frame 0 insn 17 +live -16

Print final accumulated stack use/def data per-func_instance
per-instruction. printed func_instance's are ordered by callsite and
depth. For example:

  stack use/def subprog#0 shared_instance_must_write_overwrite (d0,cs0):
    0: (b7) r1 = 1
    1: (7b) *(u64 *)(r10 -8) = r1        ; def: fp0-8
    2: (7b) *(u64 *)(r10 -16) = r1       ; def: fp0-16
    3: (bf) r1 = r10
    4: (07) r1 += -8
    5: (bf) r2 = r10
    6: (07) r2 += -16
    7: (85) call pc+7                    ; use: fp0-8 fp0-16
    8: (bf) r1 = r10
    9: (07) r1 += -16
   10: (bf) r2 = r10
   11: (07) r2 += -8
   12: (85) call pc+2                    ; use: fp0-8 fp0-16
   13: (b7) r0 = 0
   14: (95) exit
  stack use/def subprog#1 forwarding_rw (d1,cs7):
   15: (85) call pc+1                    ; use: fp0-8 fp0-16
   16: (95) exit
  stack use/def subprog#1 forwarding_rw (d1,cs12):
   15: (85) call pc+1                    ; use: fp0-8 fp0-16
   16: (95) exit
  stack use/def subprog#2 write_first_read_second (d2,cs15):
   17: (7a) *(u64 *)(r1 +0) = 42
   18: (79) r0 = *(u64 *)(r2 +0)         ; use: fp0-8 fp0-16
   19: (95) exit

For groups of three or more consecutive stack slots, abbreviate as
follows:

   25: (85) call bpf_loop#181            ; use: fp2-8..-512 fp1-8..-512 fp0-8..-512

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260410-patch-set-v4-10-5d4eecb343db@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-10 15:13:37 -07:00
Eduard Zingerman
6762e3a0bc bpf: simplify liveness to use (callsite, depth) keyed func_instances
Rework func_instance identification and remove the dynamic liveness
API, completing the transition to fully static stack liveness analysis.

Replace callchain-based func_instance keys with (callsite, depth)
pairs. The full callchain (all ancestor callsites) is no longer part
of the hash key; only the immediate callsite and the call depth
matter. This does not lose precision in practice and simplifies the
data structure significantly: struct callchain is removed entirely,
func_instance stores just callsite, depth.

Drop must_write_acc propagation. Previously, must_write marks were
accumulated across successors and propagated to the caller via
propagate_to_outer_instance(). Instead, callee entry liveness
(live_before at subprog start) is pulled directly back to the
caller's callsite in analyze_subprog() after each callee returns.

Since (callsite, depth) instances are shared across different call
chains that invoke the same subprog at the same depth, must_write
marks from one call may be stale for another. To handle this,
analyze_subprog() records into a fresh_instance() when the instance
was already visited (must_write_initialized), then merge_instances()
combines the results: may_read is unioned, must_write is intersected.
This ensures only slots written on ALL paths through all call sites
are marked as guaranteed writes.
This replaces commit_stack_write_marks() logic.

Skip recursive descent into callees that receive no FP-derived
arguments (has_fp_args() check). This is needed because global
subprogram calls can push depth beyond MAX_CALL_FRAMES (max depth
is 64 for global calls but only 8 frames are accommodated for FP
passing). It also handles the case where a callback subprog cannot be
determined by argument tracking: such callbacks will be processed by
analyze_subprog() at depth 0 independently.

Update lookup_instance() (used by is_live_before queries) to search
for the func_instance with maximal depth at the corresponding
callsite, walking depth downward from frameno to 0. This accounts for
the fact that instance depth no longer corresponds 1:1 to
bpf_verifier_state->curframe, since skipped non-FP calls create gaps.

Remove the dynamic public liveness API from verifier.c:
  - bpf_mark_stack_{read,write}(), bpf_reset/commit_stack_write_marks()
  - bpf_update_live_stack(), bpf_reset_live_stack_callchain()
  - All call sites in check_stack_{read,write}_fixed_off(),
    check_stack_range_initialized(), mark_stack_slot_obj_read(),
    mark/unmark_stack_slots_{dynptr,iter,irq_flag}()
  - The per-instruction write mark accumulation in do_check()
  - The bpf_update_live_stack() call in prepare_func_exit()

mark_stack_read() and mark_stack_write() become static functions in
liveness.c, called only from the static analysis pass. The
func_instance->updated and must_write_dropped flags are removed.
Remove spis_single_slot(), spis_one_bit() helpers from bpf_verifier.h
as they are no longer used.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Tested-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/20260410-patch-set-v4-9-5d4eecb343db@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-10 15:13:20 -07:00
Eduard Zingerman
fed53dbcdb bpf: record arg tracking results in bpf_liveness masks
After arg tracking reaches a fixed point, perform a single linear scan
over the converged at_in[] state and translate each memory access into
liveness read/write masks on the func_instance:

- Load/store instructions: FP-derived pointer's frame and offset(s)
  are converted to half-slot masks targeting
  per_frame_masks->{may_read,must_write}

- Helper/kfunc calls: record_call_access() queries
  bpf_helper_stack_access_bytes() / bpf_kfunc_stack_access_bytes()
  for each FP-derived argument to determine access size and direction.
  Unknown access size (S64_MIN) conservatively marks all slots from
  fp_off to fp+0 as read.

- Imprecise pointers (frame == ARG_IMPRECISE): conservatively mark
  all slots in every frame covered by the pointer's frame bitmask
  as fully read.

- Static subprog calls with unresolved arguments: conservatively mark
  all frames as fully read.

Instead of a call to clean_live_states(), start cleaning the current
state continuously as registers and stack become dead since the static
analysis provides complete liveness information. This makes
clean_live_states() and bpf_verifier_state->cleaned unnecessary.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260410-patch-set-v4-8-5d4eecb343db@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-10 15:06:14 -07:00
Eduard Zingerman
bf0c571f7f bpf: introduce forward arg-tracking dataflow analysis
The analysis is a basis for static liveness tracking mechanism
introduced by the next two commits.

A forward fixed-point analysis that tracks which frame's FP each
register value is derived from, and at what byte offset. This is
needed because a callee can receive a pointer to its caller's stack
frame (e.g. r1 = fp-16 at the call site), then do *(u64 *)(r1 + 0)
inside the callee — a cross-frame stack access that the callee's local
liveness must attribute to the caller's stack.

Each register holds an arg_track value from a three-level lattice:
- Precise {frame=N, off=[o1,o2,...]} — known frame index and
  up to 4 concrete byte offsets
- Offset-imprecise {frame=N, off_cnt=0} — known frame, unknown offset
- Fully-imprecise {frame=ARG_IMPRECISE, mask=bitmask} — unknown frame,
   mask says which frames might be involved

At CFG merge points the lattice moves toward imprecision (same
frame+offset stays precise, same frame different offsets merges offset
sets or becomes offset-imprecise, different frames become
fully-imprecise with OR'd bitmask).

The analysis also tracks spills/fills to the callee's own stack
(at_stack_in/out), so FP derived values spilled and reloaded.

This pass is run recursively per call site: when subprog A calls B
with specific FP-derived arguments, B is re-analyzed with those entry
args. The recursion follows analyze_subprog -> compute_subprog_args ->
(for each call insn) -> analyze_subprog. Subprogs that receive no
FP-derived args are skipped during recursion and analyzed
independently at depth 0.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260410-patch-set-v4-7-5d4eecb343db@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-10 15:05:05 -07:00
Eduard Zingerman
8d3219f64d bpf: prepare liveness internal API for static analysis pass
Move the `updated` check and reset from bpf_update_live_stack() into
update_instance() itself, so callers outside the main loop can reuse
it. Similarly, move write_insn_idx assignment out of
reset_stack_write_marks() into its public caller, and thread insn_idx
as a parameter to commit_stack_write_marks() instead of reading it
from liveness->write_insn_idx. Drop the unused `env` parameter from
alloc_frame_masks() and mark_stack_read().

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260410-patch-set-v4-6-5d4eecb343db@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-10 15:05:03 -07:00
Eduard Zingerman
be23266b4a bpf: 4-byte precise clean_verifier_state
Migrate clean_verifier_state() and its liveness queries from 8-byte
SPI granularity to 4-byte half-slot granularity.

In __clean_func_state(), each SPI is cleaned in two independent
halves:
  - half_spi 2*i   (lo): slot_type[0..3]
  - half_spi 2*i+1 (hi): slot_type[4..7]

Slot types STACK_DYNPTR, STACK_ITER and STACK_IRQ_FLAG are never
cleaned, as their slot type markers are required by
destroy_if_dynptr_stack_slot(), is_iter_reg_valid_uninit() and
is_irq_flag_reg_valid_uninit() for correctness.

When only the hi half is dead, spilled_ptr metadata is destroyed and
the lo half's STACK_SPILL bytes are downgraded to STACK_MISC or
STACK_ZERO. When only the lo half is dead, spilled_ptr is preserved
because the hi half may still need it for state comparison.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260410-patch-set-v4-5-5d4eecb343db@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-10 15:04:59 -07:00
Eduard Zingerman
7ca5f68cda bpf: make liveness.c track stack with 4-byte granularity
Convert liveness bitmask type from u64 to spis_t, doubling the number
of trackable stack slots from 64 to 128 to support 4-byte granularity.

Each 8-byte SPI now maps to two consecutive 4-byte sub-slots in the
bitmask: spi*2 half and spi*2+1 half. In verifier.c,
check_stack_write_fixed_off() now reports 4-byte aligned writes of
4-byte writes as half-slot marks and 8-byte aligned 8-byte writes as
two slots. Similar logic applied in check_stack_read_fixed_off().

Queries (is_live_before) are not yet migrated to half-slot
granularity.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260410-patch-set-v4-4-5d4eecb343db@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-10 15:04:56 -07:00
Eduard Zingerman
cf3ee1ecf3 bpf: save subprogram name in bpf_subprog_info
Subprogram name can be computed from function info and BTF, but it is
convenient to have the name readily available for logging purposes.
Update comment saying that bpf_subprog_info->start has to be the first
field, this is no longer true, relevant sites access .start field
by it's name.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260410-patch-set-v4-2-5d4eecb343db@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-10 15:01:56 -07:00
Eduard Zingerman
33dfc521c2 bpf: share several utility functions as internal API
Namely:
- bpf_subprog_is_global
- bpf_vlog_alignment

Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260410-patch-set-v4-1-5d4eecb343db@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-10 15:01:55 -07:00
Thomas Gleixner
d6e152d905 clockevents: Prevent timer interrupt starvation
Calvin reported an odd NMI watchdog lockup which claims that the CPU locked
up in user space. He provided a reproducer, which sets up a timerfd based
timer and then rearms it in a loop with an absolute expiry time of 1ns.

As the expiry time is in the past, the timer ends up as the first expiring
timer in the per CPU hrtimer base and the clockevent device is programmed
with the minimum delta value. If the machine is fast enough, this ends up
in a endless loop of programming the delta value to the minimum value
defined by the clock event device, before the timer interrupt can fire,
which starves the interrupt and consequently triggers the lockup detector
because the hrtimer callback of the lockup mechanism is never invoked.

As a first step to prevent this, avoid reprogramming the clock event device
when:
     - a forced minimum delta event is pending
     - the new expiry delta is less then or equal to the minimum delta

Thanks to Calvin for providing the reproducer and to Borislav for testing
and providing data from his Zen5 machine.

The problem is not limited to Zen5, but depending on the underlying
clock event device (e.g. TSC deadline timer on Intel) and the CPU speed
not necessarily observable.

This change serves only as the last resort and further changes will be made
to prevent this scenario earlier in the call chain as far as possible.

[ tglx: Updated to restore the old behaviour vs. !force and delta <= 0 and
  	fixed up the tick-broadcast handlers as pointed out by Borislav ]

Fixes: d316c57ff6 ("[PATCH] clockevents: add core functionality")
Reported-by: Calvin Owens <calvin@wbinvd.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Calvin Owens <calvin@wbinvd.org>
Tested-by: Borislav Petkov <bp@alien8.de>
Link: https://lore.kernel.org/lkml/acMe-QZUel-bBYUh@mozart.vkv.me/
Link: https://patch.msgid.link/20260407083247.562657657@kernel.org
2026-04-10 22:45:38 +02:00
Sechang Lim
4406942e65 bpf: Fix RCU stall in bpf_fd_array_map_clear()
Add a missing cond_resched() in bpf_fd_array_map_clear() loop.

For PROG_ARRAY maps with many entries this loop calls
prog_array_map_poke_run() per entry which can be expensive, and
without yielding this can cause RCU stalls under load:

  rcu: Stack dump where RCU GP kthread last ran:
  CPU: 0 UID: 0 PID: 30932 Comm: kworker/0:2 Not tainted 6.14.0-13195-g967e8def1100 #2 PREEMPT(undef)
  Workqueue: events prog_array_map_clear_deferred
  RIP: 0010:write_comp_data+0x38/0x90 kernel/kcov.c:246
  Call Trace:
   <TASK>
   prog_array_map_poke_run+0x77/0x380 kernel/bpf/arraymap.c:1096
   __fd_array_map_delete_elem+0x197/0x310 kernel/bpf/arraymap.c:925
   bpf_fd_array_map_clear kernel/bpf/arraymap.c:1000 [inline]
   prog_array_map_clear_deferred+0x119/0x1b0 kernel/bpf/arraymap.c:1141
   process_one_work+0x898/0x19d0 kernel/workqueue.c:3238
   process_scheduled_works kernel/workqueue.c:3319 [inline]
   worker_thread+0x770/0x10b0 kernel/workqueue.c:3400
   kthread+0x465/0x880 kernel/kthread.c:464
   ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:153
   ret_from_fork_asm+0x19/0x30 arch/x86/entry/entry_64.S:245
   </TASK>

Reviewed-by: Sun Jian <sun.jian.kdev@gmail.com>
Fixes: da765a2f59 ("bpf: Add poke dependency tracking for prog array maps")
Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com>
Link: https://lore.kernel.org/r/20260407103823.3942156-1-rhkrqnwk98@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-10 12:10:06 -07:00
Puranjay Mohan
4cbee026db bpf: return VMA snapshot from task_vma iterator
Holding the per-VMA lock across the BPF program body creates a lock
ordering problem when helpers acquire locks that depend on mmap_lock:

  vm_lock -> i_rwsem -> mmap_lock -> vm_lock

Snapshot the VMA under the per-VMA lock in _next() via memcpy(), then
drop the lock before returning. The BPF program accesses only the
snapshot.

The verifier only trusts vm_mm and vm_file pointers (see
BTF_TYPE_SAFE_TRUSTED_OR_NULL in verifier.c). vm_file is reference-
counted with get_file() under the lock and released via fput() on the
next iteration or in _destroy(). vm_mm is already correct because
lock_vma_under_rcu() verifies vma->vm_mm == mm. All other pointers
are left as-is by memcpy() since the verifier treats them as untrusted.

Fixes: 4ac4546821 ("bpf: Introduce task_vma open-coded iterator kfuncs")
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
Link: https://lore.kernel.org/r/20260408154539.3832150-4-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-10 12:05:16 -07:00
Puranjay Mohan
bee9ef4a40 bpf: switch task_vma iterator from mmap_lock to per-VMA locks
The open-coded task_vma iterator holds mmap_lock for the entire duration
of iteration, increasing contention on this highly contended lock.

Switch to per-VMA locking. Find the next VMA via an RCU-protected maple
tree walk and lock it with lock_vma_under_rcu(). lock_next_vma() is not
used because its fallback takes mmap_read_lock(), and the iterator must
work in non-sleepable contexts.

lock_vma_under_rcu() is a point lookup (mas_walk) that finds the VMA
containing a given address but cannot iterate across gaps. An
RCU-protected vma_next() walk (mas_find) first locates the next VMA's
vm_start to pass to lock_vma_under_rcu().

Between the RCU walk and the lock, the VMA may be removed, shrunk, or
write-locked. On failure, advance past it using vm_end from the RCU
walk. Because the VMA slab is SLAB_TYPESAFE_BY_RCU, vm_end may be
stale; fall back to PAGE_SIZE advancement when it does not make forward
progress. Concurrent VMA insertions at addresses already passed by the
iterator are not detected.

CONFIG_PER_VMA_LOCK is required; return -EOPNOTSUPP without it.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260408154539.3832150-3-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-10 12:05:16 -07:00
Puranjay Mohan
d8e27d2d22 bpf: fix mm lifecycle in open-coded task_vma iterator
The open-coded task_vma iterator reads task->mm locklessly and acquires
mmap_read_trylock() but never calls mmget(). If the task exits
concurrently, the mm_struct can be freed as it is not
SLAB_TYPESAFE_BY_RCU, resulting in a use-after-free.

Safely read task->mm with a trylock on alloc_lock and acquire an mm
reference. Drop the reference via bpf_iter_mmput_async() in _destroy()
and error paths. bpf_iter_mmput_async() is a local wrapper around
mmput_async() with a fallback to mmput() on !CONFIG_MMU.

Reject irqs-disabled contexts (including NMI) up front. Operations used
by _next() and _destroy() (mmap_read_unlock, bpf_iter_mmput_async)
take spinlocks with IRQs disabled (pool->lock, pi_lock). Running from
NMI or from a tracepoint that fires with those locks held could
deadlock.

A trylock on alloc_lock is used instead of the blocking task_lock()
(get_task_mm) to avoid a deadlock when a softirq BPF program iterates
a task that already holds its alloc_lock on the same CPU.

Fixes: 4ac4546821 ("bpf: Introduce task_vma open-coded iterator kfuncs")
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260408154539.3832150-2-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-10 12:05:16 -07:00
Tejun Heo
e719e17d99 sched_ext: Warn on task-based SCX op recursion
The kf_tasks[] design assumes task-based SCX ops don't nest - if they
did, kf_tasks[0] would get clobbered. The old scx_kf_allow() WARN_ONCE
caught invalid nesting via kf_mask, but that machinery is gone now.

Add a WARN_ON_ONCE(current->scx.kf_tasks[0]) at the top of each
SCX_CALL_OP_TASK*() macro. Checking kf_tasks[0] alone is sufficient: all
three variants (SCX_CALL_OP_TASK, SCX_CALL_OP_TASK_RET,
SCX_CALL_OP_2TASKS_RET) write to kf_tasks[0], so a non-NULL value at
entry to any of the three means re-entry from somewhere in the family.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
2026-04-10 07:54:06 -10:00
Tejun Heo
979a98b6e9 sched_ext: Rename scx_kf_allowed_on_arg_tasks() to scx_kf_arg_task_ok()
The "kf_allowed" framing on this helper comes from the old runtime
scx_kf_allowed() gate, which has been removed. Rename it to describe what it
actually does in the new model.

Pure rename, no functional change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
2026-04-10 07:54:06 -10:00
Cheng-Yang Chou
7cd9a5d7d4 sched_ext: Remove runtime kfunc mask enforcement
Now that scx_kfunc_context_filter enforces context-sensitive kfunc
restrictions at BPF load time, the per-task runtime enforcement via
scx_kf_mask is redundant. Remove it entirely:

 - Delete enum scx_kf_mask, the kf_mask field on sched_ext_entity, and
   the scx_kf_allow()/scx_kf_disallow()/scx_kf_allowed() helpers along
   with the higher_bits()/highest_bit() helpers they used.
 - Strip the @mask parameter (and the BUILD_BUG_ON checks) from the
   SCX_CALL_OP[_RET]/SCX_CALL_OP_TASK[_RET]/SCX_CALL_OP_2TASKS_RET
   macros and update every call site. Reflow call sites that were
   wrapped only to fit the old 5-arg form and now collapse onto a single
   line under ~100 cols.
 - Remove the in-kfunc scx_kf_allowed() runtime checks from
   scx_dsq_insert_preamble(), scx_dsq_move(), scx_bpf_dispatch_nr_slots(),
   scx_bpf_dispatch_cancel(), scx_bpf_dsq_move_to_local___v2(),
   scx_bpf_sub_dispatch(), scx_bpf_reenqueue_local(), and the per-call
   guard inside select_cpu_from_kfunc().

scx_bpf_task_cgroup() and scx_kf_allowed_on_arg_tasks() were already
cleaned up in the "drop redundant rq-locked check" patch.
scx_kf_allowed_if_unlocked() was rewritten in the preceding "decouple"
patch. No further changes to those helpers here.

Co-developed-by: Juntong Deng <juntong.deng@outlook.com>
Signed-off-by: Juntong Deng <juntong.deng@outlook.com>
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-10 07:54:06 -10:00
Tejun Heo
d1d3c1c6ae sched_ext: Add verifier-time kfunc context filter
Move enforcement of SCX context-sensitive kfunc restrictions from per-task
runtime kf_mask checks to BPF verifier-time filtering, using the BPF core's
struct_ops context information.

A shared .filter callback is attached to each context-sensitive BTF set
and consults a per-op allow table (scx_kf_allow_flags[]) indexed by SCX
ops member offset. Disallowed calls are now rejected at program load time
instead of at runtime.

The old model split reachability across two places: each SCX_CALL_OP*()
set bits naming its op context, and each kfunc's scx_kf_allowed() check
OR'd together the bits it accepted. A kfunc was callable when those two
masks overlapped. The new model transposes the result to the caller side -
each op's allow flags directly list the kfunc groups it may call. The old
bit assignments were:

  Call-site bits:
    ops.select_cpu   = ENQUEUE | SELECT_CPU
    ops.enqueue      = ENQUEUE
    ops.dispatch     = DISPATCH
    ops.cpu_release  = CPU_RELEASE

  Kfunc-group accepted bits:
    enqueue group     = ENQUEUE | DISPATCH
    select_cpu group  = SELECT_CPU | ENQUEUE
    dispatch group    = DISPATCH
    cpu_release group = CPU_RELEASE

Intersecting them yields the reachability now expressed directly by
scx_kf_allow_flags[]:

  ops.select_cpu  -> SELECT_CPU | ENQUEUE
  ops.enqueue     -> SELECT_CPU | ENQUEUE
  ops.dispatch    -> ENQUEUE | DISPATCH
  ops.cpu_release -> CPU_RELEASE

Unlocked ops carried no kf_mask bits and reached only unlocked kfuncs;
that maps directly to UNLOCKED in the new table.

Equivalence was checked by walking every (op, kfunc-group) combination
across SCX ops, SYSCALL, and non-SCX struct_ops callers against the old
scx_kf_allowed() runtime checks. With two intended exceptions (see below),
all combinations reach the same verdict; disallowed calls are now caught at
load time instead of firing scx_error() at runtime.

scx_bpf_dsq_move_set_slice() and scx_bpf_dsq_move_set_vtime() are
exceptions: they have no runtime check at all, but the new filter rejects
them from ops outside dispatch/unlocked. The affected cases are nonsensical
- the values these setters store are only read by
scx_bpf_dsq_move{,_vtime}(), which is itself restricted to
dispatch/unlocked, so a setter call from anywhere else was already dead
code.

Runtime scx_kf_mask enforcement is left in place by this patch and removed
in a follow-up.

Original-patch-by: Juntong Deng <juntong.deng@outlook.com>
Original-patch-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-10 07:54:06 -10:00
Tejun Heo
2193af26a1 sched_ext: Drop redundant rq-locked check from scx_bpf_task_cgroup()
scx_kf_allowed_on_arg_tasks() runs both an scx_kf_allowed(__SCX_KF_RQ_LOCKED)
mask check and a kf_tasks[] check. After the preceding call-site fixes,
every SCX_CALL_OP_TASK*() invocation has kf_mask & __SCX_KF_RQ_LOCKED
non-zero, so the mask check is redundant whenever the kf_tasks[] check
passes. Drop it and simplify the helper to take only @sch and @p.

Fold the locking guarantee into the SCX_CALL_OP_TASK() comment block, which
scx_bpf_task_cgroup() now points to.

No functional change.

Extracted from a larger verifier-time kfunc context filter patch
originally written by Juntong Deng.

Original-patch-by: Juntong Deng <juntong.deng@outlook.com>
Cc: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-10 07:54:06 -10:00
Tejun Heo
0022b32850 sched_ext: Decouple kfunc unlocked-context check from kf_mask
scx_kf_allowed_if_unlocked() uses !current->scx.kf_mask as a proxy for "no
SCX-tracked lock held". kf_mask is removed in a follow-up patch, so its two
callers - select_cpu_from_kfunc() and scx_dsq_move() - need another basis.

Add a new bool scx_rq.in_select_cpu, set across the SCX_CALL_OP_TASK_RET
that invokes ops.select_cpu(), to capture the one case where SCX itself
holds no lock but try_to_wake_up() holds @p's pi_lock. Together with
scx_locked_rq(), it expresses the same accepted-context set.

select_cpu_from_kfunc() needs a runtime test because it has to take
different locking paths depending on context. Open-code as a three-way
branch. The unlocked branch takes raw_spin_lock_irqsave(&p->pi_lock)
directly - pi_lock alone is enough for the fields the kfunc reads, and is
lighter than task_rq_lock().

scx_dsq_move() doesn't really need a runtime test - its accepted contexts
could be enforced at verifier load time. But since the runtime state is
already there and using it keeps the upcoming load-time filter simpler, just
write it the same way: (scx_locked_rq() || in_select_cpu) &&
!kf_allowed(DISPATCH).

scx_kf_allowed_if_unlocked() is deleted with the conversions.

No semantic change.

v2: s/No functional change/No semantic change/ - the unlocked path now acquires
    pi_lock instead of the heavier task_rq_lock() (Andrea Righi).

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-10 07:54:06 -10:00
Tejun Heo
b470e37c1f sched_ext: Fix ops.cgroup_move() invocation kf_mask and rq tracking
sched_move_task() invokes ops.cgroup_move() inside task_rq_lock(tsk), so
@p's rq lock is held. The SCX_CALL_OP_TASK invocation mislabels this:

  - kf_mask = SCX_KF_UNLOCKED (== 0), claiming no lock is held.
  - rq = NULL, so update_locked_rq() doesn't run and scx_locked_rq()
    returns NULL.

Switch to SCX_KF_REST and pass task_rq(p), matching ops.set_cpumask()
from set_cpus_allowed_scx().

Three effects:

  - scx_bpf_task_cgroup() becomes callable (was rejected by
    scx_kf_allowed(__SCX_KF_RQ_LOCKED)). Safe; rq lock is held.

  - scx_bpf_dsq_move() is now rejected (was allowed via the unlocked
    branch). Calling it while holding an unrelated task's rq lock is
    risky; rejection is correct.

  - scx_bpf_select_cpu_*() previously took the unlocked branch in
    select_cpu_from_kfunc() and called task_rq_lock(p, &rf), which
    would deadlock against the already-held pi_lock. Now it takes the
    locked-rq branch and is rejected with -EPERM via the existing
    kf_allowed(SCX_KF_SELECT_CPU | SCX_KF_ENQUEUE) check. Latent
    deadlock fix.

No in-tree scheduler is known to call any of these from ops.cgroup_move().

v2: Add Fixes: tag (Andrea Righi).

Fixes: 18853ba782 ("sched_ext: Track currently locked rq")
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-10 07:54:06 -10:00
Tejun Heo
9fb457074f sched_ext: Track @p's rq lock across set_cpus_allowed_scx -> ops.set_cpumask
The SCX_CALL_OP_TASK call site passes rq=NULL incorrectly, leaving
scx_locked_rq() unset. Pass task_rq(p) instead so update_locked_rq()
reflects reality.

v2: Add Fixes: tag (Andrea Righi).

Fixes: 18853ba782 ("sched_ext: Track currently locked rq")
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-10 07:54:06 -10:00
Tejun Heo
a37e134317 sched_ext: Add select_cpu kfuncs to scx_kfunc_ids_unlocked
select_cpu_from_kfunc() has an extra scx_kf_allowed_if_unlocked() branch
that accepts calls from unlocked contexts and takes task_rq_lock() itself
- a "callable from unlocked" property encoded in the kfunc body rather
than in set membership. That's fine while the runtime check is the
authoritative gate, but the upcoming verifier-time filter uses set
membership as the source of truth and needs it to reflect every context
the kfunc may be called from.

Add the three select_cpu kfuncs to scx_kfunc_ids_unlocked so their full
set of callable contexts is captured by set membership. This follows the
existing dual-set convention used by scx_bpf_dsq_move{,_vtime} and
scx_bpf_dsq_move_set_{slice,vtime}, which are members of both
scx_kfunc_ids_dispatch and scx_kfunc_ids_unlocked.

While at it, add brief comments on each duplicate BTF_ID_FLAGS block
(including the pre-existing dsq_move ones) explaining the dual
membership.

No runtime behavior change: the runtime check in select_cpu_from_kfunc()
remains the authoritative gate until it is removed along with the rest
of the scx_kf_mask enforcement in a follow-up.

v2: Clarify dispatch-set comment to name scx_bpf_dsq_move*() explicitly so it
    doesn't appear to cover scx_bpf_sub_dispatch() (Andrea Righi).

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-10 07:54:06 -10:00
Tejun Heo
9b5501d3c9 sched_ext: Drop TRACING access to select_cpu kfuncs
The select_cpu kfuncs - scx_bpf_select_cpu_dfl(), scx_bpf_select_cpu_and()
and __scx_bpf_select_cpu_and() - take task_rq_lock() internally. Exposing
them via scx_kfunc_set_idle to BPF_PROG_TYPE_TRACING is unsafe: arbitrary
tracing contexts (kprobes, tracepoints, fentry, LSM) may run with @p's
pi_lock state unknown.

Move them out of scx_kfunc_ids_idle into a new scx_kfunc_ids_select_cpu
set registered only for STRUCT_OPS and SYSCALL.

Extracted from a larger verifier-time kfunc context filter patch
originally written by Juntong Deng.

Original-patch-by: Juntong Deng <juntong.deng@outlook.com>
Cc: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-10 07:54:06 -10:00
Vincent Guittot
78cde54ea5 sched/eevdf: Clear buddies for preempt_short
next buddy should not prevent shorter slice preemption. Don't take buddy
into account when checking if shorter slice entity can preempt and clear it
if the entity with a shorter slice can preempt current.

Test on snapdragon rb5:
hackbench -T -p -l 16000000 -g 2 1> /dev/null &
hackbench runs in cgroup /test-A
cyclictest -t 1 -i 2777 -D 63 --policy=fair --mlock  -h 20000 -q
cyclictest runs in cgroup /test-B

                     tip/sched/core  tip/sched/core    +this patch
cyclictest slice  (ms) (default)2.8             8               8
hackbench slice   (ms) (default)2.8            20              20
Total Samples          |    22679           22595           22686
Average           (us) |       84              94(-12%)        59( 37%)
Median (P50)      (us) |       56              56(  0%)        56(  0%)
90th Percentile   (us) |       64              65(- 2%)        63(  3%)
99th Percentile   (us) |     1047            1273(-22%)        74( 94%)
99.9th Percentile (us) |     2431            4751(-95%)       663( 86%)
Maximum           (us) |     4694            8655(-84%)      3934( 55%)

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260410132321.2897789-1-vincent.guittot@linaro.org
2026-04-10 16:31:50 +02:00
Catalin Marinas
480a9e57cc Merge branches 'for-next/misc', 'for-next/tlbflush', 'for-next/ttbr-macros-cleanup', 'for-next/kselftest', 'for-next/feat_lsui', 'for-next/mpam', 'for-next/hotplug-batched-tlbi', 'for-next/bbml2-fixes', 'for-next/sysreg', 'for-next/generic-entry' and 'for-next/acpi', remote-tracking branches 'arm64/for-next/perf' and 'arm64/for-next/read-once' into for-next/core
* arm64/for-next/perf:
  : Perf updates
  perf/arm-cmn: Fix resource_size_t printk specifier in arm_cmn_init_dtc()
  perf/arm-cmn: Fix incorrect error check for devm_ioremap()
  perf: add NVIDIA Tegra410 C2C PMU
  perf: add NVIDIA Tegra410 CPU Memory Latency PMU
  perf/arm_cspmu: nvidia: Add Tegra410 PCIE-TGT PMU
  perf/arm_cspmu: nvidia: Add Tegra410 PCIE PMU
  perf/arm_cspmu: Add arm_cspmu_acpi_dev_get
  perf/arm_cspmu: nvidia: Add Tegra410 UCF PMU
  perf/arm_cspmu: nvidia: Rename doc to Tegra241
  perf/arm-cmn: Stop claiming entire iomem region
  arm64: cpufeature: Use pmuv3_implemented() function
  arm64: cpufeature: Make PMUVer and PerfMon unsigned
  KVM: arm64: Read PMUVer as unsigned

* arm64/for-next/read-once:
  : Fixes for __READ_ONCE() with CONFIG_LTO=y
  arm64, compiler-context-analysis: Permit alias analysis through __READ_ONCE() with CONFIG_LTO=y
  arm64: Optimize __READ_ONCE() with CONFIG_LTO=y

* for-next/misc:
  : Miscellaneous cleanups/fixes
  arm64: rsi: use linear-map alias for realm config buffer
  arm64: Kconfig: fix duplicate word in CMDLINE help text
  arm64: mte: Skip TFSR_EL1 checks and barriers in synchronous tag check mode
  arm64/hwcap: Generate the KERNEL_HWCAP_ definitions for the hwcaps
  arm64: kexec: Remove duplicate allocation for trans_pgd
  arm64: mm: Use generic enum pgtable_level
  arm64: scs: Remove redundant save/restore of SCS SP on entry to/from EL0
  arm64: remove ARCH_INLINE_*

* for-next/tlbflush:
  : Refactor the arm64 TLB invalidation API and implementation
  arm64: mm: __ptep_set_access_flags must hint correct TTL
  arm64: mm: Provide level hint for flush_tlb_page()
  arm64: mm: Wrap flush_tlb_page() around __do_flush_tlb_range()
  arm64: mm: More flags for __flush_tlb_range()
  arm64: mm: Refactor __flush_tlb_range() to take flags
  arm64: mm: Refactor flush_tlb_page() to use __tlbi_level_asid()
  arm64: mm: Simplify __flush_tlb_range_limit_excess()
  arm64: mm: Simplify __TLBI_RANGE_NUM() macro
  arm64: mm: Re-implement the __flush_tlb_range_op macro in C
  arm64: mm: Inline __TLBI_VADDR_RANGE() into __tlbi_range()
  arm64: mm: Push __TLBI_VADDR() into __tlbi_level()
  arm64: mm: Implicitly invalidate user ASID based on TLBI operation
  arm64: mm: Introduce a C wrapper for by-range TLB invalidation
  arm64: mm: Re-implement the __tlbi_level macro as a C function

* for-next/ttbr-macros-cleanup:
  : Cleanups of the TTBR1_* macros
  arm64/mm: Directly use TTBRx_EL1_CnP
  arm64/mm: Directly use TTBRx_EL1_ASID_MASK
  arm64/mm: Describe TTBR1_BADDR_4852_OFFSET

* for-next/kselftest:
  : arm64 kselftest updates
  selftests/arm64: Implement cmpbr_sigill() to hwcap test

* for-next/feat_lsui:
  : Futex support using FEAT_LSUI instructions to avoid toggling PAN
  arm64: armv8_deprecated: Disable swp emulation when FEAT_LSUI present
  arm64: Kconfig: Add support for LSUI
  KVM: arm64: Use CAST instruction for swapping guest descriptor
  arm64: futex: Support futex with FEAT_LSUI
  arm64: futex: Refactor futex atomic operation
  KVM: arm64: kselftest: set_id_regs: Add test for FEAT_LSUI
  KVM: arm64: Expose FEAT_LSUI to guests
  arm64: cpufeature: Add FEAT_LSUI

* for-next/mpam: (40 commits)
  : Expose MPAM to user-space via resctrl:
  :  - Add architecture context-switch and hiding of the feature from KVM.
  :  - Add interface to allow MPAM to be exposed to user-space using resctrl.
  :  - Add errata workaoround for some existing platforms.
  :  - Add documentation for using MPAM and what shape of platforms can use resctrl
  arm64: mpam: Add initial MPAM documentation
  arm_mpam: Quirk CMN-650's CSU NRDY behaviour
  arm_mpam: Add workaround for T241-MPAM-6
  arm_mpam: Add workaround for T241-MPAM-4
  arm_mpam: Add workaround for T241-MPAM-1
  arm_mpam: Add quirk framework
  arm_mpam: resctrl: Call resctrl_init() on platforms that can support resctrl
  arm64: mpam: Select ARCH_HAS_CPU_RESCTRL
  arm_mpam: resctrl: Add empty definitions for assorted resctrl functions
  arm_mpam: resctrl: Update the rmid reallocation limit
  arm_mpam: resctrl: Add resctrl_arch_rmid_read()
  arm_mpam: resctrl: Allow resctrl to allocate monitors
  arm_mpam: resctrl: Add support for csu counters
  arm_mpam: resctrl: Add monitor initialisation and domain boilerplate
  arm_mpam: resctrl: Add kunit test for control format conversions
  arm_mpam: resctrl: Add support for 'MB' resource
  arm_mpam: resctrl: Wait for cacheinfo to be ready
  arm_mpam: resctrl: Add rmid index helpers
  arm_mpam: resctrl: Convert to/from MPAMs fixed-point formats
  arm_mpam: resctrl: Hide CDP emulation behind CONFIG_EXPERT
  ...

* for-next/hotplug-batched-tlbi:
  : arm64/mm: Enable batched TLB flush in unmap_hotplug_range()
  arm64/mm: Reject memory removal that splits a kernel leaf mapping
  arm64/mm: Enable batched TLB flush in unmap_hotplug_range()

* for-next/bbml2-fixes:
  : Fixes for realm guest and BBML2_NOABORT
  arm64: mm: Remove pmd_sect() and pud_sect()
  arm64: mm: Handle invalid large leaf mappings correctly
  arm64: mm: Fix rodata=full block mapping support for realm guests

* for-next/sysreg:
  : arm64 sysreg updates
  arm64/sysreg: Update ID_AA64SMFR0_EL1 description to DDI0601 2025-12
  arm64/sysreg: Update ID_AA64ZFR0_EL1 description to DDI0601 2025-12
  arm64/sysreg: Update ID_AA64FPFR0_EL1 description to DDI0601 2025-12
  arm64/sysreg: Update ID_AA64ISAR2_EL1 description to DDI0601 2025-12
  arm64/sysreg: Update ID_AA64ISAR0_EL1 description to DDI0601 2025-12
  arm64/sysreg: Update SMIDR_EL1 to DDI0601 2025-06

* for-next/generic-entry:
  : More arm64 refactoring towards using the generic entry code
  arm64: Check DAIF (and PMR) at task-switch time
  arm64: entry: Use split preemption logic
  arm64: entry: Use irqentry_{enter_from,exit_to}_kernel_mode()
  arm64: entry: Consistently prefix arm64-specific wrappers
  arm64: entry: Don't preempt with SError or Debug masked
  entry: Split preemption from irqentry_exit_to_kernel_mode()
  entry: Split kernel mode logic from irqentry_{enter,exit}()
  entry: Move irqentry_enter() prototype later
  entry: Remove local_irq_{enable,disable}_exit_to_user()
  entry: Fix stale comment for irqentry_enter()

* for-next/acpi:
  : arm64 ACPI updates
  ACPI: AGDI: fix missing newline in error message
2026-04-10 14:22:24 +01:00
Rafael J. Wysocki
7431d90cfc Merge branches 'pm-cpuidle', 'pm-opp' and 'pm-sleep'
Merge cpuidle updates, OPP (operating performance points) library
updates, and updates related to system suspend and hibernation for
7.1-rc1:

 - Refine stopped tick handling in the menu cpuidle governor and
   rearrange stopped tick handling in the teo cpuidle governor (Rafael
   Wysocki)

 - Add Panther Lake C-states table to the intel_idle driver (Artem
   Bityutskiy)

 - Clean up dead dependencies on CPU_IDLE in Kconfig (Julian Braha)

 - Simplify cpuidle_register_device() with guard() (Huisong Li)

 - Use performance level if available to distinguish between rates in
   OPP debugfs (Manivannan Sadhasivam)

 - Fix scoped_guard in dev_pm_opp_xlate_required_opp() (Viresh Kumar)

 - Return -ENODATA if the snapshot image is not loaded (Alberto Garcia)

 - Remove inclusion of crypto/hash.h from hibernate_64.c on x86 (Eric
   Biggers)

* pm-cpuidle:
  cpuidle: Simplify cpuidle_register_device() with guard()
  cpuidle: clean up dead dependencies on CPU_IDLE in Kconfig
  intel_idle: Add Panther Lake C-states table
  cpuidle: governors: teo: Rearrange stopped tick handling
  cpuidle: governors: menu: Refine stopped tick handling

* pm-opp:
  OPP: Move break out of scoped_guard in dev_pm_opp_xlate_required_opp()
  OPP: debugfs: Use performance level if available to distinguish between rates

* pm-sleep:
  PM: hibernate: return -ENODATA if the snapshot image is not loaded
  PM: hibernate: x86: Remove inclusion of crypto/hash.h
2026-04-10 12:37:27 +02:00
Rafael J. Wysocki
83e990310d Merge branch 'pm-cpufreq'
Merge cpufreq updates for 7.1-rc1:

 - Update qcom-hw DT bindings to include Eliza hardware (Abel Vesa)

 - Update cpufreq-dt-platdev blocklist (Faruque Ansari)

 - Minor updates to driver and dt-bindings for Tegra (Thierry Reding,
   Rosen Penev)

 - Add MAINTAINERS entry for CPPC driver (Viresh Kumar)

 - Add support for new features: CPPC performance priority, Dynamic EPP,
   Raw EPP, and new unit tests for them to amd-pstate (Gautham Shenoy,
   Mario Limonciello)

 - Fix sysfs files being present when HW missing and broken/outdated
   documentation in the amd-pstate driver (Ninad Naik, Gautham Shenoy)

 - Pass the policy to cpufreq_driver->adjust_perf() to avoid using
   cpufreq_cpu_get() in the .adjust_perf() callback in amd-pstate which
   leads to a scheduling-while-atomic bug (K Prateek Nayak)

 - Clean up dead code in Kconfig for cpufreq (Julian Braha)

 - Remove max_freq_req update for pre-existing cpufreq policy and add a
   boost_freq_req QoS request to save the boost constraint instead of
   overwriting the last scaling_max_freq constraint (Pierre Gondois)

 - Embed cpufreq QoS freq_req objects in cpufreq policy so they all
   are allocated in one go along with the policy to simplify lifetime
   rules and avoid error handling issues (Viresh Kumar)

 - Use DMI max speed when CPPC is unavailable in the acpi-cpufreq
   scaling driver (Henry Tseng)

 - Switch policy_is_shared() in cpufreq to using cpumask_nth() instead
   of cpumask_weight() because the former is more efficient (Yury Norov)

 - Use sysfs_emit() in sysfs show functions for cpufreq governor
   attributes (Thorsten Blum)

 - Update intel_pstate to stop returning an error when "off" is written
   to its status sysfs attribute while the driver is already off (Fabio
   De Francesco)

 - Include current frequency in the debug message printed by
   __cpufreq_driver_target() (Pengjie Zhang)

* pm-cpufreq: (38 commits)
  cpufreq/amd-pstate: Add POWER_SUPPLY select for dynamic EPP
  MAINTAINERS: amd-pstate: Step down as maintainer, add Prateek as reviewer
  cpufreq: Pass the policy to cpufreq_driver->adjust_perf()
  cpufreq/amd-pstate: Pass the policy to amd_pstate_update()
  cpufreq/amd-pstate-ut: Add a unit test for raw EPP
  cpufreq/amd-pstate: Add support for raw EPP writes
  cpufreq/amd-pstate: Add support for platform profile class
  cpufreq/amd-pstate: add kernel command line to override dynamic epp
  cpufreq/amd-pstate: Add dynamic energy performance preference
  Documentation: amd-pstate: fix dead links in the reference section
  cpufreq/amd-pstate: Cache the max frequency in cpudata
  Documentation/amd-pstate: Add documentation for amd_pstate_floor_{freq,count}
  Documentation/amd-pstate: List amd_pstate_prefcore_ranking sysfs file
  Documentation/amd-pstate: List amd_pstate_hw_prefcore sysfs file
  amd-pstate-ut: Add a testcase to validate the visibility of driver attributes
  amd-pstate-ut: Add module parameter to select testcases
  amd-pstate: Introduce a tracepoint trace_amd_pstate_cppc_req2()
  amd-pstate: Add sysfs support for floor_freq and floor_count
  amd-pstate: Add support for CPPC_REQ2 and FLOOR_PERF
  x86/cpufeatures: Add AMD CPPC Performance Priority feature.
  ...
2026-04-10 12:05:32 +02:00
cuitao
3348e1e83a cgroup/rdma: fix swapped arguments in pr_warn() format string
The format string says "device %p ... rdma cgroup %p" but the arguments
were passed as (cg, device), printing them in the wrong order.

Signed-off-by: cuitao <cuitao@kylinos.cn>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-09 22:30:08 -10:00
Jiayuan Chen
a0c584fc18 bpf: Fix use-after-free in offloaded map/prog info fill
When querying info for an offloaded BPF map or program,
bpf_map_offload_info_fill_ns() and bpf_prog_offload_info_fill_ns()
obtain the network namespace with get_net(dev_net(offmap->netdev)).
However, the associated netdev's netns may be racing with teardown
during netns destruction. If the netns refcount has already reached 0,
get_net() performs a refcount_t increment on 0, triggering:

  refcount_t: addition on 0; use-after-free.

Although rtnl_lock and bpf_devs_lock ensure the netdev pointer remains
valid, they cannot prevent the netns refcount from reaching zero.

Fix this by using maybe_get_net() instead of get_net(). maybe_get_net()
uses refcount_inc_not_zero() and returns NULL if the refcount is already
zero, which causes ns_get_path_cb() to fail and the caller to return
-ENOENT -- the correct behavior when the netns is being destroyed.

Fixes: 675fc275a3 ("bpf: offload: report device information for offloaded programs")
Fixes: 52775b33bb ("bpf: offload: report device information about offloaded maps")
Reported-by: Yinhao Hu <dddddd@hust.edu.cn>
Reported-by: Kaiyan Mei <M202472210@hust.edu.cn>
Reviewed-by: Dongliang Mu <dzm91@hust.edu.cn>
Closes: https://lore.kernel.org/bpf/f0aa3678-79c9-47ae-9e8c-02a3d1df160a@hust.edu.cn/
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20260409023733.168050-1-jiayuan.chen@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-09 13:24:32 -07:00
Daniel Borkmann
9f118095dd bpf: Drop pkt_end markers on arithmetic to prevent is_pkt_ptr_branch_taken
When a pkt pointer acquires AT_PKT_END or BEYOND_PKT_END range from
a comparison, and then, known-constant arithmetic is performed,
adjust_ptr_min_max_vals() copies the stale range via dst_reg->raw =
ptr_reg->raw without clearing the negative reg->range sentinel values.

This lets is_pkt_ptr_branch_taken() choose one branch direction and
skip going through the other. Fix this by clearing negative pkt range
values (that is, AT_PKT_END and BEYOND_PKT_END) after arithmetic on
pkt pointers. This ensures is_pkt_ptr_branch_taken() returns unknown
and both branches are properly verified.

Fixes: 6d94e741a8 ("bpf: Support for pointers beyond pkt_end.")
Reported-by: STAR Labs SG <info@starlabs.sg>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20260409155016.536608-1-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-09 13:11:31 -07:00
Linus Torvalds
3ffcd57823 last dma-mapping fix for Linux 7.0
A fix for DMA-mapping subsystem, which hides annoying, false-positive
 warnings from DMA-API debug on coherent platforms like x86_64 (Mikhail
 Gavrilov).
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSrngzkoBtlA8uaaJ+Jp1EFxbsSRAUCadfSDQAKCRCJp1EFxbsS
 RPtKAQCzfZRx2zC6ACG5opRZqqsUAah5lko9ROJZ1CK2yrfvAwEA7tGeQ8YgSUxT
 Xu5BHh7Ff/jk0ph001/6j7OSBZtuKAY=
 =xdCr
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-7.0-2026-04-09' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux

Pull dma-mapping fix from Marek Szyprowski:
 "A fix for DMA-mapping subsystem, which hides annoying, false-positive
  warnings from DMA-API debug on coherent platforms like x86_64 (Mikhail
  Gavrilov)"

* tag 'dma-mapping-7.0-2026-04-09' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
  dma-debug: suppress cacheline overlap warning when arch has no DMA alignment requirement
2026-04-09 11:02:35 -07:00
Daniel Borkmann
9dba0ae973 bpf: Remove static qualifier from local subprog pointer
The local subprog pointer in create_jt() and visit_abnormal_return_insn()
was declared static.

It is unconditionally assigned via bpf_find_containing_subprog() before
every use. Thus, the static qualifier serves no purpose and rather creates
confusion. Just remove it.

Fixes: e40f5a6bf8 ("bpf: correct stack liveness for tail calls")
Fixes: 493d9e0d60 ("bpf, x86: add support for indirect jumps")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Anton Protopopov <a.s.protopopov@gmail.com>
Link: https://lore.kernel.org/r/20260408191242.526279-3-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-08 18:43:28 -07:00
Daniel Borkmann
ee861486e3 bpf: Fix ld_{abs,ind} failure path analysis in subprogs
Usage of ld_{abs,ind} instructions got extended into subprogs some time
ago via commit 09b28d76ea ("bpf: Add abnormal return checks."). These
are only allowed in subprograms when the latter are BTF annotated and
have scalar return types.

The code generator in bpf_gen_ld_abs() has an abnormal exit path (r0=0 +
exit) from legacy cBPF times. While the enforcement is on scalar return
types, the verifier must also simulate the path of abnormal exit if the
packet data load via ld_{abs,ind} failed.

This is currently not the case. Fix it by having the verifier simulate
both success and failure paths, and extend it in similar ways as we do
for tail calls. The success path (r0=unknown, continue to next insn) is
pushed onto stack for later validation and the r0=0 and return to the
caller is done on the fall-through side.

Fixes: 09b28d76ea ("bpf: Add abnormal return checks.")
Reported-by: STAR Labs SG <info@starlabs.sg>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20260408191242.526279-2-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-08 18:43:28 -07:00
Daniel Borkmann
6bd96e40f3 bpf: Propagate error from visit_tailcall_insn
Commit e40f5a6bf8 ("bpf: correct stack liveness for tail calls") added
visit_tailcall_insn() but did not check its return value.

Fixes: e40f5a6bf8 ("bpf: correct stack liveness for tail calls")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20260408191242.526279-1-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-08 18:43:28 -07:00
Kumar Kartikeya Dwivedi
4f64d5b664 bpf: Make find_linfo widely available
Move find_linfo() as bpf_find_linfo() into core.c to allow for its use
in the verifier in subsequent patches.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
Link: https://lore.kernel.org/r/20260408021359.3786905-4-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-08 18:09:56 -07:00
Kumar Kartikeya Dwivedi
fbb98834a9 bpf: Extract bpf_get_linfo_file_line
Extract bpf_get_linfo_file_line as its own function so that the logic to
obtain the file, line, and line number for a given program can be shared
in subsequent patches.

Reviewed-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260408021359.3786905-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-08 18:09:56 -07:00
Marc Zyngier
2de32a25a3 Merge branch kvm-arm64/hyp-tracing into kvmarm-master/next
* kvm-arm64/hyp-tracing: (40 commits)
  : .
  : EL2 tracing support, adding both 'remote' ring-buffer
  : infrastructure and the tracing itself, courtesy of
  : Vincent Donnefort. From the cover letter:
  :
  : "The growing set of features supported by the hypervisor in protected
  : mode necessitates debugging and profiling tools. Tracefs is the
  : ideal candidate for this task:
  :
  :   * It is simple to use and to script.
  :
  :   * It is supported by various tools, from the trace-cmd CLI to the
  :     Android web-based perfetto.
  :
  :   * The ring-buffer, where are stored trace events consists of linked
  :     pages, making it an ideal structure for sharing between kernel and
  :     hypervisor.
  :
  : This series first introduces a new generic way of creating remote events and
  : remote buffers. Then it adds support to the pKVM hypervisor."
  : .
  tracing: selftests: Extend hotplug testing for trace remotes
  tracing: Non-consuming read for trace remotes with an offline CPU
  tracing: Adjust cmd_check_undefined to show unexpected undefined symbols
  tracing: Restore accidentally removed SPDX tag
  KVM: arm64: avoid unused-variable warning
  tracing: Generate undef symbols allowlist for simple_ring_buffer
  KVM: arm64: tracing: add ftrace dependency
  tracing: add more symbols to whitelist
  tracing: Update undefined symbols allow list for simple_ring_buffer
  KVM: arm64: Fix out-of-tree build for nVHE/pKVM tracing
  tracing: selftests: Add hypervisor trace remote tests
  KVM: arm64: Add selftest event support to nVHE/pKVM hyp
  KVM: arm64: Add hyp_enter/hyp_exit events to nVHE/pKVM hyp
  KVM: arm64: Add event support to the nVHE/pKVM hyp and trace remote
  KVM: arm64: Add trace reset to the nVHE/pKVM hyp
  KVM: arm64: Sync boot clock with the nVHE/pKVM hyp
  KVM: arm64: Add trace remote for the nVHE/pKVM hyp
  KVM: arm64: Add tracing capability for the nVHE/pKVM hyp
  KVM: arm64: Support unaligned fixmap in the pKVM hyp
  KVM: arm64: Initialise hyp_nr_cpus for nVHE hyp
  ...

Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-04-08 12:21:51 +01:00
Anshuman Khandual
5a84b60005 perf/events: Replace READ_ONCE() with standard pgtable accessors
Replace raw READ_ONCE() dereferences of pgtable entries with corresponding
standard page table accessors pxdp_get() in perf_get_pgtable_size(). These
accessors default to READ_ONCE() on platforms that don't override them. So
there is no functional change on such platforms.

However arm64 platform is being extended to support 128 bit page tables via
a new architecture feature i.e FEAT_D128 in which case READ_ONCE() will not
provide required single copy atomic access for 128 bit page table entries.
Although pxdp_get() accessors can later be overridden on arm64 platform to
extend required single copy atomicity support on 128 bit entries.

Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260227062744.2215491-1-anshuman.khandual@arm.com
2026-04-08 13:11:46 +02:00
Michal Koutný
985215804d sched/rt: Cleanup global RT bandwidth functions
The commit 5f6bd380c7 ("sched/rt: Remove default bandwidth control")
and followup changes made a few of the functions unnecessary, drop them
for simplicity.

Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260323-sched-rert_groups-v3-3-1e7d5ed6b249@suse.com
2026-04-08 13:11:44 +02:00
Michal Koutný
4f70a0456d sched/rt: Move group schedulability check to sched_rt_global_validate()
The sched_rt_global_constraints() function is a remnant that used to set
up global RT throttling but that is no more since commit 5f6bd380c7
("sched/rt: Remove default bandwidth control") and the function ended up
only doing schedulability check.
Move the check into the validation function where it fits better.
(The order of validations sched_dl_global_validate() and
sched_rt_global_validate() shouldn't matter.)

Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260323-sched-rert_groups-v3-2-1e7d5ed6b249@suse.com
2026-04-08 13:11:44 +02:00
Michal Koutný
8b016dcec9 sched/rt: Skip group schedulable check with rt_group_sched=0
The warning from the commit 87f1fb77d8 ("sched: Add RT_GROUP WARN
checks for non-root task_groups") is wrong -- it assumes that only
task_groups with rt_rq are traversed, however, the schedulability check
would iterate all task_groups even when rt_group_sched=0 is disabled at
boot time but some non-root task_groups exist.

The schedulability check is supposed to validate:
  a) that children don't overcommit its parent,
  b) no RT task group overcommits global RT limit.
but with rt_group_sched=0 there is no (non-trivial) hierarchy of RT groups,
therefore skip the validation altogether. Otherwise, writes to the
global sched_rt_runtime_us knob will be rejected with incorrect
validation error.

This fix is immaterial with CONFIG_RT_GROUP_SCHED=n.

Fixes: 87f1fb77d8 ("sched: Add RT_GROUP WARN checks for non-root task_groups")
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260323-sched-rert_groups-v3-1-1e7d5ed6b249@suse.com
2026-04-08 13:11:44 +02:00
Peter Zijlstra
14a8570564 sched/deadline: Use revised wakeup rule for dl_server
John noted that commit 1151354225 ("sched/deadline: Fix 'stuck' dl_server")
unfixed the issue from commit a3a70caf79 ("sched/deadline: Fix dl_server
behaviour").

The issue in commit 1151354225 was for wakeups of the server after the
deadline; in which case you *have* to start a new period. The case for
a3a70caf79 is wakeups before the deadline.

Now, because the server is effectively running a least-laxity policy, it means
that any wakeup during the runnable phase means dl_entity_overflow() will be
true. This means we need to adjust the runtime to allow it to still run until
the existing deadline expires.

Use the revised wakeup rule for dl_defer entities.

Fixes: 1151354225 ("sched/deadline: Fix 'stuck' dl_server")
Reported-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Tested-by: John Stultz <jstultz@google.com>
Link: https://patch.msgid.link/20260404102244.GB22575@noisy.programming.kicks-ass.net
2026-04-08 13:11:43 +02:00
Mark Rutland
c5538d0141 entry: Split kernel mode logic from irqentry_{enter,exit}()
The generic irqentry code has entry/exit functions specifically for
exceptions taken from user mode, but doesn't have entry/exit functions
specifically for exceptions taken from kernel mode.

It would be helpful to have separate entry/exit functions specifically
for exceptions taken from kernel mode. This would make the structure of
the entry code more consistent, and would make it easier for
architectures to manage logic specific to exceptions taken from kernel
mode.

Move the logic specific to kernel mode out of irqentry_enter() and
irqentry_exit() into new irqentry_enter_from_kernel_mode() and
irqentry_exit_to_kernel_mode() functions. These are marked
__always_inline and placed in irq-entry-common.h, as with
irqentry_enter_from_user_mode() and irqentry_exit_to_user_mode(), so
that they can be inlined into architecture-specific wrappers. The
existing out-of-line irqentry_enter() and irqentry_exit() functions
retained as callers of the new functions.

The lockdep assertion from irqentry_exit() is moved into
irqentry_exit_to_user_mode() and irqentry_exit_to_kernel_mode(). This
was previously missing from irqentry_exit_to_user_mode() when called
directly, and any new lockdep assertion failure relating from this
change is a latent bug.

Aside from the lockdep change noted above, there should be no functional
change as a result of this change.

[ tglx: Updated kernel doc ]

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Reviewed-by: Jinjie Ruan <ruanjinjie@huawei.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260407131650.3813777-5-mark.rutland@arm.com
2026-04-08 11:43:32 +02:00
Mark Rutland
22f66e7ef4 entry: Remove local_irq_{enable,disable}_exit_to_user()
local_irq_enable_exit_to_user() and local_irq_disable_exit_to_user() are
never overridden by architecture code, and are always equivalent to
local_irq_enable() and local_irq_disable().

These functions were added on the assumption that arm64 would override
them to manage 'DAIF' exception masking, as described by Thomas Gleixner
in these threads:

  https://lore.kernel.org/all/20190919150809.340471236@linutronix.de/
  https://lore.kernel.org/all/alpine.DEB.2.21.1910240119090.1852@nanos.tec.linutronix.de/

In practice arm64 did not need to override either. Prior to moving to
the generic irqentry code, arm64's management of DAIF was reworked in
commit:

  97d935faac ("arm64: Unmask Debug + SError in do_notify_resume()")

Since that commit, arm64 only masks interrupts during the 'prepare' step
when returning to user mode, and masks other DAIF exceptions later.
Within arm64_exit_to_user_mode(), the arm64 entry code is as follows:

	local_irq_disable();
	exit_to_user_mode_prepare_legacy(regs);
	local_daif_mask();
	mte_check_tfsr_exit();
	exit_to_user_mode();

Remove the unnecessary local_irq_enable_exit_to_user() and
local_irq_disable_exit_to_user() functions.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Reviewed-by: Jinjie Ruan <ruanjinjie@huawei.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260407131650.3813777-3-mark.rutland@arm.com
2026-04-08 11:43:31 +02:00
Amery Hung
017f5c4ef7 bpf: Allow overwriting referenced dynptr when refcnt > 1
The verifier currently does not allow overwriting a referenced dynptr's
stack slot to prevent resource leak. This is because referenced dynptr
holds additional resources that requires calling specific helpers to
release. This limitation can be relaxed when there are multiple copies
of the same dynptr. Whether it is the orignial dynptr or one of its
clones, as long as there exists at least one other dynptr with the same
ref_obj_id (to be used to release the reference), its stack slot should
be allowed to be overwritten.

Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260406150548.1354271-2-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-07 18:20:49 -07:00
Daniel Borkmann
1b327732c8 bpf: Clear delta when clearing reg id for non-{add,sub} ops
When a non-{add,sub} alu op such as xor is performed on a scalar
register that previously had a BPF_ADD_CONST delta, the else path
in adjust_reg_min_max_vals() only clears dst_reg->id but leaves
dst_reg->delta unchanged.

This stale delta can propagate via assign_scalar_id_before_mov()
when the register is later used in a mov. It gets a fresh id but
keeps the stale delta from the old (now-cleared) BPF_ADD_CONST.
This stale delta can later propagate leading to a verifier-vs-
runtime value mismatch.

The clear_id label already correctly clears both delta and id.
Make the else path consistent by also zeroing the delta when id
is cleared. More generally, this introduces a helper clear_scalar_id()
which internally takes care of zeroing. There are various other
locations in the verifier where only the id is cleared. By using
the helper we catch all current and future locations.

Fixes: 98d7ca374b ("bpf: Track delta between "linked" registers.")
Reported-by: STAR Labs SG <info@starlabs.sg>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20260407192421.508817-2-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-07 18:15:42 -07:00
Daniel Borkmann
d7f14173c0 bpf: Fix linked reg delta tracking when src_reg == dst_reg
Consider the case of rX += rX where src_reg and dst_reg are pointers to
the same bpf_reg_state in adjust_reg_min_max_vals(). The latter first
modifies the dst_reg in-place, and later in the delta tracking, the
subsequent is_reg_const(src_reg)/reg_const_value(src_reg) reads the
post-{add,sub} value instead of the original source.

This is problematic since it sets an incorrect delta, which sync_linked_regs()
then propagates to linked registers, thus creating a verifier-vs-runtime
mismatch. Fix it by just skipping this corner case.

Fixes: 98d7ca374b ("bpf: Track delta between "linked" registers.")
Reported-by: STAR Labs SG <info@starlabs.sg>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20260407192421.508817-1-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-07 18:15:42 -07:00
Andrey Grodzovsky
1870ddcd94 bpf: Prefer vmlinux symbols over module symbols for unqualified kprobes
When an unqualified kprobe target exists in both vmlinux and a loaded
module, number_of_same_symbols() returns a count greater than 1,
causing kprobe attachment to fail with -EADDRNOTAVAIL even though the
vmlinux symbol is unambiguous.

When no module qualifier is given and the symbol is found in vmlinux,
return the vmlinux-only count without scanning loaded modules. This
preserves the existing behavior for all other cases:
- Symbol only in a module: vmlinux count is 0, falls through to module
  scan as before.
- Symbol qualified with MOD:SYM: mod != NULL, unchanged path.
- Symbol ambiguous within vmlinux itself: count > 1 is returned as-is.

Fixes: 926fe783c8 ("tracing/kprobes: Fix symbol counting logic by looking at modules as well")
Fixes: 9d8616034f ("tracing/kprobes: Add symbol counting check when module loads")
Suggested-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@crowdstrike.com>
Link: https://lore.kernel.org/r/20260407203912.1787502-2-andrey.grodzovsky@crowdstrike.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-07 16:27:52 -07:00
Kumar Kartikeya Dwivedi
57b23c0f61 bpf: Retire rcu_trace_implies_rcu_gp()
RCU Tasks Trace grace period implies RCU grace period, and this
guarantee is expected to remain in the future. Only BPF is the user of
this predicate, hence retire the API and clean up all in-tree users.

RCU Tasks Trace is now implemented on SRCU-fast and its grace period
mechanism always has at least one call to synchronize_rcu() as it is
required for SRCU-fast's correctness (it replaces the smp_mb() that
SRCU-fast readers skip). So, RCU-tt GP will always imply RCU GP.

Reviewed-by: Puranjay Mohan <puranjay@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260407162234.785270-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-07 12:24:49 -07:00
Maninder Singh
034db4dd44 workqueue: use NR_STD_WORKER_POOLS instead of hardcoded value
use NR_STD_WORKER_POOLS for irq_work_fns[] array definition.
NR_STD_WORKER_POOLS is also 2, but better to use MACRO.
Initialization loop for_each_bh_worker_pool() also uses same MACRO.

Signed-off-by: Maninder Singh <maninder1.s@samsung.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-07 08:13:19 -10:00
Linus Torvalds
66d64899ea 8 hotfixes. All are cc:stable and 7 are for MM.
All are singletons - please see the changelogs for details.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCadQzWgAKCRDdBJ7gKXxA
 jp58AP9C4nNI7ReZ9hyH3xg32JZIszk8vzwRUZte/HvvY6YsXQEAt94DZnO8wa3k
 DoYYRjIi9PfhZzI4aRJFXgANWI+0twc=
 =VqfR
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2026-04-06-15-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "Eight hotfixes.  All are cc:stable and seven are for MM.

  All are singletons - please see the changelogs for details"

* tag 'mm-hotfixes-stable-2026-04-06-15-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  ocfs2: fix out-of-bounds write in ocfs2_write_end_inline
  mm/damon/stat: deallocate damon_call() failure leaking damon_ctx
  mm/vma: fix memory leak in __mmap_region()
  mm/memory_hotplug: maintain N_NORMAL_MEMORY during hotplug
  mm/damon/sysfs: dealloc repeat_call_control if damon_call() fails
  mm: reinstate unconditional writeback start in balance_dirty_pages()
  liveupdate: propagate file deserialization failures
  mm: filemap: fix nr_pages calculation overflow in filemap_map_pages()
2026-04-07 10:24:44 -07:00
Zhan Xusheng
09c04714cb alarmtimer: Access timerqueue node under lock in suspend
In alarmtimer_suspend(), timerqueue_getnext() is called under
base->lock, but next->expires is read after the lock is released.

This is safe because suspend freezes all relevant task contexts,
but reading the node while holding the lock makes the code easier
to reason about and not worry about a theoretical UAF.

Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260407143627.19405-1-zhanxusheng@xiaomi.com
2026-04-07 19:14:26 +02:00
Jiayuan Chen
beaf0e96b1 bpf: Drop task_to_inode and inet_conn_established from lsm sleepable hooks
bpf_lsm_task_to_inode() is called under rcu_read_lock() and
bpf_lsm_inet_conn_established() is called from softirq context, so
neither hook can be used by sleepable LSM programs.

Fixes: 423f16108c ("bpf: Augment the set of sleepable LSM hooks")
Reported-by: Quan Sun <2022090917019@std.uestc.edu.cn>
Reported-by: Yinhao Hu <dddddd@hust.edu.cn>
Reported-by: Kaiyan Mei <M202472210@hust.edu.cn>
Reported-by: Dongliang Mu <dzm91@hust.edu.cn>
Closes: https://lore.kernel.org/bpf/3ab69731-24d1-431a-a351-452aafaaf2a5@std.uestc.edu.cn/T/#u
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Link: https://lore.kernel.org/r/20260407122334.344072-1-jiayuan.chen@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-07 07:57:07 -07:00
Josh Snyder
82b915051d tick/nohz: Fix inverted return value in check_tick_dependency() fast path
Commit 56534673ce ("tick/nohz: Optimize check_tick_dependency() with
early return") added a fast path that returns !val when the tick_stop
tracepoint is disabled.

This is inverted: the slow path returns true when a dependency IS found
(val != 0), but !val returns true when val is zero (no dependency).  The
result is that can_stop_full_tick() sees "dependency found" when there are
none, and the tick never stops on nohz_full CPUs.

Fix this by returning !!val instead of !val, matching the slow-path semantics.

Fixes: 56534673ce ("tick/nohz: Optimize check_tick_dependency() with early return")
Signed-off-by: Josh Snyder <josh@code406.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Assisted-by: Claude:claude-opus-4-6
Link: https://patch.msgid.link/20260402-fix-idle-tick2-v1-1-eecb589649d3@code406.com
2026-04-07 15:30:21 +02:00
K Prateek Nayak
556146ce5e sched/fair: Avoid overflow in enqueue_entity()
Here is one scenario which was triggered when running:

    stress-ng --yield=32 -t 10000000s&
    while true; do perf bench sched messaging -p -t -l 100000 -g 16; done

on a 256CPUs machine after about an hour into the run:

    __enqeue_entity: entity_key(-141245081754) weight(90891264) overflow_mul(5608800059305154560) vlag(57498) delayed?(0)
    cfs_rq: zero_vruntime(3809707759657809) sum_w_vruntime(0) sum_weight(0) nr_queued(1)
    cfs_rq->curr: entity_key(0) vruntime(3809707759657809) deadline(3809723966988476) weight(37)

The above comes from __enqueue_entity() after a place_entity(). Breaking
this down:

    vlag_initial = 57498
    vlag = (57498 * (37 + 90891264)) / 37 = 141,245,081,754

    vruntime = 3809707759657809 - 141245081754 = 3,809,566,514,576,055
    entity_key(se, cfs_rq) = -141,245,081,754

Now, multiplying the entity_key with its own weight results to
5,608,800,059,305,154,560 (same as what overflow_mul() suggests) but
in Python, without overflow, this would be: -1,2837,944,014,404,397,056

Avoid the overflow (without doing the division for avg_vruntime()), by moving
zero_vruntime to the new entity when it is heavier.

Fixes: 4823725d9d ("sched/fair: Increase weight bits for avg_vruntime")
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
[peterz: suggested 'weight > load' condition]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260407120052.GG3738010@noisy.programming.kicks-ass.net
2026-04-07 14:02:00 +02:00
Joseph Salisbury
c6e80201e0 sched: Use u64 for bandwidth ratio calculations
to_ratio() computes BW_SHIFT-scaled bandwidth ratios from u64 period and
runtime values, but it returns unsigned long.  tg_rt_schedulable() also
stores the current group limit and the accumulated child sum in unsigned
long.

On 32-bit builds, large bandwidth ratios can be truncated and the RT
group sum can wrap when enough siblings are present.  That can let an
overcommitted RT hierarchy pass the schedulability check, and it also
narrows the helper result for other callers.

Return u64 from to_ratio() and use u64 for the RT group totals so
bandwidth ratios are preserved and compared at full width on both 32-bit
and 64-bit builds.

Fixes: b40b2e8eb5 ("sched: rt: multi level group constraints")
Assisted-by: Codex:GPT-5
Signed-off-by: Joseph Salisbury <joseph.salisbury@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260403210014.2713404-1-joseph.salisbury@oracle.com
2026-04-07 09:23:52 +02:00
Anton Protopopov
43cd9d9520 bpf: Do not ignore offsets for loads from insn_arrays
When a pointer to PTR_TO_INSN is dereferenced, the offset field
of the BPF_LDX_MEM instruction can be nonzero. Patch the verifier
to not ignore this field.

Reported-by: Jiyong Yang <ksur673@gmail.com>
Fixes: 493d9e0d60 ("bpf, x86: add support for indirect jumps")
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Link: https://lore.kernel.org/r/20260406160141.36943-2-a.s.protopopov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-06 18:38:32 -07:00
Gustavo A. R. Silva
18474aed5d bpf: Avoid -Wflex-array-members-not-at-end warnings
Apparently, struct bpf_empty_prog_array exists entirely to populate a
single element of "items" in a global variable. "null_prog" is only
used during the initializer.

None of this is needed; globals will be correctly sized with an array
initializer of a flexible-array member.

So, remove struct bpf_empty_prog_array and adjust the rest of the code,
accordingly.

With these changes, fix the following warnings:

./include/linux/bpf.h:2369:31: warning: structure containing a flexible
array member is not at the end of another structure [-Wflex-array-member-not-at-end]

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
Link: https://lore.kernel.org/r/acr7Whmn0br3xeBP@kspp
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-06 18:37:52 -07:00
Kumar Kartikeya Dwivedi
f25777056e bpf: Enable unaligned accesses for syscall ctx
Don't reject usage of fixed unaligned offsets for syscall ctx. Tests
will be added in later commits. Unaligned offsets already work for
variable offsets.

Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260406194403.1649608-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-06 15:27:26 -07:00
Kumar Kartikeya Dwivedi
ae5ef001aa bpf: Support variable offsets for syscall PTR_TO_CTX
Allow accessing PTR_TO_CTX with variable offsets in syscall programs.
Fixed offsets are already enabled for all program types that do not
convert their ctx accesses, since the changes we made in the commit
de6c7d99f8 ("bpf: Relax fixed offset check for PTR_TO_CTX"). Note
that we also lift the restriction on passing syscall context into
helpers, which was not permitted before, and passing modified syscall
context into kfuncs.

The structure of check_mem_access can be mostly shared and preserved,
but we must use check_mem_region_access to correctly verify access with
variable offsets.

The check made in check_helper_mem_access is hardened to only allow
PTR_TO_CTX for syscall programs to be passed in as helper memory. This
was the original intention of the existing code anyway, and it makes
little sense for other program types' context to be utilized as a memory
buffer. In case a convincing example presents itself in the future, this
check can be relaxed further.

We also no longer use the last-byte access to simulate helper memory
access, but instead go through check_mem_region_access. Since this no
longer updates our max_ctx_offset, we must do so manually, to keep track
of the maximum offset at which the program ctx may be accessed.

Take care to ensure that when arg_type is ARG_PTR_TO_CTX, we do not
relax any fixed or variable offset constraints around PTR_TO_CTX even in
syscall programs, and require them to be passed unmodified. There are
several reasons why this is necessary. First, if we pass a modified ctx,
then the global subprog's accesses will not update the max_ctx_offset to
its true maximum offset, and can lead to out of bounds accesses. Second,
tail called program (or extension program replacing global subprog) where
their max_ctx_offset exceeds the program they are being called from can
also cause issues. For the latter, unmodified PTR_TO_CTX is the first
requirement for the fix, the second is ensuring max_ctx_offset >= the
program they are being called from, which has to be a separate change
not made in this commit.

All in all, we can hint using arg_type when we expect ARG_PTR_TO_CTX and
make our relaxation around offsets conditional on it.

Drop coverage of syscall tests from verifier_ctx.c temporarily for
negative cases until they are updated in subsequent commits.

Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Acked-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260406194403.1649608-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-06 15:27:26 -07:00
Leo Timmins
307e0c5859 liveupdate: propagate file deserialization failures
luo_session_deserialize() ignored the return value from
luo_file_deserialize().  As a result, a session could be left partially
restored even though the /dev/liveupdate open path treats deserialization
failures as fatal.

Propagate the error so a failed file deserialization aborts session
deserialization instead of silently continuing.

Link: https://lkml.kernel.org/r/20260325044608.8407-1-leotimmins1974@gmail.com
Link: https://lkml.kernel.org/r/20260325044608.8407-2-leotimmins1974@gmail.com
Fixes: 16cec0d265 ("liveupdate: luo_session: add ioctls for file preservation")
Signed-off-by: Leo Timmins <leotimmins1974@gmail.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-06 11:13:42 -07:00
MingTao Huang
a1aa9ef47c bpf: Fix stale offload->prog pointer after constant blinding
When a dev-bound-only BPF program (BPF_F_XDP_DEV_BOUND_ONLY) undergoes
JIT compilation with constant blinding enabled (bpf_jit_harden >= 2),
bpf_jit_blind_constants() clones the program. The original prog is then
freed in bpf_jit_prog_release_other(), which updates aux->prog to point
to the surviving clone, but fails to update offload->prog.

This leaves offload->prog pointing to the freed original program. When
the network namespace is subsequently destroyed, cleanup_net() triggers
bpf_dev_bound_netdev_unregister(), which iterates ondev->progs and calls
__bpf_prog_offload_destroy(offload->prog). Accessing the freed prog
causes a page fault:

BUG: unable to handle page fault for address: ffffc900085f1038
Workqueue: netns cleanup_net
RIP: 0010:__bpf_prog_offload_destroy+0xc/0x80
Call Trace:
__bpf_offload_dev_netdev_unregister+0x257/0x350
bpf_dev_bound_netdev_unregister+0x4a/0x90
unregister_netdevice_many_notify+0x2a2/0x660
...
cleanup_net+0x21a/0x320

The test sequence that triggers this reliably is:

1. Set net.core.bpf_jit_harden=2 (echo 2 > /proc/sys/net/core/bpf_jit_harden)
2. Run xdp_metadata selftest, which creates a dev-bound-only XDP
   program on a veth inside a netns (./test_progs -t xdp_metadata)
3. cleanup_net -> page fault in __bpf_prog_offload_destroy

Dev-bound-only programs are unique in that they have an offload structure
but go through the normal JIT path instead of bpf_prog_offload_compile().
This means they are subject to constant blinding's prog clone-and-replace,
while also having offload->prog that must stay in sync.

Fix this by updating offload->prog in bpf_jit_prog_release_other(),
alongside the existing aux->prog update. Both are back-pointers to
the prog that must be kept in sync when the prog is replaced.

Fixes: 2b3486bc2d ("bpf: Introduce device-bound XDP programs")
Signed-off-by: MingTao Huang <mintaohuang@tencent.com>
Link: https://lore.kernel.org/r/tencent_BCF692F45859CCE6C22B7B0B64827947D406@qq.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-05 18:48:09 -07:00
Weiming Shi
5828b9e5b2 bpf: fix end-of-list detection in cgroup_storage_get_next_key()
list_next_entry() never returns NULL -- when the current element is the
last entry it wraps to the list head via container_of(). The subsequent
NULL check is therefore dead code and get_next_key() never returns
-ENOENT for the last element, instead reading storage->key from a bogus
pointer that aliases internal map fields and copying the result to
userspace.

Replace it with list_entry_is_head() so the function correctly returns
-ENOENT when there are no more entries.

Fixes: de9cbbaadb ("bpf: introduce cgroup storage maps")
Reported-by: Xiang Mei <xmei5@asu.edu>
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
Reviewed-by: Sun Jian <sun.jian.kdev@gmail.com>
Acked-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/20260403132951.43533-2-bestswngs@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-05 18:45:05 -07:00
Mykyta Yatsenko
07738bc566 bpf: Use copy_map_value_locked() in alloc_htab_elem() for BPF_F_LOCK
When a BPF_F_LOCK update races with a concurrent delete, the freed
element can be immediately recycled by alloc_htab_elem(). The fast path
in htab_map_update_elem() performs a lockless lookup and then calls
copy_map_value_locked() under the element's spin_lock. If
alloc_htab_elem() recycles the same memory, it overwrites the value
with plain copy_map_value(), without taking the spin_lock, causing
torn writes.

Use copy_map_value_locked() when BPF_F_LOCK is set so the new element's
value is written under the embedded spin_lock, serializing against any
stale lock holders.

Fixes: 96049f3afd ("bpf: introduce BPF_F_LOCK flag")
Reported-by: Aaron Esau <aaron1esau@gmail.com>
Closes: https://lore.kernel.org/all/CADucPGRvSRpkneb94dPP08YkOHgNgBnskTK6myUag_Mkjimihg@mail.gmail.com/
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Link: https://lore.kernel.org/r/20260401-bpf_map_torn_writes-v1-1-782d071c55e7@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-05 18:37:32 -07:00