mirror of
https://github.com/torvalds/linux.git
synced 2026-05-12 16:18:45 +02:00
master
51779 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
8285917d6f |
mm: memcontrol: prepare for reparenting non-hierarchical stats
To resolve the dying memcg issue, we need to reparent LRU folios of child memcg to its parent memcg. This could cause problems for non-hierarchical stats. As Yosry Ahmed pointed out: In short, if memory is charged to a dying cgroup at the time of reparenting, when the memory gets uncharged the stats updates will occur at the parent. This will update both hierarchical and non-hierarchical stats of the parent, which would corrupt the parent's non-hierarchical stats (because those counters were never incremented when the memory was charged). Now we have the following two types of non-hierarchical stats, and they are only used in CONFIG_MEMCG_V1: a. memcg->vmstats->state_local[i] b. pn->lruvec_stats->state_local[i] To ensure that these non-hierarchical stats work properly, we need to reparent these non-hierarchical stats after reparenting LRU folios. To this end, this commit makes the following preparations: 1. implement reparent_state_local() to reparent non-hierarchical stats 2. make css_killed_work_fn() to be called in rcu work, and implement get_non_dying_memcg_start() and get_non_dying_memcg_end() to avoid race between mod_memcg_state()/mod_memcg_lruvec_state() and reparent_state_local() Link: https://lore.kernel.org/e862995c45a7101a541284b6ebee5e5c32c89066.1772711148.git.zhengqi.arch@bytedance.com Co-developed-by: Yosry Ahmed <yosry@kernel.org> Signed-off-by: Yosry Ahmed <yosry@kernel.org> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Allen Pais <apais@linux.microsoft.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baoquan He <bhe@redhat.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Chen Ridong <chenridong@huawei.com> Cc: David Hildenbrand <david@kernel.org> Cc: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: Imran Khan <imran.f.khan@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kamalesh Babulal <kamalesh.babulal@oracle.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Koutný <mkoutny@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
eb0d6d97c2 |
bpf-fixes
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmnihOkACgkQ6rmadz2v
bTqjQA/+K6R/teQRwVmP1GDrfBjz2TXUzCN1WQQLzbnJNR96Mzq72+aTWjza89BK
yEUP379qiOeUfEyyV7DNfHh8hAclUAMKuvI3T3pshLQhpOS0+YcpfbakEZbos+My
AzEGhGl2nhT7S5twHFznCpuSaLgqldHkdAy4BZIiFkOS5lPBX9CU++OAslFPM+f8
R28JQYWuv2/b1mRsz8zDmQQXxwH/Rpz9hdJKcpm/kCYYBay3cAFV7ArFJfn+Y5se
9I6mTwNQ+xtSxtsmR/lftlGo1Vv9ah6qM9gKwgju0SkNrS+9UBlNUSmTrJk1fz+d
SxdppCrqxwHY3UVd62eF4fWWgusC+oMuKzTh6d+D/ZkKvnEjdAx5XQ7uUQyYhKil
G12vvKWcHit0Qz9RAhqlEEZ+GIpFTtLql6aW7pRmQKE8/vmQwAD1HBqNqWYKjokW
btlJ3fUOGu8VHtnYbI3FN6VsK8BU9t/xMny9Fys9X4KmtWBLsm4udmiorV9uC+w6
xV2s+x+ahythTEzVICB6BlQotSRyMd9kR5qisJsetWk+7NBY0Bwn7C0kfVGepHh0
WerFSYdSifTvBWQjXnvqmAX7YspmpZvevw8PCtoPq1xq5d1FrYu1K5GX/xzpy+pH
p13afkbN7Mk6OwteFefD1B0ofug3V9sx3HBI72ENs1Z+hh1KdOQ=
=79I2
-----END PGP SIGNATURE-----
Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Pull bpf fixes from Alexei Starovoitov:
"Most of the diff stat comes from Xu Kuohai's fix to emit ENDBR/BTI,
since all JITs had to be touched to move constant blinding out and
pass bpf_verifier_env in.
- Fix use-after-free in arena_vm_close on fork (Alexei Starovoitov)
- Dissociate struct_ops program with map if map_update fails (Amery
Hung)
- Fix out-of-range and off-by-one bugs in arm64 JIT (Daniel Borkmann)
- Fix precedence bug in convert_bpf_ld_abs alignment check (Daniel
Borkmann)
- Fix arg tracking for imprecise/multi-offset in BPF_ST/STX insns
(Eduard Zingerman)
- Copy token from main to subprogs to fix missing kallsyms (Eduard
Zingerman)
- Prevent double close and leak of btf objects in libbpf (Jiri Olsa)
- Fix af_unix null-ptr-deref in sockmap (Michal Luczaj)
- Fix NULL deref in map_kptr_match_type for scalar regs (Mykyta
Yatsenko)
- Avoid unnecessary IPIs. Remove redundant bpf_flush_icache() in
arm64 and riscv JITs (Puranjay Mohan)
- Fix out of bounds access. Validate node_id in arena_alloc_pages()
(Puranjay Mohan)
- Reject BPF-to-BPF calls and callbacks in arm32 JIT (Puranjay Mohan)
- Refactor all JITs to pass bpf_verifier_env to emit ENDBR/BTI for
indirect jump targets on x86-64, arm64 JITs (Xu Kuohai)
- Allow UTF-8 literals in bpf_bprintf_prepare() (Yihan Ding)"
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: (32 commits)
bpf, arm32: Reject BPF-to-BPF calls and callbacks in the JIT
bpf: Dissociate struct_ops program with map if map_update fails
bpf: Validate node_id in arena_alloc_pages()
libbpf: Prevent double close and leak of btf objects
selftests/bpf: cover UTF-8 trace_printk output
bpf: allow UTF-8 literals in bpf_bprintf_prepare()
selftests/bpf: Reject scalar store into kptr slot
bpf: Fix NULL deref in map_kptr_match_type for scalar regs
bpf: Fix precedence bug in convert_bpf_ld_abs alignment check
bpf, arm64: Emit BTI for indirect jump target
bpf, x86: Emit ENDBR for indirect jump targets
bpf: Add helper to detect indirect jump targets
bpf: Pass bpf_verifier_env to JIT
bpf: Move constants blinding out of arch-specific JITs
bpf, sockmap: Take state lock for af_unix iter
bpf, sockmap: Fix af_unix null-ptr-deref in proto update
selftests/bpf: Extend bpf_iter_unix to attempt deadlocking
bpf, sockmap: Fix af_unix iter deadlock
bpf, sockmap: Annotate af_unix sock:: Sk_state data-races
selftests/bpf: verify kallsyms entries for token-loaded subprograms
...
|
||
|
|
f75aeb2de8 |
bpf: Dissociate struct_ops program with map if map_update fails
Currently, when bpf_struct_ops_map_update_elem() fails, the programs' st_ops_assoc will remain set. They may become dangling pointers if the map is freed later, but they will never be dereferenced since the struct_ops attachment did not succeed. However, if one of the programs is subsequently attached as part of another struct_ops map, its st_ops_assoc will be poisoned even though its old st_ops_assoc was stale from a failed attachment. Fix the spurious poisoned st_ops_assoc by dissociating struct_ops programs with a map if the attachment fails. Move bpf_prog_assoc_struct_ops() to after *plink++ to make sure bpf_prog_disassoc_struct_ops() will not miss a program when iterating st_map->links. Note that, dissociating a program from a map requires some attention as it must not reset a poisoned st_ops_assoc or a st_ops_assoc pointing to another map. The former is already guarded in bpf_prog_disassoc_struct_ops(). The latter also will not happen since st_ops_assoc of programs in st_map->links are set by bpf_prog_assoc_struct_ops(), which can only be poisoned or pointing to the current map. Signed-off-by: Amery Hung <ameryhung@gmail.com> Link: https://lore.kernel.org/r/20260417174900.2895486-1-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
|
41d701ddc3 |
cgroup/cpuset: record DL BW alloc CPU for attach rollback
cpuset_can_attach() allocates DL bandwidth only when migrating
deadline tasks to a disjoint CPU mask, but cpuset_cancel_attach()
rolls back based only on nr_migrate_dl_tasks. This makes the DL
bandwidth alloc/free paths asymmetric: rollback can call dl_bw_free()
even when no dl_bw_alloc() was done.
Rollback also needs to undo the reservation against the same CPU/root
domain that was charged. Record the CPU used by dl_bw_alloc() and use
that state in cpuset_cancel_attach(). If no allocation happened,
dl_bw_cpu stays at -1 and rollback skips dl_bw_free(). If allocation
did happen, bandwidth is returned to the same CPU/root domain.
Successful attach paths are unchanged. This only fixes failed attach
rollback accounting.
Fixes:
|
||
|
|
87768582a4 |
dma-mapping updates for Linux 7.0:
- added support for batched cache sync, what improves performance of
dma_map/unmap_sg() operations on ARM64 architecture (Barry Song)
- introduced DMA_ATTR_CC_SHARED attribute for explicitly shared memory
used in confidential computing (Jiri Pirko)
- refactored spaghetti-like code in drivers/of/of_reserved_mem.c and its
clients (Marek Szyprowski, shared branch with device-tree updates to
avoid merge conflicts)
- prepared Contiguous Memory Allocator related code for making dma-buf
drivers modularized (Maxime Ripard)
- added support for benchmarking dma_map_sg() calls to tools/dma utility
(Qinxin Xia)
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSrngzkoBtlA8uaaJ+Jp1EFxbsSRAUCaeCbdQAKCRCJp1EFxbsS
RHbWAQCt70dzrU0lu0omTR1HdDP4GTYfuM6nZR91e8/itGN1+QD/XH4I/0wuybzk
v5uxbIC6lR3abQRc3YNRXfi+i5j26A4=
=Oee2
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-7.1-2026-04-16' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux
Pull dma-mapping updates from Marek Szyprowski:
- added support for batched cache sync, what improves performance of
dma_map/unmap_sg() operations on ARM64 architecture (Barry Song)
- introduced DMA_ATTR_CC_SHARED attribute for explicitly shared memory
used in confidential computing (Jiri Pirko)
- refactored spaghetti-like code in drivers/of/of_reserved_mem.c and
its clients (Marek Szyprowski, shared branch with device-tree updates
to avoid merge conflicts)
- prepared Contiguous Memory Allocator related code for making dma-buf
drivers modularized (Maxime Ripard)
- added support for benchmarking dma_map_sg() calls to tools/dma
utility (Qinxin Xia)
* tag 'dma-mapping-7.1-2026-04-16' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: (24 commits)
dma-buf: heaps: system: document system_cc_shared heap
dma-buf: heaps: system: add system_cc_shared heap for explicitly shared memory
dma-mapping: introduce DMA_ATTR_CC_SHARED for shared memory
mm: cma: Export cma_alloc(), cma_release() and cma_get_name()
dma: contiguous: Export dev_get_cma_area()
dma: contiguous: Make dma_contiguous_default_area static
dma: contiguous: Make dev_get_cma_area() a proper function
dma: contiguous: Turn heap registration logic around
of: reserved_mem: rework fdt_init_reserved_mem_node()
of: reserved_mem: clarify fdt_scan_reserved_mem*() functions
of: reserved_mem: rearrange code a bit
of: reserved_mem: replace CMA quirks by generic methods
of: reserved_mem: switch to ops based OF_DECLARE()
of: reserved_mem: use -ENODEV instead of -ENOENT
of: reserved_mem: remove fdt node from the structure
dma-mapping: fix false kernel-doc comment marker
dma-mapping: Support batch mode for dma_direct_{map,unmap}_sg
dma-mapping: Separate DMA sync issuing and completion waiting
arm64: Provide dcache_inval_poc_nosync helper
arm64: Provide dcache_clean_poc_nosync helper
...
|
||
|
|
c802f460dd |
cgroup/rdma: fix integer overflow in rdmacg_try_charge()
The expression `rpool->resources[index].usage + 1` is computed in int
arithmetic before being assigned to s64 variable `new`. When usage equals
INT_MAX (the default "max" value), the addition overflows to INT_MIN.
This negative value then passes the `new > max` check incorrectly,
allowing a charge that should be rejected and corrupting usage to
negative.
Fix by casting usage to s64 before the addition so the arithmetic is
done in 64-bit.
Fixes:
|
||
|
|
a5b98009f1 |
sched/psi: fix race between file release and pressure write
A potential race condition exists between pressure write and cgroup file
release regarding the priv member of struct kernfs_open_file, which
triggers the uaf reported in [1].
Consider the following scenario involving execution on two separate CPUs:
CPU0 CPU1
==== ====
vfs_rmdir()
kernfs_iop_rmdir()
cgroup_rmdir()
cgroup_kn_lock_live()
cgroup_destroy_locked()
cgroup_addrm_files()
cgroup_rm_file()
kernfs_remove_by_name()
kernfs_remove_by_name_ns()
vfs_write() __kernfs_remove()
new_sync_write() kernfs_drain()
kernfs_fop_write_iter() kernfs_drain_open_files()
cgroup_file_write() kernfs_release_file()
pressure_write() cgroup_file_release()
ctx = of->priv;
kfree(ctx);
of->priv = NULL;
cgroup_kn_unlock()
cgroup_kn_lock_live()
cgroup_get(cgrp)
cgroup_kn_unlock()
if (ctx->psi.trigger) // here, trigger uaf for ctx, that is of->priv
The cgroup_rmdir() is protected by the cgroup_mutex, it also safeguards
the memory deallocation of of->priv performed within cgroup_file_release().
However, the operations involving of->priv executed within pressure_write()
are not entirely covered by the protection of cgroup_mutex. Consequently,
if the code in pressure_write(), specifically the section handling the
ctx variable executes after cgroup_file_release() has completed, a uaf
vulnerability involving of->priv is triggered.
Therefore, the issue can be resolved by extending the scope of the
cgroup_mutex lock within pressure_write() to encompass all code paths
involving of->priv, thereby properly synchronizing the race condition
occurring between cgroup_file_release() and pressure_write().
And, if an live kn lock can be successfully acquired while executing
the pressure write operation, it indicates that the cgroup deletion
process has not yet reached its final stage; consequently, the priv
pointer within open_file cannot be NULL. Therefore, the operation to
retrieve the ctx value must be moved to a point *after* the live kn
lock has been successfully acquired.
In another situation, specifically after entering cgroup_kn_lock_live()
but before acquiring cgroup_mutex, there exists a different class of
race condition:
CPU0: write memory.pressure CPU1: write cgroup.pressure=0
=========================== =============================
kernfs_fop_write_iter()
kernfs_get_active_of(of)
pressure_write()
cgroup_kn_lock_live(memory.pressure)
cgroup_tryget(cgrp)
kernfs_break_active_protection(kn)
... blocks on cgroup_mutex
cgroup_pressure_write()
cgroup_kn_lock_live(cgroup.pressure)
cgroup_file_show(memory.pressure, false)
kernfs_show(false)
kernfs_drain_open_files()
cgroup_file_release(of)
kfree(ctx)
of->priv = NULL
cgroup_kn_unlock()
... acquires cgroup_mutex
ctx = of->priv; // may now be NULL
if (ctx->psi.trigger) // NULL dereference
Consequently, there is a possibility that of->priv is NULL, the pressure
write needs to check for this.
Now that the scope of the cgroup_mutex has been expanded, the original
explicit cgroup_get/put operations are no longer necessary, this is
because acquiring/releasing the live kn lock inherently executes a
cgroup get/put operation.
[1]
BUG: KASAN: slab-use-after-free in pressure_write+0xa4/0x210 kernel/cgroup/cgroup.c:4011
Call Trace:
pressure_write+0xa4/0x210 kernel/cgroup/cgroup.c:4011
cgroup_file_write+0x36f/0x790 kernel/cgroup/cgroup.c:4311
kernfs_fop_write_iter+0x3b0/0x540 fs/kernfs/file.c:352
Allocated by task 9352:
cgroup_file_open+0x90/0x3a0 kernel/cgroup/cgroup.c:4256
kernfs_fop_open+0x9eb/0xcb0 fs/kernfs/file.c:724
do_dentry_open+0x83d/0x13e0 fs/open.c:949
Freed by task 9353:
cgroup_file_release+0xd6/0x100 kernel/cgroup/cgroup.c:4283
kernfs_release_file fs/kernfs/file.c:764 [inline]
kernfs_drain_open_files+0x392/0x720 fs/kernfs/file.c:834
kernfs_drain+0x470/0x600 fs/kernfs/dir.c:525
Fixes:
|
||
|
|
2845989f2e |
bpf: Validate node_id in arena_alloc_pages()
arena_alloc_pages() accepts a plain int node_id and forwards it through
the entire allocation chain without any bounds checking.
Validate node_id before passing it down the allocation chain in
arena_alloc_pages().
Fixes:
|
||
|
|
0b6bc3dbe6 |
tracing latency updates for 7.1:
- Add TIMERLAT_ALIGN osnoise option Add a timer alignment option for timerlat that makes it work like the cyclictest -A option. timelat creates threads to test the latency of the kernel. The alignment option will have these threads trigger at the alignment offsets from each other. Instead of having each thread wake up at the exact same time, if the alignment is set to "20" each thread will wake up at 20 microseconds from the previous one. -----BEGIN PGP SIGNATURE----- iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaeIBaBQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qqM8AQCyz4uL1lCLQtJb5thG8V9QFRkYh5F3 +DyuRNoht3ijyAD+K4nZPAu4F09feWuHkssONtSZECKEuN6EhiHE4XX6Pgc= =WYYW -----END PGP SIGNATURE----- Merge tag 'trace-latency-v7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing latency update from Steven Rostedt: - Add TIMERLAT_ALIGN osnoise option Add a timer alignment option for timerlat that makes it work like the cyclictest -A option. timelat creates threads to test the latency of the kernel. The alignment option will have these threads trigger at the alignment offsets from each other. Instead of having each thread wake up at the exact same time, if the alignment is set to "20" each thread will wake up at 20 microseconds from the previous one. * tag 'trace-latency-v7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/osnoise: Add option to align tlat threads |
||
|
|
cb30bf881c |
tracing updates for v7.1:
- Fix printf format warning for bprintf
sunrpc uses a trace_printk() that triggers a printf warning during the
compile. Move the __printf() attribute around for when debugging is not
enabled the warning will go away.
- Remove redundant check for EVENT_FILE_FL_FREED in event_filter_write()
The FREED flag is checked in the call to event_file_file() and then
checked again right afterward, which is unneeded.
- Clean up event_file_file() and event_file_data() helpers
These helper functions played a different role in the past, but now with
eventfs, the READ_ONCE() isn't needed. Simplify the code a bit and also
add a warning to event_file_data() if the file or its data is not present.
- Remove updating file->private_data in tracing open
All access to the file private data is handled by the helper functions,
which do not use file->private_data. Stop updating it on open.
- Show ENUM names in function arguments via BTF in function tracing
When showing the function arguments when func-args option is set for
function tracing, if one of the arguments is found to be an enum, show the
name of the enum instead of its number.
- Add new trace_call__##name() API for tracepoints
Tracepoints are enabled via static_branch() blocks, where when not
enabled, there's only a nop that is in the code where the execution will
just skip over it. When tracing is enabled, the nop is converted to a
direct jump to the tracepoint code. Sometimes more calculations are
required to be performed to update the parameters of the tracepoint. In
this case, trace_##name##_enabled() is called which is a static_branch()
that gets enabled only when the tracepoint is enabled. This allows the
extra calculations to also be skipped by the nop:
if (trace_foo_enabled()) {
x = bar();
trace_foo(x);
}
Where the x=bar() is only performed when foo is enabled. The problem with
this approach is that there's now two static_branch() calls. One for
checking if the tracepoint is enabled, and then again to know if the
tracepoint should be called. The second one is redundant.
Introduce trace_call__foo() that will call the foo() tracepoint directly
without doing a static_branch():
if (trace_foo_enabled()) {
x = bar();
trace_call__foo();
}
- Update various locations to use the new trace_call__##name() API
- Move snapshot code out of trace.c
Cleaning up trace.c to not be a "dump all", move the snapshot code out of
it and into a new trace_snapshot.c file.
- Clean up some "%*.s" to "%*s"
- Allow boot kernel command line options to be called multiple times
Have options like:
ftrace_filter=foo ftrace_filter=bar ftrace_filter=zoo
Equal to:
ftrace_filter=foo,bar,zoo
- Fix ipi_raise event CPU field to be a CPU field
The ipi_raise target_cpus field is defined as a __bitmask(). There is now a
__cpumask() field definition. Update the field to use that.
- Have hist_field_name() use a snprintf() and not a series of strcat()
It's safer to use snprintf() that a series of strcat().
- Fix tracepoint regfunc balancing
A tracepoint can define a "reg" and "unreg" function that gets called
before the tracepoint is enabled, and after it is disabled respectively.
But on error, after the "reg" func is called and the tracepoint is not
enabled, the "unreg" function is not called to tear down what the "reg"
function performed.
- Fix output that shows what histograms are enabled
Event variables are displayed incorrectly in the histogram output.
Instead of "sched.sched_wakeup.$var", it is showing
"$sched.sched_wakeup.var" where the '$' is in the incorrect location.
- Some other simple cleanups.
-----BEGIN PGP SIGNATURE-----
iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaeCpvxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qt2WAP44m85BbAjBqJe4WR103eOXV+bREBta
dRoReKJOMe519gEAp0rK/HoCvHgHhIGe3gaGdIsNhnaxoFyNWMG/wokoLAY=
=Hg6+
-----END PGP SIGNATURE-----
Merge tag 'trace-v7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing updates from Steven Rostedt:
- Fix printf format warning for bprintf
sunrpc uses a trace_printk() that triggers a printf warning during
the compile. Move the __printf() attribute around for when debugging
is not enabled the warning will go away
- Remove redundant check for EVENT_FILE_FL_FREED in
event_filter_write()
The FREED flag is checked in the call to event_file_file() and then
checked again right afterward, which is unneeded
- Clean up event_file_file() and event_file_data() helpers
These helper functions played a different role in the past, but now
with eventfs, the READ_ONCE() isn't needed. Simplify the code a bit
and also add a warning to event_file_data() if the file or its data
is not present
- Remove updating file->private_data in tracing open
All access to the file private data is handled by the helper
functions, which do not use file->private_data. Stop updating it on
open
- Show ENUM names in function arguments via BTF in function tracing
When showing the function arguments when func-args option is set for
function tracing, if one of the arguments is found to be an enum,
show the name of the enum instead of its number
- Add new trace_call__##name() API for tracepoints
Tracepoints are enabled via static_branch() blocks, where when not
enabled, there's only a nop that is in the code where the execution
will just skip over it. When tracing is enabled, the nop is converted
to a direct jump to the tracepoint code. Sometimes more calculations
are required to be performed to update the parameters of the
tracepoint. In this case, trace_##name##_enabled() is called which is
a static_branch() that gets enabled only when the tracepoint is
enabled. This allows the extra calculations to also be skipped by the
nop:
if (trace_foo_enabled()) {
x = bar();
trace_foo(x);
}
Where the x=bar() is only performed when foo is enabled. The problem
with this approach is that there's now two static_branch() calls. One
for checking if the tracepoint is enabled, and then again to know if
the tracepoint should be called. The second one is redundant
Introduce trace_call__foo() that will call the foo() tracepoint
directly without doing a static_branch():
if (trace_foo_enabled()) {
x = bar();
trace_call__foo();
}
- Update various locations to use the new trace_call__##name() API
- Move snapshot code out of trace.c
Cleaning up trace.c to not be a "dump all", move the snapshot code
out of it and into a new trace_snapshot.c file
- Clean up some "%*.s" to "%*s"
- Allow boot kernel command line options to be called multiple times
Have options like:
ftrace_filter=foo ftrace_filter=bar ftrace_filter=zoo
Equal to:
ftrace_filter=foo,bar,zoo
- Fix ipi_raise event CPU field to be a CPU field
The ipi_raise target_cpus field is defined as a __bitmask(). There is
now a __cpumask() field definition. Update the field to use that
- Have hist_field_name() use a snprintf() and not a series of strcat()
It's safer to use snprintf() that a series of strcat()
- Fix tracepoint regfunc balancing
A tracepoint can define a "reg" and "unreg" function that gets called
before the tracepoint is enabled, and after it is disabled
respectively. But on error, after the "reg" func is called and the
tracepoint is not enabled, the "unreg" function is not called to tear
down what the "reg" function performed
- Fix output that shows what histograms are enabled
Event variables are displayed incorrectly in the histogram output
Instead of "sched.sched_wakeup.$var", it is showing
"$sched.sched_wakeup.var" where the '$' is in the incorrect location
- Some other simple cleanups
* tag 'trace-v7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (24 commits)
selftests/ftrace: Add test case for fully-qualified variable references
tracing: Fix fully-qualified variable reference printing in histograms
tracepoint: balance regfunc() on func_add() failure in tracepoint_add_func()
tracing: Rebuild full_name on each hist_field_name() call
tracing: Report ipi_raise target CPUs as cpumask
tracing: Remove duplicate latency_fsnotify() stub
tracing: Preserve repeated trace_trigger boot parameters
tracing: Append repeated boot-time tracing parameters
tracing: Remove spurious default precision from show_event_trigger/filter formats
cpufreq: Use trace_call__##name() at guarded tracepoint call sites
tracing: Remove tracing_alloc_snapshot() when snapshot isn't defined
tracing: Move snapshot code out of trace.c and into trace_snapshot.c
mm: damon: Use trace_call__##name() at guarded tracepoint call sites
btrfs: Use trace_call__##name() at guarded tracepoint call sites
spi: Use trace_call__##name() at guarded tracepoint call sites
i2c: Use trace_call__##name() at guarded tracepoint call sites
kernel: Use trace_call__##name() at guarded tracepoint call sites
tracepoint: Add trace_call__##name() API
tracing: trace_mmap.h: fix a kernel-doc warning
tracing: Pretty-print enum parameters in function arguments
...
|
||
|
|
c9e03d5948 |
Probes updates for v7.1
- fprobe: do not zero out unused fgraph_data. This removes unneeded memset of fgraph_data in fprobe entry handler. -----BEGIN PGP SIGNATURE----- iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmnhhjEbHG1hc2FtaS5o aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bfBMH/21EM4dL2oDrsR3Ud7G+ R1LYB959jkztJlIKQ9xGBdUpyMd97JYszknpTG4QHb7XW4FrcvZMiHeT9AW6ZBzw zJkWVbSCst7sNYpjmqFcf/p3uUJp3/jHNG6byQ2/oWW9bilWPitY1LDL8u/VwzWN ZN35B09bw2IgKXbxCNoNlASdV8w/nnVI+CEjKSbLJBVQffBzU9losLt5Qu1i5w/A Q5O5ZLkeawQhRzilxVAqae7U949MzRvn28EiLyMJ4rJA0vWH+ijg7vFYmC4hnFde GNzJ7q7S87BmApTsSVT+QBCpIGd2XjLYfiYZBBoZIZE2wXI4fxgF+cd05XE51hrG mTs= =ycSD -----END PGP SIGNATURE----- Merge tag 'probes-v7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull fprobe update from Masami Hiramatsu: - do not zero out unused fgraph_data. This removes unneeded memset of fgraph_data in fprobe entry handler. * tag 'probes-v7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: fprobe: do not zero out unused fgraph_data |
||
|
|
01f492e181 |
Arm:
- Add support for tracing in the standalone EL2 hypervisor code, which should help both debugging and performance analysis. This uses the new infrastructure for 'remote' trace buffers that can be exposed by non-kernel entities such as firmware, and which came through the tracing tree. - Add support for GICv5 Per Processor Interrupts (PPIs), as the starting point for supporting the new GIC architecture in KVM. - Finally add support for pKVM protected guests, where pages are unmapped from the host as they are faulted into the guest and can be shared back from the guest using pKVM hypercalls. Protected guests are created using a new machine type identifier. As the elusive guestmem has not yet delivered on its promises, anonymous memory is also supported. This is only a first step towards full isolation from the host; for example, the CPU register state and DMA accesses are not yet isolated. Because this does not really yet bring fully what it promises, it is hidden behind CONFIG_ARM_PKVM_GUEST + 'kvm-arm.mode=protected', and also triggers TAINT_USER when a VM is created. Caveat emptor. - Rework the dreaded user_mem_abort() function to make it more maintainable, reducing the amount of state being exposed to the various helpers and rendering a substantial amount of state immutable. - Expand the Stage-2 page table dumper to support NV shadow page tables on a per-VM basis. - Tidy up the pKVM PSCI proxy code to be slightly less hard to follow. - Fix both SPE and TRBE in non-VHE configurations so that they do not generate spurious, out of context table walks that ultimately lead to very bad HW lockups. - A small set of patches fixing the Stage-2 MMU freeing in error cases. - Tighten-up accepted SMC immediate value to be only #0 for host SMCCC calls. - The usual cleanups and other selftest churn. LoongArch: - Use CSR_CRMD_PLV for kvm_arch_vcpu_in_kernel(). - Add DMSINTC irqchip in kernel support. RISC-V: - Fix steal time shared memory alignment checks - Fix vector context allocation leak - Fix array out-of-bounds in pmu_ctr_read() and pmu_fw_ctr_read_hi() - Fix double-free of sdata in kvm_pmu_clear_snapshot_area() - Fix integer overflow in kvm_pmu_validate_counter_mask() - Fix shift-out-of-bounds in make_xfence_request() - Fix lost write protection on huge pages during dirty logging - Split huge pages during fault handling for dirty logging - Skip CSR restore if VCPU is reloaded on the same core - Implement kvm_arch_has_default_irqchip() for KVM selftests - Factored-out ISA checks into separate sources - Added hideleg to struct kvm_vcpu_config - Factored-out VCPU config into separate sources - Support configuration of per-VM HGATP mode from KVM user space s390: - Support for ESA (31-bit) guests inside nested hypervisors. - Remove restriction on memslot alignment, which is not needed anymore with the new gmap code. - Fix LPSW/E to update the bear (which of course is the breaking event address register). x86: - Shut up various UBSAN warnings on reading module parameter before they were initialized. - Don't zero-allocate page tables that are used for splitting hugepages in the TDP MMU, as KVM is guaranteed to set all SPTEs in the page table and thus write all bytes. - As an optimization, bail early when trying to unsync 4KiB mappings if the target gfn can just be mapped with a 2MiB hugepage. x86 generic: - Copy single-chunk MMIO write values into struct kvm_vcpu (more precisely struct kvm_mmio_fragment) to fix use-after-free stack bugs where KVM would dereference stack pointer after an exit to userspace. - Clean up and comment the emulated MMIO code to try to make it easier to maintain (not necessarily "easy", but "easier"). - Move VMXON+VMXOFF and EFER.SVME toggling out of KVM (not *all* of VMX and SVM enabling) as it is needed for trusted I/O. - Advertise support for AVX512 Bit Matrix Multiply (BMM) instructions - Immediately fail the build if a required #define is missing in one of KVM's headers that is included multiple times. - Reject SET_GUEST_DEBUG with -EBUSY if there's an already injected exception, mostly to prevent syzkaller from abusing the uAPI to trigger WARNs, but also because it can help prevent userspace from unintentionally crashing the VM. - Exempt SMM from CPUID faulting on Intel, as per the spec. - Misc hardening and cleanup changes. x86 (AMD): - Fix and optimize IRQ window inhibit handling for AVIC; make it per-vCPU so that KVM doesn't prematurely re-enable AVIC if multiple vCPUs have to-be-injected IRQs. - Clean up and optimize the OSVW handling, avoiding a bug in which KVM would overwrite state when enabling virtualization on multiple CPUs in parallel. This should not be a problem because OSVW should usually be the same for all CPUs. - Drop a WARN in KVM_MEMORY_ENCRYPT_REG_REGION where KVM complains about a "too large" size based purely on user input. - Clean up and harden the pinning code for KVM_MEMORY_ENCRYPT_REG_REGION. - Disallow synchronizing a VMSA of an already-launched/encrypted vCPU, as doing so for an SNP guest will crash the host due to an RMP violation page fault. - Overhaul KVM's APIs for detecting SEV+ guests so that VM-scoped queries are required to hold kvm->lock, and enforce it by lockdep. Fix various bugs where sev_guest() was not ensured to be stable for the whole duration of a function or ioctl. - Convert a pile of kvm->lock SEV code to guard(). - Play nicer with userspace that does not enable KVM_CAP_EXCEPTION_PAYLOAD, for which KVM needs to set CR2 and DR6 as a response to ioctls such as KVM_GET_VCPU_EVENTS (even if the payload would end up in EXITINFO2 rather than CR2, for example). Only set CR2 and DR6 when consumption of the payload is imminent, but on the other hand force delivery of the payload in all paths where userspace retrieves CR2 or DR6. - Use vcpu->arch.cr2 when updating vmcb12's CR2 on nested #VMEXIT instead of vmcb02->save.cr2. The value is out of sync after a save/restore or after a #PF is injected into L2. - Fix a class of nSVM bugs where some fields written by the CPU are not synchronized from vmcb02 to cached vmcb12 after VMRUN, and so are not up-to-date when saved by KVM_GET_NESTED_STATE. - Fix a class of bugs where the ordering between KVM_SET_NESTED_STATE and KVM_SET_{S}REGS could cause vmcb02 to be incorrectly initialized after save+restore. - Add a variety of missing nSVM consistency checks. - Fix several bugs where KVM failed to correctly update VMCB fields on nested #VMEXIT. - Fix several bugs where KVM failed to correctly synthesize #UD or #GP for SVM-related instructions. - Add support for save+restore of virtualized LBRs (on SVM). - Refactor various helpers and macros to improve clarity and (hopefully) make the code easier to maintain. - Aggressively sanitize fields when copying from vmcb12, to guard against unintentionally allowing L1 to utilize yet-to-be-defined features. - Fix several bugs where KVM botched rAX legality checks when emulating SVM instructions. There are remaining issues in that KVM doesn't handle size prefix overrides for 64-bit guests. - Fail emulation of VMRUN/VMLOAD/VMSAVE if mapping vmcb12 fails instead of somewhat arbitrarily synthesizing #GP (i.e. don't double down on AMD's architectural but sketchy behavior of generating #GP for "unsupported" addresses). - Cache all used vmcb12 fields to further harden against TOCTOU bugs. x86 (Intel): - Drop obsolete branch hint prefixes from the VMX instruction macros. - Use ASM_INPUT_RM() in __vmcs_writel() to coerce clang into using a register input when appropriate. - Code cleanups. guest_memfd: - Don't mark guest_memfd folios as accessed, as guest_memfd doesn't support reclaim, the memory is unevictable, and there is no storage to write back to. LoongArch selftests: - Add KVM PMU test cases s390 selftests: - Enable more memory selftests. x86 selftests: - Add support for Hygon CPUs in KVM selftests. - Fix a bug in the MSR test where it would get false failures on AMD/Hygon CPUs with exactly one of RDPID or RDTSCP. - Add an MADV_COLLAPSE testcase for guest_memfd as a regression test for a bug where the kernel would attempt to collapse guest_memfd folios against KVM's will. -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmnftRQUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroPAzwf+NKO4Ktv+7A22ImN0SBl0nlUuulsz vTcw3+hxdRoIw83GdNS+hG5js0wrpMDnbv3t4+VliDNBSSxrBzcSWX2wpilW0Xtw qGo1MWhs2lKPy1NlaRVOwPS6j7uF3AR0TQ1iQLGMedQuCU9WpiKJxyhNXJdbLrt3 8EgFzsvtEsv+jKNRUNDf9+d0j4gZsFyIe+Brhianbw+u3/UCiUClLCdsKPc4+5ZX 08otYXytacGNIf/5Ev1vT4pHkHL0yqKXAtX7LEtaS3+0KrPuLjV4slemivzE9vf5 Evafm5AhA4wpaNMb1ZerhY3T94lsMaJpWxotjR//0Q7C9B59pCQnXCm8mg== =CcE0 -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm updates from Paolo Bonzini: "Arm: - Add support for tracing in the standalone EL2 hypervisor code, which should help both debugging and performance analysis. This uses the new infrastructure for 'remote' trace buffers that can be exposed by non-kernel entities such as firmware, and which came through the tracing tree - Add support for GICv5 Per Processor Interrupts (PPIs), as the starting point for supporting the new GIC architecture in KVM - Finally add support for pKVM protected guests, where pages are unmapped from the host as they are faulted into the guest and can be shared back from the guest using pKVM hypercalls. Protected guests are created using a new machine type identifier. As the elusive guestmem has not yet delivered on its promises, anonymous memory is also supported This is only a first step towards full isolation from the host; for example, the CPU register state and DMA accesses are not yet isolated. Because this does not really yet bring fully what it promises, it is hidden behind CONFIG_ARM_PKVM_GUEST + 'kvm-arm.mode=protected', and also triggers TAINT_USER when a VM is created. Caveat emptor - Rework the dreaded user_mem_abort() function to make it more maintainable, reducing the amount of state being exposed to the various helpers and rendering a substantial amount of state immutable - Expand the Stage-2 page table dumper to support NV shadow page tables on a per-VM basis - Tidy up the pKVM PSCI proxy code to be slightly less hard to follow - Fix both SPE and TRBE in non-VHE configurations so that they do not generate spurious, out of context table walks that ultimately lead to very bad HW lockups - A small set of patches fixing the Stage-2 MMU freeing in error cases - Tighten-up accepted SMC immediate value to be only #0 for host SMCCC calls - The usual cleanups and other selftest churn LoongArch: - Use CSR_CRMD_PLV for kvm_arch_vcpu_in_kernel() - Add DMSINTC irqchip in kernel support RISC-V: - Fix steal time shared memory alignment checks - Fix vector context allocation leak - Fix array out-of-bounds in pmu_ctr_read() and pmu_fw_ctr_read_hi() - Fix double-free of sdata in kvm_pmu_clear_snapshot_area() - Fix integer overflow in kvm_pmu_validate_counter_mask() - Fix shift-out-of-bounds in make_xfence_request() - Fix lost write protection on huge pages during dirty logging - Split huge pages during fault handling for dirty logging - Skip CSR restore if VCPU is reloaded on the same core - Implement kvm_arch_has_default_irqchip() for KVM selftests - Factored-out ISA checks into separate sources - Added hideleg to struct kvm_vcpu_config - Factored-out VCPU config into separate sources - Support configuration of per-VM HGATP mode from KVM user space s390: - Support for ESA (31-bit) guests inside nested hypervisors - Remove restriction on memslot alignment, which is not needed anymore with the new gmap code - Fix LPSW/E to update the bear (which of course is the breaking event address register) x86: - Shut up various UBSAN warnings on reading module parameter before they were initialized - Don't zero-allocate page tables that are used for splitting hugepages in the TDP MMU, as KVM is guaranteed to set all SPTEs in the page table and thus write all bytes - As an optimization, bail early when trying to unsync 4KiB mappings if the target gfn can just be mapped with a 2MiB hugepage x86 generic: - Copy single-chunk MMIO write values into struct kvm_vcpu (more precisely struct kvm_mmio_fragment) to fix use-after-free stack bugs where KVM would dereference stack pointer after an exit to userspace - Clean up and comment the emulated MMIO code to try to make it easier to maintain (not necessarily "easy", but "easier") - Move VMXON+VMXOFF and EFER.SVME toggling out of KVM (not *all* of VMX and SVM enabling) as it is needed for trusted I/O - Advertise support for AVX512 Bit Matrix Multiply (BMM) instructions - Immediately fail the build if a required #define is missing in one of KVM's headers that is included multiple times - Reject SET_GUEST_DEBUG with -EBUSY if there's an already injected exception, mostly to prevent syzkaller from abusing the uAPI to trigger WARNs, but also because it can help prevent userspace from unintentionally crashing the VM - Exempt SMM from CPUID faulting on Intel, as per the spec - Misc hardening and cleanup changes x86 (AMD): - Fix and optimize IRQ window inhibit handling for AVIC; make it per-vCPU so that KVM doesn't prematurely re-enable AVIC if multiple vCPUs have to-be-injected IRQs - Clean up and optimize the OSVW handling, avoiding a bug in which KVM would overwrite state when enabling virtualization on multiple CPUs in parallel. This should not be a problem because OSVW should usually be the same for all CPUs - Drop a WARN in KVM_MEMORY_ENCRYPT_REG_REGION where KVM complains about a "too large" size based purely on user input - Clean up and harden the pinning code for KVM_MEMORY_ENCRYPT_REG_REGION - Disallow synchronizing a VMSA of an already-launched/encrypted vCPU, as doing so for an SNP guest will crash the host due to an RMP violation page fault - Overhaul KVM's APIs for detecting SEV+ guests so that VM-scoped queries are required to hold kvm->lock, and enforce it by lockdep. Fix various bugs where sev_guest() was not ensured to be stable for the whole duration of a function or ioctl - Convert a pile of kvm->lock SEV code to guard() - Play nicer with userspace that does not enable KVM_CAP_EXCEPTION_PAYLOAD, for which KVM needs to set CR2 and DR6 as a response to ioctls such as KVM_GET_VCPU_EVENTS (even if the payload would end up in EXITINFO2 rather than CR2, for example). Only set CR2 and DR6 when consumption of the payload is imminent, but on the other hand force delivery of the payload in all paths where userspace retrieves CR2 or DR6 - Use vcpu->arch.cr2 when updating vmcb12's CR2 on nested #VMEXIT instead of vmcb02->save.cr2. The value is out of sync after a save/restore or after a #PF is injected into L2 - Fix a class of nSVM bugs where some fields written by the CPU are not synchronized from vmcb02 to cached vmcb12 after VMRUN, and so are not up-to-date when saved by KVM_GET_NESTED_STATE - Fix a class of bugs where the ordering between KVM_SET_NESTED_STATE and KVM_SET_{S}REGS could cause vmcb02 to be incorrectly initialized after save+restore - Add a variety of missing nSVM consistency checks - Fix several bugs where KVM failed to correctly update VMCB fields on nested #VMEXIT - Fix several bugs where KVM failed to correctly synthesize #UD or #GP for SVM-related instructions - Add support for save+restore of virtualized LBRs (on SVM) - Refactor various helpers and macros to improve clarity and (hopefully) make the code easier to maintain - Aggressively sanitize fields when copying from vmcb12, to guard against unintentionally allowing L1 to utilize yet-to-be-defined features - Fix several bugs where KVM botched rAX legality checks when emulating SVM instructions. There are remaining issues in that KVM doesn't handle size prefix overrides for 64-bit guests - Fail emulation of VMRUN/VMLOAD/VMSAVE if mapping vmcb12 fails instead of somewhat arbitrarily synthesizing #GP (i.e. don't double down on AMD's architectural but sketchy behavior of generating #GP for "unsupported" addresses) - Cache all used vmcb12 fields to further harden against TOCTOU bugs x86 (Intel): - Drop obsolete branch hint prefixes from the VMX instruction macros - Use ASM_INPUT_RM() in __vmcs_writel() to coerce clang into using a register input when appropriate - Code cleanups guest_memfd: - Don't mark guest_memfd folios as accessed, as guest_memfd doesn't support reclaim, the memory is unevictable, and there is no storage to write back to LoongArch selftests: - Add KVM PMU test cases s390 selftests: - Enable more memory selftests x86 selftests: - Add support for Hygon CPUs in KVM selftests - Fix a bug in the MSR test where it would get false failures on AMD/Hygon CPUs with exactly one of RDPID or RDTSCP - Add an MADV_COLLAPSE testcase for guest_memfd as a regression test for a bug where the kernel would attempt to collapse guest_memfd folios against KVM's will" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (373 commits) KVM: x86: use inlines instead of macros for is_sev_*guest x86/virt: Treat SVM as unsupported when running as an SEV+ guest KVM: SEV: Goto an existing error label if charging misc_cg for an ASID fails KVM: SVM: Move lock-protected allocation of SEV ASID into a separate helper KVM: SEV: use mutex guard in snp_handle_guest_req() KVM: SEV: use mutex guard in sev_mem_enc_unregister_region() KVM: SEV: use mutex guard in sev_mem_enc_ioctl() KVM: SEV: use mutex guard in snp_launch_update() KVM: SEV: Assert that kvm->lock is held when querying SEV+ support KVM: SEV: Document that checking for SEV+ guests when reclaiming memory is "safe" KVM: SEV: Hide "struct kvm_sev_info" behind CONFIG_KVM_AMD_SEV=y KVM: SEV: WARN on unhandled VM type when initializing VM KVM: LoongArch: selftests: Add PMU overflow interrupt test KVM: LoongArch: selftests: Add basic PMU event counting test KVM: LoongArch: selftests: Add cpucfg read/write helpers LoongArch: KVM: Add DMSINTC inject msi to vCPU LoongArch: KVM: Add DMSINTC device support LoongArch: KVM: Make vcpu_is_preempted() as a macro rather than function LoongArch: KVM: Move host CSR_GSTAT save and restore in context switch LoongArch: KVM: Move host CSR_EENTRY save and restore in context switch ... |
||
|
|
440d6635b2 |
mm.git review status for linus..mm-nonmm-stable
Total patches: 126
Reviews/patch: 0.92
Reviewed rate: 76%
- The 2 patch series "pid: make sub-init creation retryable" from Oleg
Nesterov increases the robustness of our creation of init in a new
namespace. By clearing away some historical cruft which is no longer
needed. Also some documentation fixups are provided.
- The 2 patch series "selftests/fchmodat2: Error handling and general"
from Mark Brown has a fixup and a cleanup for the fchmodat2() syscall
selftest.
- The 3 patch series "lib: polynomial: Move to math/ and clean up" from
Andy Shevchenko does as advertised.
- The 3 patch series "hung_task: Provide runtime reset interface for
hung task detector" from Aaron Tomlin gives administrators the ability
to zero out /proc/sys/kernel/hung_task_detect_count.
- The 2 patch series "tools/getdelays: use the static UAPI headers from
tools/include/uapi" from Thomas Weißschuh teaches getdelays to use the
in-kernel UAPI headers rather than the system-provided ones.
- The 5 patch series "watchdog/hardlockup: Improvements to hardlockup"
from Mayank Rungta provides several cleanups and fixups to the
hardlockup detector code and its documentation.
- The 2 patch series "lib/bch: fix undefined behavior from signed
left-shifts" from Josh Law provides a couple of small/theoretical fixes
in the bch code.
- The 2 patch series "ocfs2/dlm: fix two bugs in dlm_match_regions()"
from Junrui Luo does what is claims.
- The 27 patch series "cleanup the RAID5 XOR library" from Christoph
Hellwig is a quite far-reaching cleanup to this code. I can't do better
than to quote Christoph:
The XOR library used for the RAID5 parity is a bit of a mess right
now. The main file sits in crypto/ despite not being cryptography and
not using the crypto API, with the generic implementations sitting in
include/asm-generic and the arch implementations sitting in an asm/
header in theory. The latter doesn't work for many cases, so
architectures often build the code directly into the core kernel, or
create another module for the architecture code.
Change this to a single module in lib/ that also contains the
architecture optimizations, similar to the library work Eric Biggers
has done for the CRC and crypto libraries later. After that it
changes to better calling conventions that allow for smarter
architecture implementations (although none is contained here yet),
and uses static_call to avoid indirection function call overhead.
- The 2 patch series "lib/list_sort: Clean up list_sort() scheduling
workarounds" from Kuan-Wei Chiu cleans up this library code by removing
a hacky thing which was added for UBIFS, which UBIFS doesn't actually
need.
- The 5 patch series "Fix bugs in extract_iter_to_sg()" from Christian
Ehrhardt fixes a few bugs in the scatterlist code, adds in-kernel tests
for the now-fixed bugs and fixes a leak in the test itself.
- The 3 patch series "kdump: Enable LUKS-encrypted dump target support
in ARM64 and PowerPC" from Coiby Xu eenables support of the
LUKS-encrypted device dump target on arm64 and powerpc.
- The 4 patch series "ocfs2: consolidate extent list validation into
block read callbacks" from Joseph Qi addresses ocfs2's validation of
extent list fields - cleanup, simplification, robustness. (Kernel test
robot loves mounting corrupted fs images!)
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCad90rQAKCRDdBJ7gKXxA
jl7rAQD4/Rq7ZSSnEv6FS4gOwc3MgTdWcZZaXkqL1KiWyYhRwAEA+cVCO344+AKb
znBOjet/hUr+/kBwyViifiC8LHzchwM=
=Nfnf
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2026-04-15-04-20' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
- "pid: make sub-init creation retryable" (Oleg Nesterov)
Make creation of init in a new namespace more robust by clearing away
some historical cruft which is no longer needed. Also some
documentation fixups
- "selftests/fchmodat2: Error handling and general" (Mark Brown)
Fix and a cleanup for the fchmodat2() syscall selftest
- "lib: polynomial: Move to math/ and clean up" (Andy Shevchenko)
- "hung_task: Provide runtime reset interface for hung task detector"
(Aaron Tomlin)
Give administrators the ability to zero out
/proc/sys/kernel/hung_task_detect_count
- "tools/getdelays: use the static UAPI headers from
tools/include/uapi" (Thomas Weißschuh)
Teach getdelays to use the in-kernel UAPI headers rather than the
system-provided ones
- "watchdog/hardlockup: Improvements to hardlockup" (Mayank Rungta)
Several cleanups and fixups to the hardlockup detector code and its
documentation
- "lib/bch: fix undefined behavior from signed left-shifts" (Josh Law)
A couple of small/theoretical fixes in the bch code
- "ocfs2/dlm: fix two bugs in dlm_match_regions()" (Junrui Luo)
- "cleanup the RAID5 XOR library" (Christoph Hellwig)
A quite far-reaching cleanup to this code. I can't do better than to
quote Christoph:
"The XOR library used for the RAID5 parity is a bit of a mess right
now. The main file sits in crypto/ despite not being cryptography
and not using the crypto API, with the generic implementations
sitting in include/asm-generic and the arch implementations
sitting in an asm/ header in theory. The latter doesn't work for
many cases, so architectures often build the code directly into
the core kernel, or create another module for the architecture
code.
Change this to a single module in lib/ that also contains the
architecture optimizations, similar to the library work Eric
Biggers has done for the CRC and crypto libraries later. After
that it changes to better calling conventions that allow for
smarter architecture implementations (although none is contained
here yet), and uses static_call to avoid indirection function call
overhead"
- "lib/list_sort: Clean up list_sort() scheduling workarounds"
(Kuan-Wei Chiu)
Clean up this library code by removing a hacky thing which was added
for UBIFS, which UBIFS doesn't actually need
- "Fix bugs in extract_iter_to_sg()" (Christian Ehrhardt)
Fix a few bugs in the scatterlist code, add in-kernel tests for the
now-fixed bugs and fix a leak in the test itself
- "kdump: Enable LUKS-encrypted dump target support in ARM64 and
PowerPC" (Coiby Xu)
Enable support of the LUKS-encrypted device dump target on arm64 and
powerpc
- "ocfs2: consolidate extent list validation into block read callbacks"
(Joseph Qi)
Cleanup, simplify, and make more robust ocfs2's validation of extent
list fields (Kernel test robot loves mounting corrupted fs images!)
* tag 'mm-nonmm-stable-2026-04-15-04-20' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (127 commits)
ocfs2: validate group add input before caching
ocfs2: validate bg_bits during freefrag scan
ocfs2: fix listxattr handling when the buffer is full
doc: watchdog: fix typos etc
update Sean's email address
ocfs2: use get_random_u32() where appropriate
ocfs2: split transactions in dio completion to avoid credit exhaustion
ocfs2: remove redundant l_next_free_rec check in __ocfs2_find_path()
ocfs2: validate extent block list fields during block read
ocfs2: remove empty extent list check in ocfs2_dx_dir_lookup_rec()
ocfs2: validate dx_root extent list fields during block read
ocfs2: fix use-after-free in ocfs2_fault() when VM_FAULT_RETRY
ocfs2: handle invalid dinode in ocfs2_group_extend
.get_maintainer.ignore: add Askar
ocfs2: validate bg_list extent bounds in discontig groups
checkpatch: exclude forward declarations of const structs
tools/accounting: handle truncated taskstats netlink messages
taskstats: set version in TGID exit notifications
ocfs2/heartbeat: fix slot mapping rollback leaks on error paths
arm64,ppc64le/kdump: pass dm-crypt keys to kdump kernel
...
|
||
|
|
b960430ea8 |
bpf: allow UTF-8 literals in bpf_bprintf_prepare()
bpf_bprintf_prepare() only needs ASCII parsing for conversion
specifiers. Plain text can safely carry bytes >= 0x80, so allow
UTF-8 literals outside '%' sequences while keeping ASCII control
bytes rejected and format specifiers ASCII-only.
This keeps existing parsing rules for format directives unchanged,
while allowing helpers such as bpf_trace_printk() to emit UTF-8
literal text.
Update test_snprintf_negative() in the same commit so selftests keep
matching the new plain-text vs format-specifier split during bisection.
Fixes:
|
||
|
|
4d0a375887 |
bpf: Fix NULL deref in map_kptr_match_type for scalar regs
Commit |
||
|
|
4096fd0e8e |
clockevents: Add missing resets of the next_event_forced flag
The prevention mechanism against timer interrupt starvation missed to reset
the next_event_forced flag in a couple of places:
- When the clock event state changes. That can cause the flag to be
stale over a shutdown/startup sequence
- When a non-forced event is armed, which then prevents rearming before
that event. If that event is far out in the future this will cause
missed timer interrupts.
- In the suspend wakeup handler.
That led to stalls which have been reported by several people.
Add the missing resets, which fixes the problems for the reporters.
Fixes:
|
||
|
|
4245bf4dc5 |
tracing/osnoise: Add option to align tlat threads
Add an option called TIMERLAT_ALIGN to osnoise/options, together with a corresponding setting osnoise/timerlat_align_us. This option sets the alignment of wakeup times between different timerlat threads, similarly to cyclictest's -A/--aligned option. If TIMERLAT_ALIGN is set, the first thread that reaches the first cycle records its first wake-up time. Each following thread sets its first wake-up time to a fixed offset from the recorded time, and increments it by the same offset. Example: osnoise/timerlat_period is set to 1000, osnoise/timerlat_align_us is set to 20. There are four threads, on CPUs 1 to 4. - CPU 4 enters first cycle first. The current time is 20000us, so the wake-up of the first cycle is set to 21000us. This time is recorded. - CPU 2 enter first cycle next. It reads the recorded time, increments it to 21020us, and uses this value as its own wake-up time for the first cycle. - CPU 3 enters first cycle next. It reads the recorded time, increments it to 21040 us, and uses the value as its own wake-up time. - CPU 1 proceeds analogically. In each next cycle, the wake-up time (called "absolute period" in timerlat code) is incremented by the (relative) period of 1000us. Thus, the wake-ups in the following cycles (provided the times are reached and not in the past) will be as follows: CPU 1 CPU 2 CPU 3 CPU 4 21080us 21020us 21040us 21000us 22080us 22020us 22040us 22000us ... ... ... ... Even if any cycle is skipped due to e.g. the first cycle calculation happening later, the alignment stays in place. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: John Kacur <jkacur@redhat.com> Cc: Luis Goncalves <lgoncalv@redhat.com> Cc: Costa Shulyupin <costa.shul@redhat.com> Link: https://patch.msgid.link/20260416115942.544032-1-tglozar@redhat.com Signed-off-by: Tomas Glozar <tglozar@redhat.com> Reviewed-by: Wander Lairson Costa <wander@redhat.com> Reviewed-by: Crystal Wood <crwood@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
|
|
07ae6c130b |
bpf: Add helper to detect indirect jump targets
Introduce helper bpf_insn_is_indirect_target to check whether a BPF instruction is an indirect jump target. Since the verifier knows which instructions are indirect jump targets, add a new flag indirect_target to struct bpf_insn_aux_data to mark them. The verifier sets this flag when verifying an indirect jump target instruction, and the helper checks the flag to determine whether an instruction is an indirect jump target. Reviewed-by: Anton Protopopov <a.s.protopopov@gmail.com> #v8 Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> #v12 Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Link: https://lore.kernel.org/r/20260416064341.151802-4-xukuohai@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
|
d9ef13f727 |
bpf: Pass bpf_verifier_env to JIT
Pass bpf_verifier_env to bpf_int_jit_compile(). The follow-up patch will use env->insn_aux_data in the JIT stage to detect indirect jump targets. Since bpf_prog_select_runtime() can be called by cbpf and lib/test_bpf.c code without verifier, introduce helper __bpf_prog_select_runtime() to accept the env parameter. Remove the call to bpf_prog_select_runtime() in bpf_prog_load(), and switch to call __bpf_prog_select_runtime() in the verifier, with env variable passed. The original bpf_prog_select_runtime() is preserved for cbpf and lib/test_bpf.c, where env is NULL. Now all constants blinding calls are moved into the verifier, except the cbpf and lib/test_bpf.c cases. The instructions arrays are adjusted by bpf_patch_insn_data() function for normal cases, so there is no need to call adjust_insn_arrays() in bpf_jit_blind_constants(). Remove it. Reviewed-by: Anton Protopopov <a.s.protopopov@gmail.com> # v8 Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> # v12 Acked-by: Hengqi Chen <hengqi.chen@gmail.com> # v14 Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Link: https://lore.kernel.org/r/20260416064341.151802-3-xukuohai@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
|
d3e945223e |
bpf: Move constants blinding out of arch-specific JITs
During the JIT stage, constants blinding rewrites instructions but only rewrites the private instruction copy of the JITed subprog, leaving the global env->prog->insnsi and env->insn_aux_data untouched. This causes a mismatch between subprog instructions and the global state, making it difficult to use the global data in the JIT. To avoid this mismatch, and given that all arch-specific JITs already support constants blinding, move it to the generic verifier code, and switch to rewrite the global env->prog->insnsi with the global states adjusted, as other rewrites in the verifier do. This removes the constants blinding calls in each JIT, which are largely duplicated code across architectures. Since constants blinding is only required for JIT, and there are two JIT entry functions, jit_subprogs() for BPF programs with multiple subprogs and bpf_prog_select_runtime() for programs with no subprogs, move the constants blinding invocation into these two functions. In the verifier path, bpf_patch_insn_data() is used to keep global verifier auxiliary data in sync with patched instructions. A key question is whether this global auxiliary data should be restored on the failure path. Besides instructions, bpf_patch_insn_data() adjusts: - prog->aux->poke_tab - env->insn_array_maps - env->subprog_info - env->insn_aux_data For prog->aux->poke_tab, it is only used by JIT or only meaningful after JIT succeeds, so it does not need to be restored on the failure path. For env->insn_array_maps, when JIT fails, programs using insn arrays are rejected by bpf_insn_array_ready() due to missing JIT addresses. Hence, env->insn_array_maps is only meaningful for JIT and does not need to be restored. For subprog_info, if jit_subprogs fails and CONFIG_BPF_JIT_ALWAYS_ON is not enabled, kernel falls back to interpreter. In this case, env->subprog_info is used to determine subprogram stack depth. So it must be restored on failure. For env->insn_aux_data, it is freed by clear_insn_aux_data() at the end of bpf_check(). Before freeing, clear_insn_aux_data() loops over env->insn_aux_data to release jump targets recorded in it. The loop uses env->prog->len as the array length, but this length no longer matches the actual size of the adjusted env->insn_aux_data array after constants blinding. To address it, a simple approach is to keep insn_aux_data as adjusted after failure, since it will be freed shortly, and record its actual size for the loop in clear_insn_aux_data(). But since clear_insn_aux_data() uses the same index to loop over both env->prog->insnsi and env->insn_aux_data, this approach results in incorrect index for the insnsi array. So an alternative approach is adopted: clone the original env->insn_aux_data before blinding and restore it after failure, similar to env->prog. For classic BPF programs, constants blinding works as before since it is still invoked from bpf_prog_select_runtime(). Reviewed-by: Anton Protopopov <a.s.protopopov@gmail.com> # v8 Reviewed-by: Hari Bathini <hbathini@linux.ibm.com> # powerpc jit Reviewed-by: Pu Lehui <pulehui@huawei.com> # riscv jit Acked-by: Hengqi Chen <hengqi.chen@gmail.com> # loongarch jit Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Link: https://lore.kernel.org/r/20260416064341.151802-2-xukuohai@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
|
fdbfee9fc5 |
Runtime Verification updates for 7.1:
- Refactor da_monitor header to share handlers across monitor types No functional changes, only less code duplication. - Add Hybrid Automata model class Add a new model class that extends deterministic automata by adding constraints on transitions and states. Those constraints can take into account wall-clock time and as such allow RV monitor to make assertions on real time. Add documentation and code generation scripts. - Add stall monitor as hybrid automaton example Add a monitor that triggers a violation when a task is stalling as an example of automaton working with real time variables. - Convert the opid monitor to a hybrid automaton The opid monitor can be heavily simplified if written as a hybrid automaton: instead of tracking preempt and interrupt enable/disable events, it can just run constraints on the preemption/interrupt states when events like wakeup and need_resched verify. - Add support for per-object monitors in DA/HA Allow writing deterministic and hybrid automata monitors for generic objects (e.g. any struct), by exploiting a hash table where objects are saved. This allows to track more than just tasks in RV. For instance it will be used to track deadline entities in deadline monitors. - Add deadline tracepoints and move some deadline utilities Prepare the ground for deadline monitors by defining events and exporting helpers. - Add nomiss deadline monitor Add first example of deadline monitor asserting all entities complete before their deadline. - Improve rvgen error handling Introduce AutomataError exception class and better handle expected exceptions while showing a backtrace for unexpected ones. - Improve python code quality in rvgen Refactor the rvgen generation scripts to align with python best practices: use f-strings instead of %, use len() instead of __len__(), remove semicolons, use context managers for file operations, fix whitespace violations, extract magic strings into constants, remove unused imports and methods. - Fix small bugs in rvgen The generator scripts presented some corner case bugs: logical error in validating what a correct dot file looks like, fix an isinstance() check, enforce a dot file has an initial state, fix type annotations and typos in comments. - rvgen refactoring Refactor automata.py to use iterator-based parsing and handle required arguments directly in argparse. - Allow epoll in rtapp-sleep monitor The epoll_wait call is now rt-friendly so it should be allowed in the sleep monitor as a valid sleep method. -----BEGIN PGP SIGNATURE----- iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCad4kUxQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qiJgAPsGJ0TCTubdn2x6fpwLDKUcSsNsYrok m8gLHeK0rmkKNQD/ajUW+RBsOufXAsnSzUoUcl7sw1TSiQJ085U+0+ZeDgM= =WZqk -----END PGP SIGNATURE----- Merge tag 'trace-rv-v7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull runtime verification updates from Steven Rostedt: - Refactor da_monitor header to share handlers across monitor types No functional changes, only less code duplication. - Add Hybrid Automata model class Add a new model class that extends deterministic automata by adding constraints on transitions and states. Those constraints can take into account wall-clock time and as such allow RV monitor to make assertions on real time. Add documentation and code generation scripts. - Add stall monitor as hybrid automaton example Add a monitor that triggers a violation when a task is stalling as an example of automaton working with real time variables. - Convert the opid monitor to a hybrid automaton The opid monitor can be heavily simplified if written as a hybrid automaton: instead of tracking preempt and interrupt enable/disable events, it can just run constraints on the preemption/interrupt states when events like wakeup and need_resched verify. - Add support for per-object monitors in DA/HA Allow writing deterministic and hybrid automata monitors for generic objects (e.g. any struct), by exploiting a hash table where objects are saved. This allows to track more than just tasks in RV. For instance it will be used to track deadline entities in deadline monitors. - Add deadline tracepoints and move some deadline utilities Prepare the ground for deadline monitors by defining events and exporting helpers. - Add nomiss deadline monitor Add first example of deadline monitor asserting all entities complete before their deadline. - Improve rvgen error handling Introduce AutomataError exception class and better handle expected exceptions while showing a backtrace for unexpected ones. - Improve python code quality in rvgen Refactor the rvgen generation scripts to align with python best practices: use f-strings instead of %, use len() instead of __len__(), remove semicolons, use context managers for file operations, fix whitespace violations, extract magic strings into constants, remove unused imports and methods. - Fix small bugs in rvgen The generator scripts presented some corner case bugs: logical error in validating what a correct dot file looks like, fix an isinstance() check, enforce a dot file has an initial state, fix type annotations and typos in comments. - rvgen refactoring Refactor automata.py to use iterator-based parsing and handle required arguments directly in argparse. - Allow epoll in rtapp-sleep monitor The epoll_wait call is now rt-friendly so it should be allowed in the sleep monitor as a valid sleep method. * tag 'trace-rv-v7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (32 commits) rv: Allow epoll in rtapp-sleep monitor rv/rvgen: fix _fill_states() return type annotation rv/rvgen: fix unbound loop variable warning rv/rvgen: enforce presence of initial state rv/rvgen: extract node marker string to class constant rv/rvgen: fix isinstance check in Variable.expand() rv/rvgen: make monitor arguments required in rvgen rv/rvgen: remove unused __get_main_name method rv/rvgen: remove unused sys import from dot2c rv/rvgen: refactor automata.py to use iterator-based parsing rv/rvgen: use class constant for init marker rv/rvgen: fix DOT file validation logic error rv/rvgen: fix PEP 8 whitespace violations rv/rvgen: fix typos in automata and generator docstring and comments rv/rvgen: use context managers for file operations rv/rvgen: remove unnecessary semicolons rv/rvgen: replace __len__() calls with len() rv/rvgen: replace % string formatting with f-strings rv/rvgen: remove bare except clauses in generator rv/rvgen: introduce AutomataError exception class ... |
||
|
|
0251e40c48 |
bpf: copy BPF token from main program to subprograms
bpf_jit_subprogs() copies various fields from the main program's aux to
each subprogram's aux, but omits the BPF token. This causes
bpf_prog_kallsyms_add() to fail for subprograms loaded via BPF token,
as bpf_token_capable() falls back to capable() in init_user_ns when
token is NULL.
Copy prog->aux->token to func[i]->aux->token so that subprograms
inherit the same capability delegation as the main program.
Fixes:
|
||
|
|
e4bf304f00 |
ring-buffer updates for 7.1:
- Add remote buffers for pKVM
pKVM has a hypervisor component that is used to protect the guest from the
host kernel. This hypervisor is a black box to the kernel as the kernel is
to user space. The remote buffers are used to have a memory mapping
between the hypervisor and the kernel where kernel may send commands to
enable tracing within the hypervisor. Then the kernel will read this
memory mapping just like user space can read the memory mapped ring buffer
of the kernel tracing system.
Since the hypervisor only has a single context, it doesn't need to worry
about races between normal context, interrupt context and NMIs like the
kernel does. The ring buffer it uses doesn't need to be as complex. The
remote buffers are a simple version of the ring buffer that works in a
single context. They are still per-CPU and use sub buffers. The data
layout is the same as the kernel's ring buffer to share the same parsing.
Currently, only ARM64 implements pKVM, but there's work to implement it
also in x86. The remote buffer code is separated out from the ARM
implementation so that it can be used in the future by x86.
The ARM64 updates for pKVM is in the ARM/KVM tree and it merged in the
remote buffers of this tree.
- Merge commit
|
||
|
|
1521829632 |
ftrace updates for 7.1:
- Speed up ftrace_lookup_symbols() for single lookups The kallsyms lookup in ftrace_lookup_symbols() does a linear search over each symbol. This is fine when it must match multiple strings, but when there's only a single string being searched for, using a binary search is much more efficient. When a single string is passed in to search, use the binary search method. -----BEGIN PGP SIGNATURE----- iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCad4RehQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6quZeAQD0rsWOGF0D970asst2WtCOoQF5G2ao l0ED7QZFoB1qcAEA7ZAuV9+zJwFOvhAekifuDEArPwpadlcUC4dU7tJU1w0= =cMjY -----END PGP SIGNATURE----- Merge tag 'ftrace-v7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull ftrace update from Steven Rostedt: - Speed up ftrace_lookup_symbols() for single lookups The kallsyms lookup in ftrace_lookup_symbols() does a linear search over each symbol. This is fine when it must match multiple strings, but when there's only a single string being searched for, using a binary search is much more efficient. When a single string is passed in to search, use the binary search method. * tag 'ftrace-v7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: ftrace: Use kallsyms binary search for single-symbol lookup |
||
|
|
aec2f682d4 |
This update includes the following changes:
API:
- Replace crypto_get_default_rng with crypto_stdrng_get_bytes.
- Remove simd skcipher support.
- Allow algorithm types to be disabled when CRYPTO_SELFTESTS is off.
Algorithms:
- Remove CPU-based des/3des acceleration.
- Add test vectors for authenc(hmac(md5),cbc(aes)).
- Add test vectors for authenc(hmac(md5),cbc(des)).
- Add test vectors for authenc(hmac(md5),rfc3686(ctr(aes))).
- Add test vectors for authenc(hmac(sha1),rfc3686(ctr(aes))).
- Add test vectors for authenc(hmac(sha224),rfc3686(ctr(aes))).
- Add test vectors for authenc(hmac(sha256),rfc3686(ctr(aes))).
- Add test vectors for authenc(hmac(sha384),rfc3686(ctr(aes))).
- Add test vectors for authenc(hmac(sha512),rfc3686(ctr(aes))).
- Replace spin lock with mutex in jitterentropy.
Drivers:
- Add authenc algorithms to safexcel.
- Add support for zstd in qat.
- Add wireless mode support for QAT GEN6.
- Add anti-rollback support for QAT GEN6.
- Add support for ctr(aes), gcm(aes), and ccm(aes) in dthev2.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEn51F/lCuNhUwmDeSxycdCkmxi6cFAmne7qgACgkQxycdCkmx
i6cm0w/9HNFzIWuZWh4Q8k1d/SX32/2p40EMvlw9QFO8wt0gsMtbk6NN5G3sIfhL
36+rT8Vo5yg9MahTqAspXKjP+QTev5D7/nsDa/FzOSA1JxyvBbgV7X33k8EZjcgT
+ffuh0WbaWlutYw07o2h4cNPz1Yp4M0hp2IdzvY0Y3q9D05eiwis1SQzUVPmTs6K
I6OP+4JjJbqubOgJxsltEoeCH9ZP0fObRWmAiVm6rwk9uX4CY32nzi3QOttXQ0su
4F/useoRwWQ1t7FTy8/fcVtFpL/G8hAFSQ4un5ODhDWL7taV5sZPXQBwXUuoVQM6
aNjZlaju/MB7gnAOrBvSsniohAAqRUNR8O7P8QW6mDrFmDhUZ3ZILmCKW+VwF5SG
a4fV94XgBVOnKIqD01cc++8mb6keX/88KJW79AEWLeJ9YZ9BuyFphr9OEBFAIHqx
xG+iEg4uoVxwC52//oGt/yZaZKK3C1y/Zey5bOjfErKq3ATXGIvawaAzdvB9mh6Q
iAnl71JpR4mrs++fAyUCKM+dfvdmQYDq6HJayMdg+IHAIeIvyMnPjsGigdVJvE65
RpBKW4aclfiYaDwX9Jf703mHR1uuKGP1GKpz8U+JXN4Ax2JPg0maC1N3wFkDypYO
HUNKgEk/173f1HTjU0JjbqvqJh+rKQ3ZbHpLxZrYtnSMukDwRO0=
=KoAB
-----END PGP SIGNATURE-----
Merge tag 'v7.1-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto update from Herbert Xu:
"API:
- Replace crypto_get_default_rng with crypto_stdrng_get_bytes
- Remove simd skcipher support
- Allow algorithm types to be disabled when CRYPTO_SELFTESTS is off
Algorithms:
- Remove CPU-based des/3des acceleration
- Add test vectors for authenc(hmac(md5),cbc({aes,des})) and
authenc(hmac({md5,sha1,sha224,sha256,sha384,sha512}),rfc3686(ctr(aes)))
- Replace spin lock with mutex in jitterentropy
Drivers:
- Add authenc algorithms to safexcel
- Add support for zstd in qat
- Add wireless mode support for QAT GEN6
- Add anti-rollback support for QAT GEN6
- Add support for ctr(aes), gcm(aes), and ccm(aes) in dthev2"
* tag 'v7.1-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (129 commits)
crypto: af_alg - use sock_kmemdup in alg_setkey_by_key_serial
crypto: vmx - remove CRYPTO_DEV_VMX from Kconfig
crypto: omap - convert reqctx buffer to fixed-size array
crypto: atmel-sha204a - add Thorsten Blum as maintainer
crypto: atmel-ecc - add Thorsten Blum as maintainer
crypto: qat - fix IRQ cleanup on 6xxx probe failure
crypto: geniv - Remove unused spinlock from struct aead_geniv_ctx
crypto: qce - simplify qce_xts_swapiv()
crypto: hisilicon - Fix dma_unmap_single() direction
crypto: talitos - rename first/last to first_desc/last_desc
crypto: talitos - fix SEC1 32k ahash request limitation
crypto: jitterentropy - replace long-held spinlock with mutex
crypto: hisilicon - remove unused and non-public APIs for qm and sec
crypto: hisilicon/qm - drop redundant variable initialization
crypto: hisilicon/qm - remove else after return
crypto: hisilicon/qm - add const qualifier to info_name in struct qm_cmd_dump_item
crypto: hisilicon - fix the format string type error
crypto: ccree - fix a memory leak in cc_mac_digest()
crypto: qat - add support for zstd
crypto: qat - use swab32 macro
...
|
||
|
|
40286d6379 |
pci-v7.1-changes
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAmnfwfMUHGJoZWxnYWFz
QGdvb2dsZS5jb20ACgkQWYigwDrT+vxwIRAAlN1h5er8aFDbjON5YMXBZqlQmzaC
bjlUHgwm7HkdErTFozyuqhE8QUO1kCm4uMQzeyJdfY9nRWqMDOuKYxMD5j0exk+o
4tbbJg6Xx4dq7Qrawy9PhxyQm/PDAcvs+FRRlGala+qq9o3fxPDOAZVDE/1C8qFQ
Jd7GGd7NZn/NN4xrqST4RQHjO8fwaMwmksWCStsb79kfesQWP6kLADGfIMcWxNUB
2s+oTnK6Hw0tkBv56n6i8mbb0EzS3/RN1daTevGAta1rmfUVVtWGRZ4paMvv0Owi
Rl5+O5Jz6/c1qiXZbUqu5CRQPIy7Dr3JPvURcZX6qbsV8PzWXZr0Wi+geWefGOnp
55y+3OT0vdBGAuXLJhrcU7Clzq9D/TZOt8oTI8IFArUfDlmrAIdozPn7gr+VGre5
QuKymSk3XWtyIbe4o8UeZ4f9g0y6ZY1XvtvB7K1tze+OOmqlkfq966+z8aZuGOKx
ZvAU/NIat5H02EgB4dEVOP8R5vPZlXGT0RLGl1JWRypPWyZDbVVA3z927qRQG5md
IsVq8WaIrB1zyl9g37lZeEaYwP/qCIQsHkMGPYcP4wdOQEV9AQqi5pmjMXnWyQJD
PR1nvmTKW7USRCJ+pz8xPhZh0cj3ENaddORTD3I/0CGVV0y452bU/5rr4T+K04bK
PCJBpxTIDuWDwXc=
=FFRz
-----END PGP SIGNATURE-----
Merge tag 'pci-v7.1-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Pull pci updates from Bjorn Helgaas:
"Enumeration:
- Allow TLP Processing Hints to be enabled for RCiEPs (George Abraham
P)
- Enable AtomicOps only if we know the Root Port supports them (Gerd
Bayer)
- Don't enable AtomicOps for RCiEPs since none of them need Atomic
Ops and we can't tell whether the Root Complex would support them
(Gerd Bayer)
- Leave Precision Time Measurement disabled until a driver enables it
to avoid PCIe errors (Mika Westerberg)
- Make pci_set_vga_state() fail if bridge doesn't support VGA
routing, i.e., PCI_BRIDGE_CTL_VGA is not writable, and return
errors to vga_get() callers including userspace via
/dev/vga_arbiter (Simon Richter)
- Validate max-link-speed from DT in j721e, brcmstb, mediatek-gen3,
rzg3s drivers (where the actual controller constraints are known),
and remove validation from the generic OF DT accessor (Hans Zhang)
- Remove pc110pad driver (no longer useful after 486 CPU support
removed) and no_pci_devices() (pc110pad was the last user) (Dmitry
Torokhov, Heiner Kallweit)
Resource management:
- Prevent assigning space to unimplemented bridge windows; previously
we mistakenly assumed prefetchable window existed and assigned
space and put a BAR there (Ahmed Naseef)
- Avoid shrinking bridge windows to fit in the initial Root Port
window; fixes one problem with devices with large BARs connected
via switches, e.g., Thunderbolt (Ilpo Järvinen)
- Pass full extent of empty space, not just the aligned space, to
resource_alignf callback so free space before the requested
alignment can be used (Ilpo Järvinen)
- Place small resources before larger ones for better utilization of
address space (Ilpo Järvinen)
- Fix alignment calculation for resource size larger than align,
e.g., bridge windows larger than the 1MB required alignment (Ilpo
Järvinen)
Reset:
- Update slot handling so all ARI functions are treated as being in
the same slot. They're all reset by Secondary Bus Reset, but
previously drivers of ARI functions that appeared to be on a
non-zero device weren't notified and fatal hardware errors could
result (Keith Busch)
- Make sysfs reset_subordinate hotplug safe to avoid spurious hotplug
events (Keith Busch)
- Hide Secondary Bus Reset ('bus') from sysfs reset_methods if masked
by CXL because it has no effect (Vidya Sagar)
- Avoid FLR for AMD NPU device, where it causes the device to hang
(Lizhi Hou)
Error handling:
- Clear only error bits in PCIe Device Status to avoid accidentally
clearing Emergency Power Reduction Detected (Shuai Xue)
- Check for AER errors even in devices without drivers (Lukas Wunner)
- Initialize ratelimit info so DPC and EDR paths log AER error
information (Kuppuswamy Sathyanarayanan)
Power control:
- Add UPD720201/UPD720202 USB 3.0 xHCI Host Controller .compatible so
generic pwrctrl driver can control it (Neil Armstrong)
Hotplug:
- Set LED_HW_PLUGGABLE for NPEM hotplug-capable ports so LED core
doesn't complain when setting brightness fails because the endpoint
is gone (Richard Cheng)
Peer-to-peer DMA:
- Allow wildcards in list of host bridges that support peer-to-peer
DMA between hierarchy domains and add all Google SoCs (Jacob
Moroni)
Endpoint framework:
- Advertise dynamic inbound mapping support in pci-epf-test and
update host pci_endpoint_test to skip doorbell testing if not
advertised by endpoint (Koichiro Den)
- Return 0, not remaining timeout, when MHI eDMA ops complete so
mhi_ep_ring_add_element() doesn't interpret non-zero as failure
(Daniel Hodges)
- Remove vntb and ntb duplicate resource teardown that leads to oops
when .allow_link() fails or .drop_link() is called (Koichiro Den)
- Disable vntb delayed work before clearing BAR mappings and
doorbells to avoid oops caused by doing the work after resources
have been torn down (Koichiro Den)
- Add a way to describe reserved subregions within BARs, e.g.,
platform-owned fixed register windows, and use it for the RK3588
BAR4 DMA ctrl window (Koichiro Den)
- Add BAR_DISABLED for BARs that will never be available to an EPF
driver, and change some BAR_RESERVED annotations to BAR_DISABLED
(Niklas Cassel)
- Add NTB .get_dma_dev() callback for cases where DMA API requires a
different device, e.g., vNTB devices (Koichiro Den)
- Add reserved region types for MSI-X Table and PBA so Endpoint
controllers can them as describe hardware-owned regions in a
BAR_RESERVED BAR (Manikanta Maddireddy)
- Make Tegra194/234 BAR0 programmable and remove 1MB size limit
(Manikanta Maddireddy)
- Expose Tegra BAR2 (MSI-X) and BAR4 (DMA) as 64-bit BAR_RESERVED
(Manikanta Maddireddy)
- Add Tegra194 and Tegra234 device table entries to pci_endpoint_test
(Manikanta Maddireddy)
- Skip the BAR subrange selftest if there are not enough inbound
window resources to run the test (Christian Bruel)
New native PCIe controller drivers:
- Add DT binding and driver for Andes QiLai SoC PCIe host controller
(Randolph Lin)
- Add DT binding and driver for ESWIN PCIe Root Complex (Senchuan
Zhang)
Baikal T-1 PCIe controller driver:
- Remove driver since it never quite became usable (Andy Shevchenko)
Cadence PCIe controller driver:
- Implement byte/word config reads with dword (32-bit) reads because
some Cadence controllers don't support sub-dword accesses (Aksh
Garg)
CIX Sky1 PCIe controller driver:
- Add 'power-domains' to DT binding for SCMI power domain (Gary Yang)
Freescale i.MX6 PCIe controller driver:
- Add i.MX94 and i.MX943 to fsl,imx6q-pcie-ep DT binding (Richard
Zhu)
- Delay instead of polling for L2/L3 Ready after PME_Turn_off when
suspending i.MX6SX because LTSSM registers are inaccessible
(Richard Zhu)
- Separate PERST# assertion (for resetting endpoints) from core reset
(for resetting the RC itself) to prepare for new DTs with PERST#
GPIO in per-Root Port nodes (Sherry Sun)
- Retain Root Port MSI capability on i.MX7D, i.MX8MM, and i.MX8MQ so
MSI from downstream devices will work (Richard Zhu)
- Fix i.MX95 reference clock source selection when internal refclk is
used (Franz Schnyder)
Freescale Layerscape PCIe controller driver:
- Allow building as a removable module (Sascha Hauer)
MediaTek PCIe Gen3 controller driver:
- Use dev_err_probe() to simplify error paths and make deferred probe
messages visible in /sys/kernel/debug/devices_deferred (Chen-Yu
Tsai)
- Power off device if setup fails (Chen-Yu Tsai)
- Integrate new pwrctrl API to enable power control for WiFi/BT
adapters on mainboard or in PCIe or M.2 slots (Chen-Yu Tsai)
NVIDIA Tegra194 PCIe controller driver:
- Poll less aggressively and non-atomically for PME_TO_Ack during
transition to L2 (Vidya Sagar)
- Disable LTSSM after transition to Detect on surprise link down to
stop toggling between Polling and Detect (Manikanta Maddireddy)
- Don't force the device into the D0 state before L2 when suspending
or shutting down the controller (Vidya Sagar)
- Disable PERST# IRQ only in Endpoint mode because it's not
registered in Root Port mode (Manikanta Maddireddy)
- Handle 'nvidia,refclk-select' as optional (Vidya Sagar)
- Disable direct speed change in Endpoint mode so link speed change
is controlled by the host (Vidya Sagar)
- Set LTR values before link up to avoid bogus LTR messages with 0
latency (Vidya Sagar)
- Allow system suspend when the Endpoint link is down (Vidya Sagar)
- Use DWC IP core version, not Tegra custom values, to avoid DWC core
version check warnings (Manikanta Maddireddy)
- Apply ECRC workaround to devices based on DesignWare 5.00a as well
as 4.90a (Manikanta Maddireddy)
- Disable PM Substate L1.2 in Endpoint mode to work around Tegra234
erratum (Vidya Sagar)
- Delay post-PERST# cleanup until core is powered on to avoid CBB
timeout (Manikanta Maddireddy)
- Assert CLKREQ# so switches that forward it to their downstream side
can bring up those links successfully (Vidya Sagar)
- Calibrate pipe to UPHY for Endpoint mode to reset stale PLL state
from any previous bad link state (Vidya Sagar)
- Remove IRQF_ONESHOT flag from Endpoint interrupt registration so
DMA driver and Endpoint controller driver can share the interrupt
line (Vidya Sagar)
- Enable DMA interrupt to support DMA in both Root Port and Endpoint
modes (Vidya Sagar)
- Enable hardware link retraining after link goes down in Endpoint
mode (Vidya Sagar)
- Add DT binding and driver support for core clock monitoring (Vidya
Sagar)
Qualcomm PCIe controller driver:
- Advertise 'Hot-Plug Capable' and set 'No Command Completed Support'
since Qcom Root Ports support hotplug events like DL_Up/Down and
can accept writes to Slot Control without delays between writes
(Krishna Chaitanya Chundru)
Renesas R-Car PCIe controller driver:
- Mark Endpoint BAR0 and BAR2 as Resizable (Koichiro Den)
- Reduce EPC BAR alignment requirement to 4K (Koichiro Den)
Renesas RZ/G3S PCIe controller driver:
- Add RZ/G3E to DT binding and to driver (John Madieu)
- Assert (not deassert) resets in probe error path (John Madieu)
- Assert resets in suspend path in reverse order they were deasserted
during probe (John Madieu)
- Rework inbound window algorithm to prevent mapping more than
intended region and enforce alignment on size, to prepare for
RZ/G3E support (John Madieu)
Rockchip DesignWare PCIe controller driver:
- Add tracepoints for PCIe controller LTSSM transitions and link rate
changes (Shawn Lin)
- Trace LTSSM events collected by the dw-rockchip debug FIFO (Shawn
Lin)
SOPHGO PCIe controller driver:
- Disable ASPM L0s and L1 on Sophgo 2042 PCIe Root Ports that
advertise support for them (Yao Zi)
Synopsys DesignWare PCIe controller driver:
- Continue with system suspend even if an Endpoint doesn't respond
with PME_TO_Ack message (Manivannan Sadhasivam)
- Set Endpoint MSI-X Table Size in the correct function of a
multi-function device when configuring MSI-X, not in Function 0
(Aksh Garg)
- Set Max Link Width and Max Link Speed for all functions of a
multi-function device, not just Function 0 (Aksh Garg)
- Expose PCIe event counters in groups 5-7 in debugfs (Hans Zhang)
Miscellaneous:
- Warn only once about invalid ACS kernel parameter format (Richard
Cheng)
- Suppress FW_BUG warning when writing sysfs 'numa_node' with the
current value (Li RongQing)
- Drop redundant 'depends on PCI' from Kconfig (Julian Braha)"
* tag 'pci-v7.1-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: (165 commits)
PCI/P2PDMA: Add Google SoCs to the P2P DMA host bridge list
PCI/P2PDMA: Allow wildcard Device IDs in host bridge list
PCI: sg2042: Avoid L0s and L1 on Sophgo 2042 PCIe Root Ports
PCI: cadence: Add flags for disabling ASPM capability for broken Root Ports
PCI: tegra194: Add core monitor clock support
dt-bindings: PCI: tegra194: Add monitor clock support
PCI: tegra194: Enable hardware hot reset mode in Endpoint mode
PCI: tegra194: Enable DMA interrupt
PCI: tegra194: Remove IRQF_ONESHOT flag during Endpoint interrupt registration
PCI: tegra194: Calibrate pipe to UPHY for Endpoint mode
PCI: tegra194: Assert CLKREQ# explicitly by default
PCI: tegra194: Fix CBB timeout caused by DBI access before core power-on
PCI: tegra194: Disable L1.2 capability of Tegra234 EP
PCI: dwc: Apply ECRC workaround to DesignWare 5.00a as well
PCI: tegra194: Use DWC IP core version
PCI: tegra194: Free up Endpoint resources during remove()
PCI: tegra194: Allow system suspend when the Endpoint link is not up
PCI: tegra194: Set LTR message request before PCIe link up in Endpoint mode
PCI: tegra194: Disable direct speed change for Endpoint mode
PCI: tegra194: Use devm_gpiod_get_optional() to parse "nvidia,refclk-select"
...
|
||
|
|
334fbe734e |
mm.git review status for linus..mm-stable
Everything: Total patches: 368 Reviews/patch: 1.56 Reviewed rate: 74% Excluding DAMON: Total patches: 316 Reviews/patch: 1.77 Reviewed rate: 81% Excluding DAMON and zram: Total patches: 306 Reviews/patch: 1.81 Reviewed rate: 82% Excluding DAMON, zram and maple_tree: Total patches: 276 Reviews/patch: 2.01 Reviewed rate: 91% Significant patch series in this merge: - The 30 patch series "maple_tree: Replace big node with maple copy" from Liam Howlett is mainly prepararatory work for ongoing development but it does reduce stack usage and is an improvement. - The 12 patch series "mm, swap: swap table phase III: remove swap_map" from Kairui Song offers memory savings by removing the static swap_map. It also yields some CPU savings and implements several cleanups. - The 2 patch series "mm: memfd_luo: preserve file seals" from Pratyush Yadav adds file seal preservation to LUO's memfd code. - The 2 patch series "mm: zswap: add per-memcg stat for incompressible pages" from Jiayuan Chen adds additional userspace stats reportng to zswap. - The 4 patch series "arch, mm: consolidate empty_zero_page" from Mike Rapoport implements some cleanups for our handling of ZERO_PAGE() and zero_pfn. - The 2 patch series "mm/kmemleak: Improve scan_should_stop() implementation" from Zhongqiu Han provides an robustness improvement and some cleanups in the kmemleak code. - The 4 patch series "Improve khugepaged scan logic" from Vernon Yang "improves the khugepaged scan logic and reduces CPU consumption by prioritizing scanning tasks that access memory frequently". - The 2 patch series "Make KHO Stateless" from Jason Miu simplifies Kexec Handover by "transitioning KHO from an xarray-based metadata tracking system with serialization to a radix tree data structure that can be passed directly to the next kernel" - The 3 patch series "mm: vmscan: add PID and cgroup ID to vmscan tracepoints" from Thomas Ballasi and Steven Rostedt enhances vmscan's tracepointing. - The 5 patch series "mm: arch/shstk: Common shadow stack mapping helper and VM_NOHUGEPAGE" from Catalin Marinas is a cleanup for the shadow stack code: remove per-arch code in favour of a generic implementation. - The 2 patch series "Fix KASAN support for KHO restored vmalloc regions" from Pasha Tatashin fixes a WARN() which can be emitted the KHO restores a vmalloc area. - The 4 patch series "mm: Remove stray references to pagevec" from Tal Zussman provides several cleanups, mainly udpating references to "struct pagevec", which became folio_batch three years ago. - The 17 patch series "mm: Eliminate fake head pages from vmemmap optimization" from Kiryl Shutsemau simplifies the HugeTLB vmemmap optimization (HVO) by changing how tail pages encode their relationship to the head page. - The 2 patch series "mm/damon/core: improve DAMOS quota efficiency for core layer filters" from SeongJae Park improves two problematic behaviors of DAMOS that makes it less efficient when core layer filters are used. - The 3 patch series "mm/damon: strictly respect min_nr_regions" from SeongJae Park improves DAMON usability by extending the treatment of the min_nr_regions user-settable parameter. - The 3 patch series "mm/page_alloc: pcp locking cleanup" from Vlastimil Babka is a proper fix for a previously hotfixed SMP=n issue. Code simplifications and cleanups ennsed. - The 16 patch series "mm: cleanups around unmapping / zapping" from David Hildenbrand implements "a bunch of cleanups around unmapping and zapping. Mostly simplifications, code movements, documentation and renaming of zapping functions". - The 6 patch series "support batched checking of the young flag for MGLRU" from Baolin Wang supports batched checking of the young flag for MGLRU. It's part cleanups; one benchmark shows large performance benefits for arm64. - The 5 patch series "memcg: obj stock and slab stat caching cleanups" from Johannes Weiner provides memcg cleanup and robustness improvements. - The 5 patch series "Allow order zero pages in page reporting" from Yuvraj Sakshith enhances page_reporting's free page reporting - it is presently and undesirably order-0 pages when reporting free memory. - The 6 patch series "mm: vma flag tweaks" from Lorenzo Stoakes is cleanup work following from the recent conversion of the VMA flags to a bitmap. - The 10 patch series "mm/damon: add optional debugging-purpose sanity checks" from SeongJae Park adds some more developer-facing debug checks into DAMON core. - The 2 patch series "mm/damon: test and document power-of-2 min_region_sz requirement" from SeongJae Park adds an additional DAMON kunit test and makes some adjustments to the addr_unit parameter handling. - The 3 patch series "mm/damon/core: make passed_sample_intervals comparisons overflow-safe" from SeongJae Park fixes a hard-to-hit time overflow issue in DAMON core. - The 7 patch series "mm/damon: improve/fixup/update ratio calculation, test and documentation" from SeongJae Park is a "batch of misc/minor improvements and fixups" for DAMON. - The 4 patch series "mm: move vma_(kernel|mmu)_pagesize() out of hugetlb.c" from David Hildenbrand fixes a possible issue with dax-device when CONFIG_HUGETLB=n. Some code movement was required. - The 6 patch series "zram: recompression cleanups and tweaks" from Sergey Senozhatsky provides "a somewhat random mix of fixups, recompression cleanups and improvements" in the zram code. - The 11 patch series "mm/damon: support multiple goal-based quota tuning algorithms" from SeongJae Park extend DAMOS quotas goal auto-tuning to support multiple tuning algorithms that users can select. - The 4 patch series "mm: thp: reduce unnecessary start_stop_khugepaged()" from Breno Leitao fixes the khugpaged sysfs handling so we no longer spam the logs with reams of junk when starting/stopping khugepaged. - The 3 patch series "mm: improve map count checks" from Lorenzo Stoakes provides some cleanups and slight fixes in the mremap, mmap and vma code. - The 5 patch series "mm/damon: support addr_unit on default monitoring targets for modules" from SeongJae Park extends the use of DAMON core's addr_unit tunable. - The 5 patch series "mm: khugepaged cleanups and mTHP prerequisites" from Nico Pache provides cleanups in the khugepaged and is a base for Nico's planned khugepaged mTHP support. - The 15 patch series "mm: memory hot(un)plug and SPARSEMEM cleanups" from David Hildenbrand implements code movement and cleanups in the memhotplug and sparsemem code. - The 2 patch series "mm: remove CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE and cleanup CONFIG_MIGRATION" from David Hildenbrand rationalizes some memhotplug Kconfig support. - The 6 patch series "change young flag check functions to return bool" from Baolin Wang is "a cleanup patchset to change all young flag check functions to return bool". - The 3 patch series "mm/damon/sysfs: fix memory leak and NULL dereference issues" from Josh Law and SeongJae Park fixes a few potential DAMON bugs. - The 25 patch series "mm/vma: convert vm_flags_t to vma_flags_t in vma code" from "converts a lot of the existing use of the legacy vm_flags_t data type to the new vma_flags_t type which replaces it". Mainly in the vma code. - The 21 patch series "mm: expand mmap_prepare functionality and usage" from Lorenzo Stoakes "expands the mmap_prepare functionality, which is intended to replace the deprecated f_op->mmap hook which has been the source of bugs and security issues for some time". Cleanups, documentation, extension of mmap_prepare into filesystem drivers. - The 13 patch series "mm/huge_memory: refactor zap_huge_pmd()" from Lorenzo Stoakes simplifies and cleans up zap_huge_pmd(). Additional cleanups around vm_normal_folio_pmd() and the softleaf functionality are performed. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCad3HDQAKCRDdBJ7gKXxA jrUQAPwNhPk5nPSxnyxjAeQtOBHqgCdnICeEismLajPKd9aYRgEA0s2XAu3tSUYi GrBnWImHG3s4ePQxVcPCegWTsOUrXgQ= =1Q7o -----END PGP SIGNATURE----- Merge tag 'mm-stable-2026-04-13-21-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "maple_tree: Replace big node with maple copy" (Liam Howlett) Mainly prepararatory work for ongoing development but it does reduce stack usage and is an improvement. - "mm, swap: swap table phase III: remove swap_map" (Kairui Song) Offers memory savings by removing the static swap_map. It also yields some CPU savings and implements several cleanups. - "mm: memfd_luo: preserve file seals" (Pratyush Yadav) File seal preservation to LUO's memfd code - "mm: zswap: add per-memcg stat for incompressible pages" (Jiayuan Chen) Additional userspace stats reportng to zswap - "arch, mm: consolidate empty_zero_page" (Mike Rapoport) Some cleanups for our handling of ZERO_PAGE() and zero_pfn - "mm/kmemleak: Improve scan_should_stop() implementation" (Zhongqiu Han) A robustness improvement and some cleanups in the kmemleak code - "Improve khugepaged scan logic" (Vernon Yang) Improve khugepaged scan logic and reduce CPU consumption by prioritizing scanning tasks that access memory frequently - "Make KHO Stateless" (Jason Miu) Simplify Kexec Handover by transitioning KHO from an xarray-based metadata tracking system with serialization to a radix tree data structure that can be passed directly to the next kernel - "mm: vmscan: add PID and cgroup ID to vmscan tracepoints" (Thomas Ballasi and Steven Rostedt) Enhance vmscan's tracepointing - "mm: arch/shstk: Common shadow stack mapping helper and VM_NOHUGEPAGE" (Catalin Marinas) Cleanup for the shadow stack code: remove per-arch code in favour of a generic implementation - "Fix KASAN support for KHO restored vmalloc regions" (Pasha Tatashin) Fix a WARN() which can be emitted the KHO restores a vmalloc area - "mm: Remove stray references to pagevec" (Tal Zussman) Several cleanups, mainly udpating references to "struct pagevec", which became folio_batch three years ago - "mm: Eliminate fake head pages from vmemmap optimization" (Kiryl Shutsemau) Simplify the HugeTLB vmemmap optimization (HVO) by changing how tail pages encode their relationship to the head page - "mm/damon/core: improve DAMOS quota efficiency for core layer filters" (SeongJae Park) Improve two problematic behaviors of DAMOS that makes it less efficient when core layer filters are used - "mm/damon: strictly respect min_nr_regions" (SeongJae Park) Improve DAMON usability by extending the treatment of the min_nr_regions user-settable parameter - "mm/page_alloc: pcp locking cleanup" (Vlastimil Babka) The proper fix for a previously hotfixed SMP=n issue. Code simplifications and cleanups ensued - "mm: cleanups around unmapping / zapping" (David Hildenbrand) A bunch of cleanups around unmapping and zapping. Mostly simplifications, code movements, documentation and renaming of zapping functions - "support batched checking of the young flag for MGLRU" (Baolin Wang) Batched checking of the young flag for MGLRU. It's part cleanups; one benchmark shows large performance benefits for arm64 - "memcg: obj stock and slab stat caching cleanups" (Johannes Weiner) memcg cleanup and robustness improvements - "Allow order zero pages in page reporting" (Yuvraj Sakshith) Enhance free page reporting - it is presently and undesirably order-0 pages when reporting free memory. - "mm: vma flag tweaks" (Lorenzo Stoakes) Cleanup work following from the recent conversion of the VMA flags to a bitmap - "mm/damon: add optional debugging-purpose sanity checks" (SeongJae Park) Add some more developer-facing debug checks into DAMON core - "mm/damon: test and document power-of-2 min_region_sz requirement" (SeongJae Park) An additional DAMON kunit test and makes some adjustments to the addr_unit parameter handling - "mm/damon/core: make passed_sample_intervals comparisons overflow-safe" (SeongJae Park) Fix a hard-to-hit time overflow issue in DAMON core - "mm/damon: improve/fixup/update ratio calculation, test and documentation" (SeongJae Park) A batch of misc/minor improvements and fixups for DAMON - "mm: move vma_(kernel|mmu)_pagesize() out of hugetlb.c" (David Hildenbrand) Fix a possible issue with dax-device when CONFIG_HUGETLB=n. Some code movement was required. - "zram: recompression cleanups and tweaks" (Sergey Senozhatsky) A somewhat random mix of fixups, recompression cleanups and improvements in the zram code - "mm/damon: support multiple goal-based quota tuning algorithms" (SeongJae Park) Extend DAMOS quotas goal auto-tuning to support multiple tuning algorithms that users can select - "mm: thp: reduce unnecessary start_stop_khugepaged()" (Breno Leitao) Fix the khugpaged sysfs handling so we no longer spam the logs with reams of junk when starting/stopping khugepaged - "mm: improve map count checks" (Lorenzo Stoakes) Provide some cleanups and slight fixes in the mremap, mmap and vma code - "mm/damon: support addr_unit on default monitoring targets for modules" (SeongJae Park) Extend the use of DAMON core's addr_unit tunable - "mm: khugepaged cleanups and mTHP prerequisites" (Nico Pache) Cleanups to khugepaged and is a base for Nico's planned khugepaged mTHP support - "mm: memory hot(un)plug and SPARSEMEM cleanups" (David Hildenbrand) Code movement and cleanups in the memhotplug and sparsemem code - "mm: remove CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE and cleanup CONFIG_MIGRATION" (David Hildenbrand) Rationalize some memhotplug Kconfig support - "change young flag check functions to return bool" (Baolin Wang) Cleanups to change all young flag check functions to return bool - "mm/damon/sysfs: fix memory leak and NULL dereference issues" (Josh Law and SeongJae Park) Fix a few potential DAMON bugs - "mm/vma: convert vm_flags_t to vma_flags_t in vma code" (Lorenzo Stoakes) Convert a lot of the existing use of the legacy vm_flags_t data type to the new vma_flags_t type which replaces it. Mainly in the vma code. - "mm: expand mmap_prepare functionality and usage" (Lorenzo Stoakes) Expand the mmap_prepare functionality, which is intended to replace the deprecated f_op->mmap hook which has been the source of bugs and security issues for some time. Cleanups, documentation, extension of mmap_prepare into filesystem drivers - "mm/huge_memory: refactor zap_huge_pmd()" (Lorenzo Stoakes) Simplify and clean up zap_huge_pmd(). Additional cleanups around vm_normal_folio_pmd() and the softleaf functionality are performed. * tag 'mm-stable-2026-04-13-21-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits) mm: fix deferred split queue races during migration mm/khugepaged: fix issue with tracking lock mm/huge_memory: add and use has_deposited_pgtable() mm/huge_memory: add and use normal_or_softleaf_folio_pmd() mm: add softleaf_is_valid_pmd_entry(), pmd_to_softleaf_folio() mm/huge_memory: separate out the folio part of zap_huge_pmd() mm/huge_memory: use mm instead of tlb->mm mm/huge_memory: remove unnecessary sanity checks mm/huge_memory: deduplicate zap deposited table call mm/huge_memory: remove unnecessary VM_BUG_ON_PAGE() mm/huge_memory: add a common exit path to zap_huge_pmd() mm/huge_memory: handle buggy PMD entry in zap_huge_pmd() mm/huge_memory: have zap_huge_pmd return a boolean, add kdoc mm/huge: avoid big else branch in zap_huge_pmd() mm/huge_memory: simplify vma_is_specal_huge() mm: on remap assert that input range within the proposed VMA mm: add mmap_action_map_kernel_pages[_full]() uio: replace deprecated mmap hook with mmap_prepare in uio_info drivers: hv: vmbus: replace deprecated mmap hook with mmap_prepare mm: allow handling of stacked mmap_prepare hooks in more drivers ... |
||
|
|
4fddde2a73 |
bpf: Fix use-after-free in arena_vm_close on fork
arena_vm_open() only bumps vml->mmap_count but never registers the
child VMA in arena->vma_list. The vml->vma always points at the
parent VMA, so after parent munmap the pointer dangles. If the child
then calls bpf_arena_free_pages(), zap_pages() reads the stale
vml->vma triggering use-after-free.
Fix this by preventing the arena VMA from being inherited across
fork with VM_DONTCOPY, and preventing VMA splits via the may_split
callback.
Also reject mremap with a .mremap callback returning -EINVAL. A
same-size mremap(MREMAP_FIXED) on the full arena VMA reaches
copy_vma() through the following path:
check_prep_vma() - returns 0 early: new_len == old_len
skips VM_DONTEXPAND check
prep_move_vma() - vm_start == old_addr and
vm_end == old_addr + old_len
so may_split is never called
move_vma()
copy_vma_and_data()
copy_vma()
vm_area_dup() - copies vm_private_data (vml pointer)
vm_ops->open() - bumps vml->mmap_count
vm_ops->mremap() - returns -EINVAL, rollback unmaps new VMA
The refcount ensures the rollback's arena_vm_close does not free
the vml shared with the original VMA.
Reported-by: Weiming Shi <bestswngs@gmail.com>
Reported-by: Xiang Mei <xmei5@asu.edu>
Fixes:
|
||
|
|
5bdb4078e1 |
sched_ext: Changes for v7.1
- Cgroup sub-scheduler groundwork. Multiple BPF schedulers can be attached to cgroups and the dispatch path is made hierarchical. This involves substantial restructuring of the core dispatch, bypass, watchdog, and dump paths to be per-scheduler, along with new infrastructure for scheduler ownership enforcement, lifecycle management, and cgroup subtree iteration. The enqueue path is not yet updated and will follow in a later cycle. - scx_bpf_dsq_reenq() generalized to support any DSQ including remote local DSQs and user DSQs. Built on top of this, SCX_ENQ_IMMED guarantees that tasks dispatched to local DSQs either run immediately or get reenqueued back through ops.enqueue(), giving schedulers tighter control over queueing latency. Also useful for opportunistic CPU sharing across sub-schedulers. - ops.dequeue() was only invoked when the core knew a task was in BPF data structures, missing scheduling property change events and skipping callbacks for non-local DSQ dispatches from ops.select_cpu(). Fixed to guarantee exactly one ops.dequeue() call when a task leaves BPF scheduler custody. - Kfunc access validation moved from runtime to BPF verifier time, removing runtime mask enforcement. - Idle SMT sibling prioritization in the idle CPU selection path. - Documentation, selftest, and tooling updates. Misc bug fixes and cleanups. - Merges from tip/sched-core, cgroup/for-7.1, and for-7.0-fixes to resolve dependencies and conflicts for the above changes. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCad0uaA4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGbktAQD2HrKdydyEfefz/n4mNpIXh/DFYX49NgKYcgUh sKy4ngD/Sy7nAZS2zwM+36PN6jBV7+cfuoaiKPgCstPFeGsvPwU= =fsgj -----END PGP SIGNATURE----- Merge tag 'sched_ext-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext updates from Tejun Heo: - cgroup sub-scheduler groundwork Multiple BPF schedulers can be attached to cgroups and the dispatch path is made hierarchical. This involves substantial restructuring of the core dispatch, bypass, watchdog, and dump paths to be per-scheduler, along with new infrastructure for scheduler ownership enforcement, lifecycle management, and cgroup subtree iteration The enqueue path is not yet updated and will follow in a later cycle - scx_bpf_dsq_reenq() generalized to support any DSQ including remote local DSQs and user DSQs Built on top of this, SCX_ENQ_IMMED guarantees that tasks dispatched to local DSQs either run immediately or get reenqueued back through ops.enqueue(), giving schedulers tighter control over queueing latency Also useful for opportunistic CPU sharing across sub-schedulers - ops.dequeue() was only invoked when the core knew a task was in BPF data structures, missing scheduling property change events and skipping callbacks for non-local DSQ dispatches from ops.select_cpu() Fixed to guarantee exactly one ops.dequeue() call when a task leaves BPF scheduler custody - Kfunc access validation moved from runtime to BPF verifier time, removing runtime mask enforcement - Idle SMT sibling prioritization in the idle CPU selection path - Documentation, selftest, and tooling updates. Misc bug fixes and cleanups * tag 'sched_ext-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (134 commits) tools/sched_ext: Add explicit cast from void* in RESIZE_ARRAY() sched_ext: Make string params of __ENUM_set() const tools/sched_ext: Kick home CPU for stranded tasks in scx_qmap sched_ext: Drop spurious warning on kick during scheduler disable sched_ext: Warn on task-based SCX op recursion sched_ext: Rename scx_kf_allowed_on_arg_tasks() to scx_kf_arg_task_ok() sched_ext: Remove runtime kfunc mask enforcement sched_ext: Add verifier-time kfunc context filter sched_ext: Drop redundant rq-locked check from scx_bpf_task_cgroup() sched_ext: Decouple kfunc unlocked-context check from kf_mask sched_ext: Fix ops.cgroup_move() invocation kf_mask and rq tracking sched_ext: Track @p's rq lock across set_cpus_allowed_scx -> ops.set_cpumask sched_ext: Add select_cpu kfuncs to scx_kfunc_ids_unlocked sched_ext: Drop TRACING access to select_cpu kfuncs selftests/sched_ext: Fix wrong DSQ ID in peek_dsq error message sched_ext: Documentation: improve accuracy of task lifecycle pseudo-code selftests/sched_ext: Improve runner error reporting for invalid arguments sched_ext: Documentation: Fix scx_bpf_move_to_local kfunc name sched_ext: Documentation: Add ops.dequeue() to task lifecycle tools/sched_ext: Fix off-by-one in scx_sdt payload zeroing ... |
||
|
|
7de6b4a246 |
workqueue: Changes for v7.1
- New default WQ_AFFN_CACHE_SHARD affinity scope subdivides LLCs into
smaller shards to improve scalability on machines with many CPUs per
LLC.
- Misc: system_dfl_long_wq for long unbound works, devm_alloc_workqueue()
for device-managed allocation, sysfs exposure for ordered workqueues and
the EFI workqueue, removal of HK_TYPE_WQ from wq_unbound_cpumask, and
various small fixes.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCad0npw4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGdZmAQD4BbIhTGKGcq89jwQRQpmUIZK6yIIWwd0cSvLC
Biko2AD9FP2M9bqUzo2cZ83AfSC4LTK020e9VmsZStkw+u0s3ws=
=cSEW
-----END PGP SIGNATURE-----
Merge tag 'wq-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue updates from Tejun Heo:
- New default WQ_AFFN_CACHE_SHARD affinity scope subdivides LLCs into
smaller shards to improve scalability on machines with many CPUs per
LLC
- Misc:
- system_dfl_long_wq for long unbound works
- devm_alloc_workqueue() for device-managed allocation
- sysfs exposure for ordered workqueues and the EFI workqueue
- removal of HK_TYPE_WQ from wq_unbound_cpumask
- various small fixes
* tag 'wq-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (21 commits)
workqueue: validate cpumask_first() result in llc_populate_cpu_shard_id()
workqueue: use NR_STD_WORKER_POOLS instead of hardcoded value
workqueue: avoid unguarded 64-bit division
docs: workqueue: document WQ_AFFN_CACHE_SHARD affinity scope
workqueue: add test_workqueue benchmark module
tools/workqueue: add CACHE_SHARD support to wq_dump.py
workqueue: set WQ_AFFN_CACHE_SHARD as the default affinity scope
workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
workqueue: fix typo in WQ_AFFN_SMT comment
workqueue: Remove HK_TYPE_WQ from affecting wq_unbound_cpumask
workqueue: unlink pwqs from wq->pwqs list in alloc_and_link_pwqs() error path
workqueue: Remove NULL wq WARN in __queue_delayed_work()
workqueue: fix parse_affn_scope() prefix matching bug
workqueue: devres: Add device-managed allocate workqueue
workqueue: Add system_dfl_long_wq for long unbound works
tools/workqueue/wq_dump.py: add NODE prefix to all node columns
tools/workqueue/wq_dump.py: fix column alignment in node_nr/max_active section
tools/workqueue/wq_dump.py: remove backslash separator from node_nr/max_active header
efi: Allow to expose the workqueue via sysfs
workqueue: Allow to expose ordered workqueues via sysfs
...
|
||
|
|
b71f0be2d2 |
cgroup: Changes for v7.1
- cgroup_file_notify() locking converted from a global lock to per-cgroup_file spinlock with a lockless fast-path when no notification is needed. - Misc changes including exposing cgroup helpers for sched_ext and minor fixes. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCad0heg4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGTFVAP0USl50aZ1SA7Gq84Qp/5v2EN5oH4lVqTlEbPti AMOV5wD+JpYS0BnLhj+Q2jElu3Jyb4drf3h5xYHhf5NS2O60EAE= =j2ad -----END PGP SIGNATURE----- Merge tag 'cgroup-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - cgroup_file_notify() locking converted from a global lock to per-cgroup_file spinlock with a lockless fast-path when no notification is needed - Misc changes including exposing cgroup helpers for sched_ext and minor fixes * tag 'cgroup-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup/rdma: fix swapped arguments in pr_warn() format string cgroup/dmem: remove region parameter from dmemcg_parse_limit cgroup: replace global cgroup_file_kn_lock with per-cgroup_file lock cgroup: add lockless fast-path checks to cgroup_file_notify() cgroup: reduce cgroup_file_kn_lock hold time in cgroup_file_notify() cgroup: Expose some cgroup helpers |
||
|
|
ecdd4fd8a5 |
bpf: fix arg tracking for imprecise/multi-offset BPF_ST/STX
BPF_STX through ARG_IMPRECISE dst should be recognized as a local
spill and join at_stack with the written value. For example,
consider the following situation:
// r1 = ARG_IMPRECISE{mask=BIT(0)|BIT(1)}
*(u64 *)(r1 + 0) = r8
Here the analysis should produce an equivalent of
at_stack[*] = join(old, r8)
BPF_ST through multi-offset or imprecise dst should join at_stack with
none instead of overwriting the slots. For example, consider the
following situation:
// r1 = ARG_IMPRECISE{mask=BIT(0)|BIT(1)}
*(u64 *)(r1 + 0) = 0
Here the analysis should produce an equivalent of
at_stack[*r1] = join(old, none).
Move the definition of the clear_overlapping_stack_slots() in order to
have __arg_track_join() visible. Remove the OFF_IMPRECISE constant to
avoid having two ways to express imprecise offset.
Only 'offset-imprecise {frame=N, cnt=0}' remains.
Fixes:
|
||
|
|
16c4f0211a |
taskstats: set version in TGID exit notifications
delay accounting started populating taskstats records with a valid version field via fill_pid() and fill_tgid(). Later, commit |
||
|
|
5c0f43e853 |
kernel-7.1-rc1.misc
Please consider pulling these changes from the signed kernel-7.1-rc1.misc tag.
Thanks!
Christian
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCad41SAAKCRCRxhvAZXjc
ou3SAQD3QUQObaY7NvJIxwm72jtO2lY6jF03Pyimt4J9yicXZQEAxkHvPpSwBoLx
n5lnzBJy9t8JQxMEkw+IL8vmGsSMZw0=
=RGh3
-----END PGP SIGNATURE-----
Merge tag 'kernel-7.1-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull pid_namespace updates from Christian Brauner:
- pid_namespace: make init creation more flexible
Annotate ->child_reaper accesses with {READ,WRITE}_ONCE() to protect
the unlocked readers from cpu/compiler reordering, and enforce that
pid 1 in a pid namespace is always the first allocated pid (the
set_tid path already required this).
On top of that, allow opening pid_for_children before the pid
namespace init has been created. This lets one process create the pid
namespace and a different process create the init via setns(), which
makes clone3(set_tid) usable in all cases evenly and is particularly
useful to CRIU when restoring nested containers.
A new selftest covers both the basic create-pidns-then-init flow and
the cross-process variant, and a MAINTAINERS entry for the pid
namespace code is added.
- unrelated signal cleanup: update outdated comment for the removed
freezable_schedule()
* tag 'kernel-7.1-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
signal: update outdated comment for removed freezable_schedule()
MAINTAINERS: add a pid namespace entry
selftests: Add tests for creating pidns init via setns
pid_namespace: allow opening pid_for_children before init was created
pid: check init is created first after idr alloc
pid_namespace: avoid optimization of accesses to ->child_reaper
|
||
|
|
7c8a4671dc |
vfs-7.1-rc1.mount.v2
Please consider pulling these changes from the signed vfs-7.1-rc1.mount.v2 tag.
Thanks!
Christian
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCad3vFgAKCRCRxhvAZXjc
onXwAQDwEGvpMUUiuI/JWFqCA5vY5LXXr/36wdcs0iUL1uy9IgEAyOdnYhYkcaX1
3lm87f6OmYkhlq6enJbco7uT4CUzlQA=
=1Ls8
-----END PGP SIGNATURE-----
Merge tag 'vfs-7.1-rc1.mount.v2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs mount updates from Christian Brauner:
- Add FSMOUNT_NAMESPACE flag to fsmount() that creates a new mount
namespace with the newly created filesystem attached to a copy of the
real rootfs. This returns a namespace file descriptor instead of an
O_PATH mount fd, similar to how OPEN_TREE_NAMESPACE works for
open_tree().
This allows creating a new filesystem and immediately placing it in a
new mount namespace in a single operation, which is useful for
container runtimes and other namespace-based isolation mechanisms.
This accompanies OPEN_TREE_NAMESPACE and avoids a needless detour via
OPEN_TREE_NAMESPACE to get the same effect. Will be especially useful
when you mount an actual filesystem to be used as the container
rootfs.
- Currently, creating a new mount namespace always copies the entire
mount tree from the caller's namespace. For containers and sandboxes
that intend to build their mount table from scratch this is wasteful:
they inherit a potentially large mount tree only to immediately tear
it down.
This series adds support for creating a mount namespace that contains
only a clone of the root mount, with none of the child mounts. Two
new flags are introduced:
- CLONE_EMPTY_MNTNS (0x400000000) for clone3(), using the 64-bit flag space
- UNSHARE_EMPTY_MNTNS (0x00100000) for unshare()
Both flags imply CLONE_NEWNS. The resulting namespace contains a
single nullfs root mount with an immutable empty directory. The
intended workflow is to then mount a real filesystem (e.g., tmpfs)
over the root and build the mount table from there.
- Allow MOVE_MOUNT_BENEATH to target the caller's rootfs, allowing to
switch out the rootfs without pivot_root(2).
The traditional approach to switching the rootfs involves
pivot_root(2) or a chroot_fs_refs()-based mechanism that atomically
updates fs->root for all tasks sharing the same fs_struct. This has
consequences for fork(), unshare(CLONE_FS), and setns().
This series instead decomposes root-switching into individually
atomic, locally-scoped steps:
fd_tree = open_tree(-EBADF, "/newroot", OPEN_TREE_CLONE | OPEN_TREE_CLOEXEC);
fchdir(fd_tree);
move_mount(fd_tree, "", AT_FDCWD, "/", MOVE_MOUNT_BENEATH | MOVE_MOUNT_F_EMPTY_PATH);
chroot(".");
umount2(".", MNT_DETACH);
Since each step only modifies the caller's own state, the
fork/unshare/setns races are eliminated by design.
A key step to making this possible is to remove the locked mount
restriction. Originally MOVE_MOUNT_BENEATH doesn't support mounting
beneath a mount that is locked. The locked mount protects the
underlying mount from being revealed. This is a core mechanism of
unshare(CLONE_NEWUSER | CLONE_NEWNS). The mounts in the new mount
namespace become locked. That effectively makes the new mount table
useless as the caller cannot ever get rid of any of the mounts no
matter how useless they are.
We can lift this restriction though. We simply transfer the locked
property from the top mount to the mount beneath. This works because
what we care about is to protect the underlying mount aka the parent.
The mount mounted between the parent and the top mount takes over the
job of protecting the parent mount from the top mount mount. This
leaves us free to remove the locked property from the top mount which
can consequently be unmounted:
unshare(CLONE_NEWUSER | CLONE_NEWNS)
and we inherit a clone of procfs on /proc then currently we cannot
unmount it as:
umount -l /proc
will fail with EINVAL because the procfs mount is locked.
After this series we can now do:
mount --beneath -t tmpfs tmpfs /proc
umount -l /proc
after which a tmpfs mount has been placed beneath the procfs mount.
The tmpfs mount has become locked and the procfs mount has become
unlocked.
This means you can safely modify an inherited mount table after
unprivileged namespace creation.
Afterwards we simply make it possible to move a mount beneath the
rootfs allowing to upgrade the rootfs.
Removing the locked restriction makes this very useful for containers
created with unshare(CLONE_NEWUSER | CLONE_NEWNS) to reshuffle an
inherited mount table safely and MOVE_MOUNT_BENEATH makes it possible
to switch out the rootfs instead of using the costly pivot_root(2).
* tag 'vfs-7.1-rc1.mount.v2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
selftests/namespaces: remove unused utils.h include from listns_efault_test
selftests/fsmount_ns: add missing TARGETS and fix cap test
selftests/empty_mntns: fix wrong CLONE_EMPTY_MNTNS hex value in comment
selftests/empty_mntns: fix statmount_alloc() signature mismatch
selftests/statmount: remove duplicate wait_for_pid()
mount: always duplicate mount
selftests/filesystems: add MOVE_MOUNT_BENEATH rootfs tests
move_mount: allow MOVE_MOUNT_BENEATH on the rootfs
move_mount: transfer MNT_LOCKED
selftests/filesystems: add clone3 tests for empty mount namespaces
selftests/filesystems: add tests for empty mount namespaces
namespace: allow creating empty mount namespaces
selftests: add FSMOUNT_NAMESPACE tests
selftests/statmount: add statmount_alloc() helper
tools: update mount.h header
mount: add FSMOUNT_NAMESPACE
mount: simplify __do_loopback()
mount: start iterating from start of rbtree
|
||
|
|
f5ad410100 |
bpf-next-7.1
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmndDWsACgkQ6rmadz2v
bTr/jw//WQ+IowvstytntSbZFhSSKjwUP1J0oz/wAyKxvly+sBQADBQkljqNaEju
Kq48CPWftJXG45x3O5P4GSYOuBnd9nwDS/hM6jA9f3Ok4IEOHAHCxLot0uq52iJa
ieGeJTUEGKFUUEiTuImt/0+Y3aeRQFV0f484+WcmCpdm+cqIXxRnxsMMFuovM4Uj
VUgYaooZteaOcnhZpaX/4bWiXM7x7FibLu9gPu9fyyHJIiVrJD+sMhb/UZtsODZO
gywy9GNs93Xm9ZoRSTpWA4pAvRajqa8DEtLlV8fx4LpvYdHIjdByiTR9CeKHYxrB
vcV1Ty6dGTd6ifFtW6ul1qaF9KeZXQBHxCTmhj4ITek1TMNDfJJD+Iwgc1ll9RL4
RoZ8DJC8Qp2RDH+3b/ptBgfROw1nrwQLuw5cG7mj5mhQdu/z9AMI2ifPk9wv56Zj
OV6wRnDcwFu5SLBUNCMd/ypnigKdWcSHCNvWo2HTtcy771b/fqz60K8dMcIWKH5B
3qvXEBHbSdf48D6t64nOyVuo8RKSIizER5Mj/baabcJqZKoAtVUo2l2vd63hX/OD
v/y51NvI0lH6cOMLka3LHVIVJInOFSKgOUa1aaKQ0KDjQDRRmmy8yY9h6RZ+aHWb
78K7oCNRx/SCLdslYFGSTQdbiI4/JVoDc6cWtHy413m5+L1447A=
=k6te
-----END PGP SIGNATURE-----
Merge tag 'bpf-next-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Pull bpf updates from Alexei Starovoitov:
- Welcome new BPF maintainers: Kumar Kartikeya Dwivedi, Eduard
Zingerman while Martin KaFai Lau reduced his load to Reviwer.
- Lots of fixes everywhere from many first time contributors. Thank you
All.
- Diff stat is dominated by mechanical split of verifier.c into
multiple components:
- backtrack.c: backtracking logic and jump history
- states.c: state equivalence
- cfg.c: control flow graph, postorder, strongly connected
components
- liveness.c: register and stack liveness
- fixups.c: post-verification passes: instruction patching, dead
code removal, bpf_loop inlining, finalize fastcall
8k line were moved. verifier.c still stands at 20k lines.
Further refactoring is planned for the next release.
- Replace dynamic stack liveness with static stack liveness based on
data flow analysis.
This improved the verification time by 2x for some programs and
equally reduced memory consumption. New logic is in liveness.c and
supported by constant folding in const_fold.c (Eduard Zingerman,
Alexei Starovoitov)
- Introduce BTF layout to ease addition of new BTF kinds (Alan Maguire)
- Use kmalloc_nolock() universally in BPF local storage (Amery Hung)
- Fix several bugs in linked registers delta tracking (Daniel Borkmann)
- Improve verifier support of arena pointers (Emil Tsalapatis)
- Improve verifier tracking of register bounds in min/max and tnum
domains (Harishankar Vishwanathan, Paul Chaignon, Hao Sun)
- Further extend support for implicit arguments in the verifier (Ihor
Solodrai)
- Add support for nop,nop5 instruction combo for USDT probes in libbpf
(Jiri Olsa)
- Support merging multiple module BTFs (Josef Bacik)
- Extend applicability of bpf_kptr_xchg (Kaitao Cheng)
- Retire rcu_trace_implies_rcu_gp() (Kumar Kartikeya Dwivedi)
- Support variable offset context access for 'syscall' programs (Kumar
Kartikeya Dwivedi)
- Migrate bpf_task_work and dynptr to kmalloc_nolock() (Mykyta
Yatsenko)
- Fix UAF in in open-coded task_vma iterator (Puranjay Mohan)
* tag 'bpf-next-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (241 commits)
selftests/bpf: cover short IPv4/IPv6 inputs with adjust_room
bpf: reject short IPv4/IPv6 inputs in bpf_prog_test_run_skb
selftests/bpf: Use memfd_create instead of shm_open in cgroup_iter_memcg
selftests/bpf: Add test for cgroup storage OOB read
bpf: Fix OOB in pcpu_init_value
selftests/bpf: Fix reg_bounds to match new tnum-based refinement
selftests/bpf: Add tests for non-arena/arena operations
bpf: Allow instructions with arena source and non-arena dest registers
bpftool: add missing fsession to the usage and docs of bpftool
docs/bpf: add missing fsession attach type to docs
bpf: add missing fsession to the verifier log
bpf: Move BTF checking logic into check_btf.c
bpf: Move backtracking logic to backtrack.c
bpf: Move state equivalence logic to states.c
bpf: Move check_cfg() into cfg.c
bpf: Move compute_insn_live_regs() into liveness.c
bpf: Move fixup/post-processing logic from verifier.c into fixups.c
bpf: Simplify do_check_insn()
bpf: Move checks for reserved fields out of the main pass
bpf: Delete unused variable
...
|
||
|
|
88b29f3f57 |
Modules changes for v7.1-rc1
Kernel symbol flags:
- Replace the separate *_gpl symbol sections (__ksymtab_gpl and
__kcrctab_gpl) with a unified symbol table and a new
__kflagstab section. This section stores symbol flags, such as
the GPL-only flag, as an 8-bit bitset for each exported symbol.
This is a cleanup that simplifies symbol lookup in the module
loader by avoiding table fragmentation and will allow a cleaner
way to add more flags later if needed.
Module signature UAPI:
- Move struct module_signature to the UAPI headers to allow reuse
by tools outside the kernel proper, such as kmod and
scripts/sign-file. This also renames a few constants for clarity
and drops unused signature types as preparation for hash-based
module integrity checking work that's in progress.
Sysfs:
- Add a /sys/module/<module>/import_ns sysfs attribute to show
the symbol namespaces imported by loaded modules. This makes it
easier to verify driver API access at runtime on systems that
care about such things (e.g. Android).
Cleanups and fixes:
- Force sh_addr to 0 for all sections in module.lds. This prevents
non-zero section addresses when linking modules with ld.bfd -r,
which confused elfutils.
- Fix a memory leak of charp module parameters on module unload
when the kernel is configured with CONFIG_SYSFS=n.
- Override the -EEXIST error code returned by module_init() to
userspace. This prevents confusion with the errno reserved by
the module loader to indicate that a module is already loaded.
- Simplify the warning message and drop the stack dump on positive
returns from module_init().
- Drop unnecessary extern keywords from function declarations and
synchronize parse_args() arguments with their implementation.
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQSE9au1u/dCZerzchhaByWrOaGnegUCadmI0gAKCRBaByWrOaGn
euC6AQCpeQGQv/Z1Pu9DmBRaRD1MjXg1K1J8DN3qH7L8FbWDwAD9FtzAHw9GPOOP
0aQpDvcYKjdrU8OiuqtENvhzCV1RTA4=
=YaHp
-----END PGP SIGNATURE-----
Merge tag 'modules-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux
Pull module updates from Sami Tolvanen:
"Kernel symbol flags:
- Replace the separate *_gpl symbol sections (__ksymtab_gpl and
__kcrctab_gpl) with a unified symbol table and a new __kflagstab
section.
This section stores symbol flags, such as the GPL-only flag, as an
8-bit bitset for each exported symbol. This is a cleanup that
simplifies symbol lookup in the module loader by avoiding table
fragmentation and will allow a cleaner way to add more flags later
if needed.
Module signature UAPI:
- Move struct module_signature to the UAPI headers to allow reuse by
tools outside the kernel proper, such as kmod and
scripts/sign-file.
This also renames a few constants for clarity and drops unused
signature types as preparation for hash-based module integrity
checking work that's in progress.
Sysfs:
- Add a /sys/module/<module>/import_ns sysfs attribute to show the
symbol namespaces imported by loaded modules.
This makes it easier to verify driver API access at runtime on
systems that care about such things (e.g. Android).
Cleanups and fixes:
- Force sh_addr to 0 for all sections in module.lds. This prevents
non-zero section addresses when linking modules with 'ld.bfd -r',
which confused elfutils.
- Fix a memory leak of charp module parameters on module unload when
the kernel is configured with CONFIG_SYSFS=n.
- Override the -EEXIST error code returned by module_init() to
userspace. This prevents confusion with the errno reserved by the
module loader to indicate that a module is already loaded.
- Simplify the warning message and drop the stack dump on positive
returns from module_init().
- Drop unnecessary extern keywords from function declarations and
synchronize parse_args() arguments with their implementation"
* tag 'modules-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux: (23 commits)
module: Simplify warning on positive returns from module_init()
module: Override -EEXIST module return
documentation: remove references to *_gpl sections
module: remove *_gpl sections from vmlinux and modules
module: deprecate usage of *_gpl sections in module loader
module: use kflagstab instead of *_gpl sections
module: populate kflagstab in modpost
module: add kflagstab section to vmlinux and modules
module: define ksym_flags enumeration to represent kernel symbol flags
selftests/bpf: verify_pkcs7_sig: Use 'struct module_signature' from the UAPI headers
sign-file: use 'struct module_signature' from the UAPI headers
tools uapi headers: add linux/module_signature.h
module: Move 'struct module_signature' to UAPI
module: Give MODULE_SIG_STRING a more descriptive name
module: Give 'enum pkey_id_type' a more specific name
module: Drop unused signature types
extract-cert: drop unused definition of PKEY_ID_PKCS7
docs: symbol-namespaces: mention sysfs attribute
module: expose imported namespaces via sysfs
module: Remove extern keyword from param prototypes
...
|
||
|
|
c43267e679 |
arm64 updates for 7.1:
Core features:
- Add support for FEAT_LSUI, allowing futex atomic operations without
toggling Privileged Access Never (PAN)
- Further refactor the arm64 exception handling code towards the
generic entry infrastructure
- Optimise __READ_ONCE() with CONFIG_LTO=y and allow alias analysis
through it
Memory management:
- Refactor the arm64 TLB invalidation API and implementation for better
control over barrier placement and level-hinted invalidation
- Enable batched TLB flushes during memory hot-unplug
- Fix rodata=full block mapping support for realm guests (when
BBML2_NOABORT is available)
Perf and PMU:
- Add support for a whole bunch of system PMUs featured in NVIDIA's
Tegra410 SoC (cspmu extensions for the fabric and PCIe, new drivers
for CPU/C2C memory latency PMUs)
- Clean up iomem resource handling in the Arm CMN driver
- Fix signedness handling of AA64DFR0.{PMUVer,PerfMon}
MPAM (Memory Partitioning And Monitoring):
- Add architecture context-switch and hiding of the feature from KVM
- Add interface to allow MPAM to be exposed to user-space using resctrl
- Add errata workaround for some existing platforms
- Add documentation for using MPAM and what shape of platforms can use
resctrl
Miscellaneous:
- Check DAIF (and PMR, where relevant) at task-switch time
- Skip TFSR_EL1 checks and barriers in synchronous MTE tag check mode
(only relevant to asynchronous or asymmetric tag check modes)
- Remove a duplicate allocation in the kexec code
- Remove redundant save/restore of SCS SP on entry to/from EL0
- Generate the KERNEL_HWCAP_ definitions from the arm64 hwcap
descriptions
- Add kselftest coverage for cmpbr_sigill()
- Update sysreg definitions
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAmnc8DEACgkQa9axLQDI
XvFauRAAhc1cIgoRpgtdZd7+3/g457teDPYA3L/CjJzI28aesIpV/ECrEw2GL4xs
HrQfijF4oyCDbBwh0sAascO/H7RoyOranlbuc+fVJ6Bj6gP9STzR4GmscsWkAMSJ
vA3Jd1DREdDBO2sjw+hGhht84nRlcfY1FyORJP+1JaFH4oWTWsRNeOZIiI3BhxR8
EtFP9E8r2Esxi/FmZb/47m7kYCEH+XsrzQvBQNLVCH899QX2Hn0kAY70ndq2ZiQl
n+zLAe7FBFwKzUVmlgWuhjrWMmK+1TthK/XQuOtxg13dHmX+vE/j+A+dOqRWSfHY
ktNcWaf6m4+TWKVeVTe4E1cnSuwTQTm4VQKd9zaeQxiZYyYJhCQjXuEZg3vDmDbq
F6D3MpTaJHRRWp0rEurxnSBlmQPCBE2IxEBdSrjd/WJ6T9e1oYwWiSJSS7bGCgGr
dd/XLsOY7Um5n4ooIFEZc1de6VO6/VTKjmxnBMgU+Sa1REbLpD438IX/6CjzG5qM
l5Ulke/c6/a/faeVCEpZpD8JuvNOzo9RISDPrNg1KKAL+OSU+9tgmVjIFPhDDB0w
zNTqT7YJIhxlJxnUGWDk8YNsTjT3OzyquY9UT1tBTBqC0k13J2i2ev30toUez7xj
2aV+9qMpunbLtwYhXNun1hBFiYrCxpX7I8ha0hXiXL0CywVOPTI=
=CnVn
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Catalin Marinas:
"The biggest changes are MPAM enablement in drivers/resctrl and new PMU
support under drivers/perf.
On the core side, FEAT_LSUI lets futex atomic operations with EL0
permissions, avoiding PAN toggling.
The rest is mostly TLB invalidation refactoring, further generic entry
work, sysreg updates and a few fixes.
Core features:
- Add support for FEAT_LSUI, allowing futex atomic operations without
toggling Privileged Access Never (PAN)
- Further refactor the arm64 exception handling code towards the
generic entry infrastructure
- Optimise __READ_ONCE() with CONFIG_LTO=y and allow alias analysis
through it
Memory management:
- Refactor the arm64 TLB invalidation API and implementation for
better control over barrier placement and level-hinted invalidation
- Enable batched TLB flushes during memory hot-unplug
- Fix rodata=full block mapping support for realm guests (when
BBML2_NOABORT is available)
Perf and PMU:
- Add support for a whole bunch of system PMUs featured in NVIDIA's
Tegra410 SoC (cspmu extensions for the fabric and PCIe, new drivers
for CPU/C2C memory latency PMUs)
- Clean up iomem resource handling in the Arm CMN driver
- Fix signedness handling of AA64DFR0.{PMUVer,PerfMon}
MPAM (Memory Partitioning And Monitoring):
- Add architecture context-switch and hiding of the feature from KVM
- Add interface to allow MPAM to be exposed to user-space using
resctrl
- Add errata workaround for some existing platforms
- Add documentation for using MPAM and what shape of platforms can
use resctrl
Miscellaneous:
- Check DAIF (and PMR, where relevant) at task-switch time
- Skip TFSR_EL1 checks and barriers in synchronous MTE tag check mode
(only relevant to asynchronous or asymmetric tag check modes)
- Remove a duplicate allocation in the kexec code
- Remove redundant save/restore of SCS SP on entry to/from EL0
- Generate the KERNEL_HWCAP_ definitions from the arm64 hwcap
descriptions
- Add kselftest coverage for cmpbr_sigill()
- Update sysreg definitions"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (109 commits)
arm64: rsi: use linear-map alias for realm config buffer
arm64: Kconfig: fix duplicate word in CMDLINE help text
arm64: mte: Skip TFSR_EL1 checks and barriers in synchronous tag check mode
arm64/sysreg: Update ID_AA64SMFR0_EL1 description to DDI0601 2025-12
arm64/sysreg: Update ID_AA64ZFR0_EL1 description to DDI0601 2025-12
arm64/sysreg: Update ID_AA64FPFR0_EL1 description to DDI0601 2025-12
arm64/sysreg: Update ID_AA64ISAR2_EL1 description to DDI0601 2025-12
arm64/sysreg: Update ID_AA64ISAR0_EL1 description to DDI0601 2025-12
arm64/hwcap: Generate the KERNEL_HWCAP_ definitions for the hwcaps
arm64: kexec: Remove duplicate allocation for trans_pgd
ACPI: AGDI: fix missing newline in error message
arm64: Check DAIF (and PMR) at task-switch time
arm64: entry: Use split preemption logic
arm64: entry: Use irqentry_{enter_from,exit_to}_kernel_mode()
arm64: entry: Consistently prefix arm64-specific wrappers
arm64: entry: Don't preempt with SError or Debug masked
entry: Split preemption from irqentry_exit_to_kernel_mode()
entry: Split kernel mode logic from irqentry_{enter,exit}()
entry: Move irqentry_enter() prototype later
entry: Remove local_irq_{enable,disable}_exit_to_user()
...
|
||
|
|
1c3b68f0d5 |
Scheduler changes for v7.1:
Fair scheduling updates:
- Skip SCHED_IDLE rq for SCHED_IDLE tasks (Christian Loehle)
- Remove superfluous rcu_read_lock() in the wakeup path (K Prateek Nayak)
- Simplify the entry condition for update_idle_cpu_scan() (K Prateek Nayak)
- Simplify SIS_UTIL handling in select_idle_cpu() (K Prateek Nayak)
- Avoid overflow in enqueue_entity() (K Prateek Nayak)
- Update overutilized detection (Vincent Guittot)
- Prevent negative lag increase during delayed dequeue (Vincent Guittot)
- Clear buddies for preempt_short (Vincent Guittot)
- Implement more complex proportional newidle balance (Peter Zijlstra)
- Increase weight bits for avg_vruntime (Peter Zijlstra)
- Use full weight to __calc_delta() (Peter Zijlstra)
RT and DL scheduling updates:
- Fix incorrect schedstats for rt and dl thread (Dengjun Su)
- Skip group schedulable check with rt_group_sched=0 (Michal Koutný)
- Move group schedulability check to sched_rt_global_validate()
(Michal Koutný)
- Add reporting of runtime left & abs deadline to sched_getattr()
for DEADLINE tasks (Tommaso Cucinotta)
Scheduling topology updates by K Prateek Nayak:
- Compute sd_weight considering cpuset partitions
- Extract "imb_numa_nr" calculation into a separate helper
- Allocate per-CPU sched_domain_shared in s_data
- Switch to assigning "sd->shared" from s_data
- Remove sched_domain_shared allocation with sd_data
Energy-aware scheduling updates:
- Filter false overloaded_group case for EAS (Vincent Guittot)
- PM: EM: Switch to rcu_dereference_all() in wakeup path
(Dietmar Eggemann)
Infrastructure updates:
- Replace use of system_unbound_wq with system_dfl_wq (Marco Crivellari)
Proxy scheduling updates by John Stultz:
- Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
- Minimise repeated sched_proxy_exec() checking
- Fix potentially missing balancing with Proxy Exec
- Fix and improve task::blocked_on et al handling
- Add assert_balance_callbacks_empty() helper
- Add logic to zap balancing callbacks if we pick again
- Move attach_one_task() and attach_task() helpers to sched.h
- Handle blocked-waiter migration (and return migration)
- Add K Prateek Nayak to scheduler reviewers for proxy execution
Misc cleanups and fixes by John Stultz, Joseph Salisbury,
Peter Zijlstra, K Prateek Nayak, Michal Koutný, Randy Dunlap,
Shrikanth Hegde, Vincent Guittot, Zhan Xusheng, Xie Yuanbin
and Vincent Guittot.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmncq4oRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1gxoA/8DD0SsMhBLaZLi+LAdY5fD6rGjOLGBtxz
NgwN8CAvPIFH7qFzPjAk7WtVXoKjF62sRDFvUaBEsliflRzOkBkYr3SnUYRORyBB
VRj7D6ymuWhxnhYsy8+Hviu/93c3GyEO59IYU0wIShxBzYBxqDfNxWvEUQte2Cin
1yFy4CICJeGpsBv9Ev+0LtesxtF5bnaioawbAYcpc2IdYsK+nsMKRvkwg1YSdLmh
v9+vIYuQBrclBn3OR7dsv2krBev5qodYtDZFwdJagE+6aaQv2zhWIfhetPpkzwrq
zhuzVZH+E9404Pn5EqJaw7KmU9eyBBwIUVqBaQfH73eSe5PY0tiSrpPU9foocUjo
4Td9sL11SLzjwpM4bIijW0ezZY8y+4Q0A21GwdcwAx3LPstXcF5GIjQ76dVFPRKN
Unbt6o+9O9NvMLg8CLzwonlFzOoLOrL+5eKJs+caOuOikT+cXnBQrukgB4ck3RAD
PIVD8XnufJTCKiDvx2vravLXsWiA2cg7citVsgc8y5FBcdhzv3YVqXd/lGkqg+09
7rVqE6NRDlkk4G4KZACTK45YVcVwXhQlMU/qiS0IduHdD0NtL9DPnQvdfzQWQehO
30cJ5vZ+fqbHspJ8AdPuqntUyfEvPTCbCT4Ou/AEcvO8NRQu2gplcq9mF4U46WZG
GBPWXvGHzM8=
=NjyS
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2026-04-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"Fair scheduling updates:
- Skip SCHED_IDLE rq for SCHED_IDLE tasks (Christian Loehle)
- Remove superfluous rcu_read_lock() in the wakeup path (K Prateek Nayak)
- Simplify the entry condition for update_idle_cpu_scan() (K Prateek Nayak)
- Simplify SIS_UTIL handling in select_idle_cpu() (K Prateek Nayak)
- Avoid overflow in enqueue_entity() (K Prateek Nayak)
- Update overutilized detection (Vincent Guittot)
- Prevent negative lag increase during delayed dequeue (Vincent Guittot)
- Clear buddies for preempt_short (Vincent Guittot)
- Implement more complex proportional newidle balance (Peter Zijlstra)
- Increase weight bits for avg_vruntime (Peter Zijlstra)
- Use full weight to __calc_delta() (Peter Zijlstra)
RT and DL scheduling updates:
- Fix incorrect schedstats for rt and dl thread (Dengjun Su)
- Skip group schedulable check with rt_group_sched=0 (Michal Koutný)
- Move group schedulability check to sched_rt_global_validate()
(Michal Koutný)
- Add reporting of runtime left & abs deadline to sched_getattr()
for DEADLINE tasks (Tommaso Cucinotta)
Scheduling topology updates by K Prateek Nayak:
- Compute sd_weight considering cpuset partitions
- Extract "imb_numa_nr" calculation into a separate helper
- Allocate per-CPU sched_domain_shared in s_data
- Switch to assigning "sd->shared" from s_data
- Remove sched_domain_shared allocation with sd_data
Energy-aware scheduling updates:
- Filter false overloaded_group case for EAS (Vincent Guittot)
- PM: EM: Switch to rcu_dereference_all() in wakeup path
(Dietmar Eggemann)
Infrastructure updates:
- Replace use of system_unbound_wq with system_dfl_wq (Marco Crivellari)
Proxy scheduling updates by John Stultz:
- Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
- Minimise repeated sched_proxy_exec() checking
- Fix potentially missing balancing with Proxy Exec
- Fix and improve task::blocked_on et al handling
- Add assert_balance_callbacks_empty() helper
- Add logic to zap balancing callbacks if we pick again
- Move attach_one_task() and attach_task() helpers to sched.h
- Handle blocked-waiter migration (and return migration)
- Add K Prateek Nayak to scheduler reviewers for proxy execution
Misc cleanups and fixes by John Stultz, Joseph Salisbury, Peter
Zijlstra, K Prateek Nayak, Michal Koutný, Randy Dunlap, Shrikanth
Hegde, Vincent Guittot, Zhan Xusheng, Xie Yuanbin and Vincent Guittot"
* tag 'sched-core-2026-04-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
sched/eevdf: Clear buddies for preempt_short
sched/rt: Cleanup global RT bandwidth functions
sched/rt: Move group schedulability check to sched_rt_global_validate()
sched/rt: Skip group schedulable check with rt_group_sched=0
sched/fair: Avoid overflow in enqueue_entity()
sched: Use u64 for bandwidth ratio calculations
sched/fair: Prevent negative lag increase during delayed dequeue
sched/fair: Use sched_energy_enabled()
sched: Handle blocked-waiter migration (and return migration)
sched: Move attach_one_task and attach_task helpers to sched.h
sched: Add logic to zap balance callbacks if we pick again
sched: Add assert_balance_callbacks_empty helper
sched/locking: Add special p->blocked_on==PROXY_WAKING value for proxy return-migration
sched: Fix modifying donor->blocked on without proper locking
locking: Add task::blocked_lock to serialize blocked_on state
sched: Fix potentially missing balancing with Proxy Exec
sched: Minimise repeated sched_proxy_exec() checking
sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
MAINTAINERS: Add K Prateek Nayak to scheduler reviewers
sched/core: Get this cpu once in ttwu_queue_cond()
...
|
||
|
|
33c66eb5e9 |
Performance events changes for v7.1:
Core updates:
- Try to allocate task_ctx_data quickly, to optimize
O(N^2) algorithm on large systems with O(100k) threads
(Namhyung Kim)
AMD PMU driver IBS support updates and fixes, by Ravi Bangoria:
- Fix interrupt accounting for discarded samples
- Fix a Zen5-specific quirk
- Fix PhyAddrVal handling
- Fix NMI-safety with perf_allow_kernel()
- Fix a race between event add and NMIs
Intel PMU driver updates:
- Only check GP counters for PEBS constraints validation (Dapeng Mi)
MSR driver:
- Turn SMI_COUNT and PPERF on by default, instead of a long
list of CPU models to enable them on (Kan Liang)
Misc cleanups and fixes by Aldf Conte, Anshuman Khandual, Namhyung Kim,
Ravi Bangoria and Yen-Hsiang Hsu.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmncppoRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hN2hAAgd8ix2hZjT/v/wH0iIayRKPEI8KqQ0XP
7L0nqHVrNw3gzsxkRBIBljyWdYhHmxYnV2ExgddwXdpcD2j/Vf2YUfNtrE0fXL5J
wD8/M4WxzxA2gwcRgz3kGaeU/I0Ble9TcIdSaI3kJFJarHaDw3jEsif/gfmYbZfm
oX+gjIUCspnUMqb5EqsHdWxPWub87NnddPI8c9hmhq/9IZ4QvhUxS+lQHc+GihpY
MTlTxG10W/+f84w0lyG153KslV1rngIqoQ2uJTRe0fjx3VX1uhsgB3LCrkTzMOWe
GVODiMhN9u5o0pfJLbboSDZ32z3QrsojXbS2Z+ZHqfqbgompIzH9SVh5fFSGKtfK
64CEP4mO90JGGnDYS6vaPJhZrbusZKzuLt0tcn0aIYHD48PNJXhD2tVE76JsnmAj
SicnL78QOQkB8Gi0LuCXhxPXY/KAqFtOgmKV9x+gqJuAFgTXEUhem6IOJjShhwOQ
NfIkXDHz7kmMLblWRmuslGOWfWddRKheQNvuJ+YqbVto6N192PQdSjnBBZjX8GpL
o52FYCbwGXckZ9X+SU55j3lmQbmtS5Rn8PwB7dmHnVIp8bRI62ANQVoNc1feIrAt
7UA0SrIPz94oe8tH8lQYb5d98fv/6+Nroli8Vpik4wQZf1VUrzlEEXv/m9CTOL4G
FA5CtFF1AmA=
=4Vs0
-----END PGP SIGNATURE-----
Merge tag 'perf-core-2026-04-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull performance events updates from Ingo Molnar:
"Core updates:
- Try to allocate task_ctx_data quickly, to optimize O(N^2) algorithm
on large systems with O(100k) threads (Namhyung Kim)
AMD PMU driver IBS support updates and fixes, by Ravi Bangoria:
- Fix interrupt accounting for discarded samples
- Fix a Zen5-specific quirk
- Fix PhyAddrVal handling
- Fix NMI-safety with perf_allow_kernel()
- Fix a race between event add and NMIs
Intel PMU driver updates:
- Only check GP counters for PEBS constraints validation (Dapeng Mi)
MSR driver:
- Turn SMI_COUNT and PPERF on by default, instead of a long list of
CPU models to enable them on (Kan Liang)
... and misc cleanups and fixes by Aldf Conte, Anshuman Khandual,
Namhyung Kim, Ravi Bangoria and Yen-Hsiang Hsu"
* tag 'perf-core-2026-04-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/events: Replace READ_ONCE() with standard pgtable accessors
perf/x86/msr: Make SMI and PPERF on by default
perf/x86/intel/p4: Fix unused variable warning in p4_pmu_init()
perf/x86/intel: Only check GP counters for PEBS constraints validation
perf/x86/amd/ibs: Fix comment typo in ibs_op_data
perf/amd/ibs: Advertise remote socket capability
perf/amd/ibs: Enable streaming store filter
perf/amd/ibs: Enable RIP bit63 hardware filtering
perf/amd/ibs: Enable fetch latency filtering
perf/amd/ibs: Support IBS_{FETCH|OP}_CTL2[Dis] to eliminate RMW race
perf/amd/ibs: Add new MSRs and CPUID bits definitions
perf/amd/ibs: Define macro for ldlat mask and shift
perf/amd/ibs: Avoid race between event add and NMI
perf/amd/ibs: Avoid calling perf_allow_kernel() from the IBS NMI handler
perf/amd/ibs: Preserve PhyAddrVal bit when clearing PhyAddr MSR
perf/amd/ibs: Limit ldlat->l3missonly dependency to Zen5
perf/amd/ibs: Account interrupt for discarded samples
perf/core: Simplify __detach_global_ctx_data()
perf/core: Try to allocate task_ctx_data quickly
perf/core: Pass GFP flags to attach_task_ctx_data()
|
||
|
|
7393febcb1 |
Locking updates for v7.1:
Mutexes:
- Add killable flavor to guard definitions (Davidlohr Bueso)
- Remove the list_head from struct mutex (Matthew Wilcox)
- Rename mutex_init_lockep() (Davidlohr Bueso)
rwsems:
- Remove the list_head from struct rw_semaphore and
replace it with a single pointer (Matthew Wilcox)
- Fix logic error in rwsem_del_waiter() (Andrei Vagin)
Semaphores:
- Remove the list_head from struct semaphore (Matthew Wilcox)
Jump labels:
- Use ATOMIC_INIT() for initialization of .enabled (Thomas Weißschuh)
- Remove workaround for old compilers in initializations
(Thomas Weißschuh)
Lock context analysis changes and improvements:
- Add context analysis for rwsems (Peter Zijlstra)
- Fix rwlock and spinlock lock context annotations (Bart Van Assche)
- Fix rwlock support in <linux/spinlock_up.h> (Bart Van Assche)
- Add lock context annotations in the spinlock implementation
(Bart Van Assche)
- signal: Fix the lock_task_sighand() annotation (Bart Van Assche)
- ww-mutex: Fix the ww_acquire_ctx function annotations
(Bart Van Assche)
- Add lock context support in do_raw_{read,write}_trylock()
(Bart Van Assche)
- arm64, compiler-context-analysis: Permit alias analysis through
__READ_ONCE() with CONFIG_LTO=y (Marco Elver)
- Add __cond_releases() (Peter Zijlstra)
- Add context analysis for mutexes (Peter Zijlstra)
- Add context analysis for rtmutexes (Peter Zijlstra)
- Convert futexes to compiler context analysis (Peter Zijlstra)
Rust integration updates:
- Add atomic fetch_sub() implementation (Andreas Hindborg)
- Refactor various rust_helper_ methods for expansion (Boqun Feng)
- Add Atomic<*{mut,const} T> support (Boqun Feng)
- Add atomic operation helpers over raw pointers (Boqun Feng)
- Add performance-optimal Flag type for atomic booleans, to avoid
slow byte-sized RMWs on architectures that don't support them.
(FUJITA Tomonori)
- Misc cleanups and fixes (Andreas Hindborg, Boqun Feng,
FUJITA Tomonori)
LTO support updates:
- arm64: Optimize __READ_ONCE() with CONFIG_LTO=y (Marco Elver)
- compiler: Simplify generic RELOC_HIDE() (Marco Elver)
Miscellaneous fixes and cleanups by Peter Zijlstra, Randy Dunlap,
Thomas Weißschuh, Davidlohr Bueso and Mikhail Gavrilov.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmncnGERHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1ig7Q//a3UgHjUe96/zuJIv/X1lt5MU0GHP/m/n
Rf6c39P0VWV6iupJtZ6gPmtQBQDyqWsnfE9S6PFDW4P/Njn0CGEBhk5bcYiN7dc5
pN0hfM67rY1Ids2FJo5JVIxw2pNZpZHU4v3dJC84xH1cPwmccHxt3XW67iQnJCY9
6m7RJ3nUfmNC1qLGKtAFQp3N91hK+BYxqZQ1Wn6a0lRWfmYY7WDs8qrr5N6Ezn7W
53ZNXXbXUC09iOO/slOZmFD5tDrp5Z1nPYTeOdFnWYC5SoTvkfauTqmfZRN5sFad
8vRxXHuCsdBthNF+ljobBUhZx9QL4UJMGOJTFVp9dZSj13vI7UNlbfAtwMKM8lsR
L+v+GSsGdQWwrhzaiz53k6ZuUUDECltjwKFFUBy9RPFMtKkpKsgjW+X6I+SFeTQW
QAPzaA/fEK45bvSPUcjn09kKKC1EVIjHQ6NvByoVjPABz92PpJgOR0si2PW5F314
7sKzk4Ra5x8NGDSEiC9uSwB7mIv56/lUq5zuVoz3CB2rehqIpPdeieE8TaXvcmdK
8otsWYcUXDcj/d6en9XBzb0t3LpQ1TumcOGw3xUJhrSoB4DvmWgW788SbGIKklR9
KFQLU2USlm4u8JHEFXOAeaDhWME+eCqP5FCq3YTqxLksiA+oYx3Xui1R+5L4yjRs
bVbp+BIs3N4=
=aw95
-----END PGP SIGNATURE-----
Merge tag 'locking-core-2026-04-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
"Mutexes:
- Add killable flavor to guard definitions (Davidlohr Bueso)
- Remove the list_head from struct mutex (Matthew Wilcox)
- Rename mutex_init_lockep() (Davidlohr Bueso)
rwsems:
- Remove the list_head from struct rw_semaphore and
replace it with a single pointer (Matthew Wilcox)
- Fix logic error in rwsem_del_waiter() (Andrei Vagin)
Semaphores:
- Remove the list_head from struct semaphore (Matthew Wilcox)
Jump labels:
- Use ATOMIC_INIT() for initialization of .enabled (Thomas Weißschuh)
- Remove workaround for old compilers in initializations
(Thomas Weißschuh)
Lock context analysis changes and improvements:
- Add context analysis for rwsems (Peter Zijlstra)
- Fix rwlock and spinlock lock context annotations (Bart Van Assche)
- Fix rwlock support in <linux/spinlock_up.h> (Bart Van Assche)
- Add lock context annotations in the spinlock implementation
(Bart Van Assche)
- signal: Fix the lock_task_sighand() annotation (Bart Van Assche)
- ww-mutex: Fix the ww_acquire_ctx function annotations
(Bart Van Assche)
- Add lock context support in do_raw_{read,write}_trylock()
(Bart Van Assche)
- arm64, compiler-context-analysis: Permit alias analysis through
__READ_ONCE() with CONFIG_LTO=y (Marco Elver)
- Add __cond_releases() (Peter Zijlstra)
- Add context analysis for mutexes (Peter Zijlstra)
- Add context analysis for rtmutexes (Peter Zijlstra)
- Convert futexes to compiler context analysis (Peter Zijlstra)
Rust integration updates:
- Add atomic fetch_sub() implementation (Andreas Hindborg)
- Refactor various rust_helper_ methods for expansion (Boqun Feng)
- Add Atomic<*{mut,const} T> support (Boqun Feng)
- Add atomic operation helpers over raw pointers (Boqun Feng)
- Add performance-optimal Flag type for atomic booleans, to avoid
slow byte-sized RMWs on architectures that don't support them.
(FUJITA Tomonori)
- Misc cleanups and fixes (Andreas Hindborg, Boqun Feng, FUJITA
Tomonori)
LTO support updates:
- arm64: Optimize __READ_ONCE() with CONFIG_LTO=y (Marco Elver)
- compiler: Simplify generic RELOC_HIDE() (Marco Elver)
Miscellaneous fixes and cleanups by Peter Zijlstra, Randy Dunlap,
Thomas Weißschuh, Davidlohr Bueso and Mikhail Gavrilov"
* tag 'locking-core-2026-04-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits)
compiler: Simplify generic RELOC_HIDE()
locking: Add lock context annotations in the spinlock implementation
locking: Add lock context support in do_raw_{read,write}_trylock()
locking: Fix rwlock support in <linux/spinlock_up.h>
lockdep: Raise default stack trace limits when KASAN is enabled
cleanup: Optimize guards
jump_label: remove workaround for old compilers in initializations
jump_label: use ATOMIC_INIT() for initialization of .enabled
futex: Convert to compiler context analysis
locking/rwsem: Fix logic error in rwsem_del_waiter()
locking/rwsem: Add context analysis
locking/rtmutex: Add context analysis
locking/mutex: Add context analysis
compiler-context-analysys: Add __cond_releases()
locking/mutex: Remove the list_head from struct mutex
locking/semaphore: Remove the list_head from struct semaphore
locking/rwsem: Remove the list_head from struct rw_semaphore
rust: atomic: Update a safety comment in impl of `fetch_add()`
rust: sync: atomic: Update documentation for `fetch_add()`
rust: sync: atomic: Add fetch_sub()
...
|
||
|
|
e80d033851 |
Updates for the SMP core code:
- Switch smp_call_on_cpu() to user system_percpu_wq instead of system_wq
a part of the ongoing workqueue restructuring
- Improve the CSD-lock diagnostics for smp_call_function_single() to
provide better debug mechanisms on weakly ordered systems.
- Cache the current CPU number once in smp_call_function*() instead of
retrieving it over and over.
- Add missing kernel-doc comments all over the place
-----BEGIN PGP SIGNATURE-----
iQJEBAABCgAuFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmnbu7EQHHRnbHhAa2Vy
bmVsLm9yZwAKCRCmGPVMDXSYoSTiD/9OrmXH0/NqWnLG2qydztv8fTgDPFtDg7KK
bQ6YQpsldbBFOBFqKt9mMgWPFIqhsmomVTtJ0HKL31w7tpXG84wPvuknC1WuaA8B
l3hsslDhU/RNNNdFOmu3vHtduDTVTy+8DhGdpl06KMlvLOzyYzU9kaHFKGPZ1OZ6
/Zhab9+QKi4uOpSbXa3Lt/xFnwZgitqqn/6RbfwvkeuYBdyVtniff0oE3BGfXuOd
taVN5XLA4g3pD7tznLm8n+HAW6weSdu7feLt8+Q65yNcLU2ioI+mCZV+j/hHqWX2
n0ZA9o0AoaOvQne0ABl4nkh9Wh19zyUnfAEswr3+hsNhFDLJhog6EvoYK6HL5Kso
fqRNn+Ppsxr9aawLUQo7CZRiVOT4LOH8yE88SHELprHFw1y4b1pXXeZHndSOi4iY
bIWQwhysCpSirT0619TXR4i4b5FUGG8PXfwgLSErY+cIVIN2JHoxoMHRgJ3tajbO
CpE5BRSbBleX1rGtG8m1pUifb+1k5u+ktBvmz4kofX8pwEemTBK0FbppixyKenSF
bLrhiFHRPosVmyMrKnuREFKJCZOXDOZ3qQx7WIQ31wTmNIj9O1ZvqVcvhNssjwF4
e+ndi2ki+rP7pm6PlijNJT/ZcVQJrM/l1jvWhZOJo1yguaIngHDYR/uE5MFPjcmp
49ZR7kPoDQ==
=45V4
-----END PGP SIGNATURE-----
Merge tag 'smp-core-2026-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull SMP core updates from Thomas Gleixner:
- Switch smp_call_on_cpu() to user system_percpu_wq instead of
system_wq a part of the ongoing workqueue restructuring
- Improve the CSD-lock diagnostics for smp_call_function_single() to
provide better debug mechanisms on weakly ordered systems.
- Cache the current CPU number once in smp_call_function*() instead of
retrieving it over and over.
- Add missing kernel-doc comments all over the place
* tag 'smp-core-2026-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
smp: Use system_percpu_wq instead of system_wq
smp: Improve smp_call_function_single() CSD-lock diagnostics
smp: Get this_cpu once in smp_call_function
smp: Add missing kernel-doc comments
|
||
|
|
f21f7b5162 |
Update to the VDSO subsystem:
- Make the handling of compat functions consistent and more robust
- Rework the underlying data store so that it is dynamically
allocated, which allows the conversion of the last holdout SPARC64
to the generic VDSO implementation
- Rework the SPARC64 VDSO to utilize the generic implementation
- Mop up the left overs of the non-generic VDSO support in the core
code.
- Expand the VDSO selftest and make them more robust
- Allow time namespaces to be enabled independently of the generic
VDSO support, which was not possible before due to SPARC64 not
using it.
- Various cleanups and improvements in the related code.
-----BEGIN PGP SIGNATURE-----
iQJEBAABCgAuFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmnb0v8QHHRnbHhAa2Vy
bmVsLm9yZwAKCRCmGPVMDXSYocfqD/9ywgnvwRH6B612mY4PI3qCbLHs6n9f78aH
YwyXmmfBZ5vt1ZtptHD+BAxiIMm9GC+/exdj5zhcOWucnBVhorcloE6evxhkJAMn
RhTQFKkEmcA/UV2Yfct9r+33kgZRyu4IIul4J7hgn2o5T1BqwZbOil0W/O5adr5P
MDLxjT1OLV80ZZWI9qbWcR/aR7W7sHcdwfVPPqjhombRY7f391Mo3dZeM5C2y55x
8TXCEqVpN1RJzFinWEgQN7QpP4OmF0rRuXSrDQpkH6pk/+RSqNlT/QGG7MJtmCQR
E6CeBjNRUn318KiroaGyTKlM9xsL3gNoiCY24ZTwzZxx3g5gSAR3KTCTJhQU0hpu
Svxj+ksqEAyW7fAOIsbce6W8fUPKC2KM+juXgPKcqZ5hjE2fALD+eEYMlq00jSiu
sj71007cM9tZKOXPdWs3Fv7AY2Yj7iiRiRz9gv1wqS1z7ybxiaFjxjLYYakej0tr
rmwBDEGhNow7msZZttr01BRZk9hDUWfIiJtL+0BrgRLNzst2A7WoagtZ2s0Z7Psl
RjtWgYNBDJ878xK0J+Djqb9TyLraGWZShIIna9uYCAJX9i954xfKJ//NOnUkZhcl
jslDLHhdttyJ+TmgIsc1ntUGvYvHqH5ywQpyDfWepMKyIYdaJLHOr2K6bwFnGHdw
uocXvLrkXw==
=8ixX
-----END PGP SIGNATURE-----
Merge tag 'timers-vdso-2026-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull vdso updates from Thomas Gleixner:
- Make the handling of compat functions consistent and more robust
- Rework the underlying data store so that it is dynamically allocated,
which allows the conversion of the last holdout SPARC64 to the
generic VDSO implementation
- Rework the SPARC64 VDSO to utilize the generic implementation
- Mop up the left overs of the non-generic VDSO support in the core
code
- Expand the VDSO selftest and make them more robust
- Allow time namespaces to be enabled independently of the generic VDSO
support, which was not possible before due to SPARC64 not using it
- Various cleanups and improvements in the related code
* tag 'timers-vdso-2026-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits)
timens: Use task_lock guard in timens_get*()
timens: Use mutex guard in proc_timens_set_offset()
timens: Simplify some calls to put_time_ns()
timens: Add a __free() wrapper for put_time_ns()
timens: Remove dependency on the vDSO
vdso/timens: Move functions to new file
selftests: vDSO: vdso_test_correctness: Add a test for time()
selftests: vDSO: vdso_test_correctness: Use facilities from parse_vdso.c
selftests: vDSO: vdso_test_correctness: Handle different tv_usec types
selftests: vDSO: vdso_test_correctness: Drop SYS_getcpu fallbacks
selftests: vDSO: vdso_test_gettimeofday: Remove nolibc checks
Revert "selftests: vDSO: parse_vdso: Use UAPI headers instead of libc headers"
random: vDSO: Remove ifdeffery
random: vDSO: Trim vDSO includes
vdso/datapage: Trim down unnecessary includes
vdso/datapage: Remove inclusion of gettimeofday.h
vdso/helpers: Explicitly include vdso/processor.h
vdso/gettimeofday: Add explicit includes
random: vDSO: Add explicit includes
MIPS: vdso: Explicitly include asm/vdso/vdso.h
...
|
||
|
|
c1fe867b5b |
Updates for the timer/timekeeping core:
- A rework of the hrtimer subsystem to reduce the overhead for frequently
armed timers, especially the hrtick scheduler timer.
- Better timer locality decision
- Simplification of the evaluation of the first expiry time by
keeping track of the neighbor timers in the RB-tree by providing a
RB-tree variant with neighbor links. That avoids walking the
RB-tree on removal to find the next expiry time, but even more
important allows to quickly evaluate whether a timer which is
rearmed changes the position in the RB-tree with the modified
expiry time or not. If not, the dequeue/enqueue sequence which both
can end up in rebalancing can be completely avoided.
- Deferred reprogramming of the underlying clock event device. This
optimizes for the situation where a hrtimer callback sets the need
resched bit. In that case the code attempts to defer the
re-programming of the clock event device up to the point where the
scheduler has picked the next task and has the next hrtick timer
armed. In case that there is no immediate reschedule or soft
interrupts have to be handled before reaching the reschedule point
in the interrupt entry code the clock event is reprogrammed in one
of those code paths to prevent that the timer becomes stale.
- Support for clocksource coupled clockevents
The TSC deadline timer is coupled to the TSC. The next event is
programmed in TSC time. Currently this is done by converting the
CLOCK_MONOTONIC based expiry value into a relative timeout,
converting it into TSC ticks, reading the TSC adding the delta
ticks and writing the deadline MSR.
As the timekeeping core has the conversion factors for the TSC
already, the whole back and forth conversion can be completely
avoided. The timekeeping core calculates the reverse conversion
factors from nanoseconds to TSC ticks and utilizes the base
timestamps of TSC and CLOCK_MONOTONIC which are updated once per
tick. This allows a direct conversion into the TSC deadline value
without reading the time and as a bonus keeps the deadline
conversion in sync with the TSC conversion factors, which are
updated by adjtimex() on systems with NTP/PTP enabled.
- Allow inlining of the clocksource read and clockevent write
functions when they are tiny enough, e.g. on x86 RDTSC and WRMSR.
With all those enhancements in place a hrtick enabled scheduler
provides the same performance as without hrtick. But also other hrtimer
users obviously benefit from these optimizations.
- Robustness improvements and cleanups of historical sins in the hrtimer
and timekeeping code.
- Rewrite of the clocksource watchdog.
The clocksource watchdog code has over time reached the state of an
impenetrable maze of duct tape and staples. The original design, which was
made in the context of systems far smaller than today, is based on the
assumption that the to be monitored clocksource (TSC) can be trivially
compared against a known to be stable clocksource (HPET/ACPI-PM timer).
Over the years this rather naive approach turned out to have major
flaws. Long delays between the watchdog invocations can cause wrap
arounds of the reference clocksource. The access to the reference
clocksource degrades on large multi-sockets systems dure to
interconnect congestion. This has been addressed with various
heuristics which degraded the accuracy of the watchdog to the point
that it fails to detect actual TSC problems on older hardware which
exposes slow inter CPU drifts due to firmware manipulating the TSC to
hide SMI time.
The rewrite addresses this by:
- Restricting the validation against the reference clocksource to the
boot CPU which is usually closest to the legacy block which
contains the reference clocksource (HPET/ACPI-PM).
- Do a round robin validation betwen the boot CPU and the other CPUs
based only on the TSC with an algorithm similar to the TSC
synchronization code during CPU hotplug.
- Being more leniant versus remote timeouts
- The usual tiny fixes, cleanups and enhancements all over the place
-----BEGIN PGP SIGNATURE-----
iQJEBAABCgAuFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmnbxdMQHHRnbHhAa2Vy
bmVsLm9yZwAKCRCmGPVMDXSYoeq6EAC4h9wuBr5yCkxmog1Bhlk9cnK0oX1THb7V
Q4z6DrYAiDXP6z4IDwqSW+3vvmNw1QXOeqpyMTiiIcQ5mNSs1IDnCt5HOEwY+ICm
fiSUMYkXkH6xdFWspYWFkD7aExHJRT3hd/bo+WnXGHhHclPj5NHZssLMIDboHrzX
jLV1hljmthfwg/uOXDGmQUPRFjqr2ZQjo7zGA5SwfVg8Krz7g/qRVy2wUns9TdW/
NYwihDm1YV7qkK/+f1GnMdd70toqb1OZo/fS9FPbBrPLdyi8V+UbnFSUeZu8kCwV
KubAzjLZR4xYCnrlaHhoi208GMd0TOvHMOrdAA0zkQHfhmszGl4N0pbF/EI29Ft2
tQG/FUTG+nzgNOrMCPN2nr3u/UOXLP+gO+2hkyyQjqUP35IyaTYQn10JgBmPTdJL
Ab6E8WL9gTMCd+t/bVjdU/B8W9ruMihKBtWkTfMBCcQ9uNJFCEGzrcMF8hzFYRTs
/4rMDr3NlGYydAnbKPj6bkC5gtjBvh/L08kOdUFyXCMSqIzvJkZJ4241ogl1Awi6
VfdwjF5KZCQo3M1ujpep+1L010wC/yulqLt2brQMO9Nt05dRhgwM3lxy7cnlMNm3
NdfMgi+OG0CzQ+ZUpvo20hCgTDUVgWN9g5R3rar8FJX+Ym3T+ZoEKlShZF+fSRjf
YAUIbUyi7A==
=2qc8
-----END PGP SIGNATURE-----
Merge tag 'timers-core-2026-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer core updates from Thomas Gleixner:
- A rework of the hrtimer subsystem to reduce the overhead for
frequently armed timers, especially the hrtick scheduler timer:
- Better timer locality decision
- Simplification of the evaluation of the first expiry time by
keeping track of the neighbor timers in the RB-tree by providing
a RB-tree variant with neighbor links. That avoids walking the
RB-tree on removal to find the next expiry time, but even more
important allows to quickly evaluate whether a timer which is
rearmed changes the position in the RB-tree with the modified
expiry time or not. If not, the dequeue/enqueue sequence which
both can end up in rebalancing can be completely avoided.
- Deferred reprogramming of the underlying clock event device. This
optimizes for the situation where a hrtimer callback sets the
need resched bit. In that case the code attempts to defer the
re-programming of the clock event device up to the point where
the scheduler has picked the next task and has the next hrtick
timer armed. In case that there is no immediate reschedule or
soft interrupts have to be handled before reaching the reschedule
point in the interrupt entry code the clock event is reprogrammed
in one of those code paths to prevent that the timer becomes
stale.
- Support for clocksource coupled clockevents
The TSC deadline timer is coupled to the TSC. The next event is
programmed in TSC time. Currently this is done by converting the
CLOCK_MONOTONIC based expiry value into a relative timeout,
converting it into TSC ticks, reading the TSC adding the delta
ticks and writing the deadline MSR.
As the timekeeping core has the conversion factors for the TSC
already, the whole back and forth conversion can be completely
avoided. The timekeeping core calculates the reverse conversion
factors from nanoseconds to TSC ticks and utilizes the base
timestamps of TSC and CLOCK_MONOTONIC which are updated once per
tick. This allows a direct conversion into the TSC deadline value
without reading the time and as a bonus keeps the deadline
conversion in sync with the TSC conversion factors, which are
updated by adjtimex() on systems with NTP/PTP enabled.
- Allow inlining of the clocksource read and clockevent write
functions when they are tiny enough, e.g. on x86 RDTSC and WRMSR.
With all those enhancements in place a hrtick enabled scheduler
provides the same performance as without hrtick. But also other
hrtimer users obviously benefit from these optimizations.
- Robustness improvements and cleanups of historical sins in the
hrtimer and timekeeping code.
- Rewrite of the clocksource watchdog.
The clocksource watchdog code has over time reached the state of an
impenetrable maze of duct tape and staples. The original design,
which was made in the context of systems far smaller than today, is
based on the assumption that the to be monitored clocksource (TSC)
can be trivially compared against a known to be stable clocksource
(HPET/ACPI-PM timer).
Over the years this rather naive approach turned out to have major
flaws. Long delays between the watchdog invocations can cause wrap
arounds of the reference clocksource. The access to the reference
clocksource degrades on large multi-sockets systems dure to
interconnect congestion. This has been addressed with various
heuristics which degraded the accuracy of the watchdog to the point
that it fails to detect actual TSC problems on older hardware which
exposes slow inter CPU drifts due to firmware manipulating the TSC to
hide SMI time.
The rewrite addresses this by:
- Restricting the validation against the reference clocksource to
the boot CPU which is usually closest to the legacy block which
contains the reference clocksource (HPET/ACPI-PM).
- Do a round robin validation betwen the boot CPU and the other
CPUs based only on the TSC with an algorithm similar to the TSC
synchronization code during CPU hotplug.
- Being more leniant versus remote timeouts
- The usual tiny fixes, cleanups and enhancements all over the place
* tag 'timers-core-2026-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (75 commits)
alarmtimer: Access timerqueue node under lock in suspend
hrtimer: Fix incorrect #endif comment for BITS_PER_LONG check
posix-timers: Fix stale function name in comment
timers: Get this_cpu once while clearing the idle state
clocksource: Rewrite watchdog code completely
clocksource: Don't use non-continuous clocksources as watchdog
x86/tsc: Handle CLOCK_SOURCE_VALID_FOR_HRES correctly
MIPS: Don't select CLOCKSOURCE_WATCHDOG
parisc: Remove unused clocksource flags
hrtimer: Add a helper to retrieve a hrtimer from its timerqueue node
hrtimer: Remove trailing comma after HRTIMER_MAX_CLOCK_BASES
hrtimer: Mark index and clockid of clock base as const
hrtimer: Drop unnecessary pointer indirection in hrtimer_expire_entry event
hrtimer: Drop spurious space in 'enum hrtimer_base_type'
hrtimer: Don't zero-initialize ret in hrtimer_nanosleep()
hrtimer: Remove hrtimer_get_expires_ns()
timekeeping: Mark offsets array as const
timekeeping/auxclock: Consistently use raw timekeeper for tk_setup_internals()
timer_list: Print offset as signed integer
tracing: Use explicit array size instead of sentinel elements in symbol printing
...
|
||
|
|
db23954eea |
Update for the core interrupt subsystem:
- Invoke add_interrupt_randomness() in handle_percpu_devid_irq() and
cleanup the workaround in the Hyper-V driver, which would now invoke
it twice on ARM64. Removing it from the driver requires to add it to
the x86 system vector entry point.
- Remove the pointles cpu_read_lock() around reading CPU possible mask,
which is read only after init.
- Add documentation for the interaction between device tree bindings and
the interrupt type defines in irq.h.
- Delete stale defines in the matrix allocator and the equivalent in
loongarch.
-----BEGIN PGP SIGNATURE-----
iQJEBAABCgAuFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmnbt64QHHRnbHhAa2Vy
bmVsLm9yZwAKCRCmGPVMDXSYoY/xD/9hdmaaSXX/JySPatJPgkAhekR8cl/brK9t
K6qhMltg/XoUhmU1y0XSfuNDJwEOa4qvW2DbwfnzknIDr0AXvL39S9oNn+1o9I0x
BjIbWwHTAsuLVyjQesqPWwyQtZ8HJCIM+3Ju2rUYz4jYuY8q15GYsbh6QN0pYDNG
F54/OpuNV42/UhX/O01Gxw930GMGnsMkV6ou9dquap23U4FdlIsoY+zU/b09VEU2
MmJZPGwryPpzhSLbCdOGuWA9oM6wd/FoBFYYU31W5OSXm4B0cRs+weE31WMogYKM
lqudBuyNAhCIuZ06sdSvBPRswdvkuTUJovrJwG2r+rV6h953ExujNn+/ZxqraPg7
iPK70Kwp/O2uyMFhHp/6r1u4bylK/AL7q6hace4cZFLqB/Htx5BW+hh/H7Cz6Yan
H3cyBz9XIE7K5BI3lotKnlWVFwkrHgwYXUzHHrMq/UPdQKog1IPh/E29JekqtR+p
RS9QzSF+Gs45I7oMP+7P8o5jAZYuuW+cODnXLlQkuz/bJff3QeftFhOKZkzbFk5x
AFT0DREttxVCGcUCQaQYrWhFshp63+eDXgKGQT1YULleFUC1f9KH2W+6KlqjDwFR
rvrUgmpMxkH2JJz1owct6vJ6wyffFNQtPeCTmq+7GM87HNm6Ji6AaRzQjRnhVGCt
mh9VbBTnFw==
=FvJG
-----END PGP SIGNATURE-----
Merge tag 'irq-core-2026-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core irq updates from Thomas Gleixner:
- Invoke add_interrupt_randomness() in handle_percpu_devid_irq() and
cleanup the workaround in the Hyper-V driver, which would now invoke
it twice on ARM64. Removing it from the driver requires to add it to
the x86 system vector entry point
- Remove the pointles cpu_read_lock() around reading CPU possible mask,
which is read only after init
- Add documentation for the interaction between device tree bindings
and the interrupt type defines in irq.h
- Delete stale defines in the matrix allocator and the equivalent in
loongarch
* tag 'irq-core-2026-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Drivers: hv: Move add_interrupt_randomness() to hypervisor callback sysvec
genirq/chip: Invoke add_interrupt_randomness() in handle_percpu_devid_irq()
genirq/affinity: Remove cpus_read_lock() while reading cpu_possible_mask
genirq/matrix, LoongArch: Delete IRQ_MATRIX_BITS leftovers
genirq: Document interaction between <linux/irq.h> and DT binding defines
|
||
|
|
9236eebd13 |
tracing: Fix fully-qualified variable reference printing in histograms
The syntax for fully-qualified variable references in histograms is subsys.event.$var, which is parsed correctly, but not displayed correctly when printing a histogram spec. The current code puts the $ reference at the beginning of the fully-qualified variable name i.e. $subsys.event.var, which is incorrect. Before: trigger info: hist:keys=next_comm:vals=hitcount:wakeup_lat=common_timestamp.usecs-$sched.sched_wakeup.ts0: ... After: trigger info: hist:keys=next_comm:vals=hitcount:wakeup_lat=common_timestamp.usecs-sched.sched_wakeup.$ts0: ... Link: https://patch.msgid.link/5dee9a86d062a4dd68c2214f3d90ac93811e1951.1776112478.git.zanussi@kernel.org Signed-off-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
|
|
fad217e16f |
tracepoint: balance regfunc() on func_add() failure in tracepoint_add_func()
When a tracepoint goes through the 0 -> 1 transition, tracepoint_add_func()
invokes the subsystem's ext->regfunc() before attempting to install the
new probe via func_add(). If func_add() then fails (for example, when
allocate_probes() cannot allocate a new probe array under memory pressure
and returns -ENOMEM), the function returns the error without calling the
matching ext->unregfunc(), leaving the side effects of regfunc() behind
with no installed probe to justify them.
For syscall tracepoints this is particularly unpleasant: syscall_regfunc()
bumps sys_tracepoint_refcount and sets SYSCALL_TRACEPOINT on every task.
After a leaked failure, the refcount is stuck at a non-zero value with no
consumer, and every task continues paying the syscall trace entry/exit
overhead until reboot. Other subsystems providing regfunc()/unregfunc()
pairs exhibit similarly scoped persistent state.
Mirror the existing 1 -> 0 cleanup and call ext->unregfunc() in the
func_add() error path, gated on the same condition used there so the
unwind is symmetric with the registration.
Fixes:
|
||
|
|
6170922f13 |
ring-buffer: Prevent off-by-one array access in ring_buffer_desc_page()
As pointed out by Smatch, the ring-buffer descriptor array page_va is counted by nr_page_va, but the accessor ring_buffer_desc_page() allows access off by one. Currently, this does not cause problems, as the page ID always comes from a trusted source. Nonetheless, ensure robustness and fix the accessor. While at it, make the page_id unsigned. Link: https://patch.msgid.link/20260410124527.3563970-1-vdonnefort@google.com Reported-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
|
|
5ec1d1e97d |
tracing: Rebuild full_name on each hist_field_name() call
hist_field_name() uses a static MAX_FILTER_STR_VAL buffer for fully
qualified variable-reference names, but it currently appends into that
buffer with strcat() without rebuilding it first. As a result, repeated
calls append a new "system.event.field" name onto the previous one,
which can eventually run past the end of full_name.
Build the name with snprintf() on each call and return NULL if the fully
qualified name does not fit in MAX_FILTER_STR_VAL.
Link: https://patch.msgid.link/20260401112224.85582-1-pengpeng@iscas.ac.cn
Fixes:
|
||
|
|
1111e9bd83 |
ring-buffer: Report header_page overwrite as char
The header_page tracefs metadata currently reports overwrite as an int field with size 1. That makes parsers warn about a type and size mismatch even though the field is only used as a one-byte flag within commit. Keep the shared offset with commit as-is, but report overwrite as char so the declared type matches the hardcoded size. The signedness is already carried separately by the emitted signed field. Link: https://patch.msgid.link/20260406165333.46052-1-create0818@163.com Link: https://bugzilla.kernel.org/show_bug.cgi?id=216999 Signed-off-by: Cao Ruichuang <create0818@163.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
|
|
d7c8087a9c |
Power management updates for 7.1-rc1
- Update qcom-hw DT bindings to include Eliza hardware (Abel Vesa)
- Update cpufreq-dt-platdev blocklist (Faruque Ansari)
- Minor updates to driver and dt-bindings for Tegra (Thierry Reding,
Rosen Penev)
- Add MAINTAINERS entry for CPPC driver (Viresh Kumar)
- Add support for new features: CPPC performance priority, Dynamic EPP,
Raw EPP, and new unit tests for them to amd-pstate (Gautham Shenoy,
Mario Limonciello)
- Fix sysfs files being present when HW missing and broken/outdated
documentation in the amd-pstate driver (Ninad Naik, Gautham Shenoy)
- Pass the policy to cpufreq_driver->adjust_perf() to avoid using
cpufreq_cpu_get() in the .adjust_perf() callback in amd-pstate which
leads to a scheduling-while-atomic bug (K Prateek Nayak)
- Clean up dead code in Kconfig for cpufreq (Julian Braha)
- Remove max_freq_req update for pre-existing cpufreq policy and add a
boost_freq_req QoS request to save the boost constraint instead of
overwriting the last scaling_max_freq constraint (Pierre Gondois)
- Embed cpufreq QoS freq_req objects in cpufreq policy so they all
are allocated in one go along with the policy to simplify lifetime
rules and avoid error handling issues (Viresh Kumar)
- Use DMI max speed when CPPC is unavailable in the acpi-cpufreq
scaling driver (Henry Tseng)
- Switch policy_is_shared() in cpufreq to using cpumask_nth() instead
of cpumask_weight() because the former is more efficient (Yury Norov)
- Use sysfs_emit() in sysfs show functions for cpufreq governor
attributes (Thorsten Blum)
- Update intel_pstate to stop returning an error when "off" is written
to its status sysfs attribute while the driver is already off (Fabio
De Francesco)
- Include current frequency in the debug message printed by
__cpufreq_driver_target() (Pengjie Zhang)
- Refine stopped tick handling in the menu cpuidle governor and
rearrange stopped tick handling in the teo cpuidle governor (Rafael
Wysocki)
- Add Panther Lake C-states table to the intel_idle driver (Artem
Bityutskiy)
- Clean up dead dependencies on CPU_IDLE in Kconfig (Julian Braha)
- Simplify cpuidle_register_device() with guard() (Huisong Li)
- Use performance level if available to distinguish between rates in
OPP debugfs (Manivannan Sadhasivam)
- Fix scoped_guard in dev_pm_opp_xlate_required_opp() (Viresh Kumar)
- Return -ENODATA if the snapshot image is not loaded (Alberto Garcia)
- Remove inclusion of crypto/hash.h from hibernate_64.c on x86 (Eric
Biggers)
- Clean up and rearrange the intel_rapl power capping driver to make
the respective interface drivers (TPMI, MSR, and MMOI) hold their
own settings and primitives and consolidate PL4 and PMU support
flags into rapl_defaults (Kuppuswamy Sathyanarayanan)
- Correct kernel-doc function parameter names in the power capping core
code (Randy Dunlap)
- Remove unneeded casting for HZ_PER_KHZ in devfreq (Andy Shevchenko)
- Use _visible attribute to replace create/remove_sysfs_files() in
devfreq (Pengjie Zhang)
- Add Tegra114 support to activity monitor device in tegra30-devfreq as
a preparation to upcoming EMC controller support (Svyatoslav Ryhel)
- Fix mistakes in cpupower man pages, add the boost and epp options to
the cpupower-frequency-info man page, and add the perf-bias option to
the cpupower-info man page (Roberto Ricci)
- Remove unnecessary extern declarations from getopt.h in arguments
parsing functions in cpufreq-set, cpuidle-info, cpuidle-set,
cpupower-info, and cpupower-set utilities (Kaushlendra Kumar)
-----BEGIN PGP SIGNATURE-----
iQFGBAABCAAwFiEEcM8Aw/RY0dgsiRUR7l+9nS/U47UFAmnY9TISHHJqd0Byand5
c29ja2kubmV0AAoJEO5fvZ0v1OO1G9gH/j5mEqfPpiwX6fQ/ZwOGdNOOPVA5w9j4
KPHSMwMD5lZkoaZfasp2vt27KY5SOoVVvRZ2DKkFJ3Jai4I3cUPZYypga2nre1ag
tgzX4vOjcw2r40Eda6ezWl1h4mca/xJJBX7xH2+hn1JY+Y1in37g50CqMIjKh96z
Uugkk6UZytL1XcF55PMhIUgDf6pDtRT5UOW9xOKOkUt8FVWTJ7ei3HaWyV5kDmVq
b5eQ42+OH7y6sWNnoKczFd8fStvh6J/avoJurBEvcOQhMcjaIaB48G19+KjDg73E
NjrVcgG20P2rltBvV2d0J1TKskZHkaP7XjIeWfkwjGZhee3FL7ssS/g=
=fRCO
-----END PGP SIGNATURE-----
Merge tag 'pm-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"Once again, cpufreq is the most active development area, mostly
because of the new feature additions and documentation updates in the
amd-pstate driver, but there are also changes in the cpufreq core
related to boost support and other assorted updates elsewhere.
Next up are power capping changes due to the major cleanup of the
Intel RAPL driver.
On the cpuidle front, a new C-states table for Intel Panther Lake is
added to the intel_idle driver, the stopped tick handling in the menu
and teo governors is updated, and there are a couple of cleanups.
Apart from the above, support for Tegra114 is added to devfreq and
there are assorted cleanups of that code, there are also two updates
of the operating performance points (OPP) library, two minor updates
related to hibernation, and cpupower utility man pages updates and
cleanups.
Specifics:
- Update qcom-hw DT bindings to include Eliza hardware (Abel Vesa)
- Update cpufreq-dt-platdev blocklist (Faruque Ansari)
- Minor updates to driver and dt-bindings for Tegra (Thierry Reding,
Rosen Penev)
- Add MAINTAINERS entry for CPPC driver (Viresh Kumar)
- Add support for new features: CPPC performance priority, Dynamic
EPP, Raw EPP, and new unit tests for them to amd-pstate (Gautham
Shenoy, Mario Limonciello)
- Fix sysfs files being present when HW missing and broken/outdated
documentation in the amd-pstate driver (Ninad Naik, Gautham Shenoy)
- Pass the policy to cpufreq_driver->adjust_perf() to avoid using
cpufreq_cpu_get() in the .adjust_perf() callback in amd-pstate
which leads to a scheduling-while-atomic bug (K Prateek Nayak)
- Clean up dead code in Kconfig for cpufreq (Julian Braha)
- Remove max_freq_req update for pre-existing cpufreq policy and add
a boost_freq_req QoS request to save the boost constraint instead
of overwriting the last scaling_max_freq constraint (Pierre
Gondois)
- Embed cpufreq QoS freq_req objects in cpufreq policy so they all
are allocated in one go along with the policy to simplify lifetime
rules and avoid error handling issues (Viresh Kumar)
- Use DMI max speed when CPPC is unavailable in the acpi-cpufreq
scaling driver (Henry Tseng)
- Switch policy_is_shared() in cpufreq to using cpumask_nth() instead
of cpumask_weight() because the former is more efficient (Yury
Norov)
- Use sysfs_emit() in sysfs show functions for cpufreq governor
attributes (Thorsten Blum)
- Update intel_pstate to stop returning an error when "off" is
written to its status sysfs attribute while the driver is already
off (Fabio De Francesco)
- Include current frequency in the debug message printed by
__cpufreq_driver_target() (Pengjie Zhang)
- Refine stopped tick handling in the menu cpuidle governor and
rearrange stopped tick handling in the teo cpuidle governor (Rafael
Wysocki)
- Add Panther Lake C-states table to the intel_idle driver (Artem
Bityutskiy)
- Clean up dead dependencies on CPU_IDLE in Kconfig (Julian Braha)
- Simplify cpuidle_register_device() with guard() (Huisong Li)
- Use performance level if available to distinguish between rates in
OPP debugfs (Manivannan Sadhasivam)
- Fix scoped_guard in dev_pm_opp_xlate_required_opp() (Viresh Kumar)
- Return -ENODATA if the snapshot image is not loaded (Alberto
Garcia)
- Remove inclusion of crypto/hash.h from hibernate_64.c on x86 (Eric
Biggers)
- Clean up and rearrange the intel_rapl power capping driver to make
the respective interface drivers (TPMI, MSR, and MMOI) hold their
own settings and primitives and consolidate PL4 and PMU support
flags into rapl_defaults (Kuppuswamy Sathyanarayanan)
- Correct kernel-doc function parameter names in the power capping
core code (Randy Dunlap)
- Remove unneeded casting for HZ_PER_KHZ in devfreq (Andy Shevchenko)
- Use _visible attribute to replace create/remove_sysfs_files() in
devfreq (Pengjie Zhang)
- Add Tegra114 support to activity monitor device in tegra30-devfreq
as a preparation to upcoming EMC controller support (Svyatoslav
Ryhel)
- Fix mistakes in cpupower man pages, add the boost and epp options
to the cpupower-frequency-info man page, and add the perf-bias
option to the cpupower-info man page (Roberto Ricci)
- Remove unnecessary extern declarations from getopt.h in arguments
parsing functions in cpufreq-set, cpuidle-info, cpuidle-set,
cpupower-info, and cpupower-set utilities (Kaushlendra Kumar)"
* tag 'pm-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (74 commits)
cpufreq/amd-pstate: Add POWER_SUPPLY select for dynamic EPP
cpupower: remove extern declarations in cmd functions
cpuidle: Simplify cpuidle_register_device() with guard()
PM / devfreq: tegra30-devfreq: add support for Tegra114
PM / devfreq: use _visible attribute to replace create/remove_sysfs_files()
PM / devfreq: Remove unneeded casting for HZ_PER_KHZ
MAINTAINERS: amd-pstate: Step down as maintainer, add Prateek as reviewer
cpufreq: Pass the policy to cpufreq_driver->adjust_perf()
cpufreq/amd-pstate: Pass the policy to amd_pstate_update()
cpufreq/amd-pstate-ut: Add a unit test for raw EPP
cpufreq/amd-pstate: Add support for raw EPP writes
cpufreq/amd-pstate: Add support for platform profile class
cpufreq/amd-pstate: add kernel command line to override dynamic epp
cpufreq/amd-pstate: Add dynamic energy performance preference
Documentation: amd-pstate: fix dead links in the reference section
cpufreq/amd-pstate: Cache the max frequency in cpudata
Documentation/amd-pstate: Add documentation for amd_pstate_floor_{freq,count}
Documentation/amd-pstate: List amd_pstate_prefcore_ranking sysfs file
Documentation/amd-pstate: List amd_pstate_hw_prefcore sysfs file
amd-pstate-ut: Add a testcase to validate the visibility of driver attributes
...
|
||
|
|
4793dae01f |
Driver core changes for 7.1-rc1
- debugfs:
- Fix NULL pointer dereference in debugfs_create_str()
- Fix misplaced EXPORT_SYMBOL_GPL for debugfs_create_str()
- Fix soundwire debugfs NULL pointer dereference from uninitialized
firmware_file
- device property:
- Make fwnode flags modifications thread safe; widen the field to
unsigned long and use set_bit() / clear_bit() based accessors
- Document how to check for the property presence
- devres:
- Separate struct devres_node from its "subclasses" (struct devres,
struct devres_group); give struct devres_node its own release and
free callbacks for per-type dispatch
- Introduce struct devres_action for devres actions, avoiding the
ARCH_DMA_MINALIGN alignment overhead of struct devres
- Export struct devres_node and its init/add/remove/dbginfo
primitives for use by Rust Devres<T>
- Fix missing node debug info in devm_krealloc()
- Use guard(spinlock_irqsave) where applicable; consolidate unlock
paths in devres_release_group()
- driver_override:
- Convert PCI, WMI, vdpa, s390/cio, s390/ap, and fsl-mc to the
generic driver_override infrastructure, replacing per-bus
driver_override strings, sysfs attributes, and match logic; fixes
a potential UAF from unsynchronized access to driver_override in
bus match() callbacks
- Simplify __device_set_driver_override() logic
- kernfs:
- Send IN_DELETE_SELF and IN_IGNORED inotify events on kernfs
file and directory removal
- Add corresponding selftests for memcg
- platform:
- Allow attaching software nodes when creating platform devices via
a new 'swnode' field in struct platform_device_info
- Add kerneldoc for struct platform_device_info
- software node:
- Move software node initialization from postcore_initcall() to
driver_init(), making it available early in the boot process
- Move kernel_kobj initialization (ksysfs_init) earlier to support
the above
- Remove software_node_exit(); dead code in a built-in unit
- SoC:
- Introduce of_machine_read_compatible() and of_machine_read_model()
OF helpers and export soc_attr_read_machine() to replace direct
accesses to of_root from SoC drivers; also enables
CONFIG_COMPILE_TEST coverage for these drivers
- sysfs:
- Constify attribute group array pointers to
'const struct attribute_group *const *' in sysfs functions,
device_add_groups() / device_remove_groups(), and struct class
- Rust:
- Devres:
- Embed struct devres_node directly in Devres<T> instead of going
through devm_add_action(), avoiding the extra allocation and
the unnecessary ARCH_DMA_MINALIGN alignment
- I/O:
- Turn IoCapable from a marker trait into a functional trait
carrying the raw I/O accessor implementation (io_read /
io_write), providing working defaults for the per-type Io
methods
- Add RelaxedMmio wrapper type, making relaxed accessors usable
in code generic over the Io trait
- Remove overloaded per-type Io methods and per-backend macros
from Mmio and PCI ConfigSpace
- I/O (Register):
- Add IoLoc trait and generic read/write/update methods to the Io
trait, making I/O operations parameterizable by typed locations
- Add register! macro for defining hardware register types with
typed bitfield accessors backed by Bounded values; supports
direct, relative, and array register addressing
- Add write_reg() / try_write_reg() and LocatedRegister trait
- Update PCI sample driver to demonstrate the register! macro
Example:
```
register! {
/// UART control register.
CTRL(u32) @ 0x18 {
/// Receiver enable.
19:19 rx_enable => bool;
/// Parity configuration.
14:13 parity ?=> Parity;
}
/// FIFO watermark and counter register.
WATER(u32) @ 0x2c {
/// Number of datawords in the receive FIFO.
26:24 rx_count;
/// RX interrupt threshold.
17:16 rx_water;
}
}
impl WATER {
fn rx_above_watermark(&self) -> bool {
self.rx_count() > self.rx_water()
}
}
fn init(bar: &pci::Bar<BAR0_SIZE>) {
let water = WATER::zeroed()
.with_const_rx_water::<1>(); // > 3 would not compile
bar.write_reg(water);
let ctrl = CTRL::zeroed()
.with_parity(Parity::Even)
.with_rx_enable(true);
bar.write_reg(ctrl);
}
fn handle_rx(bar: &pci::Bar<BAR0_SIZE>) {
if bar.read(WATER).rx_above_watermark() {
// drain the FIFO
}
}
fn set_parity(bar: &pci::Bar<BAR0_SIZE>, parity: Parity) {
bar.update(CTRL, |r| r.with_parity(parity));
}
```
- IRQ:
- Move 'static bounds from where clauses to trait declarations
for IRQ handler traits
- Misc:
- Enable the generic_arg_infer Rust feature
- Extend Bounded with shift operations, single-bit bool conversion,
and const get()
- Misc:
- Make deferred_probe_timeout default a Kconfig option
- Drop auxiliary_dev_pm_ops; the PM core falls back to driver PM
callbacks when no bus type PM ops are set
- Add conditional guard support for device_lock()
- Add ksysfs.c to the DRIVER CORE MAINTAINERS entry
- Fix kernel-doc warnings in base.h
- Fix stale reference to memory_block_add_nid() in documentation
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQS2q/xV6QjXAdC7k+1FlHeO1qrKLgUCadl5SwAKCRBFlHeO1qrK
LpjDAQCSG3vYznwrngfpmRU5bCB9sdUy/pZiX5px1357+amJkwEA9LgIVQvtHAZW
ZXcQ7Jr+mR3mJEdlatbkWHp3w1VHqAQ=
=y1DV
-----END PGP SIGNATURE-----
Merge tag 'driver-core-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core
Pull driver core updates from Danilo Krummrich:
"debugfs:
- Fix NULL pointer dereference in debugfs_create_str()
- Fix misplaced EXPORT_SYMBOL_GPL for debugfs_create_str()
- Fix soundwire debugfs NULL pointer dereference from uninitialized
firmware_file
device property:
- Make fwnode flags modifications thread safe; widen the field to
unsigned long and use set_bit() / clear_bit() based accessors
- Document how to check for the property presence
devres:
- Separate struct devres_node from its "subclasses" (struct devres,
struct devres_group); give struct devres_node its own release and
free callbacks for per-type dispatch
- Introduce struct devres_action for devres actions, avoiding the
ARCH_DMA_MINALIGN alignment overhead of struct devres
- Export struct devres_node and its init/add/remove/dbginfo
primitives for use by Rust Devres<T>
- Fix missing node debug info in devm_krealloc()
- Use guard(spinlock_irqsave) where applicable; consolidate unlock
paths in devres_release_group()
driver_override:
- Convert PCI, WMI, vdpa, s390/cio, s390/ap, and fsl-mc to the
generic driver_override infrastructure, replacing per-bus
driver_override strings, sysfs attributes, and match logic; fixes a
potential UAF from unsynchronized access to driver_override in bus
match() callbacks
- Simplify __device_set_driver_override() logic
kernfs:
- Send IN_DELETE_SELF and IN_IGNORED inotify events on kernfs file
and directory removal
- Add corresponding selftests for memcg
platform:
- Allow attaching software nodes when creating platform devices via a
new 'swnode' field in struct platform_device_info
- Add kerneldoc for struct platform_device_info
software node:
- Move software node initialization from postcore_initcall() to
driver_init(), making it available early in the boot process
- Move kernel_kobj initialization (ksysfs_init) earlier to support
the above
- Remove software_node_exit(); dead code in a built-in unit
SoC:
- Introduce of_machine_read_compatible() and of_machine_read_model()
OF helpers and export soc_attr_read_machine() to replace direct
accesses to of_root from SoC drivers; also enables
CONFIG_COMPILE_TEST coverage for these drivers
sysfs:
- Constify attribute group array pointers to
'const struct attribute_group *const *' in sysfs functions,
device_add_groups() / device_remove_groups(), and struct class
Rust:
- Devres:
- Embed struct devres_node directly in Devres<T> instead of going
through devm_add_action(), avoiding the extra allocation and the
unnecessary ARCH_DMA_MINALIGN alignment
- I/O:
- Turn IoCapable from a marker trait into a functional trait
carrying the raw I/O accessor implementation (io_read /
io_write), providing working defaults for the per-type Io
methods
- Add RelaxedMmio wrapper type, making relaxed accessors usable in
code generic over the Io trait
- Remove overloaded per-type Io methods and per-backend macros
from Mmio and PCI ConfigSpace
- I/O (Register):
- Add IoLoc trait and generic read/write/update methods to the Io
trait, making I/O operations parameterizable by typed locations
- Add register! macro for defining hardware register types with
typed bitfield accessors backed by Bounded values; supports
direct, relative, and array register addressing
- Add write_reg() / try_write_reg() and LocatedRegister trait
- Update PCI sample driver to demonstrate the register! macro
Example:
```
register! {
/// UART control register.
CTRL(u32) @ 0x18 {
/// Receiver enable.
19:19 rx_enable => bool;
/// Parity configuration.
14:13 parity ?=> Parity;
}
/// FIFO watermark and counter register.
WATER(u32) @ 0x2c {
/// Number of datawords in the receive FIFO.
26:24 rx_count;
/// RX interrupt threshold.
17:16 rx_water;
}
}
impl WATER {
fn rx_above_watermark(&self) -> bool {
self.rx_count() > self.rx_water()
}
}
fn init(bar: &pci::Bar<BAR0_SIZE>) {
let water = WATER::zeroed()
.with_const_rx_water::<1>(); // > 3 would not compile
bar.write_reg(water);
let ctrl = CTRL::zeroed()
.with_parity(Parity::Even)
.with_rx_enable(true);
bar.write_reg(ctrl);
}
fn handle_rx(bar: &pci::Bar<BAR0_SIZE>) {
if bar.read(WATER).rx_above_watermark() {
// drain the FIFO
}
}
fn set_parity(bar: &pci::Bar<BAR0_SIZE>, parity: Parity) {
bar.update(CTRL, |r| r.with_parity(parity));
}
```
- IRQ:
- Move 'static bounds from where clauses to trait declarations for
IRQ handler traits
- Misc:
- Enable the generic_arg_infer Rust feature
- Extend Bounded with shift operations, single-bit bool
conversion, and const get()
Misc:
- Make deferred_probe_timeout default a Kconfig option
- Drop auxiliary_dev_pm_ops; the PM core falls back to driver PM
callbacks when no bus type PM ops are set
- Add conditional guard support for device_lock()
- Add ksysfs.c to the DRIVER CORE MAINTAINERS entry
- Fix kernel-doc warnings in base.h
- Fix stale reference to memory_block_add_nid() in documentation"
* tag 'driver-core-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core: (67 commits)
bus: fsl-mc: use generic driver_override infrastructure
s390/ap: use generic driver_override infrastructure
s390/cio: use generic driver_override infrastructure
vdpa: use generic driver_override infrastructure
platform/wmi: use generic driver_override infrastructure
PCI: use generic driver_override infrastructure
driver core: make software nodes available earlier
software node: remove software_node_exit()
kernel: ksysfs: initialize kernel_kobj earlier
MAINTAINERS: add ksysfs.c to the DRIVER CORE entry
drivers/base/memory: fix stale reference to memory_block_add_nid()
device property: Document how to check for the property presence
soundwire: debugfs: initialize firmware_file to empty string
debugfs: fix placement of EXPORT_SYMBOL_GPL for debugfs_create_str()
debugfs: check for NULL pointer in debugfs_create_str()
driver core: Make deferred_probe_timeout default a Kconfig option
driver core: simplify __device_set_driver_override() clearing logic
driver core: auxiliary bus: Drop auxiliary_dev_pm_ops
device property: Make modifications of fwnode "flags" thread safe
rust: devres: embed struct devres_node directly
...
|
||
|
|
d568788baa |
hardening updates for v7.1-rc1
- randomize_kstack: Improve implementation across arches (Ryan Roberts) - lkdtm/fortify: Drop unneeded FORTIFY_STR_OBJECT test - refcount: Remove unused __signed_wrap function annotations -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRSPkdeREjth1dHnSE2KwveOeQkuwUCad16PwAKCRA2KwveOeQk u7crAP4qz8gXCjes76KsZm/YQS8PtOG5JroAVu5Oa4ohw0RfaQD+K/XLow1plcNF 4Bi8zSuv2ifcLysh9qEAbx5+wcHijgo= =woB3 -----END PGP SIGNATURE----- Merge tag 'hardening-v7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening updates from Kees Cook: - randomize_kstack: Improve implementation across arches (Ryan Roberts) - lkdtm/fortify: Drop unneeded FORTIFY_STR_OBJECT test - refcount: Remove unused __signed_wrap function annotations * tag 'hardening-v7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: lkdtm/fortify: Drop unneeded FORTIFY_STR_OBJECT test refcount: Remove unused __signed_wrap function annotations randomize_kstack: Unify random source across arches randomize_kstack: Maintain kstack_offset per task |
||
|
|
de639344bb |
audit/stable-7.1 PR 20260410
-----BEGIN PGP SIGNATURE----- iQJIBAABCgAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmnZegUUHHBhdWxAcGF1 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNydxAApWBVRWp/AY7jtCQGWRYAa+6y+bQ0 RWfu8putXaOyk3NTeWP64e87FKsdByR/yflefYxMH+bXc2mwbuUZYAreEVmLCJ1P QxHKuwCkCNOz90n/Y7nlDSDK1GYdzlFkCgidfr4iNSCD58WMTtNNpZREzaNiR8a1 PZ3bFvJH+S7BRCGA6/S/20rNYeWTga56pSrWt6VpMwVHGJ1R4DsD60pT8z0NqMYI BTBLeZ36HlZdwUp+APldKNNDRKG1ZQVKJRO68qcSkopr4vQzK7yL/SJsCdU8MHj2 LccXTCTHHWJbpdiE7BtzPO9UobVZIdcz2wsnJHWxzHYtXlPolgM7F31111GL4HSv V/mq5o7dR3h6nn+1gkWHjOpd/f3J3xl3FaJsH9FIIhPmCRHb4oZI0WG0ZH3mHZBl o6aaWja3PBl0XNA+q87DQVBYDOyVNB4RjuaKy+d7hm4eronTRaZkg3zutrB6/XxP uFbp+Q3diWNMsYO52DKFThL/sStmnnCMIRJuTxd8QaPhLVakaFSkWZycSUH4HijD 8WMk3e4yo3TeD6rCAognwKclj0vCMHS3TLOMXlY0vMD04gwXJ2S81yfyXGT4F5De KkXj61TFMxPyiZ6yrxk86BmoqHL0DUiCDn1rMKbNdIncHedKZoNuy+O/XNLS6No/ hLRvXSI7MNthJ5E= =1rY2 -----END PGP SIGNATURE----- Merge tag 'audit-pr-20260410' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit Pull audit updates from Paul Moore: - Improved handling of unknown status requests from userspace The current kernel code ignores unknown/unused request bits sent from userspace and returns an error code based on the results of the request(s) it does understand. The patch from Ricardo fixes this so that unknown requests return an -EINVAL to userspace, making compatibility a bit easier moving forward. - A number of small style and formatting cleanups * tag 'audit-pr-20260410' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: handle unknown status requests in audit_receive_msg() audit: fix coding style issues audit: remove redundant initialization of static variables to 0 audit: fix whitespace alignment in include/uapi/linux/audit.h |
||
|
|
ef3da345cc |
vfs-7.1-rc1.misc
Please consider pulling these changes from the signed vfs-7.1-rc1.misc tag.
Thanks!
Christian
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCadjZCwAKCRCRxhvAZXjc
ohhBAQCAmQMlMRAXAgUZFYMTZpeQlcujP5rv+/vT2Tf/xS76YwD/dRDaw1FH294+
qtk/Z1NjleNixzE2sld1K9J32NxeyAc=
=+g9q
-----END PGP SIGNATURE-----
Merge tag 'vfs-7.1-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"Features:
- coredump: add tracepoint for coredump events
- fs: hide file and bfile caches behind runtime const machinery
Fixes:
- fix architecture-specific compat_ftruncate64 implementations
- dcache: Limit the minimal number of bucket to two
- fs/omfs: reject s_sys_blocksize smaller than OMFS_DIR_START
- fs/mbcache: cancel shrink work before destroying the cache
- dcache: permit dynamic_dname()s up to NAME_MAX
Cleanups:
- remove or unexport unused fs_context infrastructure
- trivial ->setattr cleanups
- selftests/filesystems: Assume that TIOCGPTPEER is defined
- writeback: fix kernel-doc function name mismatch for wb_put_many()
- autofs: replace manual symlink buffer allocation in autofs_dir_symlink
- init/initramfs.c: trivial fix: FSM -> Finite-state machine
- fs: remove stale and duplicate forward declarations
- readdir: Introduce dirent_size()
- fs: Replace user_access_{begin/end} by scoped user access
- kernel: acct: fix duplicate word in comment
- fs: write a better comment in step_into() concerning .mnt assignment
- fs: attr: fix comment formatting and spelling issues"
* tag 'vfs-7.1-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (28 commits)
dcache: permit dynamic_dname()s up to NAME_MAX
fs: attr: fix comment formatting and spelling issues
fs: hide file and bfile caches behind runtime const machinery
fs: write a better comment in step_into() concerning .mnt assignment
proc: rename proc_notify_change to proc_setattr
proc: rename proc_setattr to proc_nochmod_setattr
affs: rename affs_notify_change to affs_setattr
adfs: rename adfs_notify_change to adfs_setattr
hfs: update comments on hfs_inode_setattr
kernel: acct: fix duplicate word in comment
fs: Replace user_access_{begin/end} by scoped user access
readdir: Introduce dirent_size()
coredump: add tracepoint for coredump events
fs: remove do_sys_truncate
fs: pass on FTRUNCATE_* flags to do_truncate
fs: fix archiecture-specific compat_ftruncate64
fs: remove stale and duplicate forward declarations
init/initramfs.c: trivial fix: FSM -> Finite-state machine
autofs: replace manual symlink buffer allocation in autofs_dir_symlink
fs/mbcache: cancel shrink work before destroying the cache
...
|
||
|
|
07c3ef5822 |
vfs-7.1-rc1.pidfs
Please consider pulling these changes from the signed vfs-7.1-rc1.pidfs tag.
Thanks!
Christian
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCadjZCwAKCRCRxhvAZXjc
omfuAQDckt5g7vxBr9hKdyrq1//nsu44fst/mRqr2iSYjuKfPQD/VN6Lw9e56Y/q
l4hHxsPPrSSxbijwng7im36iPIGdfwI=
=BbFh
-----END PGP SIGNATURE-----
Merge tag 'vfs-7.1-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull clone and pidfs updates from Christian Brauner:
"Add three new clone3() flags for pidfd-based process lifecycle
management.
CLONE_AUTOREAP:
CLONE_AUTOREAP makes a child process auto-reap on exit without ever
becoming a zombie. This is a per-process property in contrast to
the existing auto-reap mechanism via SA_NOCLDWAIT or SIG_IGN for
SIGCHLD which applies to all children of a given parent.
Currently the only way to automatically reap children is to set
SA_NOCLDWAIT or SIG_IGN on SIGCHLD. This is a parent-scoped
property affecting all children which makes it unsuitable for
libraries or applications that need selective auto-reaping of
specific children while still being able to wait() on others.
CLONE_AUTOREAP stores an autoreap flag in the child's
signal_struct. When the child exits do_notify_parent() checks this
flag and causes exit_notify() to transition the task directly to
EXIT_DEAD. Since the flag lives on the child it survives
reparenting: if the original parent exits and the child is
reparented to a subreaper or init the child still auto-reaps when
it eventually exits. This is cleaner than forcing the subreaper to
get SIGCHLD and then reaping it. If the parent doesn't care the
subreaper won't care. If there's a subreaper that would care it
would be easy enough to add a prctl() that either just turns back
on SIGCHLD and turns off auto-reaping or a prctl() that just
notifies the subreaper whenever a child is reparented to it.
CLONE_AUTOREAP can be combined with CLONE_PIDFD to allow the parent
to monitor the child's exit via poll() and retrieve exit status via
PIDFD_GET_INFO. Without CLONE_PIDFD it provides a fire-and-forget
pattern. No exit signal is delivered so exit_signal must be zero.
CLONE_THREAD and CLONE_PARENT are rejected: CLONE_THREAD because
autoreap is a process-level property, and CLONE_PARENT because an
autoreap child reparented via CLONE_PARENT could become an
invisible zombie under a parent that never calls wait().
The flag is not inherited by the autoreap process's own children.
Each child that should be autoreaped must be explicitly created
with CLONE_AUTOREAP.
CLONE_NNP:
CLONE_NNP sets no_new_privs on the child at clone time. Unlike
prctl(PR_SET_NO_NEW_PRIVS) which a process sets on itself,
CLONE_NNP allows the parent to impose no_new_privs on the child at
creation without affecting the parent's own privileges.
CLONE_THREAD is rejected because threads share credentials.
CLONE_NNP is useful on its own for any spawn-and-sandbox pattern
but was specifically introduced to enable unprivileged usage of
CLONE_PIDFD_AUTOKILL.
CLONE_PIDFD_AUTOKILL:
This flag ties a child's lifetime to the pidfd returned from
clone3(). When the last reference to the struct file created by
clone3() is closed the kernel sends SIGKILL to the child. A pidfd
obtained via pidfd_open() for the same process does not keep the
child alive and does not trigger autokill - only the specific
struct file from clone3() has this property. This is useful for
container runtimes, service managers, and sandboxed subprocess
execution - any scenario where the child must die if the parent
crashes or abandons the pidfd or just wants a throwaway helper
process.
CLONE_PIDFD_AUTOKILL requires both CLONE_PIDFD and CLONE_AUTOREAP.
It requires CLONE_PIDFD because the whole point is tying the
child's lifetime to the pidfd. It requires CLONE_AUTOREAP because a
killed child with no one to reap it would become a zombie - the
primary use case is the parent crashing or abandoning the pidfd so
no one is around to call waitpid(). CLONE_THREAD is rejected
because autokill targets a process not a thread.
If CLONE_NNP is specified together with CLONE_PIDFD_AUTOKILL an
unprivileged user may spawn a process that is autokilled. The child
cannot escalate privileges via setuid/setgid exec after being
spawned. If CLONE_PIDFD_AUTOKILL is specified without CLONE_NNP the
caller must have have CAP_SYS_ADMIN in its user namespace"
* tag 'vfs-7.1-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
selftests: check pidfd_info->coredump_code correctness
pidfds: add coredump_code field to pidfd_info
kselftest/coredump: reintroduce null pointer dereference
selftests/pidfd: add CLONE_PIDFD_AUTOKILL tests
selftests/pidfd: add CLONE_NNP tests
selftests/pidfd: add CLONE_AUTOREAP tests
pidfd: add CLONE_PIDFD_AUTOKILL
clone: add CLONE_NNP
clone: add CLONE_AUTOREAP
|
||
|
|
dc0dfa7338 |
namespaces-7.1-rc1.misc
Please consider pulling these changes from the signed namespaces-7.1-rc1.misc tag. Thanks! Christian -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCadjZCwAKCRCRxhvAZXjc ols1AP9rA4gjJOTwHg0/pc+GL4qLSqUP3O4KeuJ8qccBcEUITAD/frpUjR11Ibw/ F78/x1QhDPI8PCcw7kEyAPTfDb9VsgU= =5HBm -----END PGP SIGNATURE----- Merge tag 'namespaces-7.1-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull namespace update from Christian Brauner: "Add two simple helper macros for the namespace infrastructure" * tag 'namespaces-7.1-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: nsproxy: Add FOR_EACH_NS_TYPE() X-macro and CLONE_NS_ALL |
||
|
|
b7d74ea0fd |
vfs-7.1-rc1.kino
Please consider pulling these changes from the signed vfs-7.1-rc1.kino tag. Thanks! Christian -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCadjZCgAKCRCRxhvAZXjc otmnAP4sbsxZQdz2TG2hJuOwnEZOkkxZQOUMc3ERVyZaWXIeTAEA7e5M+8FpoG9n 8ipO76UoaXdGLESrqVdp9EOhLqOW7QY= =uMeJ -----END PGP SIGNATURE----- Merge tag 'vfs-7.1-rc1.kino' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs i_ino updates from Christian Brauner: "For historical reasons, the inode->i_ino field is an unsigned long, which means that it's 32 bits on 32 bit architectures. This has caused a number of filesystems to implement hacks to hash a 64-bit identifier into a 32-bit field, and deprives us of a universal identifier field for an inode. This changes the inode->i_ino field from an unsigned long to a u64. This shouldn't make any material difference on 64-bit hosts, but 32-bit hosts will see struct inode grow by at least 4 bytes. This could have effects on slabcache sizes and field alignment. The bulk of the changes are to format strings and tracepoints, since the kernel itself doesn't care that much about the i_ino field. The first patch changes some vfs function arguments, so check that one out carefully. With this change, we may be able to shrink some inode structures. For instance, struct nfs_inode has a fileid field that holds the 64-bit inode number. With this set of changes, that field could be eliminated. I'd rather leave that sort of cleanups for later just to keep this simple" * tag 'vfs-7.1-rc1.kino' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: nilfs2: fix 64-bit division operations in nilfs_bmap_find_target_in_group() EVM: add comment describing why ino field is still unsigned long vfs: remove externs from fs.h on functions modified by i_ino widening treewide: fix missed i_ino format specifier conversions ext4: fix signed format specifier in ext4_load_inode trace event treewide: change inode->i_ino from unsigned long to u64 nilfs2: widen trace event i_ino fields to u64 f2fs: widen trace event i_ino fields to u64 ext4: widen trace event i_ino fields to u64 zonefs: widen trace event i_ino fields to u64 hugetlbfs: widen trace event i_ino fields to u64 ext2: widen trace event i_ino fields to u64 cachefiles: widen trace event i_ino fields to u64 vfs: widen trace event i_ino fields to u64 net: change sock.sk_ino and sock_i_ino() to u64 audit: widen ino fields to u64 vfs: widen inode hash/lookup functions to u64 |
||
|
|
28483203f7 |
RCU changes for v7.1
NOCB CPU management:
- Consolidate rcu_nocb_cpu_offload() and rcu_nocb_cpu_deoffload() to reduce
code duplication.
- Extract nocb_bypass_needs_flush() helper to reduce duplication in NOCB
bypass path.
rcutorture/torture infrastructure:
- Add NOCB01 config for RCU_LAZY torture testing.
- Add NOCB02 config for NOCB poll mode testing.
- Add TRIVIAL-PREEMPT config for textbook-style preemptible RCU torture.
- Test call_srcu() with preemption both disabled and enabled.
- Remove kvm-check-branches.sh in favor of kvm-series.sh.
- Make hangs more visible in torture.sh output.
- Add informative message for tests without a recheck file.
- Fix numeric test comparison in srcu_lockdep.sh.
- Use torture_shutdown_init() in refscale and rcuscale instead of open-coded
shutdown functions.
- Fix modulo-zero error in torture_hrtimeout_ns().
SRCU:
- Fix SRCU read flavor macro comments.
- Fix s/they disables/they disable/ typo in srcu_read_unlock_fast().
RCU Tasks:
- Document that RCU Tasks Trace grace periods now imply RCU grace periods.
- Remove unnecessary smp_store_release() in cblist_init_generic().
RCU stall:
- Add BOOTPARAM_RCU_STALL_PANIC Kconfig option to allow triggering a kernel
panic on RCU stall via kernel boot parameter.
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEEcoCIrlGe4gjE06JJqA4nf2o45hAFAmnMBEAWHGpvZWxhZ25l
bGZAbnZpZGlhLmNvbQAKCRCoDid/ajjmEBCdD/4rUNIocKBmSJWHFeDzl8rvMGSx
RISjeaj8m+HUsQRSq7hecygUkSbGEQuj8KwFDJOsmuzhMRg3hR7yEz1AEphQiiIR
TN0IdGblQd57KqAk6sHsXA5mNjSWZfJQuOkvHsMxIJdhhDRDHPTGtaCBwlYoLc6b
de31LCvCfRL1QgiNAznTlPLRyehI9/+6VASLvKbmi3sQQv9B/KBBWydkryvMJbfW
uc05SUn2qQq2h/NNeYvo6RllXWMPqO4/0ll8Y71ltwFhg8JdS9qXDSFX5cWlLISA
uBA4pC9bk0xi7RGE5v4n9+fNNTr64Tjtd9QoxFvd/Jb6jdKDGjabwojoAYXCsdNw
r8Dp3fKwGofruhUqgO6LyHyG54Ro2paVYjsqr5HW9C6jalZ9+HmH5wOnKw2h5E/i
VRAM6MpwQ869ZeHhDxF8meRKf3+E+UIR95qfNZkABcFnwxgXheT+RASP/UOnoTM7
Zexuzp9c6GQu34MTt0Rz9vNfJta+ZmPlnccan1T3DgoHvcWip7IDhUD1Vqerd9/4
VK4JeeKI0t/fR5ydjs5qiA/O5caIFqtQTD0Ag2vLGQiP72pe2lRroIjD3xNZ2bWS
DN4k6ZCxR2hdPUcWzdYlGe2pYmvAlqCBxZ4fNgrs55uP2EuqW+Y8aJd9RwvRPb2B
wmTZepOh7Ct9+cjTBA==
=wjsx
-----END PGP SIGNATURE-----
Merge tag 'rcu.2026.03.31a' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux
Pull RCU updates from Joel Fernandes:
"NOCB CPU management:
- Consolidate rcu_nocb_cpu_offload() and rcu_nocb_cpu_deoffload() to
reduce code duplication
- Extract nocb_bypass_needs_flush() helper to reduce duplication in
NOCB bypass path
rcutorture/torture infrastructure:
- Add NOCB01 config for RCU_LAZY torture testing
- Add NOCB02 config for NOCB poll mode testing
- Add TRIVIAL-PREEMPT config for textbook-style preemptible RCU
torture
- Test call_srcu() with preemption both disabled and enabled
- Remove kvm-check-branches.sh in favor of kvm-series.sh
- Make hangs more visible in torture.sh output
- Add informative message for tests without a recheck file
- Fix numeric test comparison in srcu_lockdep.sh
- Use torture_shutdown_init() in refscale and rcuscale instead of
open-coded shutdown functions
- Fix modulo-zero error in torture_hrtimeout_ns().
SRCU:
- Fix SRCU read flavor macro comments
- Fix s/they disables/they disable/ typo in srcu_read_unlock_fast()
RCU Tasks:
- Document that RCU Tasks Trace grace periods now imply RCU grace
periods
- Remove unnecessary smp_store_release() in cblist_init_generic()"
* tag 'rcu.2026.03.31a' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux:
rcutorture: Test call_srcu() with preemption disabled and not
rcu: Add BOOTPARAM_RCU_STALL_PANIC Kconfig option
torture: Avoid modulo-zero error in torture_hrtimeout_ns()
rcu/nocb: Extract nocb_bypass_needs_flush() to reduce duplication
rcu/nocb: Consolidate rcu_nocb_cpu_offload/deoffload functions
rcu-tasks: Remove unnecessary smp_store_release() in cblist_init_generic()
rcutorture: Add NOCB02 config for nocb poll mode testing
rcutorture: Add NOCB01 config for RCU_LAZY torture testing
rcu-tasks: Document that RCU Tasks Trace grace periods now imply RCU grace periods
srcu: Fix s/they disables/they disable/ typo in srcu_read_unlock_fast()
srcu: Fix SRCU read flavor macro comments
rcuscale: Ditch rcu_scale_shutdown in favor of torture_shutdown_init()
refscale: Ditch ref_scale_shutdown in favor of torture_shutdown_init()
rcutorture: Fix numeric "test" comparison in srcu_lockdep.sh
torture: Print informative message for test without recheck file
torture: Make hangs more visible in torture.sh output
kvm-check-branches.sh: Remove in favor of kvm-series.sh
rcutorture: Add a textbook-style trivial preemptible RCU
|
||
|
|
76af546488 |
workqueue: validate cpumask_first() result in llc_populate_cpu_shard_id()
On uniprocessor (UP) configs such as nios2, NR_CPUS is 1, so
cpu_shard_id[] is a single-element array (int[1]). In
llc_populate_cpu_shard_id(), cpumask_first(sibling_cpus) returns an
unsigned int that the compiler cannot prove is always 0, triggering
a -Warray-bounds warning when the result is used to index
cpu_shard_id[]:
kernel/workqueue.c:8321:55: warning: array subscript 1 is above
array bounds of 'int[1]' [-Warray-bounds]
8321 | cpu_shard_id[c] = cpu_shard_id[cpumask_first(sibling_cpus)];
| ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This is a false positive: sibling_cpus can never be empty here because
'c' itself is always set in it, so cpumask_first() will always return a
valid CPU. However, the compiler cannot prove this statically, and the
warning only manifests on UP configs where the array size is 1.
Add a bounds check with WARN_ON_ONCE to silence the warning, and store
the result in a local variable to make the code clearer and avoid calling
cpumask_first() twice.
Fixes:
|
||
|
|
e74c3a8891 |
KVM/arm64 updates for 7.1
* New features: - Add support for tracing in the standalone EL2 hypervisor code, which should help both debugging and performance analysis. This comes with a full infrastructure for 'remote' trace buffers that can be exposed by non-kernel entities such as firmware. - Add support for GICv5 Per Processor Interrupts (PPIs), as the starting point for supporting the new GIC architecture in KVM. - Finally add support for pKVM protected guests, with anonymous memory being used as a backing store. About time! * Improvements and bug fixes: - Rework the dreaded user_mem_abort() function to make it more maintainable, reducing the amount of state being exposed to the various helpers and rendering a substantial amount of state immutable. - Expand the Stage-2 page table dumper to support NV shadow page tables on a per-VM basis. - Tidy up the pKVM PSCI proxy code to be slightly less hard to follow. - Fix both SPE and TRBE in non-VHE configurations so that they do not generate spurious, out of context table walks that ultimately lead to very bad HW lockups. - A small set of patches fixing the Stage-2 MMU freeing in error cases. - Tighten-up accepted SMC immediate value to be only #0 for host SMCCC calls. - The usual cleanups and other selftest churn. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmnWdswACgkQI9DQutE9 ekNYvBAAxj5Zmsx8sJ2CYDTJc2w4XkEjSgDugA+J/s0TMgrzExeBlWCstdhVTncy 68nwOjQl3TotnIrt7q36kko9u7IdD0pHNrk34NtlggLjHfB61n9SNcAA6j4F6zJa GFkHpJSrSnZuUPqapkDnlyhuPkgTIAkEUk2Am9siksSfY4HvRyHZJm2FTdxsdIBn NN9wvQqw2wefTXOQ8gS+oHbPVp1cPbwrF2a3EhzXXv/6W3mUBstXgsijgo07UzCp W6vHCv2wqHbHdf67z3Q3hL+VXlVH6oHlyW99/swqISvqRkH/iSB90+oUojnMRrSm yB6Wmhh8jboCaajWMJhG+veZw+7GMXU4nOrGd1rbnY8cwRl/TQ5YibhRm7DIdvjO xeUluTLJ0NdweQUwE2k4OlgKOuGang3E2p0clmkUO4SstA48MdqR/kpST6guIlWw U5syuNaaaiuwP5QOi9qZmMCNmQ3ZfnZG3nseJFdoyGjhVhf5jyQyv4Du9vGZQFF/ Zkg7yTqC4OWiC+3GkW9YYAySM1MyetivLtd47PGzHPTdtaZziWhNvQ0y+8QjQ+R+ CJNvyS/DvsT7epSya4sLgMP1ZAlih9xkz5sQ6k8NJLBYYXi0v33qwqditErgLLyj S4Ci4WNhHHWIusvCVM7JUBkH0AElpmi506f7F6iHoFLlkYR4t9U= =/SuQ -----END PGP SIGNATURE----- Merge tag 'kvmarm-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 updates for 7.1 * New features: - Add support for tracing in the standalone EL2 hypervisor code, which should help both debugging and performance analysis. This comes with a full infrastructure for 'remote' trace buffers that can be exposed by non-kernel entities such as firmware. - Add support for GICv5 Per Processor Interrupts (PPIs), as the starting point for supporting the new GIC architecture in KVM. - Finally add support for pKVM protected guests, with anonymous memory being used as a backing store. About time! * Improvements and bug fixes: - Rework the dreaded user_mem_abort() function to make it more maintainable, reducing the amount of state being exposed to the various helpers and rendering a substantial amount of state immutable. - Expand the Stage-2 page table dumper to support NV shadow page tables on a per-VM basis. - Tidy up the pKVM PSCI proxy code to be slightly less hard to follow. - Fix both SPE and TRBE in non-VHE configurations so that they do not generate spurious, out of context table walks that ultimately lead to very bad HW lockups. - A small set of patches fixing the Stage-2 MMU freeing in error cases. - Tighten-up accepted SMC immediate value to be only #0 for host SMCCC calls. - The usual cleanups and other selftest churn. |
||
|
|
fa2942918a |
Merge patch series "bpf: Fix OOB in pcpu_init_value and add a test"
xulang <xulang@uniontech.com> says: ==================== Fix OOB read when copying element from a BPF_MAP_TYPE_CGROUP_STORAGE map to another pcpu map with the same value_size that is not rounded up to 8 bytes, and add a test case to reproduce the issue. The root cause is that pcpu_init_value() uses copy_map_value_long() which rounds up the copy size to 8 bytes, but CGROUP_STORAGE map values are not 8-byte aligned (e.g., 4-byte). This causes a 4-byte OOB read when the copy is performed. ==================== Link: https://lore.kernel.org/r/7653EEEC2BAB17DF+20260402073948.2185396-1-xulang@uniontech.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
|
576afddfee |
bpf: Fix OOB in pcpu_init_value
An out-of-bounds read occurs when copying element from a
BPF_MAP_TYPE_CGROUP_STORAGE map to another pcpu map with the
same value_size that is not rounded up to 8 bytes.
The issue happens when:
1. A CGROUP_STORAGE map is created with value_size not aligned to
8 bytes (e.g., 4 bytes)
2. A pcpu map is created with the same value_size (e.g., 4 bytes)
3. Update element in 2 with data in 1
pcpu_init_value assumes that all sources are rounded up to 8 bytes,
and invokes copy_map_value_long to make a data copy, However, the
assumption doesn't stand since there are some cases where the source
may not be rounded up to 8 bytes, e.g., CGROUP_STORAGE, skb->data.
the verifier verifies exactly the size that the source claims, not
the size rounded up to 8 bytes by kernel, an OOB happens when the
source has only 4 bytes while the copy size(4) is rounded up to 8.
Fixes:
|
||
|
|
ac61bffe91 |
bpf: Allow instructions with arena source and non-arena dest registers
The compiler sometimes stores the result of a PTR_TO_ARENA and SCALAR
operation into the scalar register rather than the pointer register.
Relax the verifier to allow operations between a source arena register
and a destination non-arena register, marking the destination's value
as a PTR_TO_ARENA.
Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Acked-by: Song Liu <song@kernel.org>
Fixes:
|
||
|
|
9fd19e3ed7 |
bpf: add missing fsession to the verifier log
The fsession attach type is missed in the verifier log in check_get_func_ip(), bpf_check_attach_target() and check_attach_btf_id(). Update them to make the verifier log proper. Meanwhile, update the corresponding selftests. Acked-by: Leon Hwang <leon.hwang@linux.dev> Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Link: https://lore.kernel.org/r/20260412060346.142007-2-dongml2@chinatelecom.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
|
99a832a2b5 |
bpf: Move BTF checking logic into check_btf.c
BTF validation logic is independent from the main verifier. Move it into check_btf.c Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260412152936.54262-7-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
|
ed0b9710bd |
bpf: Move backtracking logic to backtrack.c
Move precision propagation and backtracking logic to backtrack.c to reduce verifier.c size. No functional changes. Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260412152936.54262-6-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
|
c82834a5a1 |
bpf: Move state equivalence logic to states.c
verifier.c is huge. Move is_state_visited() to states.c, so that all state equivalence logic is in one file. Mechanical move. No functional changes. Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260412152936.54262-5-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
|
f8a8faceab |
bpf: Move check_cfg() into cfg.c
verifier.c is huge. Move check_cfg(), compute_postorder(), compute_scc() into cfg.c Mechanical move. No functional changes. Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260412152936.54262-4-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
|
fc150cddee |
bpf: Move compute_insn_live_regs() into liveness.c
verifier.c is huge. Move compute_insn_live_regs() into liveness.c. Mechanical move. No functional changes. Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260412152936.54262-3-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
|
449f08fa59 |
bpf: Move fixup/post-processing logic from verifier.c into fixups.c
verifier.c is huge. Split fixup/post-processing logic that runs after the verifier accepted the program into fixups.c. Mechanical move. No functional changes. Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260412152936.54262-2-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
|
35bdc192d8 |
workqueue: Fixes for v7.0-rc7
- Fix incomplete activation of multiple inactive works when unplugging a
pool_workqueue, where the pending_pwqs list wasn't being updated for
subsequent works.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCadvRzg4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGeLdAP9SZN3b4QRE79fnP/1dzf2RR/XWlFq7x49Wv1nP
mmpwjQEAx/B6YabLNp98bdxaoygUkdz3zBDnL7oaWx/G0p9T/QA=
=80oo
-----END PGP SIGNATURE-----
Merge tag 'wq-for-7.0-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fix from Tejun Heo:
"This is a fix for a stall which triggers on ordered workqueues when
there are multiple inactive work items during workqueue property
changes through sysfs, which doesn't happen that frequently.
While really late, the fix is very low risk as it just repeats an
operation which is already being performed:
- Fix incomplete activation of multiple inactive works when
unplugging a pool_workqueue, where the pending_pwqs list
wasn't being updated for subsequent works"
* tag 'wq-for-7.0-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Add pool_workqueue to pending_pwqs list when unplugging multiple inactive works
|
||
|
|
ab3dee2640 |
Two fixes for the time/timers subsystem:
- Invert the inverted fastpath decision in check_tick_dependency(), which
prevents NOHZ full to stop the tick. That's a regression introduced in
the 7.0 merge window.
- Prevent a unpriviledged DoS in the clockevents code, where user space
can starve the timer interrupt by arming a timerfd or posix interval
timer in a tight loop with an absolute expiry time in the past. The
fix turned out to be incomplete and was was amended yesterday to make
it work on some 20 years old AMD machines as well. All issues with it
have been confirmed to be resolved by various reporters.
-----BEGIN PGP SIGNATURE-----
iQJEBAABCgAuFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmnboPYQHHRnbHhAa2Vy
bmVsLm9yZwAKCRCmGPVMDXSYocEDD/4+aEut3QsitcZliE5KTpD4QFVo70Ky5MMe
XvQgOJyvSJeUBjOW3UVROEZvBWxpb0bYRbXGTs37pCKK4DwjOmodAaY6OI5W/atb
13lnVWSiCKkZoRLs3+G+PCxCD3WuS2HhZc27fiFJmXNyLGfaLBVC2trNnheEBmUs
m3DleuJYiuL3tNbQqlKJ2nIXJFp9qpHSkZ241gxinTvFhBz6QhhAkegMbq9DmHTR
dps4xc3dtNm1poEcTKFeZesfpqlgJaX5LoDydtOFtDomKbXX2SzD82nvCBritUty
ICcd99kIpAgREEn+aer7hokV8X3eAOBRrzPe8tqWl2mc55jcjQJHxD8dE3zIAv5a
AkcnIsiItfHLCPB3iP95inUQoeBCvAX7jCWaLFH8EZCz8QfLALSetjn6HSbWeN/6
ANqFsWlKhK7b1+VcdSL8LVdPtEsE2ILEIHR8dF9GqkVXzs1OPEJJHQoqQxUfDhpa
BdAahPCybY+7lcvZko+WJYJ2Xj0TPRbvav9Ol1o8c8xLF3itFDTCL+E2VMoLdpUn
F8SRDo2FUwysO4NVjHzR7YRTBhlF1pC3MgvPCL2kCekPv2TDUgacHjygnd/QGQ8m
cEOZlD/FlCVE+MNSaLYOUFYJ2a7QwO4qCXOhjy6OFMtO3ezJnol9plqCRCJfhkgy
towg2pGIaQ==
=cnPC
-----END PGP SIGNATURE-----
Merge tag 'timers-urgent-2026-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
"Two fixes for the time/timers subsystem:
- Invert the inverted fastpath decision in check_tick_dependency(),
which prevents NOHZ full to stop the tick. That's a regression
introduced in the 7.0 merge window.
- Prevent a unpriviledged DoS in the clockevents code, where user
space can starve the timer interrupt by arming a timerfd or posix
interval timer in a tight loop with an absolute expiry time in the
past. The fix turned out to be incomplete and was was amended
yesterday to make it work on some 20 years old AMD machines as
well. All issues with it have been confirmed to be resolved by
various reporters"
* tag 'timers-urgent-2026-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clockevents: Prevent timer interrupt starvation
tick/nohz: Fix inverted return value in check_tick_dependency() fast path
|
||
|
|
02640d8886 |
Fix DL server related slowdown to deferred fair tasks.
Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmnbSK8RHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1jwlg/8DwVE6c2SeQH51lvWNpzOp+1C/Q+P7Fxv PofW6L4AQHclj0NHSjSN+NNEuQ4y4VeaxNeJHtRrRBkAV7cPB0MOEqzRU6dpgaA+ Tlcm54X6+9xXjWpkLqCZvJBxmK6/d6NUt4mCJbXa/x7omVDoo/8I646WTbdDRZ1p YIJ63x+U06+81S/3DhsUR0X4+r5GlJDB3iTlEI3wZ26XeLsT/MvWs30QEzeQ1mzM zqqCYZ36aq1ULAz/nRYmGDIEsbwkBU0vXWeZDNeqGvMzNSK8P+Vejdd3wXXPs4vS QxbVCMQv256zmK3B9l7av0aJEwszs0aMqW/PD5QVLxTrWwV35akC0QF5GdAwNVbz m52KFMWElfq/Bu/fc/HrYeuEUY4ViA6/5Wrq021YGctidpEr+BFcFiLumQB6LL8/ 3lOBqwHtiM2qAtEvVCPgfVf3P9KmGl1OAw5TzgzzqKjA5TcDqSAe2CI/wvDhc6WK sSXo6pI3IOE2Q03X5b06Lyqd/NMVhBTb6DfwHTXdYbiDloV3m92UABBFfN9t6/qh IIn3U5JJEydZPaSKP1N9IdYScaRgqFgA2tXcMSqD1EJUYBJjh8Sp2cnv49nlfvK8 fYoSniKlmfnSaS4ooHGp/z44BCceY7x+zfGBgdMfgf31EKypqwn71biXxqvquJoA UZTOfCJ/fRc= =b4Dg -----END PGP SIGNATURE----- Merge tag 'sched-urgent-2026-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "Fix DL server related slowdown to deferred fair tasks" * tag 'sched-urgent-2026-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/deadline: Use revised wakeup rule for dl_server |
||
|
|
2ec74a0536 |
bpf: Simplify do_check_insn()
Move env->insn_idx++ to the caller, so that most of check_*() calls in do_check_insn() tail call into the next helper. Link: https://lore.kernel.org/r/20260411230001.71664-1-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
|
ae3f8ca2ba |
bpf: Move checks for reserved fields out of the main pass
Check reserved fields of each insn once in a prepass instead of repeatedly rechecking them during the main verifier pass. Link: https://lore.kernel.org/r/20260411200932.41797-1-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
|
558b9206d5 |
Probes fixes for v7.0-rc7
- tracing/probe: reject non-closed empty immediate strings Fixed a buffer index underflow bug that occurred when passing an non-closed empty immediate string to the probe event. -----BEGIN PGP SIGNATURE----- iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmnaJr8bHG1hc2FtaS5o aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8b4PcH/29I02lnjYf4IJ6oiANa K5+RU+Xq0z+Tn079GapHI9vmF8d9M89bI76fqKAVBGHosk+QTdYPLwupbowY72Fq 5uZPpkMXD+cvNJz0qGEq2WOPeB+VGxq8zR/lnZjhZLupJP0DfBp6SpXS0lFx4d0Q x+bw6keG7Guq0ocq2mFNWPo2U8bxlh78UrX+pYoIHyjeO+LVNC7X7Ld3aRDebMck uDnS5oLNMojd4Vi5WYuP1GzGzIUJ0cXDDXnSVqHP5zdt7VV83eZNlfWOKe1YyHEq FoVTrb3dRFFWn2EdXOJB+DjQXWZeZtUsfJTxWm4UgPgV1DmLmBFurfB8Kk0JmX7S 0XU= =vHwk -----END PGP SIGNATURE----- Merge tag 'probes-fixes-v7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing probe fix from Masami Hiramatsu: "Reject non-closed empty immediate strings Fix a buffer index underflow bug that occurred when passing an non-closed empty immediate string to the probe event" * tag 'probes-fixes-v7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/probe: reject non-closed empty immediate strings |
||
|
|
57205e2dd9 |
bpf: Delete unused variable
'cnt' is set, but not used. Delete it.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202604111401.eqzyF2kx-lkp@intel.com/
Fixes:
|
||
|
|
ff1c0c5d07 |
Merge branch 'timers/urgent' into timers/core
to resolve the conflict with urgent fixes. |
||
|
|
136deea435 |
bpf: Remove gfp_flags plumbing from bpf_local_storage_update()
Remove the check that rejects sleepable BPF programs from doing
BPF_ANY/BPF_EXIST updates on local storage. This restriction was added
in commit
|
||
|
|
5063e77588 |
bpf: Use kmalloc_nolock() universally in local storage
Switch to kmalloc_nolock() universally in local storage. Socket local
storage didn't move to kmalloc_nolock() when BPF memory allocator was
replaced by it for performance reasons. Now that kfree_rcu() supports
freeing memory allocated by kmalloc_nolock(), we can move the remaining
local storages to use kmalloc_nolock() and cleanup the cluttered free
paths.
Use kfree() instead of kfree_nolock() in bpf_selem_free_trace_rcu() and
bpf_local_storage_free_trace_rcu(). Both callbacks run in process context
where spinning is allowed, so kfree_nolock() is unnecessary.
Benchmark:
./bench -p 1 local-storage-create --storage-type socket \
--batch-size {16,32,64}
The benchmark is a microbenchmark stress-testing how fast local storage
can be created. There is no measurable throughput change for socket local
storage after switching from kzalloc() to kmalloc_nolock().
Socket local storage
batch creation speed diff
--------------- ---- ------------------ ----
Baseline 16 433.9 ± 0.6 k/s
32 434.3 ± 1.4 k/s
64 434.2 ± 0.7 k/s
After 16 439.0 ± 1.9 k/s +1.2%
32 437.3 ± 2.0 k/s +0.7%
64 435.8 ± 2.5k/s +0.4%
Also worth noting that the baseline got a 5% throughput boost when sheaf
replaces percpu partial slab recently [0].
[0] https://lore.kernel.org/bpf/20260123-sheaves-for-all-v4-0-041323d506f7@suse.cz/
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Link: https://lore.kernel.org/r/20260411015419.114016-3-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
||
|
|
49d78adf95 |
sched_ext: Drop spurious warning on kick during scheduler disable
kick_cpus_irq_workfn() warns when scx_kick_syncs is NULL, but this can legitimately happen when a BPF timer or other kick source races with free_kick_syncs() during scheduler disable. Drop the pr_warn_once() and add a comment explaining the race. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn> |
||
|
|
2f2ec8e773 |
bpf: Enforce regsafe base id consistency for BPF_ADD_CONST scalars
When regsafe() compares two scalar registers that both carry
BPF_ADD_CONST, check_scalar_ids() maps their full compound id
(aka base | BPF_ADD_CONST flag) as one idmap entry. However,
it never verifies that the underlying base ids, that is, with
the flag stripped are consistent with existing idmap mappings.
This allows construction of two verifier states where the old
state has R3 = R2 + 10 (both sharing base id A) while the current
state has R3 = R4 + 10 (base id C, unrelated to R2). The idmap
creates two independent entries: A->B (for R2) and A|flag->C|flag
(for R3), without catching that A->C conflicts with A->B. State
pruning then incorrectly succeeds.
Fix this by additionally verifying base ID mapping consistency
whenever BPF_ADD_CONST is set: after mapping the compound ids,
also invoke check_ids() on the base IDs (flag bits stripped).
This ensures that if A was already mapped to B from comparing
the source register, any ADD_CONST derivative must also derive
from B, not an unrelated C.
Fixes:
|
||
|
|
e774d5f1bc |
RISC-V updates for v7.0-rc8
Before v7.0 is released, fix a few issues with the CFI patchset,
merged earlier in v7.0-rc, that primarily affect interfaces to
non-kernel code:
- Improve the prctl() interface for per-task indirect branch landing
pad control to expand abbreviations and to resemble the speculation
control prctl() interface
- Expand the "LP" and "SS" abbreviations in the ptrace uapi header
file to "branch landing pad" and "shadow stack", to improve
readability
- Fix a typo in a CFI-related macro name in the ptrace uapi header
file
- Ensure that the indirect branch tracking state and shadow stack
state are unlocked immediately after an exec() on the new task so
that libc subsequently can control it
- While working in this area, clean up the kernel-internal,
cross-architecture prctl() function names by expanding the
abbreviations mentioned above
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEElRDoIDdEz9/svf2Kx4+xDQu9KksFAmnYP5YACgkQx4+xDQu9
KkuoPQ//Yye5D+35EqfA12yP96Vrtg0QCKiqMotz3yLo0T7zh5KosAs/QIE5eQi7
vWRnCld5PsFa0ZS2822oPfQo8pKVO1y7M2ecFWSwaOWq865Xs82M/puqEQF3GFCS
219cg1dTVBGvvKSf4MINUBRprfZmZRT9pzhSk79qHEbHKzwCDk7uah51iUdyPJyd
KX3hshYMLq3rooTHR2wD/ChTpV+pCrt2rSUVbW8+sTUWDfv2sTLauHmemKw7LpdW
C0SulXvcYkGyiqsB5AXW9x2ttJ5hX9diPb73XS6eBCU0CaMl9BVZWNKeqhEMJxKR
wmqIadD8pelf7Jh7wGAbNW4hWqTsO3xRpZH38Y/cGLdhs3cqvKjEmT3fOFWUP9bP
hWv5027gVXVSOmvxhPiUJs7D5WWAz4Q64JZfdJSmDdEWVXcI0v/hzdukuPw4iiT6
DaqOyClTcwc+j1jawFTICXTF7wXfvZT5sjulrmPk1HX4nZ5padKpfQ77AdKHF9Q6
9pC25QHQk42h/R4ynA4lm15YnCOfYvjP25hU7K64gQnqO6qBrolfrA4kJOmdYv/g
1IXsA2YZafJbcXwyFZjWy50uu5gaCM5JhRRFdUrjmB6j3gv9HfBlWJXQywReUjPo
Kq4tnFppxzFVm23COj9j5kyjsFjUhZ8KCft3+n7lrndeOCk5Z3E=
=5/Ct
-----END PGP SIGNATURE-----
Merge tag 'riscv-for-linus-v7.0-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V updates from Paul Walmsley:
"Before v7.0 is released, fix a few issues with the CFI patchset,
merged earlier in v7.0-rc, that primarily affect interfaces to
non-kernel code:
- Improve the prctl() interface for per-task indirect branch landing
pad control to expand abbreviations and to resemble the speculation
control prctl() interface
- Expand the "LP" and "SS" abbreviations in the ptrace uapi header
file to "branch landing pad" and "shadow stack", to improve
readability
- Fix a typo in a CFI-related macro name in the ptrace uapi header
file
- Ensure that the indirect branch tracking state and shadow stack
state are unlocked immediately after an exec() on the new task so
that libc subsequently can control it
- While working in this area, clean up the kernel-internal,
cross-architecture prctl() function names by expanding the
abbreviations mentioned above"
* tag 'riscv-for-linus-v7.0-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
prctl: cfi: change the branch landing pad prctl()s to be more descriptive
riscv: ptrace: cfi: expand "SS" references to "shadow stack" in uapi headers
prctl: rename branch landing pad implementation functions to be more explicit
riscv: ptrace: expand "LP" references to "branch landing pads" in uapi headers
riscv: cfi: clear CFI lock status in start_thread()
riscv: ptrace: cfi: fix "PRACE" typo in uapi header
|
||
|
|
2cb27158ad |
bpf: poison dead stack slots
As a sanity check poison stack slots that stack liveness determined to be dead, so that any read from such slots will cause program rejection. If stack liveness logic is incorrect the poison can cause valid program to be rejected, but it also will prevent unsafe program to be accepted. Allow global subprogs "read" poisoned stack slots. The static stack liveness determined that subprog doesn't read certain stack slots, but sizeof(arg_type) based global subprog validation isn't accurate enough to know which slots will actually be read by the callee, so it needs to check full sizeof(arg_type) at the caller. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260410-patch-set-v4-14-5d4eecb343db@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
|
2c167d9177 |
bpf: change logging scheme for live stack analysis
Instead of breadcrumbs like:
(d2,cs15) frame 0 insn 18 +live -16
(d2,cs15) frame 0 insn 17 +live -16
Print final accumulated stack use/def data per-func_instance
per-instruction. printed func_instance's are ordered by callsite and
depth. For example:
stack use/def subprog#0 shared_instance_must_write_overwrite (d0,cs0):
0: (b7) r1 = 1
1: (7b) *(u64 *)(r10 -8) = r1 ; def: fp0-8
2: (7b) *(u64 *)(r10 -16) = r1 ; def: fp0-16
3: (bf) r1 = r10
4: (07) r1 += -8
5: (bf) r2 = r10
6: (07) r2 += -16
7: (85) call pc+7 ; use: fp0-8 fp0-16
8: (bf) r1 = r10
9: (07) r1 += -16
10: (bf) r2 = r10
11: (07) r2 += -8
12: (85) call pc+2 ; use: fp0-8 fp0-16
13: (b7) r0 = 0
14: (95) exit
stack use/def subprog#1 forwarding_rw (d1,cs7):
15: (85) call pc+1 ; use: fp0-8 fp0-16
16: (95) exit
stack use/def subprog#1 forwarding_rw (d1,cs12):
15: (85) call pc+1 ; use: fp0-8 fp0-16
16: (95) exit
stack use/def subprog#2 write_first_read_second (d2,cs15):
17: (7a) *(u64 *)(r1 +0) = 42
18: (79) r0 = *(u64 *)(r2 +0) ; use: fp0-8 fp0-16
19: (95) exit
For groups of three or more consecutive stack slots, abbreviate as
follows:
25: (85) call bpf_loop#181 ; use: fp2-8..-512 fp1-8..-512 fp0-8..-512
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260410-patch-set-v4-10-5d4eecb343db@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
||
|
|
6762e3a0bc |
bpf: simplify liveness to use (callsite, depth) keyed func_instances
Rework func_instance identification and remove the dynamic liveness
API, completing the transition to fully static stack liveness analysis.
Replace callchain-based func_instance keys with (callsite, depth)
pairs. The full callchain (all ancestor callsites) is no longer part
of the hash key; only the immediate callsite and the call depth
matter. This does not lose precision in practice and simplifies the
data structure significantly: struct callchain is removed entirely,
func_instance stores just callsite, depth.
Drop must_write_acc propagation. Previously, must_write marks were
accumulated across successors and propagated to the caller via
propagate_to_outer_instance(). Instead, callee entry liveness
(live_before at subprog start) is pulled directly back to the
caller's callsite in analyze_subprog() after each callee returns.
Since (callsite, depth) instances are shared across different call
chains that invoke the same subprog at the same depth, must_write
marks from one call may be stale for another. To handle this,
analyze_subprog() records into a fresh_instance() when the instance
was already visited (must_write_initialized), then merge_instances()
combines the results: may_read is unioned, must_write is intersected.
This ensures only slots written on ALL paths through all call sites
are marked as guaranteed writes.
This replaces commit_stack_write_marks() logic.
Skip recursive descent into callees that receive no FP-derived
arguments (has_fp_args() check). This is needed because global
subprogram calls can push depth beyond MAX_CALL_FRAMES (max depth
is 64 for global calls but only 8 frames are accommodated for FP
passing). It also handles the case where a callback subprog cannot be
determined by argument tracking: such callbacks will be processed by
analyze_subprog() at depth 0 independently.
Update lookup_instance() (used by is_live_before queries) to search
for the func_instance with maximal depth at the corresponding
callsite, walking depth downward from frameno to 0. This accounts for
the fact that instance depth no longer corresponds 1:1 to
bpf_verifier_state->curframe, since skipped non-FP calls create gaps.
Remove the dynamic public liveness API from verifier.c:
- bpf_mark_stack_{read,write}(), bpf_reset/commit_stack_write_marks()
- bpf_update_live_stack(), bpf_reset_live_stack_callchain()
- All call sites in check_stack_{read,write}_fixed_off(),
check_stack_range_initialized(), mark_stack_slot_obj_read(),
mark/unmark_stack_slots_{dynptr,iter,irq_flag}()
- The per-instruction write mark accumulation in do_check()
- The bpf_update_live_stack() call in prepare_func_exit()
mark_stack_read() and mark_stack_write() become static functions in
liveness.c, called only from the static analysis pass. The
func_instance->updated and must_write_dropped flags are removed.
Remove spis_single_slot(), spis_one_bit() helpers from bpf_verifier.h
as they are no longer used.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Tested-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/20260410-patch-set-v4-9-5d4eecb343db@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
||
|
|
fed53dbcdb |
bpf: record arg tracking results in bpf_liveness masks
After arg tracking reaches a fixed point, perform a single linear scan
over the converged at_in[] state and translate each memory access into
liveness read/write masks on the func_instance:
- Load/store instructions: FP-derived pointer's frame and offset(s)
are converted to half-slot masks targeting
per_frame_masks->{may_read,must_write}
- Helper/kfunc calls: record_call_access() queries
bpf_helper_stack_access_bytes() / bpf_kfunc_stack_access_bytes()
for each FP-derived argument to determine access size and direction.
Unknown access size (S64_MIN) conservatively marks all slots from
fp_off to fp+0 as read.
- Imprecise pointers (frame == ARG_IMPRECISE): conservatively mark
all slots in every frame covered by the pointer's frame bitmask
as fully read.
- Static subprog calls with unresolved arguments: conservatively mark
all frames as fully read.
Instead of a call to clean_live_states(), start cleaning the current
state continuously as registers and stack become dead since the static
analysis provides complete liveness information. This makes
clean_live_states() and bpf_verifier_state->cleaned unnecessary.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260410-patch-set-v4-8-5d4eecb343db@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
||
|
|
bf0c571f7f |
bpf: introduce forward arg-tracking dataflow analysis
The analysis is a basis for static liveness tracking mechanism
introduced by the next two commits.
A forward fixed-point analysis that tracks which frame's FP each
register value is derived from, and at what byte offset. This is
needed because a callee can receive a pointer to its caller's stack
frame (e.g. r1 = fp-16 at the call site), then do *(u64 *)(r1 + 0)
inside the callee — a cross-frame stack access that the callee's local
liveness must attribute to the caller's stack.
Each register holds an arg_track value from a three-level lattice:
- Precise {frame=N, off=[o1,o2,...]} — known frame index and
up to 4 concrete byte offsets
- Offset-imprecise {frame=N, off_cnt=0} — known frame, unknown offset
- Fully-imprecise {frame=ARG_IMPRECISE, mask=bitmask} — unknown frame,
mask says which frames might be involved
At CFG merge points the lattice moves toward imprecision (same
frame+offset stays precise, same frame different offsets merges offset
sets or becomes offset-imprecise, different frames become
fully-imprecise with OR'd bitmask).
The analysis also tracks spills/fills to the callee's own stack
(at_stack_in/out), so FP derived values spilled and reloaded.
This pass is run recursively per call site: when subprog A calls B
with specific FP-derived arguments, B is re-analyzed with those entry
args. The recursion follows analyze_subprog -> compute_subprog_args ->
(for each call insn) -> analyze_subprog. Subprogs that receive no
FP-derived args are skipped during recursion and analyzed
independently at depth 0.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260410-patch-set-v4-7-5d4eecb343db@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
||
|
|
8d3219f64d |
bpf: prepare liveness internal API for static analysis pass
Move the `updated` check and reset from bpf_update_live_stack() into update_instance() itself, so callers outside the main loop can reuse it. Similarly, move write_insn_idx assignment out of reset_stack_write_marks() into its public caller, and thread insn_idx as a parameter to commit_stack_write_marks() instead of reading it from liveness->write_insn_idx. Drop the unused `env` parameter from alloc_frame_masks() and mark_stack_read(). Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260410-patch-set-v4-6-5d4eecb343db@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
|
be23266b4a |
bpf: 4-byte precise clean_verifier_state
Migrate clean_verifier_state() and its liveness queries from 8-byte SPI granularity to 4-byte half-slot granularity. In __clean_func_state(), each SPI is cleaned in two independent halves: - half_spi 2*i (lo): slot_type[0..3] - half_spi 2*i+1 (hi): slot_type[4..7] Slot types STACK_DYNPTR, STACK_ITER and STACK_IRQ_FLAG are never cleaned, as their slot type markers are required by destroy_if_dynptr_stack_slot(), is_iter_reg_valid_uninit() and is_irq_flag_reg_valid_uninit() for correctness. When only the hi half is dead, spilled_ptr metadata is destroyed and the lo half's STACK_SPILL bytes are downgraded to STACK_MISC or STACK_ZERO. When only the lo half is dead, spilled_ptr is preserved because the hi half may still need it for state comparison. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260410-patch-set-v4-5-5d4eecb343db@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
|
7ca5f68cda |
bpf: make liveness.c track stack with 4-byte granularity
Convert liveness bitmask type from u64 to spis_t, doubling the number of trackable stack slots from 64 to 128 to support 4-byte granularity. Each 8-byte SPI now maps to two consecutive 4-byte sub-slots in the bitmask: spi*2 half and spi*2+1 half. In verifier.c, check_stack_write_fixed_off() now reports 4-byte aligned writes of 4-byte writes as half-slot marks and 8-byte aligned 8-byte writes as two slots. Similar logic applied in check_stack_read_fixed_off(). Queries (is_live_before) are not yet migrated to half-slot granularity. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260410-patch-set-v4-4-5d4eecb343db@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
|
cf3ee1ecf3 |
bpf: save subprogram name in bpf_subprog_info
Subprogram name can be computed from function info and BTF, but it is convenient to have the name readily available for logging purposes. Update comment saying that bpf_subprog_info->start has to be the first field, this is no longer true, relevant sites access .start field by it's name. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260410-patch-set-v4-2-5d4eecb343db@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
|
33dfc521c2 |
bpf: share several utility functions as internal API
Namely: - bpf_subprog_is_global - bpf_vlog_alignment Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260410-patch-set-v4-1-5d4eecb343db@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
|
d6e152d905 |
clockevents: Prevent timer interrupt starvation
Calvin reported an odd NMI watchdog lockup which claims that the CPU locked
up in user space. He provided a reproducer, which sets up a timerfd based
timer and then rearms it in a loop with an absolute expiry time of 1ns.
As the expiry time is in the past, the timer ends up as the first expiring
timer in the per CPU hrtimer base and the clockevent device is programmed
with the minimum delta value. If the machine is fast enough, this ends up
in a endless loop of programming the delta value to the minimum value
defined by the clock event device, before the timer interrupt can fire,
which starves the interrupt and consequently triggers the lockup detector
because the hrtimer callback of the lockup mechanism is never invoked.
As a first step to prevent this, avoid reprogramming the clock event device
when:
- a forced minimum delta event is pending
- the new expiry delta is less then or equal to the minimum delta
Thanks to Calvin for providing the reproducer and to Borislav for testing
and providing data from his Zen5 machine.
The problem is not limited to Zen5, but depending on the underlying
clock event device (e.g. TSC deadline timer on Intel) and the CPU speed
not necessarily observable.
This change serves only as the last resort and further changes will be made
to prevent this scenario earlier in the call chain as far as possible.
[ tglx: Updated to restore the old behaviour vs. !force and delta <= 0 and
fixed up the tick-broadcast handlers as pointed out by Borislav ]
Fixes:
|
||
|
|
4406942e65 |
bpf: Fix RCU stall in bpf_fd_array_map_clear()
Add a missing cond_resched() in bpf_fd_array_map_clear() loop.
For PROG_ARRAY maps with many entries this loop calls
prog_array_map_poke_run() per entry which can be expensive, and
without yielding this can cause RCU stalls under load:
rcu: Stack dump where RCU GP kthread last ran:
CPU: 0 UID: 0 PID: 30932 Comm: kworker/0:2 Not tainted 6.14.0-13195-g967e8def1100 #2 PREEMPT(undef)
Workqueue: events prog_array_map_clear_deferred
RIP: 0010:write_comp_data+0x38/0x90 kernel/kcov.c:246
Call Trace:
<TASK>
prog_array_map_poke_run+0x77/0x380 kernel/bpf/arraymap.c:1096
__fd_array_map_delete_elem+0x197/0x310 kernel/bpf/arraymap.c:925
bpf_fd_array_map_clear kernel/bpf/arraymap.c:1000 [inline]
prog_array_map_clear_deferred+0x119/0x1b0 kernel/bpf/arraymap.c:1141
process_one_work+0x898/0x19d0 kernel/workqueue.c:3238
process_scheduled_works kernel/workqueue.c:3319 [inline]
worker_thread+0x770/0x10b0 kernel/workqueue.c:3400
kthread+0x465/0x880 kernel/kthread.c:464
ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:153
ret_from_fork_asm+0x19/0x30 arch/x86/entry/entry_64.S:245
</TASK>
Reviewed-by: Sun Jian <sun.jian.kdev@gmail.com>
Fixes:
|
||
|
|
4cbee026db |
bpf: return VMA snapshot from task_vma iterator
Holding the per-VMA lock across the BPF program body creates a lock
ordering problem when helpers acquire locks that depend on mmap_lock:
vm_lock -> i_rwsem -> mmap_lock -> vm_lock
Snapshot the VMA under the per-VMA lock in _next() via memcpy(), then
drop the lock before returning. The BPF program accesses only the
snapshot.
The verifier only trusts vm_mm and vm_file pointers (see
BTF_TYPE_SAFE_TRUSTED_OR_NULL in verifier.c). vm_file is reference-
counted with get_file() under the lock and released via fput() on the
next iteration or in _destroy(). vm_mm is already correct because
lock_vma_under_rcu() verifies vma->vm_mm == mm. All other pointers
are left as-is by memcpy() since the verifier treats them as untrusted.
Fixes:
|
||
|
|
bee9ef4a40 |
bpf: switch task_vma iterator from mmap_lock to per-VMA locks
The open-coded task_vma iterator holds mmap_lock for the entire duration of iteration, increasing contention on this highly contended lock. Switch to per-VMA locking. Find the next VMA via an RCU-protected maple tree walk and lock it with lock_vma_under_rcu(). lock_next_vma() is not used because its fallback takes mmap_read_lock(), and the iterator must work in non-sleepable contexts. lock_vma_under_rcu() is a point lookup (mas_walk) that finds the VMA containing a given address but cannot iterate across gaps. An RCU-protected vma_next() walk (mas_find) first locates the next VMA's vm_start to pass to lock_vma_under_rcu(). Between the RCU walk and the lock, the VMA may be removed, shrunk, or write-locked. On failure, advance past it using vm_end from the RCU walk. Because the VMA slab is SLAB_TYPESAFE_BY_RCU, vm_end may be stale; fall back to PAGE_SIZE advancement when it does not make forward progress. Concurrent VMA insertions at addresses already passed by the iterator are not detected. CONFIG_PER_VMA_LOCK is required; return -EOPNOTSUPP without it. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20260408154539.3832150-3-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
|
d8e27d2d22 |
bpf: fix mm lifecycle in open-coded task_vma iterator
The open-coded task_vma iterator reads task->mm locklessly and acquires
mmap_read_trylock() but never calls mmget(). If the task exits
concurrently, the mm_struct can be freed as it is not
SLAB_TYPESAFE_BY_RCU, resulting in a use-after-free.
Safely read task->mm with a trylock on alloc_lock and acquire an mm
reference. Drop the reference via bpf_iter_mmput_async() in _destroy()
and error paths. bpf_iter_mmput_async() is a local wrapper around
mmput_async() with a fallback to mmput() on !CONFIG_MMU.
Reject irqs-disabled contexts (including NMI) up front. Operations used
by _next() and _destroy() (mmap_read_unlock, bpf_iter_mmput_async)
take spinlocks with IRQs disabled (pool->lock, pi_lock). Running from
NMI or from a tracepoint that fires with those locks held could
deadlock.
A trylock on alloc_lock is used instead of the blocking task_lock()
(get_task_mm) to avoid a deadlock when a softirq BPF program iterates
a task that already holds its alloc_lock on the same CPU.
Fixes:
|
||
|
|
e719e17d99 |
sched_ext: Warn on task-based SCX op recursion
The kf_tasks[] design assumes task-based SCX ops don't nest - if they did, kf_tasks[0] would get clobbered. The old scx_kf_allow() WARN_ONCE caught invalid nesting via kf_mask, but that machinery is gone now. Add a WARN_ON_ONCE(current->scx.kf_tasks[0]) at the top of each SCX_CALL_OP_TASK*() macro. Checking kf_tasks[0] alone is sufficient: all three variants (SCX_CALL_OP_TASK, SCX_CALL_OP_TASK_RET, SCX_CALL_OP_2TASKS_RET) write to kf_tasks[0], so a non-NULL value at entry to any of the three means re-entry from somewhere in the family. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com> |
||
|
|
979a98b6e9 |
sched_ext: Rename scx_kf_allowed_on_arg_tasks() to scx_kf_arg_task_ok()
The "kf_allowed" framing on this helper comes from the old runtime scx_kf_allowed() gate, which has been removed. Rename it to describe what it actually does in the new model. Pure rename, no functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com> |
||
|
|
7cd9a5d7d4 |
sched_ext: Remove runtime kfunc mask enforcement
Now that scx_kfunc_context_filter enforces context-sensitive kfunc restrictions at BPF load time, the per-task runtime enforcement via scx_kf_mask is redundant. Remove it entirely: - Delete enum scx_kf_mask, the kf_mask field on sched_ext_entity, and the scx_kf_allow()/scx_kf_disallow()/scx_kf_allowed() helpers along with the higher_bits()/highest_bit() helpers they used. - Strip the @mask parameter (and the BUILD_BUG_ON checks) from the SCX_CALL_OP[_RET]/SCX_CALL_OP_TASK[_RET]/SCX_CALL_OP_2TASKS_RET macros and update every call site. Reflow call sites that were wrapped only to fit the old 5-arg form and now collapse onto a single line under ~100 cols. - Remove the in-kfunc scx_kf_allowed() runtime checks from scx_dsq_insert_preamble(), scx_dsq_move(), scx_bpf_dispatch_nr_slots(), scx_bpf_dispatch_cancel(), scx_bpf_dsq_move_to_local___v2(), scx_bpf_sub_dispatch(), scx_bpf_reenqueue_local(), and the per-call guard inside select_cpu_from_kfunc(). scx_bpf_task_cgroup() and scx_kf_allowed_on_arg_tasks() were already cleaned up in the "drop redundant rq-locked check" patch. scx_kf_allowed_if_unlocked() was rewritten in the preceding "decouple" patch. No further changes to those helpers here. Co-developed-by: Juntong Deng <juntong.deng@outlook.com> Signed-off-by: Juntong Deng <juntong.deng@outlook.com> Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> |
||
|
|
d1d3c1c6ae |
sched_ext: Add verifier-time kfunc context filter
Move enforcement of SCX context-sensitive kfunc restrictions from per-task
runtime kf_mask checks to BPF verifier-time filtering, using the BPF core's
struct_ops context information.
A shared .filter callback is attached to each context-sensitive BTF set
and consults a per-op allow table (scx_kf_allow_flags[]) indexed by SCX
ops member offset. Disallowed calls are now rejected at program load time
instead of at runtime.
The old model split reachability across two places: each SCX_CALL_OP*()
set bits naming its op context, and each kfunc's scx_kf_allowed() check
OR'd together the bits it accepted. A kfunc was callable when those two
masks overlapped. The new model transposes the result to the caller side -
each op's allow flags directly list the kfunc groups it may call. The old
bit assignments were:
Call-site bits:
ops.select_cpu = ENQUEUE | SELECT_CPU
ops.enqueue = ENQUEUE
ops.dispatch = DISPATCH
ops.cpu_release = CPU_RELEASE
Kfunc-group accepted bits:
enqueue group = ENQUEUE | DISPATCH
select_cpu group = SELECT_CPU | ENQUEUE
dispatch group = DISPATCH
cpu_release group = CPU_RELEASE
Intersecting them yields the reachability now expressed directly by
scx_kf_allow_flags[]:
ops.select_cpu -> SELECT_CPU | ENQUEUE
ops.enqueue -> SELECT_CPU | ENQUEUE
ops.dispatch -> ENQUEUE | DISPATCH
ops.cpu_release -> CPU_RELEASE
Unlocked ops carried no kf_mask bits and reached only unlocked kfuncs;
that maps directly to UNLOCKED in the new table.
Equivalence was checked by walking every (op, kfunc-group) combination
across SCX ops, SYSCALL, and non-SCX struct_ops callers against the old
scx_kf_allowed() runtime checks. With two intended exceptions (see below),
all combinations reach the same verdict; disallowed calls are now caught at
load time instead of firing scx_error() at runtime.
scx_bpf_dsq_move_set_slice() and scx_bpf_dsq_move_set_vtime() are
exceptions: they have no runtime check at all, but the new filter rejects
them from ops outside dispatch/unlocked. The affected cases are nonsensical
- the values these setters store are only read by
scx_bpf_dsq_move{,_vtime}(), which is itself restricted to
dispatch/unlocked, so a setter call from anywhere else was already dead
code.
Runtime scx_kf_mask enforcement is left in place by this patch and removed
in a follow-up.
Original-patch-by: Juntong Deng <juntong.deng@outlook.com>
Original-patch-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
|
||
|
|
2193af26a1 |
sched_ext: Drop redundant rq-locked check from scx_bpf_task_cgroup()
scx_kf_allowed_on_arg_tasks() runs both an scx_kf_allowed(__SCX_KF_RQ_LOCKED) mask check and a kf_tasks[] check. After the preceding call-site fixes, every SCX_CALL_OP_TASK*() invocation has kf_mask & __SCX_KF_RQ_LOCKED non-zero, so the mask check is redundant whenever the kf_tasks[] check passes. Drop it and simplify the helper to take only @sch and @p. Fold the locking guarantee into the SCX_CALL_OP_TASK() comment block, which scx_bpf_task_cgroup() now points to. No functional change. Extracted from a larger verifier-time kfunc context filter patch originally written by Juntong Deng. Original-patch-by: Juntong Deng <juntong.deng@outlook.com> Cc: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> |
||
|
|
0022b32850 |
sched_ext: Decouple kfunc unlocked-context check from kf_mask
scx_kf_allowed_if_unlocked() uses !current->scx.kf_mask as a proxy for "no
SCX-tracked lock held". kf_mask is removed in a follow-up patch, so its two
callers - select_cpu_from_kfunc() and scx_dsq_move() - need another basis.
Add a new bool scx_rq.in_select_cpu, set across the SCX_CALL_OP_TASK_RET
that invokes ops.select_cpu(), to capture the one case where SCX itself
holds no lock but try_to_wake_up() holds @p's pi_lock. Together with
scx_locked_rq(), it expresses the same accepted-context set.
select_cpu_from_kfunc() needs a runtime test because it has to take
different locking paths depending on context. Open-code as a three-way
branch. The unlocked branch takes raw_spin_lock_irqsave(&p->pi_lock)
directly - pi_lock alone is enough for the fields the kfunc reads, and is
lighter than task_rq_lock().
scx_dsq_move() doesn't really need a runtime test - its accepted contexts
could be enforced at verifier load time. But since the runtime state is
already there and using it keeps the upcoming load-time filter simpler, just
write it the same way: (scx_locked_rq() || in_select_cpu) &&
!kf_allowed(DISPATCH).
scx_kf_allowed_if_unlocked() is deleted with the conversions.
No semantic change.
v2: s/No functional change/No semantic change/ - the unlocked path now acquires
pi_lock instead of the heavier task_rq_lock() (Andrea Righi).
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
|
||
|
|
b470e37c1f |
sched_ext: Fix ops.cgroup_move() invocation kf_mask and rq tracking
sched_move_task() invokes ops.cgroup_move() inside task_rq_lock(tsk), so
@p's rq lock is held. The SCX_CALL_OP_TASK invocation mislabels this:
- kf_mask = SCX_KF_UNLOCKED (== 0), claiming no lock is held.
- rq = NULL, so update_locked_rq() doesn't run and scx_locked_rq()
returns NULL.
Switch to SCX_KF_REST and pass task_rq(p), matching ops.set_cpumask()
from set_cpus_allowed_scx().
Three effects:
- scx_bpf_task_cgroup() becomes callable (was rejected by
scx_kf_allowed(__SCX_KF_RQ_LOCKED)). Safe; rq lock is held.
- scx_bpf_dsq_move() is now rejected (was allowed via the unlocked
branch). Calling it while holding an unrelated task's rq lock is
risky; rejection is correct.
- scx_bpf_select_cpu_*() previously took the unlocked branch in
select_cpu_from_kfunc() and called task_rq_lock(p, &rf), which
would deadlock against the already-held pi_lock. Now it takes the
locked-rq branch and is rejected with -EPERM via the existing
kf_allowed(SCX_KF_SELECT_CPU | SCX_KF_ENQUEUE) check. Latent
deadlock fix.
No in-tree scheduler is known to call any of these from ops.cgroup_move().
v2: Add Fixes: tag (Andrea Righi).
Fixes:
|
||
|
|
9fb457074f |
sched_ext: Track @p's rq lock across set_cpus_allowed_scx -> ops.set_cpumask
The SCX_CALL_OP_TASK call site passes rq=NULL incorrectly, leaving
scx_locked_rq() unset. Pass task_rq(p) instead so update_locked_rq()
reflects reality.
v2: Add Fixes: tag (Andrea Righi).
Fixes:
|
||
|
|
a37e134317 |
sched_ext: Add select_cpu kfuncs to scx_kfunc_ids_unlocked
select_cpu_from_kfunc() has an extra scx_kf_allowed_if_unlocked() branch
that accepts calls from unlocked contexts and takes task_rq_lock() itself
- a "callable from unlocked" property encoded in the kfunc body rather
than in set membership. That's fine while the runtime check is the
authoritative gate, but the upcoming verifier-time filter uses set
membership as the source of truth and needs it to reflect every context
the kfunc may be called from.
Add the three select_cpu kfuncs to scx_kfunc_ids_unlocked so their full
set of callable contexts is captured by set membership. This follows the
existing dual-set convention used by scx_bpf_dsq_move{,_vtime} and
scx_bpf_dsq_move_set_{slice,vtime}, which are members of both
scx_kfunc_ids_dispatch and scx_kfunc_ids_unlocked.
While at it, add brief comments on each duplicate BTF_ID_FLAGS block
(including the pre-existing dsq_move ones) explaining the dual
membership.
No runtime behavior change: the runtime check in select_cpu_from_kfunc()
remains the authoritative gate until it is removed along with the rest
of the scx_kf_mask enforcement in a follow-up.
v2: Clarify dispatch-set comment to name scx_bpf_dsq_move*() explicitly so it
doesn't appear to cover scx_bpf_sub_dispatch() (Andrea Righi).
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
|
||
|
|
9b5501d3c9 |
sched_ext: Drop TRACING access to select_cpu kfuncs
The select_cpu kfuncs - scx_bpf_select_cpu_dfl(), scx_bpf_select_cpu_and() and __scx_bpf_select_cpu_and() - take task_rq_lock() internally. Exposing them via scx_kfunc_set_idle to BPF_PROG_TYPE_TRACING is unsafe: arbitrary tracing contexts (kprobes, tracepoints, fentry, LSM) may run with @p's pi_lock state unknown. Move them out of scx_kfunc_ids_idle into a new scx_kfunc_ids_select_cpu set registered only for STRUCT_OPS and SYSCALL. Extracted from a larger verifier-time kfunc context filter patch originally written by Juntong Deng. Original-patch-by: Juntong Deng <juntong.deng@outlook.com> Cc: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> |
||
|
|
78cde54ea5 |
sched/eevdf: Clear buddies for preempt_short
next buddy should not prevent shorter slice preemption. Don't take buddy
into account when checking if shorter slice entity can preempt and clear it
if the entity with a shorter slice can preempt current.
Test on snapdragon rb5:
hackbench -T -p -l 16000000 -g 2 1> /dev/null &
hackbench runs in cgroup /test-A
cyclictest -t 1 -i 2777 -D 63 --policy=fair --mlock -h 20000 -q
cyclictest runs in cgroup /test-B
tip/sched/core tip/sched/core +this patch
cyclictest slice (ms) (default)2.8 8 8
hackbench slice (ms) (default)2.8 20 20
Total Samples | 22679 22595 22686
Average (us) | 84 94(-12%) 59( 37%)
Median (P50) (us) | 56 56( 0%) 56( 0%)
90th Percentile (us) | 64 65(- 2%) 63( 3%)
99th Percentile (us) | 1047 1273(-22%) 74( 94%)
99.9th Percentile (us) | 2431 4751(-95%) 663( 86%)
Maximum (us) | 4694 8655(-84%) 3934( 55%)
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260410132321.2897789-1-vincent.guittot@linaro.org
|
||
|
|
480a9e57cc |
Merge branches 'for-next/misc', 'for-next/tlbflush', 'for-next/ttbr-macros-cleanup', 'for-next/kselftest', 'for-next/feat_lsui', 'for-next/mpam', 'for-next/hotplug-batched-tlbi', 'for-next/bbml2-fixes', 'for-next/sysreg', 'for-next/generic-entry' and 'for-next/acpi', remote-tracking branches 'arm64/for-next/perf' and 'arm64/for-next/read-once' into for-next/core
* arm64/for-next/perf:
: Perf updates
perf/arm-cmn: Fix resource_size_t printk specifier in arm_cmn_init_dtc()
perf/arm-cmn: Fix incorrect error check for devm_ioremap()
perf: add NVIDIA Tegra410 C2C PMU
perf: add NVIDIA Tegra410 CPU Memory Latency PMU
perf/arm_cspmu: nvidia: Add Tegra410 PCIE-TGT PMU
perf/arm_cspmu: nvidia: Add Tegra410 PCIE PMU
perf/arm_cspmu: Add arm_cspmu_acpi_dev_get
perf/arm_cspmu: nvidia: Add Tegra410 UCF PMU
perf/arm_cspmu: nvidia: Rename doc to Tegra241
perf/arm-cmn: Stop claiming entire iomem region
arm64: cpufeature: Use pmuv3_implemented() function
arm64: cpufeature: Make PMUVer and PerfMon unsigned
KVM: arm64: Read PMUVer as unsigned
* arm64/for-next/read-once:
: Fixes for __READ_ONCE() with CONFIG_LTO=y
arm64, compiler-context-analysis: Permit alias analysis through __READ_ONCE() with CONFIG_LTO=y
arm64: Optimize __READ_ONCE() with CONFIG_LTO=y
* for-next/misc:
: Miscellaneous cleanups/fixes
arm64: rsi: use linear-map alias for realm config buffer
arm64: Kconfig: fix duplicate word in CMDLINE help text
arm64: mte: Skip TFSR_EL1 checks and barriers in synchronous tag check mode
arm64/hwcap: Generate the KERNEL_HWCAP_ definitions for the hwcaps
arm64: kexec: Remove duplicate allocation for trans_pgd
arm64: mm: Use generic enum pgtable_level
arm64: scs: Remove redundant save/restore of SCS SP on entry to/from EL0
arm64: remove ARCH_INLINE_*
* for-next/tlbflush:
: Refactor the arm64 TLB invalidation API and implementation
arm64: mm: __ptep_set_access_flags must hint correct TTL
arm64: mm: Provide level hint for flush_tlb_page()
arm64: mm: Wrap flush_tlb_page() around __do_flush_tlb_range()
arm64: mm: More flags for __flush_tlb_range()
arm64: mm: Refactor __flush_tlb_range() to take flags
arm64: mm: Refactor flush_tlb_page() to use __tlbi_level_asid()
arm64: mm: Simplify __flush_tlb_range_limit_excess()
arm64: mm: Simplify __TLBI_RANGE_NUM() macro
arm64: mm: Re-implement the __flush_tlb_range_op macro in C
arm64: mm: Inline __TLBI_VADDR_RANGE() into __tlbi_range()
arm64: mm: Push __TLBI_VADDR() into __tlbi_level()
arm64: mm: Implicitly invalidate user ASID based on TLBI operation
arm64: mm: Introduce a C wrapper for by-range TLB invalidation
arm64: mm: Re-implement the __tlbi_level macro as a C function
* for-next/ttbr-macros-cleanup:
: Cleanups of the TTBR1_* macros
arm64/mm: Directly use TTBRx_EL1_CnP
arm64/mm: Directly use TTBRx_EL1_ASID_MASK
arm64/mm: Describe TTBR1_BADDR_4852_OFFSET
* for-next/kselftest:
: arm64 kselftest updates
selftests/arm64: Implement cmpbr_sigill() to hwcap test
* for-next/feat_lsui:
: Futex support using FEAT_LSUI instructions to avoid toggling PAN
arm64: armv8_deprecated: Disable swp emulation when FEAT_LSUI present
arm64: Kconfig: Add support for LSUI
KVM: arm64: Use CAST instruction for swapping guest descriptor
arm64: futex: Support futex with FEAT_LSUI
arm64: futex: Refactor futex atomic operation
KVM: arm64: kselftest: set_id_regs: Add test for FEAT_LSUI
KVM: arm64: Expose FEAT_LSUI to guests
arm64: cpufeature: Add FEAT_LSUI
* for-next/mpam: (40 commits)
: Expose MPAM to user-space via resctrl:
: - Add architecture context-switch and hiding of the feature from KVM.
: - Add interface to allow MPAM to be exposed to user-space using resctrl.
: - Add errata workaoround for some existing platforms.
: - Add documentation for using MPAM and what shape of platforms can use resctrl
arm64: mpam: Add initial MPAM documentation
arm_mpam: Quirk CMN-650's CSU NRDY behaviour
arm_mpam: Add workaround for T241-MPAM-6
arm_mpam: Add workaround for T241-MPAM-4
arm_mpam: Add workaround for T241-MPAM-1
arm_mpam: Add quirk framework
arm_mpam: resctrl: Call resctrl_init() on platforms that can support resctrl
arm64: mpam: Select ARCH_HAS_CPU_RESCTRL
arm_mpam: resctrl: Add empty definitions for assorted resctrl functions
arm_mpam: resctrl: Update the rmid reallocation limit
arm_mpam: resctrl: Add resctrl_arch_rmid_read()
arm_mpam: resctrl: Allow resctrl to allocate monitors
arm_mpam: resctrl: Add support for csu counters
arm_mpam: resctrl: Add monitor initialisation and domain boilerplate
arm_mpam: resctrl: Add kunit test for control format conversions
arm_mpam: resctrl: Add support for 'MB' resource
arm_mpam: resctrl: Wait for cacheinfo to be ready
arm_mpam: resctrl: Add rmid index helpers
arm_mpam: resctrl: Convert to/from MPAMs fixed-point formats
arm_mpam: resctrl: Hide CDP emulation behind CONFIG_EXPERT
...
* for-next/hotplug-batched-tlbi:
: arm64/mm: Enable batched TLB flush in unmap_hotplug_range()
arm64/mm: Reject memory removal that splits a kernel leaf mapping
arm64/mm: Enable batched TLB flush in unmap_hotplug_range()
* for-next/bbml2-fixes:
: Fixes for realm guest and BBML2_NOABORT
arm64: mm: Remove pmd_sect() and pud_sect()
arm64: mm: Handle invalid large leaf mappings correctly
arm64: mm: Fix rodata=full block mapping support for realm guests
* for-next/sysreg:
: arm64 sysreg updates
arm64/sysreg: Update ID_AA64SMFR0_EL1 description to DDI0601 2025-12
arm64/sysreg: Update ID_AA64ZFR0_EL1 description to DDI0601 2025-12
arm64/sysreg: Update ID_AA64FPFR0_EL1 description to DDI0601 2025-12
arm64/sysreg: Update ID_AA64ISAR2_EL1 description to DDI0601 2025-12
arm64/sysreg: Update ID_AA64ISAR0_EL1 description to DDI0601 2025-12
arm64/sysreg: Update SMIDR_EL1 to DDI0601 2025-06
* for-next/generic-entry:
: More arm64 refactoring towards using the generic entry code
arm64: Check DAIF (and PMR) at task-switch time
arm64: entry: Use split preemption logic
arm64: entry: Use irqentry_{enter_from,exit_to}_kernel_mode()
arm64: entry: Consistently prefix arm64-specific wrappers
arm64: entry: Don't preempt with SError or Debug masked
entry: Split preemption from irqentry_exit_to_kernel_mode()
entry: Split kernel mode logic from irqentry_{enter,exit}()
entry: Move irqentry_enter() prototype later
entry: Remove local_irq_{enable,disable}_exit_to_user()
entry: Fix stale comment for irqentry_enter()
* for-next/acpi:
: arm64 ACPI updates
ACPI: AGDI: fix missing newline in error message
|
||
|
|
7431d90cfc |
Merge branches 'pm-cpuidle', 'pm-opp' and 'pm-sleep'
Merge cpuidle updates, OPP (operating performance points) library updates, and updates related to system suspend and hibernation for 7.1-rc1: - Refine stopped tick handling in the menu cpuidle governor and rearrange stopped tick handling in the teo cpuidle governor (Rafael Wysocki) - Add Panther Lake C-states table to the intel_idle driver (Artem Bityutskiy) - Clean up dead dependencies on CPU_IDLE in Kconfig (Julian Braha) - Simplify cpuidle_register_device() with guard() (Huisong Li) - Use performance level if available to distinguish between rates in OPP debugfs (Manivannan Sadhasivam) - Fix scoped_guard in dev_pm_opp_xlate_required_opp() (Viresh Kumar) - Return -ENODATA if the snapshot image is not loaded (Alberto Garcia) - Remove inclusion of crypto/hash.h from hibernate_64.c on x86 (Eric Biggers) * pm-cpuidle: cpuidle: Simplify cpuidle_register_device() with guard() cpuidle: clean up dead dependencies on CPU_IDLE in Kconfig intel_idle: Add Panther Lake C-states table cpuidle: governors: teo: Rearrange stopped tick handling cpuidle: governors: menu: Refine stopped tick handling * pm-opp: OPP: Move break out of scoped_guard in dev_pm_opp_xlate_required_opp() OPP: debugfs: Use performance level if available to distinguish between rates * pm-sleep: PM: hibernate: return -ENODATA if the snapshot image is not loaded PM: hibernate: x86: Remove inclusion of crypto/hash.h |
||
|
|
83e990310d |
Merge branch 'pm-cpufreq'
Merge cpufreq updates for 7.1-rc1:
- Update qcom-hw DT bindings to include Eliza hardware (Abel Vesa)
- Update cpufreq-dt-platdev blocklist (Faruque Ansari)
- Minor updates to driver and dt-bindings for Tegra (Thierry Reding,
Rosen Penev)
- Add MAINTAINERS entry for CPPC driver (Viresh Kumar)
- Add support for new features: CPPC performance priority, Dynamic EPP,
Raw EPP, and new unit tests for them to amd-pstate (Gautham Shenoy,
Mario Limonciello)
- Fix sysfs files being present when HW missing and broken/outdated
documentation in the amd-pstate driver (Ninad Naik, Gautham Shenoy)
- Pass the policy to cpufreq_driver->adjust_perf() to avoid using
cpufreq_cpu_get() in the .adjust_perf() callback in amd-pstate which
leads to a scheduling-while-atomic bug (K Prateek Nayak)
- Clean up dead code in Kconfig for cpufreq (Julian Braha)
- Remove max_freq_req update for pre-existing cpufreq policy and add a
boost_freq_req QoS request to save the boost constraint instead of
overwriting the last scaling_max_freq constraint (Pierre Gondois)
- Embed cpufreq QoS freq_req objects in cpufreq policy so they all
are allocated in one go along with the policy to simplify lifetime
rules and avoid error handling issues (Viresh Kumar)
- Use DMI max speed when CPPC is unavailable in the acpi-cpufreq
scaling driver (Henry Tseng)
- Switch policy_is_shared() in cpufreq to using cpumask_nth() instead
of cpumask_weight() because the former is more efficient (Yury Norov)
- Use sysfs_emit() in sysfs show functions for cpufreq governor
attributes (Thorsten Blum)
- Update intel_pstate to stop returning an error when "off" is written
to its status sysfs attribute while the driver is already off (Fabio
De Francesco)
- Include current frequency in the debug message printed by
__cpufreq_driver_target() (Pengjie Zhang)
* pm-cpufreq: (38 commits)
cpufreq/amd-pstate: Add POWER_SUPPLY select for dynamic EPP
MAINTAINERS: amd-pstate: Step down as maintainer, add Prateek as reviewer
cpufreq: Pass the policy to cpufreq_driver->adjust_perf()
cpufreq/amd-pstate: Pass the policy to amd_pstate_update()
cpufreq/amd-pstate-ut: Add a unit test for raw EPP
cpufreq/amd-pstate: Add support for raw EPP writes
cpufreq/amd-pstate: Add support for platform profile class
cpufreq/amd-pstate: add kernel command line to override dynamic epp
cpufreq/amd-pstate: Add dynamic energy performance preference
Documentation: amd-pstate: fix dead links in the reference section
cpufreq/amd-pstate: Cache the max frequency in cpudata
Documentation/amd-pstate: Add documentation for amd_pstate_floor_{freq,count}
Documentation/amd-pstate: List amd_pstate_prefcore_ranking sysfs file
Documentation/amd-pstate: List amd_pstate_hw_prefcore sysfs file
amd-pstate-ut: Add a testcase to validate the visibility of driver attributes
amd-pstate-ut: Add module parameter to select testcases
amd-pstate: Introduce a tracepoint trace_amd_pstate_cppc_req2()
amd-pstate: Add sysfs support for floor_freq and floor_count
amd-pstate: Add support for CPPC_REQ2 and FLOOR_PERF
x86/cpufeatures: Add AMD CPPC Performance Priority feature.
...
|
||
|
|
3348e1e83a |
cgroup/rdma: fix swapped arguments in pr_warn() format string
The format string says "device %p ... rdma cgroup %p" but the arguments were passed as (cg, device), printing them in the wrong order. Signed-off-by: cuitao <cuitao@kylinos.cn> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
|
|
a0c584fc18 |
bpf: Fix use-after-free in offloaded map/prog info fill
When querying info for an offloaded BPF map or program, bpf_map_offload_info_fill_ns() and bpf_prog_offload_info_fill_ns() obtain the network namespace with get_net(dev_net(offmap->netdev)). However, the associated netdev's netns may be racing with teardown during netns destruction. If the netns refcount has already reached 0, get_net() performs a refcount_t increment on 0, triggering: refcount_t: addition on 0; use-after-free. Although rtnl_lock and bpf_devs_lock ensure the netdev pointer remains valid, they cannot prevent the netns refcount from reaching zero. Fix this by using maybe_get_net() instead of get_net(). maybe_get_net() uses refcount_inc_not_zero() and returns NULL if the refcount is already zero, which causes ns_get_path_cb() to fail and the caller to return -ENOENT -- the correct behavior when the netns is being destroyed. Fixes: |
||
|
|
9f118095dd |
bpf: Drop pkt_end markers on arithmetic to prevent is_pkt_ptr_branch_taken
When a pkt pointer acquires AT_PKT_END or BEYOND_PKT_END range from
a comparison, and then, known-constant arithmetic is performed,
adjust_ptr_min_max_vals() copies the stale range via dst_reg->raw =
ptr_reg->raw without clearing the negative reg->range sentinel values.
This lets is_pkt_ptr_branch_taken() choose one branch direction and
skip going through the other. Fix this by clearing negative pkt range
values (that is, AT_PKT_END and BEYOND_PKT_END) after arithmetic on
pkt pointers. This ensures is_pkt_ptr_branch_taken() returns unknown
and both branches are properly verified.
Fixes:
|
||
|
|
3ffcd57823 |
last dma-mapping fix for Linux 7.0
A fix for DMA-mapping subsystem, which hides annoying, false-positive warnings from DMA-API debug on coherent platforms like x86_64 (Mikhail Gavrilov). -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQSrngzkoBtlA8uaaJ+Jp1EFxbsSRAUCadfSDQAKCRCJp1EFxbsS RPtKAQCzfZRx2zC6ACG5opRZqqsUAah5lko9ROJZ1CK2yrfvAwEA7tGeQ8YgSUxT Xu5BHh7Ff/jk0ph001/6j7OSBZtuKAY= =xdCr -----END PGP SIGNATURE----- Merge tag 'dma-mapping-7.0-2026-04-09' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux Pull dma-mapping fix from Marek Szyprowski: "A fix for DMA-mapping subsystem, which hides annoying, false-positive warnings from DMA-API debug on coherent platforms like x86_64 (Mikhail Gavrilov)" * tag 'dma-mapping-7.0-2026-04-09' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: dma-debug: suppress cacheline overlap warning when arch has no DMA alignment requirement |
||
|
|
9dba0ae973 |
bpf: Remove static qualifier from local subprog pointer
The local subprog pointer in create_jt() and visit_abnormal_return_insn() was declared static. It is unconditionally assigned via bpf_find_containing_subprog() before every use. Thus, the static qualifier serves no purpose and rather creates confusion. Just remove it. Fixes: |
||
|
|
ee861486e3 |
bpf: Fix ld_{abs,ind} failure path analysis in subprogs
Usage of ld_{abs,ind} instructions got extended into subprogs some time
ago via commit
|
||
|
|
6bd96e40f3 |
bpf: Propagate error from visit_tailcall_insn
Commit |
||
|
|
4f64d5b664 |
bpf: Make find_linfo widely available
Move find_linfo() as bpf_find_linfo() into core.c to allow for its use in the verifier in subsequent patches. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Link: https://lore.kernel.org/r/20260408021359.3786905-4-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
|
fbb98834a9 |
bpf: Extract bpf_get_linfo_file_line
Extract bpf_get_linfo_file_line as its own function so that the logic to obtain the file, line, and line number for a given program can be shared in subsequent patches. Reviewed-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260408021359.3786905-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
|
2de32a25a3 |
Merge branch kvm-arm64/hyp-tracing into kvmarm-master/next
* kvm-arm64/hyp-tracing: (40 commits) : . : EL2 tracing support, adding both 'remote' ring-buffer : infrastructure and the tracing itself, courtesy of : Vincent Donnefort. From the cover letter: : : "The growing set of features supported by the hypervisor in protected : mode necessitates debugging and profiling tools. Tracefs is the : ideal candidate for this task: : : * It is simple to use and to script. : : * It is supported by various tools, from the trace-cmd CLI to the : Android web-based perfetto. : : * The ring-buffer, where are stored trace events consists of linked : pages, making it an ideal structure for sharing between kernel and : hypervisor. : : This series first introduces a new generic way of creating remote events and : remote buffers. Then it adds support to the pKVM hypervisor." : . tracing: selftests: Extend hotplug testing for trace remotes tracing: Non-consuming read for trace remotes with an offline CPU tracing: Adjust cmd_check_undefined to show unexpected undefined symbols tracing: Restore accidentally removed SPDX tag KVM: arm64: avoid unused-variable warning tracing: Generate undef symbols allowlist for simple_ring_buffer KVM: arm64: tracing: add ftrace dependency tracing: add more symbols to whitelist tracing: Update undefined symbols allow list for simple_ring_buffer KVM: arm64: Fix out-of-tree build for nVHE/pKVM tracing tracing: selftests: Add hypervisor trace remote tests KVM: arm64: Add selftest event support to nVHE/pKVM hyp KVM: arm64: Add hyp_enter/hyp_exit events to nVHE/pKVM hyp KVM: arm64: Add event support to the nVHE/pKVM hyp and trace remote KVM: arm64: Add trace reset to the nVHE/pKVM hyp KVM: arm64: Sync boot clock with the nVHE/pKVM hyp KVM: arm64: Add trace remote for the nVHE/pKVM hyp KVM: arm64: Add tracing capability for the nVHE/pKVM hyp KVM: arm64: Support unaligned fixmap in the pKVM hyp KVM: arm64: Initialise hyp_nr_cpus for nVHE hyp ... Signed-off-by: Marc Zyngier <maz@kernel.org> |
||
|
|
5a84b60005 |
perf/events: Replace READ_ONCE() with standard pgtable accessors
Replace raw READ_ONCE() dereferences of pgtable entries with corresponding standard page table accessors pxdp_get() in perf_get_pgtable_size(). These accessors default to READ_ONCE() on platforms that don't override them. So there is no functional change on such platforms. However arm64 platform is being extended to support 128 bit page tables via a new architecture feature i.e FEAT_D128 in which case READ_ONCE() will not provide required single copy atomic access for 128 bit page table entries. Although pxdp_get() accessors can later be overridden on arm64 platform to extend required single copy atomicity support on 128 bit entries. Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260227062744.2215491-1-anshuman.khandual@arm.com |
||
|
|
985215804d |
sched/rt: Cleanup global RT bandwidth functions
The commit
|
||
|
|
4f70a0456d |
sched/rt: Move group schedulability check to sched_rt_global_validate()
The sched_rt_global_constraints() function is a remnant that used to set
up global RT throttling but that is no more since commit
|
||
|
|
8b016dcec9 |
sched/rt: Skip group schedulable check with rt_group_sched=0
The warning from the commit |
||
|
|
14a8570564 |
sched/deadline: Use revised wakeup rule for dl_server
John noted that commit |
||
|
|
c5538d0141 |
entry: Split kernel mode logic from irqentry_{enter,exit}()
The generic irqentry code has entry/exit functions specifically for exceptions taken from user mode, but doesn't have entry/exit functions specifically for exceptions taken from kernel mode. It would be helpful to have separate entry/exit functions specifically for exceptions taken from kernel mode. This would make the structure of the entry code more consistent, and would make it easier for architectures to manage logic specific to exceptions taken from kernel mode. Move the logic specific to kernel mode out of irqentry_enter() and irqentry_exit() into new irqentry_enter_from_kernel_mode() and irqentry_exit_to_kernel_mode() functions. These are marked __always_inline and placed in irq-entry-common.h, as with irqentry_enter_from_user_mode() and irqentry_exit_to_user_mode(), so that they can be inlined into architecture-specific wrappers. The existing out-of-line irqentry_enter() and irqentry_exit() functions retained as callers of the new functions. The lockdep assertion from irqentry_exit() is moved into irqentry_exit_to_user_mode() and irqentry_exit_to_kernel_mode(). This was previously missing from irqentry_exit_to_user_mode() when called directly, and any new lockdep assertion failure relating from this change is a latent bug. Aside from the lockdep change noted above, there should be no functional change as a result of this change. [ tglx: Updated kernel doc ] Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Jinjie Ruan <ruanjinjie@huawei.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260407131650.3813777-5-mark.rutland@arm.com |
||
|
|
22f66e7ef4 |
entry: Remove local_irq_{enable,disable}_exit_to_user()
local_irq_enable_exit_to_user() and local_irq_disable_exit_to_user() are
never overridden by architecture code, and are always equivalent to
local_irq_enable() and local_irq_disable().
These functions were added on the assumption that arm64 would override
them to manage 'DAIF' exception masking, as described by Thomas Gleixner
in these threads:
https://lore.kernel.org/all/20190919150809.340471236@linutronix.de/
https://lore.kernel.org/all/alpine.DEB.2.21.1910240119090.1852@nanos.tec.linutronix.de/
In practice arm64 did not need to override either. Prior to moving to
the generic irqentry code, arm64's management of DAIF was reworked in
commit:
|
||
|
|
017f5c4ef7 |
bpf: Allow overwriting referenced dynptr when refcnt > 1
The verifier currently does not allow overwriting a referenced dynptr's stack slot to prevent resource leak. This is because referenced dynptr holds additional resources that requires calling specific helpers to release. This limitation can be relaxed when there are multiple copies of the same dynptr. Whether it is the orignial dynptr or one of its clones, as long as there exists at least one other dynptr with the same ref_obj_id (to be used to release the reference), its stack slot should be allowed to be overwritten. Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Amery Hung <ameryhung@gmail.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260406150548.1354271-2-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
|
1b327732c8 |
bpf: Clear delta when clearing reg id for non-{add,sub} ops
When a non-{add,sub} alu op such as xor is performed on a scalar
register that previously had a BPF_ADD_CONST delta, the else path
in adjust_reg_min_max_vals() only clears dst_reg->id but leaves
dst_reg->delta unchanged.
This stale delta can propagate via assign_scalar_id_before_mov()
when the register is later used in a mov. It gets a fresh id but
keeps the stale delta from the old (now-cleared) BPF_ADD_CONST.
This stale delta can later propagate leading to a verifier-vs-
runtime value mismatch.
The clear_id label already correctly clears both delta and id.
Make the else path consistent by also zeroing the delta when id
is cleared. More generally, this introduces a helper clear_scalar_id()
which internally takes care of zeroing. There are various other
locations in the verifier where only the id is cleared. By using
the helper we catch all current and future locations.
Fixes:
|
||
|
|
d7f14173c0 |
bpf: Fix linked reg delta tracking when src_reg == dst_reg
Consider the case of rX += rX where src_reg and dst_reg are pointers to
the same bpf_reg_state in adjust_reg_min_max_vals(). The latter first
modifies the dst_reg in-place, and later in the delta tracking, the
subsequent is_reg_const(src_reg)/reg_const_value(src_reg) reads the
post-{add,sub} value instead of the original source.
This is problematic since it sets an incorrect delta, which sync_linked_regs()
then propagates to linked registers, thus creating a verifier-vs-runtime
mismatch. Fix it by just skipping this corner case.
Fixes:
|
||
|
|
1870ddcd94 |
bpf: Prefer vmlinux symbols over module symbols for unqualified kprobes
When an unqualified kprobe target exists in both vmlinux and a loaded module, number_of_same_symbols() returns a count greater than 1, causing kprobe attachment to fail with -EADDRNOTAVAIL even though the vmlinux symbol is unambiguous. When no module qualifier is given and the symbol is found in vmlinux, return the vmlinux-only count without scanning loaded modules. This preserves the existing behavior for all other cases: - Symbol only in a module: vmlinux count is 0, falls through to module scan as before. - Symbol qualified with MOD:SYM: mod != NULL, unchanged path. - Symbol ambiguous within vmlinux itself: count > 1 is returned as-is. Fixes: |
||
|
|
57b23c0f61 |
bpf: Retire rcu_trace_implies_rcu_gp()
RCU Tasks Trace grace period implies RCU grace period, and this guarantee is expected to remain in the future. Only BPF is the user of this predicate, hence retire the API and clean up all in-tree users. RCU Tasks Trace is now implemented on SRCU-fast and its grace period mechanism always has at least one call to synchronize_rcu() as it is required for SRCU-fast's correctness (it replaces the smp_mb() that SRCU-fast readers skip). So, RCU-tt GP will always imply RCU GP. Reviewed-by: Puranjay Mohan <puranjay@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260407162234.785270-1-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
|
034db4dd44 |
workqueue: use NR_STD_WORKER_POOLS instead of hardcoded value
use NR_STD_WORKER_POOLS for irq_work_fns[] array definition. NR_STD_WORKER_POOLS is also 2, but better to use MACRO. Initialization loop for_each_bh_worker_pool() also uses same MACRO. Signed-off-by: Maninder Singh <maninder1.s@samsung.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
|
|
66d64899ea |
8 hotfixes. All are cc:stable and 7 are for MM.
All are singletons - please see the changelogs for details. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCadQzWgAKCRDdBJ7gKXxA jp58AP9C4nNI7ReZ9hyH3xg32JZIszk8vzwRUZte/HvvY6YsXQEAt94DZnO8wa3k DoYYRjIi9PfhZzI4aRJFXgANWI+0twc= =VqfR -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2026-04-06-15-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "Eight hotfixes. All are cc:stable and seven are for MM. All are singletons - please see the changelogs for details" * tag 'mm-hotfixes-stable-2026-04-06-15-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: ocfs2: fix out-of-bounds write in ocfs2_write_end_inline mm/damon/stat: deallocate damon_call() failure leaking damon_ctx mm/vma: fix memory leak in __mmap_region() mm/memory_hotplug: maintain N_NORMAL_MEMORY during hotplug mm/damon/sysfs: dealloc repeat_call_control if damon_call() fails mm: reinstate unconditional writeback start in balance_dirty_pages() liveupdate: propagate file deserialization failures mm: filemap: fix nr_pages calculation overflow in filemap_map_pages() |
||
|
|
09c04714cb |
alarmtimer: Access timerqueue node under lock in suspend
In alarmtimer_suspend(), timerqueue_getnext() is called under base->lock, but next->expires is read after the lock is released. This is safe because suspend freezes all relevant task contexts, but reading the node while holding the lock makes the code easier to reason about and not worry about a theoretical UAF. Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260407143627.19405-1-zhanxusheng@xiaomi.com |
||
|
|
beaf0e96b1 |
bpf: Drop task_to_inode and inet_conn_established from lsm sleepable hooks
bpf_lsm_task_to_inode() is called under rcu_read_lock() and
bpf_lsm_inet_conn_established() is called from softirq context, so
neither hook can be used by sleepable LSM programs.
Fixes:
|
||
|
|
82b915051d |
tick/nohz: Fix inverted return value in check_tick_dependency() fast path
Commit |
||
|
|
556146ce5e |
sched/fair: Avoid overflow in enqueue_entity()
Here is one scenario which was triggered when running:
stress-ng --yield=32 -t 10000000s&
while true; do perf bench sched messaging -p -t -l 100000 -g 16; done
on a 256CPUs machine after about an hour into the run:
__enqeue_entity: entity_key(-141245081754) weight(90891264) overflow_mul(5608800059305154560) vlag(57498) delayed?(0)
cfs_rq: zero_vruntime(3809707759657809) sum_w_vruntime(0) sum_weight(0) nr_queued(1)
cfs_rq->curr: entity_key(0) vruntime(3809707759657809) deadline(3809723966988476) weight(37)
The above comes from __enqueue_entity() after a place_entity(). Breaking
this down:
vlag_initial = 57498
vlag = (57498 * (37 + 90891264)) / 37 = 141,245,081,754
vruntime = 3809707759657809 - 141245081754 = 3,809,566,514,576,055
entity_key(se, cfs_rq) = -141,245,081,754
Now, multiplying the entity_key with its own weight results to
5,608,800,059,305,154,560 (same as what overflow_mul() suggests) but
in Python, without overflow, this would be: -1,2837,944,014,404,397,056
Avoid the overflow (without doing the division for avg_vruntime()), by moving
zero_vruntime to the new entity when it is heavier.
Fixes:
|
||
|
|
c6e80201e0 |
sched: Use u64 for bandwidth ratio calculations
to_ratio() computes BW_SHIFT-scaled bandwidth ratios from u64 period and
runtime values, but it returns unsigned long. tg_rt_schedulable() also
stores the current group limit and the accumulated child sum in unsigned
long.
On 32-bit builds, large bandwidth ratios can be truncated and the RT
group sum can wrap when enough siblings are present. That can let an
overcommitted RT hierarchy pass the schedulability check, and it also
narrows the helper result for other callers.
Return u64 from to_ratio() and use u64 for the RT group totals so
bandwidth ratios are preserved and compared at full width on both 32-bit
and 64-bit builds.
Fixes:
|
||
|
|
43cd9d9520 |
bpf: Do not ignore offsets for loads from insn_arrays
When a pointer to PTR_TO_INSN is dereferenced, the offset field
of the BPF_LDX_MEM instruction can be nonzero. Patch the verifier
to not ignore this field.
Reported-by: Jiyong Yang <ksur673@gmail.com>
Fixes:
|
||
|
|
18474aed5d |
bpf: Avoid -Wflex-array-members-not-at-end warnings
Apparently, struct bpf_empty_prog_array exists entirely to populate a single element of "items" in a global variable. "null_prog" is only used during the initializer. None of this is needed; globals will be correctly sized with an array initializer of a flexible-array member. So, remove struct bpf_empty_prog_array and adjust the rest of the code, accordingly. With these changes, fix the following warnings: ./include/linux/bpf.h:2369:31: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Link: https://lore.kernel.org/r/acr7Whmn0br3xeBP@kspp Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
|
f25777056e |
bpf: Enable unaligned accesses for syscall ctx
Don't reject usage of fixed unaligned offsets for syscall ctx. Tests will be added in later commits. Unaligned offsets already work for variable offsets. Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260406194403.1649608-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
|
ae5ef001aa |
bpf: Support variable offsets for syscall PTR_TO_CTX
Allow accessing PTR_TO_CTX with variable offsets in syscall programs.
Fixed offsets are already enabled for all program types that do not
convert their ctx accesses, since the changes we made in the commit
|
||
|
|
307e0c5859 |
liveupdate: propagate file deserialization failures
luo_session_deserialize() ignored the return value from
luo_file_deserialize(). As a result, a session could be left partially
restored even though the /dev/liveupdate open path treats deserialization
failures as fatal.
Propagate the error so a failed file deserialization aborts session
deserialization instead of silently continuing.
Link: https://lkml.kernel.org/r/20260325044608.8407-1-leotimmins1974@gmail.com
Link: https://lkml.kernel.org/r/20260325044608.8407-2-leotimmins1974@gmail.com
Fixes:
|
||
|
|
a1aa9ef47c |
bpf: Fix stale offload->prog pointer after constant blinding
When a dev-bound-only BPF program (BPF_F_XDP_DEV_BOUND_ONLY) undergoes
JIT compilation with constant blinding enabled (bpf_jit_harden >= 2),
bpf_jit_blind_constants() clones the program. The original prog is then
freed in bpf_jit_prog_release_other(), which updates aux->prog to point
to the surviving clone, but fails to update offload->prog.
This leaves offload->prog pointing to the freed original program. When
the network namespace is subsequently destroyed, cleanup_net() triggers
bpf_dev_bound_netdev_unregister(), which iterates ondev->progs and calls
__bpf_prog_offload_destroy(offload->prog). Accessing the freed prog
causes a page fault:
BUG: unable to handle page fault for address: ffffc900085f1038
Workqueue: netns cleanup_net
RIP: 0010:__bpf_prog_offload_destroy+0xc/0x80
Call Trace:
__bpf_offload_dev_netdev_unregister+0x257/0x350
bpf_dev_bound_netdev_unregister+0x4a/0x90
unregister_netdevice_many_notify+0x2a2/0x660
...
cleanup_net+0x21a/0x320
The test sequence that triggers this reliably is:
1. Set net.core.bpf_jit_harden=2 (echo 2 > /proc/sys/net/core/bpf_jit_harden)
2. Run xdp_metadata selftest, which creates a dev-bound-only XDP
program on a veth inside a netns (./test_progs -t xdp_metadata)
3. cleanup_net -> page fault in __bpf_prog_offload_destroy
Dev-bound-only programs are unique in that they have an offload structure
but go through the normal JIT path instead of bpf_prog_offload_compile().
This means they are subject to constant blinding's prog clone-and-replace,
while also having offload->prog that must stay in sync.
Fix this by updating offload->prog in bpf_jit_prog_release_other(),
alongside the existing aux->prog update. Both are back-pointers to
the prog that must be kept in sync when the prog is replaced.
Fixes:
|
||
|
|
5828b9e5b2 |
bpf: fix end-of-list detection in cgroup_storage_get_next_key()
list_next_entry() never returns NULL -- when the current element is the
last entry it wraps to the list head via container_of(). The subsequent
NULL check is therefore dead code and get_next_key() never returns
-ENOENT for the last element, instead reading storage->key from a bogus
pointer that aliases internal map fields and copying the result to
userspace.
Replace it with list_entry_is_head() so the function correctly returns
-ENOENT when there are no more entries.
Fixes:
|
||
|
|
07738bc566 |
bpf: Use copy_map_value_locked() in alloc_htab_elem() for BPF_F_LOCK
When a BPF_F_LOCK update races with a concurrent delete, the freed
element can be immediately recycled by alloc_htab_elem(). The fast path
in htab_map_update_elem() performs a lockless lookup and then calls
copy_map_value_locked() under the element's spin_lock. If
alloc_htab_elem() recycles the same memory, it overwrites the value
with plain copy_map_value(), without taking the spin_lock, causing
torn writes.
Use copy_map_value_locked() when BPF_F_LOCK is set so the new element's
value is written under the embedded spin_lock, serializing against any
stale lock holders.
Fixes:
|