linux

mirror of https://github.com/torvalds/linux.git synced 2026-05-25 23:52:08 +02:00

Author	SHA1	Message	Date
Linus Torvalds	f0e77c598e	bpf-fixes -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmoRvpcACgkQ6rmadz2v bTow3w/+L3PVujliBpziQFTnHJ2SoTiwUoyjJpsQOc2o1yDEWJXvTcnQ5EURah4t aDMjPBBWgDea7HHWvC/vbRf2D8AjCf3gZBBpzW6uTQ2F1whD3DRCZ4O6XPvfdJEQ R0JkqZyjBjH9fkKBy30PcF+XM9iJ5pY/mkx6nCrcYvsvbj5cIkZnmP03vBGh1jeI yanlYb6N2XHwQp98PKoiN4/BP4ZOQx2HhBX0TmhTcRXVAyyX5SQy4ukrp1y2CSji YjpM2qHdEMtMeFFwcy1K2hJwNbjhrvfgHaKbwSuM3eLjug2AMBX0zp/4Zvw7mb2o B6zMRo0UgOt+kJzunmqnfNe01YZ+Z+So+FkinLSTba91gwCgxa3Qm3gNsZBtxv5V ayrrrFoB1PCxsRJqC0Jio7WXY1JRUkusHOdzR/8pygmwcp+vy6XEzJwhGD+DeMcu T4VJj2bp1bCK4iZwqjyxNAoniYSIjwxzwVDw8s0Zz1Bk+92YJEnZatahFTYFzJRK G9hnJaht0dK960LnudBUwKXz37dvM3LxAAt0ckAepfHAOwwrdB5XhgLQjfPZejot J6FWsxVoS1L+lXV7104QPy2Y9zmJ7ElOzQHWRcoBWs7Srar1a+PUFD0nkuSKmPcu 7P3ukMr6NyekE0zGlOWSZNetlZpdzvUrpuRY2WOIl+sezwCp2xg= =04VP -----END PGP SIGNATURE----- Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Pull bpf fixes from Alexei Starovoitov: - Fix bpf_throw() and global subprog combination (Kumar Kartikeya Dwivedi) - Fix out of bounds access in BPF interpreter (Yazhou Tang) - Fix potential out of bounds access in inner per-cpu array map (Guannan Wang) - Reject NULL data/sig in bpf_verify_pkcs7_signature (KP Singh) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: libbpf: fix off-by-one in emit_signature_match jump offset bpf: Reject NULL data/sig in bpf_verify_pkcs7_signature selftests/bpf: Cover global subprog exception leaks bpf: Check global subprog exception paths bpf: make bpf_session_is_return() reference optional bpf: Use array_map_meta_equal for percpu array inner map replacement selftests/bpf: Add test for large offset bpf-to-bpf call bpf: Fix s16 truncation for large bpf-to-bpf call offsets bpf: Fix out-of-bounds read in bpf_patch_call_args()	2026-05-24 09:53:17 -07:00
Linus Torvalds	79bd2dded1	sched_ext: Fixes for v7.1-rc4 - Spurious WARN in ops_dequeue() racing with concurrent dispatch. - Self-deadlock between scheduler disable and a concurrent sub-sched enable. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCahCHGQ4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGdKOAP9C6xYbZSXuPJSugQg7Ogq7GTcMf0EtK7CGVb9x 0pVJigEA1E17Vqf1WTWTp2DOsoPV1adS51wcoTGvklJc0eFRrw0= =i40E -----END PGP SIGNATURE----- Merge tag 'sched_ext-for-7.1-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Spurious WARN in ops_dequeue() racing with concurrent dispatch - Self-deadlock between scheduler disable and a concurrent sub-sched enable * tag 'sched_ext-for-7.1-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Fix spurious WARN on stale ops_state in ops_dequeue() sched_ext: Fix deadlock between scx_root_disable() and concurrent forks	2026-05-22 16:43:33 -07:00
Linus Torvalds	de37e502a3	cgroup: Fixes for v7.1-rc4 Two rstat fixes: - Out-of-bounds access in the css_rstat_updated() BPF kfunc when called with an unchecked user-supplied cpu. - Over-strict NMI guard after the recent switch to try_cmpxchg left sparc and ppc64 unable to queue rstat updates from NMI. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCahCHDA4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGTk2AP9Me+BV0h17oEuaqAii7uzMom6zCYUO6KY6ADAe zr+zcgEA0B72FxH+GyPwe7lhropwg9WR6jagsCFN/tlMPHwQrwc= =Sos7 -----END PGP SIGNATURE----- Merge tag 'cgroup-for-7.1-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: "Two rstat fixes: - Out-of-bounds access in the css_rstat_updated() BPF kfunc when called with an unchecked user-supplied cpu - Over-strict NMI guard after the recent switch to try_cmpxchg left sparc and ppc64 unable to queue rstat updates from NMI" * tag 'cgroup-for-7.1-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: rstat: relax NMI guard after switch to try_cmpxchg cgroup/rstat: validate cpu before css_rstat_cpu() access	2026-05-22 16:28:47 -07:00
Linus Torvalds	1c04dcd891	dma-mapping fixes for Linux 7.1 Two minor updates for the DMA-mapping code, mainly fixing some rare corner cases (Petr Tesarik, Jianpeng Chang). -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQSrngzkoBtlA8uaaJ+Jp1EFxbsSRAUCag/+/AAKCRCJp1EFxbsS RC9gAP4qM5M9S2WrUJBCoeQrhUrQajNBXN1HV3N+hncHcgkCUwEA2nJq1oETLONH UI4HDrtEBIUEXQgPWEmCj7krN5IYOw0= =I4my -----END PGP SIGNATURE----- Merge tag 'dma-mapping-7.1-2026-05-22' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux Pull dma-mapping fixes from Marek Szyprowski: "Two minor updates for the DMA-mapping code, mainly fixing some rare corner cases (Petr Tesarik, Jianpeng Chang)" * tag 'dma-mapping-7.1-2026-05-22' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: dma-mapping: move dma_map_resource() sanity check into debug code dma-direct: fix use of max_pfn	2026-05-22 06:16:00 -07:00
Linus Torvalds	23884007af	tracing fixes for v7.1: - Avoid NULL return from hist_field_name() The function hist_field_name() is directly passed to a strcat() which does not handle "NULL" characters. Return a zero length string when size is greater than the limit. This is used only to output already created histograms and no field currently is greater than the limit. But it should still not return NULL. - Do not call map->ops->elt_free() on allocation failure When elt_alloc() fails, it should not call the map->ops->elt_free() function if it exists, as that function may not be able to handle the free on allocation failures. The ->elt_free() should only be called when elt_alloc() succeeds. -----BEGIN PGP SIGNATURE----- iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCag+BVhQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qor+AP94efFkLGAxuv7YIZsPrrkz+dh0XI/N 5asQe9sTnrfGiAD8DhE77S0DkZpMO+OE0J6mqTWmOVqds4RcuCWABxx12Ag= =F67c -----END PGP SIGNATURE----- Merge tag 'trace-v7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Avoid NULL return from hist_field_name() The function hist_field_name() is directly passed to a strcat() which does not handle "NULL" characters. Return a zero length string when size is greater than the limit. This is used only to output already created histograms and no field currently is greater than the limit. But it should still not return NULL. - Do not call map->ops->elt_free() on allocation failure When elt_alloc() fails, it should not call the map->ops->elt_free() function if it exists, as that function may not be able to handle the free on allocation failures. The ->elt_free() should only be called when elt_alloc() succeeds. * tag 'trace-v7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Do not call map->ops->elt_free() if elt_alloc() fails tracing: Avoid NULL return from hist_field_name() on truncation	2026-05-22 06:09:58 -07:00
Linus Torvalds	7acfa2c5f4	ring-buffer fixes for 7.1: - Fix reporting MISSED EVENTS in trace iterator When the "trace" file is read with tracing enabled, if the writer were to pass the iterator reader, it resets, sets a "missed_events" flag and continues. The tracing output checks for missed events and if there are some, it prints out "[LOST EVENTS]" to let the user know events were dropped. But the clearing of the missed_events happened when the tracing system queried the ring buffer iterator about missed events. This was premature as the ring buffer is per CPU, and the tracing code reads all the CPU buffers and checks for missed events when it is read. If the CPU iterator that had missed events isn't printed next, the output for the LOST EVENTS is lost. Clear the missed_events flag when the iterator moves to the next event and not when the missed_events flag is queried. Also clear it on reset. - Flush and stop the persistent ring buffer on panic On panic the persistent ring buffer is used to debug what caused the panic. But on some architectures, it requires flushing the memory from cache, otherwise, the ring buffer persistent memory may not have the last events and this could also cause the ring buffer to be corrupted on the next boot. - Fix nr_subbufs initialization in simple_ring_buffer_init_mm The remote simple ring buffer meta data nr_subbufs is initialized too early and gets cleared later on, making it zero and not reflect the actual number of sub-buffers. - Fix unload_page for simple_ring_buffer init rollback On error, the pages loaded need to be unloaded. To unload a page it is expected that: page = load_page(va); -> unload_page(page). But the code was doing: unload_page(va) and not unload_page(page). - Create output file from cmd_check_undefined The check for undefined symbols checks if the file .o.checked exists and if so it skips doing the work. But the .o.checked file never was created making every build do the work even when it was already done previously. -----BEGIN PGP SIGNATURE----- iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCag8l7BQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qga3AQDkyh7V4T+fxY5gc5jSKVx5U9bRAMpJ 3GWGNCY9TGUyewEApUNO5MVGvXttyc1ONPHuBcShynj3resJk90sk491kw0= =aY8d -----END PGP SIGNATURE----- Merge tag 'trace-ringbuffer-v7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull ring-buffer fixes from Steven Rostedt: - Fix reporting MISSED EVENTS in trace iterator When the "trace" file is read with tracing enabled, if the writer were to pass the iterator reader, it resets, sets a "missed_events" flag and continues. The tracing output checks for missed events and if there are some, it prints out "[LOST EVENTS]" to let the user know events were dropped. But the clearing of the missed_events happened when the tracing system queried the ring buffer iterator about missed events. This was premature as the ring buffer is per CPU, and the tracing code reads all the CPU buffers and checks for missed events when it is read. If the CPU iterator that had missed events isn't printed next, the output for the LOST EVENTS is lost. Clear the missed_events flag when the iterator moves to the next event and not when the missed_events flag is queried. Also clear it on reset. - Flush and stop the persistent ring buffer on panic On panic the persistent ring buffer is used to debug what caused the panic. But on some architectures, it requires flushing the memory from cache, otherwise, the ring buffer persistent memory may not have the last events and this could also cause the ring buffer to be corrupted on the next boot. - Fix nr_subbufs initialization in simple_ring_buffer_init_mm The remote simple ring buffer meta data nr_subbufs is initialized too early and gets cleared later on, making it zero and not reflect the actual number of sub-buffers. - Fix unload_page for simple_ring_buffer init rollback On error, the pages loaded need to be unloaded. To unload a page it is expected that: page = load_page(va); -> unload_page(page). But the code was doing: unload_page(va) and not unload_page(page). - Create output file from cmd_check_undefined The check for undefined symbols checks if the file .o.checked exists and if so it skips doing the work. But the .o.checked file never was created making every build do the work even when it was already done previously. * tag 'trace-ringbuffer-v7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Create output file from cmd_check_undefined tracing: Fix unload_page for simple_ring_buffer init rollback tracing: Fix nr_subbufs initialization in simple_ring_buffer_init_mm() ring-buffer: Flush and stop persistent ring buffer on panic ring-buffer: Fix reporting of missed events in iterator	2026-05-21 14:05:09 -07:00
Samuele Mariotti	0c1a9dce20	sched_ext: Fix spurious WARN on stale ops_state in ops_dequeue() ops_dequeue() can race with finish_dispatch() and spuriously trigger the "queued task must be in BPF scheduler's custody" warning. ops_dequeue() snapshots p->scx.ops_state via atomic_long_read_acquire() and then, in the SCX_OPSS_QUEUED arm, asserts that SCX_TASK_IN_CUSTODY is set. The two reads are not atomic w.r.t. a concurrent finish_dispatch() running on another CPU: CPU 1 CPU 2 ===== ===== dequeue_task_scx() ops_dequeue() opss = read_acquire(ops_state) = SCX_OPSS_QUEUED finish_dispatch() cmpxchg ops_state: SCX_OPSS_QUEUED -> SCX_OPSS_DISPATCHING [succeeds] dispatch_enqueue(SCX_DSQ_GLOBAL, SCX_ENQ_CLEAR_OPSS) call_task_dequeue() p->scx.flags &= ~SCX_TASK_IN_CUSTODY WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_IN_CUSTODY)) /* opss is stale: QUEUED, * but task already claimed */ set_release(ops_state, SCX_OPSS_NONE) The race has been observed via two distinct call chains: the most common goes through sched_setaffinity(), a rarer variant through sched_change_begin(). For SCX_DSQ_GLOBAL / SCX_DSQ_BYPASS, dispatch_enqueue() clears SCX_TASK_IN_CUSTODY before clearing ops_state to SCX_OPSS_NONE (intentional, to avoid concurrent non-atomic RMW of p->scx.flags against ops_dequeue()). The window between those two writes is exactly what ops_dequeue() observes as "QUEUED without custody". The observed state is not actually inconsistent, it just means CPU 1 has already claimed the task and the QUEUED value held by CPU 2 is stale. Re-read ops_state in that case; the next read is guaranteed to return SCX_OPSS_DISPATCHING or SCX_OPSS_NONE, both of which exit the switch cleanly. The retry is bounded: once IN_CUSTODY is cleared, ops_state has already advanced past QUEUED for this dispatch cycle, and a fresh QUEUED would require re-enqueue under p's rq lock, which CPU 2 holds. Changes in v2: - Use READ_ONCE() for p->scx.flags to ensure fresh reads and prevent compiler reordering in the lockless path - Add cpu_relax() to reduce power consumption and improve performance during the spin-wait - Use unlikely() to optimize branch prediction for the common case - Expand the in-code comment to document the race condition and bounded retry guarantee Fixes: `ebf1ccff79` ("sched_ext: Fix ops.dequeue() semantics") Suggested-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Samuele Mariotti <smariotti@disroot.org> Signed-off-by: Paolo Valente <paolo.valente@unimore.it> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-05-21 06:27:44 -10:00
Masami Hiramatsu (Google)	8f0f5c4fb9	tracing: Do not call map->ops->elt_free() if elt_alloc() fails In paths where tracing_map_elt_alloc() failed to allocate objects, the map->ops->elt_alloc() call was never successful. In this case, map->ops->elt_free() should not be called. Link: https://sashiko.dev/#/patchset/20260520223101.34710-1-rosenp%40gmail.com Cc: stable@vger.kernel.org Cc: Tom Zanussi <tom.zanussi@linux.intel.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Rosen Penev <rosenp@gmail.com> Reported-by: Sashiko <sashiko-bot@kernel.org> Fixes: `2734b62952` ("tracing: Add per-element variable support to tracing_map") Link: https://patch.msgid.link/177933895460.108746.5396070821443932634.stgit@devnote2 Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2026-05-21 11:29:03 -04:00
Thomas Weißschuh	057caace52	tracing: Create output file from cmd_check_undefined As the output file is currently never created, the check will run every time, even if the inputs have not changed. Create an empty output file which allows make to skip the execution when it is not necessary. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Link: https://patch.msgid.link/20260520-tracing-ringbuffer-check-v1-1-d979cfab1338@weissschuh.net Fixes: `1211907ac0` ("tracing: Generate undef symbols allowlist for simple_ring_buffer") Fixes: `58b4bd1839` ("tracing: Adjust cmd_check_undefined to show unexpected undefined symbols") Reviewed-by: Nathan Chancellor <nathan@kernel.org> Tested-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2026-05-21 08:31:55 -04:00
Vincent Donnefort	a0a2f42a37	tracing: Fix unload_page for simple_ring_buffer init rollback The unload_page callback expects the return value of load_page() as its argument: ret = load_page(va); unload(ret). Fix the rollback code in simple_ring_buffer_init_mm() where the descriptor's VA is used instead of the loaded page address. Link: https://patch.msgid.link/20260512141614.1759430-1-vdonnefort@google.com Fixes: `635923081c` ("tracing: load/unload page callbacks for simple_ring_buffer") Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2026-05-21 08:26:22 -04:00
David Carlier	c2d2856cf6	tracing: Fix nr_subbufs initialization in simple_ring_buffer_init_mm() nr_subbufs in the ring buffer metadata is always initialized to zero because it is assigned from cpu_buffer->nr_pages before the page initialization loop has run. While nr_subbufs is not currently read by the kernel, it should reflect the actual buffer geometry in the meta page for correctness. Move the assignment after the page loop so that cpu_buffer->nr_pages holds the final count. Link: https://patch.msgid.link/20260512135420.99194-1-devnexen@gmail.com Fixes: `34e5b958bd` ("tracing: Introduce simple_ring_buffer") Reviewed-by: Vincent Donnefort <vdonnefort@google.com> Assisted-by: Claude:claude-opus-4-7 Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2026-05-21 08:24:59 -04:00
Masami Hiramatsu (Google)	a494d3c8d5	ring-buffer: Flush and stop persistent ring buffer on panic On real hardware, panic and machine reboot may not flush hardware cache to memory. This means the persistent ring buffer, which relies on a coherent state of memory, may not have its events written to the buffer and they may be lost. Moreover, there may be inconsistency with the counters which are used for validation of the integrity of the persistent ring buffer which may cause all data to be discarded. To avoid this issue, stop recording of the ring buffer on panic and flush the cache of the ring buffer's memory. Fixes: `e645535a95` ("tracing: Add option to use memmapped memory for trace boot instance") Cc: stable@vger.kernel.org Cc: Will Deacon <will@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Ian Rogers <irogers@google.com> Link: https://patch.msgid.link/177751969602.2136606.12031934362587643488.stgit@mhiramat.tok.corp.google.com Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2026-05-21 08:20:58 -04:00
Steven Rostedt	a254b6d13b	ring-buffer: Fix reporting of missed events in iterator When tracing is active while reading the trace file, if the iterator reading the buffer detects that the writer has passed the iterator head, it will reset and set a "missed events" flag. This flag is passed to the output processing to show the user that events were missed: CPU:4 [LOST EVENTS] The problem is that the flag is reset after it is checked in ring_buffer_iter_dropped(). But the "trace" file iterates over all the CPU ring buffers and it will check if they are dropped when figuring out which buffer to print next. This prematurely clears the missed_events flag if the CPU buffer with the missed events is not the one that is printed next. On the iteration where the CPU buffer with the missed events is printed, the check if it had missed events would return false and the output does not show that events were missed. Do not reset the missed_events flag when checking if there were missed events, but instead clear it when moving the iterator head to the next event. Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260520220801.4fd09d13@fedora Fixes: `c9b7a4a72f` ("ring-buffer/tracing: Have iterator acknowledge dropped events") Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2026-05-21 08:20:29 -04:00
David Carlier	576ec047d2	tracing: Avoid NULL return from hist_field_name() on truncation hist_field_name() returns "" everywhere except the fully-qualified VAR_REF/EXPR case, where snprintf() truncation returns NULL early and bypasses the bottom NULL->"" guard. Callers don't expect NULL: strcat(expr, hist_field_name(field, 0)) at trace_events_hist.c:1758 and the strcmp() in the sort-key match loop at :4804 both deref it. system and event_name are bounded by MAX_EVENT_NAME_LEN, but the field name on a VAR_REF is kstrdup'd from a histogram variable name parsed out of the trigger string and has no length cap, so a long enough var name in a fully qualified reference can reach the truncation path. Keep the length check but leave field_name as "" on overflow. Link: https://patch.msgid.link/20260508195747.25492-1-devnexen@gmail.com Fixes: `5ec1d1e97d` ("tracing: Rebuild full_name on each hist_field_name() call") Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2026-05-20 16:10:56 -04:00
Cunlong Li	22572dbcd3	cgroup: rstat: relax NMI guard after switch to try_cmpxchg Commit `36df6e3dbd` ("cgroup: make css_rstat_updated nmi safe") used this_cpu_cmpxchg() for the lockless insertion, and therefore required both ARCH_HAVE_NMI_SAFE_CMPXCHG and ARCH_HAS_NMI_SAFE_THIS_CPU_OPS in the NMI guard: on archs without the latter, this_cpu_cmpxchg() falls back to "local_irq_save() + plain cmpxchg", and local_irq_save() cannot mask NMIs. Commit `3309b63a22` ("cgroup: rstat: use LOCK CMPXCHG in css_rstat_updated") later replaced this_cpu_cmpxchg() with plain try_cmpxchg() to fix cross-CPU lockless-list corruption, but left the NMI guard untouched. After that switch, css_rstat_updated() no longer performs any this_cpu_*() RMW operations and only relies on the arch having NMI-safe cmpxchg, so ARCH_HAS_NMI_SAFE_THIS_CPU_OPS is no longer required in the guard. Relax the guard accordingly so that archs which have HAVE_NMI and ARCH_HAVE_NMI_SAFE_CMPXCHG but not ARCH_HAS_NMI_SAFE_THIS_CPU_OPS (e.g. sparc, powerpc on PPC64/BOOK3S) can benefit from the existing CONFIG_MEMCG_NMI_SAFETY_REQUIRES_ATOMIC path. Without this, the css is never queued in NMI on those archs, and the atomics staged by account_{slab,kmem}_nmi_safe() are not drained by flush_nmi_stats(). Fixes: `3309b63a22` ("cgroup: rstat: use LOCK CMPXCHG in css_rstat_updated") Signed-off-by: Cunlong Li <shenxiaogll@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-05-20 09:44:35 -10:00
Linus Torvalds	df685633c3	RCU fixes for v7.1 Fix a regression introduced by commit `61bbcfb505` ("srcu: Push srcu_node allocation to GP when non-preemptible"): SRCU may queue works on CPUs that are "possible" but never have been online. In such a case, the work callbacks may not be executed until the corresponding CPU gets online, and as the callbacks accumulates, workqueue lockups will fire. Fix this by avoiding queuing works on CPUs that have never been online. -----BEGIN PGP SIGNATURE----- iQFhBAABCABLFiEEj5IosQTPz8XU1wRHSXnow7UH+rgFAmoMy/EbFIAAAAAABAAO bWFudTIsMi41KzEuMTIsMCwzERxib3F1bkBrZXJuZWwub3JnAAoJEEl56MO1B/q4 lT8H/RlNu00LC24b0JxPYRBZJz3TSM2WlnGJQ+5LXSQgk2ecqzoTzwDE7oC3naPC QDkwGpSif8Y5OKaEnlVavtDcHdNa824mKrgRo/nXkk5fqrrMMubHOHe5Y0fwy5z5 upPoEvEs0XtbW4Mm6lI4uRw+qvIH16+Ud9SMrfZMwLRGaO8axBXi3rijUtfAMRGv xBFqJX15Z/ixWkA6aHGuM1fI4WdApUen4/W3oUC+Ka4Lpgtt29GmIOV3n/topNQq R8bZM9QC+7f6Vk1s49ywD9WZYa1b4Pig74XDOXn3328kulGBxVtOnOz4sXMgBmBF ZLvJ4xXy8+u1eM0DFcSExJCCvHc= =PyvW -----END PGP SIGNATURE----- Merge tag 'rcu-fixes.v7.1-20260519a' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux Pull RCU fixes from Boqun Feng: "Fix a regression introduced by commit `61bbcfb505` ("srcu: Push srcu_node allocation to GP when non-preemptible"). SRCU may queue works on CPUs that are "possible" but never have been online. In such a case, the work callbacks may not be executed until the corresponding CPU gets online, and as the callbacks accumulates, workqueue lockups will fire. Fix this by avoiding queuing works on CPUs that have never been online" * tag 'rcu-fixes.v7.1-20260519a' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: srcu: Don't queue workqueue handlers to never-online CPUs	2026-05-20 10:15:30 -05:00
KP Singh	49b18315be	bpf: Reject NULL data/sig in bpf_verify_pkcs7_signature __bpf_dynptr_data() can return NULL (FILE dynptrs, any non-contiguous backing). bpf_verify_pkcs7_signature() forwards the pointer to verify_pkcs7_signature() unchecked, causing a NULL deref in asn1_ber_decoder() reachable from a sleepable BPF LSM at lsm.s/bpf. NULL-check both pointers and reject with -EINVAL. Mirrors the guards already in kernel/bpf/crypto.c. Fixes: `865b0566d8` ("bpf: Add bpf_verify_pkcs7_signature() kfunc") Reported-by: Xianrui Dong <dongxianrui1@gmail.com> Signed-off-by: KP Singh <kpsingh@kernel.org> Reviewed-by: Amery Hung <ameryhung@gmail.com> Acked-by: Song Liu <song@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20260520024059.313468-1-kpsingh@kernel.org Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>	2026-05-20 05:12:05 +02:00
Qing Ming	8817005efb	cgroup/rstat: validate cpu before css_rstat_cpu() access css_rstat_updated() is exposed as a BPF kfunc and accepts a caller-provided cpu argument. The function uses cpu for per-cpu rstat lookups without checking whether it refers to a valid possible CPU. A BPF iter/cgroup program with CAP_BPF and CAP_PERFMON can pass an invalid cpu value. On an unfixed UBSCAN_BOUNDS test kernel, cpu == 0x7fffffff triggers: UBSAN: array-index-out-of-bounds in kernel/cgroup/rstat.c:31:9 index 2147483647 is out of range for type 'long unsigned int [64]' Call Trace: css_rstat_updated bpf_iter_run_prog cgroup_iter_seq_show bpf_seq_read Add cpu validation to the BPF-facing css_rstat_updated() kfunc and move the common implementation to __css_rstat_updated() for in-kernel callers. Fixes: `a319185be9` ("cgroup: bpf: enable bpf programs to integrate with rstat") Signed-off-by: Qing Ming <a0yami@mailbox.org> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-05-18 09:31:52 -10:00
Paul E. McKenney	593889c401	srcu: Don't queue workqueue handlers to never-online CPUs While an srcu_struct structure is in the midst of switching from CPU-0 to all-CPUs state, it can attempt to invoke callbacks for CPUs that have never been online. Worse yet, it can attempt in invoke callbacks for CPUs that never will be online, even including imaginary CPUs not in cpu_possible_mask. This can cause hangs on s390, which is not set up to deal with workqueue handlers being scheduled on such CPUs. This commit therefore causes Tree SRCU to refrain from queueing workqueue handlers on CPUs that have not yet (and might never) come online. Because callbacks are not invoked on CPUs that have not been online, it is an error to invoke call_srcu(), synchronize_srcu(), or synchronize_srcu_expedited() on a CPU that is not yet fully online. However, it turns out to be less code to redirect the callbacks from too-early invocations of call_srcu() than to warn about such invocations. This commit therefore also redirects callbacks queued on not-yet-fully-online CPUs to the boot CPU. Reported-by: Vasily Gorbik <gor@linux.ibm.com> Fixes: `61bbcfb505` ("srcu: Push srcu_node allocation to GP when non-preemptible") Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: Vasily Gorbik <gor@linux.ibm.com> Tested-by: Samir <samir@linux.ibm.com> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Boqun Feng <boqun@kernel.org>	2026-05-18 12:27:18 -07:00
Jianpeng Chang	af0c3f0586	dma-mapping: move dma_map_resource() sanity check into debug code dma_map_resource() uses pfn_valid() to ensure the range is not RAM. However, pfn_valid() only checks for availability of the memory map for a PFN but it does not ensure that the PFN is actually backed by RAM. On ARM64 with SPARSEMEM (128MB section granularity), MMIO addresses that share a section with RAM will falsely trigger the WARN_ON_ONCE and cause dma_map_resource() to return DMA_MAPPING_ERROR. This causes a WARNING on Raspberry Pi 4 during spi_bcm2835 probe because the SPI FIFO register (0xfe204004) falls in the same sparsemem section as the end of RAM (0xf8000000-0xfbffffff), both in section 31 (0xf8000000-0xffffffff). Move the sanity check from dma_map_resource() into debug_dma_map_phys() and replace the unreliable pfn_valid() with pfn_valid() && !PageReserved(), which correctly identifies actual usable RAM without false positives for MMIO regions that happen to have struct pages. Since dma_map_resource() is dma_map_phys(DMA_ATTR_MMIO), the check applies equally to both APIs. Any non-reserved page represents kernel memory to a sufficient degree that using DMA_ATTR_MMIO on it is almost certainly wrong and risks breaking coherency on non-coherent platforms. ZONE_DEVICE pages used for PCI P2P DMA (MEMORY_DEVICE_PCI_P2PDMA) have PageReserved set, so they will not trigger a false positive. The check no longer blocks the mapping and uses err_printk() to integrate with dma-debug filtering. Fixes: `f7326196a7` ("dma-mapping: export new dma_*map_phys() interface") Reviewed-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Jianpeng Chang <jianpeng.chang.cn@windriver.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20260513072209.1486986-1-jianpeng.chang.cn@windriver.com	2026-05-18 09:04:59 +02:00
Tejun Heo	515e3996a4	sched_ext: Fix deadlock between scx_root_disable() and concurrent forks scx_root_disable() enters SCX_DISABLING before it grabs scx_enable_mutex to clear __scx_switched_all and scx_switching_all. task_should_scx() short-circuits on DISABLING, so forks in that window land on fair while next_active_class() still skips fair - the new tasks stall. This can deadlock the disable path itself: scx_alloc_and_add_sched() runs under scx_enable_mutex and creates a helper kthread; if that new kthread is one of the stalled fair tasks, the mutex holder waits forever and scx_root_disable() can never make progress. Only sub-sched support exposes this, since sub-sched enables are the only path where scx_alloc_and_add_sched() can race the root's disable. Move the DISABLING check after @scx_switching_all. @scx_switching_all serves as a proxy for __scx_switched_all, so while it's set, forks keep going to scx. Once cleared, DISABLING applies normally. v2: Reword in-source comment and description. (Andrea) Fixes: `337ec00b1d` ("sched_ext: Implement cgroup sub-sched enabling and disabling") Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-05-17 09:06:38 -10:00
Linus Torvalds	e5d505e366	tracing fixes for 7.1: - Add more functions to the remote allowed list randconfig found more functions that are allowed for the remote code for s390 and arm. Add them to the allowed list. - Fix remote_test error path If one of the simple ring buffers fails to load, the code is supposed to rollback its initialized buffers. Instead of rolling back the buffers for the failed load, it uses the global variable and rolls back all the successfully loaded buffers. -----BEGIN PGP SIGNATURE----- iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCagm9FRQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qn0xAP0WSYmbUAGYnrq1o5L4EEM61ManFBfp ta7dmx1Q5SYw6AEAtqbiH42VJZviAmajguyuL0cs9i9exFl+j4SaneejNwg= =CLqK -----END PGP SIGNATURE----- Merge tag 'trace-v7.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Add more functions to the remote allowed list randconfig found more functions that are allowed for the remote code for s390 and arm. Add them to the allowed list. - Fix remote_test error path If one of the simple ring buffers fails to load, the code is supposed to rollback its initialized buffers. Instead of rolling back the buffers for the failed load, it uses the global variable and rolls back all the successfully loaded buffers. * tag 'trace-v7.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Fix desc in error path for the trace remote test module ring-buffer remote: Avoid unexpected symbol warnings (arm, s390)	2026-05-17 12:02:31 -07:00
Kumar Kartikeya Dwivedi	3d562d35a0	bpf: Check global subprog exception paths Global subprogs are verified independently and are not descended into when their callers are symbolically executed. This means a caller can hold references or locks across a global subprog call that may throw, while the verifier only checks the non-exceptional return path at the call site. Record whether a subprog might throw in the CFG summary pass, alongside the existing might_sleep and packet-data-changing summaries, and propagate that effect through reachable callees. When a global subprog is marked as possibly throwing, push the normal continuation and validate the exceptional path immediately at the call site, avoiding a synthetic exception state and associated special case in the pruning checks. Fixes: `f18b03faba` ("bpf: Implement BPF exceptions") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260517075530.3461166-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-05-17 11:15:05 -07:00
Linus Torvalds	ec296ebf6d	Miscellaneous IRQ fixes: - Fix use-after-free in irq_work_single() on PREEMPT_RT (Jiayuan Chen) - Don't call add_interrupt_randomness() for NMIs in handle_percpu_devid_irq() (Mark Rutland) - Remove unused function in the ath79-cpu irqchip driver causing LKP CI build warnings (Rosen Penev) - Fix IRQ allocation/teardown leakage regressions in the GICv5 irqchip driver (Sascha Bischoff) - Fix an IRQ trigger type regression in the Meson S4 SoC irqchip driver (Xianwei Zhao) - Fix CPU offlining regression in the RiscV IMSIC irqchip driver (Yong-Xuan Wang) Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmoJXe4RHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1gyYBAAsylW1/wK/bu0QYhoqHTafWEnBvmgZOLL pu6577JeLnKaE0jR5DAZRbANnQitE+zlKO2rgYRxpYRm3rUb0OAnQx3OKjdykkSv 1Lu0BQaIlfpVdDJMS+fq6GNHyHwWXMMT9kNwAr7Xc05E+GTMRbl5neFFjKH2vmw4 RDjaD3HykhnbtzFt26Nx3Qx80JBkqhV7hGuuPVwQP3QTRyi2y51inKPgwxZKrwfs TaajXymHgsei+bCxbj75zWSs8xtkjSvgZetLSJIcjCCBw58IieIdF6i5MDIsqiGt 4v1c/u4+Q1Ip/OD41/dmHlsLMKsg0cNVa9WfatX53iWQIJY0sL8ayCGBLPCTDSe3 I615b6Im15thEozAlQ/BoSz5tFCtCHlrhx0sKqNRcFhVTa0Tlx0YNrb7SCmjHPw+ FSRM0lwlPM4xUPE4VPobV1Bqw5vR7kExeTK2Am2FMINOLwW1hUxilftJz45tMBbP m+27d77Td3l6HGNO8E9rd4q20QR1t3cb+gOhx286UJEb1s13jSPzv/47vyRXCwb8 7IxD+IBazjeO2xM4PCZDfj4kszx28icaBeRrLVFkaV0TNvJ1F/acNgiOEfprYBIu ISvLQy3Qel9SYpm99uUiBiv9gN0TNKvZJn3oR7sYigNQ+dOWZF5P7A7Kd+BqYmaS Hop0rS2yFuA= =rLcN -----END PGP SIGNATURE----- Merge tag 'irq-urgent-2026-05-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull IRQ fixes from Ingo Molnar: - Fix use-after-free in irq_work_single() on PREEMPT_RT (Jiayuan Chen) - Don't call add_interrupt_randomness() for NMIs in handle_percpu_devid_irq() (Mark Rutland) - Remove unused function in the ath79-cpu irqchip driver causing LKP CI build warnings (Rosen Penev) - Fix IRQ allocation/teardown leakage regressions in the GICv5 irqchip driver (Sascha Bischoff) - Fix an IRQ trigger type regression in the Meson S4 SoC irqchip driver (Xianwei Zhao) - Fix CPU offlining regression in the RiscV IMSIC irqchip driver (Yong-Xuan Wang) * tag 'irq-urgent-2026-05-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irq_work: Fix use-after-free in irq_work_single() on PREEMPT_RT irqchip/riscv-imsic: Clear interrupt move state during CPU offlining irqchip/meson-gpio: Use the correct register in meson_s4_gpio_irq_set_type() irqchip/ath79-cpu: Remove unused function genirq/chip: Don't call add_interrupt_randomness() for NMIs irqchip/gic-v5: Allocate ITS parent LPIs as a range irqchip/gic-v5: Support range allocation for LPIs irqchip/gic-v5: Move LPI allocation into the LPI domain	2026-05-17 10:34:15 -07:00
Vincent Donnefort	55a0005518	tracing: Fix desc in error path for the trace remote test module During initialisation in remote_test_load(), if one of the simple_ring_buffer fails to initialise, the error path attempts to rollback initialised buffers. However, the rollback incorrectly uses the global pointer to the trace descriptor, which is only set upon successful load completion. Fix the error path by using the local pointer to the descriptor. Link: https://patch.msgid.link/20260515201616.337469-1-vdonnefort@google.com Fixes: `ea908a2b79` ("tracing: Add a trace remote module for testing") Reported-by: Sashiko <sashiko-bot@kernel.org> Signed-off-by: Vincent Donnefort <vdonnefort@google.com> base-commit: `5d6919055d` Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2026-05-16 16:11:04 -04:00
Arnd Bergmann	96350db80e	ring-buffer remote: Avoid unexpected symbol warnings (arm, s390) The now more verbose check found more architecture specific symbol missing from the whitelist, during randconfig testing on s390 and 32-bit arm: Unexpected symbols in kernel/trace/simple_ring_buffer.o: U __aeabi_unwind_cpp_pr1 Unexpected symbols in kernel/trace/simple_ring_buffer.o: U __s390_indirect_jump_r1 U __s390_indirect_jump_r10 U __s390_indirect_jump_r14 U __s390_indirect_jump_r2 U __s390_indirect_jump_r5 U __s390_indirect_jump_r7 U __s390_indirect_jump_r8 U __s390_indirect_jump_r9 make[6]: *** [/home/arnd/arm-soc/kernel/trace/Makefile:160: kernel/trace/simple_ring_buffer.o.checked] Error 1 Add these to the list and keep it roughly sorted into sanitizer and architecture symbols. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Marc Zyngier <maz@kernel.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Link: https://patch.msgid.link/20260515105717.1023007-1-arnd@kernel.org Fixes: `1211907ac0` ("tracing: Generate undef symbols allowlist for simple_ring_buffer") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2026-05-15 14:59:30 -04:00
Arnd Bergmann	a828abbb89	bpf: make bpf_session_is_return() reference optional Building without CONFIG_BPF_EVENTS produces a build-time warning: WARN: resolve_btfids: unresolved symbol bpf_session_is_return The function is actually defined in kernel/trace/bpf_trace.o, which is built conditionally based on configuration. Make the reference to this function conditional as well, as is already done in the bpf verifier for other functions. Fixes: `8fe4dc4f64` ("bpf: change prototype of bpf_session_{cookie,is_return}") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/r/20260515113242.2706303-1-arnd@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-05-15 09:49:16 -07:00
Linus Torvalds	eb5441518f	audit/stable-7.1 PR 20260513 -----BEGIN PGP SIGNATURE----- iQJIBAABCgAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmoEyx0UHHBhdWxAcGF1 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXMGrhAA2pOTZRewSAzxSnoCy+NckMelDuA4 5j4hg0RZUtyEXcPmvuG5BbS2XON7K9fVg9uMm9oHoVuz/AJag9tgYQquOQWrGFm4 mI2m8pK9A55jdL8Suxa1Nw0Jxe6755TzyzNEXAxnmrGyXDdL+EuI+X2Is94maryn a+Gdlpuo9BSgbBLtoGw3eV7ddkT+TY3F+zfyoPBnJ+Lxwr3nKGPrJL9rz3jNqPpF DrF/o8vLV8UWwRJH4nV6IljzqkY0hQBBk4zOFPDtMXhsgeC3mHErshifjpG91E7S OXIacIDWtSwCi2kNb9jhtlRMXrj8ANqZx15qgYPmrCAKIx88cYzHZvWA/EFfQqWm buM+QLae88PKcwG5eMltBJiw2lpdGuw/0D8iuCroOYFI13P1qiAXGUB8r0z+MJt/ ycdMQyT+Mk8kgNh6WfRPgwPE4AH6TG8Ld2BaANYrC4Jvb3EsXiySdXwj8V6PKVmN wshExL/Cvbtk9UuO938pIgtlL5727b9WWHpDYKpy/ZzuOvgGahXn6flDzq+DCZXp 4A++RPLdYXRTLVMR+IawVMw32/qq443C6K8gGkF69MYkYCPnrYIxP/DlBQkNgIHV z5/2a0rHnVtqiB+5BA/wZYVAc/F+yN/2f6TA6AZxNDMQu+w6J9PwBb00uzMzfkNf vzicqxVOWS/jO3o= =bV7l -----END PGP SIGNATURE----- Merge tag 'audit-pr-20260513' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit Pull audit fixes from Paul Moore: - Correctly log the inheritable capabilities - Honor AUDIT_LOCKED in the AUDIT_TRIM and AUDIT_MAKE_EQUIV commands * tag 'audit-pr-20260513' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: enforce AUDIT_LOCKED for AUDIT_TRIM and AUDIT_MAKE_EQUIV audit: fix incorrect inheritable capability in CAPSET records	2026-05-14 08:53:24 -07:00
Linus Torvalds	31e62c2ebb	ptrace: slightly saner 'get_dumpable()' logic The 'dumpability' of a task is fundamentally about the memory image of the task - the concept comes from whether it can core dump or not - and makes no sense when you don't have an associated mm. And almost all users do in fact use it only for the case where the task has a mm pointer. But we have one odd special case: ptrace_may_access() uses 'dumpable' to check various other things entirely independently of the MM (typically explicitly using flags like PTRACE_MODE_READ_FSCREDS). Including for threads that no longer have a VM (and maybe never did, like most kernel threads). It's not what this flag was designed for, but it is what it is. The ptrace code does check that the uid/gid matches, so you do have to be uid-0 to see kernel thread details, but this means that the traditional "drop capabilities" model doesn't make any difference for this all. Make it all make a bit more sense by saying that if you don't have a MM pointer, we'll use a cached "last dumpability" flag if the thread ever had a MM (it will be zero for kernel threads since it is never set), and require a proper CAP_SYS_PTRACE capability to override. Reported-by: Qualys Security Advisory <qsa@qualys.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kees Cook <kees@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-05-14 08:32:11 -07:00
Guannan Wang	5939801753	bpf: Use array_map_meta_equal for percpu array inner map replacement percpu_array_map_ops.map_meta_equal points to the generic bpf_map_meta_equal(), which does not compare max_entries. When a percpu array serves as an inner map, replacing it with one that has fewer max_entries bypasses the check. Since percpu_array_map_gen_lookup() inlines the original template's index_mask as a JIT immediate, a lookup on the replacement map can access pptrs[] out of bounds. Point percpu_array_map_ops.map_meta_equal to array_map_meta_equal(), which already enforces the max_entries equality check. Add a selftest to verify that replacing a percpu array inner map with a differently-sized one is rejected. Fixes: `db69718b8e` ("bpf: inline bpf_map_lookup_elem() for PERCPU_ARRAY maps") Signed-off-by: Guannan Wang <wgnbuaa@gmail.com> Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Link: https://lore.kernel.org/r/20260514074454.77491-1-wgnbuaa@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-05-14 08:18:50 -07:00
Linus Torvalds	59a62ea458	sched_ext: Fixes for v7.1-rc3 Bulk is hardening of the new sub-scheduler infrastructure. - UAFs and lifecycle bugs on the sub-sched attach/detach paths: parent sub_kset freed under a racing child, list_del_rcu on an uninitialized list head, ops->priv stomped by concurrent attach/detach, and a UAF in the init-failure error path. - Task state-machine reorg closing concurrent enable-vs-dead races: a task exiting during the unlocked init window could trip NULL ops derefs or skip exit_task() cleanup. - A scx_link_sched() self-deadlock on scx_sched_lock. - isolcpus: stop dereferencing the now-RCU-protected HK_TYPE_DOMAIN cpumask without RCU, and stop rejecting BPF schedulers when only cpuset isolated partitions are active. - PREEMPT_RT: disable irq_work runs in hardirq context so dumps show the failing task rather than the irq_work kthread. - Assorted !CONFIG_EXT_SUB_SCHED, randconfig, and selftest build fixes. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCagTk1g4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGT6TAP0ZbRHz9ViligecZXIHjEvZQjEV4sn1NLpGi4og V0Ol2AD/RzqHQZo5+HpMz4hPrcZdkAWcr74cLrNTJ2WQjOk4RgE= =6Mbx -----END PGP SIGNATURE----- Merge tag 'sched_ext-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: "The bulk of this is hardening of the new sub-scheduler infrastructure. - UAFs and lifecycle bugs on the sub-sched attach/detach paths: parent sub_kset freed under a racing child, list_del_rcu on an uninitialized list head, ops->priv stomped by concurrent attach/detach, and a UAF in the init-failure error path - Task state-machine reorg closing concurrent enable-vs-dead races: a task exiting during the unlocked init window could trip NULL ops derefs or skip exit_task() cleanup - A scx_link_sched() self-deadlock on scx_sched_lock - isolcpus: stop dereferencing the now-RCU-protected HK_TYPE_DOMAIN cpumask without RCU, and stop rejecting BPF schedulers when only cpuset isolated partitions are active - PREEMPT_RT: disable irq_work runs in hardirq context so dumps show the failing task rather than the irq_work kthread - Assorted !CONFIG_EXT_SUB_SCHED, randconfig, and selftest build fixes" * tag 'sched_ext-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Use HK_TYPE_DOMAIN_BOOT to detect isolcpus= domain isolation sched_ext: Defer sub_kset base put to scx_sched_free_rcu_work sched_ext: INIT_LIST_HEAD() &sch->all in scx_alloc_and_add_sched() sched_ext: Drop NONE early return in scx_disable_and_exit_task() sched_ext: Avoid UAF in scx_root_enable_workfn() init failure path sched_ext: Clear ops->priv on scx_alloc_and_add_sched() error paths sched_ext: Fix ops->priv clobber on concurrent attach/detach selftests/sched_ext: Fix build error in dequeue selftest sched_ext: Handle SCX_TASK_NONE in disable/switched_from paths sched_ext: Close sub-sched init race with post-init DEAD recheck sched_ext: Close root-enable vs sched_ext_dead() race with SCX_TASK_INIT_BEGIN sched_ext: Replace SCX_TASK_OFF_TASKS flag with SCX_TASK_DEAD state sched_ext: Inline scx_init_task() and move RESET_RUNNABLE_AT into scx_set_task_state() sched_ext: Cleanups in preparation for the SCX_TASK_INIT_BEGIN/DEAD work sched_ext: Use IRQ_WORK_INIT_HARD() to initialize sch->disable_irq_work sched_ext: Fix !CONFIG_EXT_SUB_SCHED build warnings sched_ext: Drop unused scx_find_sub_sched() stub sched_ext: Move scx_error() out of scx_link_sched()'s lock region	2026-05-13 15:00:40 -07:00
Linus Torvalds	0913b580f8	cgroup: Fixes for v7.1-rc3 - cpuset fixes: - Partition invalidation could return CPUs still in use by sibling partitions, producing overlapping effective_cpus. - cpuset_can_attach() over-reserved DL bandwidth on moves that stayed within the same root domain. - Pending DL migration state leaked into later attaches when a later can_attach() check failed. - Reorder PF_EXITING and __GFP_HARDWALL checks so dying tasks can allocate from any node and exit quickly. - dmem: propagate -ENOMEM instead of spinning forever when the fallback pool allocation also fails. - selftests/cgroup: percpu test error-path leak, bogus numeric comparison of cpuset strings, and a zero-length read() that silently passed OOM-kill tests. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCagTkzw4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGR+AAQCcYEGJ+yNAzzrTcY8xy7333rorMckSmZt18jzv 1KSqEQD+KjindGNcWP/meQBPnEjcBjix6i961mgnQ99e/UD2HQ4= =4pT3 -----END PGP SIGNATURE----- Merge tag 'cgroup-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - cpuset fixes: - Partition invalidation could return CPUs still in use by sibling partitions, producing overlapping effective_cpus - cpuset_can_attach() over-reserved DL bandwidth on moves that stayed within the same root domain - Pending DL migration state leaked into later attaches when a later can_attach() check failed - Reorder PF_EXITING and __GFP_HARDWALL checks so dying tasks can allocate from any node and exit quickly - dmem: propagate -ENOMEM instead of spinning forever when the fallback pool allocation also fails - selftests/cgroup: percpu test error-path leak, bogus numeric comparison of cpuset strings, and a zero-length read() that silently passed OOM-kill tests * tag 'cgroup-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup/cpuset: Return only actually allocated CPUs during partition invalidation selftests/cgroup: Fix error path leaks in test_percpu_basic cgroup/cpuset: Reserve DL bandwidth only for root-domain moves cgroup/cpuset: Reset DL migration state on can_attach() failure selftests/cgroup: Fix string comparison in write_test selftests/cgroup: Fix cg_read_strcmp() empty string comparison cgroup/dmem: Return -ENOMEM on failed pool preallocation cgroup/cpuset: move PF_EXITING check before __GFP_HARDWALL in cpuset_current_node_allowed()	2026-05-13 14:56:31 -07:00
Linus Torvalds	50599e4c68	workqueue: Fixes for v7.1-rc3 - Plug a wq->cpu_pwq leak on the WQ_UNBOUND allocation failure path. - Fix a cancel_delayed_work_sync() livelock against drain_workqueue() caused by the drain/destroy reject path leaving WORK_STRUCT_PENDING set with no owner. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCagTkwA4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGXGNAQDarHcCjUzjddPY1drGJz73LIsfAhU1haDWYQgD Ssd/ZgD/fYP0Gp6GwbFF/n9JAo48Y2P29PF4lOfVagv1Md0SeAM= =NMbR -----END PGP SIGNATURE----- Merge tag 'wq-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fixes from Tejun Heo: - Plug a wq->cpu_pwq leak on the WQ_UNBOUND allocation failure path - Fix a cancel_delayed_work_sync() livelock against drain_workqueue() caused by the drain/destroy reject path leaving WORK_STRUCT_PENDING set with no owner * tag 'wq-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: Fix wq->cpu_pwq leak in alloc_and_link_pwqs() WQ_UNBOUND path workqueue: Release PENDING in __queue_work() drain/destroy reject path	2026-05-13 14:49:13 -07:00
Andrea Righi	6ae315d379	sched_ext: Use HK_TYPE_DOMAIN_BOOT to detect isolcpus= domain isolation scx_enable() refuses to attach a BPF scheduler when isolcpus=domain is in effect by comparing housekeeping_cpumask(HK_TYPE_DOMAIN) against cpu_possible_mask. Since commit `27c3a5967f` ("sched/isolation: Convert housekeeping cpumasks to rcu pointers"), HK_TYPE_DOMAIN's cpumask is RCU protected and dereferencing it requires either RCU read lock, the cpu_hotplug write lock, or the cpuset lock; scx_enable() holds none of these, so booting with isolcpus=domain and attaching any BPF scheduler triggers the following lockdep splat: ============================= WARNING: suspicious RCU usage ----------------------------- kernel/sched/isolation.c:60 suspicious rcu_dereference_check() usage! 1 lock held by scx_flash/281: #0: ffffffff8379fce0 (update_mutex){+.+.}-{4:4}, at: bpf_struct_ops_link_create+0x134/0x1c0 Call Trace: dump_stack_lvl+0x6f/0xb0 lockdep_rcu_suspicious.cold+0x37/0x70 housekeeping_cpumask+0xcd/0xe0 scx_enable.isra.0+0x17/0x120 bpf_scx_reg+0x5e/0x80 bpf_struct_ops_link_create+0x151/0x1c0 __sys_bpf+0x1e4b/0x33c0 __x64_sys_bpf+0x21/0x30 do_syscall_64+0x117/0xf80 entry_SYSCALL_64_after_hwframe+0x77/0x7f In addition, commit `03ff735101` ("cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset") made HK_TYPE_DOMAIN include cpuset isolated partitions as well, which means the current check also rejects BPF schedulers when a cpuset partition is active. That contradicts the original intent of commit `9f391f94a1` ("sched_ext: Disallow loading BPF scheduler if isolcpus= domain isolation is in effect"), which explicitly noted that cpuset partitions are honored through per-task cpumasks and should not be rejected. Switch to housekeeping_enabled(HK_TYPE_DOMAIN_BOOT), which reads only the housekeeping flag bit (no RCU dereference) and reflects exactly the boot-time isolcpus= configuration that the error message refers to. Fixes: `27c3a5967f` ("sched/isolation: Convert housekeeping cpumasks to rcu pointers") Cc: stable@vger.kernel.org # v7.0+ Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Frederic Weisbecker <frederic@kernel.org>	2026-05-13 10:02:57 -10:00
sunshaojie	345f401666	cgroup/cpuset: Return only actually allocated CPUs during partition invalidation In update_parent_effective_cpumask() with partcmd_invalidate, the CPUs to return to the parent are computed as: adding = cpumask_and(tmp->addmask, xcpus, parent->effective_xcpus); where xcpus = user_xcpus(cs) which returns cs->exclusive_cpus (if set) or cs->cpus_allowed. When exclusive_cpus is not set, user_xcpus(cs) can contain CPUs that were never actually granted to the partition due to sibling exclusion in compute_excpus(). Consequently, the invalidation may return CPUs to the parent that remain in use by sibling partitions, causing overlapping effective_cpus and triggering the WARN_ON_ONCE(1) in generate_sched_domains(). Use cs->effective_xcpus instead, which reflects the CPUs actually granted to this partition. Reproducer (on a 4-CPU machine): cd /sys/fs/cgroup mkdir a1 b1 # a1 becomes partition root with CPUs 0-1 echo "0-1" > a1/cpuset.cpus echo "root" > a1/cpuset.cpus.partition # b1 becomes partition root with CPUs 1-2, but sibling exclusion # reduces its effective_xcpus to CPU 2 only echo "1-2" > b1/cpuset.cpus echo "root" > b1/cpuset.cpus.partition # b1 changes cpus_allowed to 0-1 -> partition invalidation echo "0-1" > b1/cpuset.cpus # Expected: CPUs 2-3 (only CPU 2 returned from b1) # Actual: CPUs 1-3 (CPU 0-1 returned, overlapping with a1) cat cpuset.cpus.effective dmesg will also show a WARNING from generate_sched_domains() reporting overlapping partition root effective_cpus. Fixes: `2a3602030d` ("cgroup/cpuset: Don't invalidate sibling partitions on cpuset.cpus conflict") Cc: stable@vger.kernel.org # v7.0+ Signed-off-by: sunshaojie <sunshaojie@kylinos.cn> Tested-by: Chen Ridong <chenridong@huaweicloud.com> Reviewed-by: Chen Ridong <chenridong@huaweicloud.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-05-13 08:54:53 -10:00
Linus Torvalds	1f63dd8ca0	liveupdate fixes for v7.1-rc4 A few fixes for kexec handover and liveupdate: * make sure KHO is skipped for crash kernel * fix error reporting in memfd preservation if it fails mid-loop * don't allow preserving memfds whose page count exceeds UINT_MAX * fix documentation of memfd seals preservation to match the code -----BEGIN PGP SIGNATURE----- iQFEBAABCgAuFiEEeOVYVaWZL5900a/pOQOGJssO/ZEFAmoET5gQHHJwcHRAa2Vy bmVsLm9yZwAKCRA5A4Ymyw79kVfdB/99gLJy40MO9ZCHSxRQD9TE7Fbuv71flVuD wmDz43UOyDIEp+qCB0VcNQPG3v+UINygUMGHkhOG4fgKLm0bEORXIJHNr8sTXYYk LuxN8g+Xv1P/qkucEIXy1oB38okg9cORhlfrCOiwpWBjNt5/AqZYKWttDshuZiIM kjIKEDtTZ/nDLXjkWAa4Qs4MtBjqTVCrG3glSNHT0yiFDEkAejXbr4RZ/Ght/9pz FwHzTfdIOnecvOCD2OHVQx9TJluaP57mlxTkOXJV6OApg0wiHjohl0Xcerh+JfB4 HAdF7xpr5Sk/BQVc3ygsDKwfTVfB/eYMfCoyUkXg9AVhcoBXmhD0 =luI7 -----END PGP SIGNATURE----- Merge tag 'fixes-2026-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux Pull liveupdate fixes from Mike Rapoport: "A few fixes for kexec handover and liveupdate: - make sure KHO is skipped for crash kernel - fix error reporting in memfd preservation if it fails mid-loop - don't allow preserving memfds whose page count exceeds UINT_MAX - fix documentation of memfd seals preservation to match the code" * tag 'fixes-2026-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux: mm/memfd_luo: document preservation of file seals mm/memfd_luo: reject memfds whose page count exceeds UINT_MAX mm/memfd_luo: report error when restoring a folio fails mid-loop kho: skip KHO for crash kernel	2026-05-13 08:24:50 -07:00
Tejun Heo	cceb874eee	sched_ext: Defer sub_kset base put to scx_sched_free_rcu_work scx_sub_enable_workfn() pins parent->kobj before dropping scx_sched_lock, but that does not pin parent->sub_kset. Concurrent disable can kset_unregister and free sub_kset before scx_alloc_and_add_sched() dereferences it. Split sub_kset teardown: kobject_del() at disable keeps sysfs removal; defer kobject_put() to scx_sched_free_rcu_work so the memory survives. A racing child sees state_in_sysfs=0 with valid memory, sysfs_create_dir() fails, and the existing exit_kind gate in scx_link_sched() turns it away with -ENOENT. Fixes: `411d3ef1a7` ("sched_ext: Unregister sub_kset on scheduler disable") Signed-off-by: Tejun Heo <tj@kernel.org>	2026-05-12 11:28:56 -10:00
Tejun Heo	b273b75b8d	sched_ext: INIT_LIST_HEAD() &sch->all in scx_alloc_and_add_sched() On scx_link_sched() error paths (parent disabled, hash insert failure), &sch->all is never added to scx_sched_all. The cleanup path runs scx_unlink_sched() unconditionally, which calls list_del_rcu(&sch->all) on a list_head that was never initialized triggering a corruption warning. Initialize &sch->all. Fixes: `54be8de423` ("sched_ext: Factor out scx_link_sched() and scx_unlink_sched()") Signed-off-by: Tejun Heo <tj@kernel.org>	2026-05-12 11:28:56 -10:00
Tejun Heo	39e25a2100	sched_ext: Drop NONE early return in scx_disable_and_exit_task() `d3e73a0808` ("sched_ext: Handle SCX_TASK_NONE in disable/switched_from paths") skipped the trailing scx_set_task_sched(p, NULL) on NONE tasks. After scx_fail_parent() parks a task at NONE/sched=parent and the parent is later freed via queue_rcu_work() during root_disable, the preserved p->scx.sched dangles - print_scx_info() from sched_show_task() reads sch->ops.name from freed memory. Drop the early return. __scx_disable_and_exit_task() already short- circuits on NONE and the SUB_INIT block was cleared by scx_fail_parent()'s earlier call, so clearing p->scx.sched is the only work left - and the one thing the path actually needs. v2: Extend the SUB_INIT block comment to note that the flag is only set on the sub-enable path, so it's always clear on the NONE re-entry (Andrea). Fixes: `d3e73a0808` ("sched_ext: Handle SCX_TASK_NONE in disable/switched_from paths") Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-05-12 11:13:58 -10:00
Sergio Correia	f9e1c1324b	audit: enforce AUDIT_LOCKED for AUDIT_TRIM and AUDIT_MAKE_EQUIV AUDIT_ADD_RULE and AUDIT_DEL_RULE correctly check for AUDIT_LOCKED and return -EPERM, but AUDIT_TRIM and AUDIT_MAKE_EQUIV do not. This allows a process with CAP_AUDIT_CONTROL to modify directory tree watches and equivalence mappings even when the audit configuration has been locked, undermining the purpose of the lock. Add AUDIT_LOCKED checks to both commands. Cc: stable@vger.kernel.org Reviewed-by: Ricardo Robaina <rrobaina@redhat.com> Assisted-by: Claude:claude-opus-4-6 Signed-off-by: Sergio Correia <scorreia@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>	2026-05-12 16:10:38 -04:00
Sergio Correia	e4a640475e	audit: fix incorrect inheritable capability in CAPSET records __audit_log_capset() records the effective capability set into the inheritable field due to a copy-paste error. Every CAPSET audit record therefore reports cap_pi (process inheritable) with the value of cap_effective instead of cap_inheritable. This silently corrupts audit data used for compliance and forensic analysis: an attacker who modifies inheritable capabilities to prepare for a privilege-escalating exec would have the change masked in the audit trail. The bug has been present since the original introduction of CAPSET audit records in 2008. Cc: stable@vger.kernel.org Fixes: `e68b75a027` ("When the capset syscall is used it is not possible for audit to record the actual capbilities being added/removed. This patch adds a new record type which emits the target pid and the eff, inh, and perm cap sets.") Reviewed-by: Ricardo Robaina <rrobaina@redhat.com> Assisted-by: Claude:claude-opus-4-6 Signed-off-by: Sergio Correia <scorreia@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>	2026-05-12 16:05:57 -04:00
Linus Torvalds	1d5dcaa3bd	Probes fixes for v7.1-rc3 - kprobes: skip non-symbol addresses in kprobe_add_ksym_blacklist() Since the ftrace adds its NOPs at .kprobes.text section (which stores an array), a wrong entry is added when loading a module which uses "__kprobes" attribute. To solve this, add "notrace" to __kprobes functions. - test_kprobes: clear kprobes between test runs Clear all kprobes in the test program after running a test set, because Kunit test can run several times. - fprobe: Fix unregister_fprobe() to wait for RCU grace period Since the fprobe data structure is removed with hlist_del_rcu(), it should wait for the RCU grace period. If the caller waits for RCU, we can use the async variant (e.g. eBPF) -----BEGIN PGP SIGNATURE----- iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmoCf4QbHG1hc2FtaS5o aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bAt8H/RiNH4k/20YKE2Z56GLy N+qCb8CO8L+AroNGCAj4KRVYtBLVzxBLf+Fcdfz6UM/jQ/k2UTeh6ysIt8iWCZYA 2vJBlVDvvjWPpEZW6yCxlpEAgU2B/Xv/92ZnQjW7sGvL75+gsA1dLu1Gt6lqM5zS X335PrIN3c4g+zhwCwW8wLCpMJvyk0qnXiN3thfXTCT/P9GPZldMEAAOecyLl7C3 Y/Zc8Af3xbMdqplIoYoKRWr0uzYBb1NB2FZR7Dp6i5/5MAhVYobd23s6VXWXZwxV FHRJ6R16vCK/ftnwtOiUeuiC3iXn21XQdma6pr2nI6bRhr5v/NBXxmh5U2+tRHeF /I4= =E/h6 -----END PGP SIGNATURE----- Merge tag 'probes-fixes-v7.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes fixes from Masami Hiramatsu: - kprobes: skip non-symbol addresses in kprobe_add_ksym_blacklist() Since the ftrace adds its NOPs at .kprobes.text section (which stores an array), a wrong entry is added when loading a module which uses "__kprobes" attribute. To solve this, add "notrace" to __kprobes functions - test_kprobes: clear kprobes between test runs Clear all kprobes in the test program after running a test set, because Kunit test can run several times - fprobe: Fix unregister_fprobe() to wait for RCU grace period Since the fprobe data structure is removed with hlist_del_rcu(), it should wait for the RCU grace period. If the caller waits for RCU, we can use the async variant (e.g. eBPF) * tag 'probes-fixes-v7.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: fprobe: Fix unregister_fprobe() to wait for RCU grace period test_kprobes: clear kprobes between test runs kprobes: skip non-symbol addresses in kprobe_add_ksym_blacklist()	2026-05-12 10:18:02 -07:00
Tejun Heo	9a415cc537	sched_ext: Avoid UAF in scx_root_enable_workfn() init failure path In scx_root_enable_workfn(), put_task_struct(p) is called before scx_error() dereferences p->comm and p->pid. If the iterator's reference is the last drop, the task is freed synchronously and the deref becomes a UAF. Move put_task_struct() past scx_error(). Reported-by: Sashiko <sashiko-bot@kernel.org> Closes: https://lore.kernel.org/all/20260511214031.AF5E9C2BCB0@smtp.kernel.org/ Fixes: `f0e1a0643a` ("sched_ext: Implement BPF extensible scheduler class") Cc: stable@vger.kernel.org # v6.12+ Signed-off-by: Tejun Heo <tj@kernel.org>	2026-05-11 12:05:48 -10:00
Guopeng Zhang	5dd74441cb	cgroup/cpuset: Reserve DL bandwidth only for root-domain moves cpuset_can_attach() currently adds the bandwidth of all migrating SCHED_DEADLINE tasks to sum_migrate_dl_bw. If the source and destination cpuset effective CPU masks do not overlap, the whole sum is then reserved in the destination root domain. set_cpus_allowed_dl(), however, subtracts bandwidth from the source root domain only when the affinity change really moves the task between root domains. A DL task can move between cpusets that are still in the same root domain, so including that task in sum_migrate_dl_bw can reserve destination bandwidth without a matching source-side subtraction. Share the root-domain move test with set_cpus_allowed_dl(). Keep nr_migrate_dl_tasks counting all migrating deadline tasks for cpuset DL task accounting, but add to sum_migrate_dl_bw only for tasks that need a root-domain bandwidth move. Keep using the destination cpuset effective CPU mask and leave the broader can_attach()/attach() transaction model unchanged. Fixes: `2ef269ef1a` ("cgroup/cpuset: Free DL BW in case can_attach() fails") Cc: stable@vger.kernel.org # v6.10+ Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn> Reviewed-by: Waiman Long <longman@redhat.com> Acked-by: Juri Lelli <juri.lelli@redhat.com> Tested-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-05-11 10:27:14 -10:00
Jann Horn	c1fa0bb633	exit: prevent preemption of oopsing TASK_DEAD task When an already-exiting task oopses, make_task_dead() currently calls do_task_dead() with preemption enabled. That is forbidden: do_task_dead() calls __schedule(), which has a comment saying "WARNING: must be called with preemption disabled!". If an oopsing task is preempted in do_task_dead(), between becoming TASK_DEAD and entering the scheduler explicitly, bad things happen: finish_task_switch() assumes that once the scheduler has switched away from a TASK_DEAD task, the task can never run again and its stack is no longer needed; but that assumption apparently doesn't hold if the dead task was preempted (the SM_PREEMPT case). This means that the scheduler ends up repeatedly dropping references on the dead task's stack, which can lead to use-after-free or double-free of the entire task stack; in other words, two tasks can end up running on the same stack, resulting in various kinds of memory corruption. (This does not just affect "recursively oopsing" tasks; it is enough to oops once during task exit, for example in a file_operations::release handler) Fixes: `7f80a2fd7d` ("exit: Stop poorly open coding do_task_dead in make_task_dead") Cc: stable@kernel.org Signed-off-by: Jann Horn <jannh@google.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-05-11 08:55:11 -07:00
Yazhou Tang	58a8f3e250	bpf: Fix s16 truncation for large bpf-to-bpf call offsets Currently, the BPF instruction set allows bpf-to-bpf calls (or internal calls, pseudo calls) to use a 32-bit imm field to represent the relative jump offset. However, when JIT is disabled or falls back to the interpreter, the verifier invokes bpf_patch_call_args() to rewrite the call instruction. In this function, the 32-bit imm is downcast to s16 and stored in the off field. void bpf_patch_call_args(struct bpf_insn *insn, u32 stack_depth) { stack_depth = max_t(u32, stack_depth, 1); insn->off = (s16) insn->imm; insn->imm = interpreters_args[(round_up(stack_depth, 32) / 32) - 1] - __bpf_call_base_args; insn->code = BPF_JMP \| BPF_CALL_ARGS; } If the original imm exceeds the s16 range (i.e., a jump offset greater than 32767 instructions), this downcast silently truncates the offset, resulting in an incorrect call target. Fix this by: 1. In bpf_patch_call_args(), keeping the imm field unchanged and using the off field to store the index of the interpreter function. 2. In ___bpf_prog_run() for the JMP_CALL_ARGS case, retrieving the interpreter function pointer from the interpreters_args array using the off field as the index, and passing the original imm to calculate the last argument of the interpreter function. After these changes, the truncation issue is resolved, and __bpf_call_base_args is also no longer needed and can be removed, which makes the code cleaner. Performance: In ___bpf_prog_run() for the JMP_CALL_ARGS case, changing the retrieval of the interpreter function pointer from pointer addition to direct array indexing improves performance. The possible reason is that the latter has better instruction-level parallelism. See the v5 discussion [1] for more details. [1] https://lore.kernel.org/bpf/f120c3c4-6999-414a-b514-518bb64b4758@zju.edu.cn/ To avoid requiring bpftool changes, keep the new imm/off encoding internal and restore the legacy xlated dump layout in bpf_insn_prepare_dump(). For bpf-to-bpf call offsets that do not fit in s16, export off as 0 instead of a truncated and misleading value. Fixes: `1ea47e01ad` ("bpf: add support for bpf_call to interpreter") Fixes: `7105e828c0` ("bpf: allow for correlation of maps and helpers in dump") Suggested-by: Xu Kuohai <xukuohai@huaweicloud.com> Suggested-by: Puranjay Mohan <puranjay@kernel.org> Co-developed-by: Tianci Cao <ziye@zju.edu.cn> Signed-off-by: Tianci Cao <ziye@zju.edu.cn> Co-developed-by: Shenghao Yuan <shenghaoyuan0928@163.com> Signed-off-by: Shenghao Yuan <shenghaoyuan0928@163.com> Signed-off-by: Yazhou Tang <tangyazhou518@outlook.com> Link: https://lore.kernel.org/r/20260506094714.419842-3-tangyazhou@zju.edu.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-05-11 08:27:02 -07:00
Yazhou Tang	4314a44564	bpf: Fix out-of-bounds read in bpf_patch_call_args() The interpreters_args array only accommodates stack depths up to MAX_BPF_STACK (512 bytes). However, do_misc_fixups() may allow a larger stack depth if JIT is requested. If JIT compilation later fails and falls back to the interpreter, the verifier invokes bpf_patch_call_args() with this oversized stack depth. This causes a load-time out-of-bounds (OOB) read when calculating the interpreter function pointer index. Fix this by changing bpf_patch_call_args() to return an int and explicitly rejecting the JIT fallback (returning -EINVAL) if the stack depth exceeds MAX_BPF_STACK. Fixes: `1ea47e01ad` ("bpf: add support for bpf_call to interpreter") Co-developed-by: Tianci Cao <ziye@zju.edu.cn> Signed-off-by: Tianci Cao <ziye@zju.edu.cn> Co-developed-by: Shenghao Yuan <shenghaoyuan0928@163.com> Signed-off-by: Shenghao Yuan <shenghaoyuan0928@163.com> Signed-off-by: Yazhou Tang <tangyazhou518@outlook.com> Acked-by: Xu Kuohai <xukuohai@huawei.com> Link: https://lore.kernel.org/r/20260506094714.419842-2-tangyazhou@zju.edu.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-05-11 08:27:01 -07:00
Jiayuan Chen	91840be8f7	irq_work: Fix use-after-free in irq_work_single() on PREEMPT_RT On PREEMPT_RT, non-HARD irq_work runs in per-CPU kthreads via run_irq_workd(), so irq_work_sync() uses rcuwait() to wait for BUSY==0. After irq_work_single() clears BUSY via atomic_cmpxchg(), it still dereferences @work for irq_work_is_hard() and rcuwait_wake_up(). An irq_work_sync() caller on another CPU that enters after BUSY is cleared can observe BUSY==0 immediately, return, and free the work before those accesses complete — causing a use-after-free. Fix this by wrapping run_irq_workd() in guard(rcu)() so that the entire irq_work_single() execution is within an RCU read-side critical section. Then add synchronize_rcu() in irq_work_sync() after rcuwait_wait_event() to ensure the caller waits for the RCU grace period before returning, preventing premature frees. Fixes: `810979682c` ("irq_work: Allow irq_work_sync() to sleep if irq_work() no IRQ support.") Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20260330073234.303732-1-jiayuan.chen@linux.dev	2026-05-11 16:28:04 +02:00
Mark Rutland	512718bbc5	genirq/chip: Don't call add_interrupt_randomness() for NMIs Recently handle_percpu_devid_irq() was changed to call add_interrupt_randomness(). This introduced a potential deadlock when handle_percpu_devid_irq() is used to handle an NMI, which can be detected with lockdep, e.g. ================================ WARNING: inconsistent lock state 7.1.0-rc2-pnmi #465 Not tainted -------------------------------- inconsistent {INITIAL USE} -> {IN-NMI} usage. perf/695 [HC1[1]:SC0[0]:HE0:SE1] takes: ffff00837dfd3a18 (&base->lock){-.-.}-{2:2}, at: lock_timer_base+0x6c/0xac {INITIAL USE} state was registered at: _raw_spin_lock_irqsave+0x68/0xb0 lock_timer_base+0x6c/0xac __mod_timer+0x100/0x32c add_timer_global+0x2c/0x40 __queue_delayed_work+0xf0/0x140 queue_delayed_work_on+0x134/0x138 mem_cgroup_css_online+0x30c/0x310 online_css+0x34/0x10c cgroup_init_subsys+0x158/0x1c8 cgroup_init+0x440/0x524 start_kernel+0x888/0x998 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&base->lock); <Interrupt> lock(&base->lock); * DEADLOCK * Call trace: _raw_spin_lock_irqsave+0x68/0xb0 lock_timer_base+0x6c/0xac add_timer_on+0x78/0x16c add_interrupt_randomness+0x124/0x134 handle_percpu_devid_irq+0xd4/0x16c handle_irq_desc+0x40/0x58 generic_handle_domain_nmi+0x28/0x50 __gic_handle_nmi.isra.0+0x4c/0xa0 gic_handle_irq+0x38/0x2bc call_on_irq_stack+0x30/0x48 do_interrupt_handler+0x80/0x98 el1_interrupt+0x90/0xac el1h_64_irq_handler+0x18/0x24 el1h_64_irq+0x80/0x84 [...] During review, Thomas pointed out it wouldn't be safe for handle_percpu_devid_irq() to call add_interrupt_randomness() if it was used to handle NMIs: https://lore.kernel.org/lkml/87bjgik042.ffs@tglx/ ... but evidently people missed that handle_percpu_devid_irq() is used for NMIs. While it might seem that NMIs should be handled with a separate handle_percpu_devid_nmi() function, for various structural reasons this was impractical, and handle_percpu_devid_irq() has been expected to be used for NMIs since commits: `21bbbc50f3` ("irqchip/gic-v3: Switch high priority PPIs over to handle_percpu_devid_irq()") `5ff78c8de9` ("genirq: Kill handle_percpu_devid_fasteoi_nmi()") Taking the above into account, avoid the deadlock by not calling add_interrupt_randomness() when handle_percpu_devid_irq() is called in an NMI context. This is consistent with other NNI handling flows, which do not call add_interrupt_randomness(). At the same time, update the kernel-doc comment to make it clear that handle_percpu_devid_irq() can be called in NMI context. The rest of handle_percpu_devid_irq() is currently NMI safe and doesn't need to change. Fixes: `fd7400cfcb` ("genirq/chip: Invoke add_interrupt_randomness() in handle_percpu_devid_irq()") Reported-by: Ada Couprie Diaz <ada.coupriediaz@arm.com> Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Jinjie Ruan <ruanjinjie@huawei.com> Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://patch.msgid.link/20260507110518.3128248-1-mark.rutland@arm.com	2026-05-11 14:56:04 +02:00
Masami Hiramatsu (Google)	657b594b20	fprobe: Fix unregister_fprobe() to wait for RCU grace period Commit `4346ba1604` ("fprobe: Rewrite fprobe on function-graph tracer") changed fprobe to register struct fprobe to an rcu-hlist, but it forgot to wait for RCU GP. Thus there can be use-after-free if the fprobe is released right after unregistering. This can be happened on fprobe event and sample module code. To fix this issue, add synchronize_rcu() in unregister_fprobe(). Note that BPF is OK because fprobe is used as a part of bpf_kprobe_multi_link. This unregisters its fprobe in bpf_kprobe_multi_link_release() and it is deallocated via bpf_kprobe_multi_link_dealloc(), which is invoked from bpf_link_defer_dealloc_rcu_gp() RCU callback. For BPF, this also introduced unregister_fprobe_async() which does NOT wait for RCU grace priod. Link: https://lore.kernel.org/all/177813998919.256460.2809243930741138224.stgit@mhiramat.tok.corp.google.com/ Fixes: `4346ba1604` ("fprobe: Rewrite fprobe on function-graph tracer") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>	2026-05-11 19:04:46 +09:00

1 2 3 4 5 ...

51847 Commits