mirror of
https://github.com/torvalds/linux.git
synced 2026-05-12 16:18:45 +02:00
724d197aae
51045 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
724d197aae |
tracing: Remove tracing_alloc_snapshot() when snapshot isn't defined
The function tracing_alloc_snapshot() is only used between trace.c and
trace_snapshot.c. When snapshot isn't configured, it's not used at all.
The stub function was defined as a global with no users and no prototype
causing build issues.
Remove the function when snapshot isn't configured as nothing is calling
it.
Also remove the EXPORT_SYMBOL_GPL() that was associated with it as it's
not used outside of the tracing subsystem which also includes any modules.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260328101946.2c4ef4a5@robin
Reported-by: Mark Brown <broonie@kernel.org>
Closes: https://lore.kernel.org/all/acb-IuZ4vDkwwQLW@sirena.co.uk/
Fixes:
|
||
|
|
bade44fe54 |
tracing: Move snapshot code out of trace.c and into trace_snapshot.c
The trace.c file was a dumping ground for most tracing code. Start organizing it better by moving various functions out into their own files. Move all the snapshot code, including the max trace code into its own trace_snapshot.c file. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260324140145.36352d6a@gandalf.local.home Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
|
|
7b1c2d7547 |
kernel: Use trace_call__##name() at guarded tracepoint call sites
Replace trace_foo() with the new trace_call__foo() at sites already guarded by trace_foo_enabled(), avoiding a redundant static_branch_unlikely() re-evaluation inside the tracepoint. trace_call__foo() calls the tracepoint callbacks directly without utilizing the static branch again. Cc: David Vernet <void@manifault.com> Cc: Andrea Righi <arighi@nvidia.com> Cc: Changwoo Min <changwoo@igalia.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Thomas Gleixner <tglx@kernel.org> Cc: "Yury Norov [NVIDIA]" <yury.norov@gmail.com> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Kisel <romank@linux.microsoft.com> Cc: Joel Fernandes <joelagnelf@nvidia.com> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: Ulf Hansson <ulf.hansson@linaro.org> Link: https://patch.msgid.link/20260323160052.17528-3-vineeth@bitbyteword.org Suggested-by: Steven Rostedt <rostedt@goodmis.org> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org> Assisted-by: Claude:claude-sonnet-4-6 Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Thomas Gleixner <tglx@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
|
|
6b0c7c28f7 |
tracing: Pretty-print enum parameters in function arguments
Currently, print_function_args() prints enum parameter values in decimal format, reducing trace log readability. Use BTF information to resolve enum parameters and print their symbolic names (where available). This improves readability by showing meaningful identifiers instead of raw numbers. Before: mod_memcg_lruvec_state(lruvec=0xffff..., idx=5, val=320) After: mod_memcg_lruvec_state(lruvec=0xffff..., idx=5 [NR_SLAB_RECLAIMABLE_B], val=320) Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Link: https://patch.msgid.link/20260209071949.4040193-1-dolinux.peng@gmail.com Signed-off-by: Donglin Peng <pengdonglin@xiaomi.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
|
|
f54f08b1b8 |
tracing: Free up file->private_data for use by individual events
The tracing_open_file_tr() function currently copies the trace_event_file pointer from inode->i_private to file->private_data when the file is successfully opened. This duplication is not particularly useful, as all event code should utilize event_file_file() or event_file_data() to retrieve a trace_event_file pointer from a file struct and these access functions read file->f_inode->i_private. Moreover, this setup requires the code for opening hist files to explicitly clear file->private_data before calling single_open(), since this function expects the private_data member to be set to NULL and uses it to store a pointer to a seq_file. Remove the unnecessary setting of file->private_data in tracing_open_file_tr() and simplify the hist code. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Tom Zanussi <zanussi@kernel.org> Link: https://patch.msgid.link/20260219162737.314231-6-petr.pavlu@suse.com Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
|
|
f0eaed2723 |
tracing: Clean up access to trace_event_file from a file pointer
The tracing code provides two functions event_file_file() and event_file_data() to obtain a trace_event_file pointer from a file struct. The primary method to use is event_file_file(), as it checks for the EVENT_FILE_FL_FREED flag to determine whether the event is being removed. The second function event_file_data() is an optimization for retrieving the same data when the event_mutex is still held. In the past, when removing an event directory in remove_event_file_dir(), the code set i_private to NULL for all event files and readers were expected to check for this state to recognize that the event is being removed. In the case of event_id_read(), the value was read using event_file_data() without acquiring the event_mutex. This required event_file_data() to use READ_ONCE() when retrieving the i_private data. With the introduction of eventfs, i_private is assigned when an eventfs inode is allocated and remains set throughout its lifetime. Remove the now unnecessary READ_ONCE() access to i_private in both event_file_file() and event_file_data(). Inline the access to i_private in remove_event_file_dir(), which allows event_file_data() to handle i_private solely as a trace_event_file pointer. Add a check in event_file_data() to ensure that the event_mutex is held and that file->flags doesn't have the EVENT_FILE_FL_FREED flag set. Finally, move event_file_data() immediately after event_file_code() since the latter provides a comment explaining how both functions should be used together. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Tom Zanussi <zanussi@kernel.org> Link: https://patch.msgid.link/20260219162737.314231-5-petr.pavlu@suse.com Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
|
|
f55c09dabb |
tracing: Remove unnecessary check for EVENT_FILE_FL_FREED
The event_filter_write() function calls event_file_file() to retrieve a trace_event_file associated with a given file struct. If a non-NULL pointer is returned, the function then checks whether the trace_event_file instance has the EVENT_FILE_FL_FREED flag set. This check is redundant because event_file_file() already performs this validation and returns NULL if the flag is set. The err value is also already initialized to -ENODEV. Remove the unnecessary check for EVENT_FILE_FL_FREED in event_filter_write(). Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Tom Zanussi <zanussi@kernel.org> Link: https://patch.msgid.link/20260219162737.314231-4-petr.pavlu@suse.com Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
|
|
473e470f16 |
tracing: move __printf() attribute on __ftrace_vbprintk()
The sunrpc change to use trace_printk() for debugging caused
a new warning for every instance of dprintk() in some configurations,
when -Wformat-security is enabled:
fs/nfs/getroot.c: In function 'nfs_get_root':
fs/nfs/getroot.c:90:17: error: format not a string literal and no format arguments [-Werror=format-security]
90 | nfs_errorf(fc, "NFS: Couldn't getattr on root");
I've been slowly chipping away at those warnings over time with the
intention of enabling them by default in the future. While I could not
figure out why this only happens for this one instance, I see that the
__trace_bprintk() function is always called with a local variable as
the format string, rather than a literal.
Move the __printf(2,3) annotation on this function from the declaration
to the caller. As this is can only be validated for literals, the
attribute on the declaration causes the warnings every time, but
removing it entirely introduces a new warning on the __ftrace_vbprintk()
definition.
The format strings still get checked because the underlying literal keeps
getting passed into __trace_printk() in the "else" branch, which is not
taken but still evaluated for compile-time warnings.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Anna Schumaker <anna.schumaker@oracle.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Yury Norov <ynorov@nvidia.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Link: https://patch.msgid.link/20260203164545.3174910-1-arnd@kernel.org
Fixes:
|
||
|
|
d5273fd3ca |
bpf-fixes
-----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmnAGisACgkQ6rmadz2v bTqjsw/9GfHT/fdnjfA/q27TQH28ZdrZfq90BpI3m5BfTO8/l+Kt+g1HDGpku+C/ iWh66rg9t/P9nMvtdzvPsdT833UbwbY6fPEK3r7ANgf7SBb1DNvaGHBM6XNefvZV j+VcykKUaEo8U1GeG+gI4TyAALSqvvMeBPYpAPZDUYguYLyE+YIl2Pl6tWt+A7yf 9V3JjCSz63t75qqnhY2SIBZv2pqWiMaCI8uPgaF7drhQM5Xc0l/R75CMPGeF9BrT GRtTVJhY+6UyI2Q0ZRSRSVHZ1j2kYHI/eK3Kamxwal5hNh37BYHm3pT5TSHbZTe1 xO7c1AB0vds8kznRkclQfsMdjVwuBQj03ukLVNqnnaaE4Ir7JlXlXYgeG0KJbbfW kQG8UyDD7tMWZkvaA0Z51FC88WJNLJoNAku519alcMtgAf1CrxzG9aUAYEWE4erh E/FKKvFqQ6T0mOFSXlk1NFeMjNXcg5Tu2KKKKOjAWT6goUc4hw80IWydTyxMy32m 8/eLmdTZpAQovc2rS+5LSTigQ3DT082J950sxdQ3yRaLTWBGNC06gkA/WcRq2ZI+ hBdW6GI1XFwkXGw5+F9fN9Bt5FmE42v44i+RrlNZV1R5bVr0Za/ofkWP3dm1/SOg QRSJk30hx9JveR9gD/xWawycYFuwmha/BL0tur2T32M67MneJpo= =Ye1S -----END PGP SIGNATURE----- Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Pull bpf fixes from Alexei Starovoitov: - Fix how linked registers track zero extension of subregisters (Daniel Borkmann) - Fix unsound scalar fork for OR instructions (Daniel Wade) - Fix exception exit lock check for subprogs (Ihor Solodrai) - Fix undefined behavior in interpreter for SDIV/SMOD instructions (Jenny Guanni Qu) - Release module's BTF when module is unloaded (Kumar Kartikeya Dwivedi) - Fix constant blinding for PROBE_MEM32 instructions (Sachin Kumar) - Reset register ID for END instructions to prevent incorrect value tracking (Yazhou Tang) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Add a test cases for sync_linked_regs regarding zext propagation bpf: Fix sync_linked_regs regarding BPF_ADD_CONST32 zext propagation selftests/bpf: Add tests for maybe_fork_scalars() OR vs AND handling bpf: Fix unsound scalar forking in maybe_fork_scalars() for BPF_OR selftests/bpf: Add tests for sdiv32/smod32 with INT_MIN dividend bpf: Fix undefined behavior in interpreter sdiv/smod for INT_MIN selftests/bpf: Add tests for bpf_throw lock leak from subprogs bpf: Fix exception exit lock checking for subprogs bpf: Release module BTF IDR before module unload selftests/bpf: Fix pkg-config call on static builds bpf: Fix constant blinding for PROBE_MEM32 stores selftests/bpf: Add test for BPF_END register ID reset bpf: Reset register ID for BPF_END value tracking |
||
|
|
ac57fa9faf |
tracing fixes for 7.0:
- Revert "tracing: Remove pid in task_rename tracing output" A change was made to remove the pid field from the task_rename event because it was thought that it was always done for the current task and recording the pid would be redundant. This turned out to be incorrect and there are a few corner case where this is not true and caused some regressions in tooling. - Fix the reading from user space for migration The reading of user space uses a seq lock type of logic where it uses a per-cpu temporary buffer and disables migration, then enables preemption, does the copy from user space, disables preemption, enables migration and checks if there was any schedule switches while preemption was enabled. If there was a context switch, then it is considered that the per-cpu buffer could be corrupted and it tries again. There's a protection check that tests if it takes a hundred tries, it issues a warning and exits out to prevent a live lock. This was triggered because the task was selected by the load balancer to be migrated to another CPU, every time preemption is enabled the migration task would schedule in try to migrate the task but can't because migration is disabled and let it run again. This caused the scheduler to schedule out the task every time it enabled preemption and made the loop never exit (until the 100 iteration test triggered). Fix this by enabling and disabling preemption and keeping migration enabled if the reading from user space needs to be done again. This will let the migration thread migrate the task and the copy from user space will likely pass on the next iteration. - Fix trace_marker copy option freeing The "copy_trace_marker" option allows a tracing instance to get a copy of a write to the trace_marker file of the top level instance. This is managed by a link list protected by RCU. When an instance is removed, a check is made if the option is set, and if so synchronized_rcu() is called. The problem is that an iteration is made to reset all the flags to what they were when the instance was created (to perform clean ups) was done before the check of the copy_trace_marker option and that option was cleared, so the synchronize_rcu() was never called. Move the clearing of all the flags after the check of copy_trace_marker to do synchronize_rcu() so that the option is still set if it was before and the synchronization is performed. - Fix entries setting when validating the persistent ring buffer When validating the persistent ring buffer on boot up, the number of events per sub-buffer is added to the sub-buffer meta page. The validator was updating cpu_buffer->head_page (the first sub-buffer of the per-cpu buffer) and not the "head_page" variable that was iterating the sub-buffers. This was causing the first sub-buffer to be assigned the entries for each sub-buffer and not the sub-buffer that was supposed to be updated. - Use "hash" value to update the direct callers When updating the ftrace direct callers, it assigned a temporary callback to all the callback functions of the ftrace ops and not just the functions represented by the passed in hash. This causes an unnecessary slow down of the functions of the ftrace_ops that is not being modified. Only update the functions that are going to be modified to call the ftrace loop function so that the update can be made on those functions. -----BEGIN PGP SIGNATURE----- iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCacAMahQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qr0sAQCoI4L3iAR5HU1z8dw2GWhOz9fTnzfw 9VPRZAsga9J5xgEA1Y0bvKBM0UPHFAL2POkaILYV1aT00lZ7aIVHPqfdYgA= =OoGW -----END PGP SIGNATURE----- Merge tag 'trace-v7.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Revert "tracing: Remove pid in task_rename tracing output" A change was made to remove the pid field from the task_rename event because it was thought that it was always done for the current task and recording the pid would be redundant. This turned out to be incorrect and there are a few corner case where this is not true and caused some regressions in tooling. - Fix the reading from user space for migration The reading of user space uses a seq lock type of logic where it uses a per-cpu temporary buffer and disables migration, then enables preemption, does the copy from user space, disables preemption, enables migration and checks if there was any schedule switches while preemption was enabled. If there was a context switch, then it is considered that the per-cpu buffer could be corrupted and it tries again. There's a protection check that tests if it takes a hundred tries, it issues a warning and exits out to prevent a live lock. This was triggered because the task was selected by the load balancer to be migrated to another CPU, every time preemption is enabled the migration task would schedule in try to migrate the task but can't because migration is disabled and let it run again. This caused the scheduler to schedule out the task every time it enabled preemption and made the loop never exit (until the 100 iteration test triggered). Fix this by enabling and disabling preemption and keeping migration enabled if the reading from user space needs to be done again. This will let the migration thread migrate the task and the copy from user space will likely pass on the next iteration. - Fix trace_marker copy option freeing The "copy_trace_marker" option allows a tracing instance to get a copy of a write to the trace_marker file of the top level instance. This is managed by a link list protected by RCU. When an instance is removed, a check is made if the option is set, and if so synchronized_rcu() is called. The problem is that an iteration is made to reset all the flags to what they were when the instance was created (to perform clean ups) was done before the check of the copy_trace_marker option and that option was cleared, so the synchronize_rcu() was never called. Move the clearing of all the flags after the check of copy_trace_marker to do synchronize_rcu() so that the option is still set if it was before and the synchronization is performed. - Fix entries setting when validating the persistent ring buffer When validating the persistent ring buffer on boot up, the number of events per sub-buffer is added to the sub-buffer meta page. The validator was updating cpu_buffer->head_page (the first sub-buffer of the per-cpu buffer) and not the "head_page" variable that was iterating the sub-buffers. This was causing the first sub-buffer to be assigned the entries for each sub-buffer and not the sub-buffer that was supposed to be updated. - Use "hash" value to update the direct callers When updating the ftrace direct callers, it assigned a temporary callback to all the callback functions of the ftrace ops and not just the functions represented by the passed in hash. This causes an unnecessary slow down of the functions of the ftrace_ops that is not being modified. Only update the functions that are going to be modified to call the ftrace loop function so that the update can be made on those functions. * tag 'trace-v7.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: ftrace: Use hash argument for tmp_ops in update_ftrace_direct_mod ring-buffer: Fix to update per-subbuf entries of persistent ring buffer tracing: Fix trace_marker copy link list updates tracing: Fix failure to read user space from system call trace events tracing: Revert "tracing: Remove pid in task_rename tracing output" |
||
|
|
ebfd9b7af2 |
Miscellaneous perf fixes:
- Fix a PMU driver crash on AMD EPYC systems, caused by
a race condition in x86_pmu_enable()
- Fix a possible counter-initialization bug in x86_pmu_enable()
- Fix a counter inheritance bug in inherit_event() and
__perf_event_read()
- Fix an Intel PMU driver branch constraints handling bug
found by UBSAN
- Fix the Intel PMU driver's new Off-Module Response (OMR)
support code for Diamond Rapids / Nova lake, to fix a snoop
information parsing bug
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmm/ptcRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1i8yBAAsbAbTzyJOxiE29beCiW3r9V4ELtrSlpG
syUtZ4Y1cozK1w6/8S2oWHJkuWa+ToO6bn9rKxNemWIrnsxe5B4iKfeNWRFTz24g
41xTXaxUB9c7vBgv4BDvr/ykkYGQybkn0Bf/U5rufzvIlst9bx7zKVAnIT9Qws37
UTMY96XGYY5HNzGSZbQkpQ4cs8n72U+00OBHMTWtH8NJT+fRmaM312Q8F6wNKgH2
YtaAjwb55BU5+hQUz5YN96xQGYaoj0s8UtOk4a3tS/t0F8mOodDVTxzMdHKToQmD
SbscGvfC+bg5zjoYGFEU+cXoBMkkZlBPqZdLVQAGEy3YZT+JIdmyJhn9FD6HVuVw
OzyG9VuY+TxOFrFQdMs3Xfa0vZ7AO1c90HB08P4T7nWaMioR1iFAF31MSVEMXuzd
ZROHplWNIqDeOzmerXgZq4JWy3Bpaam8fH1B5/qN450oAxaym3lCOoCZidJYgy3g
CVBF/6BO7DlpKiy9lXknRscItwIiRmZ9Xr+sOmOMQRGqQkC6Ykk/Hj2sA1qKPuQ5
ruRqqaL9cznttAoJR2jZ0Ijyu7usgxB/y066nR1zXKdvEdNcntaic+QybHxbQoYx
kyYNoR1dg+AWLb5juT68abkP4trZ8EUa7Q29OX1SzTk+0U7M0fO3/rq9gt71HiHR
WSjRDQPfWrI=
=jp5H
-----END PGP SIGNATURE-----
Merge tag 'perf-urgent-2026-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
- Fix a PMU driver crash on AMD EPYC systems, caused by
a race condition in x86_pmu_enable()
- Fix a possible counter-initialization bug in x86_pmu_enable()
- Fix a counter inheritance bug in inherit_event() and
__perf_event_read()
- Fix an Intel PMU driver branch constraints handling bug
found by UBSAN
- Fix the Intel PMU driver's new Off-Module Response (OMR)
support code for Diamond Rapids / Nova lake, to fix a snoop
information parsing bug
* tag 'perf-urgent-2026-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Fix OMR snoop information parsing issues
perf/x86/intel: Add missing branch counters constraint apply
perf: Make sure to use pmu_ctx->pmu for groups
x86/perf: Make sure to program the counter value for stopped events on migration
perf/x86: Move event pointer setup earlier in x86_pmu_enable()
|
||
|
|
50b35c9e50 |
ftrace: Use hash argument for tmp_ops in update_ftrace_direct_mod
The modify logic registers temporary ftrace_ops object (tmp_ops) to trigger
the slow path for all direct callers to be able to safely modify attached
addresses.
At the moment we use ops->func_hash for tmp_ops filter, which represents all
the systems attachments. It's faster to use just the passed hash filter, which
contains only the modified sites and is always a subset of the ops->func_hash.
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Menglong Dong <menglong8.dong@gmail.com>
Cc: Song Liu <song@kernel.org>
Link: https://patch.msgid.link/20260312123738.129926-1-jolsa@kernel.org
Fixes:
|
||
|
|
f35dbac694 |
ring-buffer: Fix to update per-subbuf entries of persistent ring buffer
Since the validation loop in rb_meta_validate_events() updates the same
cpu_buffer->head_page->entries, the other subbuf entries are not updated.
Fix to use head_page to update the entries field, since it is the cursor
in this loop.
Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Ian Rogers <irogers@google.com>
Fixes:
|
||
|
|
07183aac4a |
tracing: Fix trace_marker copy link list updates
When the "copy_trace_marker" option is enabled for an instance, anything
written into /sys/kernel/tracing/trace_marker is also copied into that
instances buffer. When the option is set, that instance's trace_array
descriptor is added to the marker_copies link list. This list is protected
by RCU, as all iterations uses an RCU protected list traversal.
When the instance is deleted, all the flags that were enabled are cleared.
This also clears the copy_trace_marker flag and removes the trace_array
descriptor from the list.
The issue is after the flags are called, a direct call to
update_marker_trace() is performed to clear the flag. This function
returns true if the state of the flag changed and false otherwise. If it
returns true here, synchronize_rcu() is called to make sure all readers
see that its removed from the list.
But since the flag was already cleared, the state does not change and the
synchronization is never called, leaving a possible UAF bug.
Move the clearing of all flags below the updating of the copy_trace_marker
option which then makes sure the synchronization is performed.
Also use the flag for checking the state in update_marker_trace() instead
of looking at if the list is empty.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260318185512.1b6c7db4@gandalf.local.home
Fixes:
|
||
|
|
edca33a562 |
tracing: Fix failure to read user space from system call trace events
The system call trace events call trace_user_fault_read() to read the user
space part of some system calls. This is done by grabbing a per-cpu
buffer, disabling migration, enabling preemption, calling
copy_from_user(), disabling preemption, enabling migration and checking if
the task was preempted while preemption was enabled. If it was, the buffer
is considered corrupted and it tries again.
There's a safety mechanism that will fail out of this loop if it fails 100
times (with a warning). That warning message was triggered in some
pi_futex stress tests. Enabling the sched_switch trace event and
traceoff_on_warning, showed the problem:
pi_mutex_hammer-1375 [006] d..21 138.981648: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981651: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981656: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981659: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981664: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981667: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981671: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981675: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981679: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981682: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981687: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981690: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981695: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981698: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981703: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981706: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981711: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981714: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981719: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981722: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981727: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981730: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981735: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981738: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
What happened was the task 1375 was flagged to be migrated. When
preemption was enabled, the migration thread woke up to migrate that task,
but failed because migration for that task was disabled. This caused the
loop to fail to exit because the task scheduled out while trying to read
user space.
Every time the task enabled preemption the migration thread would schedule
in, try to migrate the task, fail and let the task continue. But because
the loop would only enable preemption with migration disabled, it would
always fail because each time it enabled preemption to read user space,
the migration thread would try to migrate it.
To solve this, when the loop fails to read user space without being
scheduled out, enabled and disable preemption with migration enabled. This
will allow the migration task to successfully migrate the task and the
next loop should succeed to read user space without being scheduled out.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260316130734.1858a998@gandalf.local.home
Fixes:
|
||
|
|
bc308be380 |
bpf: Fix sync_linked_regs regarding BPF_ADD_CONST32 zext propagation
Jenny reported that in sync_linked_regs() the BPF_ADD_CONST32 flag is
checked on known_reg (the register narrowed by a conditional branch)
instead of reg (the linked target register created by an alu32 operation).
Example case with reg:
1. r6 = bpf_get_prandom_u32()
2. r7 = r6 (linked, same id)
3. w7 += 5 (alu32 -- r7 gets BPF_ADD_CONST32, zero-extended by CPU)
4. if w6 < 0xFFFFFFFC goto safe (narrows r6 to [0xFFFFFFFC, 0xFFFFFFFF])
5. sync_linked_regs() propagates to r7 but does NOT call zext_32_to_64()
6. Verifier thinks r7 is [0x100000001, 0x100000004] instead of [1, 4]
Since known_reg above does not have BPF_ADD_CONST32 set above, zext_32_to_64()
is never called on alu32-derived linked registers. This causes the verifier
to track incorrect 64-bit bounds, while the CPU correctly zero-extends the
32-bit result.
The code checking known_reg->id was correct however (see scalars_alu32_wrap
selftest case), but the real fix needs to handle both directions - zext
propagation should be done when either register has BPF_ADD_CONST32, since
the linked relationship involves a 32-bit operation regardless of which
side has the flag.
Example case with known_reg (exercised also by scalars_alu32_wrap):
1. r1 = r0; w1 += 0x100 (alu32 -- r1 gets BPF_ADD_CONST32)
2. if r1 > 0x80 - known_reg = r1 (has BPF_ADD_CONST32), reg = r0 (doesn't)
Hence, fix it by checking for (reg->id | known_reg->id) & BPF_ADD_CONST32.
Moreover, sync_linked_regs() also has a soundness issue when two linked
registers used different ALU widths: one with BPF_ADD_CONST32 and the
other with BPF_ADD_CONST64. The delta relationship between linked registers
assumes the same arithmetic width though. When one register went through
alu32 (CPU zero-extends the 32-bit result) and the other went through
alu64 (no zero-extension), the propagation produces incorrect bounds.
Example:
r6 = bpf_get_prandom_u32() // fully unknown
if r6 >= 0x100000000 goto out // constrain r6 to [0, U32_MAX]
r7 = r6
w7 += 1 // alu32: r7.id = N | BPF_ADD_CONST32
r8 = r6
r8 += 2 // alu64: r8.id = N | BPF_ADD_CONST64
if r7 < 0xFFFFFFFF goto out // narrows r7 to [0xFFFFFFFF, 0xFFFFFFFF]
At the branch on r7, sync_linked_regs() runs with known_reg=r7
(BPF_ADD_CONST32) and reg=r8 (BPF_ADD_CONST64). The delta path
computes:
r8 = r7 + (delta_r8 - delta_r7) = 0xFFFFFFFF + (2 - 1) = 0x100000000
Then, because known_reg->id has BPF_ADD_CONST32, zext_32_to_64(r8) is
called, truncating r8 to [0, 0]. But r8 used a 64-bit ALU op -- the
CPU does NOT zero-extend it. The actual CPU value of r8 is
0xFFFFFFFE + 2 = 0x100000000, not 0. The verifier now underestimates
r8's 64-bit bounds, which is a soundness violation.
Fix sync_linked_regs() by skipping propagation when the two registers
have mixed ALU widths (one BPF_ADD_CONST32, the other BPF_ADD_CONST64).
Lastly, fix regsafe() used for path pruning: the existing checks used
"& BPF_ADD_CONST" to test for offset linkage, which treated
BPF_ADD_CONST32 and BPF_ADD_CONST64 as equivalent.
Fixes:
|
||
|
|
c845894ebd |
bpf: Fix unsound scalar forking in maybe_fork_scalars() for BPF_OR
maybe_fork_scalars() is called for both BPF_AND and BPF_OR when the
source operand is a constant. When dst has signed range [-1, 0], it
forks the verifier state: the pushed path gets dst = 0, the current
path gets dst = -1.
For BPF_AND this is correct: 0 & K == 0.
For BPF_OR this is wrong: 0 | K == K, not 0.
The pushed path therefore tracks dst as 0 when the runtime value is K,
producing an exploitable verifier/runtime divergence that allows
out-of-bounds map access.
Fix this by passing env->insn_idx (instead of env->insn_idx + 1) to
push_stack(), so the pushed path re-executes the ALU instruction with
dst = 0 and naturally computes the correct result for any opcode.
Fixes:
|
||
|
|
c77b30bd1d |
bpf: Fix undefined behavior in interpreter sdiv/smod for INT_MIN
The BPF interpreter's signed 32-bit division and modulo handlers use
the kernel abs() macro on s32 operands. The abs() macro documentation
(include/linux/math.h) explicitly states the result is undefined when
the input is the type minimum. When DST contains S32_MIN (0x80000000),
abs((s32)DST) triggers undefined behavior and returns S32_MIN unchanged
on arm64/x86. This value is then sign-extended to u64 as
0xFFFFFFFF80000000, causing do_div() to compute the wrong result.
The verifier's abstract interpretation (scalar32_min_max_sdiv) computes
the mathematically correct result for range tracking, creating a
verifier/interpreter mismatch that can be exploited for out-of-bounds
map value access.
Introduce abs_s32() which handles S32_MIN correctly by casting to u32
before negating, avoiding signed overflow entirely. Replace all 8
abs((s32)...) call sites in the interpreter's sdiv32/smod32 handlers.
s32 is the only affected case -- the s64 division/modulo handlers do
not use abs().
Fixes:
|
||
|
|
6c2128505f |
bpf: Fix exception exit lock checking for subprogs
process_bpf_exit_full() passes check_lock = !curframe to
check_resource_leak(), which is false in cases when bpf_throw() is
called from a static subprog. This makes check_resource_leak() to skip
validation of active_rcu_locks, active_preempt_locks, and
active_irq_id on exception exits from subprogs.
At runtime bpf_throw() unwinds the stack via ORC without releasing any
user-acquired locks, which may cause various issues as the result.
Fix by setting check_lock = true for exception exits regardless of
curframe, since exceptions bypass all intermediate frame
cleanup. Update the error message prefix to "bpf_throw" for exception
exits to distinguish them from normal BPF_EXIT.
Fix reject_subprog_with_rcu_read_lock test which was previously
passing for the wrong reason. Test program returned directly from the
subprog call without closing the RCU section, so the error was
triggered by the unclosed RCU lock on normal exit, not by
bpf_throw. Update __msg annotations for affected tests to match the
new "bpf_throw" error prefix.
The spin_lock case is not affected because they are already checked [1]
at the call site in do_check_insn() before bpf_throw can run.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/bpf/verifier.c?h=v7.0-rc4#n21098
Assisted-by: Claude:claude-opus-4-6
Fixes:
|
||
|
|
e9825d1c79 |
Power management fixes for 7.0-rc5
- Consolidate the handling of two special cases in the idle loop that
occur when only one CPU idle state is present (Rafael Wysocki)
- Fix a race condition related to device removal in the runtime PM
core code that may cause a stale device object pointer to be
dereferenced (Bart Van Assche)
-----BEGIN PGP SIGNATURE-----
iQFGBAABCAAwFiEEcM8Aw/RY0dgsiRUR7l+9nS/U47UFAmm8Ad4SHHJqd0Byand5
c29ja2kubmV0AAoJEO5fvZ0v1OO1yFYIAKaPfbV2uWEbB0l+g7QBk4lEGc+CUxN+
jpBut/hWepoLDzQyqaROtTd8R5dFmKASsLXpNVgDjEDX+HCkFzdmlUlDZar0B8zZ
lZTcQypFPWtKeTD5OuelZK/trT09SdsB5Yeav1O6KXM5bT0ISn/f729gkPGIXXm5
+L8R2SvrEYu1dMZ3ithRq+4/bx1Pqav/g+0aD/tQMBDxjSE0G+d3hPFxvPueDMZX
L8CNC5CcH3vwNrI0NR9EH/z491q1Ddo6XUBW+ja2UGztdBTVmm5fEktDGxVjN8B8
A+ZT0RX1itcV+ngjtNYpXbRlMLwCkTvM8rVLpOANcAmsbNOCYGyM80s=
=aptB
-----END PGP SIGNATURE-----
Merge tag 'pm-7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix an idle loop issue exposed by recent changes and a race
condition related to device removal in the runtime PM core code:
- Consolidate the handling of two special cases in the idle loop that
occur when only one CPU idle state is present (Rafael Wysocki)
- Fix a race condition related to device removal in the runtime PM
core code that may cause a stale device object pointer to be
dereferenced (Bart Van Assche)"
* tag 'pm-7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: runtime: Fix a race condition related to device removal
sched: idle: Consolidate the handling of two special cases
|
||
|
|
146bd2a87a |
bpf: Release module BTF IDR before module unload
Gregory reported in [0] that the global_map_resize test when run in
repeatedly ends up failing during program load. This stems from the fact
that BTF reference has not dropped to zero after the previous run's
module is unloaded, and the older module's BTF is still discoverable and
visible. Later, in libbpf, load_module_btfs() will find the ID for this
stale BTF, open its fd, and then it will be used during program load
where later steps taking module reference using btf_try_get_module()
fail since the underlying module for the BTF is gone.
Logically, once a module is unloaded, it's associated BTF artifacts
should become hidden. The BTF object inside the kernel may still remain
alive as long its reference counts are alive, but it should no longer be
discoverable.
To fix this, let us call btf_free_id() from the MODULE_STATE_GOING case
for the module unload to free the BTF associated IDR entry, and disable
its discovery once module unload returns to user space. If a race
happens during unload, the outcome is non-deterministic anyway. However,
user space should be able to rely on the guarantee that once it has
synchronously established a successful module unload, no more stale
artifacts associated with this module can be obtained subsequently.
Note that we must be careful to not invoke btf_free_id() in btf_put()
when btf_is_module() is true now. There could be a window where the
module unload drops a non-terminal reference, frees the IDR, but the
same ID gets reused and the second unconditional btf_free_id() ends up
releasing an unrelated entry.
To avoid a special case for btf_is_module() case, set btf->id to zero to
make btf_free_id() idempotent, such that we can unconditionally invoke it
from btf_put(), and also from the MODULE_STATE_GOING case. Since zero is
an invalid IDR, the idr_remove() should be a noop.
Note that we can be sure that by the time we reach final btf_put() for
btf_is_module() case, the btf_free_id() is already done, since the
module itself holds the BTF reference, and it will call this function
for the BTF before dropping its own reference.
[0]: https://lore.kernel.org/bpf/cover.1773170190.git.grbell@redhat.com
Fixes:
|
||
|
|
f4c31b07b1 |
sched: idle: Consolidate the handling of two special cases
There are two special cases in the idle loop that are handled
inconsistently even though they are analogous.
The first one is when a cpuidle driver is absent and the default CPU
idle time power management implemented by the architecture code is used.
In that case, the scheduler tick is stopped every time before invoking
default_idle_call().
The second one is when a cpuidle driver is present, but there is only
one idle state in its table. In that case, the scheduler tick is never
stopped at all.
Since each of these approaches has its drawbacks, reconcile them with
the help of one simple heuristic. Namely, stop the tick if the CPU has
been woken up by it in the previous iteration of the idle loop, or let
it tick otherwise.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Qais Yousef <qyousef@layalina.io>
Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Fixes:
|
||
|
|
8a91ebb337 |
6 hotfixes. 4 are cc:stable. 3 are for MM.
All are singletons - please see the changelogs for details. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCabhW5gAKCRDdBJ7gKXxA jiy+AQDq0x5J94Ezavp0//wqpQ3gwCbo+E8cOE4qb3ig20gBFAEA2hr8qvWYeGCV r+FhX6PAmY6fxL8e/akqR3h7fR7iPAk= =g8d7 -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2026-03-16-12-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "6 hotfixes. 4 are cc:stable. 3 are for MM. All are singletons - please see the changelogs for details" * tag 'mm-hotfixes-stable-2026-03-16-12-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: MAINTAINERS: update email address for Ignat Korchagin mm/huge_memory: fix early failure try_to_migrate() when split huge pmd for shared THP mm/rmap: fix incorrect pte restoration for lazyfree folios mm/huge_memory: fix use of NULL folio in move_pages_huge_pmd() build_bug.h: correct function parameters names in kernel-doc crash_dump: don't log dm-crypt key bytes in read_key_from_user_keying |
||
|
|
d9bf296c39 |
Probes fixes for v7.0-rc3
- kprobes: avoid crash when rmmod/insmod after ftrace killed This fixes a kernel crash caused by kprobes on the symbol in a module which is unloaded after ftrace_kill() is called. - kprobes: Remove unneeded warnings from __arm_kprobe_ftrace() Remove unneeded WARN messages which can be kicked if the kprobe is using ftrace and it fails to enable the ftrace. Since kprobes correctly handle such failure, we don't need to warn it. -----BEGIN PGP SIGNATURE----- iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmm2hk8bHG1hc2FtaS5o aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bBZkH+gP3OllhdIU3AUB+vXEb UEE3VE5IZRufSgtjbbJnYI3b8U2dWXw7wmb+fBJ0i0Zf6F+2IUr3hUg1pHNARvlL MMu1YW7PG8gKGUsNc7jpHBVXrnefA4XpzXe7wtxaGvqAV16nL/6xhZlanGgL50Gv +F9cETMIGZ8duF6XVgEMmUUCg88Iwpp9MjzDQOjpRK7z41LND3ccJ2V88ODzDevj idnRbnk12Q9b7xZwGL5P5Ab163kHPFExsXQQnPmy/0fLcyGi8U3hY5EwYhOT85J0 qRZVHpR1yrXqBpaxD/7as5/pfuUluxdzwmDkyRuQGYW6y7xkhlVNEGWqsWZeYLuW IZU= =Zkm3 -----END PGP SIGNATURE----- Merge tag 'probes-fixes-v7.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes fixes from Masami Hiramatsu: - Avoid crash when rmmod/insmod after ftrace killed This fixes a kernel crash caused by kprobes on the symbol in a module which is unloaded after ftrace_kill() is called. - Remove unneeded warnings from __arm_kprobe_ftrace() Remove unneeded WARN messages which can be triggered if the kprobe is using ftrace and it fails to enable the ftrace. Since kprobes correctly handle such failure, we don't need to warn it. * tag 'probes-fixes-v7.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: kprobes: Remove unneeded warnings from __arm_kprobe_ftrace() kprobes: avoid crash when rmmod/insmod after ftrace killed |
||
|
|
164cb546e9 |
Fix function tracer recursion bug by marking jiffies_64_to_clock_t()
notrace. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmm2J6MRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1hQ9g//Qwq0Esh58yP0UkySArlf7xrOEBrfIF2R W7aLDRevhTCklN5z4BlkCaT7XyfZL19CF2AXW1PZ9fSJKlM/vK0xkFToSiKuG6Pc 2BvfYOvePrmT+8AMnhFe2ekzJclUi9s+ars3H0i59Xu6D+CCOqDvvdGoVkdfuL5+ eYlxVZK2TtrZEDkdXkBD0vPvcrXqYxDnZD3Pkpys9NQBxSyIDwVdCr+uNVv3hCsf L6TAnOzUbI7jWp805izAgdQ4E0P2ywb83nK8op++p4J8PpVaGSnW6LmfDmBsKvRG V+cyqXdMToZjEayBdjDgk3GH6M4K2lFWsg1TVfqE1K05fwEdNDYJY7G7eKvhrLH1 2oNvsEMKLXNVSTgn7gSr63IQBC7+J0eKkmBp1Lv9WWjqYC4QYacrFm4Fc5OCmec0 GI3QQwPvNa+i7GByvLbXe8+8q/BLIWOIcCutlU8GRdYaghA1lOynI8dQmKfSdM+j O1TMIqI56ymffsiBe4d9iUyzCWUdB1LXR0Hm36wm+SJmydc6/Ael1gUlFjqXpGKq Ui4h5rRtNvZgsqmsGGgjMvdiZ9dN/e7tz4tdyp4oCcpKhfDT7qNLdHeoyvDxhDs/ uTvpAK5iH6Wxa91DXt4M+17gf97Ws1dIHU6G9og4B1l5bptyXmrMuVPmK8L/wxDu POJGsxuIStA= =Obpb -----END PGP SIGNATURE----- Merge tag 'timers-urgent-2026-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Ingo Molnar: "Fix function tracer recursion bug by marking jiffies_64_to_clock_t() notrace" * tag 'timers-urgent-2026-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: time/jiffies: Mark jiffies_64_to_clock_t() notrace |
||
|
|
63724e9519 |
More MM-CID fixes, mostly fixing hangs/races:
- Fix CID hangs due to a race between concurrent forks
- Fix vfork()/CLONE_VM MMCID bug causing hangs
- Remove pointless preemption guard
- Fix CID task list walk performance regression on large systems
by removing the known-flaky and slow counting logic using
for_each_process_thread() in mm_cid_*fixup_tasks_to_cpus(),
and implementing a simple sched_mm_cid::node list instead
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmm2JvIRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hTqg/+K7b4LDOi3nVblmoj6q+mQj2i8DFPbi10
zeAWJJnamYWPvUi+Wxq30JjZJ9v+15Ddcmbhea9m/3u1YO6nAL5TbGeQcJ2LU/7p
Ynu9cznv9PfqO4X7WQc3gJC9xx8PbcM00E3JzGxDX/3NDmDBaTOwwuTp41ymcbhm
cGfnUQWGt81sMummVzqehszfIRMZHnWflYDJ2gC66rcGXMNBlEX125F8jybOm66n
Ez6gO7e9EGn28+hZIufySsxaeeK/3NFVKj1UjGP/FMuBwQFAjHPv61nic33nOKXT
yrw7U8DIaYUqFN4d1lplTG72j2YSUj7snn3Q+ubxpzFmOt7RmouVqwlVGEoey5fh
cEe2VYSQFoZKQioWWyms1LP1hTOa2JkNVhdjBfRZ8IM+Wp47OaDiw1h1+zwwMDbJ
xpDAXEuU+sBZiv2SeBLFQgrGj58gb8pdjN4o47X89mx8TKYWtStrCMsD+MF10LBm
dz780Eiinbw5D8JBsxU/ehETpgrAAVmo1KbFx2Q2grAgkJs7jSqBN2KF8NpmH/ZS
Jk8SpQOn4Vp8iO32TbpsV/GErG9EQgixQxnkTukv2Qd9kguhmjwbi/blN3rLBlBb
XbmR9rRAMfAjlPrk84tn9ecXNWO0NV83IYheAwjip36alSbOs+OcxdhrZ78nxh8C
EsKqGl3PeOk=
=ce5G
-----END PGP SIGNATURE-----
Merge tag 'sched-urgent-2026-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
"More MM-CID fixes, mostly fixing hangs/races:
- Fix CID hangs due to a race between concurrent forks
- Fix vfork()/CLONE_VM MMCID bug causing hangs
- Remove pointless preemption guard
- Fix CID task list walk performance regression on large systems
by removing the known-flaky and slow counting logic using
for_each_process_thread() in mm_cid_*fixup_tasks_to_cpus(), and
implementing a simple sched_mm_cid::node list instead"
* tag 'sched-urgent-2026-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/mmcid: Avoid full tasklist walks
sched/mmcid: Remove pointless preempt guard
sched/mmcid: Handle vfork()/CLONE_VM correctly
sched/mmcid: Prevent CID stalls due to concurrent forks
|
||
|
|
9abff5748e |
workqueue: Fixes for v7.0-rc3
- Improve workqueue stall diagnostics: dump all busy workers (not just
running ones), show wall-clock duration of in-flight work items, and
add a sample module for reproducing stalls.
- Fix POOL_BH vs WQ_BH flag namespace mismatch in pr_cont_worker_id().
- Rename pool->watchdog_ts to pool->last_progress_ts and related
functions for clarity.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCabRyOg4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGaAEAP9xJvKzVtyXBMWxIQNJKqN58VaT/5bNk3tYfm3O
ZOuBrQD+Lsxjvv/pSSKFscZJ/x0dthdYncYI8DF3/G6Lnf+LDAc=
=MT3y
-----END PGP SIGNATURE-----
Merge tag 'wq-for-7.0-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fixes from Tejun Heo:
- Improve workqueue stall diagnostics: dump all busy workers (not just
running ones), show wall-clock duration of in-flight work items, and
add a sample module for reproducing stalls
- Fix POOL_BH vs WQ_BH flag namespace mismatch in pr_cont_worker_id()
- Rename pool->watchdog_ts to pool->last_progress_ts and related
functions for clarity
* tag 'wq-for-7.0-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Rename show_cpu_pool{s,}_hog{s,}() to reflect broadened scope
workqueue: Add stall detector sample module
workqueue: Show all busy workers in stall diagnostics
workqueue: Show in-flight work item duration in stall diagnostics
workqueue: Rename pool->watchdog_ts to pool->last_progress_ts
workqueue: Use POOL_BH instead of WQ_BH when checking pool flags
|
||
|
|
b073bcb8d4 |
cgroup: Fixes for v7.0-rc3
- Hide PF_EXITING tasks from cgroup.procs to avoid exposing dead tasks that haven't been removed yet, fixing a systemd timeout issue on PREEMPT_RT. - Call rebuild_sched_domains() directly in CPU hotplug instead of deferring to a workqueue, fixing a race where online/offline CPUs could briefly appear in stale sched domains. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCabRyLQ4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGXk4APwKw2HtGyI3OQAHfDBL+wlblPtf8acz0zpDwGCT 9+TWFQD/Rhmtvkb/X/LTwT5PKJksoHOfkD4MqmVMGStKGdxtBAY= =DJ17 -----END PGP SIGNATURE----- Merge tag 'cgroup-for-7.0-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - Hide PF_EXITING tasks from cgroup.procs to avoid exposing dead tasks that haven't been removed yet, fixing a systemd timeout issue on PREEMPT_RT - Call rebuild_sched_domains() directly in CPU hotplug instead of deferring to a workqueue, fixing a race where online/offline CPUs could briefly appear in stale sched domains * tag 'cgroup-for-7.0-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: Don't expose dead tasks in cgroup cgroup/cpuset: Call rebuild_sched_domains() directly in hotplug |
||
|
|
8369b2e97d |
sched_ext: Fixes for v7.0-rc3
- Fix data races flagged by KCSAN: add missing READ_ONCE()/WRITE_ONCE() annotations for lock-free accesses to module parameters and dsq->seq. - Fix silent truncation of upper 32 enqueue flags (SCX_ENQ_PREEMPT and above) when passed through the int sched_class interface. - Documentation updates: scheduling class precedence, task ownership state machine, example scheduler descriptions, config list cleanup. - Selftest fix for format specifier and buffer length in file_write_long(). -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCabRyHg4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGZiWAQCmUOHiGAk73p9DDn6Zyrm+o/iQm/iOinchBeUs ZiG0bgEAn15giAnLCA5Zs6cG7PemxBH1v7ctyzTjh1VsBds0rwo= =zXix -----END PGP SIGNATURE----- Merge tag 'sched_ext-for-7.0-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Fix data races flagged by KCSAN: add missing READ_ONCE()/WRITE_ONCE() annotations for lock-free accesses to module parameters and dsq->seq - Fix silent truncation of upper 32 enqueue flags (SCX_ENQ_PREEMPT and above) when passed through the int sched_class interface - Documentation updates: scheduling class precedence, task ownership state machine, example scheduler descriptions, config list cleanup - Selftest fix for format specifier and buffer length in file_write_long() * tag 'sched_ext-for-7.0-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Use WRITE_ONCE() for the write side of scx_enable helper pointer sched_ext: Fix enqueue_task_scx() truncation of upper enqueue flags sched_ext: Documentation: Update sched-ext.rst sched_ext: Use READ_ONCE() for scx_slice_bypass_us in scx_bypass() sched_ext: Documentation: Mention scheduling class precedence sched_ext: Document task ownership state machine sched_ext: Use READ_ONCE() for lock-free reads of module param variables sched_ext/selftests: Fix format specifier and buffer length in file_write_long() sched_ext: Use WRITE_ONCE() for the write side of dsq->seq update |
||
|
|
5ef268cb7a |
kprobes: Remove unneeded warnings from __arm_kprobe_ftrace()
Remove unneeded warnings for handled errors from __arm_kprobe_ftrace()
because all caller handled the error correctly.
Link: https://lore.kernel.org/all/177261531182.1312989.8737778408503961141.stgit@mhiramat.tok.corp.google.com/
Reported-by: Zw Tang <shicenci@gmail.com>
Closes: https://lore.kernel.org/all/CAPHJ_V+J6YDb_wX2nhXU6kh466Dt_nyDSas-1i_Y8s7tqY-Mzw@mail.gmail.com/
Fixes:
|
||
|
|
e113f0b46d |
kprobes: avoid crash when rmmod/insmod after ftrace killed
After we hit ftrace is killed by some errors, the kernel crash if
we remove modules in which kprobe probes.
BUG: unable to handle page fault for address: fffffbfff805000d
PGD 817fcc067 P4D 817fcc067 PUD 817fc8067 PMD 101555067 PTE 0
Oops: Oops: 0000 [#1] SMP KASAN PTI
CPU: 4 UID: 0 PID: 2012 Comm: rmmod Tainted: G W OE
Tainted: [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
RIP: 0010:kprobes_module_callback+0x89/0x790
RSP: 0018:ffff88812e157d30 EFLAGS: 00010a02
RAX: 1ffffffff805000d RBX: dffffc0000000000 RCX: ffffffff86a8de90
RDX: ffffed1025c2af9b RSI: 0000000000000008 RDI: ffffffffc0280068
RBP: 0000000000000000 R08: 0000000000000001 R09: ffffed1025c2af9a
R10: ffff88812e157cd7 R11: 205d323130325420 R12: 0000000000000002
R13: ffffffffc0290488 R14: 0000000000000002 R15: ffffffffc0280040
FS: 00007fbc450dd740(0000) GS:ffff888420331000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: fffffbfff805000d CR3: 000000010f624000 CR4: 00000000000006f0
Call Trace:
<TASK>
notifier_call_chain+0xc6/0x280
blocking_notifier_call_chain+0x60/0x90
__do_sys_delete_module.constprop.0+0x32a/0x4e0
do_syscall_64+0x5d/0xfa0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
This is because the kprobe on ftrace does not correctly handles
the kprobe_ftrace_disabled flag set by ftrace_kill().
To prevent this error, check kprobe_ftrace_disabled in
__disarm_kprobe_ftrace() and skip all ftrace related operations.
Link: https://lore.kernel.org/all/176473947565.1727781.13110060700668331950.stgit@mhiramat.tok.corp.google.com/
Reported-by: Ye Bin <yebin10@huawei.com>
Closes: https://lore.kernel.org/all/20251125020536.2484381-1-yebin@huaweicloud.com/
Fixes:
|
||
|
|
4b9ce67196 |
perf: Make sure to use pmu_ctx->pmu for groups
Oliver reported that x86_pmu_del() ended up doing an out-of-bound memory access
when group_sched_in() fails and needs to roll back.
This *should* be handled by the transaction callbacks, but he found that when
the group leader is a software event, the transaction handlers of the wrong PMU
are used. Despite the move_group case in perf_event_open() and group_sched_in()
using pmu_ctx->pmu.
Turns out, inherit uses event->pmu to clone the events, effectively undoing the
move_group case for all inherited contexts. Fix this by also making inherit use
pmu_ctx->pmu, ensuring all inherited counters end up in the same pmu context.
Similarly, __perf_event_read() should use equally use pmu_ctx->pmu for the
group case.
Fixes:
|
||
|
|
192d852129 |
sched/mmcid: Avoid full tasklist walks
Chasing vfork()'ed tasks on a CID ownership mode switch requires a full
task list walk, which is obviously expensive on large systems.
Avoid that by keeping a list of tasks using a mm MMCID entity in mm::mm_cid
and walk this list instead. This removes the proven to be flaky counting
logic and avoids a full task list walk in the case of vfork()'ed tasks.
Fixes:
|
||
|
|
7574ac6e49 |
sched/mmcid: Remove pointless preempt guard
This is a leftover from the early versions of this function where it could
be invoked without mm::mm_cid::lock held.
Remove it and add lockdep asserts instead.
Fixes:
|
||
|
|
28b5a13950 |
sched/mmcid: Handle vfork()/CLONE_VM correctly
Matthieu and Jiri reported stalls where a task endlessly loops in
mm_get_cid() when scheduling in.
It turned out that the logic which handles vfork()'ed tasks is broken. It
is invoked when the number of tasks associated to a process is smaller than
the number of MMCID users. It then walks the task list to find the
vfork()'ed task, but accounts all the already processed tasks as well.
If that double processing brings the number of to be handled tasks to 0,
the walk stops and the vfork()'ed task's CID is not fixed up. As a
consequence a subsequent schedule in fails to acquire a (transitional) CID
and the machine stalls.
Cure this by removing the accounting condition and make the fixup always
walk the full task list if it could not find the exact number of users in
the process' thread list.
Fixes:
|
||
|
|
b2e48c429e |
sched/mmcid: Prevent CID stalls due to concurrent forks
A newly forked task is accounted as MMCID user before the task is visible
in the process' thread list and the global task list. This creates the
following problem:
CPU1 CPU2
fork()
sched_mm_cid_fork(tnew1)
tnew1->mm.mm_cid_users++;
tnew1->mm_cid.cid = getcid()
-> preemption
fork()
sched_mm_cid_fork(tnew2)
tnew2->mm.mm_cid_users++;
// Reaches the per CPU threshold
mm_cid_fixup_tasks_to_cpus()
for_each_other(current, p)
....
As tnew1 is not visible yet, this fails to fix up the already allocated CID
of tnew1. As a consequence a subsequent schedule in might fail to acquire a
(transitional) CID and the machine stalls.
Move the invocation of sched_mm_cid_fork() after the new task becomes
visible in the thread and the task list to prevent this.
This also makes it symmetrical vs. exit() where the task is removed as CID
user before the task is removed from the thread and task lists.
Fixes:
|
||
|
|
755a648e78 |
time/jiffies: Mark jiffies_64_to_clock_t() notrace
The trace_clock_jiffies() function that handles the "uptime" clock for
tracing calls jiffies_64_to_clock_t(). This causes the function tracer to
constantly recurse when the tracing clock is set to "uptime". Mark it
notrace to prevent unnecessary recursion when using the "uptime" clock.
Fixes:
|
||
|
|
36f46b0e36 |
crash_dump: don't log dm-crypt key bytes in read_key_from_user_keying
When debug logging is enabled, read_key_from_user_keying() logs the first
8 bytes of the key payload and partially exposes the dm-crypt key. Stop
logging any key bytes.
Link: https://lkml.kernel.org/r/20260227230008.858641-2-thorsten.blum@linux.dev
Fixes:
|
||
|
|
2321a9596d |
bpf: Fix constant blinding for PROBE_MEM32 stores
BPF_ST | BPF_PROBE_MEM32 immediate stores are not handled by
bpf_jit_blind_insn(), allowing user-controlled 32-bit immediates to
survive unblinded into JIT-compiled native code when bpf_jit_harden >= 1.
The root cause is that convert_ctx_accesses() rewrites BPF_ST|BPF_MEM
to BPF_ST|BPF_PROBE_MEM32 for arena pointer stores during verification,
before bpf_jit_blind_constants() runs during JIT compilation. The
blinding switch only matches BPF_ST|BPF_MEM (mode 0x60), not
BPF_ST|BPF_PROBE_MEM32 (mode 0xa0). The instruction falls through
unblinded.
Add BPF_ST|BPF_PROBE_MEM32 cases to bpf_jit_blind_insn() alongside the
existing BPF_ST|BPF_MEM cases. The blinding transformation is identical:
load the blinded immediate into BPF_REG_AX via mov+xor, then convert
the immediate store to a register store (BPF_STX).
The rewritten STX instruction must preserve the BPF_PROBE_MEM32 mode so
the architecture JIT emits the correct arena addressing (R12-based on
x86-64). Cannot use the BPF_STX_MEM() macro here because it hardcodes
BPF_MEM mode; construct the instruction directly instead.
Fixes:
|
||
|
|
a3125bc018 |
bpf: Reset register ID for BPF_END value tracking
When a register undergoes a BPF_END (byte swap) operation, its scalar
value is mutated in-place. If this register previously shared a scalar ID
with another register (e.g., after an `r1 = r0` assignment), this tie must
be broken.
Currently, the verifier misses resetting `dst_reg->id` to 0 for BPF_END.
Consequently, if a conditional jump checks the swapped register, the
verifier incorrectly propagates the learned bounds to the linked register,
leading to false confidence in the linked register's value and potentially
allowing out-of-bounds memory accesses.
Fix this by explicitly resetting `dst_reg->id` to 0 in the BPF_END case
to break the scalar tie, similar to how BPF_NEG handles it via
`__mark_reg_known`.
Fixes:
|
||
|
|
d557640e4c |
sched: idle: Make skipping governor callbacks more consistent
If the cpuidle governor .select() callback is skipped because there
is only one idle state in the cpuidle driver, the .reflect() callback
should be skipped as well, at least for consistency (if not for
correctness), so do it.
Fixes:
|
||
|
|
2fcfe5951e |
sched_ext: Use WRITE_ONCE() for the write side of scx_enable helper pointer
scx_enable() uses double-checked locking to lazily initialize a static
kthread_worker pointer. The fast path reads helper locklessly:
if (!READ_ONCE(helper)) { // lockless read -- no helper_mutex
The write side initializes helper under helper_mutex, but previously
used a plain assignment:
helper = kthread_run_worker(0, "scx_enable_helper");
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
plain write -- KCSAN data race with READ_ONCE() above
Since READ_ONCE() on the fast path and the plain write on the
initialization path access the same variable without a common lock,
they constitute a data race. KCSAN requires that all sides of a
lock-free access use READ_ONCE()/WRITE_ONCE() consistently.
Use a temporary variable to stage the result of kthread_run_worker(),
and only WRITE_ONCE() into helper after confirming the pointer is
valid. This avoids a window where a concurrent caller on the fast path
could observe an ERR pointer via READ_ONCE(helper) before the error
check completes.
Fixes:
|
||
|
|
6ff1020c2f |
Make clock_adjtime() syscall timex validation slightly more
permissive for auxiliary clocks, to not reject syscalls based on the status field that do not try to modify the status field. This makes ABI behavior in clock_adjtime() consistent with CLOCK_REALTIME. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmmsxzkRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1hq8g//fRTp9p2pVfmRWUoxWELrT/bMK1r+D6F3 6BYkwp68peRhchVrFxkI/Y37rjAIC8CXZSPuvkubqIROrH3gA7SCCQYCcZKdss+t i3lbpQF8IbagPIS5btpOAN2KRCu2S7aqjDdH0rWb9VhQdlW7fI71Z72Uz07YEA+q TWpy3gE531P/dgAqcvIAyMHnFZDCb1S6z8wZvT3SV4r4GkczfXpTFyNHHtETSu0V 7isuOBfloM4HpDU50oUotlqBiwigH27J2Ad6aIrnCA7iaQPrzREysG+8E96ShhaB g6+qaQS5gTgFryA1bggA6LzGveLOI8bjy2kZ2SnZWuFPj46OReGIuwK4kyY07jz2 xk0sd37alN16ETKhGVLfAgjmzVGoKVNnp4ak9J3VmMbxWEmXeObuOC8SmF9VImc1 4bRaG9+Tlfd4DtOOz2+E4VcPE1D9A2tMw4esgUaXRrrp4GlEcKOJ5PRlWj0uGvrh xLPLbL0XIiWsjMsHdVs4Gq9Z0MvfRHc4VLOviIqLFtHox2DscZypPkyjKAv5inp0 /VWyUYJkkr07RMQQ3nqHnP+lzAfO2aSeZ72D9NnHStL3RPbGC4jYvpoi8dnH0/TT PKJgj2jb7u3h+1cxKBi1RM0JbxUYD5+4N8zfJISa9uqkHZ3XY3VyuuT+2RHO6CQp d1BdX0V4oDA= =zjov -----END PGP SIGNATURE----- Merge tag 'timers-urgent-2026-03-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Ingo Molnar: "Make clock_adjtime() syscall timex validation slightly more permissive for auxiliary clocks, to not reject syscalls based on the status field that do not try to modify the status field. This makes the ABI behavior in clock_adjtime() consistent with CLOCK_REALTIME" * tag 'timers-urgent-2026-03-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timekeeping: Fix timex status validation for auxiliary clocks |
||
|
|
b1b9a9d0b5 |
Fix a DL scheduler bug that may corrupt internal metrics during
PI and setscheduler() syscalls, resulting in kernel warnings and misbehavior. Found during stress-testing. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmmsxW0RHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1jzXRAAjqcTwaC72cd+6cnh+tE9/fcjXf1JtK5e TxdTygsgBAbXh63rD4y4cRPueqBR1ne52TAV0lI8Z1pBM/XthnaF4MJBue6B8EdX SQIE7hpOh6R81I6hnuhNoNsAy95jQvYXN5SFaKMuNacWNVX8k3vPzN5XPxa7yHLN MVUL+O9c7Xwg4v30Nz/QIv0mFoPosbh4PIdeVpD/ghJAXtXhsCg7EYOivEk9UsSy TAcq3qRnfDyroIOc5/dnSglEwX12LQqVFBba97nI/TCjaH23PsUIt2Dg2rpJbJ+k bLh4hGpOoyQvgE/PSEdoMl1F9pXw3XiUOzAGrFJdqn0iKL+7WzuTEQH+vAToGZQv 4hF5BtMjLrAYY/MVsD8qJGm/pne5nTIo2gSsG7LZPwCmMj0rDUGXfO4G8N8LHhT7 ExQ/t2+z0BczsKdvF3VKX+RweT51AOYOWcmLIdA9h1jdAy858GVmTzSWDveAEJ0L yToPQ0UMCz985g9il6Rdb5cIphD7DjuUeFNnYTCm63cVpZdA4j8Da74r4KfP2jNY tRcbiUy+A7MwqW5aERgwBtI6XCz6QZqW3svJW9yYghf40lgNGAcDCTTdf2r7g0Ho Q0pQVxEk9mXD5N1otjzSS4piLbzoMaPH1L4W6ceHN1RzBjfSJED3tmfGUHZUDqNE w33GhhQAFpA= =vP5l -----END PGP SIGNATURE----- Merge tag 'sched-urgent-2026-03-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "Fix a DL scheduler bug that may corrupt internal metrics during PI and setscheduler() syscalls, resulting in kernel warnings and misbehavior. Found during stress-testing" * tag 'sched-urgent-2026-03-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/deadline: Fix missing ENQUEUE_REPLENISH during PI de-boosting |
||
|
|
8b7f4cd3ac |
bpf-fixes
-----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmmsahoACgkQ6rmadz2v bToD4Q/9Hr2lAELsTRZIENBTYYTfk48Rzrr+rJbZWLfX4nOu9RwhVboTw0gvJr8r QEteYO8+gP5hrIuifcC+vdzW12CpgziBK7/Qj2NB7MGSX0ZCLK0ShQK6TAbNE8uZ xA4qMcJp0BWJpgfU51z4Ur5nMzG0Th2XlY+SGwHYZ/53qRMmP7Tr8R9wPltKuEll J+tS6BE6bFKMQe1CuIYlYfYj9kO3j4yx9S48j1e8x9LXc2/2arfsOIZQ6L11rnxZ YTa9jbcRL0YdLy7d4hMLyPQquF9hiHDrJLJih+YNmbyOCPGD3HP0ib2c2g0ip+qk WlNFLc2K9w0eS02uuAytaxwArdOpDUHG+skWe+dGt95Bk3Xld0hW/JSI5hrohM1E 0YKjuZeGGvTkI5FJ5kk+BWoApy+FveiN1w7VsHZgr1tix5NRZRgJP/5swM+0eSp/ neTzp08fC10gi+GOnUpdF/lI4wEayoGMpo68BTGiM/hP6Qh268wNWgTHaau3Sk/z wfJb90VTcDHk+N8mnbi4970RZ6OdqX2n7aBy8D7hBW31kHhP27zofzYX51CFD8Ln 5NQDU3wX/hUw9tNwoghCNe12//gfDBi6rO/GoEztNFfVVef4QuzsfwKKdxzUMc9a jZvyDyGh87rGLauDWylq4s23vgxgIG/p114uv+eFp3AMkOfT/Dk= =knWl -----END PGP SIGNATURE----- Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Pull bpf fixes from Alexei Starovoitov: - Fix u32/s32 bounds when ranges cross min/max boundary (Eduard Zingerman) - Fix precision backtracking with linked registers (Eduard Zingerman) - Fix linker flags detection for resolve_btfids (Ihor Solodrai) - Fix race in update_ftrace_direct_add/del (Jiri Olsa) - Fix UAF in bpf_trampoline_link_cgroup_shim (Lang Xu) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: resolve_btfids: Fix linker flags detection selftests/bpf: add reproducer for spurious precision propagation through calls bpf: collect only live registers in linked regs Revert "selftests/bpf: Update reg_bound range refinement logic" selftests/bpf: test refining u32/s32 bounds when ranges cross min/max boundary bpf: Fix u32/s32 bounds when ranges cross min/max boundary bpf: Fix a UAF issue in bpf_trampoline_link_cgroup_shim ftrace: Add missing ftrace_lock to update_ftrace_direct_add/del |
||
|
|
aed0af05a8 |
tracing fixes for 7.0:
- Fix possible NULL pointer dereference in trace_data_alloc()
On the error path in trace_data_alloc(), it can call trigger_data_free()
with a NULL pointer. This use to be a kfree() but was changed to
trigger_data_free() to clean up any partial initialization. The issue is
that trigger_data_free() does not expect a NULL pointer. Have
trigger_data_free() return safely on NULL pointer.
- Fix multiple events on the command line and bootconfig
If multiple events are enabled on the command line separately and not
grouped, only the last event gets enabled. That is:
trace_event=sched_switch trace_event=sched_waking
Will only enable sched_waking where as:
trace_event=sched_switch,sched_waking
Will enable both.
The bootconfig makes it even worse as the second way is the more common
method.
The issue is that a temporary buffer is used to store the events to enable
later in boot. Each time the cmdline callback is called, it overwrites
what was previously there.
Have the callback append the next value (delimited by a comma) if the
temporary buffer already has content.
- Fix command line trace_buffer_size if >= 2G
The logic to allocate the trace buffer uses "int" for the size parameter
in the command line code causing overflow issues if more that 2G is
specified.
-----BEGIN PGP SIGNATURE-----
iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaaxEIRQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qn+QAQCM6aJm0ZqDD2dM262M1mQpkU7sW3Dz
hZfBpo3YlH55fQEAklsaD96+yKN7PLl1Vh4c0zCelMHZA7kgck/3GqaFAgA=
=rn/Z
-----END PGP SIGNATURE-----
Merge tag 'trace-v7.0-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix possible NULL pointer dereference in trace_data_alloc()
On the trace_data_alloc() error path, it can call trigger_data_free()
with a NULL pointer. This used to be a kfree() but was changed to
trigger_data_free() to clean up any partial initialization. The issue
is that trigger_data_free() does not expect a NULL pointer. Have
trigger_data_free() return safely on NULL pointer.
- Fix multiple events on the command line and bootconfig
If multiple events are enabled on the command line separately and not
grouped, only the last event gets enabled. That is:
trace_event=sched_switch trace_event=sched_waking
will only enable sched_waking whereas:
trace_event=sched_switch,sched_waking
will enable both.
The bootconfig makes it even worse as the second way is the more
common method.
The issue is that a temporary buffer is used to store the events to
enable later in boot. Each time the cmdline callback is called, it
overwrites what was previously there.
Have the callback append the next value (delimited by a comma) if the
temporary buffer already has content.
- Fix command line trace_buffer_size if >= 2G
The logic to allocate the trace buffer uses "int" for the size
parameter in the command line code causing overflow issues if more
that 2G is specified.
* tag 'trace-v7.0-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Fix trace_buf_size= cmdline parameter with sizes >= 2G
tracing: Fix enabling multiple events on the kernel command line and bootconfig
tracing: Add NULL pointer check to trigger_data_free()
|
||
|
|
57ccf5ccdc |
sched_ext: Fix enqueue_task_scx() truncation of upper enqueue flags
enqueue_task_scx() takes int enq_flags from the sched_class interface.
SCX enqueue flags starting at bit 32 (SCX_ENQ_PREEMPT and above) are
silently truncated when passed through activate_task(). extra_enq_flags
was added as a workaround - storing high bits in rq->scx.extra_enq_flags
and OR-ing them back in enqueue_task_scx(). However, the OR target is
still the int parameter, so the high bits are lost anyway.
The current impact is limited as the only affected flag is SCX_ENQ_PREEMPT
which is informational to the BPF scheduler - its loss means the scheduler
doesn't know about preemption but doesn't cause incorrect behavior.
Fix by renaming the int parameter to core_enq_flags and introducing a
u64 enq_flags local that merges both sources. All downstream functions
already take u64 enq_flags.
Fixes:
|
||
|
|
2658a1720a |
bpf: collect only live registers in linked regs
Fix an inconsistency between func_states_equal() and
collect_linked_regs():
- regsafe() uses check_ids() to verify that cached and current states
have identical register id mapping.
- func_states_equal() calls regsafe() only for registers computed as
live by compute_live_registers().
- clean_live_states() is supposed to remove dead registers from cached
states, but it can skip states belonging to an iterator-based loop.
- collect_linked_regs() collects all registers sharing the same id,
ignoring the marks computed by compute_live_registers().
Linked registers are stored in the state's jump history.
- backtrack_insn() marks all linked registers for an instruction
as precise whenever one of the linked registers is precise.
The above might lead to a scenario:
- There is an instruction I with register rY known to be dead at I.
- Instruction I is reached via two paths: first A, then B.
- On path A:
- There is an id link between registers rX and rY.
- Checkpoint C is created at I.
- Linked register set {rX, rY} is saved to the jump history.
- rX is marked as precise at I, causing both rX and rY
to be marked precise at C.
- On path B:
- There is no id link between registers rX and rY,
otherwise register states are sub-states of those in C.
- Because rY is dead at I, check_ids() returns true.
- Current state is considered equal to checkpoint C,
propagate_precision() propagates spurious precision
mark for register rY along the path B.
- Depending on a program, this might hit verifier_bug()
in the backtrack_insn(), e.g. if rY ∈ [r1..r5]
and backtrack_insn() spots a function call.
The reproducer program is in the next patch.
This was hit by sched_ext scx_lavd scheduler code.
Changes in tests:
- verifier_scalar_ids.c selftests need modification to preserve
some registers as live for __msg() checks.
- exceptions_assert.c adjusted to match changes in the verifier log,
R0 is dead after conditional instruction and thus does not get
range.
- precise.c adjusted to match changes in the verifier log, register r9
is dead after comparison and it's range is not important for test.
Reported-by: Emil Tsalapatis <emil@etsalapatis.com>
Fixes:
|
||
|
|
d008ba8be8 |
tracing: Fix trace_buf_size= cmdline parameter with sizes >= 2G
Some of the sizing logic through tracer_alloc_buffers() uses int
internally, causing unexpected behavior if the user passes a value that
does not fit in an int (on my x86 machine, the result is uselessly tiny
buffers).
Fix by plumbing the parameter's real type (unsigned long) through to the
ring buffer allocation functions, which already use unsigned long.
It has always been possible to create larger ring buffers via the sysfs
interface: this only affects the cmdline parameter.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/bff42a4288aada08bdf74da3f5b67a2c28b761f8.1772852067.git.calvin@wbinvd.org
Fixes:
|
||
|
|
fbc7aef517 |
bpf: Fix u32/s32 bounds when ranges cross min/max boundary
Same as in __reg64_deduce_bounds(), refine s32/u32 ranges in __reg32_deduce_bounds() in the following situations: - s32 range crosses U32_MAX/0 boundary, positive part of the s32 range overlaps with u32 range: 0 U32_MAX | [xxxxxxxxxxxxxx u32 range xxxxxxxxxxxxxx] | |----------------------------|----------------------------| |xxxxx s32 range xxxxxxxxx] [xxxxxxx| 0 S32_MAX S32_MIN -1 - s32 range crosses U32_MAX/0 boundary, negative part of the s32 range overlaps with u32 range: 0 U32_MAX | [xxxxxxxxxxxxxx u32 range xxxxxxxxxxxxxx] | |----------------------------|----------------------------| |xxxxxxxxx] [xxxxxxxxxxxx s32 range | 0 S32_MAX S32_MIN -1 - No refinement if ranges overlap in two intervals. This helps for e.g. consider the following program: call %[bpf_get_prandom_u32]; w0 &= 0xffffffff; if w0 < 0x3 goto 1f; // on fall-through u32 range [3..U32_MAX] if w0 s> 0x1 goto 1f; // on fall-through s32 range [S32_MIN..1] if w0 s< 0x0 goto 1f; // range can be narrowed to [S32_MIN..-1] r10 = 0; 1: ...; The reg_bounds.c selftest is updated to incorporate identical logic, refinement based on non-overflowing range halves: ((x ∩ [0, smax]) ∩ (y ∩ [0, smax])) ∪ ((x ∩ [smin,-1]) ∩ (y ∩ [smin,-1])) Reported-by: Andrea Righi <arighi@nvidia.com> Reported-by: Emil Tsalapatis <emil@etsalapatis.com> Closes: https://lore.kernel.org/bpf/aakqucg4vcujVwif@gpd4/T/ Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260306-bpf-32-bit-range-overflow-v3-1-f7f67e060a6b@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |